usage guides

  • quickstart

    This is a quickstart guide that should get the pipeline running in most situations. A more detailed description of the structure of the parameters file and the command line usage is in the analysis configuration section.

    • scamp configuration

      These notes illustrate how to configure an analysis using {scamp}. Detailed descriptions of the parameters can be found in the modules or workflows documentation.

      • user-configurable parameters

        This post contains a more-detailed description of the parameters that are available to configure {scamp}. Some parameters are required, others may be defined from defaults while some may be provided by upstream {scamp} processes.

        Subsections of usage guides

        quickstart

        This is a quickstart guide that should get the pipeline running in most situations. A more detailed description of the structure of the parameters file and the command line usage is in the analysis configuration section.

        Running the pipeline

        For now, the processes are not containerised. All software, packages and libraries must be available from the shell. The {scamp} conda environment provides Nextflow, R and all R packages.

        conda activate /nemo/stp/babs/working/barrinc/conda/envs/scamp

        {scamp} can be used to generate a best-guess parameters file. The file it creates is dependent on the sample sheet and directory structure on Nemo of the ASF’s data directory in babs/inputs. The file it creates should be checked - especially for cases where multiple libraries contribute to individual samples, such as 10X multiome projects.

        The following snippet will use the guess_scamp_file.py script that is included with {scamp} to find the data directory for the project and parse the accompanying design file. When the Conda environment is loaded, NXF_HOME is checked and set to a reasonable default if not defined. But really, NXF_HOME should be defined in your .bashrc alongside the other Nextflow environment variables. If you want to, a path under working could be used, for example /nemo/stp/babs/working/${USER}/nextflow. The PATH is then modified to include the bin for {scamp} so that the guess_scamp_file.py executable from {scamp} can be found in your shell.

        Using nextflow pull, the most recent version of a pipeline can be downloaded into NXF_HOME. In the following chunk we direct a specific version of {scamp} to be downloaded using -revision. This is optional, but recommended to ensure you have the most-recent release available.

        cache {scamp}
        nextflow pull ChristopherBarrington/scamp -revision 
        nextflow pull ChristopherBarrington/scamp

        guess_scamp_file.py includes a set of default parameters, which will need to be updated as we include new protocols. Example usage is shown below, where we indicate the genome that we want to use for the project, the LIMS ID under which the data was produced and the name of the output YAML file. For command line options, use guess_scamp_file.py --help.

        For projects using dataset barcodes (10x Flex, Plex or HTO for example) a barcodes.csv file is required. This is a two-column CSV with “barcode” and “dataset” variables. Each row should be a unique barcode:dataset pair - if a dataset is labelled by multiple barcodes in the project, these should be represented on multiple rows. The “dataset” should match the name of the dataset in the project’s design file (either in the ASF data directory or specified by --design-file). The barcodes and design files are parsed and joined together using the “barcode” as the key. Barcode information is not tracked in the LIMS and must be provided by the scientist.

        A collection of assays can be included from which the project type can be defined. The project’s assays is a curated set of keywords to define what types of data should be expected. For example, --project-assays 3prime 10x will be translated into --project-type 10x-3prime. A list of valid assay names can be found with guess_scamp_file.py --help. If --project-type is not provided, it is sought from --project-assays and vice versa. Only one of --project-type and --project-assays is required, but it is better to provide --project-assays; the assays in --project-type must be hyphen-separated and sorted alphabetically.

        guess_scamp_file.py usage examples
        guess_scamp_file.py \
          --lims-id SC22034 \
          --genome mm10 \
          --project-assays 10x 3prime
        guess_scamp_file.py \
          --data-path /nemo/stp/babs/inputs/sequencing/data/morisn/christopher.cooke/SC22034 \
          --genome mm10 \
          --project-assays 10x 3prime
        guess_scamp_file.py \
          --lab morisn \
          --scientist christopher.cooke \
          --lims-id SC22034 \
          --genome mm10 \
          --project-assays 10x 3prime
        guess_scamp_file.py \
          --lims-id SC22034 \
          --barcodes-file inputs/barcodes.csv \
          --project-assays 10x flex
        guess_scamp_file.py --help

        Check the guessed parameters file! Pay particular attention to the LIMS IDs associated to dataset, the feature types and sample names!

        The parameters in the guessed scamp_file.yaml should now be checked, the values may need to be corrected and/or amended or new information included. For example, certain samples may need to be removed or different analysis workflows my be required. Examples of analysis parameters files can be found in the analysis configuration post.

        Once the pipeline parameters are encoded in the parameters file, the pipeline can then be launched using a specific release such as `` or the current version using main. Using a specific tag is recommended for reproducibility.

        If you want to test you configuration file without running any real analysis, you can run Nextflow in stub-run mode:

        nextflow run ChristopherBarrington/scamp -revision  \
          -stub-run -profile stub_run \
          --scamp_file scamp_file.yaml

        This will create empty files instead of analysing data but will produce errors if there is a configuration problem. Your analysis may still fail when it runs though! Once you are confident, you can run the pipeline:

        nextflow run ChristopherBarrington/scamp -revision  \
          --scamp_file scamp_file.yaml

        This should now start the pipeline and show the processes being run for each of the analysis workflows detailed in your configuration file.

        scamp configuration

        These notes illustrate how to configure an analysis using {scamp}. Detailed descriptions of the parameters can be found in the modules or workflows documentation.

        Analysis configuration

        A parameters YAML file is used to described all aspects of a project and should serve as a record for parameters used in the pipeline. It will be passed to Nextflow as a parameter using --scamp_file.

        The structure of parameters file allows aspects of a project to be recorded alongside analysis parameters that can be specified for multiple datasets in a plain text and human-readable format. Parameter keys that begin with underscores are reserved by {scamp} and should not be used in other keys. At the first level, the project (_project), genome (_genome), common dataset parameters (_defaults) and dataset (_dataset) stanzas are specified. Within the datasets stanza, datasets can be freely (but sensibly) named.

        Example configuration file

        In this example for a scRNA-seq project, there are four datasets that will be quantified against mouse using the Cell Ranger mm10 reference from which Seurat objects will be prepared. To keep the file clear I have assumed symlinks are available in an inputs directory to other parts of the filesystem. The inputs/primary_data is a symlink to ASF’s outputs for this project and inputs/10x_indexes is a symlink to the communal 10X indexes resource.

        Example parameters
         1_project:
         2    lab: morisn
         3    scientist: christopher.cooke
         4    lims id: SC22034
         5    babs id: nm322
         6    type: 10X-3prime
         7
         8_genome:
         9    organism: mus musculus
        10    assembly: mm10
        11    ensembl release: 98
        12    fasta file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa
        13    fasta index file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa.fai
        14    gtf file: inputs/10x_indexes/refdata-gex-mm10-2020-A/genes/genes.gtf
        15
        16_defaults:
        17    fastq paths:
        18        - inputs/primary_data/220221_A01366_0148_AH7HYGDMXY/fastq
        19        - inputs/primary_data/220310_A01366_0156_AH5YTYDMXY/fastq
        20        - inputs/primary_data/220818_A01366_0266_BHCJK7DMXY/fastq
        21        - inputs/primary_data/230221_A01366_0353_AHNH37DSX5/fastq
        22    feature types:
        23        Gene Expression:
        24            - COO4671A1
        25            - COO4671A2
        26            - COO4671A3
        27            - COO4671A4
        28    feature identifiers: name
        29    workflows:
        30        - quantification/cell ranger
        31        - seurat/prepare/cell ranger
        32    index path: inputs/10x_indexes/refdata-cellranger-mm10-3.0.0
        33
        34_datasets:
        35    stella 120h rep1:
        36        description: STELLA sorting at 120 hours
        37        limsid: COO4671A1
        38        dataset tag: ST120R1
        39
        40    pecam1 120h rep1:
        41        description: PECAM1 sorting at 120 hours
        42        limsid: COO4671A2
        43        dataset tag: P120R1
        44
        45    ssea1 120h rep1:
        46        description: SSEA1 sorting at 120 hours
        47        limsid: COO4671A3
        48        dataset tag: SS120R1
        49
        50    blimp1 + ssea1 120h rep1:
        51        description: BLIMP1 and SSEA1 sorting at 120 hours
        52        limsid: COO4671A4
        53        dataset tag: BS120R1
        
         1_project:
         2    lab: guillemotf
         3    scientist: sara.ahmeddeprado
         4    lims id: SC22051
         5    babs id: sa145
         6    type: 10X-Multiomics
         7
         8_genome:
         9    assembly: mm10 + mCherry
        10    organism: mus musculus
        11    ensembl release: 98
        12    non-nuclear contigs:
        13        - chrM
        14    fasta path: inputs/fastas
        15    gtf path: inputs/gtfs
        16
        17_defaults:
        18    fastq paths:
        19        - inputs/primary_data/220406_A01366_0169_AHC3HVDMXY/fastq
        20        - inputs/primary_data/220407_A01366_0171_AH3W3LDRX2/fastq
        21        - inputs/primary_data/220420_A01366_0179_BH72WWDMXY/fastq
        22        - inputs/primary_data/220422_A01366_0180_BHJLLNDSX3/fastq
        23    feature types:
        24        Gene Expression:
        25            - AHM4688A1
        26            - AHM4688A2
        27            - AHM4688A3
        28        Chromatin Accessibility:
        29            - AHM4688A4
        30            - AHM4688A5
        31            - AHM4688A6
        32    feature identifiers: name
        33    workflows:
        34        - quantification/cell ranger arc
        35        - seurat/prepare/cell ranger arc
        36
        37_datasets:
        38    8 weeks sample1:
        39        description: 8 weeks old, replicate 1
        40        limsid:
        41            - AHM4688A1
        42            - AHM4688A4
        43        dataset tag: 8WS1
        44
        45    8 weeks sample2:
        46        description: 8 weeks old, replicate 2
        47        limsid:
        48            - AHM4688A2
        49            - AHM4688A5
        50        dataset tag: 8WS2
        51
        52    8 weeks sample3:
        53        description: 8 weeks old, replicate 3
        54        limsid:
        55            - AHM4688A3
        56            - AHM4688A6
        57        dataset tag: 8WS3
        
         1_project:
         2    lab: morisn
         3    scientist: christopher.cooke
         4    lims id: SC22034
         5    babs id: nm322
         6    type: 10X-3prime
         7
         8_genome:
         9    organism: mus musculus
        10    assembly: mm10
        11    ensembl release: 98
        12    fasta file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa
        13    fasta index file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa.fai
        14    gtf file: inputs/10x_indexes/refdata-gex-mm10-2020-A/genes/genes.gtf
        15
        16_defaults:
        17    fastq paths:
        18        - inputs/primary_data/220221_A01366_0148_AH7HYGDMXY/fastq
        19        - inputs/primary_data/220310_A01366_0156_AH5YTYDMXY/fastq
        20        - inputs/primary_data/220818_A01366_0266_BHCJK7DMXY/fastq
        21        - inputs/primary_data/230221_A01366_0353_AHNH37DSX5/fastq
        22    feature types:
        23        Gene Expression:
        24            - COO4671A1
        25            - COO4671A2
        26            - COO4671A3
        27            - COO4671A4
        28    feature identifiers: name
        29    workflows:
        30        - seurat/prepare/cell ranger
        31    index path: inputs/10x_indexes/refdata-cellranger-mm10-3.0.0
        32
        33_datasets:
        34    stella 120h rep1:
        35        description: STELLA sorting at 120 hours
        36        limsid: COO4671A1
        37        dataset tag: ST120R1
        38        quantification path: results/quantification/cell_ranger/outs/stella_120h_rep1
        39        quantification method: cell ranger
        40
        41    pecam1 120h rep1:
        42        description: PECAM1 sorting at 120 hours
        43        limsid: COO4671A2
        44        dataset tag: P120R1
        45        quantification path: results/quantification/cell_ranger/outs/pecam1_120h_rep1
        46        quantification method: cell ranger
        47
        48    ssea1 120h rep1:
        49        description: SSEA1 sorting at 120 hours
        50        limsid: COO4671A3
        51        dataset tag: SS120R1
        52        quantification path: results/quantification/cell_ranger/outs/ssea1_120h_rep1
        53        quantification method: cell ranger
        54
        55    blimp1 + ssea1 120h rep1:
        56        description: BLIMP1 and SSEA1 sorting at 120 hours
        57        limsid: COO4671A4
        58        dataset tag: BS120R1
        59        quantification path: results/quantification/cell_ranger/outs/blimp1_ssea1_120h_rep1
        60        quantification method: cell ranger
        

        _project includes information about the project rather than parameters that should be applied to datasets. Most of the information in this stanza can be extracted from a path on Nemo and/or the LIMS.

        The _genome stanza is static across most projects though the ensembl release is tied to any index against which the data is aligned or quantified (etc).

        _defaults describes parameters that will be aggregated into every dataset in the _datasets stanza of the project, with the dataset-level parameter taking precedence. (So we don’t have a big copy/paste list of parameters). Depending on the analysis workflows, different keys will be expected. In this example we are going to quantify expression of a scRNA-seq dataset so we need to know where the FastQ files are in the file system. The paths (not files) are specified here with fastq paths. The feature_identifiers will be used when the Seurat object is created; specifying “names” will use the gene names (rather than Ensembl identifiers) as feature names. The default parameters stanza typically contains a set of analysis workflows, using the workflows key. This curated list of keywords identifies which workflows should be applied to the dataset(s). In the example we specify two workflows: quantification by Cell Ranger and Seurat object creation. The order of the workflows is not important. The keywords to include can be found in the workflows documentation.

        Each dataset is described in _datasets. Dataset stanzas must have unique names and ideally not contain odd characters. {scamp} will try to make the key directory-safe, however.

        user-configurable parameters

        This post contains a more-detailed description of the parameters that are available to configure {scamp}. Some parameters are required, others may be defined from defaults while some may be provided by upstream {scamp} processes.