scamp configuration

These notes illustrate how to configure an analysis using {scamp}. Detailed descriptions of the parameters can be found in the modules or workflows documentation.

Analysis configuration

A parameters YAML file is used to described all aspects of a project and should serve as a record for parameters used in the pipeline. It will be passed to Nextflow as a parameter using --scamp_file.

The structure of parameters file allows aspects of a project to be recorded alongside analysis parameters that can be specified for multiple datasets in a plain text and human-readable format. Parameter keys that begin with underscores are reserved by {scamp} and should not be used in other keys. At the first level, the project (_project), genome (_genome), common dataset parameters (_defaults) and dataset (_dataset) stanzas are specified. Within the datasets stanza, datasets can be freely (but sensibly) named.

Example configuration file

In this example for a scRNA-seq project, there are four datasets that will be quantified against mouse using the Cell Ranger mm10 reference from which Seurat objects will be prepared. To keep the file clear I have assumed symlinks are available in an inputs directory to other parts of the filesystem. The inputs/primary_data is a symlink to ASF’s outputs for this project and inputs/10x_indexes is a symlink to the communal 10X indexes resource.

Example parameters
 1_project:
 2    lab: morisn
 3    scientist: christopher.cooke
 4    lims id: SC22034
 5    babs id: nm322
 6    type: 10X-3prime
 7
 8_genome:
 9    organism: mus musculus
10    assembly: mm10
11    ensembl release: 98
12    fasta file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa
13    fasta index file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa.fai
14    gtf file: inputs/10x_indexes/refdata-gex-mm10-2020-A/genes/genes.gtf
15
16_defaults:
17    fastq paths:
18        - inputs/primary_data/220221_A01366_0148_AH7HYGDMXY/fastq
19        - inputs/primary_data/220310_A01366_0156_AH5YTYDMXY/fastq
20        - inputs/primary_data/220818_A01366_0266_BHCJK7DMXY/fastq
21        - inputs/primary_data/230221_A01366_0353_AHNH37DSX5/fastq
22    feature types:
23        Gene Expression:
24            - COO4671A1
25            - COO4671A2
26            - COO4671A3
27            - COO4671A4
28    feature identifiers: name
29    workflows:
30        - quantification/cell ranger
31        - seurat/prepare/cell ranger
32    index path: inputs/10x_indexes/refdata-cellranger-mm10-3.0.0
33
34_datasets:
35    stella 120h rep1:
36        description: STELLA sorting at 120 hours
37        limsid: COO4671A1
38        dataset tag: ST120R1
39
40    pecam1 120h rep1:
41        description: PECAM1 sorting at 120 hours
42        limsid: COO4671A2
43        dataset tag: P120R1
44
45    ssea1 120h rep1:
46        description: SSEA1 sorting at 120 hours
47        limsid: COO4671A3
48        dataset tag: SS120R1
49
50    blimp1 + ssea1 120h rep1:
51        description: BLIMP1 and SSEA1 sorting at 120 hours
52        limsid: COO4671A4
53        dataset tag: BS120R1
 1_project:
 2    lab: guillemotf
 3    scientist: sara.ahmeddeprado
 4    lims id: SC22051
 5    babs id: sa145
 6    type: 10X-Multiomics
 7
 8_genome:
 9    assembly: mm10 + mCherry
10    organism: mus musculus
11    ensembl release: 98
12    non-nuclear contigs:
13        - chrM
14    fasta path: inputs/fastas
15    gtf path: inputs/gtfs
16
17_defaults:
18    fastq paths:
19        - inputs/primary_data/220406_A01366_0169_AHC3HVDMXY/fastq
20        - inputs/primary_data/220407_A01366_0171_AH3W3LDRX2/fastq
21        - inputs/primary_data/220420_A01366_0179_BH72WWDMXY/fastq
22        - inputs/primary_data/220422_A01366_0180_BHJLLNDSX3/fastq
23    feature types:
24        Gene Expression:
25            - AHM4688A1
26            - AHM4688A2
27            - AHM4688A3
28        Chromatin Accessibility:
29            - AHM4688A4
30            - AHM4688A5
31            - AHM4688A6
32    feature identifiers: name
33    workflows:
34        - quantification/cell ranger arc
35        - seurat/prepare/cell ranger arc
36
37_datasets:
38    8 weeks sample1:
39        description: 8 weeks old, replicate 1
40        limsid:
41            - AHM4688A1
42            - AHM4688A4
43        dataset tag: 8WS1
44
45    8 weeks sample2:
46        description: 8 weeks old, replicate 2
47        limsid:
48            - AHM4688A2
49            - AHM4688A5
50        dataset tag: 8WS2
51
52    8 weeks sample3:
53        description: 8 weeks old, replicate 3
54        limsid:
55            - AHM4688A3
56            - AHM4688A6
57        dataset tag: 8WS3
 1_project:
 2    lab: morisn
 3    scientist: christopher.cooke
 4    lims id: SC22034
 5    babs id: nm322
 6    type: 10X-3prime
 7
 8_genome:
 9    organism: mus musculus
10    assembly: mm10
11    ensembl release: 98
12    fasta file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa
13    fasta index file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa.fai
14    gtf file: inputs/10x_indexes/refdata-gex-mm10-2020-A/genes/genes.gtf
15
16_defaults:
17    fastq paths:
18        - inputs/primary_data/220221_A01366_0148_AH7HYGDMXY/fastq
19        - inputs/primary_data/220310_A01366_0156_AH5YTYDMXY/fastq
20        - inputs/primary_data/220818_A01366_0266_BHCJK7DMXY/fastq
21        - inputs/primary_data/230221_A01366_0353_AHNH37DSX5/fastq
22    feature types:
23        Gene Expression:
24            - COO4671A1
25            - COO4671A2
26            - COO4671A3
27            - COO4671A4
28    feature identifiers: name
29    workflows:
30        - seurat/prepare/cell ranger
31    index path: inputs/10x_indexes/refdata-cellranger-mm10-3.0.0
32
33_datasets:
34    stella 120h rep1:
35        description: STELLA sorting at 120 hours
36        limsid: COO4671A1
37        dataset tag: ST120R1
38        quantification path: results/quantification/cell_ranger/outs/stella_120h_rep1
39        quantification method: cell ranger
40
41    pecam1 120h rep1:
42        description: PECAM1 sorting at 120 hours
43        limsid: COO4671A2
44        dataset tag: P120R1
45        quantification path: results/quantification/cell_ranger/outs/pecam1_120h_rep1
46        quantification method: cell ranger
47
48    ssea1 120h rep1:
49        description: SSEA1 sorting at 120 hours
50        limsid: COO4671A3
51        dataset tag: SS120R1
52        quantification path: results/quantification/cell_ranger/outs/ssea1_120h_rep1
53        quantification method: cell ranger
54
55    blimp1 + ssea1 120h rep1:
56        description: BLIMP1 and SSEA1 sorting at 120 hours
57        limsid: COO4671A4
58        dataset tag: BS120R1
59        quantification path: results/quantification/cell_ranger/outs/blimp1_ssea1_120h_rep1
60        quantification method: cell ranger

_project includes information about the project rather than parameters that should be applied to datasets. Most of the information in this stanza can be extracted from a path on Nemo and/or the LIMS.

The _genome stanza is static across most projects though the ensembl release is tied to any index against which the data is aligned or quantified (etc).

_defaults describes parameters that will be aggregated into every dataset in the _datasets stanza of the project, with the dataset-level parameter taking precedence. (So we don’t have a big copy/paste list of parameters). Depending on the analysis workflows, different keys will be expected. In this example we are going to quantify expression of a scRNA-seq dataset so we need to know where the FastQ files are in the file system. The paths (not files) are specified here with fastq paths. The feature_identifiers will be used when the Seurat object is created; specifying “names” will use the gene names (rather than Ensembl identifiers) as feature names. The default parameters stanza typically contains a set of analysis workflows, using the workflows key. This curated list of keywords identifies which workflows should be applied to the dataset(s). In the example we specify two workflows: quantification by Cell Ranger and Seurat object creation. The order of the workflows is not important. The keywords to include can be found in the workflows documentation.

Each dataset is described in _datasets. Dataset stanzas must have unique names and ideally not contain odd characters. {scamp} will try to make the key directory-safe, however.