scamp

Coming soon...

usage guides

quickstart
This is a quickstart guide that should get the pipeline running in most situations. A more detailed description of the structure of the parameters file and the command line usage is in the analysis configuration section.
scamp configuration
These notes illustrate how to configure an analysis using {scamp}. Detailed descriptions of the parameters can be found in the modules or workflows documentation.
user-configurable parameters
This post contains a more-detailed description of the parameters that are available to configure {scamp}. Some parameters are required, others may be defined from defaults while some may be provided by upstream {scamp} processes.

quickstart

This is a quickstart guide that should get the pipeline running in most situations. A more detailed description of the structure of the parameters file and the command line usage is in the analysis configuration section.

Running the pipeline

For now, the processes are not containerised. All software, packages and libraries must be available from the shell. The {scamp} conda environment provides Nextflow, R and all R packages.

conda activate /nemo/stp/babs/working/barrinc/conda/envs/scamp

{scamp} can be used to generate a best-guess parameters file. The file it creates is dependent on the sample sheet and directory structure on Nemo of the ASF’s data directory in babs/inputs. The file it creates should be checked - especially for cases where multiple libraries contribute to individual samples, such as 10X multiome projects.

The following snippet will use the guess_scamp_file.py script that is included with {scamp} to find the data directory for the project and parse the accompanying design file. When the Conda environment is loaded, NXF_HOME is checked and set to a reasonable default if not defined. But really, NXF_HOME should be defined in your .bashrc alongside the other Nextflow environment variables. If you want to, a path under working could be used, for example /nemo/stp/babs/working/${USER}/nextflow. The PATH is then modified to include the bin for {scamp} so that the guess_scamp_file.py executable from {scamp} can be found in your shell.

Using nextflow pull, the most recent version of a pipeline can be downloaded into NXF_HOME. In the following chunk we direct a specific version of {scamp} to be downloaded using -revision. This is optional, but recommended to ensure you have the most-recent release available.

cache {scamp}

nextflow pull ChristopherBarrington/scamp -revision

nextflow pull ChristopherBarrington/scamp

guess_scamp_file.py includes a set of default parameters, which will need to be updated as we include new protocols. Example usage is shown below, where we indicate the genome that we want to use for the project, the LIMS ID under which the data was produced and the name of the output YAML file. For command line options, use guess_scamp_file.py --help.

For projects using dataset barcodes (10x Flex, Plex or HTO for example) a barcodes.csv file is required. This is a two-column CSV with “barcode” and “dataset” variables. Each row should be a unique barcode:dataset pair - if a dataset is labelled by multiple barcodes in the project, these should be represented on multiple rows. The “dataset” should match the name of the dataset in the project’s design file (either in the ASF data directory or specified by --design-file). The barcodes and design files are parsed and joined together using the “barcode” as the key. Barcode information is not tracked in the LIMS and must be provided by the scientist.

A collection of assays can be included from which the project type can be defined. The project’s assays is a curated set of keywords to define what types of data should be expected. For example, --project-assays 3prime 10x will be translated into --project-type 10x-3prime. A list of valid assay names can be found with guess_scamp_file.py --help. If --project-type is not provided, it is sought from --project-assays and vice versa. Only one of --project-type and --project-assays is required, but it is better to provide --project-assays; the assays in --project-type must be hyphen-separated and sorted alphabetically.

guess_scamp_file.py usage examples

guess_scamp_file.py \
  --lims-id SC22034 \
  --genome mm10 \
  --project-assays 10x 3prime

guess_scamp_file.py \
  --data-path /nemo/stp/babs/inputs/sequencing/data/morisn/christopher.cooke/SC22034 \
  --genome mm10 \
  --project-assays 10x 3prime

guess_scamp_file.py \
  --lab morisn \
  --scientist christopher.cooke \
  --lims-id SC22034 \
  --genome mm10 \
  --project-assays 10x 3prime

guess_scamp_file.py \
  --lims-id SC22034 \
  --barcodes-file inputs/barcodes.csv \
  --project-assays 10x flex

guess_scamp_file.py --help

Check the guessed parameters file! Pay particular attention to the LIMS IDs associated to dataset, the feature types and sample names!

The parameters in the guessed scamp_file.yaml should now be checked, the values may need to be corrected and/or amended or new information included. For example, certain samples may need to be removed or different analysis workflows my be required. Examples of analysis parameters files can be found in the analysis configuration post.

Once the pipeline parameters are encoded in the parameters file, the pipeline can then be launched using a specific release such as `` or the current version using main. Using a specific tag is recommended for reproducibility.

If you want to test you configuration file without running any real analysis, you can run Nextflow in stub-run mode:

nextflow run ChristopherBarrington/scamp -revision  \
  -stub-run -profile stub_run \
  --scamp_file scamp_file.yaml

This will create empty files instead of analysing data but will produce errors if there is a configuration problem. Your analysis may still fail when it runs though! Once you are confident, you can run the pipeline:

nextflow run ChristopherBarrington/scamp -revision  \
  --scamp_file scamp_file.yaml

This should now start the pipeline and show the processes being run for each of the analysis workflows detailed in your configuration file.

scamp configuration

These notes illustrate how to configure an analysis using {scamp}. Detailed descriptions of the parameters can be found in the modules or workflows documentation.

Analysis configuration

A parameters YAML file is used to described all aspects of a project and should serve as a record for parameters used in the pipeline. It will be passed to Nextflow as a parameter using --scamp_file.

The structure of parameters file allows aspects of a project to be recorded alongside analysis parameters that can be specified for multiple datasets in a plain text and human-readable format. Parameter keys that begin with underscores are reserved by {scamp} and should not be used in other keys. At the first level, the project (_project), genome (_genome), common dataset parameters (_defaults) and dataset (_dataset) stanzas are specified. Within the datasets stanza, datasets can be freely (but sensibly) named.

Example configuration file

Related files

In this example for a scRNA-seq project, there are four datasets that will be quantified against mouse using the Cell Ranger mm10 reference from which Seurat objects will be prepared. To keep the file clear I have assumed symlinks are available in an inputs directory to other parts of the filesystem. The inputs/primary_data is a symlink to ASF’s outputs for this project and inputs/10x_indexes is a symlink to the communal 10X indexes resource.

Example parameters

 1_project:
 2    lab: morisn
 3    scientist: christopher.cooke
 4    lims id: SC22034
 5    babs id: nm322
 6    type: 10X-3prime
 7
 8_genome:
 9    organism: mus musculus
10    assembly: mm10
11    ensembl release: 98
12    fasta file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa
13    fasta index file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa.fai
14    gtf file: inputs/10x_indexes/refdata-gex-mm10-2020-A/genes/genes.gtf
15
16_defaults:
17    fastq paths:
18        - inputs/primary_data/220221_A01366_0148_AH7HYGDMXY/fastq
19        - inputs/primary_data/220310_A01366_0156_AH5YTYDMXY/fastq
20        - inputs/primary_data/220818_A01366_0266_BHCJK7DMXY/fastq
21        - inputs/primary_data/230221_A01366_0353_AHNH37DSX5/fastq
22    feature types:
23        Gene Expression:
24            - COO4671A1
25            - COO4671A2
26            - COO4671A3
27            - COO4671A4
28    feature identifiers: name
29    workflows:
30        - quantification/cell ranger
31        - seurat/prepare/cell ranger
32    index path: inputs/10x_indexes/refdata-cellranger-mm10-3.0.0
33
34_datasets:
35    stella 120h rep1:
36        description: STELLA sorting at 120 hours
37        limsid: COO4671A1
38        dataset tag: ST120R1
39
40    pecam1 120h rep1:
41        description: PECAM1 sorting at 120 hours
42        limsid: COO4671A2
43        dataset tag: P120R1
44
45    ssea1 120h rep1:
46        description: SSEA1 sorting at 120 hours
47        limsid: COO4671A3
48        dataset tag: SS120R1
49
50    blimp1 + ssea1 120h rep1:
51        description: BLIMP1 and SSEA1 sorting at 120 hours
52        limsid: COO4671A4
53        dataset tag: BS120R1

 1_project:
 2    lab: guillemotf
 3    scientist: sara.ahmeddeprado
 4    lims id: SC22051
 5    babs id: sa145
 6    type: 10X-Multiomics
 7
 8_genome:
 9    assembly: mm10 + mCherry
10    organism: mus musculus
11    ensembl release: 98
12    non-nuclear contigs:
13        - chrM
14    fasta path: inputs/fastas
15    gtf path: inputs/gtfs
16
17_defaults:
18    fastq paths:
19        - inputs/primary_data/220406_A01366_0169_AHC3HVDMXY/fastq
20        - inputs/primary_data/220407_A01366_0171_AH3W3LDRX2/fastq
21        - inputs/primary_data/220420_A01366_0179_BH72WWDMXY/fastq
22        - inputs/primary_data/220422_A01366_0180_BHJLLNDSX3/fastq
23    feature types:
24        Gene Expression:
25            - AHM4688A1
26            - AHM4688A2
27            - AHM4688A3
28        Chromatin Accessibility:
29            - AHM4688A4
30            - AHM4688A5
31            - AHM4688A6
32    feature identifiers: name
33    workflows:
34        - quantification/cell ranger arc
35        - seurat/prepare/cell ranger arc
36
37_datasets:
38    8 weeks sample1:
39        description: 8 weeks old, replicate 1
40        limsid:
41            - AHM4688A1
42            - AHM4688A4
43        dataset tag: 8WS1
44
45    8 weeks sample2:
46        description: 8 weeks old, replicate 2
47        limsid:
48            - AHM4688A2
49            - AHM4688A5
50        dataset tag: 8WS2
51
52    8 weeks sample3:
53        description: 8 weeks old, replicate 3
54        limsid:
55            - AHM4688A3
56            - AHM4688A6
57        dataset tag: 8WS3

 1_project:
 2    lab: morisn
 3    scientist: christopher.cooke
 4    lims id: SC22034
 5    babs id: nm322
 6    type: 10X-3prime
 7
 8_genome:
 9    organism: mus musculus
10    assembly: mm10
11    ensembl release: 98
12    fasta file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa
13    fasta index file: inputs/10x_indexes/refdata-gex-mm10-2020-A/fasta/genome.fa.fai
14    gtf file: inputs/10x_indexes/refdata-gex-mm10-2020-A/genes/genes.gtf
15
16_defaults:
17    fastq paths:
18        - inputs/primary_data/220221_A01366_0148_AH7HYGDMXY/fastq
19        - inputs/primary_data/220310_A01366_0156_AH5YTYDMXY/fastq
20        - inputs/primary_data/220818_A01366_0266_BHCJK7DMXY/fastq
21        - inputs/primary_data/230221_A01366_0353_AHNH37DSX5/fastq
22    feature types:
23        Gene Expression:
24            - COO4671A1
25            - COO4671A2
26            - COO4671A3
27            - COO4671A4
28    feature identifiers: name
29    workflows:
30        - seurat/prepare/cell ranger
31    index path: inputs/10x_indexes/refdata-cellranger-mm10-3.0.0
32
33_datasets:
34    stella 120h rep1:
35        description: STELLA sorting at 120 hours
36        limsid: COO4671A1
37        dataset tag: ST120R1
38        quantification path: results/quantification/cell_ranger/outs/stella_120h_rep1
39        quantification method: cell ranger
40
41    pecam1 120h rep1:
42        description: PECAM1 sorting at 120 hours
43        limsid: COO4671A2
44        dataset tag: P120R1
45        quantification path: results/quantification/cell_ranger/outs/pecam1_120h_rep1
46        quantification method: cell ranger
47
48    ssea1 120h rep1:
49        description: SSEA1 sorting at 120 hours
50        limsid: COO4671A3
51        dataset tag: SS120R1
52        quantification path: results/quantification/cell_ranger/outs/ssea1_120h_rep1
53        quantification method: cell ranger
54
55    blimp1 + ssea1 120h rep1:
56        description: BLIMP1 and SSEA1 sorting at 120 hours
57        limsid: COO4671A4
58        dataset tag: BS120R1
59        quantification path: results/quantification/cell_ranger/outs/blimp1_ssea1_120h_rep1
60        quantification method: cell ranger

_project includes information about the project rather than parameters that should be applied to datasets. Most of the information in this stanza can be extracted from a path on Nemo and/or the LIMS.

The _genome stanza is static across most projects though the ensembl release is tied to any index against which the data is aligned or quantified (etc).

_defaults describes parameters that will be aggregated into every dataset in the _datasets stanza of the project, with the dataset-level parameter taking precedence. (So we don’t have a big copy/paste list of parameters). Depending on the analysis workflows, different keys will be expected. In this example we are going to quantify expression of a scRNA-seq dataset so we need to know where the FastQ files are in the file system. The paths (not files) are specified here with fastq paths. The feature_identifiers will be used when the Seurat object is created; specifying “names” will use the gene names (rather than Ensembl identifiers) as feature names. The default parameters stanza typically contains a set of analysis workflows, using the workflows key. This curated list of keywords identifies which workflows should be applied to the dataset(s). In the example we specify two workflows: quantification by Cell Ranger and Seurat object creation. The order of the workflows is not important. The keywords to include can be found in the workflows documentation.

Each dataset is described in _datasets. Dataset stanzas must have unique names and ideally not contain odd characters. {scamp} will try to make the key directory-safe, however.

user-configurable parameters

This post contains a more-detailed description of the parameters that are available to configure {scamp}. Some parameters are required, others may be defined from defaults while some may be provided by upstream {scamp} processes.

contribution guides

These guides will hopefully help you to add features into {scamp}. New features can be added by writing modules, these are the basic building blocks around which a workflow is written. A workflow consists of several processes that are used in concert to achieve something. Workflows may be though of as independent pipelines, in {scamp} we can chain multiple pipelines together to provide flexibility for the analysis.

Writing a module requires a script (of any language) to be written alongside a simple Nextflow process definition. Together these define how the input data is processed and what outputs are produced. Each module is documented and so it is a self-contained unit.

A workflow can include multiple modules and is where the management of parameters occurs. In the workflow, user parameters are manipulated and augmented with the output of processes so that successive processes can be managed to complete an analysis. Workflows could be nested into related topics, with workflows being able to initiate (sub)workflows (etc). Each workflow is documented in a reamde.yaml that is alongside its Nextflow file.

modules
Modules represent specific steps of a pipeline that can be reused in multiple instances. A module should be written to be generic and not specifically tied to a pipeline, workflow or (sub)workflow. Each module performs a specific task and usually includes only a few different programs.
workflows
Coming soon...
documentation
Coming soon...

modules

Modules represent specific steps of a pipeline that can be reused in multiple instances. A module should be written to be generic and not specifically tied to a pipeline, workflow or (sub)workflow. Each module performs a specific task and usually includes only a few different programs.

Working modules can be written independently from their inclusion in a pipeline, so do not worry about learning Nextflow if you don't want to - you can write a module script that can be wrapped in Nextflow later.

A suggested module structure could be as follows. The example module is called “a_new_module” and contains one subdirectory and four files. These will be described more thoroughly below. Briefly, the main.nf file is where the Nextflow process is defined, this is the part of the module that controls the execution of the main.sh script (in this example). The stub.sh is an optional file that can be used to generate placeholder output files so that a pipeline can be tested without taking the time to analyse any data. The readme.yaml will be used to create documentation for this website.

a_new_module/
|-- main.nf
|-- readme.yaml
`-- templates
    |-- main.sh
    `-- stub.sh

The following are suggestions. This is the way that I have been writing modules. But there is flexibility, if you don’t like the way I have written the scripts, you don’t have to do it the same way!

Nextflow process

Nextflow's process documentation

The process defines the context in which a processing step is executed on a set of inputs; a single process can become multiple tasks where each task has a different set of input parameters for the process.

An example process “complicated_analysis” is defined below, in the main.nf file. The really important parts are the input, output and script stanzas.

The inputs to a process are passed as channels from the Nextflow pipeline. The order and type of channel is important. The definitions here must be adhered to in the pipeline. In this example, there are four inputs: opt, tag, sample and db. Their types are specified as either val or file. A val is a value which can be substituted into the script. The file will be a symlink to the target named, in this case, “db”.

For {scamp} processes, the opt input should be used universally. The opt channel is a map of key value pairs that can be accessed by the configuration file, allowing pipeline parameters that are not necessarily used in the process to be accessed outside the task, and used to track input parameters in output channels. But beware that only variables that affect the process’s execution should be included because they could invalidate the cache. The tag is a string that will be added to the Nextflow output log to identify an individual task. If omitted, a number is shown in the log instead.

The outputs of a process are the files or variables produced by the script. The “complicated_analysis” module emits three output channels: the opt without modification from it’s input, a task.yaml file to track software versions of a process and task parameters, and the analysis output file: output.file. These are emitted to the pipeline in channels named opt, task and output.

For {scamp} processes the opt and task should be used. The task may be used in the future to compose markdown reports.

The script stanza defines what analysis actually happens. I favour using templates here so that the scripts are kept separate from Nextflow. In this example, if the user has provided the -stub-run argument when invoking the pipeline, the stub.sh script is executed, otherwise main.sh will be executed.

Other Nextflow directives can be included but may not be completely relevant in the context of a module. For example, using publishDir should be the choice of the pipeline creator so may not be sensible to include here. Directives included here can be overridden by a suitable configuration file, however. In this case we include some resource requests - cpus, memory and time - but no execution method (eg SLURM) which should be defined at execution by the user.

 1process complicated_analysis {
 2  tag "$tag"
 3
 4  cpus 16
 5  memory '64GB'
 6  time '3d'
 7
 8  input:
 9    val opt
10    val tag
11    val sample
12    file 'db'
13
14  output:
15    val opt, emit: opt
16    path 'task.yaml', emit: task
17    path 'output.file', emit: output
18
19  script:
20    template workflow.stubRun ? 'stub.sh' : 'main.sh'
21}

Executable script

Nextflow's script documentation

Nextflow is language agnostic and so long as the interpreter is available in the task’s PATH the script should run. These scripts can be tested outside Nextflow with equivalent parameters passed as environment variables, for example. Containers can be used and should be included in the directives of the process.

In this example there are two programs being used to create an output file from two inputs. The first tool uses the task’s sample variable and the db file from the inputs. The value of sample is interpolated into the script by $sample. For db, a symlink is staged in the work directory of the task, between the target file and “db” so we can specify db in the script as if it were that file, irrespective of its location in the filesystem.

Once analysis_tool has completed its work the intermediate output file is parsed and output.file is written. Nextflow will provide this file to the pipeline since it was listed in the output stanza for the process.

The task.yaml file can be aggregated across workflow tasks, processes and the pipeline and could be used used in the future so that task-specific information and software versions can be included in reports.

An R script could be used here too, specifying Rscript instead of bash in the shebang line. Nextflow variables are similarly interpolated into the script though so be wary when accessing lists. Writing task.yaml can be taken care of using the [{scampr} package][gh scampr.

Nextflow will interpolate variables using $variable so any scripts using $ may have unexpected behaviour. Where possible use non-dollar alternatives or delimit the symbol.

process scripts

 1#! bash
 2
 3analysis_tool --sample $sample --database db --parameter 100 --output intermediate.file
 4cat intermediate.file | parsing_tool > output.file
 5
 6# write task information to a (yaml) file
 7cat <<-END_TASK > task.yaml
 8'${task.process}':
 9  task:
10    '${task.index}':
11      params:
12        sample: $sample
13      meta:
14        workDir: `pwd`
15  process:
16    ext: []
17    versions:
18      analysis tool: `analysis_tool --version`
19      parsing tool: `parsing_tool -v`
20END_TASK

 1#! Rscript
 2
 3library(magrittr)
 4library(scampr)
 5
 6log_message('making a connection to biomart', level='main')
 7
 8task.process <- "${task.process}"
 9task.index <- "${task.index}"
10
11list(nvalues = "$nvalues") |>
12    assign_and_record_task()
13
14data.frame(x=rnorm(n=nvalues), y=rnorm(n=nvalues)) |>
15    saveRDS(file='data.rds')

Stub script

The optional stub.sh is an alternative script that can be executed when the user invokes -stub-run. The idea of this script is to create the output files expected by the pipeline without expending computational resource. In this way we can test how processes and channels interact in the pipeline without conjuring test data or worrying about cache validity.

The example below simply uses touch to create output files with no content.

 1#! bash
 2
 3touch output.file
 4
 5# write task information to a (yaml) file
 6cat <<-END_TASK > task.yaml
 7'${task.process}':
 8  task:
 9    '${task.index}':
10      params:
11        sample: $sample
12      meta:
13        workDir: `pwd`
14  process:
15    ext: []
16    versions:
17      analysis tool: `analysis_tool --version`
18      parsing tool: `parsing_tool -v`
19END_TASK

Documentation

Each module should be documented using the readme.yaml file. This file will be used to populate the module documentation on this website.

 1name: A new module
 2
 3description: |
 4  A short description of the module's function.  
 5
 6tags:
 7  - lowercase
 8  - strings
 9
10tools:
11  name of software:
12    description: A markdown-ready description - pillaged from its website!
13    homepage: url, could be github
14    documentation: maybe a readthedocs
15    source: url to (eg) github
16    doi: doi
17    licence: eg MIT or GPL-3
18    ext: extra arguments identifier
19    multithreaded:
20      - list of features
21      - eg "multithreaded"
22      - that appear in module documentation
23
24input:
25  - name: opt
26    type: map
27    description: A map of task-specific variables.
28  - name: tag
29    type: string
30    description: A unique identifier to use in the tag directive.
31
32output:
33  - name: opt
34    type: map
35    description: A map of task-specific variables.
36  - name: task
37    type: file
38    description: YAML-formatted file of task parameters and software versions used by the process.
39    pattern: task.yaml
40
41channel tags:
42  - ':channel_1': Description of channel 1, without the shared root in the tag.
43  - ':channel_2': Description of channel 2.
44
45authors:
46  - "@ChristopherBarrington"

A template module documentation file can be created using hugo. Suppose we wanted to add documentation to a new module for cellranger count, stored in scamp/modules/cell_ranger/count. Setting the environment variable MODULE_PATH=modules/cell_ranger/count and using Hugo as below will create a template readme.md in the module, which is subsequently renamed to a YAML file.

create module documentation

hugo new --kind module-readme \
         --contentDir scamp \
         ${MODULE_PATH}/readme.md && \
rename --remove-extension \
       --append \
       .yaml scamp/$_

hugo new --kind module-readme \
         --contentDir scamp \
         ${MODULE_PATH}/readme.md && \
rename .md .yaml scamp/$_

You’re on your own.

workflows

Coming soon...

documentation

Coming soon...

modules

cell_ranger
- count
  Quantify gene expression in a single cell RNA-seq dataset.
- mkref
  Creates an index for use with Cell Ranger. It can produce custom genomes if provided with the relevant (and correctly formatted) FastA and GTF files.
cell_ranger_arc
- count
  Aligns and quantifies FastQ files from a 10X snRNA+ATAC-seq experiment against a reference genome. Output matrices are provided in triplet and h5 formats.
- make_libraries_csv
  Creates a sample sheet for a whole project, listing the sample names, assay types and paths to FastQ files. It can be subset to produce a sample sheet for a sample.
- mkref
  Creates an index for use with Cell Ranger ARC. It can produce custom genomes if provided with the relevant (and correctly formatted) FastA and GTF files.
cell_ranger_multi
- count
  Aligns and quantifies FastQ files from a multiomic 10x experiment against a reference genome and include VDJ-B/T and cell surface markers. Output matrices for gene expression and features are provided in triplet and h5 formats. VDJ data are provided separately.
- make_input_csv
  Creates a configuration file for a library, listing the sample names, assay types and paths to FastQ files (etc).
R
- biomaRt
  - get_mart
    Make a connection to the release-matched Ensembl database and saves the object as an RDS file.
- GenomeInfoDB
  - convert_fai_to_seqinfo
    Converts a FastA index (`fai`) to a {GenomeInfoDb} `Seqinfo` object and saves the object to an RDS file.
- GenomicRanges
  - convert_gtf_to_granges
    Reads a GTF file into a GRanges object and saves the object as an RDS file.
- Seurat
  - make_assay
    Writes assay objects as RDS files for a specified assay type.
  - make_object
    A Seurat object is created from the assays, metadata and miscellaneous objects and written to an RDS file.
  - percentage_feature_set
    Adds a metadata variable that shows the perentage of a cell's data that originates from features that match a regex.
  - write_10x_counts_matrices
    Reads a directory containing Cell Ranger-formatted output into a list of matrices.
- Signac
  - make_chromatin_assay
    Create a chromatin assay using Signac and a counts matrix.
samtools
- faidx
  Create a FastA index from a FastA file, providing a `.fai` file.
tools
- cat
  Concatenate multiple files into a single output file. Different input formats can be used; based on the extension, YAML files are concatenated using `yq`, otherwise `cat` is used.

cell_ranger

count
Quantify gene expression in a single cell RNA-seq dataset.
mkref
Creates an index for use with Cell Ranger. It can produce custom genomes if provided with the relevant (and correctly formatted) FastA and GTF files.

count

10x create index rna

mkref

cell_ranger_arc

count
Aligns and quantifies FastQ files from a 10X snRNA+ATAC-seq experiment against a reference genome. Output matrices are provided in triplet and h5 formats.
make_libraries_csv
Creates a sample sheet for a whole project, listing the sample names, assay types and paths to FastQ files. It can be subset to produce a sample sheet for a sample.
mkref
Creates an index for use with Cell Ranger ARC. It can produce custom genomes if provided with the relevant (and correctly formatted) FastA and GTF files.

10x multiome quantification

count

10x multiome

make_libraries_csv

10x create index multiome

mkref

cell_ranger_multi

count
Aligns and quantifies FastQ files from a multiomic 10x experiment against a reference genome and include VDJ-B/T and cell surface markers. Output matrices for gene expression and features are provided in triplet and h5 formats. VDJ data are provided separately.
make_input_csv
Creates a configuration file for a library, listing the sample names, assay types and paths to FastQ files (etc).

10x 3' gene expression 5' gene expression antibody derived tags (adt) flex hashtag oligos (hto) plex quantification rna variable diversity joining (vdj)

count

10x 3' gene expression 5' gene expression antibody derived tags (adt) flex hashtag oligos (hto) plex quantification rna variable diversity joining (vdj)

make_input_csv

R

biomaRt
- get_mart
  Make a connection to the release-matched Ensembl database and saves the object as an RDS file.
GenomeInfoDB
- convert_fai_to_seqinfo
  Converts a FastA index (`fai`) to a {GenomeInfoDb} `Seqinfo` object and saves the object to an RDS file.
GenomicRanges
- convert_gtf_to_granges
  Reads a GTF file into a GRanges object and saves the object as an RDS file.
Seurat
- make_assay
  Writes assay objects as RDS files for a specified assay type.
- make_object
  A Seurat object is created from the assays, metadata and miscellaneous objects and written to an RDS file.
- percentage_feature_set
  Adds a metadata variable that shows the perentage of a cell's data that originates from features that match a regex.
- write_10x_counts_matrices
  Reads a directory containing Cell Ranger-formatted output into a list of matrices.
Signac
- make_chromatin_assay
  Create a chromatin assay using Signac and a counts matrix.

biomaRt

get_mart
Make a connection to the release-matched Ensembl database and saves the object as an RDS file.

bioconductor biomart ensembl r

get_mart

GenomeInfoDB

convert_fai_to_seqinfo
Converts a FastA index (`fai`) to a {GenomeInfoDb} `Seqinfo` object and saves the object to an RDS file.

bioconductor convert formats genomeinfodb genomics r

convert_fai_to_seqinfo

GenomicRanges

convert_gtf_to_granges
Reads a GTF file into a GRanges object and saves the object as an RDS file.

bioconductor convert formats genomicranges genomics r

convert_gtf_to_granges

Seurat

make_assay
Writes assay objects as RDS files for a specified assay type.
make_object
A Seurat object is created from the assays, metadata and miscellaneous objects and written to an RDS file.
percentage_feature_set
Adds a metadata variable that shows the perentage of a cell's data that originates from features that match a regex.
write_10x_counts_matrices
Reads a directory containing Cell Ranger-formatted output into a list of matrices.

r seurat

make_assay

r seurat

make_object

in development r seurat

percentage_feature_set

r seurat

write_10x_counts_matrices

Signac

make_chromatin_assay
Create a chromatin assay using Signac and a counts matrix.

chromatin accessibility r signac

make_chromatin_assay

samtools

faidx
Create a FastA index from a FastA file, providing a `.fai` file.

genomics index

faidx

tools

cat
Concatenate multiple files into a single output file. Different input formats can be used; based on the extension, YAML files are concatenated using `yq`, otherwise `cat` is used.

cat

workflows

genome_preparation
These processes can be executed independently of datassets, their parameters do not depend on the data but on the genome used in the analysis.
quantification
A variety of methods are implemented that can take FastQ files and output quantified expression data, creating intermediate files such as indexes as required.
- cell_ranger
  Use the 10X `cellranger` software to quantify expression and optionally create a genome index against which gene expression can be quantified.
- cell_ranger_arc
  Using `cellranger-arc` software, libraries for snRNA-seq and cell-matched snATAC-seq assays are quantified against an index, which can be optionally created.
- cell_ranger_multi
  Using `cellranger-multi` software, barcoded and probe-based libraries for gene expression assays are quantified against an index, which can be optionally created.
seurat
Analysis of single cell data using Seurat-based methods.
- prepare
  - cell_ranger
    Using the filtered single cell expression matrix, output from Cell Ranger, a Seurat object is prepared with little modification.
  - cell_ranger_arc
    Using the filtered single nucleus expression and accessibility matrices, written by Cell Ranger ARC, a Seurat object is prepared that contains RNA and chromatin accessibility assays with little modification.

genomics

genome_preparation

quantification

10x quantification rna

cell_ranger

10x multiome quantification

cell_ranger_arc

10x 3' gene expression 5' gene expression antibody derived tags (adt) flex hashtag oligos (hto) plex quantification rna variable diversity joining (vdj)

cell_ranger_multi

Seurat

seurat

prepare

cell_ranger
Using the filtered single cell expression matrix, output from Cell Ranger, a Seurat object is prepared with little modification.
cell_ranger_arc
Using the filtered single nucleus expression and accessibility matrices, written by Cell Ranger ARC, a Seurat object is prepared that contains RNA and chromatin accessibility assays with little modification.

10x gene expression seurat

cell_ranger

10x multiome seurat

cell_ranger_arc

utilities

check_for_matching_key_values
Verify that keys in a collection of `maps` match.
concat_workflow_emissions
Collect an emission from multiple channels or processes into a collection.
concatenate_maps_list
Concatenate a collection of `map`s into a single `map`, overriding with successive keys.
convert_map_keys_to_files
Iteratively search a `map` for matching keys which are converted to `file` types. Single values (`string`) are converted, or all elements in a `collection`, or all values of a `map`.
format_unique_key
Joins strings togther with a spearator in the aim of making a uniquely identifiable identifier. But does no checking for any type of unique-ness.
make_map
Create `Map` from a `Collection` of values and keys.
make_string_directory_safe
Try to make a string safe for use as a directory (or file) name.
merge_metadata_and_process_output
Combines keys from maps to give the set of input parameters to a process and its emitted values.
merge_process_emissions
Iteratively merges output channels of a process into a single channel of maps.
pluck
Iteratively pluck keys from a map of maps.
print_as_json
Print a data structure in JSON format so it is more-easily readable.
remove_keys_from_map
Given a map, remove one or more keys and return the result.
rename_map_keys
Given a `map` of key/value pairs, a subset of keys can be renamed.

utility

check_for_matching_key_values

utility

concat_workflow_emissions

utility

concatenate_maps_list

utility

convert_map_keys_to_files

utility

format_unique_key

utility

make_map

utility

make_string_directory_safe

utility

merge_metadata_and_process_output

utility

merge_process_emissions

utility

pluck

utility

print_as_json

utility

remove_keys_from_map

utility

scamp

Subsections of scamp

usage guides

Subsections of usage guides

quickstart

Running the pipeline

scamp configuration

Analysis configuration

Example configuration file

user-configurable parameters

contribution guides

Subsections of contribution guides

modules

Nextflow process

Executable script

Stub script

Documentation

workflows

documentation

modules

Subsections of modules

cell_ranger

Subsections of cell_ranger

count

mkref

cell_ranger_arc

Subsections of cell_ranger_arc

count

make_libraries_csv

mkref

cell_ranger_multi

Subsections of cell_ranger_multi

count

make_input_csv

R

Subsections of R

biomaRt

Subsections of biomaRt

get_mart

GenomeInfoDB

Subsections of GenomeInfoDB

convert_fai_to_seqinfo

GenomicRanges

Subsections of GenomicRanges

convert_gtf_to_granges

Seurat

Subsections of Seurat

make_assay

make_object

percentage_feature_set

write_10x_counts_matrices

Signac

Subsections of Signac

make_chromatin_assay

samtools

Subsections of samtools

faidx

tools

Subsections of tools

cat

workflows

Subsections of workflows

genome_preparation

quantification

Subsections of quantification

cell_ranger

cell_ranger_arc

cell_ranger_multi

seurat

Subsections of seurat

prepare

Subsections of prepare

cell_ranger

cell_ranger_arc

utilities

Subsections of utilities

check_for_matching_key_values

concat_workflow_emissions

concatenate_maps_list

convert_map_keys_to_files