contribution guides

These guides will hopefully help you to add features into {scamp}. New features can be added by writing modules, these are the basic building blocks around which a workflow is written. A workflow consists of several processes that are used in concert to achieve something. Workflows may be though of as independent pipelines, in {scamp} we can chain multiple pipelines together to provide flexibility for the analysis.

Writing a module requires a script (of any language) to be written alongside a simple Nextflow process definition. Together these define how the input data is processed and what outputs are produced. Each module is documented and so it is a self-contained unit.

A workflow can include multiple modules and is where the management of parameters occurs. In the workflow, user parameters are manipulated and augmented with the output of processes so that successive processes can be managed to complete an analysis. Workflows could be nested into related topics, with workflows being able to initiate (sub)workflows (etc). Each workflow is documented in a reamde.yaml that is alongside its Nextflow file.

  • modules

    Modules represent specific steps of a pipeline that can be reused in multiple instances. A module should be written to be generic and not specifically tied to a pipeline, workflow or (sub)workflow. Each module performs a specific task and usually includes only a few different programs.

    • workflows

      Coming soon...

      • documentation

        Coming soon...

        Subsections of contribution guides

        modules

        Modules represent specific steps of a pipeline that can be reused in multiple instances. A module should be written to be generic and not specifically tied to a pipeline, workflow or (sub)workflow. Each module performs a specific task and usually includes only a few different programs.

        Working modules can be written independently from their inclusion in a pipeline, so do not worry about learning Nextflow if you don't want to - you can write a module script that can be wrapped in Nextflow later.

        A suggested module structure could be as follows. The example module is called “a_new_module” and contains one subdirectory and four files. These will be described more thoroughly below. Briefly, the main.nf file is where the Nextflow process is defined, this is the part of the module that controls the execution of the main.sh script (in this example). The stub.sh is an optional file that can be used to generate placeholder output files so that a pipeline can be tested without taking the time to analyse any data. The readme.yaml will be used to create documentation for this website.

        a_new_module/
        |-- main.nf
        |-- readme.yaml
        `-- templates
            |-- main.sh
            `-- stub.sh

        The following are suggestions. This is the way that I have been writing modules. But there is flexibility, if you don’t like the way I have written the scripts, you don’t have to do it the same way!

        Nextflow process

        Nextflow's process documentation

        The process defines the context in which a processing step is executed on a set of inputs; a single process can become multiple tasks where each task has a different set of input parameters for the process.

        An example process “complicated_analysis” is defined below, in the main.nf file. The really important parts are the input, output and script stanzas.

        The inputs to a process are passed as channels from the Nextflow pipeline. The order and type of channel is important. The definitions here must be adhered to in the pipeline. In this example, there are four inputs: opt, tag, sample and db. Their types are specified as either val or file. A val is a value which can be substituted into the script. The file will be a symlink to the target named, in this case, “db”.

        For {scamp} processes, the opt input should be used universally. The opt channel is a map of key value pairs that can be accessed by the configuration file, allowing pipeline parameters that are not necessarily used in the process to be accessed outside the task, and used to track input parameters in output channels. But beware that only variables that affect the process’s execution should be included because they could invalidate the cache. The tag is a string that will be added to the Nextflow output log to identify an individual task. If omitted, a number is shown in the log instead.

        The outputs of a process are the files or variables produced by the script. The “complicated_analysis” module emits three output channels: the opt without modification from it’s input, a task.yaml file to track software versions of a process and task parameters, and the analysis output file: output.file. These are emitted to the pipeline in channels named opt, task and output.

        For {scamp} processes the opt and task should be used. The task may be used in the future to compose markdown reports.

        The script stanza defines what analysis actually happens. I favour using templates here so that the scripts are kept separate from Nextflow. In this example, if the user has provided the -stub-run argument when invoking the pipeline, the stub.sh script is executed, otherwise main.sh will be executed.

        Other Nextflow directives can be included but may not be completely relevant in the context of a module. For example, using publishDir should be the choice of the pipeline creator so may not be sensible to include here. Directives included here can be overridden by a suitable configuration file, however. In this case we include some resource requests - cpus, memory and time - but no execution method (eg SLURM) which should be defined at execution by the user.

         1process complicated_analysis {
         2  tag "$tag"
         3
         4  cpus 16
         5  memory '64GB'
         6  time '3d'
         7
         8  input:
         9    val opt
        10    val tag
        11    val sample
        12    file 'db'
        13
        14  output:
        15    val opt, emit: opt
        16    path 'task.yaml', emit: task
        17    path 'output.file', emit: output
        18
        19  script:
        20    template workflow.stubRun ? 'stub.sh' : 'main.sh'
        21}
        

        Executable script

        Nextflow's script documentation

        Nextflow is language agnostic and so long as the interpreter is available in the task’s PATH the script should run. These scripts can be tested outside Nextflow with equivalent parameters passed as environment variables, for example. Containers can be used and should be included in the directives of the process.

        In this example there are two programs being used to create an output file from two inputs. The first tool uses the task’s sample variable and the db file from the inputs. The value of sample is interpolated into the script by $sample. For db, a symlink is staged in the work directory of the task, between the target file and “db” so we can specify db in the script as if it were that file, irrespective of its location in the filesystem.

        Once analysis_tool has completed its work the intermediate output file is parsed and output.file is written. Nextflow will provide this file to the pipeline since it was listed in the output stanza for the process.

        The task.yaml file can be aggregated across workflow tasks, processes and the pipeline and could be used used in the future so that task-specific information and software versions can be included in reports.

        An R script could be used here too, specifying Rscript instead of bash in the shebang line. Nextflow variables are similarly interpolated into the script though so be wary when accessing lists. Writing task.yaml can be taken care of using the [{scampr} package][gh scampr.

        Nextflow will interpolate variables using $variable so any scripts using $ may have unexpected behaviour. Where possible use non-dollar alternatives or delimit the symbol.

        process scripts
         1#! bash
         2
         3analysis_tool --sample $sample --database db --parameter 100 --output intermediate.file
         4cat intermediate.file | parsing_tool > output.file
         5
         6# write task information to a (yaml) file
         7cat <<-END_TASK > task.yaml
         8'${task.process}':
         9  task:
        10    '${task.index}':
        11      params:
        12        sample: $sample
        13      meta:
        14        workDir: `pwd`
        15  process:
        16    ext: []
        17    versions:
        18      analysis tool: `analysis_tool --version`
        19      parsing tool: `parsing_tool -v`
        20END_TASK
        
         1#! Rscript
         2
         3library(magrittr)
         4library(scampr)
         5
         6log_message('making a connection to biomart', level='main')
         7
         8task.process <- "${task.process}"
         9task.index <- "${task.index}"
        10
        11list(nvalues = "$nvalues") |>
        12    assign_and_record_task()
        13
        14data.frame(x=rnorm(n=nvalues), y=rnorm(n=nvalues)) |>
        15    saveRDS(file='data.rds')
        

        Stub script

        The optional stub.sh is an alternative script that can be executed when the user invokes -stub-run. The idea of this script is to create the output files expected by the pipeline without expending computational resource. In this way we can test how processes and channels interact in the pipeline without conjuring test data or worrying about cache validity.

        The example below simply uses touch to create output files with no content.

         1#! bash
         2
         3touch output.file
         4
         5# write task information to a (yaml) file
         6cat <<-END_TASK > task.yaml
         7'${task.process}':
         8  task:
         9    '${task.index}':
        10      params:
        11        sample: $sample
        12      meta:
        13        workDir: `pwd`
        14  process:
        15    ext: []
        16    versions:
        17      analysis tool: `analysis_tool --version`
        18      parsing tool: `parsing_tool -v`
        19END_TASK
        

        Documentation

        Each module should be documented using the readme.yaml file. This file will be used to populate the module documentation on this website.

         1name: A new module
         2
         3description: |
         4  A short description of the module's function.  
         5
         6tags:
         7  - lowercase
         8  - strings
         9
        10tools:
        11  name of software:
        12    description: A markdown-ready description - pillaged from its website!
        13    homepage: url, could be github
        14    documentation: maybe a readthedocs
        15    source: url to (eg) github
        16    doi: doi
        17    licence: eg MIT or GPL-3
        18    ext: extra arguments identifier
        19    multithreaded:
        20      - list of features
        21      - eg "multithreaded"
        22      - that appear in module documentation
        23
        24input:
        25  - name: opt
        26    type: map
        27    description: A map of task-specific variables.
        28  - name: tag
        29    type: string
        30    description: A unique identifier to use in the tag directive.
        31
        32output:
        33  - name: opt
        34    type: map
        35    description: A map of task-specific variables.
        36  - name: task
        37    type: file
        38    description: YAML-formatted file of task parameters and software versions used by the process.
        39    pattern: task.yaml
        40
        41channel tags:
        42  - ':channel_1': Description of channel 1, without the shared root in the tag.
        43  - ':channel_2': Description of channel 2.
        44
        45authors:
        46  - "@ChristopherBarrington"
        

        A template module documentation file can be created using hugo. Suppose we wanted to add documentation to a new module for cellranger count, stored in scamp/modules/cell_ranger/count. Setting the environment variable MODULE_PATH=modules/cell_ranger/count and using Hugo as below will create a template readme.md in the module, which is subsequently renamed to a YAML file.

        create module documentation
        hugo new --kind module-readme \
                 --contentDir scamp \
                 ${MODULE_PATH}/readme.md && \
        rename --remove-extension \
               --append \
               .yaml scamp/$_
        hugo new --kind module-readme \
                 --contentDir scamp \
                 ${MODULE_PATH}/readme.md && \
        rename .md .yaml scamp/$_

        You’re on your own.

        workflows

        Coming soon...

        documentation

        Coming soon...