Process RNA-seq data end to end

This function selectively performs various steps to process RNA-seq data. See also the vignettes: browseVignettes('seeker').

Usage

seeker(params, parentDir = ".", dryRun = FALSE)

Arguments

params

Named list of parameters with components:

study: String used to name the output directory within parentDir.
metadata: Named list with components:
- run: Logical indicating whether to fetch metadata. See fetchMetadata(). If TRUE, saves a file parentDir/study/metadata.csv. If FALSE, expects that file to already exist. The unmodified fetched or found metadata is saved to a file parentDir/study/metadata_original.csv. Following components are only checked if run is TRUE.
- bioproject: String indicating the study's bioproject accession.
- include: Optional named list for specifying which rows of metadata to include for further processing, with components:
  - colname: String indicating column in metadata
  - values: Vector indicating values within colname
- exclude: Optional named list for specifying which rows of metadata to exclude from further processing (superseding include), with components:
  - colname: String indicating column in metadata
  - values: Vector indicating values within colname
fetch: Named list with components:
- run: Logical indicating whether to fetch files from SRA. See fetch(). If TRUE, saves files to parentDir/study/fetch_output. Whether TRUE or FALSE, expects metadata to have a column "run_accession", and updates metadata with column "fastq_fetched" containing paths to files in parentDir/study/fetch_output. Following components are only checked if run is TRUE.
- keep: Logical indicating whether to keep fastq.gz files when all processing steps have completed. NULL indicates TRUE.
- overwrite: Logical indicating whether to overwrite files that already exist. NULL indicates to use the default in fetch().
- keepSra: Logical indicating whether to keep the ".sra" files. NULL indicates to use the default in fetch().
- prefetchCmd: String indicating command for prefetch, which downloads ".sra" files. NULL indicates to use the default in fetch().
- prefetchArgs: Character vector indicating arguments to pass to prefetch. NULL indicates to use the default in fetch().
- fasterqdumpCmd: String indicating command for fasterq-dump, which uses ".sra" files to create ".fastq" files. NULL indicates to use the default in fetch().
- prefetchArgs: Character vector indicating arguments to pass to fasterq-dump. NULL indicates to use the default in fetch().
- pigzCmd: String indicating command for pigz, which converts ".fastq" files to ".fastq.gz" files. NULL indicates to use the default in fetch().
- pigzArgs: Character vector indicating arguments to pass to pigz. NULL indicates to use the default in fetch().
trimgalore: Named list with components:
- run: Logical indicating whether to perform quality/adapter trimming of reads. See trimgalore(). If TRUE, expects metadata to have a column "fastq_fetched" containing paths to fastq files in parentDir/study/fetch_output, saves trimmed files to parentDir/study/trimgalore_output, and updates metadata with column "fastq_trimmed". If FALSE, expects and does nothing. Following components are only checked if run is TRUE.
- keep: Logical indicating whether to keep trimmed fastq files when all processing steps have completed. NULL indicates TRUE.
- cmd: Name or path of the command-line interface. NULL indicates to use the default in trimgalore().
- args: Additional arguments to pass to the command-line interface. NULL indicates to use the default in trimgalore().
- pigzCmd: String indicating command for pigz, which converts ".fastq" files to ".fastq.gz" files. NULL indicates to use the default in trimgalore().
fastqc: Named list with components:
- run: Logical indicating whether to perform QC on reads. See fastqc(). If TRUE and trimgalore$run is TRUE, expects metadata to have a column "fastq_trimmed" containing paths to fastq files in parentDir/study/trimgalore_output. If TRUE and trimgalore$run is FALSE, expects metadata to have a column "fastq_fetched" containing paths to fastq files in parentDir/study/fetch_output. If TRUE, saves results to parentDir/study/fastqc_output. If FALSE, expects and does nothing. Following components are only checked if run is TRUE.
- keep: Logical indicating whether to keep fastqc files when all processing steps have completed. NULL indicates TRUE.
- cmd: Name or path of the command-line interface. NULL indicates to use the default in fastqc().
- args: Additional arguments to pass to the command-line interface. NULL indicates to use the default in fastqc().
salmon: Named list with components:
- run: Logical indicating whether to quantify transcript abundances. See salmon(). If TRUE and trimgalore$run is TRUE, expects metadata to have a column "fastq_trimmed" containing paths to fastq files in parentDir/study/trimgalore_output. If TRUE and trimgalore$run is FALSE, expects metadata to have a column "fastq_fetched" containing paths to fastq files in parentDir/study/fetch_output. If TRUE, saves results to parentDir/study/salmon_output and parentDir/study/salmon_meta_info.csv. If FALSE, expects and does nothing. Following components are only checked if run is TRUE.
- indexDir: Directory that contains salmon index.
- sampleColname: String indicating column in metadata containing sample ids. NULL indicates "sample_accession", which should work for data from SRA and ENA.
- keep: Logical indicating whether to keep quantification results when all processing steps have completed. NULL indicates TRUE.
- cmd: Name or path of the command-line interface. NULL indicates to use the default in salmon().
- args: Additional arguments to pass to the command-line interface. NULL indicates to use the default in salmon().
multiqc: Named list with components:
- run: Logical indicating whether to aggregrate results of various processing steps. See multiqc(). If TRUE, saves results to parentDir/study/multiqc_output. If FALSE, expects and does nothing. Following components are only checked if run is TRUE.
- cmd: Name or path of the command-line interface. NULL indicates to use the default in multiqc().
- args: Additional arguments to pass to the command-line interface. NULL indicates to use the default in multiqc().
tximport: Named list with components:
- run: Logical indicating whether to summarize transcript- or gene-level estimates for downstream analysis. See tximport(). If TRUE, expects metadata to have a column sampleColname of sample ids, and expects a directory parentDir/study/salmon_output containing directories of quantification results, and saves results to parentDir/study/tximport_output.qs. If FALSE, expects and does nothing. Following components are only checked if run is TRUE.
- tx2gene: Optional named list with components:
  - organism: String indicating organism and thereby ensembl gene dataset. See getTx2gene().
  - version: Optional number indicating ensembl version. NULL indicates the latest version. See getTx2gene().
  - filename: Optional string indicating name of pre-existing text file in parentDir/params$study containing mapping between transcripts (first column) and genes (second column), with column names in the first row. If filename is specified, organism and version must not be specified.
  If not NULL, saves a file parentDir/study/tx2gene.csv.gz.
- countsFromAbundance: String indicating whether or how to estimate counts using estimated abundances. See tximport::tximport().
- ignoreTxVersion: Logical indicating whether to the version suffix on transcript ids. NULL indicates to use TRUE. See tximport::tximport().

params can be derived from a yaml file, see vignette("introduction", package = "seeker"). The yaml representation of params will be saved to parentDir/params$study/params.yml.

parentDir

Directory in which to store the output, which will be a directory named according to params$study.

dryRun

Logical indicating whether to check the validity of inputs without actually fetching or processing any data.

Value

Path to the output directory parentDir/params$study, invisibly.

Examples

if (FALSE) { # \dontrun{
doParallel::registerDoParallel()
params = yaml::read_yaml('my_params.yaml')
seeker(params)
} # }