snv-indels module

The snv-indels module is responsible for aligning the reads to the reference, and calling variants. The bam and count files produced by this module are used in the fusion and gene expression modules.

Tools

This module uses STAR to align the reads to the reference using two-pass mode. VarDict is used to call variants, which are annotated using VEP. Variants are filtered based on the criteria defined in filter_criteria, and annotated based on annotation_criteria.

The variants annotated by VEP are then filtered based on a number of different criteria:

Variants that are present on the blacklist are excluded.
Only variants that are present on one of the specified transcripts in ref_id_mapping are included.
Only variants that match one of the consequences defined in vep_include_consequence are included.
Variant that have a population frequency of more than 1% in the gnomADe population are excluded.

Picard is used to generate various alignment statistics.

Input

The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.

Example input for the snv-indels module
sample_name	R1	R2	strandedness
MO1-RNAseq-1-16714	test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz	test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz	forward

Output

The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:

A .bam and .bai per sample, which contain the aligned reads.
A VEP output file (vep_high), which contains the final set of filtered variants.
A VEP output file (vep_target), which contains the variants on the transcripts of interest. These variants have not been filtered on vep_include_consequence terms.

Configuration

You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.

Example

{
  "genome_fasta": "test/data/reference/hamlet-ref.fa",
  "genome_fai": "test/data/reference/hamlet-ref.fa.fai",
  "genome_dict": "test/data/reference/hamlet-ref.dict",
  "star_index": "test/data/reference/hamlet-star",
  "ref_id_mapping": "test/data/reference/id_mappings.tsv",
  "filter_criteria": "test/data/config/filter_criteria.tsv",
  "annotation_criteria": "test/data/config/annotation_criteria.tsv",
  "rrna_refflat": "test/data/reference/ucsc_rrna.refFlat",
  "gtf": "test/data/reference/hamlet-ref.gtf",
  "annotation_refflat": "test/data/reference/hamlet-ref.refFlat"
}

Note that the vep-cache entry is missing for this example file, which means that the online API of VEP will be used. For the best performance, please specify a vep-cache folder as well.

Configuration options

Configuration options
Option	Description	Required
forward_adapter	The forward adapter sequence	yes
reverse_adapter	The reverse adapter sequence	yes
genome_fasta	Reference genome, in FASTA format	yes
genome_fai	.fai index file for the reference fasta	yes
genome_dict	.dict index file for the reference fasta	yes
star_index	STAR index database	yes
ref_id_mapping	File of transcripts of interest	yes
filter_criteria	Criteria file to filter variants	yes
annotation_criteria	Criteria file to annotate variants	yes
rrna_refflat	File of rRNA transcripts	yes
gtf	GTF file with transcripts, used by STAR	yes
annotation_refflat	File used to determine exon coverage	yes
blacklist	File of blacklisted variants	yes
vep-cache	Folder containing the VEP cache	no
vep_include_consequence	List of VEP consequences to report	yes
variant_allele_frequency	Minimum variant allele frequency in the sample to call a variant (default=0.05)	no