snv-indels

The snv-indels module is responsible for aligning the reads to the reference, and calling SNVs and insertions/deletion.

Tools

This module uses STAR to align the reads to the reference using twopass mode.`VarDict <https://github.com/AstraZeneca-NGS/VarDictJava>`_ is used to call variants, which are annotated using VEP. For each variant, this module determines if it is located inside one of the defined bed_variant_hotspots.

The variants annotated by VEP are then filtered based on a number of different criteria:

Variants that are present on the blacklist are excluded.
Only variants that are present on one of the specified transcripts in ref_id_mapping are included.
Only variants that match one of the consequences defined in vep_include_consequence are included.
Variant that have a population frequency of more than 1% in the gnomADe population are excluded.

Picard is used to generate various alignment statistics.

Input

The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.

Example input for the snv-indels module
sample_name	R1	R2	strandedness
MO1-RNAseq-1-16714	test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz	test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz	forward

Output

The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:

A .bam and .bai per sample, which contain the aligned reads.
A VEP output file (vep_high), which contains the final set of filtered variants.
A VEP output file (vep_target), which contains the variants on the transcripts of interest. These variants have not been filtered on vep_include_consequence terms.
A VCF file that only contains those variants that fall in one of the bed_variant_hotspots regions.

Configuration

You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.

Example

{
  "genome_fasta": "test/data/reference/hamlet-ref.fa",
  "genome_fai": "test/data/reference/hamlet-ref.fa.fai",
  "genome_dict": "test/data/reference/hamlet-ref.dict",
  "star_index": "test/data/reference/hamlet-star",
  "ref_id_mapping": "test/data/reference/id_mappings.tsv",
  "rrna_refflat": "test/data/reference/ucsc_rrna.refFlat",
  "bed_variant_hotspots": "test/data/reference/hotspots_genome.bed",
  "gtf": "test/data/reference/hamlet-ref.gtf",
  "annotation_refflat": "test/data/reference/hamlet-ref.refFlat",
  "vep_include_consequence": [
    "stop_gained",
    "frameshift_variant",
    "stop_lost",
    "start_lost",
    "inframe_insertion",
    "inframe_deletion",
    "protein_altering_variant",
    "missense_variant"
  ]
}

Note that the vep-cache entry is missing for this example file, which means that the online API of VEP will be used. For the best performance, please specify a vep-cache folder as well.

Configuration options

Configuration options :header-rows: 1
Option	Description	Required
forward_adapter	The forward adapter sequence	yes
reverse_adapter	The reverse adapter sequence	yes
genome_fasta	Reference genome, in FASTA format	yes
genome_fai	.fai index file for the reference fasta	yes
genome_dict	.dict index file for the reference fasta	yes
star_index	STAR index database	yes
ref_id_mapping	File of transcripts of interest	yes
rrna_refflat	File of rRNA transcripts	yes
bed_variant_hotspots	BED file of hotspot regions	yes
gtf	GTF file with transcripts, used by STAR	yes
annotation_refflat	File used to determine exon coverage	yes
blacklist	File of blacklisted variants	yes
vep-cache	Folder containing the VEP cache	no
vep_include_consequence	List of VEP consequences <http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html>_ to report	yes