snv-indels

The snv-indels module is responsible for aligning the reads to the reference, and calling SNVs and insertions/deletion.

Tools

This module uses STAR to align the reads to the reference using twopass mode.`VarDict <https://github.com/AstraZeneca-NGS/VarDictJava>`_ is used to call variants, which are annotated using VEP. For each variant, this module determines if it is located inside one of the defined bed_variant_hotspots.

The variants annotated by VEP are then filtered based on a number of different criteria:

  1. Variants that are present on the blacklist are excluded.

  2. Only variants that are present on one of the specified transcripts in ref_id_mapping are included.

  3. Only variants that match one of the consequences defined in vep_include_consequence are included.

  4. Variant that have a population frequency of more than 1% in the gnomADe population are excluded.

Picard is used to generate various alignment statistics.

Input

The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.

Example input for the snv-indels module

sample_name

R1

R2

strandedness

MO1-RNAseq-1-16714

test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz

test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz

forward

Output

The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:

  • A .bam and .bai per sample, which contain the aligned reads.

  • A VEP output file (vep_high), which contains the final set of filtered variants.

  • A VEP output file (vep_target), which contains the variants on the transcripts of interest. These variants have not been filtered on vep_include_consequence terms.

  • A VCF file that only contains those variants that fall in one of the bed_variant_hotspots regions.

Configuration

You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.

Example

{
  "genome_fasta": "test/data/reference/hamlet-ref.fa",
  "genome_fai": "test/data/reference/hamlet-ref.fa.fai",
  "genome_dict": "test/data/reference/hamlet-ref.dict",
  "star_index": "test/data/reference/hamlet-star",
  "ref_id_mapping": "test/data/reference/id_mappings.tsv",
  "rrna_refflat": "test/data/reference/ucsc_rrna.refFlat",
  "bed_variant_hotspots": "test/data/reference/hotspots_genome.bed",
  "gtf": "test/data/reference/hamlet-ref.gtf",
  "annotation_refflat": "test/data/reference/hamlet-ref.refFlat",
  "vep_include_consequence": [
    "stop_gained",
    "frameshift_variant",
    "stop_lost",
    "start_lost",
    "inframe_insertion",
    "inframe_deletion",
    "protein_altering_variant",
    "missense_variant"
  ]
}

Note that the vep-cache entry is missing for this example file, which means that the online API of VEP will be used. For the best performance, please specify a vep-cache folder as well.

Configuration options

Configuration options :header-rows: 1

Option

Description

Required

forward_adapter

The forward adapter sequence

yes

reverse_adapter

The reverse adapter sequence

yes

genome_fasta

Reference genome, in FASTA format

yes

genome_fai

.fai index file for the reference fasta

yes

genome_dict

.dict index file for the reference fasta

yes

star_index

STAR index database

yes

ref_id_mapping

File of transcripts of interest

yes

rrna_refflat

File of rRNA transcripts

yes

bed_variant_hotspots

BED file of hotspot regions

yes

gtf

GTF file with transcripts, used by STAR

yes

annotation_refflat

File used to determine exon coverage

yes

blacklist

File of blacklisted variants

yes

vep-cache

Folder containing the VEP cache

no

vep_include_consequence

List of VEP consequences <http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html>_ to report

yes