snv-indels module

The snv-indels module is responsible for aligning the reads to the reference, and calling variants. The bam and count files produced by this module are used in the fusion and gene expression modules.

Tools

This module uses STAR to align the reads to the reference using two-pass mode. VarDict is used to call variants, which are annotated using VEP. Variants are filtered based on the criteria defined in filter_criteria, and annotated based on annotation_criteria.

The variants annotated by VEP are then filtered based on a number of different criteria:

  1. Variants that are present on the blacklist are excluded.

  2. Only variants that match at least one criteria in filter_criteria are included.

  3. Variant that have a population frequency of more than 1% in the gnomADe population are excluded.

Picard is used to generate various alignment statistics.

Input

The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.

Example input for the snv-indels module

sample_name

R1

R2

strandedness

MO1-RNAseq-1-16714

test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz

test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz

reverse

Output

The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:

  • A .bam and .bai per sample, which contain the aligned reads.

  • The filtered VEP output file (filter_vep), which contains the final set of filtered and annotated variants.

  • The counts file produced by STAR, which contains the coverage per gene.

Configuration

You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.

Example

{
  "genome_fasta": "test/data/reference/hamlet-ref.fa",
  "genome_fai": "test/data/reference/hamlet-ref.fa.fai",
  "genome_dict": "test/data/reference/hamlet-ref.dict",
  "star_index": "test/data/reference/hamlet-star",
  "ref_id_mapping": "test/data/reference/id_mappings.tsv",
  "filter_criteria": "test/data/config/filter_criteria.tsv",
  "annotation_criteria": "test/data/config/annotation_criteria.tsv",
  "rrna_refflat": "test/data/reference/ucsc_rrna.refFlat",
  "gtf": "test/data/reference/hamlet-ref.gtf",
  "annotation_refflat": "test/data/reference/hamlet-ref.refFlat"
}

Note that the vep-cache entry is missing for this example file, which means that VEP will be run with only the fasta and gtf files as input. For the best performance, please specify a vep-cache folder as well.

Configuration options

Configuration options

Option

Description

Required

genome_fasta

Reference genome, in FASTA format

yes

genome_fai

.fai index file for the reference fasta

yes

genome_dict

.dict index file for the reference fasta

yes

star_index

STAR index database

yes

ref_id_mapping

File of transcripts of interest

yes

filter_criteria

Criteria file to filter variants

yes

annotation_criteria

Criteria file to annotate variants

yes

rrna_refflat

File of rRNA transcripts

yes

gtf

GTF file with transcripts, used by STAR

yes

annotation_refflat

File used to determine exon coverage

yes

blacklist

File of blacklisted variants

no

vep-cache

Folder containing the VEP cache

no

variant_allele_frequency

Minimum variant allele frequency in the sample to call a variant (default=0.05)

no

Filter and annotation criteria

HAMLET include the ability to specify separate filter criteria for every transcript, based on the position and the VEP consequence of the variant. The criteria are used both the filter which variants will be part of the output (filter_criteria), and also annotate the identified variants (annotation_criteria).

The required columns are transcript_id, consequence, start and end. For annotation variants, the annotation column is used. Every column except for transcript_id can be empty.

Example filter_criteria file, from the HAMLET tests

transcript_id

consequence

start

end

ENST00000361851.1

stop_gained

ENST00000361851.1

frameshift_variant

ENST00000361851.1

stop_lost

ENST00000361851.1

start_lost

ENST00000361851.1

inframe_insertion

ENST00000361851.1

inframe_deletion

ENST00000361851.1

protein_altering_variant

ENST00000361851.1

missense_variant

ENST00000361899.1

stop_gained

ENST00000361899.1

frameshift_variant

ENST00000361899.1

stop_lost

ENST00000361899.1

start_lost

ENST00000361899.1

inframe_insertion

ENST00000361899.1

inframe_deletion

ENST00000361899.1

protein_altering_variant

ENST00000361899.1

missense_variant

ENST00000361789.1

stop_gained

ENST00000361789.1

frameshift_variant

ENST00000361789.1

stop_lost

ENST00000361789.1

start_lost

ENST00000361789.1

inframe_insertion

ENST00000361789.1

inframe_deletion

ENST00000361789.1

protein_altering_variant

ENST00000361789.1

missense_variant

ENST00000241453.1

inframe_insertion

Example annotation_criteria file, from the HAMLET tests

transcript_id

consequence

start

end

annotation

ENST00000241453.1

inframe_insertion

1790

1801

Hotspot

ENST00000361899.1

missense_variant

334

334

Hotspot

ENST00000361851.1

missense_variant

25

Hotspot