snv-indels module

The snv-indels module is responsible for aligning the reads to the reference, and calling variants. The bam and count files produced by this module are used in the fusion and gene expression modules.

Tools

This module uses STAR to align the reads to the reference using two-pass mode. VarDict is used to call variants, which are annotated using VEP. Variants are filtered based on the criteria defined in inclusion_criteria, and annotated based on annotation_criteria.

The variants annotated by VEP are then filtered and annotated in the following order:

Variants that have a population frequency of more than 1% in the gnomADe population are removed.
Only variant which match the inclusion_criteria will be included.
If a variant is present in known_variants, the annotation from that file will be added to the variant.
If a variant is not in known_variants, it will be checked against the criteria in annotation_criteria. The annotation from the first matching definition will be added to the variant.

Picard is used to generate various alignment statistics.

Variant annotations

By default, HAMLET comes with variant filters and annotations which are tuned towards diagnosing AML. When using the default variant filters and annotations, HAMLET uses the following definitions:

Configuration options
Annotation	Definition
Known pathogenic	This variant is known to be associated with AML
Pathogenic	All evidence point to this variant being pathogenic for AML
Likely pathogenic	This variant should be considered pathogenic, unless there is evidence to the contrary (e.g. it is a known benign variant)
Possible pathogenic	This variant should not be considered pathogenic, unless there is additional evidence (e.g. it is a known pathogenic variant)
Likely benign	This variant should not be considered pathogenic
Artifact	This variant is most likely an artifact produced by the pipeline, i.e. the variant is not truly present in the sample

Input

The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.

Example input for the snv-indels module
sample_name	R1	R2	strandedness
MO1-RNAseq-1-16714	test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz	test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz	reverse

Output

The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:

A .bam and .bai per sample, which contain the aligned reads.
The filtered VEP output file (filter_vep), which contains the final set of filtered and annotated variants.
The counts file produced by STAR, which contains the coverage per gene.
Various quality control metrics produced by MultiQC.

Configuration

You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.

Example

$ python3 utilities/create-config.py --module snv-indels HAMLET-data

 {
  "annotation_criteria": "HAMLET-data/annotation_criteria.tsv",
  "annotation_refflat": "HAMLET-data/ucsc_gencode.refFlat",
  "inclusion_criteria": "HAMLET-data/inclusion_criteria.tsv",
  "genome_dict": "HAMLET-data/GCA_000001405.15_GRCh38_no_alt_analysis_set.dict",
  "genome_fai": "HAMLET-data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai",
  "genome_fasta": "HAMLET-data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna",
  "gtf": "HAMLET-data/Homo_sapiens.GRCh38.115.chr.gtf",
  "known_variants": "HAMLET-data/known_variants.tsv",
  "min_variant_depth": 2,
  "rrna_refflat": "HAMLET-data/ucsc_rrna.refFlat",
  "star_index": "HAMLET-data/star-index",
  "variant_allele_frequency": 0.05,
  "vep_cache": "HAMLET-data"
 }

Note that although the vep-cache entry is optional, for the best performance, please specify a vep-cache folder as well.

Configuration options

Configuration options
Option	Description	Required
annotation_criteria	Criteria file to annotate variants	yes
annotation_refflat	File used to determine exon coverage	yes
inclusion_criteria	Criteria file to filter variants	yes
genome_dict	.dict index file for the reference fasta	yes
genome_fai	.fai index file for the reference fasta	yes
genome_fasta	Reference genome, in FASTA format	yes
gtf	GTF file with transcripts, used by STAR	yes
known_variants	File containing known variants and their annotation	no
min_variant_depth	Minimum read depth to call a variant	no (default=2)
rrna_refflat	File of rRNA transcripts	yes
star_index	STAR index database	yes
variant_allele_frequency	Minimum variant allele frequency to call a variant	no (default=0.05)
vep-cache	Folder containing the VEP cache	no

Filter and annotation criteria

HAMLET include the ability to specify separate filter criteria for every transcript, based on the position and the VEP consequence of the variant. The criteria are used both to filter which variants will be part of the output (inclusion_criteria), and also annotate the identified variants (annotation_criteria).

The used columns are transcript_id, consequence, start, end and frame. For annotating variants, the annotation column is used. Every column except for transcript_id can be empty.

Example `inclusion_criteria` file, from the HAMLET tests
transcript_id	consequence	start	end	frame
ENST00000361851	stop_gained
ENST00000361851	frameshift_variant
ENST00000361851	stop_lost
ENST00000361851	start_lost
ENST00000361851	inframe_insertion
ENST00000361851	inframe_deletion
ENST00000361851	protein_altering_variant
ENST00000361851	missense_variant
ENST00000361899	stop_gained
ENST00000361899	frameshift_variant
ENST00000361899	stop_lost
ENST00000361899	start_lost
ENST00000361899	inframe_insertion
ENST00000361899	inframe_deletion
ENST00000361899	protein_altering_variant
ENST00000361899	missense_variant
ENST00000361789	stop_gained
ENST00000361789	frameshift_variant
ENST00000361789	stop_lost
ENST00000361789	start_lost
ENST00000361789	inframe_insertion
ENST00000361789	inframe_deletion
ENST00000361789	protein_altering_variant
ENST00000361789	missense_variant
ENST00000241453	inframe_insertion

Example `annotation_criteria` file, from the HAMLET tests
transcript_id	consequence	start	end	frame	annotation
ENST00000241453				0	FLT3 in frame
ENST00000241453	inframe_insertion	1790	1801		Hotspot
ENST00000361899	missense_variant	334	334		Hotspot
ENST00000361851	missense_variant	25			Hotspot
ENST00000305877	intron_variant	1279	1280		BCR
ENST00000361453	synonymous_variant				MT-ND2

Known variant annotations

In addition to the annotation criteria desribed above, it is also possible to supply HAMLET with annotations for specific variants via the known_variants file. Annotations from this file have a higher priority than the annotations specified in annotation_criteria.

The used columns are variant and annotation. These columns cannot be empty.

Example `known_variants` file, from the HAMLET tests
variant	annotation
ENST00000361899:c.334A>G	known variant
ENST00000241453:c.1758_1787dup	known variant
ENST00000361899:c.40A>G	artifact
ENST00000361899:c.175A>G	artifact
ENST00000361390:c.210C>A	known variant