snv-indels module
The snv-indels module is responsible for aligning the reads to the reference, and calling variants. The bam and count files produced by this module are used in the fusion and gene expression modules.
Tools
This module uses STAR to align the reads
to the reference using two-pass mode. VarDict is used to call variants,
which are annotated using VEP. Variants are filtered based on the criteria
defined in filter_criteria, and annotated based on annotation_criteria.
The variants annotated by VEP are then filtered based on a number of different criteria:
Variants that are present on the
blacklistare excluded.Only variants that match at least one criteria in
filter_criteriaare included.Variant that have a population frequency of more than 1% in the
gnomADepopulation are excluded.
Picard is used to generate various alignment statistics.
Input
The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.
sample_name |
R1 |
R2 |
strandedness |
MO1-RNAseq-1-16714 |
test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz |
test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz |
reverse |
Output
The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:
A .bam and .bai per sample, which contain the aligned reads.
The filtered VEP output file (
filter_vep), which contains the final set of filtered and annotated variants.The
countsfile produced by STAR, which contains the coverage per gene.
Configuration
You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.
Example
{
"genome_fasta": "test/data/reference/hamlet-ref.fa",
"genome_fai": "test/data/reference/hamlet-ref.fa.fai",
"genome_dict": "test/data/reference/hamlet-ref.dict",
"star_index": "test/data/reference/hamlet-star",
"ref_id_mapping": "test/data/reference/id_mappings.tsv",
"filter_criteria": "test/data/config/filter_criteria.tsv",
"annotation_criteria": "test/data/config/annotation_criteria.tsv",
"rrna_refflat": "test/data/reference/ucsc_rrna.refFlat",
"gtf": "test/data/reference/hamlet-ref.gtf",
"annotation_refflat": "test/data/reference/hamlet-ref.refFlat"
}
Note that the vep-cache entry is missing for this example file, which means
that VEP will be run with only the fasta and gtf files as input. For the best performance, please
specify a vep-cache folder as well.
Configuration options
Option |
Description |
Required |
|---|---|---|
genome_fasta |
Reference genome, in FASTA format |
yes |
genome_fai |
.fai index file for the reference fasta |
yes |
genome_dict |
.dict index file for the reference fasta |
yes |
star_index |
STAR index database |
yes |
ref_id_mapping |
File of transcripts of interest |
yes |
filter_criteria |
Criteria file to filter variants |
yes |
annotation_criteria |
Criteria file to annotate variants |
yes |
rrna_refflat |
File of rRNA transcripts |
yes |
gtf |
GTF file with transcripts, used by STAR |
yes |
annotation_refflat |
File used to determine exon coverage |
yes |
blacklist |
File of blacklisted variants |
no |
vep-cache |
Folder containing the VEP cache |
no |
variant_allele_frequency |
Minimum variant allele frequency in the sample to call a variant (default=0.05) |
no |
Filter and annotation criteria
HAMLET include the ability to specify separate filter criteria for every
transcript, based on the position and the VEP consequence of the variant. The
criteria are used both the filter which variants will be part of the output
(filter_criteria), and also annotate the identified variants
(annotation_criteria).
The required columns are transcript_id, consequence, start and end. For annotation variants, the annotation column is used. Every column except for transcript_id can be empty.
transcript_id |
consequence |
start |
end |
ENST00000361851.1 |
stop_gained |
||
ENST00000361851.1 |
frameshift_variant |
||
ENST00000361851.1 |
stop_lost |
||
ENST00000361851.1 |
start_lost |
||
ENST00000361851.1 |
inframe_insertion |
||
ENST00000361851.1 |
inframe_deletion |
||
ENST00000361851.1 |
protein_altering_variant |
||
ENST00000361851.1 |
missense_variant |
||
ENST00000361899.1 |
stop_gained |
||
ENST00000361899.1 |
frameshift_variant |
||
ENST00000361899.1 |
stop_lost |
||
ENST00000361899.1 |
start_lost |
||
ENST00000361899.1 |
inframe_insertion |
||
ENST00000361899.1 |
inframe_deletion |
||
ENST00000361899.1 |
protein_altering_variant |
||
ENST00000361899.1 |
missense_variant |
||
ENST00000361789.1 |
stop_gained |
||
ENST00000361789.1 |
frameshift_variant |
||
ENST00000361789.1 |
stop_lost |
||
ENST00000361789.1 |
start_lost |
||
ENST00000361789.1 |
inframe_insertion |
||
ENST00000361789.1 |
inframe_deletion |
||
ENST00000361789.1 |
protein_altering_variant |
||
ENST00000361789.1 |
missense_variant |
||
ENST00000241453.1 |
inframe_insertion |
transcript_id |
consequence |
start |
end |
annotation |
ENST00000241453.1 |
inframe_insertion |
1790 |
1801 |
Hotspot |
ENST00000361899.1 |
missense_variant |
334 |
334 |
Hotspot |
ENST00000361851.1 |
missense_variant |
25 |
Hotspot |