snv-indels
The snv-indels module is responsible for aligning the reads to the reference, and calling SNVs and insertions/deletion.
Tools
This module uses STAR to align the reads to the reference using twopass mode.`VarDict <https://github.com/AstraZeneca-NGS/VarDictJava>`_ is used to call variants, which are annotated using VEP. For each variant, this module determines if it is located inside one of the defined bed_variant_hotspots.
The variants annotated by VEP are then filtered based on a number of different criteria:
Variants that are present on the blacklist are excluded.
Only variants that are present on one of the specified transcripts in ref_id_mapping are included.
Only variants that match one of the consequences defined in vep_include_consequence are included.
Variant that have a population frequency of more than 1% in the gnomADe population are excluded.
Picard is used to generate various alignment statistics.
Input
The input for this module is a single pair of FastQ files per sample, specified in a PEP configuration file, as is shown below.
sample_name |
R1 |
R2 |
strandedness |
MO1-RNAseq-1-16714 |
test/data/fastq/NOMO1-RNAseq-1-16714_R1.fastq.gz |
test/data/fastq/NOMO1-RNAseq-1-16714_R2.fastq.gz |
forward |
Output
The output of this module are a JSON file with an overview of the most important results, as well as a number of other output files:
A .bam and .bai per sample, which contain the aligned reads.
A VEP output file (vep_high), which contains the final set of filtered variants.
A VEP output file (vep_target), which contains the variants on the transcripts of interest. These variants have not been filtered on vep_include_consequence terms.
A VCF file that only contains those variants that fall in one of the bed_variant_hotspots regions.
Configuration
You can automatically generate a configuration for the fusion module using the utilities/create-config.py script.
Example
{
"genome_fasta": "test/data/reference/hamlet-ref.fa",
"genome_fai": "test/data/reference/hamlet-ref.fa.fai",
"genome_dict": "test/data/reference/hamlet-ref.dict",
"star_index": "test/data/reference/hamlet-star",
"ref_id_mapping": "test/data/reference/id_mappings.tsv",
"rrna_refflat": "test/data/reference/ucsc_rrna.refFlat",
"bed_variant_hotspots": "test/data/reference/hotspots_genome.bed",
"gtf": "test/data/reference/hamlet-ref.gtf",
"annotation_refflat": "test/data/reference/hamlet-ref.refFlat",
"vep_include_consequence": [
"stop_gained",
"frameshift_variant",
"stop_lost",
"start_lost",
"inframe_insertion",
"inframe_deletion",
"protein_altering_variant",
"missense_variant"
]
}
Note that the vep-cache entry is missing for this example file, which means that the online API of VEP will be used. For the best performance, please specify a vep-cache folder as well.
Configuration options
Option |
Description |
Required |
forward_adapter |
The forward adapter sequence |
yes |
reverse_adapter |
The reverse adapter sequence |
yes |
genome_fasta |
Reference genome, in FASTA format |
yes |
genome_fai |
.fai index file for the reference fasta |
yes |
genome_dict |
.dict index file for the reference fasta |
yes |
star_index |
STAR index database |
yes |
ref_id_mapping |
File of transcripts of interest |
yes |
rrna_refflat |
File of rRNA transcripts |
yes |
bed_variant_hotspots |
BED file of hotspot regions |
yes |
gtf |
GTF file with transcripts, used by STAR |
yes |
annotation_refflat |
File used to determine exon coverage |
yes |
blacklist |
File of blacklisted variants |
yes |
vep-cache |
Folder containing the VEP cache |
no |
vep_include_consequence |
List of VEP consequences <http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html>_ to report |
yes |