Resequencing data SNP detection

Specific tutorial: SNP detection in resequencing data

Resequencing data SNP detection
Photo by Louis Reed / Unsplash

Performing variant detection on whole-genome resequencing data to obtain variant information such as SNPs, indels, and SVs is a necessary step before conducting genetic analysis of a population. If there are many individuals to be tested or the sequencing depth (amount of sequencing) is large, this process also requires the computer to have a large RAM capacity and a high-performance CPU.Different software tools are required for variant detection, and the actual steps executed may vary among different operators. Below is a tutorial on the SNP calling steps that I have personally executed and found usable, provided for reference only.Software and environment preparation:

conda create -n snp_calling python=3.9 
conda activate snp_calling
conda install -c bioconda bwa hisat2 samtools picard gatk4

Here I use conda to manage software, while also setting up the analysis environment in advance.

1.index

First, index the reference genome using bwa.

#Index the reference genome using BWA, here using 100 threads for indexing.
bwa index -t 100 ref_genome.fna

The computing server I use has 384 CPU threads, and here I am using 100 threads for computation. The actual number should be determined based on the real CPU cores and thread count of the computer or server you are using, and it should not exceed the thread count.

After the indexing is completed, we use the following command to compare.

bwa mem -t 10 ref_genome.fna R1.clean.fq.gz R2.clean.fq.gz 01.sam

This command indicates using 10 threads for alignment. R1.clean.fq.gz and R2.clean.fq.gz represent the paired-end sequencing data in different directions after quality control, and 01.sam is the output file. The relevant paths can be adjusted as needed. If there are a large number of individuals, the wait command can be used to perform multi-threaded alignment for all individuals simultaneously.

2.Convert and sort SAM/BAM files

When using samtools for processing, the standard commands are as follows:

samtools view -@ 4 -bS 01.sam > 01.bam
samtools sort -@ 4 01.bam -o 01_sorted.bam
samtools index 01_sorted.bam

Similarly, when dealing with a large number of individuals, we can make the entire process multi-threaded and automated by using the following script:

#Step 1: Convert .sam to .bam
#Identify all SAM files in the input folder d, so please place the SAM files generated by the previous alignment step (sh) in the same directory beforehand.
#Note that the output directory must also be set correctly in this step.
for file in "$input_dir"/*.sam; do
  samtools view -@ 12 -bS "$file" > "$output_dir_que/$(basename ${file%.sam}.bam)" &
done
wait
#Step 2: Sort .bam files
for file in "$output_dir_que"/*.bam; do
  samtools sort -@ 12 "$file" -o "$output_dir_modify/$(basename ${file%.bam}_sorted.bam)" &
done
wait
#Step 3: Index sorted .bam files
for file in "$output_dir_modify"/*_sorted.bam; do
  samtools index "$file" &
done
wait
echo "Execution ended"

The script above consists of three steps, and the output directories for each step can be modified within the script before running it, or variables can be passed via bash commands.

Additionally, ensure that the number of threads does not exceed the maximum CPU thread count. If you are using a script, also note that the total number of threads, obtained by multiplying the thread count per command by the number of instances, must not exceed your CPU's thread capacity.

3.Mark PCR duplicates

Here we are using Picard, and it's important to note that the Java OpenJDK version needs to be greater than 17. If you encounter Java version issues, you can use conda to set up an appropriate environment:

conda install -c conda-forge openjdk=17

Standard commands for marking duplicates and indexing:

picard MarkDuplicates I=01_sorted.bam O=01_marked.bam M=01_metrics.txt
samtools index 01_marked.bam

The code for automated execution using a script is as follows:

#!/bin/bash
#Set the path of the Picard tool, modify it according to the actual situation.
PICARD_JAR="/the-path-to-your-picard.jar/picard.jar"
#Create output directory
mkdir -p modify/marked_bams
#Define a function to mark duplicates
mark_duplicates() {
    local bam_file=$1
    #In this tutorial, the naming convention is 01.bam or 01_sorted.bam, meaning if you have multiple samples, the subsequent ones should follow as 02, 03, and so on. If your filenames do not follow this format, you will need to modify the code below.
    local base_name=(basename bam_file "_sorted.bam")
    #Here, files ending with _sorted.bam in the directory are automatically identified, and then _sorted.bam is removed to obtain the file name.
    local output_bam="modify/marked_bams/${base_name}_marked.bam"
    #Here defines the output file name.
    local metrics_file="modify/marked_bams/${base_name}_metrics.txt"
    #Here defines the output metrics file name
    java -Xmx32g -jar $PICARD_JAR MarkDuplicates         I=$bam_file         O=$output_bam         M=$metrics_file         CREATE_INDEX=true 2>&1 | tee modify/marked_bams/${base_name}_markdup.log
    #Here, the Picard tool is used to mark duplicates.
    if [ ! -f $output_bam ]; then
        echo "Failed to generate output_bam, check modify/marked_bams/{base_name}_markdup.log for details." >> error_log.txt
    fi
    #If the output file does not exist, output an error message.
}
#Process 5 tasks in parallel, which can also be adjusted as needed. You can modify the number of parallel tasks and then attempt to run them. If no errors or interruptions occur within 2-3 minutes, the process can proceed normally.
max_jobs=5
current_jobs=0
for bam_file in modify/*_sorted.bam; do
    mark_duplicates $bam_file &
    current_jobs=$((current_jobs + 1))
    if [ current_jobs -ge max_jobs ]; then
        wait
        current_jobs=0
    fi
done
#Wait for all background tasks to complete
wait
echo "All BAM files have been marked for duplicates."

4.Add sample information to BAM file

In actual processing, when handling the marked BAM file, GATK 4 reports an error indicating no sample information was found. Therefore, it is necessary to annotate the sample information into the BAM file. Use Picard for marking.

picard AddOrReplaceReadGroups  I=modify/marked_bams/01_marked.bam  O=modify/marked_bams/01_marked_with_RG.bam RGID=1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=sample_name 
//The sample_name and other information here need to be modified according to your actual situation.

Here is a processing script. If you need to add different information, simply make the corresponding modifications.

#!/bin/bash

input_dir="modify/marked_bams"
output_dir="modify/marked_bams_with_RG"

mkdir -p $output_dir

for bam_file in $input_dir/*_marked.bam; do
    sample_name=$(basename $bam_file _marked.bam)
    output_file="$output_dir/${sample_name}_marked_with_RG.bam"
  
    picard AddOrReplaceReadGroups \
        I=$bam_file \
        O=$output_file \
        RGID=1 \
        RGLB=lib1 \
        RGPL=illumina \
        RGPU=unit1 \
        RGSM=$sample_name
  
    samtools index $output_file
done

5.GATK variant detection

Using GATK for SNP and INDEL detection. Below is the actual command from the analysis process of one sample, utilizing 15 CPU threads. If you need to obtain GVCF files, please skip this section of code and look further ahead.

gatk HaplotypeCaller 
   -R ../../genome_data/ref_genome.fna 
   -I modify/marked_bams_with_RG/01_marked_with_RG.bam 
   -O output_vcf/01.vcf 
   --native-pair-hmm-threads 15

Below is the multi-threaded operation script

#!/bin/bash
input_dir="modify/marked_bams_with_RG"
output_dir="output_vcf"
reference_genome="../../genome_data/ref_genome.fna"
mkdir -p $output_dir
for bam_file in $input_dir/*_marked_with_RG.bam; do

    sample_name=$(basename $bam_file _marked_with_RG.bam)

    output_vcf="$output_dir/${sample_name}.vcf"
gatk HaplotypeCaller \

       -R $reference_genome \

       -I $bam_file \

       -O $output_vcf \

       --native-pair-hmm-threads 15 &
done
wait

Reference time information: For my own variant detection work this time, SNP detection for each individual used 15 CPU threads, with a total completion time of 12,000 CPU minutes, equivalent to 200 CPU hours or 13.3 hours. The genome size was 2.2GB, with a sequencing depth of 35X, and the paired-end sequencing files were each 7GB in size. The above command did not generate a GVCF file. Below is the version for obtaining a GVCF.

gatk HaplotypeCaller 
   -R ../../genome_data/ref_genome.fna 
   -I modify/marked_bams_with_RG/01_marked_with_RG.bam 
   -O output_vcf/01.vcf 
   --native-pair-hmm-threads 15 \ -ERC GVCF -ploidy 2 -stand-call-conf 30.0

Batch script:

#!/bin/bash
input_dir="modify/marked_bams_with_RG"
output_dir="output_vcf"
reference_genome="../../genome_data/ref_genome.fna"
mkdir -p $output_dir
for bam_file in $input_dir/*_marked_with_RG.bam; do

    sample_name=$(basename $bam_file _marked_with_RG.bam)
output_vcf="$output_dir/${sample_name}.g.vcf.gz"
gatk HaplotypeCaller \

       -R $reference_genome \

       -I $bam_file \

       -O $output_vcf \

       -ERC GVCF -ploidy 2 -stand-call-conf 30.0 \

       --native-pair-hmm-threads 15 &
done
wait

6.Individual variant information VCF file merging

gatk CombineGVCFs -R ../../genome_data/ref_genome.fna 
--variant output_vcf/01.g.vcf.gz 
--variant output_vcf/02.g.vcf.gz
--variant ...
-O ./final_combine/combined.g.vcf.gz

If you have a large number of files, you can merge them in batches to speed up the process. For example, if you have 100 individuals, you can merge 10 at a time and finally merge all 10 together.

7.Genotyping and SNP Extraction

Use the following command to genotype the variant sites.

gatk GenotypeGVCFs -R ../../genome_data/ref_genome.vcf -V ./final_combine/combined.g.vcf.gz -O ./final_genotype/genotype.vcf.gz

Use the following command to extract SNPs

gatk SelectVariants -R ../../genome_data/ref_genome.vcf -O SNP.vcf.gz --variant ./final_genotype/genotype.vcf.gz --select-type-to-include SNP

8.SNP mechanical filtration

Mechanical filtration allows you to set relevant parameters and filtering criteria based on your needs. Below is a reference.

gatk VariantFiltration \     
-V SNP.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \    
-filter "SOR > 3.0" --filter-name "SOR3" \    
-filter "FS > 60.0" --filter-name "FS60" \   
-filter "MQ < 40.0" --filter-name "MQ40" \    
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \   
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O SNP_filtered.vcf.gz

9.Follow-up actions

After obtaining the SNP.vcf file, you can perform filtering and other operations before analysis using vcftools.