Efficiently extract singletons from WGS data on UKB-RAP

Hi all, I'm trying to extract information of singletons (variant + its carrier ID) from pVCF data of UKB WGS. Currently i could only use swiss army knife to do this for a single pVCF:   dx run swiss-army-knife -iin "/Bulk/Whole genome sequences/Population level WGS variants, pVCF format - interim 200k release/ukb24304_c10_b1000_v1.vcf.gz" -icmd="bcftools view -i 'AC=1' ukb24304_c10_b1000_v1.vcf.gz > singletons.vcf && bcftools query -f '[%CHROM:%POS:%REF>%ALT] %s\n' -i 'GT="het"' singletons.vcf > singletons_samples.txt"   However, there are 60k pVCF which has to be iterated one by one in this way, and the time and money comsumption (~10 min per VCF with 4 cores) is also too heavy. I'm wondering whether there is any more efficient options (another applet or using swiss army knife in a parallel way). Thanks for your help!

Comments

5 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    I am wondering whether there would an alternative way of doing this, e.g. load and read PLINK (bed, bim, fam) or BGEN rather than pVCF. PLINK or BGEN are available per chromosome.

    0
  • Comment author
    Francesco Mazzarotto

    Hi, how did you end up solving this? 

    I need to extract >2 million variants from the WGS data, distributed across 76k pVCF files (averaging 28 per VCF). I currently developed scripts that can be launched in parallel, using tabix to perform these extractions by position (to then perform allele-matching and downstream analyses locally). But tabix extractions are way too slow, with each of them lasting several minutes if concerning 1-5 genomic positions, even hours if encompassing 50 or 60.

    I actually never considered doing this starting from a different type of file (e.g. PLINK files)….until now.

    Thanks very much!

    0
  • Comment author
    Ahmet Sayici

    New ML-Corrected DRAGEN whole genome sequencing (WGS) release makes everything so easy. Just go to Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release] and check .pvar files. You'll find AC in the INFO column. Then a short awk command would extract the singletons easily. 

    0
  • Comment author
    Francesco Mazzarotto

    Hi Ahmet, 

    thanks very much for your reply. I had a look into these pvar files, but it's not what I need as they report global AC, but they are not informative as to who carries a given variant. 

    I need to know who the carriers are for each of my >2 million variants of interest, and that's why I attempted to extract these info from pVCFs so far. 

    Do you have any idea if doing this from other files types is a possiblity, quicker than tabix-extracting positions of interest from pVCFs?

    0
  • Comment author
    Ahmet Sayici
    • Edited

    Hi Francesco,

    In this case I would use PLINK2 with Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release] in a ttyd instance. Bigger the ttyd, faster the analyses. 

    There is an option in PLINK2 called --export A which reports the SNP dosages per sample. Here is an example run with chromosome 1: 

    ./plink2 \
    	--pfile '/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/ukb24308_c10_b0_v1' \
    	--export A \ 
    	--extract variant_list.txt \
    	--no-pheno \
    	--out carrier_report 

    I used mounted path, instead of using dx download to get the files to the ttyd, because it is too much faster. You can use the same path. Then we have --export A to get the SNP dosages. After that --extract variant_list.txt to specify which variants you want to report. You can use the same variant list file for every chromosome, you don't need to split it by chr. --no-pheno is required because DRAGEN .psam files are not standard and lack of phenotype column. And finally --out carrier_report to name the output file.

    You can run this in a for loop. At the end, you'll have 25 output files. The output files will be massive because the output format is a sparse table. Rows are samples and columns are variants. If a sample-variant pair has NA in the entry it means that sample doesn't have the variant. Otherwise the entries are sample-major additive (0/1/2) coding. Also - --export A looks for REF allele by default.
     

    EDIT 1: Please align with PVAR files. The variant ID format is a bit weird. They all follow this format DRAGEN:CHR:POS:REF:ALT. (DRAGEN is static) 


    Best, 
    Ahmet

    1

Please sign in to leave a comment.