Efficiently extract singletons from WGS data on UKB-RAP

27 October 2023 00:00
5 comments

Hi all, I'm trying to extract information of singletons (variant + its carrier ID) from pVCF data of UKB WGS. Currently i could only use swiss army knife to do this for a single pVCF: dx run swiss-army-knife -iin "/Bulk/Whole genome sequences/Population level WGS variants, pVCF format - interim 200k release/ukb24304_c10_b1000_v1.vcf.gz" -icmd="bcftools view -i 'AC=1' ukb24304_c10_b1000_v1.vcf.gz > singletons.vcf && bcftools query -f '[%CHROM:%POS:%REF>%ALT] %s\n' -i 'GT="het"' singletons.vcf > singletons_samples.txt" However, there are 60k pVCF which has to be iterated one by one in this way, and the time and money comsumption (~10 min per VCF with 4 cores) is also too heavy. I'm wondering whether there is any more efficient options (another applet or using swiss army knife in a parallel way). Thanks for your help!

Comments

5 comments

Ondrej Klempir DNAnexus Team
- 06 November 2023 15:55
I am wondering whether there would an alternative way of doing this, e.g. load and read PLINK (bed, bim, fam) or BGEN rather than pVCF. PLINK or BGEN are available per chromosome.

0
Francesco Mazzarotto
- 09 December 2025 15:39
Hi, how did you end up solving this?
I need to extract >2 million variants from the WGS data, distributed across 76k pVCF files (averaging 28 per VCF). I currently developed scripts that can be launched in parallel, using tabix to perform these extractions by position (to then perform allele-matching and downstream analyses locally). But tabix extractions are way too slow, with each of them lasting several minutes if concerning 1-5 genomic positions, even hours if encompassing 50 or 60.
I actually never considered doing this starting from a different type of file (e.g. PLINK files)….until now.
Thanks very much!

0
Ahmet Sayici
- 09 December 2025 15:46
New ML-Corrected DRAGEN whole genome sequencing (WGS) release makes everything so easy. Just go to Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release] and check .pvar files. You'll find AC in the INFO column. Then a short awk command would extract the singletons easily.

0
Francesco Mazzarotto
- 10 December 2025 16:51
Hi Ahmet,
thanks very much for your reply. I had a look into these pvar files, but it's not what I need as they report global AC, but they are not informative as to who carries a given variant.
I need to know who the carriers are for each of my >2 million variants of interest, and that's why I attempted to extract these info from pVCFs so far.
Do you have any idea if doing this from other files types is a possiblity, quicker than tabix-extracting positions of interest from pVCFs?

0
Ahmet Sayici
- Edited 11 December 2025 10:53
Hi Francesco,
In this case I would use PLINK2 with Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release] in a ttyd instance. Bigger the ttyd, faster the analyses.
There is an option in PLINK2 called --export A which reports the SNP dosages per sample. Here is an example run with chromosome 1:
```
./plink2 \
	--pfile '/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/ukb24308_c10_b0_v1' \
	--export A \ 
	--extract variant_list.txt \
	--no-pheno \
	--out carrier_report 
```
I used mounted path, instead of using dx download to get the files to the ttyd, because it is too much faster. You can use the same path. Then we have --export A to get the SNP dosages. After that --extract variant_list.txt to specify which variants you want to report. You can use the same variant list file for every chromosome, you don't need to split it by chr. --no-pheno is required because DRAGEN .psam files are not standard and lack of phenotype column. And finally --out carrier_report to name the output file.
You can run this in a for loop. At the end, you'll have 25 output files. The output files will be massive because the output format is a sparse table. Rows are samples and columns are variants. If a sample-variant pair has NA in the entry it means that sample doesn't have the variant. Otherwise the entries are sample-major additive (0/1/2) coding. Also - --export A looks for REF allele by default.

EDIT 1: Please align with PVAR files. The variant ID format is a bit weird. They all follow this format DRAGEN:CHR:POS:REF:ALT. (DRAGEN is static)

Best,
Ahmet
1

Please sign in to leave a comment.