Hi all,
I'm trying to extract information of singletons (variant + its carrier ID) from pVCF data of UKB WGS. Currently i could only use swiss army knife to do this for a single pVCF:
dx run swiss-army-knife -iin "/Bulk/Whole genome sequences/Population level WGS variants, pVCF format - interim 200k release/ukb24304_c10_b1000_v1.vcf.gz" -icmd="bcftools view -i 'AC=1' ukb24304_c10_b1000_v1.vcf.gz > singletons.vcf && bcftools query -f '[%CHROM:%POS:%REF>%ALT] %s\n' -i 'GT="het"' singletons.vcf > singletons_samples.txt"
However, there are 60k pVCF which has to be iterated one by one in this way, and the time and money comsumption (~10 min per VCF with 4 cores) is also too heavy. I'm wondering whether there is any more efficient options (another applet or using swiss army knife in a parallel way). Thanks for your help!
I am wondering whether there would an alternative way of doing this, e.g. load and read PLINK (bed, bim, fam) or BGEN rather than pVCF. PLINK or BGEN are available per chromosome.
I need to extract >2 million variants from the WGS data, distributed across 76k pVCF files (averaging 28 per VCF). I currently developed scripts that can be launched in parallel, using tabix to perform these extractions by position (to then perform allele-matching and downstream analyses locally). But tabix extractions are way too slow, with each of them lasting several minutes if concerning 1-5 genomic positions, even hours if encompassing 50 or 60.
I actually never considered doing this starting from a different type of file (e.g. PLINK files)….until now.
New ML-Corrected DRAGEN whole genome sequencing (WGS) release makes everything so easy. Just go to Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release] and check .pvar files. You'll find AC in the INFO column. Then a short awk command would extract the singletons easily.
thanks very much for your reply. I had a look into these pvar files, but it's not what I need as they report global AC, but they are not informative as to who carries a given variant.
I need to know who the carriers are for each of my >2 million variants of interest, and that's why I attempted to extract these info from pVCFs so far.
Do you have any idea if doing this from other files types is a possiblity, quicker than tabix-extracting positions of interest from pVCFs?
In this case I would use PLINK2 with Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release] in a ttyd instance. Bigger the ttyd, faster the analyses.
There is an option in PLINK2 called --export A which reports the SNP dosages per sample. Here is an example run with chromosome 1:
./plink2 \
--pfile '/mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/ukb24308_c10_b0_v1' \
--export A \
--extract variant_list.txt \
--no-pheno \
--out carrier_report
I used mounted path, instead of using dx download to get the files to the ttyd, because it is too much faster. You can use the same path. Then we have --export A to get the SNP dosages. After that --extract variant_list.txt to specify which variants you want to report. You can use the same variant list file for every chromosome, you don't need to split it by chr. --no-pheno is required because DRAGEN .psam files are not standard and lack of phenotype column. And finally --out carrier_report to name the output file.
You can run this in a for loop. At the end, you'll have 25 output files. The output files will be massive because the output format is a sparse table. Rows are samples and columns are variants. If a sample-variant pair has NA in the entry it means that sample doesn't have the variant. Otherwise the entries are sample-major additive (0/1/2) coding. Also - --export A looks for REF allele by default.
EDIT 1: Please align with PVAR files. The variant ID format is a bit weird. They all follow this format DRAGEN:CHR:POS:REF:ALT. (DRAGEN is static)
Comments
5 comments
I am wondering whether there would an alternative way of doing this, e.g. load and read PLINK (bed, bim, fam) or BGEN rather than pVCF. PLINK or BGEN are available per chromosome.
Hi, how did you end up solving this?
I need to extract >2 million variants from the WGS data, distributed across 76k pVCF files (averaging 28 per VCF). I currently developed scripts that can be launched in parallel, using tabix to perform these extractions by position (to then perform allele-matching and downstream analyses locally). But tabix extractions are way too slow, with each of them lasting several minutes if concerning 1-5 genomic positions, even hours if encompassing 50 or 60.
I actually never considered doing this starting from a different type of file (e.g. PLINK files)….until now.
Thanks very much!
New ML-Corrected DRAGEN whole genome sequencing (WGS) release makes everything so easy. Just go to
Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]and check.pvarfiles. You'll find AC in the INFO column. Then a short awk command would extract the singletons easily.Hi Ahmet,
thanks very much for your reply. I had a look into these pvar files, but it's not what I need as they report global AC, but they are not informative as to who carries a given variant.
I need to know who the carriers are for each of my >2 million variants of interest, and that's why I attempted to extract these info from pVCFs so far.
Do you have any idea if doing this from other files types is a possiblity, quicker than tabix-extracting positions of interest from pVCFs?
Hi Francesco,
In this case I would use
PLINK2withBulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]in attydinstance. Bigger thettyd, faster the analyses.There is an option in PLINK2 called
--export Awhich reports the SNP dosages per sample. Here is an example run with chromosome 1:I used mounted path, instead of using
dx downloadto get the files to thettyd, because it is too much faster. You can use the same path. Then we have--export Ato get the SNP dosages. After that--extract variant_list.txtto specify which variants you want to report. You can use the same variant list file for every chromosome, you don't need to split it by chr.--no-phenois required because DRAGEN .psam files are not standard and lack of phenotype column. And finally--out carrier_reportto name the output file.You can run this in a for loop. At the end, you'll have 25 output files. The output files will be massive because the output format is a sparse table. Rows are samples and columns are variants. If a sample-variant pair has NA in the entry it means that sample doesn't have the variant. Otherwise the entries are sample-major additive (0/1/2) coding. Also -
--export Alooks for REF allele by default.EDIT 1: Please align with PVAR files. The variant ID format is a bit weird. They all follow this format DRAGEN:CHR:POS:REF:ALT. (DRAGEN is static)
Best,
Ahmet
Please sign in to leave a comment.