Optimising speed of extracting specific loci from WGS
I am trying to get info from c500,000 loci scattered throughout genome (basically allele freqs of whole UK pop at specific sites). That means that I need to access thousands of VCF for chunked regions. (I have an index file of VCFs for this purpose, which may be helpful to others and I can share). The info I need is contained in the INFO column of the DRAGEN vcf files.
I have been experimenting using the DNANexus ttyd app.
It is painfully slow to retrieve each variant that I am wondering if anyone else has an approach?
I have tried this, which takes about a minute per variant. Its relying on dxfuse mounting; I tried downloading the VCF to the worker and it was no quicker.
```
#ensure bcftools installed on worker.
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' --regions-overlap 0 -i 'TYPE="snp"' -r chr22:15690182,chr22:15690294,chr22:15690376,chr22:15690406,chr22:15690425 "$DRAGENPATH"/ukb24310_c22_b784_v1.vcf.gz
```
I also tried with plink which is marginally faster
```
plink2 --vcf "$DRAGENPATH/ukb24310_c22_b784_v1.vcf.gz" --extract bed1 dummybed.bed --freq --snps-only --out freq
```
Am I missing a trick here somewhere?
NB - I know that there is a UKB allele frequency browser but as far as I can tell there is no API or file to query thousands of loci.
Comments
1 comment
Hi Gabriel,
The Swiss-army-knife app contains plink2 and bcftools. You can use this to create a series of individual jobs or batch jobs, it can be launched wither interactively or from the command line.
Please see the following for more information: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-bulk-data#analyzing-files-with-swiss-army-knife , https://www.youtube.com/watch?v=vJHzfqrDaFw
Hope this helps
George
Please sign in to leave a comment.