Optimising speed of extracting specific loci from WGS

Edited 26 November 2024 10:46
1 comment

I am trying to get info from c500,000 loci scattered throughout genome (basically allele freqs of whole UK pop at specific sites). That means that I need to access thousands of VCF for chunked regions. (I have an index file of VCFs for this purpose, which may be helpful to others and I can share). The info I need is contained in the INFO column of the DRAGEN vcf files.

I have been experimenting using the DNANexus ttyd app.

It is painfully slow to retrieve each variant that I am wondering if anyone else has an approach?

I have tried this, which takes about a minute per variant. Its relying on dxfuse mounting; I tried downloading the VCF to the worker and it was no quicker.

```

#ensure bcftools installed on worker.

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' --regions-overlap 0 -i 'TYPE="snp"' -r chr22:15690182,chr22:15690294,chr22:15690376,chr22:15690406,chr22:15690425 "$DRAGENPATH"/ukb24310_c22_b784_v1.vcf.gz

```

I also tried with plink which is marginally faster

```

plink2 --vcf "$DRAGENPATH/ukb24310_c22_b784_v1.vcf.gz" --extract bed1 dummybed.bed --freq --snps-only --out freq

```

Am I missing a trick here somewhere?

NB - I know that there is a UKB allele frequency browser but as far as I can tell there is no API or file to query thousands of loci.

Comments

1 comment

George F UKB Community team Data Analyst
- 07 May 2025 11:09
Hi Gabriel,
The Swiss-army-knife app contains plink2 and bcftools. You can use this to create a series of individual jobs or batch jobs, it can be launched wither interactively or from the command line.
Please see the following for more information: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-data/accessing-bulk-data#analyzing-files-with-swiss-army-knife , https://www.youtube.com/watch?v=vJHzfqrDaFw
Hope this helps
George

0

Please sign in to leave a comment.