Optimising speed of extracting specific loci from WGS

Gabriel Doctor

I am trying to get info from c500,000 loci scattered throughout genome (basically allele freqs of whole UK pop at specific sites). That means that I need to access thousands of VCF for chunked regions. (I have an index file of VCFs for this purpose, which may be helpful to others and I can share). The info I need is contained in the INFO column of the DRAGEN vcf files. 

I have been experimenting using the DNANexus ttyd app. 

It is painfully slow to retrieve each variant that I am wondering if anyone else has an approach?  

I have tried this, which takes about a minute per variant. Its relying on dxfuse mounting; I tried downloading the VCF to the worker and it was no quicker. 

```

#ensure bcftools installed on worker.

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n'  --regions-overlap 0  -i 'TYPE="snp"' -r chr22:15690182,chr22:15690294,chr22:15690376,chr22:15690406,chr22:15690425 "$DRAGENPATH"/ukb24310_c22_b784_v1.vcf.gz

```

I also tried with plink which is marginally faster 

```

plink2 --vcf "$DRAGENPATH/ukb24310_c22_b784_v1.vcf.gz" --extract bed1 dummybed.bed --freq --snps-only --out freq

```

Am I missing a trick here somewhere? 
 

NB - I know that there is a UKB allele frequency browser but as far as I can tell there is no API or file to query thousands of loci.

 

Comments

1 comment

Please sign in to leave a comment.