Hi, I am trying to run the frequency report using PLINK2 as with the following example for chr22 (.bed / . bim / .fam) files
plink2 --bfile ukb24305_c22_b0_v1 --freq counts --snps-only --no-psam-pheno --out "ukbb_chr22"
For example: variants such as 22-15688736-C-T (gnomad AF : 0.375467), 22-16597059-G-A(gnomAD AF : 0.578828) are high enough in frequency in all gnomad populations that they should also appear in UKBB however they don't seem to appear here in the output of plink2's frequency report.
Trying to do something similar with the bgen files leads to the same set of variants leading to the same output.
qctool -g $in_name -snp-stats -snp-stats-columns allele-frequencies -osnp "${in_prefix}.txt
Has the pVCFs / PLINK / BGEN files been QC-ed to remove quite a few of these variants or could this be an issue regarding the tools / set up?
Using bcftools query with swiss army knife app is probably your best bet. You could save time and money but just filtering to the regions you are interested in (pVCFs are sharded by 50kilobp chunks).
I did it for the whole genomes using multiple array jobs like 4 days (and cost like 80 GBP, so not cheap but worth doing it once). A big hassle working around this but I haven't found anything better really.
pVCFs have multi-allelics which were missing in the plink and bgen.
Hi Elston, many thanks! Could your results be shared @Chai Fungtammasan? for everyone's benefit? So that we don't have to run these king of analyses multiple times?
The best way is to return this to UKB as "return of results" and ask if UKB can release it on RAP. These data are pseudonymized and protected by MTA, so the user could not shared with each other unless they are in the same research application.
Comments
5 comments
potential answer on Twitter for anyone who is interested.
https://twitter.com/nickywhiffin/status/1712118878501777898
Hello
have you found the best way of calculating allele frequencies for a list of variants from whole population pVCFs ?
BW
Emanuele
Using bcftools query with swiss army knife app is probably your best bet. You could save time and money but just filtering to the regions you are interested in (pVCFs are sharded by 50kilobp chunks).
I did it for the whole genomes using multiple array jobs like 4 days (and cost like 80 GBP, so not cheap but worth doing it once). A big hassle working around this but I haven't found anything better really.
pVCFs have multi-allelics which were missing in the plink and bgen.
Hi Elston, many thanks! Could your results be shared @Chai Fungtammasan? for everyone's benefit? So that we don't have to run these king of analyses multiple times?
The best way is to return this to UKB as "return of results" and ask if UKB can release it on RAP. These data are pseudonymized and protected by MTA, so the user could not shared with each other unless they are in the same research application.
Please sign in to leave a comment.