Hello,
During our research with chromosome 6, we came across many mutations which a high percentage of unknowns (i.e., "./."), for example: 6:32664883:T:G, 6:32521980:CT:C with 55% and 48%, respectively.
Actually, ~30% of the mutations which are relevant to our study, have more than 20% of unknowns!
Such a high percentage of unknowns is both strange (from an aligner algorithm point of view), and more important, introduces a high level of noise into our model. Therefore, we wish to understand/investigate this phenomenon a bit more.
Any ideas?
Best wishes,
Eran
In particular, which data field are you using as input?
Also, what tools are you using to access the data? What percentage of unknows would you expect? Is there a similar issue if you look at a different chromosome?
(Please don't post any participant-level data though).
Comments
2 comments
Hi Eran, could you give a bit more detail?
In particular, which data field are you using as input?
Also, what tools are you using to access the data? What percentage of unknows would you expect? Is there a similar issue if you look at a different chromosome?
(Please don't post any participant-level data though).
You might find relevant information in the related Resources or Category Description or Field Notes for the data field on Showcase, https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100314
Hi Rachael,
Thanks for replying.
Here is reproduction code (## are comments):
## within a Cloud Workstation instance, type:
## download chr6 plink files (ukb23158_c6_b0_v1.bed, ukb23158_c6_b0_v1.bim, and ukb23158_c6_b0_v1.fam),
## from path: 'Bulk' -> 'Exome sequences' -> 'Population level exome OQFE variants, PLINK format - final release'
dx download file-G98bV40JykJxP3Fv36j7jg9x
dx download file-G98Y96QJykJjZVZY3x12Z3qX
dx download file-GXvbJJjJ2BjK3ZbXz9q4ZyZp
## create a VCF file for mutation 6:32664883:T:G
echo "6:32664883:T:G" > muts.bim
./plink --bfile ukb23158_c6_b0_v1 --extract muts.bim --keep-allele-order --recode vcf --out data
## look at the mutation line (transpose the file it for easier counting)
tail -n1 data.vcf | tr "\t" "\n" | tail -n+10 > data.vcf.trans
## count overall entries (== 469,835)
wc -l data.vcf.trans
## count overall unknowns (i.e., "./.") (== 259,754)
grep "\./\." data.vcf.trans | wc -l
## %unknowns == 259,754 / 469,835 = 55%
You can repeat the same for mutation 6:32521980:CT:C and get 48% of unknown data.
I assume the data could have some unknowns. But failing to align 50% of subjects seems to indicate a serious problem in the aligner used.
Best wishes,
Eran
Please sign in to leave a comment.