Discrepancies between 200K and 270K WES datasets and with WGS data

21 October 2024 10:52
1 comment

It ooks like at least some exome genotype calls are very inaccurate indeed.

I'm pasting below the genotype counts for two variants in the first 200K exomes and in the WGS data for the same subjects and then the counts for the second 270K exomes and the counts for the corresponding WGS data. This is just extracted using plink.

Things to note are the following:

Far, far more het calls are made in the 200K exomes than the 270k exomes. The differences are huge: 264 against 5, 460 against 15.

In the WGS data, many more hets are called even than in the 200K exomes. Again, the differences are huge: 1029 against 264, 1979 against 460. And these are exactly the same subjects.

The WGS results seem closer to those in e.g. dbSNP and I suspect they are more or less correct.

I checked, and broadly speaking subjects called as hets in the 200K exomes are also hets in the WGS data. The problem is that far more are hets in WGS rather than the WES results being unrelated.

What I think is happening is that the WGS data is probably about right. Which means that the WES results are just completely wrong.

I have no idea how widespread this problem is. I looked at these two loci because they are critical to a potentially very interesting result.

It would be straightforward to systematically examine how well the WES calls match to the WGS calls but I am not resourced to do this. Nor do I regard it as my responsibility.

What is the overall concordance between WES and WGS calls? Is the WES data actually very unreliable, such that people should only use the WGS data? Obviously, the results which have been obtained to date from the WES data suggest that it is not completely random but the discrepancies I am highlighting are far higher than I would expect. Have I just landed on a couple of outliers and all the other data is fine? What could possibly cause discrepancies of this magnitude?

The relevant results are pasted below.

CHR SNP A1 A2 C(HOM A1) C(HET) C(HOM A2) C(HAP A1) C(HAP A2) C(MISSING)

200K.OnRAP.WES.1.58534246.frqx

1 chr1_58534246_T_C C T 1 264 200112 0 0 24

200K.OnRAP.WGS.1.58534246.frqx

1 . C T 1 1029 198169 0 0 0

270K.OnRAP.WES.1.58534246.frqx

1 chr1_58534246_T_C C T 0 5 269126 0 0 38

270K.OnRAP.WGS.1.58534246.frqx

1 . C T 2 1333 267347 0 0 1

CHR SNP A1 A2 C(HOM A1) C(HET) C(HOM A2) C(HAP A1) C(HAP A2) C(MISSING)

200K.OnRAP.WES.1.58530648.frqx

1 chr1_58530648_C_A A C 1 460 199844 0 0 96

200K.OnRAP.WGS.1.58530648.frqx

1 . A C 5 1979 197215 0 0 0

270K.OnRAP.WES.1.58530648.frqx

1 chr1_58530648_C_A A C 0 15 268977 0 0 177

270K.OnRAP.WGS.1.58530648.frqx

1 . A C 8 2681 265994 0 0 0

Comments

1 comment

Lora B UKB Community team Data Analyst
- 21 October 2024 13:52
- Official comment
Thank you for highlighting potential issues with some variants. The results from both WES and WGS methods underwent QC by the providers and the consortiums generating the data, this included assessing concordance with genotype calls; any samples failing QC were not included in the final dataset. The WES and WGS methodologies differ significantly and may result in different level of coverage and sequence quality across different genomic regions, so some differences may be expected; that said, since raising this with us last month we have scheduled some time within our bioinformatics team to take a further review. We will post further comment following that review that should be conducted.

Please sign in to leave a comment.