Expected format for 500k WGS variants?

10 November 2022 00:00
5 comments

I see that the public release of the 500k WGS data is expected in late 2023 (https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/future-data-release-timelines). Does anyone know in what format the variant calls will be provided? The 150k WGS compressed VCF files are 328 Tb. The 200k WGS compressed VCF files are 586 Tb. With n^2 scaling, just the variant calls for 500k WGS will be ~3.6 Pb, which is nearly unusable. Will any more efficient storage format be provided? (The scaling problem occurs because each new sample adds both new rows... private variants... and a new column. But the vast majority of the variant x sample matrix is ref/ref genotypes.)

Comments

5 comments

Chai Fungtammasan DNAnexus Team
- 10 November 2022 17:20
I would need to ask UKB and they might need to negotiate with data providder. Would you think that either PLINK or BGEN would be good enough for this case? That's what we have for WES.

0
Former User of DNAx Community_47
- 11 November 2022 11:01
Some people will probably want those formats, which will at least be smaller than VCF.
For my own purposes parquet would be preferable, since it would enable more efficiently querying the data and/or running on Spark.

There are potentially other sparse matrix formats that could be provided.
Internally I am converting to a format that lists sample IDs that are het/hom/missing for each variant. This avoids storing ref/ref genotypes, and the resulting files are about 250x smaller. (And the savings would be larger as the sample size grows.)

0
Chai Fungtammasan DNAnexus Team
- 14 November 2022 16:25
Let me discuss with product team and UKB on this to see what is feasible on their end.

0
Former User of DNAx Community_47
- 15 November 2022 15:50
Thank you! I would be interested to hear any comments the team has on the subject.

0
Chai Fungtammasan DNAnexus Team
- 15 November 2022 18:38
For what we know so far, it will be in PLINK2, but there will be more discussion if BGEN or PLINK would also be available. I expect that at minimum it would be pVCF and PLINK2.
I have forward your feedback about storing data in Spark to DNAnexus product team. This has to be done by DNAnexus rather than UKB or data provider.

0

Please sign in to leave a comment.