I see that the public release of the 500k WGS data is expected in late 2023 (https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/future-data-release-timelines).
Does anyone know in what format the variant calls will be provided?
The 150k WGS compressed VCF files are 328 Tb.
The 200k WGS compressed VCF files are 586 Tb.
With n^2 scaling, just the variant calls for 500k WGS will be ~3.6 Pb, which is nearly unusable. Will any more efficient storage format be provided?
(The scaling problem occurs because each new sample adds both new rows... private variants... and a new column. But the vast majority of the variant x sample matrix is ref/ref genotypes.)
I would need to ask UKB and they might need to negotiate with data providder. Would you think that either PLINK or BGEN would be good enough for this case? That's what we have for WES.
Some people will probably want those formats, which will at least be smaller than VCF.
For my own purposes parquet would be preferable, since it would enable more efficiently querying the data and/or running on Spark.
There are potentially other sparse matrix formats that could be provided.
Internally I am converting to a format that lists sample IDs that are het/hom/missing for each variant. This avoids storing ref/ref genotypes, and the resulting files are about 250x smaller. (And the savings would be larger as the sample size grows.)
For what we know so far, it will be in PLINK2, but there will be more discussion if BGEN or PLINK would also be available. I expect that at minimum it would be pVCF and PLINK2.
I have forward your feedback about storing data in Spark to DNAnexus product team. This has to be done by DNAnexus rather than UKB or data provider.
Comments
5 comments
I would need to ask UKB and they might need to negotiate with data providder. Would you think that either PLINK or BGEN would be good enough for this case? That's what we have for WES.
Some people will probably want those formats, which will at least be smaller than VCF.
For my own purposes parquet would be preferable, since it would enable more efficiently querying the data and/or running on Spark.
There are potentially other sparse matrix formats that could be provided.
Internally I am converting to a format that lists sample IDs that are het/hom/missing for each variant. This avoids storing ref/ref genotypes, and the resulting files are about 250x smaller. (And the savings would be larger as the sample size grows.)
Let me discuss with product team and UKB on this to see what is feasible on their end.
Thank you! I would be interested to hear any comments the team has on the subject.
For what we know so far, it will be in PLINK2, but there will be more discussion if BGEN or PLINK would also be available. I expect that at minimum it would be pVCF and PLINK2.
I have forward your feedback about storing data in Spark to DNAnexus product team. This has to be done by DNAnexus rather than UKB or data provider.
Please sign in to leave a comment.