Excessive cost of storing plink-format imputation files on RAP
Hello,
As it will soon no longer be allowed for UKB imputation data to be held locally, my team and I are starting to move our analyses surrounding imputation data to the RAP. We use plink heavily in these analyses. However, the imputation files (whether hg19, TOPMed or GEL) are only available in .bgen format on the RAP, which is a format plink must first convert to its pfile format (pgen/psam/pvar) to be able to use.
For the larger chromosomes, this conversion can take a few hours - something we would prefer not to have to do everytime we use plink. I have tried creating pfile-format imputation files from the .bgen files, and storing them in our RAP storage space, but this takes about 5 TB (I'm storing both hg19 & hg38 formats), so about 80 GBP a month, which feels like a lot given this is essentially raw UKB data.
Would it be possible for UKB to create such files and make them available as part of Bulk data? I feel like this problem is likely to affect numerous users, and since such a solution already exists for the WES data (which is available under Bulk as .bgen, .vcf and pfile format), I figure this is feasible.
Thanks !
Elby
Comments
5 comments
Dear Elby,
This is not currently part of our future timelines but is something we will continue for our future prioritisation.
In the meantime, please see if you are eligible for any of the credits available with UK Biobank: https://www.ukbiobank.ac.uk/use-our-data/fees/financial-support/
Thanks
George
I second this suggestion.
Plink is one of the most widely used genetics tools out there, and Plink's .bed format is widely supported by other genetics tools as well. Only making the data available in BGEN format (which will often have to be converted before use) is going to lead to a lot of researchers having to spend a lot of time and money making and storing their own converted versions of these files.
Please consider making a .bed version of the imputed genotypes available on RAP. This would be a tremendous help to many researchers who are going through the transition to RAP right now.
We will also be storing plink format versions of the imputed genotypes - so this will be an ongoing issue for us as well.
There is a plink version of the 500k WGS data available now in pgen format. If it makes sense, maybe migrate the analyses from the imputed data to the WGS plink datasets.
Good suggestion, but those files are much larger due to containing a lot of rare variants that aren't necessary for a lot of analyses. You can filter by allele frequency, but you still have to read all of the input data to do so, so it's still going to slow jobs down.
The sequencing data is also GRCh38, while the old imputed data is GRCh37, so that might cause some unexpected hiccups for researchers who have to switch to RAP mid-project.
Please sign in to leave a comment.