How to store large genotype/phenotype data on the RAP that will be queried within custom code?

20 May 2022 00:00
1 comment

We have to annotate all UKB WES and WGS variants with VEP and various other features such as pathogenicity scores from dbnsfp and organize the annotations and carrier information in a database that can be queried quickly within our custom code (python and R) that will be run on the RAP. Similarly, we want to process all the UKB phenotypic data (e.g. correcting for covariates) and store this in a database format that can be queried quickly within our custom code. We've been using sqlite files so far to store variant annotations and carrier info, but this will likely not be suitable for the WGS data. What is the best way to store large data like that on the RAP so that it can be queried quickly?

Comments

1 comment

Ondrej Klempir DNAnexus Team
- 13 June 2022 13:30
UKB phenotypic data is stored in the form of Dataset. This can be then quickly queried using Spark / Spark SQL / dxdata Python package. I would start with the following links:

https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data#notebook-starting-code
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data#accessing-the-database-directly-using-sql
https://dnanexus.gitbook.io/uk-biobank-rap/frequently-asked-questions#databases-and-datasets

0

Please sign in to leave a comment.