How to extract specific SNPs from data?
Hi! I'm looking for help to correct my understanding of the data available in Biobank. Please correct me if I'm wrong:
- There exists no whole genome sequencing data, only a panel of SNPs (full list available https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=1963)
- These SNPs are available for all participants, regardless of whether they were in the UKBiLIVE or UKBiobank cohorts.
My question is then: how do I extract the genotypes for participants in the UK Biobank for further analysis in a Jupyter Notebook? I have already compiled a list of RSID I'm interested in, e.g., ["rs8176749", “rs8176746”, …]. I also understand that I can filter the participants to generate a smaller cohort (e.g. all males age 40 at time of study), and save this smaller cohort in a .dataset, e.g., “target_cohort.dataset”.
Now, how can I get the genotype at the RSIDs in my list for the participants in my target cohort? (I have access to the bulk data folder, I just have no clue how to access the data I'm interested in) Any help would be appreciated, thanks!
Comments
1 comment
Hi Jun Yuan,
UK Biobank has Whole Genome Sequencing (WGS) , Whole Exome Sequencing (WES) and the older Genotyping data.
Please see these articles https://community.ukbiobank.ac.uk/hc/en-gb/articles/23472796568861-What-types-of-data-are-available-in-UK-Biobank
https://community.ukbiobank.ac.uk/hc/en-gb/articles/15468228735133-Whole-exome-sequencing-data
https://community.ukbiobank.ac.uk/hc/en-gb/articles/15468170680605-Genome-wide-genotyping
https://community.ukbiobank.ac.uk/hc/en-gb/articles/15468060583709-Whole-genome-sequencing-data
WES and WGS data is available for research projects that are Tier 3. Genotyping data is cost tier 1, so it is available for all projects, Tier 1, 2 or 3.
There is ~95% overlap in ~800,000 markers available in the UKBiLEVE Axiom array and the UK Biobank Axiom array, see resource 146640. A full list of markers available in each of the two panels can be found in resource 149600 and resource 149601 .
The Showcase Genomic search feature may be useful to search the genotyping data for the particular rsIDs or genomic regions of interest . Further to this, there are also imputed genotype and phased haplotype values, see category 100319 .
If the SNPs you are interested in are included in the older Genotyping data, you can use this UKB jupyterlab code notebook to extract the data you need. See GitHub https://github.com/UK-Biobank/SNP-filtering . There is also an applet for the same purpose in the same GitHub repo.
These related community posts may also be useful, and a search for “SNP” would find further items.
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/18669657313437-How-do-I-extract-allele-combinations-at-specific-SNPs-using-Jupyterlab
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/23773622473501-Availability-of-APOE-genotypes
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/26983237384477-SNP-filtering-for-genomic-regions
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/24709398255133-SNP-filtering-error-message
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/26206858779037-SNP-availability-for-known-genes
If the SNPs you need are not in the genotyping data, they are likely to be present in the WES data. This can be accessed through the genomics tab of the Cohort Browser.
Note that the Genotyping Reference is GRCh37, and the WES and WGS Reference is GRCh38.
Please sign in to leave a comment.