How do I extract allele combinations at specific SNPs using Jupyterlab?

06 May 2024 11:44
10 comments

Hello,

I have defined particular cohorts from my dataset based on phenotypic data in JupyterLab. However, I would now like to add columns of data detailing which alleles are present at 2 SNPs (rs7412 and rs429358) to determine which gene variants of a particular gene (APOE) each participant has, similar to this paper: https://pubmed.ncbi.nlm.nih.gov/32818802/

How may I go about doing this?

Thanks in advance

Comments

10 comments

Rachael W UKB Community team Data Analyst
- 07 May 2024 07:48
There is a notebook to help with that, see https://github.com/UK-Biobank/SNP-filtering .
To use it, take a copy into your RAP project storage area, for example by clicking Code, select Download ZIP to get it to your PC, unzip to find SNP-extraction_test.ipynb and ReadMe, open RAP, navigate to required folder, click Add - Upload Data, select file SNP-extraction_test.ipynb.
Follow instructions in the ReadMe and run the first four cells. To run the install in the first cell, launch a Terminal and enter the conda install bioconda::plink command there.

Both SNPs are on chromosome 19, so replace [14X] with [1][9].
Enter required rsIDs in the fourth cell.
Ignore the warning about only one item to merge.
The result will be a file called snp_ind_plink_results.raw in your project storage top level folder. Close kernel, shut tabs, DNAnexus End Session, Terminate, close jupyterlab tab.
Preview what is in snp_ind_plink_results.raw, by selecting the file and choosing preview from the more-actions.
Notice that the first few hundred rows don't have valid EIDs. This is because they correspond to withdrawn participants, so ignore them. All other rows have the EID in both the FID and IID columns. Ignore PAT and MAT which will be 0 as UKB data doesn't hold that info. If you like you can use the SEX column to check that it matches with the sex of the participants in your phenotype file. Ignore column PHENOTYPE which is all -9 because we didn't tell PLINK any pheno information. The last two columns will be rs429358_C and rs7412_T. A participant with a 2 in column 7412_T has TT at that position, a participant with a 0 has CC, and a participant with a 1 has TC. There are some NA rows.
If you don't already know what the Ref allele is (C for rs7412), you can find it at https://biobank.ndph.ox.ac.uk/showcase/gsearch.cgi .
Read the snp_ind_plink_results.raw file into your RStudio session, select the useful columns, and inner_join it to your pheno data by EID.

0
Rachael W UKB Community team Data Analyst
- 09 May 2024 08:51
Additional note for future reference:
The above only works for SNPs that are present in the Genotyping data, not for variants that are solely in the Whole Exome or Whole Genome Sequencing data.

0
Junling Gao
- 06 July 2024 05:34
Dear Rachael W
What means 0，1，2 for rs429358_C? THANK YOU!

0
Rachael W UKB Community team Data Analyst
- 07 July 2024 15:08
Hi Junling,
see the Genomics Search in Showcase at https://biobank.ndph.ox.ac.uk/showcase/gsearch.cgi, enter the rsid, and get:
This says the the Reference allele is T and the Alternative allele is C.
A participant with 0 for rs429358_C has 0 “C” alleles, so they must be TT
A participant with 1 for rs429358_C has 1 “C” allele, so they must be CT
A participant with 2 for rs429358_C has 2 “C” alleles, so they must be CC .

0
Junling Gao
- 08 July 2024 02:17
Dear Rachael,
Thank you for your promptly reply! I've got it.

0
Ian Meneghel Danilevicz
- 28 April 2025 09:16
Hello Rachael,
Thank you for sharing a link (https://github.com/UK-Biobank/SNP-filtering ) which has many tips on how to extract SNPs or chromosome positions. However, if I am interested in chr positions, what should be my file genomic_regions.txt ? I mean it should contain the correspond regions (example 1:182600492), how should I fill the txt file?
Example:
1:182600492,2:66523433,…
or
1:182600492
2:66523433
…
or
chr1:182600492,chr2:66523433,…

0
Rachael W UKB Community team Data Analyst
- 29 April 2025 07:27
Hi Ian,
I don't know the answer to that, as I have only ever tried it with rs ids.
I suggest you start a new forum thread (click the New Post button) with “SNP filtering for genomic regions” in the title, to see if any researchers can help.

0
Jacob Reiter
- 29 June 2025 20:32
Hi Rachel - I see you said this is only for genotyping data, not for variants that are solely in the Whole Exome or Whole Genome Sequencing data – I'm trying to get rs25531 , rs17291388, and several otherfs – none of which come up in the Genomic Search showcase tool. How would I get the SNP data for these? Does that github jupyter notebook not work then?

0
Rachael W UKB Community team Data Analyst
- Edited 14 September 2025 13:27
Hi Jacob,
the SNP-filtering github notebook does not work for the Whole Exome or Whole Genome Sequencing data.
To work with the Whole Exome Sequence data, it is possible to use the UKB-RAP Cohort Browser Genomic Search, [now renamed as “Germline Variants”] to extract lists of eids that are homozygous or heterozygous for a variant at a particular position.
To work with the Whole Genome Sequence data, I think you would need to use some of the tools in the Swiss Army Knife in the Tools tab. If you don't have a bioinformatics background, it might be challenging (I don't know how to do it).
Note that the WES and WGS use GRCh38, whereas the Genotyping uses the old GRCh37.

0
Rachael W UKB Community team Data Analyst
- 01 October 2025 18:41
This repo might be useful (I haven't tested it yet):
https://github.com/lcpilling/ukbrapR

0

Please sign in to leave a comment.