How do I extract allele combinations at specific SNPs using Jupyterlab?
Hello,
I have defined particular cohorts from my dataset based on phenotypic data in JupyterLab. However, I would now like to add columns of data detailing which alleles are present at 2 SNPs (rs7412 and rs429358) to determine which gene variants of a particular gene (APOE) each participant has, similar to this paper: https://pubmed.ncbi.nlm.nih.gov/32818802/
How may I go about doing this?
Thanks in advance
Comments
5 comments
There is a notebook to help with that, see https://github.com/UK-Biobank/SNP-filtering .
To use it, take a copy into your RAP project storage area, for example by clicking Code, select Download ZIP to get it to your PC, unzip to find SNP-extraction_test.ipynb and ReadMe, open RAP, navigate to required folder, click Add - Upload Data, select file SNP-extraction_test.ipynb.
Follow instructions in the ReadMe and run the first four cells. To run the install in the first cell, launch a Terminal and enter the conda install bioconda::plink command there.
Both SNPs are on chromosome 19, so replace [14X] with [1][9].
Enter required rsIDs in the fourth cell.
Ignore the warning about only one item to merge.
The result will be a file called snp_ind_plink_results.raw in your project storage top level folder. Close kernel, shut tabs, DNAnexus End Session, Terminate, close jupyterlab tab.
Preview what is in snp_ind_plink_results.raw, by selecting the file and choosing preview from the more-actions.
Notice that the first few hundred rows don't have valid EIDs. This is because they correspond to withdrawn participants, so ignore them. All other rows have the EID in both the FID and IID columns. Ignore PAT and MAT which will be 0 as UKB data doesn't hold that info. If you like you can use the SEX column to check that it matches with the sex of the participants in your phenotype file. Ignore column PHENOTYPE which is all -9 because we didn't tell PLINK any pheno information. The last two columns will be rs429358_C and rs7412_T. A participant with a 2 in column 7412_T has TT at that position, a participant with a 0 has CC, and a participant with a 1 has TC. There are some NA rows.
If you don't already know what the Ref allele is (C for rs7412), you can find it at https://biobank.ndph.ox.ac.uk/showcase/gsearch.cgi .
Read the snp_ind_plink_results.raw file into your RStudio session, select the useful columns, and inner_join it to your pheno data by EID.
Additional note for future reference:
The above only works for SNPs that are present in the Genotyping data, not for variants that are solely in the Whole Exome or Whole Genome Sequencing data.
Dear Rachael W
What means 0,1,2 for rs429358_C? THANK YOU!
Hi Junling,
see the Genomics Search in Showcase at https://biobank.ndph.ox.ac.uk/showcase/gsearch.cgi, enter the rsid, and get:
This says the the Reference allele is T and the Alternative allele is C.
A participant with 0 for rs429358_C has 0 “C” alleles, so they must be TT
A participant with 1 for rs429358_C has 1 “C” allele, so they must be CT
A participant with 2 for rs429358_C has 2 “C” alleles, so they must be CC .
Dear Rachael,
Thank you for your promptly reply! I've got it.
Please sign in to leave a comment.