How do I extract allele combinations at specific SNPs using Jupyterlab?
Hello,
I have defined particular cohorts from my dataset based on phenotypic data in JupyterLab. However, I would now like to add columns of data detailing which alleles are present at 2 SNPs (rs7412 and rs429358) to determine which gene variants of a particular gene (APOE) each participant has, similar to this paper: https://pubmed.ncbi.nlm.nih.gov/32818802/
How may I go about doing this?
Thanks in advance
Comments
10 comments
There is a notebook to help with that, see https://github.com/UK-Biobank/SNP-filtering .
To use it, take a copy into your RAP project storage area, for example by clicking Code, select Download ZIP to get it to your PC, unzip to find SNP-extraction_test.ipynb and ReadMe, open RAP, navigate to required folder, click Add - Upload Data, select file SNP-extraction_test.ipynb.
Follow instructions in the ReadMe and run the first four cells. To run the install in the first cell, launch a Terminal and enter the conda install bioconda::plink command there.
Both SNPs are on chromosome 19, so replace [14X] with [1][9].
Enter required rsIDs in the fourth cell.
Ignore the warning about only one item to merge.
The result will be a file called snp_ind_plink_results.raw in your project storage top level folder. Close kernel, shut tabs, DNAnexus End Session, Terminate, close jupyterlab tab.
Preview what is in snp_ind_plink_results.raw, by selecting the file and choosing preview from the more-actions.
Notice that the first few hundred rows don't have valid EIDs. This is because they correspond to withdrawn participants, so ignore them. All other rows have the EID in both the FID and IID columns. Ignore PAT and MAT which will be 0 as UKB data doesn't hold that info. If you like you can use the SEX column to check that it matches with the sex of the participants in your phenotype file. Ignore column PHENOTYPE which is all -9 because we didn't tell PLINK any pheno information. The last two columns will be rs429358_C and rs7412_T. A participant with a 2 in column 7412_T has TT at that position, a participant with a 0 has CC, and a participant with a 1 has TC. There are some NA rows.
If you don't already know what the Ref allele is (C for rs7412), you can find it at https://biobank.ndph.ox.ac.uk/showcase/gsearch.cgi .
Read the snp_ind_plink_results.raw file into your RStudio session, select the useful columns, and inner_join it to your pheno data by EID.
Additional note for future reference:
The above only works for SNPs that are present in the Genotyping data, not for variants that are solely in the Whole Exome or Whole Genome Sequencing data.
Dear Rachael W
What means 0,1,2 for rs429358_C? THANK YOU!
Hi Junling,
see the Genomics Search in Showcase at https://biobank.ndph.ox.ac.uk/showcase/gsearch.cgi, enter the rsid, and get:
This says the the Reference allele is T and the Alternative allele is C.
A participant with 0 for rs429358_C has 0 “C” alleles, so they must be TT
A participant with 1 for rs429358_C has 1 “C” allele, so they must be CT
A participant with 2 for rs429358_C has 2 “C” alleles, so they must be CC .
Dear Rachael,
Thank you for your promptly reply! I've got it.
Hello Rachael,
Thank you for sharing a link (https://github.com/UK-Biobank/SNP-filtering ) which has many tips on how to extract SNPs or chromosome positions. However, if I am interested in chr positions, what should be my file genomic_regions.txt ? I mean it should contain the correspond regions (example 1:182600492), how should I fill the txt file?
Example:
1:182600492,2:66523433,…
or
1:182600492
2:66523433
…
or
chr1:182600492,chr2:66523433,…
Hi Ian,
I don't know the answer to that, as I have only ever tried it with rs ids.
I suggest you start a new forum thread (click the New Post button) with “SNP filtering for genomic regions” in the title, to see if any researchers can help.
Hi Rachel - I see you said this is only for genotyping data, not for variants that are solely in the Whole Exome or Whole Genome Sequencing data – I'm trying to get rs25531 , rs17291388, and several otherfs – none of which come up in the Genomic Search showcase tool. How would I get the SNP data for these? Does that github jupyter notebook not work then?
Hi Jacob,
the SNP-filtering github notebook does not work for the Whole Exome or Whole Genome Sequencing data.
To work with the Whole Exome Sequence data, it is possible to use the UKB-RAP Cohort Browser Genomic Search, [now renamed as “Germline Variants”] to extract lists of eids that are homozygous or heterozygous for a variant at a particular position.
To work with the Whole Genome Sequence data, I think you would need to use some of the tools in the Swiss Army Knife in the Tools tab. If you don't have a bioinformatics background, it might be challenging (I don't know how to do it).
Note that the WES and WGS use GRCh38, whereas the Genotyping uses the old GRCh37.
This repo might be useful (I haven't tested it yet):
https://github.com/lcpilling/ukbrapR
Please sign in to leave a comment.