SNP filtering for genomic regions
Hi all,
I am following a tutorial developed by Oliver Gray (many tks Oliver)(https://github.com/UK-Biobank/SNP-filtering) which guides to obtain a list of SNPs. However, not all SNPs are available in rs name. So I figure out how to obtain the SNP by their chromosome position. I guess the same tutorial also explains, but it is missing how to fill an external file (genomic_regions.txt) to select the SNPs. For example, if I am interested in two SNPs, one in the first chromosome and the other in the second, such as 1:182600492 and 2:66523433, how should I fill the txt file?
Example 1:
1:182600492
2:66523433
Example 2:
chr1:182600492,chr2:66523433
Example 3:
Something else.
Thanks for your help.
Ian
PS this discussion started on a previous post (https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/18669657313437-How-do-I-extract-allele-combinations-at-specific-SNPs-using-Jupyterlab), but Rachel suggested that I start a new issue as the topic is a little shifted.
Comments
3 comments
Hi Ian,
Thank you for reaching out. We are working on enhancing the documentation and tool for filtering SNPs in the genotyping data.
If you want to filter the data based on the genomic positions the genomic_regions.txt will have the following structure:
1 30000000 35000000 R1
4 60000000 62000000 R2
The text file should include four columns that are tab separated, including chromosome ( e.g. 1, 15, X), region start (in base pair coordinates), region end (also in base pair coordinates) and a user-selected identifier for the region. Each chromosome region of interest should be included on a separate line. Hope this helps.
Thank you for using the Community forum.
Thank you Lea!
So, for example, if I am interested in the rs17400325 and rs9369062, which I can find in GWAS catalog (GWAS Catalog), the structure would be something like this:
2 177701185 177701185 rs17400325
6 38469527 38469527 rs9369062
as the first one is on chromosome 2 and base pair location 177701185, the second is on chromosome 6 and base pair location 38469527.
Best
Hi Ian,
That's the correct structure. However, please note that the base positions in the genotyping data are in GRCh37 coordinates so the start and end variant positions will be different.
Hope this helps. Thank you for using the Community Forum.
Please sign in to leave a comment.