SNP filtering error message

Helena Bird

I've been following the GitHub - UK-Biobank/SNP-filtering: This repo contains a JupyterLab notebook allowing individual SNPs to be filtered from the UKB genotyping data. A few approaches to filtering are provided, including options to filter by individual SNP rsIDs or by genomic regions of interest.

and have got as far as the following command: 

dx download -f "snps_list.txt"
plink --bfile genotyping_merged --extract snps_list.txt --recode A --out snp_list_plink_results
dx upload snp_list_plink_results.raw

However, i then get an error message:

root@5a793ce36d85:/opt/notebooks# dx download -f "snps_list.txt"
dxpy.utils.resolver.ResolutionError: Unable to resolve "snps_list.txt" to a data object or folder name in '/'
root@5a793ce36d85:/opt/notebooks# plink --bfile genotyping_merged --extract snps_list.txt --recode A --out snp_list_plink_results
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to snp_list_plink_results.log.
Options in effect:
 --bfile genotyping_merged
 --extract snps_list.txt
 --out snp_list_plink_results
 --recode A

15614 MB RAM detected; reserving 7807 MB for main workspace.
47443 variants loaded from .bim file.
488377 people (223323 males, 264582 females, 472 ambiguous) loaded from .fam.
Ambiguous sex IDs written to snp_list_plink_results.nosex .
Error: Failed to open snps_list.txt.

Please can you advise how i can move forward?

Comments

7 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Helena,

    if the first four cells in the notebook have given you the SNP data you want, then there is no need to run this fifth cell.

    This fifth cell is for people who want to provide their list of required SNPs as a file, instead of entering the rsids within the plink command as was done in cell four.   You could create a snps_list.txt file in the top level of your main project storage if you want to try it.  I think you would need one rsid per line, but I haven't tested that.   If you do create a snps_list.txt file, it will probably need to have Unix line-endings, not Windows line-endings.

    This forum post about the SNP notebook and this forum post about line endings might be helpful.

    0
  • Comment author
    Helena Bird

    Thank You Rachael, that makes sense thank you.

    Hopefully a quick follow up question, in the ‘forum post about the SNP notebook’ the last line says 

    Read the snp_ind_plink_results.raw file into your RStudio session, select the useful columns, and inner_join it to your pheno data by EID

    Is there a tutorial on how to make the inner_join with pheno data? I've previously used table exporter to create a file with the biochemistry results i require and diagnosis but its a csv file. The SNP file is a raw file, is there a tutorial on how to connect the two different files per EID please?

    Thank you

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    Hi Helena

    once you have created the file, it isn't essential to use RStudio.  You could use a jupyterlab with either python or R if you prefer.   I can't remember why the other forum post mentions RStudio - probably because that researcher had specified they were using it.   I would recommend a jupyterlab session, because that is what I am used to and because you don't risk leaving it running indefinitely.

    If I were doing this, I would start a jupyterlab  from the Tools tab, open a $_ terminal, use dx download command to copy each file from your main project storarge into the jupyterlab storage, open an R kernel, use the fread command from the data.table package to read each file into a dataframe, then use inner_join command from the dplyr package to combine the two dataframes.

    I'm almost sure that fread will be able to read a .raw file.  If that doesn't work, post again here and I'll have a look.   (As always, make sure you don't post any UKB data or EIDs.)

    Once the files have been read in as dataframes, it shouldn't matter what the original file format was.   If you need more information, can you tell me which bit is still unclear?

     

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst
    • Edited

    Don't forget to save your results to a file in jupyterlab storage, using fwrite, and then copy them to your main project storage, using dx upload.   The jupyterlab storage will disappear when the jupyterlab is shut down.

    0
  • Comment author
    Helena Bird

    Thank you so much Rachael, i will try that now, thank you

    0
  • Comment author
    Helena Bird

    Hi Rachael, after filtering the SNPs successfully as a raw file, i also have a column for sex with either 1 or 2. In the showcase however, coding for sex cites 0= Female and 1=Male. What therefore is 1 or 2 please? is it the same order so 1= Female, 2= Male?

    Many thanks Helena

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Helena, that's a good question, but I don't know the answer, sorry.   It should be possible to work it out by trying one option and then the other.   You should find that all (or almost all) items will match for one option (and that all or almost all will fail to match for the other option)..   

    If there are a small number that don't match, it could be because a participant reported a sex that is not the same as their genetic sex, knowingly or unknowingly, or it could be that some kind of technical error has happened.

    If neither option produces almost all matches, that would suggest that there is something wrong with the process.

    0

Please sign in to leave a comment.