end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb

Permanently deleted user

Hello everyone,

I kindly request your assistance with the code in UKB_RAP/end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb. I am trying to reproduce it as I am confident that it will help me with my real-world tasks.

To start, I load JupyterLab and execute the following steps:

a) run the command !git clone https://github.com/dnanexus/UKB_RAP.git

b) open UKB_RAP/end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb

c) when I reach the "Load dataset" cell:

cmd = [

'dx',

'extract_dataset',

control_dataset,

'--fields',

field_names,

'--delimiter',

',',

'--output',

'control_dictionary.csv',

]

subprocess.check_call(cmd)

Unfortunately, this cell is resulting in an error.

  1. Can you please advise me on how to fix it?

For your information, if I modify the cell slightly, it executes successfully. I was able to execute it using the following format:

cmd = ["dx", "extract_dataset", control_dataset, "-ddd", "--delimiter", ","]

subprocess.check_call(cmd)

(See screenshot attached.)

Is this the correct approach? Also, in this case, how can I obtain a dataframe with the phenotype data for each eid from the three files that are loaded?

Thank you in advance for your prompt response.

 

Best regards,

Alex.

Comments

11 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    Hello {@005t000000BBwZQAA1}?,

     

    I was not able to reproduce the error message you observed. It worked well on my side when I followed the instructions in the notebook and used the recommended instance type. Both files (case&control dictionary csv) were generated correctly.

     

    Screenshot 2023-03-31 at 13.32.44 

    Please make sure you do not have any syntax error in your code.

     

    To your "For your information, if I modify the cell slightly, it executes successfully. I was able to execute it using the following format:

    cmd = ["dx", "extract_dataset", control_dataset, "-ddd", "--delimiter", ","]", IMO, I do not think that would be the fix here. When adding "-ddd", this means that 3 helper files are going to be generated. These files are then used to obtain correct formatting of individual field names. Here, again in my understanding, we need to distinguish between these two dx extract_dataset commands. In the next stage, "dx extract_dataset" is actually executed to get exported data for case&control cohort/group.

     

    To your "Also, in this case, how can I obtain a dataframe with the phenotype data for each eid from the three files that are loaded?"

    --> See my explanation above. Once you are successful with the export data step, data will appear in the folder structure and you can read it using e.g. pandas.

     

     

    0
  • Comment author
    Permanently deleted user

    Hello @Ondrej Klempir? ,

    I appreciate your response and help with the issue I am facing.

    I have followed your recommendations and tried to run the code again with the recommended configuration:

    Cluster configuration: Single Node

    Recommended instance: mem1_ssd1_v2_x36

    Feature (PYTHON_R) -- (don't have recommended feature)

    Unfortunately, the error has occurred again.

     

    Regarding the syntax errors. Could you please advise me on what to pay attention to, considering that I cloned the code from GitHub and only changed the path to the cohorts?

    Please let me know if there are any other suggestions you have for resolving this issue. Your help is greatly appreciated.

    Screenshot 2023-03-31 at 2.43.23 PMBest regards,

    Alex

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Hi @Alex Shemy?, interesting. Two things to test:

     

    A) Add %%time as the first row in cell in which you are observing the error. This will measure the runtime for this cell.

    https://stackoverflow.com/questions/49403536/what-does-time-mean-in-python-3

     

    B) What happens if you reduce size of the input list of field ids? I mean if you keep lets say just first 4 items from this list:

     

    field_ids = [

    '31',

    '2966',

    '22001',

    '22006',

    '22019',

    '22021',

    '21022',

    '23104',

    '20160',

    '30760',

    '30780',

    '22020',

    '22009'

    ]

     

    -->

     

    field_ids = [

    '31',

    '2966',

    '22001',

    '22006'

    ]

     

    0
  • Comment author
    Permanently deleted user

    @Ondrej Klempir? 

    The error still persists.

    Screenshots are attached.

    There is no runtime information displayed. This is probably because the runtime is only displayed upon successful execution (as in the example with printing '1').

     

    Screenshot 2023-03-31 at 4.20.48 PMScreenshot 2023-03-31 at 4.21.06 PM 

    Screenshot 2023-03-31 at 4.22.05 PMScreenshot 2023-03-31 at 4.28.03 PM

    0
  • Comment author
    Alexandra Lee DNAnexus Team

    @Alex Shemy? I was also not able to reproduce your error if I run the notebook from top to bottom.

     

    However, I get the same error if the `control_dictionary.csv` file already exists and I re-run the cell containing the dx extract_dataset command

     

    `cmd = [

    'dx',

    'extract_dataset',

    control_dataset,

    '--fields',

    field_names,

    '--delimiter',

    ',',

    '--output',

    'control_dictionary.csv',

    ]

    subprocess.check_call(cmd)`

     

    Is it possible that this `control_dictionary.csv` file was already created when you were trying to run the cell? Maybe you can try to clear the output files generated and re-run the notebook from the top again. In general, sometimes you'll get unexpected behavior if you jump around to different parts of a notebook instead of running the notebook from the top.

     

    I hope this helps

    0
  • Comment author
    Permanently deleted user

    Hi {@00560000001jOfvAAE}? 

    Thank you very much for your assistance in troubleshooting the issue with the non-working example from UKB_RAP.

    I have been executing the cells sequentially and

    control_dictionary.csv

    was not created.

    Unfortunately, I still encountered the same error every time I created a new JupyterLab instance today. I have attached some screenshots that show which files exist and which ones do not.

    Could you please provide me with the full specifications for JupyterLab? I have been using the following:

    Cluster Configuration: Single mode

    Instance Type: mem1_ssd1_v2_x36

    Feature: PYTHON_R

    Is it possible that the way the data is provided to me is different from how you are accessing it? Should I verify this?

    I am grateful for any assistance you can provide. Thank you very much in advance.

    Best regards,

    Alex.

    Screenshot 2023-04-03 at 12.12.54 PM 

    Screenshot 2023-04-03 at 12.16.01 PM

    0
  • Comment author
    Permanently deleted user

    @Ondrej Klempir? 

    I'm sorry to bother you.

    May I ask if you had a chance to review my response?

    Also, could you please suggest any other possible ways to resolve the issue with the non-working example?

    0
  • Comment author
    Permanently deleted user

    Colleagues, I am still working on debugging the non-functional code end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb

    Are there any requirements for the cohort from which the fields are being obtained? If so, what are these requirements?

    Is it possible to create a cohort for all 502,384 patients and obtain phenotype data for them using this method?

    md = [

    'dx',

    'extract_dataset',

    control_dataset,

    '--fields',

    field_names,

    '--delimiter',

    ',',

    '--output',

    'control_dictionary.csv',

    ]

    subprocess.check_call(cmd)

     

    It is unclear whether there are any specific requirements for the cohort. The field IDs specified in the 'field_ids' variable appear to be specific to the control dataset being used, so it may be necessary to consult the dataset documentation or owners to determine any specific requirements for the cohort.

    Do I need to specify when creating a cohort that I will be extracting data from the fields listed below?

    field_ids = [

    '31',

    '2966',

    '22001',

    '22006',

    '22019',

    '22021',

    '21022',

    '23104',

    '20160',

    '30760',

    '30780',

    '22020',

    '22009'

    ]

     

    Thank you for your patience. I would be glad to assist you.

    {@005t0000006BZL2AAO}? 

    {@00560000001jOfvAAE}? 

    {@005t0000001TnmWAAS}? 

     

    Best,

    Alex

    0
  • Comment author
    Alexandra Lee DNAnexus Team

    Here are the steps I followed:

     

    1. Launched JuypterLab instance with the following configuration:

     

    Cluster Configuration: Single mode

    Instance Type: mem1_ssd1_v2_x36

    Feature: PYTHON_R

     

    2. In python notebook clone repo: `!git clone https://github.com/dnanexus/UKB_RAP.git`

     

    3. Navigate into the `UKB_RAP/end_to_end_gwas_phewas` folder

     

    4. Open `gwas-phenotype-samples-qc.ipynb` and run

     

    When I follow the above steps, I am not able to replicate your error. I wonder if other folks who've worked with dx extract_dataset could help?

    @Yimei Huang? @Chinonso Odebeatu? 

     

     

     

    0
  • Comment author
    Permanently deleted user

    @Alexandra Lee?  Thank you very much for the detailed response to my message.

    I'm doing the same as you, but I'm getting an error.

    Could the problem be with how UK BioBank exported the data to us? Who can help with this?

    What else can we look into?

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    For detailed troubeshooting, you can contact ukbiobank-support@dnanexus.com.

     

    As a workaround, you might use dxdata package instead of running dx extract_dataset. https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

     

    or extract data via Table Exporter

     

    0

Please sign in to leave a comment.