end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb

Permanently deleted user

30 March 2023 00:00
11 comments

Hello everyone,

I kindly request your assistance with the code in UKB_RAP/end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb. I am trying to reproduce it as I am confident that it will help me with my real-world tasks.

To start, I load JupyterLab and execute the following steps:

a) run the command !git clone https://github.com/dnanexus/UKB_RAP.git

b) open UKB_RAP/end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb

c) when I reach the "Load dataset" cell:

cmd = [

'dx',

'extract_dataset',

control_dataset,

'--fields',

field_names,

'--delimiter',

',',

'--output',

'control_dictionary.csv',

]

subprocess.check_call(cmd)

Unfortunately, this cell is resulting in an error.

Can you please advise me on how to fix it?

For your information, if I modify the cell slightly, it executes successfully. I was able to execute it using the following format:

cmd = ["dx", "extract_dataset", control_dataset, "-ddd", "--delimiter", ","]

subprocess.check_call(cmd)

(See screenshot attached.)

Is this the correct approach? Also, in this case, how can I obtain a dataframe with the phenotype data for each eid from the three files that are loaded?

Thank you in advance for your prompt response.

Best regards,

Alex.

Comments

11 comments

Ondrej Klempir DNAnexus Team
- 31 March 2023 11:45
Hello {@005t000000BBwZQAA1}?,

I was not able to reproduce the error message you observed. It worked well on my side when I followed the instructions in the notebook and used the recommended instance type. Both files (case&control dictionary csv) were generated correctly.

Please make sure you do not have any syntax error in your code.

To your "For your information, if I modify the cell slightly, it executes successfully. I was able to execute it using the following format:
cmd = ["dx", "extract_dataset", control_dataset, "-ddd", "--delimiter", ","]", IMO, I do not think that would be the fix here. When adding "-ddd", this means that 3 helper files are going to be generated. These files are then used to obtain correct formatting of individual field names. Here, again in my understanding, we need to distinguish between these two dx extract_dataset commands. In the next stage, "dx extract_dataset" is actually executed to get exported data for case&control cohort/group.

To your "Also, in this case, how can I obtain a dataframe with the phenotype data for each eid from the three files that are loaded?"
--> See my explanation above. Once you are successful with the export data step, data will appear in the folder structure and you can read it using e.g. pandas.

0
Permanently deleted user
- 31 March 2023 19:07
Hello @Ondrej Klempir? ,
I appreciate your response and help with the issue I am facing.
I have followed your recommendations and tried to run the code again with the recommended configuration:
Cluster configuration: Single Node
Recommended instance: mem1_ssd1_v2_x36
Feature (PYTHON_R) -- (don't have recommended feature)
Unfortunately, the error has occurred again.

Regarding the syntax errors. Could you please advise me on what to pay attention to, considering that I cloned the code from GitHub and only changed the path to the cohorts?
Please let me know if there are any other suggestions you have for resolving this issue. Your help is greatly appreciated.
Best regards,
Alex

0
Ondrej Klempir DNAnexus Team
- 31 March 2023 19:53
Hi @Alex Shemy?, interesting. Two things to test:

A) Add %%time as the first row in cell in which you are observing the error. This will measure the runtime for this cell.
https://stackoverflow.com/questions/49403536/what-does-time-mean-in-python-3

B) What happens if you reduce size of the input list of field ids? I mean if you keep lets say just first 4 items from this list:

field_ids = [
'31',
'2966',
'22001',
'22006',
'22019',
'22021',
'21022',
'23104',
'20160',
'30760',
'30780',
'22020',
'22009'
]

-->

field_ids = [
'31',
'2966',
'22001',
'22006'
]

0
Permanently deleted user
- 31 March 2023 20:32
@Ondrej Klempir?
The error still persists.
Screenshots are attached.
There is no runtime information displayed. This is probably because the runtime is only displayed upon successful execution (as in the example with printing '1').

0
Alexandra Lee DNAnexus Team
- 03 April 2023 13:48
@Alex Shemy? I was also not able to reproduce your error if I run the notebook from top to bottom.

However, I get the same error if the `control_dictionary.csv` file already exists and I re-run the cell containing the dx extract_dataset command

`cmd = [
'dx',
'extract_dataset',
control_dataset,
'--fields',
field_names,
'--delimiter',
',',
'--output',
'control_dictionary.csv',
]
subprocess.check_call(cmd)`

Is it possible that this `control_dictionary.csv` file was already created when you were trying to run the cell? Maybe you can try to clear the output files generated and re-run the notebook from the top again. In general, sometimes you'll get unexpected behavior if you jump around to different parts of a notebook instead of running the notebook from the top.

I hope this helps

0
Permanently deleted user
- 03 April 2023 16:28
Hi {@00560000001jOfvAAE}?
Thank you very much for your assistance in troubleshooting the issue with the non-working example from UKB_RAP.
I have been executing the cells sequentially and
control_dictionary.csv
was not created.
Unfortunately, I still encountered the same error every time I created a new JupyterLab instance today. I have attached some screenshots that show which files exist and which ones do not.
Could you please provide me with the full specifications for JupyterLab? I have been using the following:
Cluster Configuration: Single mode
Instance Type: mem1_ssd1_v2_x36
Feature: PYTHON_R
Is it possible that the way the data is provided to me is different from how you are accessing it? Should I verify this?
I am grateful for any assistance you can provide. Thank you very much in advance.
Best regards,
Alex.

0
Permanently deleted user
- 03 April 2023 16:35
@Ondrej Klempir?
I'm sorry to bother you.
May I ask if you had a chance to review my response?
Also, could you please suggest any other possible ways to resolve the issue with the non-working example?

0
Permanently deleted user
- 03 April 2023 18:05
Colleagues, I am still working on debugging the non-functional code end_to_end_gwas_phewas/gwas-phenotype-samples-qc.ipynb
Are there any requirements for the cohort from which the fields are being obtained? If so, what are these requirements?
Is it possible to create a cohort for all 502,384 patients and obtain phenotype data for them using this method?
md = [
'dx',
'extract_dataset',
control_dataset,
'--fields',
field_names,
'--delimiter',
',',
'--output',
'control_dictionary.csv',
]
subprocess.check_call(cmd)

It is unclear whether there are any specific requirements for the cohort. The field IDs specified in the 'field_ids' variable appear to be specific to the control dataset being used, so it may be necessary to consult the dataset documentation or owners to determine any specific requirements for the cohort.
Do I need to specify when creating a cohort that I will be extracting data from the fields listed below?
field_ids = [
'31',
'2966',
'22001',
'22006',
'22019',
'22021',
'21022',
'23104',
'20160',
'30760',
'30780',
'22020',
'22009'
]

Thank you for your patience. I would be glad to assist you.
{@005t0000006BZL2AAO}?
{@00560000001jOfvAAE}?
{@005t0000001TnmWAAS}?

Best,
Alex

0
Alexandra Lee DNAnexus Team
- 03 April 2023 20:16
Here are the steps I followed:

1. Launched JuypterLab instance with the following configuration:

Cluster Configuration: Single mode
Instance Type: mem1_ssd1_v2_x36
Feature: PYTHON_R

2. In python notebook clone repo: `!git clone https://github.com/dnanexus/UKB_RAP.git`

3. Navigate into the `UKB_RAP/end_to_end_gwas_phewas` folder

4. Open `gwas-phenotype-samples-qc.ipynb` and run

When I follow the above steps, I am not able to replicate your error. I wonder if other folks who've worked with dx extract_dataset could help?
@Yimei Huang? @Chinonso Odebeatu?

0
Permanently deleted user
- 04 April 2023 00:18
@Alexandra Lee? Thank you very much for the detailed response to my message.
I'm doing the same as you, but I'm getting an error.
Could the problem be with how UK BioBank exported the data to us? Who can help with this?
What else can we look into?

0
Ondrej Klempir DNAnexus Team
- 04 April 2023 14:37
For detailed troubeshooting, you can contact ukbiobank-support@dnanexus.com.

As a workaround, you might use dxdata package instead of running dx extract_dataset. https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

or extract data via Table Exporter

0

Please sign in to leave a comment.