Struggling with CLI
Hello!
I'm trying to get to grips with the command line interface on a windows machine. I'm not very familiar with python, but I've been trying to follow the instructions here on how to use the table exporter to get a file listing all the available data-fields in my dataset: https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/accessing-phenotypic-data-as-a-file
However, I've hit a problem I can't work round. dx extract_dataset needs pandas version 1.3.5 to run. Pandas 1.3.5 is not compatible with the current version of numpy ; nor is it compatible with any currently supported versions of numpy.
The online Jupiter notebook uses numpy v1.23.5, but support of that ended 3 weeks ago and it is not installable on the current released version of python. Attempting to install an earlier version of numpy leads to an “AttributeError: module 'pkgutil' has no attribute 'ImpImporter'” error- see https://stackoverflow.com/questions/77364550/attributeerror-module-pkgutil-has-no-attribute-impimporter-did-you-mean for discussion of the problem.
Using later versions of numpy gives me an incompatibility error “ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject”
Has anyone else had this problem? How did you resolve it?
Thank you
Comments
4 comments
Hi Amy, here is one possible way to get the file dictionary:
Find your project ID . Click on the icon to the left of your project name so that a panel appears to the right with info about your project, like this:
Find your dataset record ID : Go into your project, look for the item with Type/Class Dataset Record, click the icon to the left to get an info panel to the right, like this:
Start a jupyterlab from the Tools tab in your project on the RAP.
Open a $_ terminal.
Replace my project ID and my record ID with your own ones in the following command, and enter it in the terminal
dx extract_dataset project-Ggp13kQJQ7zKy7rgwrgwRgwp:record-GgpVF50JFv1q2RGWRGW1JF6B -ddd --delimiter ","
When I do this, I get a warning message saying “For '-ddd' usage, the recommended pandas version is '1.3.5'. The installed version of pandas is 1.5.3. It is recommended to update pandas. For example, 'pip/pip3 install -I pandas==X.X.X' where X.X.X is '1.3.5'. “
However, this is only a warning, so I ignore it for now. It is possible that it only means “at least 1.3.5”, and that the system is incorrectly complaining about the more recent version number. After all, it says to “update” pandas, but if I were to install 1.3.5 instead of the current 1.5.3 it wouldn’t be an “update”.
More to the point, I also get the three output files that I need, saved in the JupyterLab Storage space, like this:
If I right-click on one of them, and select Open, I can view it.
One of the files is a list of data codings. One is a list of Entities, which relate to the separate record-tables such as gp_scripts or hesin. The largest file (7Mb) includes a list of all the fields in the participant entity.
Save all the files you want to keep into your main project storage, using dx upload <filename> .
If you want to upload all of them, use dx upload *.csv
If you don’t specify a folder, the default behaviour of dx upload is to save files to the top level of your main project storage. If necessary, you can then drag and drop them to a better folder.
Check that you can see the files in your main project storage before you close the jupyterLab terminal, as they will be deleted from the jupyterlab storage when it closes.
These three files do not include any participant data, so you can download them to your local Windows computer if you like, but you will probably want to use them within the RAP. To download a file, click on it (in the main project storage) and select Download data.
I hope this works for you.
By the way, I suggest you start with quite a small set of fields. You can always get another lot, and then read both csvs into your jupyterLab.
You might find this post helpful for the next step: https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/19671290524317-How-to-extract-all-the-phenotypes-available-for-a-single-individual
Thank you so much - that sorted my issue!
For future reference:
Where I said “The largest file (7Mb) includes a list of all the fields in the participant entity.”, I should have said “The largest file (7Mb) includes a list of all the fields in all the entities, including the participant entity.”
Please sign in to leave a comment.