query cram files and collect EIDs
Hello I am new to this platform and I am trying to get the path to the CRAM files that correspond to a list of EIDs that I have.
The usual command is:
"dx find data --property eid=12345 --name *.cram"
But I have a huge list of EIDs....
Is there a way to query multiple eids?
I could do it this way:
"
while read e
do
dx find data --property eid=${e}--name *.cram >> out
done < list_eids
"
But the above seems exhaustive....?
I have also would like to query the sex of these individuals- I tried opening a jupyter notebook and doing it as follows:
import pyspark
import dxpy
import dxdata
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
dispensed_database_name = dxpy.find_one_data_object(classname="database", name="app*", folder="/", name_mode="glob", describe=True)["describe"]["name"]
dispensed_dataset_id = dxpy.find_one_data_object(typename="Dataset", name="app*.dataset", folder="/", name_mode="glob")["id"]
dataset = dxdata.load_dataset(id=dispensed_dataset_id)
dataset.entities
participant = dataset["participant"]
def fields_for_id(field_id):
from distutils.version import LooseVersion
field_id = str(field_id)
fields = participant.find_fields(name_regex=r'^p{}(_i\d+)?(_a\d+)?$'.format(field_id))
return sorted(fields, key=lambda f: LooseVersion(f.name))
id_for_sex = '31'
id_for_crams='23144'
field_names = ['eid'] + field_names_for_id(id_for_sex)+ field_names_for_id(id_for_crams)
df = participant.retrieve_fields(names=field_names, engine=dxdata.connect())
df_pandas = df.toPandas()
display(df_pandas)
The above script is written on the basis that 'p31' and 'p23144' are the fields for sex and cram files:
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23144
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=31
However this participants dataset that I am using does not seem to contain the field for cram files....
There just doesn't seem to be a way to link all of this data together...
Can anyone offer advice?
Comments
2 comments
Hi {@005820000012OZMAA2}? ,
this page of the documentation details how the bulk folders and files are structured and named, https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data .
So, within your project, I think you will need Bulk/Exome sequences/Exome OQFE CRAM files/
Then, within that folder, there are subfolders such as 10, which holds all the Exome CRAM data for participants whose EID starts with 10.
One further gotcha: it looks as if both the cram and the cram.crai files are in the same folders, and that both of them are listed AS IF their field-id is 23143 (and not 23144 as you might expect from the showcase field ids).
This is quite tricky to spot if you just browse the folders, as there are far too many files to see in the output, so you only see the most recently created, which unfortunately appears to have been the cram.crai files. In order to see that the cram files are also there, you can narrow the search, for example by specifying Name to be /NNNNNNN_23143_0_0.cram$/ where NNNNNNN is one of the EIDs that you can see in the folder.
The main apache parquet dataset that needs a Spark instance to read it holds the data that in showcase would be nice tabular fields, ie one value per cell. This includes the Sex field. think of it as a spreadsheet or dataframe with participants as the rows and showcase fields as the columns. A participant's CRAM file won't fit into the cell, so it is saved separately as a file in the Bulk folder.
If you pull out just the tabular fields you need, for the participants (EIDs) you need, and save them into a csv file, you might not need to use a Spark JupyterLab Instance. The Single-node JupyterLab instances are cheaper. For more on the different kinds of JupyterLab instance, see the dnanexus videos about jupyterlab, which are the 3rd and 4th videos on this page : https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/research-analysis-platform-training-webinars
Please sign in to leave a comment.