Hi! Question: How to retrieve all fields (from phenotypic data) for a specific sample or list of samples (provided a file for example)?

Comments

12 comments

  • Comment author
    Ben Busby DNAnexus Team

    Hi Catarina! let me take a look!

    0
  • Comment author
    Ben Busby DNAnexus Team

    I dont see the file. Can you tell me which IDs you are using?

    0
  • Comment author
    Ben Busby DNAnexus Team

    Aha, I think this should answer your question, please lmk if it doesn't: https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb

    0
  • Comment author
    Former User of DNAx Community_85

    Participant IDs - They will be specific to my project and I don't think I should share them(?)

    I have checked that link but I couldn't retrieve all the data fields for a specific participant (or group of participants)

    0
  • Comment author
    Ben Busby DNAnexus Team

    agreed, dont share the IDs here

    I just needed to know which IDs

    0
  • Comment author
    Ben Busby DNAnexus Team

    are you certain all of the fields you wanted were selected in showcase?

    If you just got access to them, you may have to redispense

     

    doc on that for completeness: https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/creating-a-project

    0
  • Comment author
    Ben Busby DNAnexus Team

    This is probably obvious, but (for others) you can check by running dataset.entities

    0
  • Comment author
    Former User of DNAx Community_85

    I am going to recheck the first link you sent.

    My problem was in retrieving data for a particular set of participants (eids - e.g. 1234567, 12345678).

     

    Thanks!

    0
  • Comment author
    Ben Busby DNAnexus Team

    Check and lmk if the answer is not there.

     

    Ill be back on tomorrow morning at the latest!

    0
  • Comment author
    Ben Busby DNAnexus Team

    Catarina, this may also be helpful:

     

    field_names = []

    for feature in feature_list:

    print(feature)

    print(field_names_for_id(feature_code_mapping[feature]))

    field_names+=field_names_for_id(feature_code_mapping[feature])

    0
  • Comment author
    Ben Busby DNAnexus Team

    Hi! Instead of bits and pieces, my friend @Ondrej Klempir? put everything in one place, using koalas:

     

    # after phenotypes are successfuly loaded into a Spark dataframe (https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb)

     

    list_of_eids = ["1234567", "12345678"] # it can be hardcoded or e.g. loaded from a file

     

    import databricks.koalas as ks # import Koalas

     

    df_phenotypes_koalas = df_phenotypes.to_koalas() # convert Spark dataframe to enable filtering in Koalas library

    print(df_phenotypes_koalas.shape) # check shape before filtering

     

    filtered_phenotypes = df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)] # apply "isin" filtering

    print(df_phenotypes_koalas[df_phenotypes_koalas["eid"].isin(list_of_eids)].shape) # apply "isin" filtering and check output shape

    0
  • Hi all, I would be grateful for some help on this topic also. I've been trying to create a cohort of specific eids (several hundred) but having no luck. I've tried dxdata.create_cohort but when I run it, it does create a cohort in the right folder, but it always opens with an error. Can anyone give some example code for how to use dxdata.create_cohort properly?

     

    I've seen the solution using the koalas dataframe, but doesn't that involve having to query the entire 500,000 participant dataset for your fields of interest first? Wouldn't that use a huge amount of computing power? I would be grateful for any advice! Thanks, David

    0

Please sign in to leave a comment.