Simple Proteomic Download

04 March 2025 20:50

Hi,

I'm trying to develop a really simple script for downloading certain data fields and saving them as CSV files but have recently ran into some unusual. To be clear, I've done this in the past, but now run times (on 4 nodes in a SparkCluster) are getting exorbitantly long. Below is the script:

import dxpy
import pandas as pd
import subprocess
import glob
import os
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
import dxdata
import warnings
warnings.filterwarnings("ignore")

# ---- Step 1 -----
project = os.getenv('DX_PROJECT_CONTEXT_ID')
record = os.popen("dx find data --type Dataset --delimiter ',' | awk -F ',' '{print $5}'").read().rstrip()
DATASET_ID = project + ":" + record
cmd = ["dx", "extract_dataset", DATASET_ID, "-ddd", "--delimiter", ","]
subprocess.check_call(cmd)

# ---- Step 2 -----
path = os.getcwd()
data_dict_csv = glob.glob(os.path.join(path, "*.data_dictionary.csv"))[0]
data_dict_df = pd.read_csv(data_dict_csv)

# ---- Step 3 -----
olink_df = data_dict_df.loc[data_dict_df["entity"].str.contains("olink_instance")]
print(olink_df.entity.unique())
print(olink_df.groupby('entity')['name'].size())

# ---- Step 4 -----
conf = pyspark.SparkConf().set("spark.kryoserializer.buffer.max", "256")
sc = pyspark.SparkContext(conf=conf)
spark = pyspark.sql.SparkSession(sc)
sqlContext = SQLContext(sc)

# ---- Step 5 -----
data_dict_csv = glob.glob(os.path.join(path, "*.data_dictionary.csv"))[0]
data_dict_df = pd.read_csv(data_dict_csv)
dataset = dxdata.load_dataset(id=DATASET_ID)
pheno_explore = dataset['participant']

# ---- Step 6 -----
pattern_exp = ".*eid|p41262|p41283|p40005|p41263|p40007|p21022|p53|.*"
field_names_exp = list(pheno_explore.find_fields(name_regex=pattern_exp))
pheno_data = pheno_explore.retrieve_fields(fields=field_names_exp, engine=dxdata.connect()).to_pandas_on_spark()

# ---- Step 7 -----
pheno_data = pheno_data.to_pandas()
pheno_data.to_csv("pheno_data1.csv", index=False)
print(f"Pheno data saved to: pheno_data1.csv")

I've also tried using Koalas at some point but am getting errors using this version of my download template (something that wasn't happening before) It is identical to the version above except for these lines:

pattern_exp = ".*eid|p42018|p42020|p42022|p42024|p42032|p42006|p53|.*"

field_names_exp = list(pheno_explore.find_fields(name_regex=pattern_exp))
pheno_data = pheno_explore.retrieve_fields(fields=field_names_exp, engine=dxdata.connect()).to_koalas()

pheno_data_pandas = pheno_data.to_pandas()
pheno_data_pandas.to_csv("main_disease_data.csv", index=False)
print(f"Pheno data saved to: pheno_data.csv")

Does anyone have a faster / more efficient way of doing this? I've tried the Github instructions but am running into “hive” errors when I attempt to use those docs located here: https://github.com/UK-Biobank/UKB-RAP-Notebooks-Access/blob/main/JupyterNotebook_Python/A102_Explore-participant-data_Python.ipynb

Thanks!

Comments

0 comments

Please sign in to leave a comment.