Where does saved data go on a Jupyter spark cluster?

07 October 2022 00:00
1 comment

I can process some test data with Hail on a Jupyter lab spark cluster, and for example export a filtered VCF. How can I get this data back up to my DNA Nexus project? Example: import hail as hl builder = ( SparkSession .builder .enableHiveSupport() ) spark = builder.getOrCreate() hl.init(sc=spark.sparkContext) hl.utils.get_1kg('data/') hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True) mt = hl.read_matrix_table('data/1kg.mt') ## apply some filters / processing [...] mt.rows().export('test/mydata.tsv.gz', delimiter='\t') # Where does this go?? Can I put is somewhere and then use dx-upload-all-outputs?

Comments

1 comment

Ondrej Klempir DNAnexus Team
- 08 October 2022 06:40
My understanding is that e.g. test/mydata.tsv.gz is saved into HDFS file system (Cluster distributed hadoop storage). You could try to move it from hdfs to Local notebook storage (/opt/...) using "hdfs dfs -get file"

https://www.geeksforgeeks.org/hdfs-commands/

Once your file is in the Local notebook storage (the same place as your notebook), you should be able to upload the file to project.

0

Please sign in to leave a comment.