I can process some test data with Hail on a Jupyter lab spark cluster, and for example export a filtered VCF. How can I get this data back up to my DNA Nexus project?
Example:
import hail as hl
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
hl.utils.get_1kg('data/')
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')
## apply some filters / processing [...]
mt.rows().export('test/mydata.tsv.gz', delimiter='\t')
# Where does this go?? Can I put is somewhere and then use dx-upload-all-outputs?
My understanding is that e.g. test/mydata.tsv.gz is saved into HDFS file system (Cluster distributed hadoop storage). You could try to move it from hdfs to Local notebook storage (/opt/...) using "hdfs dfs -get file"
Comments
1 comment
My understanding is that e.g. test/mydata.tsv.gz is saved into HDFS file system (Cluster distributed hadoop storage). You could try to move it from hdfs to Local notebook storage (/opt/...) using "hdfs dfs -get file"
https://www.geeksforgeeks.org/hdfs-commands/
Once your file is in the Local notebook storage (the same place as your notebook), you should be able to upload the file to project.
Please sign in to leave a comment.