Hi Lora?for context, I'm trying to accomplish what is set out in this post, which is basically to analyze some of the pVCFs using HAIL.
I've now managed to locate the pVCF that I need using this file from the UK Biobank. I tried loading the file into HAIL within Jupyter notebook with the following code (not showing the code I used to load HAIL itself):
hl.import_vcf('/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c5_b4_v1.vcf.gz').write('ukb23157_c5_b4_v1.mt', overwrite=True)
But I get an error saying that the file doesn't exist.
I thought that maybe this is some problem with the file being gzipped. To get around this, I tried to extract the file using a terminal with the following command:
Hi Jeremy - thank you for your question! I might need to get some advice from our bioinformatician on best approaches to using the pVCF files with HAIL - will update!
Thank you very much! I think I could also benefit from understanding read/write permissions to the filesystem. Is there a place I could gunzip the vcf.gz file to?
After some bioinformatics advice, our guess is that the problem may be due to HAIL not having access to the dx fuse system, which lets you use the /mnt/project area directly. An alternative approach would be download the pVCF of interest to the local instance, using the dx download command, eg dx download "/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c5_b4_v1.vcf.gz". Then you will have a copy on your local system you can use with HAIL. In terms of unzipping the file, again, the problem may be that dx fuse is read-only ; writing should be performed using dx upload. If you download the zipped file to your local instance, you should be able to unzip it locally. Hope that helps!
Hello-Thanks for doing this Q&A session. My question is about the blood lipid level data in UKB. We have noticed that:
1. The UKB-performed lipid panel data from the "blood biochemistry" testing suggests a very high prevalence of dyslipidemia among UKB participants
2. The "NMR Metabolomics" data indicates a much lower prevalence
3. The NMR metabolomics testing includes two tests, "LDL" and "Clinical LDL", with very different results, and we can't find any documentation of the differences.
Great?on a broader level I guess I don't understand how the dx fuse system works. Will try your suggestions later today and see what documentation I can find on dx fuse. Thank you all!
Thank you for your question! Generally, the UKB cohort comprises older individuals, which may account for the high prevalence of dyslipidaemia; is the prevalence higher than you would expect for a cohort of that age distribution? The blood biochemistry panel is available for a larger proportion of the cohort compared to NMR metabolomics data, which may explain the differences in prevalence between the two. Further details on the NMR metabolomics can be found under UK Biobank resource 3000 (https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3000).
Comments
8 comments
Hi Lora?for context, I'm trying to accomplish what is set out in this post, which is basically to analyze some of the pVCFs using HAIL.
I've now managed to locate the pVCF that I need using this file from the UK Biobank. I tried loading the file into HAIL within Jupyter notebook with the following code (not showing the code I used to load HAIL itself):
hl.import_vcf('/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c5_b4_v1.vcf.gz').write('ukb23157_c5_b4_v1.mt', overwrite=True)
But I get an error saying that the file doesn't exist.
I thought that maybe this is some problem with the file being gzipped. To get around this, I tried to extract the file using a terminal with the following command:
gunzip -c ukb23157_c5_b4_v1.vcf.gz > /ukb23157_c5_b4_v1.vcf
But although the command appears to complete, I don't see a file when it does. I presume that the system automatically deleted the file.
I also tried to write it to /mnt/project, but then I got an error saying the filesystem is read-only.
Could you please suggest how to go about this?
Best,
Jeremy
Hi Jeremy - thank you for your question! I might need to get some advice from our bioinformatician on best approaches to using the pVCF files with HAIL - will update!
Thank you very much! I think I could also benefit from understanding read/write permissions to the filesystem. Is there a place I could gunzip the vcf.gz file to?
Hi Jeremy
After some bioinformatics advice, our guess is that the problem may be due to HAIL not having access to the dx fuse system, which lets you use the /mnt/project area directly. An alternative approach would be download the pVCF of interest to the local instance, using the dx download command, eg dx download "/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c5_b4_v1.vcf.gz". Then you will have a copy on your local system you can use with HAIL. In terms of unzipping the file, again, the problem may be that dx fuse is read-only ; writing should be performed using dx upload. If you download the zipped file to your local instance, you should be able to unzip it locally. Hope that helps!
Hello-Thanks for doing this Q&A session. My question is about the blood lipid level data in UKB. We have noticed that:
1. The UKB-performed lipid panel data from the "blood biochemistry" testing suggests a very high prevalence of dyslipidemia among UKB participants
2. The "NMR Metabolomics" data indicates a much lower prevalence
3. The NMR metabolomics testing includes two tests, "LDL" and "Clinical LDL", with very different results, and we can't find any documentation of the differences.
Can you help sort any of this out?
Great?on a broader level I guess I don't understand how the dx fuse system works. Will try your suggestions later today and see what documentation I can find on dx fuse. Thank you all!
Hi Eric,
Thank you for your question! Generally, the UKB cohort comprises older individuals, which may account for the high prevalence of dyslipidaemia; is the prevalence higher than you would expect for a cohort of that age distribution? The blood biochemistry panel is available for a larger proportion of the cohort compared to NMR metabolomics data, which may explain the differences in prevalence between the two. Further details on the NMR metabolomics can be found under UK Biobank resource 3000 (https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3000).
Thank you for your questions! Other members of the UK Biobank Data Analyst Team will be around to answer your questions later this week!
Please sign in to leave a comment.