Firstly, I converted a BAM file to a VCF file using the Mutectcaller (Parabricks accelerated) app in UKB RAP. Secondly, I annotated variants the VCF file using the SnpEff Annotate app and than I filtered snpEff.vcf file according to specific genes using bcftools in SAK and now I'd like convert this filtered.vcf file to tabular data as in the appendix. How can I do that? What is your suggestion for me?
[Image: variants]
In that case, I'm afraid that there are no other solution than learning programming language. Both Python and R would be able to select only the data column you need.
Since you can't download the data, there is probably no point to convert it to excel though.
@Burcu Çevik? is the above screenshot came from UKB data? If so, could you remove this post and repost with no UKB data?
What programming language you are using? We can share the one we have been using, but we need to know which program you can code.
0
Permanently deleted user
@Chai Fungtammasan? No, It's not came from UKB data. It belongs to the pat lab I'm student at. Actually, I'd hoped you suggest an app in tools library on UK RAP. I'm sorry, I do not know any programming language.
One quick workaround is to remove header from VCF file and you would end up with tab delimited file. However, I personally won't recommend posting genetic data into excel file. The excel would mess up the gene name and several info badly.
0
Permanently deleted user
Since downloading VCF files is forbidden, I could not remove header from the filtered.vcf file. I converted VCF to excel but It contains just chr, ref and alt informations. That's not enough for me. I converted VCF to txt but it so complicated. I can't find the informations I was seeking.
In my UKB project, my purpose is to obtain a somatic variant table for each sample in my cohort. I want this variant table to contain genes, location, length, % Frequency, Exon, Transcript, Coding, Amino Acid Change, Variant Effect, Coverage informations. Do you have any other solution for this?
0
Permanently deleted user
Okay, well, can you explain in a simply way, step by step what I had to do please? My preference is for Python. I'd appreciate it if you share related links.
What is important step to make that running, run the JupyterLab on UKB-RAP and then inside a python notebook, run the following lines/paragraphs and see what each part is producing:
Please note that the commands above are something you can start with (likely it will not work as-is for your use case, but I tried to made it as general as possible), there will be likely some additional custom steps needed, because every vcf could have a different content of the INFO field and possibly might have some differences in header etc.
Once you run all the commands, an exported text table will appear on the left side of your screen, inside small ?icon? folder.
When working on this answer, I built one part on this:
It seems to me you should first download the file "output.vcf" to the cloud worker. As far as I can see from your screenshot, the file is located on DNAnexus project. You have basically two options:
use "dx download" command to download the file to worker
use dxfuse, i.e. /mnt/project/... full path to specify the location of your file
So theoretically, given the fact that "output.vcf" is in root directory, one option would be to do something like
df_vcf = read_vcf("/mnt/project/output.vcf")
0
Permanently deleted user
Hi @Ondrej Klempir?
/mnt/project/ solved my problem. Thank you so much. I tried the rest of the code and I got a exported text table. This table consists of 277719 rows and 3370 columns. I realized repeats some columns in the table. So I can't decide which choose. For example, there is not a one column which contain gene name. There are a lot of columns contain gene name. Why are there so many colomns with the same information? Actually I'n not sure whether exactly same information. Do you have any idea?
SnpEff is very versatile . Annotation and the resulting number of columns per row would depend on the selected annotation tool. In my experience, proper filtering is always needed to lower down the number of columns to get only relevant subset for your analysis.
Comments
17 comments
In that case, I'm afraid that there are no other solution than learning programming language. Both Python and R would be able to select only the data column you need.
Since you can't download the data, there is probably no point to convert it to excel though.
@Burcu Çevik? is the above screenshot came from UKB data? If so, could you remove this post and repost with no UKB data?
What programming language you are using? We can share the one we have been using, but we need to know which program you can code.
@Chai Fungtammasan? No, It's not came from UKB data. It belongs to the pat lab I'm student at. Actually, I'd hoped you suggest an app in tools library on UK RAP. I'm sorry, I do not know any programming language.
I see. Thanks for confirming.
One quick workaround is to remove header from VCF file and you would end up with tab delimited file. However, I personally won't recommend posting genetic data into excel file. The excel would mess up the gene name and several info badly.
Since downloading VCF files is forbidden, I could not remove header from the filtered.vcf file. I converted VCF to excel but It contains just chr, ref and alt informations. That's not enough for me. I converted VCF to txt but it so complicated. I can't find the informations I was seeking.
In my UKB project, my purpose is to obtain a somatic variant table for each sample in my cohort. I want this variant table to contain genes, location, length, % Frequency, Exon, Transcript, Coding, Amino Acid Change, Variant Effect, Coverage informations. Do you have any other solution for this?
Okay, well, can you explain in a simply way, step by step what I had to do please? My preference is for Python. I'd appreciate it if you share related links.
Sure You could try this course in Kaggle. https://www.kaggle.com/learn/pandas
You need to learn first two lessons. There is also a course in Python too if you have no Python background.
You would also need a code to convert vcf to pandas dataframe. I found this code, but I never try myself.
Could you please inform me how I will use it on UKB RAP after I finish the courses you sent the link to? and Could you share the code you found?
Could you share the one you're using for python?
Hello {@005t000000BBrFkAAL}?, I looked into this and prepared some steps for you. You will need to run JupyterLab with Python/R. As a testing example vcf file, I used a publicly available snpeff annotated vcf from https://raw.githubusercontent.com/pcingola/SnpEff/master/examples/test.chr22.ann.vcf
What is important step to make that running, run the JupyterLab on UKB-RAP and then inside a python notebook, run the following lines/paragraphs and see what each part is producing:
import io
import os
import pandas as pd
def read_vcf(path):
with open(path, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
%%bash
wget https://raw.githubusercontent.com/pcingola/SnpEff/master/examples/test.chr22.ann.vcf
df_vcf = read_vcf('test.chr22.ann.vcf')
concat_vcf = pd.concat([df_vcf[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER']], df_vcf['INFO'].str.split("|", expand=True).add_prefix('INFO_')], axis=1)
concat_vcf
# filter columns / select only subset
concat_vcf[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO_1', 'INFO_2', 'INFO_3']]
# and not just only filter, you can also rename columns to have it better described in the final txt file
concat_vcf[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO_1', 'INFO_2', 'INFO_3']].rename(columns={"INFO_1": "annotation_1", "INFO_2": "annotation_2"}) # you can filter
concat_vcf.to_csv('exported_table.csv', index=False)
Please note that the commands above are something you can start with (likely it will not work as-is for your use case, but I tried to made it as general as possible), there will be likely some additional custom steps needed, because every vcf could have a different content of the INFO field and possibly might have some differences in header etc.
Once you run all the commands, an exported text table will appear on the left side of your screen, inside small ?icon? folder.
When working on this answer, I built one part on this:
https://gist.github.com/dceoy/99d976a2c01e7f0ba1c813778f9db744
In the future, you can also wrap all the logic into a python script (runable via SAK) or applet.
Let me know if it helps.
@Ondrej Klempir? Thank you so much for the detailed responses. I will try and get back to you.
Hi @Ondrej Klempir?
I received "FileNotFoundError". What should I write before the file name to not get this error?
It seems to me you should first download the file "output.vcf" to the cloud worker. As far as I can see from your screenshot, the file is located on DNAnexus project. You have basically two options:
You can read more about it here:
https://documentation.dnanexus.com/user/jupyter-notebooks#dxjupyterlab-environments
https://documentation.dnanexus.com/user/jupyter-notebooks#accessing-data
So theoretically, given the fact that "output.vcf" is in root directory, one option would be to do something like
df_vcf = read_vcf("/mnt/project/output.vcf")
Hi @Ondrej Klempir?
/mnt/project/ solved my problem. Thank you so much. I tried the rest of the code and I got a exported text table. This table consists of 277719 rows and 3370 columns. I realized repeats some columns in the table. So I can't decide which choose. For example, there is not a one column which contain gene name. There are a lot of columns contain gene name. Why are there so many colomns with the same information? Actually I'n not sure whether exactly same information. Do you have any idea?
SnpEff is very versatile . Annotation and the resulting number of columns per row would depend on the selected annotation tool. In my experience, proper filtering is always needed to lower down the number of columns to get only relevant subset for your analysis.
You can read specification of SnpEff tool.
Please sign in to leave a comment.