How can I convert VCF file to tabular data?

Permanently deleted user

14 June 2023 00:00
17 comments

Firstly, I converted a BAM file to a VCF file using the Mutectcaller (Parabricks accelerated) app in UKB RAP. Secondly, I annotated variants the VCF file using the SnpEff Annotate app and than I filtered snpEff.vcf file according to specific genes using bcftools in SAK and now I'd like convert this filtered.vcf file to tabular data as in the appendix. How can I do that? What is your suggestion for me? [Image: variants]

Comments

17 comments

Chai Fungtammasan DNAnexus Team
- 18 May 2023 07:46
In that case, I'm afraid that there are no other solution than learning programming language. Both Python and R would be able to select only the data column you need.
Since you can't download the data, there is probably no point to convert it to excel though.

0
Chai Fungtammasan DNAnexus Team
- 14 June 2023 14:40
@Burcu Çevik? is the above screenshot came from UKB data? If so, could you remove this post and repost with no UKB data?
What programming language you are using? We can share the one we have been using, but we need to know which program you can code.

0
Permanently deleted user
- 14 June 2023 16:05
@Chai Fungtammasan? No, It's not came from UKB data. It belongs to the pat lab I'm student at. Actually, I'd hoped you suggest an app in tools library on UK RAP. I'm sorry, I do not know any programming language.

0
Chai Fungtammasan DNAnexus Team
- 14 June 2023 16:19
I see. Thanks for confirming.

One quick workaround is to remove header from VCF file and you would end up with tab delimited file. However, I personally won't recommend posting genetic data into excel file. The excel would mess up the gene name and several info badly.

0
Permanently deleted user
- 09 July 2023 10:22
Since downloading VCF files is forbidden, I could not remove header from the filtered.vcf file. I converted VCF to excel but It contains just chr, ref and alt informations. That's not enough for me. I converted VCF to txt but it so complicated. I can't find the informations I was seeking.

In my UKB project, my purpose is to obtain a somatic variant table for each sample in my cohort. I want this variant table to contain genes, location, length, % Frequency, Exon, Transcript, Coding, Amino Acid Change, Variant Effect, Coverage informations. Do you have any other solution for this?

0
Permanently deleted user
- 09 July 2023 21:35
Okay, well, can you explain in a simply way, step by step what I had to do please? My preference is for Python. I'd appreciate it if you share related links.

0
Chai Fungtammasan DNAnexus Team
- 10 July 2023 00:51
Sure You could try this course in Kaggle. https://www.kaggle.com/learn/pandas
You need to learn first two lessons. There is also a course in Python too if you have no Python background.
You would also need a code to convert vcf to pandas dataframe. I found this code, but I never try myself.

0
Permanently deleted user
- 13 July 2023 10:10
Could you please inform me how I will use it on UKB RAP after I finish the courses you sent the link to? and Could you share the code you found?

0
Permanently deleted user
- 13 July 2023 10:18
Could you share the one you're using for python?

0
Ondrej Klempir DNAnexus Team
- 13 July 2023 19:09
Hello {@005t000000BBrFkAAL}?, I looked into this and prepared some steps for you. You will need to run JupyterLab with Python/R. As a testing example vcf file, I used a publicly available snpeff annotated vcf from https://raw.githubusercontent.com/pcingola/SnpEff/master/examples/test.chr22.ann.vcf

What is important step to make that running, run the JupyterLab on UKB-RAP and then inside a python notebook, run the following lines/paragraphs and see what each part is producing:

import io
import os
import pandas as pd

def read_vcf(path):
  with open(path, 'r') as f:
    lines = [l for l in f if not l.startswith('##')]
  return pd.read_csv(
    io.StringIO(''.join(lines)),
    dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
        'QUAL': str, 'FILTER': str, 'INFO': str},
    sep='\t'
  ).rename(columns={'#CHROM': 'CHROM'})

%%bash
wget https://raw.githubusercontent.com/pcingola/SnpEff/master/examples/test.chr22.ann.vcf

df_vcf = read_vcf('test.chr22.ann.vcf')

concat_vcf = pd.concat([df_vcf[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER']], df_vcf['INFO'].str.split("|", expand=True).add_prefix('INFO_')], axis=1)

concat_vcf

# filter columns / select only subset
concat_vcf[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO_1', 'INFO_2', 'INFO_3']]

# and not just only filter, you can also rename columns to have it better described in the final txt file
concat_vcf[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO_1', 'INFO_2', 'INFO_3']].rename(columns={"INFO_1": "annotation_1", "INFO_2": "annotation_2"}) # you can filter

concat_vcf.to_csv('exported_table.csv', index=False)

Please note that the commands above are something you can start with (likely it will not work as-is for your use case, but I tried to made it as general as possible), there will be likely some additional custom steps needed, because every vcf could have a different content of the INFO field and possibly might have some differences in header etc.

Once you run all the commands, an exported text table will appear on the left side of your screen, inside small ?icon? folder.

When working on this answer, I built one part on this:
https://gist.github.com/dceoy/99d976a2c01e7f0ba1c813778f9db744

In the future, you can also wrap all the logic into a python script (runable via SAK) or applet.

Let me know if it helps.

0
Permanently deleted user
- 13 July 2023 22:08
@Ondrej Klempir? Thank you so much for the detailed responses. I will try and get back to you.

0
Permanently deleted user
- 09 August 2023 17:49
Hi @Ondrej Klempir?

I received "FileNotFoundError". What should I write before the file name to not get this error?

0
Ondrej Klempir DNAnexus Team
- 10 August 2023 12:07
It seems to me you should first download the file "output.vcf" to the cloud worker. As far as I can see from your screenshot, the file is located on DNAnexus project. You have basically two options:
1. use "dx download" command to download the file to worker
2. use dxfuse, i.e. /mnt/project/... full path to specify the location of your file
You can read more about it here:
https://documentation.dnanexus.com/user/jupyter-notebooks#dxjupyterlab-environments
https://documentation.dnanexus.com/user/jupyter-notebooks#accessing-data
0
Ondrej Klempir DNAnexus Team
- 10 August 2023 12:09
So theoretically, given the fact that "output.vcf" is in root directory, one option would be to do something like

df_vcf = read_vcf("/mnt/project/output.vcf")

0
Permanently deleted user
- 16 August 2023 07:47
Hi @Ondrej Klempir?

/mnt/project/ solved my problem. Thank you so much. I tried the rest of the code and I got a exported text table. This table consists of 277719 rows and 3370 columns. I realized repeats some columns in the table. So I can't decide which choose. For example, there is not a one column which contain gene name. There are a lot of columns contain gene name. Why are there so many colomns with the same information? Actually I'n not sure whether exactly same information. Do you have any idea?

0
Ondrej Klempir DNAnexus Team
- 25 August 2023 06:37
SnpEff is very versatile . Annotation and the resulting number of columns per row would depend on the selected annotation tool. In my experience, proper filtering is always needed to lower down the number of columns to get only relevant subset for your analysis.

0
Ondrej Klempir DNAnexus Team
- 25 August 2023 06:38
You can read specification of SnpEff tool.

0

Please sign in to leave a comment.