How to filter snpEff.vcf.gz file on UKB RAP?

Permanently deleted user

08 May 2023 00:00
7 comments

I've been working on UKB WES data on RAP and I have a snpEff.vcf.gz file. The size of this file is so large. I'm just interested in specific genes and specific information. How can I filter this file according to things that interest me on UKB RAP?

Comments

7 comments

Ondrej Klempir DNAnexus Team
- 09 May 2023 07:40
Hi @Burcu Çevik?,

What I would do - I would run JupyterLab and explore the vcf there.
If you prefer bash, one good option is to use vcftools [https://vcftools.sourceforge.net/man_latest.html].
If you prefer python, my favourite tool is pyVCF [https://pyvcf.readthedocs.io/en/latest/FILTERS.html].

In addition to JupyterLab interactive work, vcftools is also part of Swiss Army Knife supported app on UKB-RAP.

0
Chai Fungtammasan DNAnexus Team
- 09 May 2023 16:52
I personally like to use bcftools in swiss-army-knife.

0
Permanently deleted user
- 09 June 2023 15:36
{@005t0000006BZL2AAO}? {@005t000000149vjAAA}? Thanks for your answers. I tried to use bcftools in SAK and I used a bed file to filter according to specific genes but I received an error. I share log file with you below. I don't know where I went wrong. Do you have any suggestions?

Failure from origin-job
--------------------------------
{
  "id": "job-GVQ8bZjJbBpgxgJBY2q8zk6X",
  "name": "Swiss Army Knife",
  "function": "main",
  "stage": null,
  "analysis": null,
  "executable": "app-GKyyzJQ951j4Bkfq4jFkGX1K",
  "executableName": "swiss-army-knife",
  "failureReason": "AppError",
  "failureMessage": "Error while running the command (please refer to the job log for more information)."
}

Origin-job Inputs
--------------------
{
  "in": [
    {
      "$dnanexus_link": {
        "project": "project-GKQbbYQJbBpk5YXf0JpzGXKK",
        "id": "file-GVQ8b50JbBpx6QJf70YkPPv2"
      }
    },
    {
      "$dnanexus_link": {
        "project": "project-GKQbbYQJbBpk5YXf0JpzGXKK",
        "id": "file-GVBJvy0JK69B1QFPkbb12k12"
      }
    }
  ],
  "cmd": "bcftools view -R deneme.bed 1798762.snpEff.vcf.gz > filtered.vcf",
  "mount_inputs": false
}

View log of failed sub-job
--------------------------------
Logging initialized (priority)
Downloading bundled file resources.tar.gz
>>> Unpacking resources.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file qctool.tar.gz
>>> Unpacking qctool.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file plato.tar.gz
>>> Unpacking plato.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file bedtools.tar.gz
>>> Unpacking bedtools.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file htslib.tar.gz
>>> Unpacking htslib.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file java.tar.gz
>>> Unpacking java.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file plink.tar.gz
>>> Unpacking plink.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file r.tar.gz
>>> Unpacking r.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file sambamba.tar.gz
>>> Unpacking sambamba.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file seqtk.tar.gz
>>> Unpacking seqtk.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file vcflib.tar.gz
>>> Unpacking vcflib.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file vcftools.tar.gz
>>> Unpacking vcftools.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file plink2.tar.gz
>>> Unpacking plink2.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file regenie.tar.gz
>>> Unpacking regenie.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file bolt-lmm_asset.tar.gz
>>> Unpacking bolt-lmm_asset.tar.gz to /
tar: Removing leading `/' from member names
Downloading bundled file bgen.tar.gz
>>> Unpacking bgen.tar.gz to /
tar: Removing leading `/' from member names
dxpy/0.346.0 (Linux-5.15.0-1031-aws-x86_64-with-glibc2.29)
bash running (job ID job-GVQ8bZjJbBpgxgJBY2q8zk6X)
downloading file: file-GVQ8b50JbBpx6QJf70YkPPv2 to filesystem: /home/dnanexus/in/in/0/deneme.bed
downloading file: file-GVBJvy0JK69B1QFPkbb12k12 to filesystem: /home/dnanexus/in/in/1/1798762.snpEff.vcf.gz
Using dxfuse version v1.0.0
The log file is located at /root/.dxfuse/dxfuse.log
starting fs daemon
wait for ready
Daemon started successfully
Downloading files using 4 threads+ [[ '' == '' ]]
+ eval 'bcftools view -R deneme.bed 1798762.snpEff.vcf.gz > filtered.vcf'
++ bcftools view -R deneme.bed 1798762.snpEff.vcf.gz
[E::idx_find_and_load] Could not retrieve index file for '1798762.snpEff.vcf.gz'
Failed to read from 1798762.snpEff.vcf.gz: could not load index
END_LOG

0
Ondrej Klempir DNAnexus Team
- 12 June 2023 12:20
The error message "[E::idx_find_and_load] Could not retrieve index file for '1798762.snpEff.vcf.gz'" indicates that for running bcftools, you will need to provide an index file for your vcf file. Using tabix or bcftools index command to build an index file could solve this problem - you can try to do it on the same worker prior to bcftools command.

Not tested on my end, but (hope)I found relevant threads for example here:
https://github.com/samtools/bcftools/issues/129
https://www.biocomputix.com/post/bctools-index-how-to-create-index-for-vcf-files

0
Permanently deleted user
- 14 June 2023 10:11
Hi @Ondrej Klempir?,
Is the index file 1798762.snpEff.vcf.gz.tbi file? I already have that file. I don't need to build.

0
Ondrej Klempir DNAnexus Team
- 14 June 2023 10:14
Yes, I would say so. Sounds good. Try to use it as input file for your Swiss Army Knife job.

0
Permanently deleted user
- 14 June 2023 12:01
Thank you Ondrej. This time I did not receive an error message.

0

Please sign in to leave a comment.