How to merge VCF files with Swiss Army Knife?

28 July 2022 00:00
8 comments

I am using Swiss Army knife to merge VCF files with Bcftools. I am using the following command: cmd: bcftools merge -m none --file-list files_to_merge.txt -Oz -o MergedFiles.vcf.gz The input file in[0]: files_to_merge.txt contains the file names, one per row. I get the following error message: Swiss Army Knife STDERR Failed to open ukb23352_c10_b0_v1.vcf.gz: No such file or directory (ukb23352_c10_b0_v1.vcf.gz is the first file in my list). I have tried adding the full path to the file names, but it did not help. I then tried to use the web GUI, adding the file names as input, together with the list of files, getting a different error this time: 'Could not retrieve index file for 'ukb23352_c10_b0_v1.vcf.gz'. I would really appreciate any suggestion. Thank you, Marianna

Comments

8 comments

Ondrej Klempir DNAnexus Team
- 01 August 2022 13:52
1) From the error message, it seems to me that you did not provide the file ukb23352_c10_b0_v1.vcf.gz as input for Swiss Army Knife. You can either provide it as input when you set up your job or you can reference the file using dxfuse, i.e. /mnt/project/...

How did you specify the "full path"?

2) 'Could not retrieve index file for 'ukb23352_c10_b0_v1.vcf.gz'. sounds to me that you will need to provide the corresponding index file of the given vcf.gz as well...

0
Former User of DNAx Community_56
- 02 August 2022 09:27
1. I provided a list of input files containing the file names, one per row. I tried with or without the full path, since they were in the same folder I was launching the command from, but it did not make any difference. 2. I now see that all the files will need to be unzipped and then re-zipped with bcftools in order to generate the index files and be then merged and converted to plink files, which is my main aim. This seems to be complicated and likely to require lot of memory, considering the large number of files. Is there a simpler way to do it? [Horizontal PrecisionLife logo with the company tagline] Marianna Sanna PhD Bioinformatician Follow us [LinkedIn icon linking to the PrecisionLife LinkedIn account to follow on social media] [Twitter icon linking to the PrecisionLife Twitter account to follow on social media]

0
Anastazie Sedlakova DNAnexus Team
- 02 August 2022 11:09
@Marianna Sanna? Can you please check that when you are writing full path, you are adding /mnt/project/?
e.g.
/mnt/project/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/ukb23352_c10_b0_v1.vcf.gz
/mnt/project/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/ukb23352_c11_b0_v1.vcf.gz

0
Former User of DNAx Community_56
- 02 August 2022 15:47
Hi Anastazie, I am providing the full path in my file list. I think the mistake was in adding the ?\? as in "/mnt/project/Bulk/W hole\ genome\ sequences??. It works fine now without that. Another issue is that I need to add each single file as input. This is fine when testing with few files, but what about using a large number of files? It would not be feasible to input each of them one by one. Is there any more efficient way of doing it? Thanks [Horizontal PrecisionLife logo with the company tagline] Marianna Sanna PhD Bioinformatician Follow us [LinkedIn icon linking to the PrecisionLife LinkedIn account to follow on social media] [Twitter icon linking to the PrecisionLife Twitter account to follow on social media]

0
Former User of DNAx Community_28
- 13 August 2022 20:58
If you are using dx tools on a linux or mac via the command line, you can create a files_to_merge.txt file by running these commands on your local computer:

dx ls Bulk/Whole\ genome\ sequences/Whole\ genome\ GraphTyper\ joint\ call\ pVCF/*gz > tempfile.txt ;

sort tempfile.txt | awk '{print "/mnt/project/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/"$1}' > full_path_allvcf.txt

In this case I am merging everything, but you would probably want to loop this over each chromosome separately. *-c12_*gz instead of just *gz.

0
Former User of DNAx Community_16
- 03 October 2022 00:21
@Phil Greer?
Can you explain little bit more?
How to give this multiple input in swiss army knife bcftools??

Thanks for you help

Best,
Vignesh

0
Former User of DNAx Community_28
- 03 October 2022 12:52
@Vignesh Arunachalam? ,

First I wrote and published a script to create the list of all the WGS pVCF segments per chromosome here:
https://github.com/pjgreer/ukb-rap-tools/blob/main/ukb-vcf-list.sh

This script would be run on your local computer and it will generate a list or every pvcf segment for each chromosome. Producing 23 total files (1-22 + X)

You must then upload all 23 files to a folder in your UKB RAP project via "dx upload" or using the web interface.

In your dx run command, you would then add the path to the merge list file as an input field.:--iin /data//ukb_cX_vcf_full_path_mergelist.txt

Hope this is more clear.

-Phil

0
Former User of DNAx Community_16
- 04 October 2022 07:05
@Phil Greer?

Thanks, Phil
It's really helpful.
I was doing the same way but while using awk - i have added quotes to path and its throwing an error.
Now I fixed it

Thanks for your help

Cheers,
Vignesh

0

Please sign in to leave a comment.