How to merge VCF files with Swiss Army Knife?

I am using Swiss Army knife to merge VCF files with Bcftools. I am using the following command: cmd: bcftools merge -m none --file-list files_to_merge.txt -Oz -o MergedFiles.vcf.gz   The input file in[0]: files_to_merge.txt contains the file names, one per row. I get the following error message: Swiss Army Knife STDERR Failed to open ukb23352_c10_b0_v1.vcf.gz: No such file or directory   (ukb23352_c10_b0_v1.vcf.gz is the first file in my list).   I have tried adding the full path to the file names, but it did not help. I then tried to use the web GUI, adding the file names as input, together with the list of files, getting a different error this time: 'Could not retrieve index file for 'ukb23352_c10_b0_v1.vcf.gz'.   I would really appreciate any suggestion.   Thank you, Marianna      

Comments

8 comments

  • Comment author
    Ondrej Klempir DNAnexus Team

    1) From the error message, it seems to me that you did not provide the file ukb23352_c10_b0_v1.vcf.gz as input for Swiss Army Knife. You can either provide it as input when you set up your job or you can reference the file using dxfuse, i.e. /mnt/project/...

     

    How did you specify the "full path"?

     

    2) 'Could not retrieve index file for 'ukb23352_c10_b0_v1.vcf.gz'. sounds to me that you will need to provide the corresponding index file of the given vcf.gz as well...

    0
  • 1. I provided a list of input files containing the file names, one per row. I tried with or without the full path, since they were in the same folder I was launching the command from, but it did not make any difference. 2. I now see that all the files will need to be unzipped and then re-zipped with bcftools in order to generate the index files and be then merged and converted to plink files, which is my main aim. This seems to be complicated and likely to require lot of memory, considering the large number of files. Is there a simpler way to do it? [Horizontal PrecisionLife logo with the company tagline] Marianna Sanna PhD Bioinformatician Follow us [LinkedIn icon linking to the PrecisionLife LinkedIn account to follow on social media] [Twitter icon linking to the PrecisionLife Twitter account to follow on social media]
    0
  • Comment author
    Anastazie Sedlakova DNAnexus Team

    @Marianna Sanna? Can you please check that when you are writing full path, you are adding /mnt/project/?

    e.g.

    /mnt/project/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/ukb23352_c10_b0_v1.vcf.gz

    /mnt/project/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/ukb23352_c11_b0_v1.vcf.gz

    0
  • Hi Anastazie, I am providing the full path in my file list. I think the mistake was in adding the ?\? as in "/mnt/project/Bulk/W hole\ genome\ sequences??. It works fine now without that. Another issue is that I need to add each single file as input. This is fine when testing with few files, but what about using a large number of files? It would not be feasible to input each of them one by one. Is there any more efficient way of doing it? Thanks [Horizontal PrecisionLife logo with the company tagline] Marianna Sanna PhD Bioinformatician Follow us [LinkedIn icon linking to the PrecisionLife LinkedIn account to follow on social media] [Twitter icon linking to the PrecisionLife Twitter account to follow on social media]
    0
  • If you are using dx tools on a linux or mac via the command line, you can create a files_to_merge.txt file by running these commands on your local computer:

     

     

    dx ls Bulk/Whole\ genome\ sequences/Whole\ genome\ GraphTyper\ joint\ call\ pVCF/*gz > tempfile.txt ;

     

    sort tempfile.txt | awk '{print "/mnt/project/Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/"$1}' > full_path_allvcf.txt

     

     

    In this case I am merging everything, but you would probably want to loop this over each chromosome separately. *-c12_*gz instead of just *gz.

     

    0
  • @Phil Greer? 

    Can you explain little bit more?

    How to give this multiple input in swiss army knife bcftools??

     

    Thanks for you help

     

    Best,

    Vignesh

    0
  • @Vignesh Arunachalam? ,

     

    First I wrote and published a script to create the list of all the WGS pVCF segments per chromosome here:

    https://github.com/pjgreer/ukb-rap-tools/blob/main/ukb-vcf-list.sh

     

    This script would be run on your local computer and it will generate a list or every pvcf segment for each chromosome. Producing 23 total files (1-22 + X)

     

    You must then upload all 23 files to a folder in your UKB RAP project via "dx upload" or using the web interface.

     

    In your dx run command, you would then add the path to the merge list file as an input field.:--iin /data//ukb_cX_vcf_full_path_mergelist.txt

     

    Hope this is more clear.

     

    -Phil

     

     

    0
  • @Phil Greer? 

     

    Thanks, Phil

    It's really helpful.

    I was doing the same way but while using awk - i have added quotes to path and its throwing an error.

    Now I fixed it

     

     

    Thanks for your help

     

     

    Cheers,

    Vignesh

    0

Please sign in to leave a comment.