Is it possible to dynamically input files in swiss-army-knife functions?
Hi all, I have some code using bcftools to make vcf files per gene (with the output vcf files dropping genotypes) at the moment I have some working code but I'd like to see if it's possible to convert this to submit a series of jobs running the code per gene instead (sorry in advance for the long post).
I have a list of genes with their chrom, start and end, and I am iterating over them one by one in a jupyterlab environment on dnanexus using this code:
# gene_boundaries is a CSV format with columns: gene,chrom,start,end
gene_boundaries = pd.read_csv('gene_boundaries.csv')
out_path = '/opt/notebooks/temp'
if not os.path.isdir(out_path):
os.mkdir(out_path)
bcftools_path = '/opt/notebooks/bcftools'
vcf_path = '/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release'
for i in range(gene_boundaries.shape[0]):
chrom = gene_boundaries.loc[i, 'chromosome_name']
if not pd.isna(chrom):
chrom = str(chrom).strip()
start = str(gene_boundaries.loc[i, 'ext_start'])
end = str(gene_boundaries.loc[i, 'ext_end'])
gene = gene_boundaries.loc[i, 'hgnc_symbol']
print('Gene and chromosome:', gene, chrom)
files_all = os.listdir(vcf_path)
files_chrom = [f for f in files_all if f.startswith(f'ukb23157_c{chrom}_') and f.endswith('vcf.gz')]
for j in files_chrom:
print(j)
new_name = j.replace(".gz", "")
os.system(f'{bcftools_path}/bcftools '+
'view ' +
f'--regions chr{chrom}:{start}-{end} ' +
'--drop-genotypes '+
f'"{vcf_path}/{j}" ' +
f'--output {out_path}/ukb_whole_cohort_exome_{gene}_{chrom}_{new_name} '+
'--output-type v')This is slowly working, processing 1 gene at a time and taking from the many block vcf files per chromosome.
If possible, I am trying to figure out how I can submit this as a dnanexus job running per gene. However, I am getting stuck on the issue where per chromsome there are a variable number of VCF files (e.g., the files I am using look like this: /Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c1_b48_v1.vcf.gz with the b numbers, like b48, being variable per chromosome).
Does this look like something I could run instead using swiss-army-knife?
I am trying to develop something like this:
#!/bin/bash
project="project-ID"
file_dir="Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release"
data_field="ukb23157"
outpath_dnanexus='/genes_vcf'
while IFS=, read -r gene chrom start end; do
# Finding all the vcf block files per my gene's chromosome
mapfile -t files_all < <(dx find data --name "${data_field}_c${chrom}_b*_*_v1.vcf.gz" --path "${project}:${file_dir}" --brief)
# Making output filename
for vcf_path in "${files_all[@]}"; do
vcf_file=$(dx describe "${vcf_path}" --name)
base_name=$(basename "${vcf_file}")
chr_num=$(echo "${base_name}" | sed -E 's/^.*_c([0-9XY]+)_b.*_v1\.vcf\.gz$/\1/')
if [[ "${chr_num}" == "${chrom}" ]]; then
new_name="${gene}_${chrom}_${start}_${end}_${base_name%.vcf.gz}"
# Running bcftools
run_bcftools="bcftools \
view \
--regions chr${chrom}:${start}-${end} \
--drop-genotypes ${vcf_path} \
--output ${outpath_dnanexus}/ukb_whole_cohort_exome_${new_name}.vcf \
--output-type v"
dx run swiss-army-knife -iin="${vcf_path}" \
-icmd="${run_bcftools}" \
--tag="bcftools_c${chrom}_${gene}" \
--instance-type "mem1_hdd1_v2_x8" \
-imount_inputs=true --destination="${project}:${outpath_dnanexus}" --verbose --yes
fi
done
done < <(tail -n +2 /gene_boundaries_TEST.csv)
There's probably lots still to fix with this code, but the main thing is that the -iin="${vcf_path}" path is the part that would need to be dynamic and consider reading in all the vcf block/b files per a chromosome - is this possible?
Sorry again for the length and complexity going on here, any help in general is appreciated!
Comments
3 comments
The “b”-numbering of pVCFs lists all the variants within a chunk of chromosome co-ordinates (typically 500000-1000000 bp per chunk for field 23157, although gaps may appear for non-coding regions). Note that areas of low variability are likely to produce header-only VCFs.
We are currently working on releasing an index file for field 23157, as well as creating a GitHub notebook so researchers can create index files from other pVCF fields.
If this does not answer your query, please can you restate your question more concisely and note that I am unsure what you mean by “dynamically input files”.
Thank you for your reply and your help, I suspect that the github notebook will be very useful - thank you for taking the time to develop this!
And sorry for the confusion on my end, I may not be using the word dynamic right for my use-case.
As I understand it, just from looking at the files on the platform, each chromosome has a different number of chunks. So when I iterate over each gene in my code it needs to be working on a variable number of vcf input files/chunks for that gene's chromosome.
I was wondering if there's a way to set the
-iin="${vcf_path}"in my code so it's actually going to read in all vcfs for that chromosome, regardless of the number of files/chunks (and in that way be dynamic to the number of inputs per chromosome, not sure if I'm using that term correctly), so the bcftools function can run on all those files.There's no way to use wildcards for this that I know, you could build an app/applet/workflow!
But my hack for this is to print out the lines using a for loop then paste into the SAK block, eg:
for i in {1..10}; do echo '"--iin=file'${i}'.vcf.gz"' \\; doneEither that or write a script to list the start positions of each VCF and only work with the ones appropriate for each gene.
Hope that helps
Please sign in to leave a comment.