Query of the week #8: Filtering BGEN file and exporting geno data to Pandas df
After a one-week pause (were enjoying public holidays), we are back with Query of the week numero 8. This week we will follow up on Query #7. I read a great discussion below the post and this week elaborates one way to load geno data and process it for Pandas dataframe. {@005t00000089ohDAAQ}? mentioned two ways (PLINK and HAIL) in the discussion below the post. Today I am sharing my notes and one example on how to directly query a BGEN file.
We are going to explore a UKB BGEN file along with its sample file and we will be using bgen-reader (https://github.com/jeremymcrae/bgen). The authors of this tool said "This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files)". Among others, it offers a set of methods to interact with BGEN file, such as with_rsid, at_position, varids, rsids, chroms or positions. See the Github page (https://github.com/jeremymcrae/bgen) for more details and give them a star if you like it! On UKB-RAP, you can easily install it within JupyterLab.
Now to the example itself and what worked for me. I was able to:
a) read a pair of BGEN and sample file
b) list all rsids in BGEN
c) get BGEN geno data for a selected rsid for all samples in that BGEN
d) subsample according to filter
With this tool and the steps described above, you can get a numpy data representation which could be easily transformed to a Pandas dataframe. Looking forward to hearing your experience with processing geno data!
Comments
3 comments
Very nice! Glad to learn there is a package for this.
I personally like to use this https://bgen-reader.readthedocs.io/en/latest/quickstart.html
The documentation is good, and you can code with either NumPy or Dask style.
The bgen package has the ability to check compression method though, so I also use that from time to time.
Dask looks great, will definitely read more about it.
Please sign in to leave a comment.