Query of the week #8: Filtering BGEN file and exporting geno data to Pandas df

Ondrej Klempir DNAnexus Team

14 April 2023 00:00
3 comments

After a one-week pause (were enjoying public holidays), we are back with Query of the week numero 8. This week we will follow up on Query #7. I read a great discussion below the post and this week elaborates one way to load geno data and process it for Pandas dataframe. {@005t00000089ohDAAQ}? mentioned two ways (PLINK and HAIL) in the discussion below the post. Today I am sharing my notes and one example on how to directly query a BGEN file.

We are going to explore a UKB BGEN file along with its sample file and we will be using bgen-reader (https://github.com/jeremymcrae/bgen). The authors of this tool said "This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files)". Among others, it offers a set of methods to interact with BGEN file, such as with_rsid, at_position, varids, rsids, chroms or positions. See the Github page (https://github.com/jeremymcrae/bgen) for more details and give them a star if you like it! On UKB-RAP, you can easily install it within JupyterLab.

Now to the example itself and what worked for me. I was able to:

a) read a pair of BGEN and sample file

b) list all rsids in BGEN

Screenshot 2023-04-06 at 16.54.29

c) get BGEN geno data for a selected rsid for all samples in that BGEN

d) subsample according to filter

Screenshot 2023-04-06 at 16.54.41

With this tool and the steps described above, you can get a numpy data representation which could be easily transformed to a Pandas dataframe. Looking forward to hearing your experience with processing geno data!

Comments

3 comments

Ted Laderas DNAnexus Team
- 14 April 2023 18:15
Very nice! Glad to learn there is a package for this.

0
Chai Fungtammasan DNAnexus Team
- 15 April 2023 04:03
I personally like to use this https://bgen-reader.readthedocs.io/en/latest/quickstart.html
The documentation is good, and you can code with either NumPy or Dask style.

The bgen package has the ability to check compression method though, so I also use that from time to time.

0
Ondrej Klempir DNAnexus Team
- 17 April 2023 10:11
Dask looks great, will definitely read more about it.

0

Please sign in to leave a comment.