The genotype calls : 22418 and the imputed data (all methods) as well. I have the genotype files downloaded already so can use the pariticpant IDs for each phase to parse on my end. However, if there is an easier way to do this with a new download, I can do that as well.
I'm just getting started on the imputation files and a quick question on how best to use them:
1) I'm recreating a dataset used in Khera at al, 2018 and they used imputed data from all three. Is there a workflow that combines these and is there overlapping SNPs in each of the datasets?
The newly released impute data (GEL and TOPMed) could not be downloaded per the MTA with UKB.
I'm not aware of a pipeline to compare the same EID across all of them. I just know that you have to make sure to use EID in sample file rather than within BGEN. Also, the array and original impute data are in GRCh37, while the rest of main genomics dataset are in GRCh38
Thanks for the reply and information. What I'm looking for is a list of people who are in the phase 1 release and a list within the phase 2 release of the ukbiobank data. Do you know where I might find this info?
Quotes from the publication may help with clarity:
" ... a validation dataset of 120,280 participants of European ancestry derived from the UK Biobank phase 1 release. " is used as a training set and "The testing dataset was comprised of 288,978 UK Biobank phase 2 genotype data release participants distinct from those in the training dataset described above". Does that make sense? I'm trying to recreate these datasets for my research.
Thank you! This was a great suggestion but doesn't have the data I'm looking for. I've reached out the UKBiobank directly and may need to go to plan B, which is randomizing participants into two groups with equal sizes to what is in the publication. Theoretically, with these large numbers, this should be OK to do. Thanks for all of your help.
Comments
9 comments
What is the field ID that you want to download?
The genotype calls : 22418 and the imputed data (all methods) as well. I have the genotype files downloaded already so can use the pariticpant IDs for each phase to parse on my end. However, if there is an easier way to do this with a new download, I can do that as well.
I'm just getting started on the imputation files and a quick question on how best to use them:
1) I'm recreating a dataset used in Khera at al, 2018 and they used imputed data from all three. Is there a workflow that combines these and is there overlapping SNPs in each of the datasets?
Many thanks for your help.
Keri
The newly released impute data (GEL and TOPMed) could not be downloaded per the MTA with UKB.
I'm not aware of a pipeline to compare the same EID across all of them. I just know that you have to make sure to use EID in sample file rather than within BGEN. Also, the array and original impute data are in GRCh37, while the rest of main genomics dataset are in GRCh38
By quick googling I found tutorial on how to extract common variants from two sets of PLINK files. This tutorial uses PLINK, which is installed in Swiss Army Knife tool on UKB RAP. Alternativelly, you can install PLINK on cloud workstation or ttyd (web-based terminal)
Thanks for the reply and information. What I'm looking for is a list of people who are in the phase 1 release and a list within the phase 2 release of the ukbiobank data. Do you know where I might find this info?
Quotes from the publication may help with clarity:
" ... a validation dataset of 120,280 participants of European ancestry derived from the UK Biobank phase 1 release. " is used as a training set and "The testing dataset was comprised of 288,978 UK Biobank phase 2 genotype data release participants distinct from those in the training dataset described above". Does that make sense? I'm trying to recreate these datasets for my research.
Thank you!
I see. I don't know where to find that list. You may contact UKB directly or ask author of the papers.
Ok. Thank you.
@Keri Multerer? I am assuming that you are citing this article. I was able to find documentation for the interim genotype result in Showcase documentation, but it is not connected with any data field.
Thank you! This was a great suggestion but doesn't have the data I'm looking for. I've reached out the UKBiobank directly and may need to go to plan B, which is randomizing participants into two groups with equal sizes to what is in the publication. Theoretically, with these large numbers, this should be OK to do. Thanks for all of your help.
Please sign in to leave a comment.