We need to impute a subset of UKB genotypes with HRC as reference. Since we can't download genotypes and upload to Michigan Server (or can we?), can someone please advise on how to run imputation with a different panel on the RAP?
I think the HRC imputed dataset is already present in the platform. Once check the bulk area of the genotypes in the imputation folder. They also have 1000g imputed and TOPMed imputed as far as I remember.
We need to impute a subset of samples using the HRC European panel. How can one impute with different reference panels if we can?t use the Michigan Imputation Server which is the most widely used method?
I think you should check this https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100319. Ukbiobank dataset with HRC imputation was already done before and the link should give you the information needed. I just want to let you know that the HRC imputed files of ukbiobank are already available to my understanding. I am not suggesting anything regarding imputation with different panels. I am sorry if my response implies that.
Thanks but that?s not what I need. I need to use the European reference panel of HRC to impute a subset of samples, some of which may or may not have been in the biobank at the time they imputed it with HRC. So if anyone has done it, please let me know how.
This means that a project with Tier 2 or above can download the data in field 22418, either from the RAP or in a basket.
As Akhil says, most participants already have imputed haplotypes using the HRC, see field 22438.
There are currently 968 participants that have genotype calls in field 22418 but do not have haplotypes in field 22438. However, I believe that these will probably turn out to be samples that failed some kind of qc, and it is quite likely that the Michigan server would also produce either null results or results with low confidence. I say this because at the time the imputation was done, there were genotype calls for 488377 participants, resulting in imputed data for 487442 participants, and a lack of imputed data for the remaining 935 participants. For these figures, see the paper by Bycroft et al at https://www.biorxiv.org/content/10.1101/166298v1.full
The change in the number of participants in field 22418, from 488377 in 2017 to 488127 now, will be due to participants withdrawing their consent.
The difference between 968 participants and 935 participants could potentially also be due to withdrawals.
There are no participants that have genotype calls in field 22418 now that did not have them at the time of the imputations.
The Michigan Imputation Server code is available as a Docker image, https://imputationserver.readthedocs.io/en/latest/docker/ . I don't know whether anyone has got that working on the RAP. You could try asking a new question with that in the title.
Worse yet, this paper by McCarthy et al suggests that the full HRC will not be made publicly available, as it says
Since the HRC reference panel combines data from many different studies with a range of restrictions on data release we have developed centralized imputation server resources (see URLs). Under this model researchers upload phased or unphased genotype data and imputation is carried out on central servers. Once completed researchers can download imputed datasets. Along similar lines, we have also developed a lower throughput phasing server for haplotype estimation of clinical samples with genotypes from high-coverage WGS data that takes advantage of rare variant sharing 18 (see URLs). A limited subset of HRC haplotypes will be made available for researchers via the European Genome-phenome Archive (EGA) for the sole purpose of phasing and imputation.
Thanks. That?s exactly why I want to use the michigan server, because they made it available there. I believe the Michigan server process follows the rules outlined in the MTA, and they also make the user agree to not try to identify any participant etc., but using it would require downloading genotype data to a local computer (temporarily) and then downloading the output before uploading to the RAP again. I just want to know if that would be allowed. I don?t speak ?legalese? and I don?t know who would be able to give me a definitive answer.
You can always try running imputation in the platform itself without downloading. It may be somewhat painful at first to figure out the installation and things but it's worth the shot. (One way to avoid figuring out legal stuff)
[I still think it is unlikely to get you any useful data].
Definitively, you can download the genotype data in field 22418 from the RAP to your university's servers. It is your university's responsibility to ensure that they are stored with an appropriate amount of security etc, as per your MTA. In general, a personal PC would not be considered appropriate, but something large and not portable behind the firewall probably is.
The answer about "3rd-party processor" comes from our Head of Access, who is not a lawyer but has a lot of experience with the MTA .
The awkward part is that the document I linked to says you need to get a Written Agreement with the Michigan Imputation Server people, see section 1.3, and obviously since they are not making a profit they might be reluctant to spend time on this.
You would think it must be possible to get access to the HRC, since it appears that the Bycroft group managed it. I suggest you email either the Bycroft group or the Michigan Server group and ask what is the best way to progress. It might also be helpful to talk to the legal representative of your university that signed your MTA. You could also try an email to AMS Access, asking for an official exemption from the requirement for a Written Agreement, but I suspect you would need to make a compelling case (what makes you think it would produce any result of benefit to world health, and what alternatives you have explored), and I would not expect it to be successful. The general UKB legal team policy is that the MTA is non-negotiable.
Please note that I am neither a lawyer nor a geneticist. I am sure about the "Definitively" paragraph.
Comments
14 comments
I think the HRC imputed dataset is already present in the platform. Once check the bulk area of the genotypes in the imputation folder. They also have 1000g imputed and TOPMed imputed as far as I remember.
We need to impute a subset of samples using the HRC European panel. How can one impute with different reference panels if we can?t use the Michigan Imputation Server which is the most widely used method?
I think you should check this https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100319. Ukbiobank dataset with HRC imputation was already done before and the link should give you the information needed. I just want to let you know that the HRC imputed files of ukbiobank are already available to my understanding. I am not suggesting anything regarding imputation with different panels. I am sorry if my response implies that.
Thanks but that?s not what I need. I need to use the European reference panel of HRC to impute a subset of samples, some of which may or may not have been in the biobank at the time they imputed it with HRC. So if anyone has done it, please let me know how.
The current cost tier for field 22418 genotype calls is d2o1s2. This can be seen on the Showcase page for field 22418.
This means that a project with Tier 2 or above can download the data in field 22418, either from the RAP or in a basket.
As Akhil says, most participants already have imputed haplotypes using the HRC, see field 22438.
There are currently 968 participants that have genotype calls in field 22418 but do not have haplotypes in field 22438. However, I believe that these will probably turn out to be samples that failed some kind of qc, and it is quite likely that the Michigan server would also produce either null results or results with low confidence. I say this because at the time the imputation was done, there were genotype calls for 488377 participants, resulting in imputed data for 487442 participants, and a lack of imputed data for the remaining 935 participants. For these figures, see the paper by Bycroft et al at https://www.biorxiv.org/content/10.1101/166298v1.full
The change in the number of participants in field 22418, from 488377 in 2017 to 488127 now, will be due to participants withdrawing their consent.
The difference between 968 participants and 935 participants could potentially also be due to withdrawals.
There are no participants that have genotype calls in field 22418 now that did not have them at the time of the imputations.
If you do decide that you need to use the Michigan Imputation Server, then it would count as a "3rd-party processor" within the terms of your project's MTA. In particular, see the 2018 amendment agreement available at https://www.ukbiobank.ac.uk/media/ujwedt1j/third_party_subcontractor_processors.pdf
The Michigan Imputation Server code is available as a Docker image, https://imputationserver.readthedocs.io/en/latest/docker/ . I don't know whether anyone has got that working on the RAP. You could try asking a new question with that in the title.
Unfortunately, the HRC doesn't appear to be one of the Reference Panels available from Michigan, see https://github.com/genepi/imputationserver-docker.
Worse yet, this paper by McCarthy et al suggests that the full HRC will not be made publicly available, as it says
Since the HRC reference panel combines data from many different studies with a range of restrictions on data release we have developed centralized imputation server resources (see URLs). Under this model researchers upload phased or unphased genotype data and imputation is carried out on central servers. Once completed researchers can download imputed datasets. Along similar lines, we have also developed a lower throughput phasing server for haplotype estimation of clinical samples with genotypes from high-coverage WGS data that takes advantage of rare variant sharing 18 (see URLs). A limited subset of HRC haplotypes will be made available for researchers via the European Genome-phenome Archive (EGA) for the sole purpose of phasing and imputation.
Would the limited subset be sufficient?
Thanks. That?s exactly why I want to use the michigan server, because they made it available there. I believe the Michigan server process follows the rules outlined in the MTA, and they also make the user agree to not try to identify any participant etc., but using it would require downloading genotype data to a local computer (temporarily) and then downloading the output before uploading to the RAP again. I just want to know if that would be allowed. I don?t speak ?legalese? and I don?t know who would be able to give me a definitive answer.
You can always try running imputation in the platform itself without downloading. It may be somewhat painful at first to figure out the installation and things but it's worth the shot. (One way to avoid figuring out legal stuff)
Hi Rona,
[I still think it is unlikely to get you any useful data].
Definitively, you can download the genotype data in field 22418 from the RAP to your university's servers. It is your university's responsibility to ensure that they are stored with an appropriate amount of security etc, as per your MTA. In general, a personal PC would not be considered appropriate, but something large and not portable behind the firewall probably is.
The answer about "3rd-party processor" comes from our Head of Access, who is not a lawyer but has a lot of experience with the MTA .
The awkward part is that the document I linked to says you need to get a Written Agreement with the Michigan Imputation Server people, see section 1.3, and obviously since they are not making a profit they might be reluctant to spend time on this.
You would think it must be possible to get access to the HRC, since it appears that the Bycroft group managed it. I suggest you email either the Bycroft group or the Michigan Server group and ask what is the best way to progress. It might also be helpful to talk to the legal representative of your university that signed your MTA. You could also try an email to AMS Access, asking for an official exemption from the requirement for a Written Agreement, but I suspect you would need to make a compelling case (what makes you think it would produce any result of benefit to world health, and what alternatives you have explored), and I would not expect it to be successful. The general UKB legal team policy is that the MTA is non-negotiable.
Please note that I am neither a lawyer nor a geneticist. I am sure about the "Definitively" paragraph.
I guess the next question is why you believe you need the HRC panel.
Hi Akhil, unfortunately it seems there is legal stuff by that route too, in that the HRC panel doesn't appear to be publicly available.
Oh my bad, Thought the HRC panel was available publicly. Scratch that off the list.
Please sign in to leave a comment.