Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-01-11T06:06:32.721Z Has data issue: false hasContentIssue false

Impact of selection bias on polygenic risk score estimates in healthcare settings

Published online by Cambridge University Press:  25 May 2023

Younga Heather Lee
Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA
Tanayott Thaweethai
Affiliation:
Harvard Medical School, Boston, Massachusetts, USA Biostatistics Center, Massachusetts General Hospital, Boston, Massachusetts, USA
Yi-Han Sheu
Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA
Yen-Chen Anne Feng
Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA Analytic and Translational Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Division of Biostatistics and Data Science, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
Elizabeth W. Karlson
Affiliation:
Harvard Medical School, Boston, Massachusetts, USA Division of Rheumatology, Immunity, and Inflammation, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
Tian Ge
Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA Center for Precision Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
Peter Kraft
Affiliation:
Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA
Jordan W. Smoller*
Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA Center for Precision Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
*
Corresponding author: Jordan W. Smoller; Email: jsmoller@mgh.harvard.edu

Abstract

Background

Hospital-based biobanks are being increasingly considered as a resource for translating polygenic risk scores (PRS) into clinical practice. However, since these biobanks originate from patient populations, there is a possibility of bias in polygenic risk estimation due to overrepresentation of patients with higher frequency of healthcare interactions.

Methods

PRS for schizophrenia, bipolar disorder, and depression were calculated using summary statistics from the largest available genomic studies for a sample of 24 153 European ancestry participants in the Mass General Brigham (MGB) Biobank. To correct for selection bias, we fitted logistic regression models with inverse probability (IP) weights, which were estimated using 1839 sociodemographic, clinical, and healthcare utilization features extracted from electronic health records of 1 546 440 non-Hispanic White patients eligible to participate in the Biobank study at their first visit to the MGB-affiliated hospitals.

Results

Case prevalence of bipolar disorder among participants in the top decile of bipolar disorder PRS was 10.0% (95% CI 8.8–11.2%) in the unweighted analysis but only 6.2% (5.0–7.5%) when selection bias was accounted for using IP weights. Similarly, case prevalence of depression among those in the top decile of depression PRS was reduced from 33.5% (31.7–35.4%) to 28.9% (25.8–31.9%) after IP weighting.

Conclusions

Non-random selection of participants into volunteer biobanks may induce clinically relevant selection bias that could impact implementation of PRS in research and clinical settings. As efforts to integrate PRS in medical practice expand, recognition and mitigation of these biases should be considered and may need to be optimized in a context-specific manner.

Type
Original Article
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Allen, N. L., Karlson, E. W., Malspeis, S., Lu, B., Seidman, C. E., & Lehmann, L. S. (2014). Biobank participants’ preferences for disclosure of genetic research results: Perspectives from the OurGenes, OurHealth, OurCommunity project. Mayo Clinic Proceedings. Mayo Clinic, 89(6), 738746. doi:10.1016/j.mayocp.2014.03.015CrossRefGoogle ScholarPubMed
Bayramli, I., Castro, V., Barak-Corren, Y., Madsen, E. M., Nock, M. K., Smoller, J. W., & Reis, B. Y. (2021). Temporally informed random forests for suicide risk prediction. Journal of the American Medical Informatics Association: JAMIA, 29(1), 6271. doi:10.1093/jamia/ocab225CrossRefGoogle ScholarPubMed
Beesley, L. J., & Mukherjee, B. (2022). Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics, 78(1), 214226. doi:10.1111/biom.13400CrossRefGoogle ScholarPubMed
Bigdeli, T. B., Voloudakis, G., Barr, P. B., Gorman, B. R., Genovese, G., & Peterson, R. E., … Cooperative Studies Program (CSP) #572 and Million Veteran Program (MVP). (2022). Penetrance and pleiotropy of polygenic risk scores for schizophrenia, bipolar disorder, and depression among adults in the US veterans affairs health care system. JAMA Psychiatry, 79(11), 10921101. doi:10.1001/jamapsychiatry.2022.2742CrossRefGoogle ScholarPubMed
Boutin, N. T., Schecter, S. B., Perez, E. F., Tchamitchian, N. S., Cerretani, X. R., Gainer, V. S., … Smoller, J. W. (2022). The evolution of a large biobank at Mass General Brigham. Journal of Personalized Medicine, 12(8), 1323. doi:10.3390/jpm12081323CrossRefGoogle ScholarPubMed
Carroll, R. J., Bastarache, L., & Denny, J. C. (2014). R PheWAS: Data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics, 30(16), 23752376. doi:10.1093/bioinformatics/btu197CrossRefGoogle ScholarPubMed
Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., … Murphy, S. N. (2021). The Mass General Brigham Biobank Portal: An i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association: JAMIA, 29(4), 643651. doi:10.1093/jamia/ocab264CrossRefGoogle Scholar
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Retrieved from http://arxiv.org/abs/1603.02754.Google Scholar
Cole, S. R., & Hernán, M. A. (2008). Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168(6), 656664. doi:10.1093/aje/kwn164CrossRefGoogle ScholarPubMed
Electronic Medical Records and Genomics (eMERGE) Network. (n.d.). Retrieved from 29 April 2021 https://www.genome.gov/Funded-Programs-Projects/Electronic-Medical-Records-and-Genomics-Network-eMERGE.Google Scholar
Epic Systems Corporation. (n.d.). Epic electronic health record. Verona, WI.Google Scholar
Fry, A., Littlejohns, T. J., Sudlow, C., Doherty, N., Adamska, L., Sprosen, T., … Allen, N. E. (2017). Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. American Journal of Epidemiology, 186(9), 10261034. doi:10.1093/aje/kwx246CrossRefGoogle ScholarPubMed
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1776. doi:10.1038/s41467-019-09718-5CrossRefGoogle ScholarPubMed
Goldstein, B. A., Bhavsar, N. A., Phelan, M., & Pencina, M. J. (2016). Controlling for informed presence bias due to the number of health encounters in an electronic health record. American Journal of Epidemiology, 184(11), 847855. doi:10.1093/aje/kww112CrossRefGoogle Scholar
Haneuse, S., Arterburn, D., & Daniels, M. J. (2021). Assessing missing data assumptions in EHR-based studies: A complex and underappreciated task. JAMA Network Open, 4(2), e210184. doi:10.1001/jamanetworkopen.2021.0184CrossRefGoogle ScholarPubMed
Haneuse, S., & Daniels, M. (2016). A general framework for considering selection bias in EHR-based studies: What data are observed and why? EGEMS, 4(1), 1203. doi:10.13063/2327-9214.1203CrossRefGoogle ScholarPubMed
Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology, 15(5), 615625. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/15308962.CrossRefGoogle ScholarPubMed
Howard, D. M.Adams, M. J., Clarke, T. K., Hafferty, J. D., Gibson, J., Shirali, M., … McIntosh, A. M. (2019). Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nature Neuroscience, 22(3), 343352. 10.1038/s41593-018-0326-7.CrossRefGoogle ScholarPubMed
Karlson, E. W., Boutin, N. T., Hoffnagle, A. G., & Allen, N. L. (2016). Building the partners HealthCare biobank at partners personalized medicine: Informed consent, return of research results, recruitment lessons and operational considerations. Journal of Personalized Medicine, 6(1). doi:10.3390/jpm6010002CrossRefGoogle ScholarPubMed
Khera, A. V., Chaffin, M., Wade, K. H., Zahid, S., Brancale, J., Xia, R., … Kathiresan, S. (2019). Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell, 177(3), 587596.e9. doi:10.1016/j.cell.2019.03.028CrossRefGoogle ScholarPubMed
Khera, A. V., Emdin, C. A., Drake, I., Natarajan, P., Bick, A. G., Cook, N. R., … Kathiresan, S. (2016). Genetic risk, adherence to a healthy lifestyle, and coronary disease. The New England Journal of Medicine, 375(24), 23492358. doi:10.1056/NEJMoa1605086CrossRefGoogle ScholarPubMed
Läll, K., Mägi, R., Morris, A., Metspalu, A., & Fischer, K. (2017). Personalized risk prediction for type 2 diabetes: The potential of genetic risk scores. Genetics in Medicine, 19(3), 322329. doi:10.1038/gim.2016.103CrossRefGoogle ScholarPubMed
Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L., & Bonham, V. L. (2018). Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Affairs, 37(5), 780785. doi:10.1377/hlthaff.2017.1595CrossRefGoogle ScholarPubMed
Leppig, K. A., Kulchak Rahm, A., Appelbaum, P., Aufox, S., Bland, S. T., Buchanan, A., … Wiesner, G. L. (2022). The reckoning: The return of genomic results to 1444 participants across the eMERGE3 network. Genetics in Medicine, 24(5), 11301138. doi:10.1016/j.gim.2022.01.015CrossRefGoogle ScholarPubMed
Lewis, C. M., & Vassos, E. (2017). Prospects for using risk scores in polygenic medicine. Genome Medicine, 9(1), 96. doi:10.1186/s13073-017-0489-yCrossRefGoogle ScholarPubMed
Lumley, T. (2021). survey: Analysis of complex survey samples (Version 4.1-1). Retrieved from University of Auckland website. Retrieved from http://r-survey.r-forge.r-project.org/survey/.Google Scholar
Lundberg, S., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Retrieved from http://arxiv.org/abs/1705.07874.Google Scholar
Madden, J. A., Brothers, K. K., Williams, J. L., Myers, M. F., Leppig, K. A., Clayton, E. W., … Holm, I. A. (2022). Impact of returning unsolicited genomic results to nongenetic health care providers in the eMERGE III network. Genetics in Medicine, 24(6), 12971305. doi:10.1016/j.gim.2022.02.018CrossRefGoogle ScholarPubMed
Mangiafico, S. (2022). Functions to support extension education program evaluation [R package rcompanion version 2.4.13]. Retrieved from https://CRAN.R-project.org/package=rcompanion.Google Scholar
Mavaddat, N., Michailidou, K., Dennis, J., Lush, M., Fachal, L., Lee, A., … Easton, D. F. (2019). Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. American Journal of Human Genetics, 104(1), 2134. doi:10.1016/j.ajhg.2018.11.002CrossRefGoogle ScholarPubMed
Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 70(1), 5371. doi:10.1111/j.1467-9868.2007.00627.xCrossRefGoogle Scholar
Mostafavi, H., Harpak, A., Agarwal, I., Conley, D., Pritchard, J. K., & Przeworski, M. (2020). Variable prediction accuracy of polygenic scores within an ancestry group. ELife, 9. doi:10.7554/eLife.48376CrossRefGoogle ScholarPubMed
Mullins, N., Forstner, A. J., O'Connell, K. S., Coombes, B., Coleman, J. R. I., Qiao, Z., … Andreassen, O. A. (2021). Genome-wide association study of over 40000 bipolar disorder cases provides new insights into the underlying biology. Nature Genetics, 53(6), 817829. doi:10.1101/2020.09.17.20187054CrossRefGoogle Scholar
Murray, G. K., Lin, T., Austin, J., McGrath, J. J., Hickie, I. B., & Wray, N. R. (2021). Could polygenic risk scores Be useful in psychiatry?: A review. JAMA Psychiatry, 78(2), 210219. doi:10.1001/jamapsychiatry.2020.3042CrossRefGoogle ScholarPubMed
Pashayan, N., Pharoah, P. D. P., Schleutker, J., Talala, K., Tammela, T. L. J., Määttänen, L., … Auvinen, A. (2015). Reducing overdiagnosis by polygenic risk-stratified screening: Findings from the Finnish section of the ERSPC. British Journal of Cancer, 113(7), 10861093. doi:10.1038/bjc.2015.289CrossRefGoogle ScholarPubMed
Peskoe, S. B., Arterburn, D., Coleman, K. J., Herrinton, L. J., Daniels, M. J., & Haneuse, S. (2021). Adjusting for selection bias due to missing data in electronic health records-based research. Statistical Methods in Medical Research, 30(10), 22212238. doi:10.1177/09622802211027601CrossRefGoogle ScholarPubMed
Pet, D. B., Holm, I. A., Williams, J. L., Myers, M. F., Novak, L. L., Brothers, K. B., … Clayton, E. W. (2019). Physicians’ perspectives on receiving unsolicited genomic results. Genetics in Medicine, 21(2), 311318. doi:10.1038/s41436-018-0047-zCrossRefGoogle ScholarPubMed
Polygenic Risk Score Task Force of the International Common Disease Alliance (2021). Responsible use of polygenic risk scores in the clinic: Potential benefits, risks and gaps. Nature Medicine, 27(11), 18761884. doi:10.1038/s41591-021-01549-6CrossRefGoogle Scholar
Prictor, M., Teare, H. J. A., & Kaye, J. (2018). Equitable participation in biobanks: The risks and benefits of a ‘dynamic consent’ approach. Frontiers in Public Health, 6, 253. doi:10.3389/fpubh.2018.00253CrossRefGoogle ScholarPubMed
The Schizophrenia Working Group of the Psychiatric Genomics Consortium, Ripke, S., Walters, J. T. R., & O'Donovan, M. C. (2020). Mapping genomic loci prioritises genes and implicates synaptic biology in schizophrenia. Nature, 604(7906), 502508. doi:10.1101/2020.09.12.20192922Google Scholar
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S + to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. doi:10.1186/1471-2105-12-77CrossRefGoogle Scholar
Seaman, S. R., & White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3), 278295. doi:10.1177/0962280210395740CrossRefGoogle ScholarPubMed
Sharp, S. A., Rich, S. S., Wood, A. R., Jones, S. E., Beaumont, R. N., Harrison, J. W., … Oram, R. A. (2019). Development and standardization of an improved type 1 diabetes genetic risk score for use in newborn screening and incident diagnosis. Diabetes Care, 42(2), 200207. doi:10.2337/dc18-1785CrossRefGoogle ScholarPubMed
Smoller, J. W. (2018). The use of electronic health records for psychiatric phenotyping and genomics. American Journal of Medical Genetics. Part B, Neuropsychiatric Genetics, 177(7), 601612. doi:10.1002/ajmg.b.32548CrossRefGoogle ScholarPubMed
Swanson, J. M. (2012). [Review of The UK Biobank and selection bias]. The Lancet, 380(9837), 110. doi:10.1016/S0140-6736(12)61179-9CrossRefGoogle Scholar
Thaweethai, T., Arterburn, D. E., Coleman, K. J., & Haneuse, S. (2021). Robust inference when combining inverse-probability weighting and multiple imputation to address missing data with application to an electronic health records-based study of bariatric surgery. The Annals of Applied Statistics, 15(1), 126147. doi:10.1214/20-AOAS1386CrossRefGoogle Scholar
Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine, 16(4), 385395. doi:10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-33.0.CO;2-3>CrossRefGoogle ScholarPubMed
Tyrrell, J., Zheng, J., Beaumont, R., Hinton, K., Richardson, T. G., Wood, A. R., … Tilling, K. (2021). Genetic predictors of participation in optional components of UK Biobank. Nature Communications, 12(1), 886. doi:10.1038/s41467-021-21073-yCrossRefGoogle ScholarPubMed
van Alten, S., Domingue, B. W., Galama, T., & Marees, A. T. (2022). Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering (p. 2022.05.16.22275048). doi:10.1101/2022.05.16.22275048CrossRefGoogle Scholar
Wei, W.-Q., Bastarache, L. A., Carroll, R. J., Marlo, J. E., Osterman, T. J., Gamazon, E. R., … Denny, J. C. (2017). Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE, 12(7), e0175508. doi:10.1371/journal.pone.0175508CrossRefGoogle ScholarPubMed
Wiesner, G. L., Kulchak Rahm, A., Appelbaum, P., Aufox, S., Bland, S. T., Blout, C. L., … Leppig, K. A. (2020). Returning results in the genomic era: Initial experiences of the eMERGE network. Journal of Personalized Medicine, 10(2). doi:10.3390/jpm10020030CrossRefGoogle ScholarPubMed
Zheutlin, A. B., Dennis, J., Karlsson Linnér, R., Moscati, A., Restrepo, N., Straub, P., … Smoller, J. W. (2019). Penetrance and pleiotropy of polygenic risk scores for schizophrenia in 106 160 patients across four health care systems. The American Journal of Psychiatry, 176(10), 846855. doi:10.1176/appi.ajp.2019.18091085CrossRefGoogle ScholarPubMed
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 14181429. doi:10.1198/016214506000000735CrossRefGoogle Scholar
Supplementary material: File

Lee et al. supplementary material 1
Download undefined(File)
File 1.6 MB
Supplementary material: File

Lee et al. supplementary material 2
Download undefined(File)
File 96.4 KB