Polygenic risk scores for prediction of breast cancer risk in Asian populations

Purpose Non-European populations are under-represented in genetics studies, hindering clinical implementation of breast cancer polygenic risk scores (PRSs). We aimed to develop PRSs using the largest available studies of Asian ancestry and to assess the transferability of PRS across ethnic subgroups. Methods The development data set comprised 138,309 women from 17 case-control studies. PRSs were generated using a clumping and thresholding method, lasso penalized regression, an Empirical Bayes approach, a Bayesian polygenic prediction approach, or linear combinations of multiple PRSs. These PRSs were evaluated in 89,898 women from 3 prospective studies (1592 incident cases). Results The best performing PRS (genome-wide set of single-nucleotide variations [formerly single-nucleotide polymorphism]) had a hazard ratio per unit SD of 1.62 (95% CI = 1.46-1.80) and an area under the receiver operating curve of 0.635 (95% CI = 0.622-0.649). Combined Asian and European PRSs (333 single-nucleotide variations) had a hazard ratio per SD of 1.53 (95% CI = 1.37-1.71) and an area under the receiver operating curve of 0.621 (95% CI = 0.608-0.635). The distribution of the latter PRS was different across ethnic subgroups, confirming the importance of population-specific calibration for valid estimation of breast cancer risk. Conclusion PRSs developed in this study, from association data from multiple ancestries, can enhance risk stratification for women of Asian ancestry.

. AUCs of PRSs generated using clumping and thresholding method for East Asian ancestry women PRS performance (AUC) in validation dataset generated using clumping and thresholding method at different p-value threshold. Each point represents a p-value threshold. The best-fit PRS consisted of 46 SNVs at p-value threshold of 5.74 x 10 -7 . Figure S3. AUCs of best PRSs generated using clumping and thresholding method with different clumping option for East Asian ancestry women Each point represents the AUC of the best PRS generate at the combination of the given clumping r 2 and clumping size. Figure S4. PRSs generating using different values of parameters in penalized regression for East Asian ancestry women Each point represents correlation between breast cancer status in validation dataset and PRS generated at the given combination of penalty parameter (lambda) and shrinkage parameter (s). The best PRS occurred at penalty parameter (λ) equal to 0.014 and shrinkage parameter (s) equals to 0.9, where 2,985 SNVs were selected. Figure S5. Performance of the PRSs in test dataset.
Forest plot showed the association between standardised PRSs and breast cancer risk in three prospective cohorts, China Kadoorie Biobank (CKB), Singapore Chinese Health Study (SCHS) and Korean Cancer Prevention Study II. The squares represent the hazard ratios (HRs), the horizontal lines represent the corresponding 95% confidence intervals and the diamond shapes represent the overall estimates. I-squared and p-value (two-sided) for heterogeneity were obtained by fitting a random-effects model and using generalized Q-statistic estimator (the rma() command in R).
(a) Performance of Asian-based and European-based PRSs (b) Performance of PRSs generated by linear combinations of multiple PRSs. Figure S6. Distribution of 10-years absolute breast cancer risk at age 40 by birth cohort.
Dashed vertical line equals to 2.3% (average 10-year absolute risk of a 50-years old European women). Area under the curve represents the proportion of women who would have absolute risk at age 40 greater than a specific risk threshold. For example, area to the right of the vertical line for the blue curve represent proportion of women who were born after 1979 who would have absolute risk at age 40 greater than 2.3%. The breast cancer incidence for birth cohort 1960-1969 and 1970-1979 were observed and determined using breast cancer incidence in Singapore from 1968 to 2017. For birth cohort 1980-1989, breast cancer incidences were projected by assuming an annual increase in breast cancer incidence of 3.9% 1 . Figure S7. Performance of the PRS46 + PRS287_EBin Chinese, Malay and Indian women from Malaysia and Singapore Forest plot showed the association between standardised PRSs and breast cancer risk in Chinese, Malay and Indian women from Malaysia and Singapore (validation cohort). Odds ratios (ORs) and AUCs were generated using data from Malaysia Breast Cancer Genetics (MyBrCa) and Singapore Breast Cancer Cohort (SGBCC) studies, stratified by ethnicity. The squares represent the odds ratios (ORs), the horizontal lines represent the corresponding 95% confidence intervals and the diamond shapes represent the overall estimates. I-squared and p-value (two-sided) for heterogeneity were obtained by fitting a random-effects model and using generalized Q-statistic estimator (the rma() command in R). The number of cases and controls for each ethnicity, ORs and corresponding 95% confidence intervals are tabulated in Table S8.
Please refer to attached excel for Table S1, S2, S3, and S4.   Table S1. Participating studies and the number of individuals used in polygenic risk scores evaluation analyses.  0.579 † Adjusted for first the 10 principal components and study, and standardised to SDs in controls of each PRS.
*Combined PRSs were generated using the formula + + where , and are the weights obtained by fitting a logistic regression model with breast cancer as outcome, and and are explanatory variables using either East Asian ancestry women in the validation dataset. The PRSs were standardized to standard deviation (SD) of the controls in the validation dataset.  (Table S1).
b Combined PRSs were generated using the formula + + where , and are the weights obtained by fitting a logistic regression model with breast cancer as outcome, and and are explanatory variables using Indian-ancestry women in the validation dataset. The weights for the considered combinations of PRSs can be found in bottom panel of Table S5. † Adjusted for first the 10 principal components and study, and standardised to SDs in controls of each PRS.

Study populations and genotyping
The study population was divided into three datasets. PRSs were developed using training  (Table S1).
The best PRSs were evaluated in the testing set comprising 89,898 women from three prospective cohorts of East Asian ancestry: (a) 10,021 women who had not had any cancer diagnosis prior to recruitment into Singapore Chinese Health Study (SCHS) 25 Genotype calling, quality control procedure and imputation have been described previously 23,30,31,33,37,38 . All data were imputed using the 1000 Genomes Project (Phase 3) data as the reference panel 39 , except BioBank Japan, for which the HapMap Phase II (release 22) 40 was used. SNVs with overall minor allele frequency in controls > 0.01 and imputation r 2 > 0.9 for OncoArray studies, imputation r 2 > 0.7 for Biobank Japan and imputation r 2 > 0.3 for other studies in the training and validation dataset were included in this analysis. Since all samples in the validation sets were genotyped using OncoArray, a higher threshold was imposed for OncoArray to ensure accurate determination of PRS in the validation datasets.
Ancestry informative principal components were available for Asian ancestry samples in the training dataset and the validation dataset, generated using methods as previously described 30 . Briefly, for the BCAC data, continental ancestry was derived by combining the data with the 1000 Genomes Project reference data. Individuals with >40% estimated East Asian ancestry were retained. In the second stage, principal components were generated on the Asian ancestry individuals using a subset of uncorrelated SNVs. Similar ancestry informative principal components were generated for the other dataset.
All studies were approved by the relevant institutional ethics committees and review boards, and all participants provided written informed consent.

Single-SNV association analysis in the training set
Single-SNV association tests were conducted in the BCAC studies separately for the iCOGS and OncoArray datasets, adjusted for age, the first two principal components and country/study to obtain the per-allele OR for each SNV using Plink 2.0 41 . Single-SNV association analyses in ABCC studies were previously conducted by Shu and colleagues (2020) 30 -only summary statistics from meta-analyses of BCAC and ABCC studies were available for this project. Briefly, the analyses were conducted separately for each study in ABCC, adjusted for age and first two principal components 30 . Combined weights and p-values were derived using fixed-effect meta-analysis with the software METAL 42 . A total of 20,768 SNVs significantly associated with breast cancer risk at p-values < 0.001 were selected. SNVs clumping (within 1Mb windows) was subsequently conducted using the software PRSice v2.11 43 to remove highly correlated SNVs (pairwise correlation r 2 > 0.9); the SNV with the lowest p-value for association in the correlated pairs was retained, resulting in 3,050 SNVs for subsequent analyses. Since the raw genotype data were not available for ABCC studies, the correlation r 2 was computed using 4,921 control samples in BCAC OncoArray studies only. Single-SNV association analyses in BBJ1 were previously conducted by Ishigaki et al (2020) 33 . Briefly, GWAS was conducted using generalised linear mixed model, adjusted for age and first five principal components. The GWAS summary statistics from BBJ were combined with GWAS from BCAC Asian studies using fixed-effect meta-analysis with the software METAL.

Clumping and thresholding (C+T) method
To account for the joint effect of SNVs used in derivation of the best PRS determined by the C+T method, the SNV weights should ideally be estimated jointly in a single logistic regression model. Raw genotype data of the training set were not available for the joint estimation.
However, these weights can be computed from the marginal effect sizes -if are the (conditionally independent) effect sizes and ′ = 2 (1 − ) are the corresponding normalized effect sizes, where is the effect allele frequency of SNV , then the predicted normalised marginal effect sizes of the SNVs ′ are given by ′ = ′, where is the matrix of correlations between the SNV genotypes. Thus ′ = ′, and the optimal weight. The optimal weight for SNV j is then given by: The correlation matrix was estimated using 4,921 Asian control samples in BCAC OncoArray studies. The concept of this method has been previously described in Section 3.3 of Prive et al (2020) 44 .

Re-weighting of European-based PRS
For these analyses, we considered PRS based on the 313 SNVs developed in European women 45 . Of the 313 SNVs, only 287 SNVs with imputation info score > 0.9 in OncoArray studies were retained for subsequent analyses. We considered two sets of weights for these SNVs: (i) Asian weights estimated from training set 1 alone; and (ii) weights based on a combination of the Asian (from training set 1) and European weights, allowing for these weights to differ but be correlated.
For (i), the optimal weights taking into account the correlation between SNVs were derived using Equation (1). For (ii), we combined Asian and European weights using an Empirical Bayes approach. In brief, we assume that the true population-specific effect sizes vary from a "global" effect size, , by a normally distributed amount, with variance , i.e.

Absolute risk of breast cancer by PRS percentiles
The age-specific absolute risks of developing breast cancer in each PRS percentile, g, were calculated using the following formula: where λ ( ) = ( )exp ( ) is the breast cancer incidence associated with PRS at age u, ( ) is the baseline incidence and the corresponding effect size , Sg(u) is the probability of being breast cancer free at age u, and Sm(u) is the probability of not dying from a cause other than breast cancer at age u. The theoretical effect sizes, OR = exp , for PRS interval between two percentiles (u,v), using middle quintile as reference (40- 60  where is the log OR per unit SD of the continuous PRS 46 . The PRS-specific breast cancer incidences, λ (u), were calculated iteratively by assuming that the average age-specific breast cancer incidence over all PRS percentiles agreed with the population breast cancer incidence.
The details of these methods have been described previously 47 . We calculated lifetime and 10-year absolute risks using Singaporean mortality and breast cancer incidence for Chinese, Malays and Indians in 2017 48,49 . Under polygenic model, the logarithm of risk in the population has been shown to follow a normal distribution 50 . The proportion of population above a specific risk threshold was given by the area under the curve of the distribution of logarithm of 10-year absolute risk at pre-specified age.
To generate the distribution of birth-cohort specific 10-year absolute risk at age 40, we used birth cohort--specific breast cancer incidences derived using population breast cancer incidence in Singapore from 1968 to 2017 51 . The population incidences were reported in five- year age intervals for calendar year 1968-1972, 1973-1977,…, 2013-2017. By taking the lowerbound of these five-year age interval and the midpoint of calendar-specific interval, the year of birth of the cohort that the reported population incidences were based on were calculated.
We took the average of the reported incidences according to three birth cohorts -1960-1969, 1970-1979 and 1980-1989. Birth cohort-specific incidence were observed for women who were born between 1960-1969 and 1970-1979. For women who were born between 1980-1989, breast cancer incidence was observed up to age 35. The breast cancer incidence for this birth cohort at age 40 was projected by assuming an annual increase in breast cancer incidence of 3.9% 1 .