본문 바로가기
Paper

2014 Rare-Variant Association Analysis:Study Designs and Statistical Tests

by wycho 2021. 1. 22.

Seunggeung Lee, Gonçalo R. Abecasis, Michael Boehnke, Xihong Lin

AJHG VOLUME 95, ISSUE 1, P5-23, JULY 03, 2014

Published: 

https://doi.org/10.1016/j.ajhg.2014.06.009

 

 

Highlight

Many Mendelian disorders and rare forms of common diseases are caused by highly penetrant rare variants. Evolutionary theory predicts that deleterious alleles are likely to be rare as a result of purifying selection, and indeed, loss-of-function variants, which prevent the generation of functional proteins, are especially rare.

 

.. the statistical power is low unless ample sizes or effect sizes are very large, ..

 

Low-depth sequencing relies on linkage-disequilibrium (LD)-based methods that leverage information across individuals to improve the quality of variant detection and estimated genotypes.

 

Initial simulation studies showed taht low-depth sequencing for a larger sample might be more powerful than deep sequencing of fewer samples, both for variant detection and subsequent disease association studies.

 

Contaminated samples often have unusually high levels of geterozygosity.

 

.. the replication of rare-variant associations generally requires a large sample, ..

 

Noncoding regions ca play an important role in complex diseases and traits. It has been shown that most GWAS loci lie in noncoding regions.

 

Large samples are needed for identifying low-frequency and rare disease-associated variants unless their effects are quite strong.

 

.. when samples are selected from the upper and lower 10% tails of the  phenotype distribution, the number of individuals who must be sequenced for a given power can often be reduced by more than half.

 

Despite its advantages over random sampling in terms of power, extreme-phenotype sampling also has limitations. Notably, the results might not be generalizable to the underlying populatioin and might be sensityve to outliers, sampling bias, and the assumption of mormality for the underlying traits. If a complex trait is influenced by multiple loci, extreme-phenotype sampling can reduce power to detect loci with small effects.

 

Methods for Rare-Variant Association Testing

The analysis of rare variant is more challenging than that of common variants. First, a large sample size is needed for simply observing a rare variant with a high probability. Second, standard single-variant association analysis is underpowered to detect rare-variant associations.

 

..with an odds ratio (OR) = 1.4, the sample sizes required to achieve 80% power are 6,400, 54,000, and 540,000 for a MAF=0.1,0.01, and 0.001, respectively, if one assumes 5% disease prevalence and a significance level of 5x10^-8.

 

.. more stringent significance levels might be required, ..

 

.. if the sample sizes are large enough, the effects are very large, or the variants are not too rare.

 

.. addressing this issue will require more methodoligical development.

 

Adaptive Burden Tests

The estimated regression coefficient (EREC) test estimates a regression coefficient of each variant and uses this as a weight. The test is based on the expectition that the true regression coefficient βj is an optimal weight to maximize power. Because βj estimats are unstable when the minor allele count (MAC) is small, the EREC test stabilizes the estimates by adding a small constant to the estimated βj, which might reduce the optimality of the EREC test. .. only accurate for vary large samples, ..

 

Adaptive burden tests are more robust than the original burden methods because they require fewer assumptions about the underlying nenetic architecture at each locus.

 

.. some require estimation of regression coefficients of individual variants in the first stage is often difficult and unstable for rare vairants.

 

Variance-Component Tests

SKAT can also accomodate SNP-SNP interactions.

 

The SKAT test statistic is a weighted sum of squres of single-variant score statistics Sj.

For binary traits, large-sample-based p value calculations can produce inaccurate type I errors rates when sample sizes or tatal MACs are small. In thses situations, false-positive rates can be deflated when the numbers of affected and control individuals are equal and inflated when these numbers are unequal.

 

.. the naive approach of simply taking the minimum p value of different mehods generally yields an inflated type I error rate.

 

The EC test uses an exponential sum of Sj^2, which developed under a Bayesian framework with a sparse alternative prior under the assumption that only one variant in a gene or region is a causal variant.

 

.. the EC test can have higher power than burden or variance-component tests when only a very small proportion of variants are causal. .. less power when moderate or large proportions of variants are causal. The null distribution of Q_EC if unkwoen, and so permutations are required for estimating p values.

 

.. gene- and region-based tests are designed to increase power by aggregating association signals across multiple rare vairants.

 

.. compared to single-variant-based tests, gene- and region-based tests can lead to loss of power when one or a very few of the variants in a fene are associated with the trait, when many variants have no efect, and when causal variants are low-frequency variants.

 

Meta-analysis

Meta-analysis provides an effective way to combine data from multiple studies.

 

.. it is well known aht this approach is less powerful than joint analysis of individual-level data and fixed-effects meta-analysis.

 

Fixed-effects meta-analysis can use individual-level data to achieve power essentially identical to that of joint analysis. These frameworks require that each study provide score statistics for individual variants and also between-variant covariance matrices that reflect region-specific LD information among variants. These matrices later allow asymptotic p values to be calculated. Burden tests, SKAT, SKAT-O, and VT have all been developed in this score-statistic-based meta-analysis framework.

 

Case-control imbalances across different sequencing platforms might also increase type I error rates, given that traditional large-sample-based association tests of individual low-frequency variants might not be well calibrated for case-control imbalaces.

 

Other Analytic Issues for Rare-Variant Association Studies

Population-Stratification Adjustment

PCA and mixed models both assume a smooth distribution of MAFs over geographical (or ancestry) space. Because rare variants are often sharply localized, PCA and mixed models might fail to correct for population stratification if the distribution of disease risk is also sharply localized.

 

PCA perfomance heavily depends on the underlying risk distribution and population structure, ..

 

Genotype Imputation

Imputation accuarcy decreases as MAF decreses, making it challenging to impute very rare variants.

 

Conclusions

One strategy to improve power is to ouse publicly available data to augment the control set by selecting ancestry-matched controls.

 

.. the sample from differenc platforms can severely increase false-positive rates.

 

 

 

댓글