Study design, laboratory methods, and analytic approaches differ by trait type (Mendelian or complex) and hypothesis being tested (rare disease‐rare variant, Mendelian positional cloning; CDCV [GWAS]; CDRV [WES or WGS and individual variant or set‐based association]). These approaches are described in the following sections.
Components of a Disease Gene Discovery Study
Each genetically complex trait has its own peculiarities that require special attention. However, a guiding paradigm can be applied to most conditions. Originally, the general approach that was used for Mendelian single‐gene disorders was positional cloning. With the completion of the human genome reference sequence, cloning was no longer a necessary step – and therefore this general approach is better described as disease gene discovery. The classical approach (Figure 1.1) follows a generally linear series of events: defining the phenotype, identifying multi‐case families, collecting blood samples, genotyping markers, analyzing data for initial disease gene localization, refining the initial localization to define the minimum candidate region, and then sequencing genes within this region to find the causative mutation(s).
In contrast to the classical approach, the current approaches to finding genes for common and genetically complex traits are not linear, and many steps are works in progress, subject to further defining, refining, or replacement by subsequent steps. Figure 1.2 illustrates the stepwise and recursive nature of the components of a complex trait study. Each step has its own key factors that must be considered, and for complex traits, the order and emphasis of these steps on the approach will vary from study to study. This fact is underappreciated and contrasts strongly with the classical disease gene discovery approach. Indeed, many of the difficulties reconciling discordant studies of the same complex trait arise from study‐specific decisions made in the approach.
Figure 1.1 Steps in a Mendelian disease gene discovery (positional cloning) study.
Figure 1.2 Study cycle for a complex trait gene identification study.
This section discusses the steps in Figure 1.2, providing an overview of each component and a guide to the chapter(s) providing more detail on these points.
Define Disease Phenotype
The first step in any disease gene discovery process is to know what phenotype is being studied. This may sound obvious, but specifying the exact measures that will be used to reliably and validly determine the phenotype is often overlooked in the rush to move forward. There are three aspects that need to be considered: clinical definition, determining that a trait has a genetic component, and identification of datasets that can be studied.
Clinical Definition
It is not enough to define a trait in binary terms, such as the presence or absence of Huntington’s disease or diabetes. In Huntington’s disease, for example, there can be wide variation in the symptoms, with some only psychological or very mild motor disturbances detectable by expert examination, and the age at which these symptoms begin is similarly variable. In diabetes, there are distinct subtypes (insulin‐dependent diabetes mellitus and non‐insulin‐dependent diabetes mellitus) as well as variable age at onset. Additionally, blood glucose levels (a quantitative trait) are strongly associated with diabetes (a qualitative trait) and could be used as a surrogate measure or endophenotype. One critical role of the clinician in study design is to assess the various diagnostic procedures and tools and determine which ones best define a consistent phenotype. Additionally, dissecting genetically complex diseases usually requires large datasets to supply enough power to unravel genetic effects. For this reason, participant ascertainment often extends to multiple sites. It is critical for multi‐site studies to establish consensus diagnostic procedures and criteria and apply them consistently across sites. For example, the establishment of a consensus diagnostic scheme (McKhann et al. 1984) played an important role in a successful complex disease linkage study in late‐onset familial Alzheimer’s disease (Pericak‐Vance et al. 1991) and subsequent identification of the association of Alzheimer’s disease and common variation in the APOE gene (Corder et al. 1993; Corder et al. 1994).
The phenotype assignment must be done in a rigorously consistent fashion. Even a small rate of phenotype error might alter analytic results – in some cases leading to false‐positive results and in others to false‐negative results. Thus, which data will be used to assign the trait status must be carefully determined. Must detailed clinical records of an examination specifically addressing the phenotype be obtained and reviewed for consistency on every participant? Is the self‐report of a participant or a participant’s relative sufficient? Is a note documenting a diagnosis (but no examination findings) from a medical record adequate? Or is direct examination of every participant using a standardized research protocol required? Additionally, investigators must consider whether to collect additional biomarker data (e.g. antibody titers, protein assays) or clinical tests (e.g. electroencephalogram, electrocardiogram, magnetic resonance imaging) that might correlate with the trait of interest. The goal of the phenotyping protocol is to standardize procedures, minimize error in determining the phenotype, and maximize the power of the dataset to detect genes underlying the trait.
Determining that a Trait Has a Genetic Component
It is critical that as much as possible be known about the genetic basis of a complex trait prior to determine the most appropriate study design for gene identification. That a trait “runs in families” is insufficient evidence, since this phenomenon can occur for several reasons other than shared genetic susceptibility, including shared environmental exposure and biased ascertainment. As outlined in Chapter 3, there are numerous lines of evidence that can be examined, including family studies, segregation analysis, twin studies, adoption studies, heritability studies, and population‐based risks to relatives of probands (the initially identified individual with disease). For most traits being contemplated, some such data already exist in the literature. A thorough review of this literature may provide most of the necessary information and point out any missing data. The data may not only indicate the strength of the genetic effect on the trait but also give some indication of the underlying genetic model. For example, there may be obvious evidence of a single “major” gene, such as in Huntington’s disease, or multiple genes interacting in complex ways, such as in multiple sclerosis (Sadovnick et al. 1996).
Identification of Datasets
It is helpful to identify early on what potential datasets exist or can be collected. Do large families exist or are most cases apparently sporadic? Are large cohort or case–control studies available? Are there repositories of multiplex families with associated clinical data available? Are there existing clinical networks or large specialty clinics available? Is the necessary phenotype data available in a biobank linked to an existing electronic health record? The answers to