Kevin W. Kelley, Hiromi Inoue, Anna V. Molofsky, Michael C. Oldham
Nat Neurosci. 2018 Sep;21(9):1171-1184. doi: 10.1038/s41593-018-0216-z. Epub 2018 Aug 28.
Link to full publication: https://www.nature.com/articles/s41593-018-0216-z
See also News and Views: https://www.nature.com/articles/s41593-018-0218-x
Understanding the molecular basis of cellular identity in the human central nervous system (CNS) is essential for ascertaining the functions of different cell types and their susceptibilities to diverse pathologies. Because gene expression lies at the root of cellular identity, human CNS transcriptomes provide a natural point of entry for this task. However, nearly all gene expression studies of the human CNS have analyzed heterogeneous (bulk) tissue samples comprised of many different cell types. Therefore, it is often assumed that the cellular origins of gene expression in these samples cannot be determined, and that doing so requires physically separating cell types. Although such ‘bottom-up’ methods are readily applied to model organisms, they are difficult to apply to the adult human CNS due to its size, limited accessibility, and resistance to dissociation.
An alternative approach is to estimate the covariation between the abundance of individual cell types and transcripts through integrative gene coexpression analysis of bulk tissue samples. This ‘top-down’ approach assumes that variation in cellular composition among biological replicate samples will drive covariation of transcripts that are uniquely or predominantly expressed in specific cell types. In contrast to single-cell methods, this approach is based on aggregate analysis of many billions of cells, and therefore permits highly robust inferences about the core transcriptional identities of major cell types.
By analyzing gene coexpression relationships in 62 datasets consisting of >7000 neurotypical adult human samples representing all major CNS regions and technology platforms, we identified consensus transcriptional signatures of astrocytes, oligodendrocytes, microglia, and neurons. We created a novel metric called ‘fidelity’, which quantifies the extent to which a gene’s expression levels are correlated with the inferred abundance of a cell type over all analyzed samples. Genes with the highest expression fidelity for a cell type are expressed with high sensitivity and specificity (Fig. 1).
This web site allows users to explore the covariation of gene expression levels and cellular abundance in the human CNS in two ways:
Twenty-one broad CNS regions and the four major CNS cell types (neurons, astrocytes, oligodendrocytes, and microglia) are currently supported. Annotated usage examples are provided below. For additional details, please refer to the publication that accompanies this web site (citation above).
Fig. 2 provides an example of the ‘wheel plot’ that is returned after querying a particular CNS region / cell type. In this case, the CNS region is ‘All’ and the cell type is ‘astrocyte’. The meaning of each track is explained below the figure caption.
Wheel plots are downloadable as PDFs. A complete list of genes ranked by expression fidelity for a given CNS region / cell type, along with absolute expression levels, can be downloaded as a CSV file from the link adjacent to the wheel plot. Complete tables for all CNS regions / cell types are provided on the Data Download page.
Fig. 3 provides an example of the data that are returned after querying a particular CNS region / gene. In this case, the CNS region is ‘All’ and the gene symbol is ‘CNP’. The meaning of each panel is explained below the figure caption.
A) Genome-wide distributions of expression fidelity for astrocytes (A), oligodendrocytes (O), microglia (M), and neurons (N) over all analyzed samples are shown. The horizontal line denotes the expression fidelity of the query gene (here, CNP) for each cell type. The dashed horizontal line (only visible when analyzing genes over ‘All’ regions) denotes the threshold above which all fidelity scores had Confidence = 100 (see panel B).
B) This table provides more information on the data used to produce the distributions in panel (A):
C) Mean expression percentile ranks of the query gene in all analyzed datasets. Shapes and colors of points denote technology platforms and CNS regions, respectively.
D) Modeling results for expression levels of the query gene as a function of variation in the abundance of individual cell types. Simple linear regression is used to predict gene expression levels as a function of estimated neuron, astrocyte, oligodendrocyte, or microglia abundance in all human CNS regional datasets containing the query gene. Adjusted r2 values and t-values are shown for each cell type (labeled by color, with shapes denoting technology platforms). In this example, CNP expression is consistently well-modeled by variation in oligodendrocyte abundance across datasets.
E) Modeling results for expression levels of the query gene as a function of variation in the abundance of all cell types. Multiple linear regression is used to predict gene expression levels as a function of estimated neuron, astrocyte, oligodendrocyte, and microglia abundance in all human CNS regional datasets containing the query gene. Adjusted r2 values and F-statistics are shown (with colors and shapes denoting CNS regions and technology platforms, respectively). In this example, the full ‘AOMN model’ can explain ~90% of expression variation for CNP, on average, over all datasets.
F) The top 12 genes ranked by their aggregate correlation to the query gene’s expression levels over all analyzed samples. Among all ~18,500 genes in the database, CNP is most strongly correlated with MAG, followed by MOG, etc.
Region Abbreviations: FCX = frontal cortex; PCX = parietal cortex; TCX = temporal cortex; LIM = limbic cortex; IN = insular cortex; OCX = occipital cortex; BF = basal forebrain; CLA = claustrum; AMY = amygdala; HIP = hippocampus; STR = striatum; GP = globus pallidus; DI = diencephalon; MID = midbrain; PON = pons; MED = medulla; SC = spinal cord; CB = cerebellum; WM = white matter.
The data presented on this web site are derived from analyses that are ultimately correlative in nature. Therefore, it is important to acknowledge that covariation of gene expression levels and cellular abundance in bulk tissue samples can be confounded by diverse sources of biological and technical variation. For example, the analyses presented in Fig. 3 indicate that CNP has extremely high fidelity for oligodendrocytes, which is consistent with its reputation as a canonical marker of this cell type. However, CNP also has fidelity for microglia that appears much higher than it does for neurons or astrocytes. This result is driven by covariation of microglial and oligodendroglial abundance in human CNS samples, since both cell types are more abundant in white matter than gray matter. Therefore, variation in the ratio of gray matter to white matter will drive covariation in the abundance of these cell types and the genes that they express. Spurious correlations may also result from alternative splicing, microarray probe failures, batch effects, and limitations in dynamic range (sensitivity / saturation for microarray probes and transcript coverage / read depth for RNAseq). Notwithstanding these limitations, the genes with the highest fidelity for a cell type are already remarkably stable. Ongoing efforts to assimilate new datasets will provide further improvements.