Compositional analysis#
A central question in scRNA-seq experiments is if cell-type proportions have been changed between conditions. This seemingly simple question is technically challenging, due to the compositional nature of single-cell data. That is, if the abundance of a certain cell-type is increased, as a consequence, the abundance of all other cell-types will be decreased, since the overall number of cells profiled is limited. On top of that, cell proportions are not represented in an unbiased manner in scRNA-seq data, as, depending on the protocol, different cell-types are captured with different efficiencies [Lambrechts et al., 2018, Salcher et al., 2022].
Note
Several alternative methods are available for comparing compositional data. scCODA [Büttner et al., 2021] is a Bayesian model for compositional data analysis that models the uncertainty of cell-type fractions of each sample. It requires the definition of a reference cell-type that is assumed to be unchanged between conditions. tascCODA [Ostner et al., 2021] is an extension of the scCODA model that additionally takes the hierarchical relationships of cell lineages into account. Propeller [Phipson et al., 2022] uses a log-linear model to model cell-type proportions and was demonstrated to have high statistical power with few biological replicates. Finally, sccomp [Mangiola et al., 2022] provides a highly-flexible statistical framework that considers the presence of outliers and models group-specific variability of cell-type proportions.
Another group of tools works independent of discrete cell-types and are useful for finding more subtle changes in functional states based on the cell × cell neighborhood graph. DA-seq [Zhao et al., 2021] computes a differential abundance (DA)-score for each cell, based on the prevalence of conditions in neighborhoods of multiple sizes using a logistic regression classifier. Similarly, Milo [Dann et al., 2022] tests if in certain parts of the neighborhood graph cells from a certain condition are over-represented. Thanks to its statistics being based on a GLM with negative binomial noise model, it allows for flexible modeling of experimental designs and covariates.
There’s a dedicated chapter in the single-cell best practices book which provides additional information on compositional analyses.
In Salcher et al. [2022], we used scCODA for comparing cell-type fractions. In this section, we demonstrate how to apply it to a single-cell atlas. In this example, we are going to compare cell-type fractions between the two tumor types, LUAD and LUSC.
1. Import the required libraries#
import os
# Set tensorflow logging level to only display warnings and errors
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"
import altair as alt
import pandas as pd
import scanpy as sc
import sccoda.util.cell_composition_data as scc_dat
import sccoda.util.comp_ana as scc_ana
import tensorflow as tf
tf.random.set_seed(0)
2. Load the input data#
adata_path = "../../data/input_data_zenodo/atlas-integrated-annotated.h5ad"
adata = sc.read_h5ad(adata_path)
3. Compute cell-type count matrix#
As a first step, we compute the number of cells per sample and cell-type. This matrix is the basis for both qualitative visualization and quantitative analysis using scCODA.
Subset the AnnData object to the samples of interest, in our case only primary tumor samples and only patients with LUAD and LUSC.
adata_subset = adata[
(adata.obs["origin"] == "tumor_primary") & adata.obs["condition"].isin(["LUAD", "LUSC"]),
:,
]
Create a DataFrame with counts per cell-type using pandas
cells_per_patient = (
adata_subset.obs.groupby(
# groupby needs to include all covariates of interest, the column with
# the biological replicate (patient) and the cell-type
["dataset", "condition", "tumor_stage", "patient", "cell_type_coarse"],
observed=True,
)
.size()
.unstack(fill_value=0)
)
cells_per_patient
cell_type_coarse | B cell | T cell | Epithelial cell | Macrophage/Monocyte | Mast cell | Plasma cell | cDC | Stromal | NK cell | Endothelial cell | pDC | Neutrophils | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dataset | condition | tumor_stage | patient | ||||||||||||
Lambrechts_Thienpont_2018_6653 | LUSC | early | Lambrechts_Thienpont_2018_6653_7 | 405 | 759 | 718 | 191 | 125 | 160 | 56 | 22 | 127 | 14 | 4 | 1 |
LUAD | advanced | Lambrechts_Thienpont_2018_6653_6 | 994 | 2620 | 303 | 274 | 68 | 251 | 39 | 27 | 202 | 27 | 6 | 0 | |
Maynard_Bivona_2020 | LUAD | early | Maynard_Bivona_2020_TH238 | 9 | 300 | 325 | 115 | 111 | 41 | 100 | 146 | 11 | 62 | 5 | 3 |
advanced | Maynard_Bivona_2020_TH226 | 7 | 284 | 575 | 371 | 112 | 91 | 28 | 300 | 33 | 113 | 8 | 115 | ||
Maynard_Bivona_2020_TH236 | 49 | 140 | 160 | 65 | 74 | 107 | 19 | 219 | 5 | 82 | 2 | 2 | |||
Maynard_Bivona_2020_TH179 | 16 | 429 | 173 | 323 | 94 | 121 | 96 | 30 | 14 | 83 | 50 | 1 | |||
Maynard_Bivona_2020_TH248 | 32 | 154 | 58 | 73 | 12 | 171 | 6 | 127 | 30 | 17 | 2 | 0 | |||
Maynard_Bivona_2020_TH231 | 2 | 4 | 375 | 24 | 5 | 0 | 2 | 2 | 2 | 2 | 3 | 87 | |||
Maynard_Bivona_2020_TH158 | 10 | 70 | 8 | 7 | 3 | 0 | 1 | 9 | 5 | 1 | 0 | 0 | |||
Maynard_Bivona_2020_TH169 | 0 | 10 | 138 | 4 | 6 | 10 | 1 | 3 | 0 | 3 | 0 | 0 | |||
UKIM-V | LUSC | early | UKIM-V_P2 | 97 | 969 | 2187 | 641 | 53 | 182 | 140 | 29 | 67 | 10 | 8 | 2579 |
LUAD | early | UKIM-V_P1 | 82 | 273 | 133 | 30 | 3 | 7 | 27 | 17 | 50 | 1 | 25 | 22 | |
UKIM-V_P3 | 132 | 1376 | 387 | 1134 | 23 | 28 | 153 | 17 | 327 | 11 | 15 | 166 |
4. Visualization as bar chart#
For a first (qualitative) impression, we can make a bar chart to compare cell-type fractions between conditions.
Transform the count matrix from above into a table of average cell-type fracitons per condition. We compute the mean cell-type fraction per patient, in order to give equal weight to each patient (this is particularly important if there are different numbers of cells per patient)
average_fractions_per_condition = (
cells_per_patient.apply(lambda x: x / x.sum(), axis=1)
.melt(ignore_index=False, value_name="frac")
.reset_index()
.groupby(["condition", "cell_type_coarse"], observed=True)
.agg(mean_frac=pd.NamedAgg("frac", "mean"))
.reset_index()
)
average_fractions_per_condition.head()
condition | cell_type_coarse | mean_frac | |
---|---|---|---|
0 | LUSC | B cell | 0.085394 |
1 | LUSC | Endothelial cell | 0.003429 |
2 | LUSC | Epithelial cell | 0.296106 |
3 | LUSC | Macrophage/Monocyte | 0.083022 |
4 | LUSC | Mast cell | 0.028012 |
Create a bar chart using
altair
:
alt.Chart(average_fractions_per_condition).encode(y="condition", x="mean_frac", color="cell_type_coarse").mark_bar()
5. Quantitative analysis using scCODA#
Create an
AnnData
object for scCODA usingfrom_pandas()
sccoda_data = scc_dat.from_pandas(
cells_per_patient.reset_index(),
# we need to specify all columns that do not contain cell-type counts as covariate columns
covariate_columns=["patient", "dataset", "condition", "tumor_stage"],
)
Make “condition” a categorical column. The first category is considered the “base” category (i.e. denominator of fold changes)
sccoda_data.obs["condition"] = pd.Categorical(sccoda_data.obs["condition"], categories=["LUSC", "LUAD"])
Create the scCODA model. In
formula
, specify the condition and all covariates you would like to include into the model.
Important
scCODA requires to specify a “reference cell-type” that is considered unchanged between the conditions. For studying the tumor microenvironment, we can set the reference cell-type to Epithelial cells/Tumor cells to capture changes of the tumor microenvironment
sccoda_mod = scc_ana.CompositionalAnalysis(
sccoda_data,
formula="condition + tumor_stage + dataset",
# TODO consider changing reference cell-type to "tumor cells" in the final version
reference_cell_type="Epithelial cell",
)
Zero counts encountered in data! Added a pseudocount of 0.5.
Perform Hamiltonian Monte Carlo (HMC) sampling to estimate coefficients
# TODO: suitable number of iterations
sccoda_res = sccoda_mod.sample_hmc(num_results=20000)
WARNING:tensorflow:From /data/scratch/sturm/conda/envs/2023-atlas-protocol/lib/python3.10/site-packages/tensorflow_probability/python/internal/batched_rejection_sampler.py:102: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
MCMC sampling finished. (312.484 sec)
Acceptance rate: 51.2%
Set the false-discovery-rate (FDR) level for the results.
Note
A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries. scCODA only computes a fold-change for cell-types that achieve an FDR smaller than the specified threshold. For more details, please refer to the scCODA documentation.
# TODO: suitable FDR based on final dataset
sccoda_res.set_fdr(0.5)
Inspect the results
sccoda_res.summary()
Compositional Analysis summary:
Data: 13 samples, 12 cell types
Reference index: 2
Formula: condition + tumor_stage + dataset
Intercepts:
Final Parameter Expected Sample
Cell Type
B cell 0.059 150.580711
T cell 1.408 580.273315
Epithelial cell 1.098 425.599697
Macrophage/Monocyte 0.378 207.161613
Mast cell -0.408 94.396050
Plasma cell -0.230 112.786790
cDC -0.508 85.413078
Stromal -0.561 81.004056
NK cell -0.368 98.248426
Endothelial cell -0.810 63.149140
pDC -1.166 44.234243
Neutrophils -1.059 49.229805
Effects:
Final Parameter \
Covariate Cell Type
condition[T.LUAD] B cell 0.004939
T cell 0.326575
Epithelial cell 0.000000
Macrophage/Monocyte 0.039490
Mast cell -0.046844
Plasma cell -0.017538
cDC 0.020343
Stromal 0.102844
NK cell -0.021084
Endothelial cell 0.070012
pDC 0.089475
Neutrophils -0.087406
tumor_stage[T.early] B cell 0.038741
T cell 0.041858
Epithelial cell 0.000000
Macrophage/Monocyte 0.089068
Mast cell -0.036403
Plasma cell -0.028537
cDC 0.111079
Stromal -0.020042
NK cell 0.061678
Endothelial cell -0.072053
pDC 0.063247
Neutrophils 0.069296
dataset[T.Maynard_Bivona_2020] B cell -0.789721
T cell -0.859565
Epithelial cell 0.000000
Macrophage/Monocyte -0.089072
Mast cell 0.301332
Plasma cell 0.023971
cDC -0.163217
Stromal 0.573186
NK cell -0.428521
Endothelial cell 0.389097
pDC -0.026125
Neutrophils 0.000658
dataset[T.UKIM-V] B cell -0.156524
T cell -0.086085
Epithelial cell 0.000000
Macrophage/Monocyte 0.139682
Mast cell -0.261072
Plasma cell -0.249341
cDC 0.095798
Stromal -0.139784
NK cell 0.111914
Endothelial cell -0.308733
pDC 0.048082
Neutrophils 0.867303
Expected Sample \
Covariate Cell Type
condition[T.LUAD] B cell 135.055282
T cell 717.893260
Epithelial cell 379.838278
Macrophage/Monocyte 192.334372
Mast cell 80.390921
Plasma cell 98.909734
cDC 77.795874
Stromal 80.125156
NK cell 85.855131
Endothelial cell 60.446437
pDC 43.173224
Neutrophils 40.259253
tumor_stage[T.early] B cell 151.936444
T cell 587.325675
Epithelial cell 413.113203
Macrophage/Monocyte 219.815713
Mast cell 88.351134
Plasma cell 106.397722
cDC 92.647346
Stromal 77.067321
NK cell 101.433140
Endothelial cell 57.035186
pDC 45.739801
Neutrophils 51.214237
dataset[T.Maynard_Bivona_2020] B cell 83.134691
T cell 298.753414
Epithelial cell 517.589852
Macrophage/Monocyte 230.467765
Mast cell 155.169029
Plasma cell 140.492525
cDC 88.231706
Stromal 174.752280
NK cell 77.840559
Endothelial cell 113.327328
pDC 52.407931
Neutrophils 59.909841
dataset[T.UKIM-V] B cell 130.213458
T cell 538.406462
Epithelial cell 430.392988
Macrophage/Monocyte 240.899627
Mast cell 73.525055
Plasma cell 88.886316
cDC 95.058913
Stromal 71.230055
NK cell 111.120169
Endothelial cell 46.897578
pDC 46.935808
Neutrophils 118.510491
log2-fold change
Covariate Cell Type
condition[T.LUAD] B cell -0.156987
T cell 0.307037
Epithelial cell -0.164112
Macrophage/Monocyte -0.107140
Mast cell -0.231694
Plasma cell -0.189414
cDC -0.134763
Stromal -0.015739
NK cell -0.194530
Endothelial cell -0.063106
pDC -0.035027
Neutrophils -0.290212
tumor_stage[T.early] B cell 0.012931
T cell 0.017428
Epithelial cell -0.042960
Macrophage/Monocyte 0.085538
Mast cell -0.095478
Plasma cell -0.084131
cDC 0.117293
Stromal -0.071875
NK cell 0.046023
Endothelial cell -0.146911
pDC 0.048286
Neutrophils 0.057013
dataset[T.Maynard_Bivona_2020] B cell -0.857014
T cell -0.957777
Epithelial cell 0.282312
Macrophage/Monocyte 0.153808
Mast cell 0.717042
Plasma cell 0.316895
cDC 0.046840
Stromal 1.109245
NK cell -0.335912
Endothelial cell 0.843661
pDC 0.244622
Neutrophils 0.283261
dataset[T.UKIM-V] B cell -0.209658
T cell -0.108037
Epithelial cell 0.016157
Macrophage/Monocyte 0.217675
Mast cell -0.360491
Plasma cell -0.343565
cDC 0.154365
Stromal -0.185508
NK cell 0.177615
Endothelial cell -0.429250
pDC 0.085525
Neutrophils 1.267411
Compute “credible effects” based on the FDR. This will select cell-types that meet the FDR threshold.
Note
condition[T.LUAD]
refers to the coefficient of the model that captures the differences inf LUAD
compared to LUSC
(which se set as the reference). The name of the coefficient can be found in the summary output above. By chosing a differen coefficient (e.g. tumor_stage[T.early]
) we could test the effects of a different variable (in this case tumor_stage
) on the cell-type composition.
credible_effects_condition = sccoda_res.credible_effects()["condition[T.LUAD]"]
Make a plot of log2 fold changes between LUAD and LUSC using
altair
.
alt.Chart(
sccoda_res.effect_df.loc["condition[T.LUAD]"].loc[credible_effects_condition].reset_index(),
title="condition",
).mark_bar().encode(
x=alt.X("Cell Type", sort="y"),
y="log2-fold change",
color=alt.Color("Cell Type"),
)