Compositional analysis#

A central question in scRNA-seq experiments is if cell-type proportions have been changed between conditions. This seemingly simple question is technically challenging, due to the compositional nature of single-cell data. That is, if the abundance of a certain cell-type is increased, as a consequence, the abundance of all other cell-types will be decreased, since the overall number of cells profiled is limited. On top of that, cell proportions are not represented in an unbiased manner in scRNA-seq data, as, depending on the protocol, different cell-types are captured with different efficiencies [Lambrechts et al., 2018, Salcher et al., 2022].

Note

Several alternative methods are available for comparing compositional data. scCODA [Büttner et al., 2021] is a Bayesian model for compositional data analysis that models the uncertainty of cell-type fractions of each sample. It requires the definition of a reference cell-type that is assumed to be unchanged between conditions. tascCODA [Ostner et al., 2021] is an extension of the scCODA model that additionally takes the hierarchical relationships of cell lineages into account. Propeller [Phipson et al., 2022] uses a log-linear model to model cell-type proportions and was demonstrated to have high statistical power with few biological replicates. Finally, sccomp [Mangiola et al., 2022] provides a highly-flexible statistical framework that considers the presence of outliers and models group-specific variability of cell-type proportions.

Another group of tools works independent of discrete cell-types and are useful for finding more subtle changes in functional states based on the cell × cell neighborhood graph. DA-seq [Zhao et al., 2021] computes a differential abundance (DA)-score for each cell, based on the prevalence of conditions in neighborhoods of multiple sizes using a logistic regression classifier. Similarly, Milo [Dann et al., 2022] tests if in certain parts of the neighborhood graph cells from a certain condition are over-represented. Thanks to its statistics being based on a GLM with negative binomial noise model, it allows for flexible modeling of experimental designs and covariates.

There’s a dedicated chapter in the single-cell best practices book which provides additional information on compositional analyses.

In Salcher et al. [2022], we used scCODA for comparing cell-type fractions. In this section, we demonstrate how to apply it to a single-cell atlas. In this example, we are going to compare cell-type fractions between the two tumor types, LUAD and LUSC.

1. Import the required libraries#

import os

# Set tensorflow logging level to only display warnings and errors
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"

import altair as alt
import pandas as pd
import scanpy as sc
import sccoda.util.cell_composition_data as scc_dat
import sccoda.util.comp_ana as scc_ana
import tensorflow as tf

tf.random.set_seed(0)

2. Load the input data#

adata_path = "../../data/input_data_zenodo/atlas-integrated-annotated.h5ad"
adata = sc.read_h5ad(adata_path)

3. Compute cell-type count matrix#

As a first step, we compute the number of cells per sample and cell-type. This matrix is the basis for both qualitative visualization and quantitative analysis using scCODA.

  1. Subset the AnnData object to the samples of interest, in our case only primary tumor samples and only patients with LUAD and LUSC.

adata_subset = adata[
    (adata.obs["origin"] == "tumor_primary") & adata.obs["condition"].isin(["LUAD", "LUSC"]),
    :,
]
  1. Create a DataFrame with counts per cell-type using pandas

cells_per_patient = (
    adata_subset.obs.groupby(
        # groupby needs to include all covariates of interest, the column with
        # the biological replicate (patient) and the cell-type
        ["dataset", "condition", "tumor_stage", "patient", "cell_type_coarse"],
        observed=True,
    )
    .size()
    .unstack(fill_value=0)
)
cells_per_patient
cell_type_coarse B cell T cell Epithelial cell Macrophage/Monocyte Mast cell Plasma cell cDC Stromal NK cell Endothelial cell pDC Neutrophils
dataset condition tumor_stage patient
Lambrechts_Thienpont_2018_6653 LUSC early Lambrechts_Thienpont_2018_6653_7 405 759 718 191 125 160 56 22 127 14 4 1
LUAD advanced Lambrechts_Thienpont_2018_6653_6 994 2620 303 274 68 251 39 27 202 27 6 0
Maynard_Bivona_2020 LUAD early Maynard_Bivona_2020_TH238 9 300 325 115 111 41 100 146 11 62 5 3
advanced Maynard_Bivona_2020_TH226 7 284 575 371 112 91 28 300 33 113 8 115
Maynard_Bivona_2020_TH236 49 140 160 65 74 107 19 219 5 82 2 2
Maynard_Bivona_2020_TH179 16 429 173 323 94 121 96 30 14 83 50 1
Maynard_Bivona_2020_TH248 32 154 58 73 12 171 6 127 30 17 2 0
Maynard_Bivona_2020_TH231 2 4 375 24 5 0 2 2 2 2 3 87
Maynard_Bivona_2020_TH158 10 70 8 7 3 0 1 9 5 1 0 0
Maynard_Bivona_2020_TH169 0 10 138 4 6 10 1 3 0 3 0 0
UKIM-V LUSC early UKIM-V_P2 97 969 2187 641 53 182 140 29 67 10 8 2579
LUAD early UKIM-V_P1 82 273 133 30 3 7 27 17 50 1 25 22
UKIM-V_P3 132 1376 387 1134 23 28 153 17 327 11 15 166

4. Visualization as bar chart#

For a first (qualitative) impression, we can make a bar chart to compare cell-type fractions between conditions.

  1. Transform the count matrix from above into a table of average cell-type fracitons per condition. We compute the mean cell-type fraction per patient, in order to give equal weight to each patient (this is particularly important if there are different numbers of cells per patient)

average_fractions_per_condition = (
    cells_per_patient.apply(lambda x: x / x.sum(), axis=1)
    .melt(ignore_index=False, value_name="frac")
    .reset_index()
    .groupby(["condition", "cell_type_coarse"], observed=True)
    .agg(mean_frac=pd.NamedAgg("frac", "mean"))
    .reset_index()
)
average_fractions_per_condition.head()
condition cell_type_coarse mean_frac
0 LUSC B cell 0.085394
1 LUSC Endothelial cell 0.003429
2 LUSC Epithelial cell 0.296106
3 LUSC Macrophage/Monocyte 0.083022
4 LUSC Mast cell 0.028012
  1. Create a bar chart using altair:

alt.Chart(average_fractions_per_condition).encode(y="condition", x="mean_frac", color="cell_type_coarse").mark_bar()

5. Quantitative analysis using scCODA#

  1. Create an AnnData object for scCODA using from_pandas()

sccoda_data = scc_dat.from_pandas(
    cells_per_patient.reset_index(),
    # we need to specify all columns that do not contain cell-type counts as covariate columns
    covariate_columns=["patient", "dataset", "condition", "tumor_stage"],
)
  1. Make “condition” a categorical column. The first category is considered the “base” category (i.e. denominator of fold changes)

sccoda_data.obs["condition"] = pd.Categorical(sccoda_data.obs["condition"], categories=["LUSC", "LUAD"])
  1. Create the scCODA model. In formula, specify the condition and all covariates you would like to include into the model.

Important

scCODA requires to specify a “reference cell-type” that is considered unchanged between the conditions. For studying the tumor microenvironment, we can set the reference cell-type to Epithelial cells/Tumor cells to capture changes of the tumor microenvironment

sccoda_mod = scc_ana.CompositionalAnalysis(
    sccoda_data,
    formula="condition + tumor_stage + dataset",
    # TODO consider changing reference cell-type to "tumor cells" in the final version
    reference_cell_type="Epithelial cell",
)
Zero counts encountered in data! Added a pseudocount of 0.5.
  1. Perform Hamiltonian Monte Carlo (HMC) sampling to estimate coefficients

# TODO: suitable number of iterations
sccoda_res = sccoda_mod.sample_hmc(num_results=20000)
WARNING:tensorflow:From /data/scratch/sturm/conda/envs/2023-atlas-protocol/lib/python3.10/site-packages/tensorflow_probability/python/internal/batched_rejection_sampler.py:102: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
MCMC sampling finished. (312.484 sec)
Acceptance rate: 51.2%
  1. Set the false-discovery-rate (FDR) level for the results.

Note

A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries. scCODA only computes a fold-change for cell-types that achieve an FDR smaller than the specified threshold. For more details, please refer to the scCODA documentation.

# TODO: suitable FDR based on final dataset
sccoda_res.set_fdr(0.5)
  1. Inspect the results

sccoda_res.summary()
Compositional Analysis summary:

Data: 13 samples, 12 cell types
Reference index: 2
Formula: condition + tumor_stage + dataset

Intercepts:
                     Final Parameter  Expected Sample
Cell Type                                            
B cell                         0.059       150.580711
T cell                         1.408       580.273315
Epithelial cell                1.098       425.599697
Macrophage/Monocyte            0.378       207.161613
Mast cell                     -0.408        94.396050
Plasma cell                   -0.230       112.786790
cDC                           -0.508        85.413078
Stromal                       -0.561        81.004056
NK cell                       -0.368        98.248426
Endothelial cell              -0.810        63.149140
pDC                           -1.166        44.234243
Neutrophils                   -1.059        49.229805


Effects:
                                                    Final Parameter  \
Covariate                      Cell Type                              
condition[T.LUAD]              B cell                      0.004939   
                               T cell                      0.326575   
                               Epithelial cell             0.000000   
                               Macrophage/Monocyte         0.039490   
                               Mast cell                  -0.046844   
                               Plasma cell                -0.017538   
                               cDC                         0.020343   
                               Stromal                     0.102844   
                               NK cell                    -0.021084   
                               Endothelial cell            0.070012   
                               pDC                         0.089475   
                               Neutrophils                -0.087406   
tumor_stage[T.early]           B cell                      0.038741   
                               T cell                      0.041858   
                               Epithelial cell             0.000000   
                               Macrophage/Monocyte         0.089068   
                               Mast cell                  -0.036403   
                               Plasma cell                -0.028537   
                               cDC                         0.111079   
                               Stromal                    -0.020042   
                               NK cell                     0.061678   
                               Endothelial cell           -0.072053   
                               pDC                         0.063247   
                               Neutrophils                 0.069296   
dataset[T.Maynard_Bivona_2020] B cell                     -0.789721   
                               T cell                     -0.859565   
                               Epithelial cell             0.000000   
                               Macrophage/Monocyte        -0.089072   
                               Mast cell                   0.301332   
                               Plasma cell                 0.023971   
                               cDC                        -0.163217   
                               Stromal                     0.573186   
                               NK cell                    -0.428521   
                               Endothelial cell            0.389097   
                               pDC                        -0.026125   
                               Neutrophils                 0.000658   
dataset[T.UKIM-V]              B cell                     -0.156524   
                               T cell                     -0.086085   
                               Epithelial cell             0.000000   
                               Macrophage/Monocyte         0.139682   
                               Mast cell                  -0.261072   
                               Plasma cell                -0.249341   
                               cDC                         0.095798   
                               Stromal                    -0.139784   
                               NK cell                     0.111914   
                               Endothelial cell           -0.308733   
                               pDC                         0.048082   
                               Neutrophils                 0.867303   

                                                    Expected Sample  \
Covariate                      Cell Type                              
condition[T.LUAD]              B cell                    135.055282   
                               T cell                    717.893260   
                               Epithelial cell           379.838278   
                               Macrophage/Monocyte       192.334372   
                               Mast cell                  80.390921   
                               Plasma cell                98.909734   
                               cDC                        77.795874   
                               Stromal                    80.125156   
                               NK cell                    85.855131   
                               Endothelial cell           60.446437   
                               pDC                        43.173224   
                               Neutrophils                40.259253   
tumor_stage[T.early]           B cell                    151.936444   
                               T cell                    587.325675   
                               Epithelial cell           413.113203   
                               Macrophage/Monocyte       219.815713   
                               Mast cell                  88.351134   
                               Plasma cell               106.397722   
                               cDC                        92.647346   
                               Stromal                    77.067321   
                               NK cell                   101.433140   
                               Endothelial cell           57.035186   
                               pDC                        45.739801   
                               Neutrophils                51.214237   
dataset[T.Maynard_Bivona_2020] B cell                     83.134691   
                               T cell                    298.753414   
                               Epithelial cell           517.589852   
                               Macrophage/Monocyte       230.467765   
                               Mast cell                 155.169029   
                               Plasma cell               140.492525   
                               cDC                        88.231706   
                               Stromal                   174.752280   
                               NK cell                    77.840559   
                               Endothelial cell          113.327328   
                               pDC                        52.407931   
                               Neutrophils                59.909841   
dataset[T.UKIM-V]              B cell                    130.213458   
                               T cell                    538.406462   
                               Epithelial cell           430.392988   
                               Macrophage/Monocyte       240.899627   
                               Mast cell                  73.525055   
                               Plasma cell                88.886316   
                               cDC                        95.058913   
                               Stromal                    71.230055   
                               NK cell                   111.120169   
                               Endothelial cell           46.897578   
                               pDC                        46.935808   
                               Neutrophils               118.510491   

                                                    log2-fold change  
Covariate                      Cell Type                              
condition[T.LUAD]              B cell                      -0.156987  
                               T cell                       0.307037  
                               Epithelial cell             -0.164112  
                               Macrophage/Monocyte         -0.107140  
                               Mast cell                   -0.231694  
                               Plasma cell                 -0.189414  
                               cDC                         -0.134763  
                               Stromal                     -0.015739  
                               NK cell                     -0.194530  
                               Endothelial cell            -0.063106  
                               pDC                         -0.035027  
                               Neutrophils                 -0.290212  
tumor_stage[T.early]           B cell                       0.012931  
                               T cell                       0.017428  
                               Epithelial cell             -0.042960  
                               Macrophage/Monocyte          0.085538  
                               Mast cell                   -0.095478  
                               Plasma cell                 -0.084131  
                               cDC                          0.117293  
                               Stromal                     -0.071875  
                               NK cell                      0.046023  
                               Endothelial cell            -0.146911  
                               pDC                          0.048286  
                               Neutrophils                  0.057013  
dataset[T.Maynard_Bivona_2020] B cell                      -0.857014  
                               T cell                      -0.957777  
                               Epithelial cell              0.282312  
                               Macrophage/Monocyte          0.153808  
                               Mast cell                    0.717042  
                               Plasma cell                  0.316895  
                               cDC                          0.046840  
                               Stromal                      1.109245  
                               NK cell                     -0.335912  
                               Endothelial cell             0.843661  
                               pDC                          0.244622  
                               Neutrophils                  0.283261  
dataset[T.UKIM-V]              B cell                      -0.209658  
                               T cell                      -0.108037  
                               Epithelial cell              0.016157  
                               Macrophage/Monocyte          0.217675  
                               Mast cell                   -0.360491  
                               Plasma cell                 -0.343565  
                               cDC                          0.154365  
                               Stromal                     -0.185508  
                               NK cell                      0.177615  
                               Endothelial cell            -0.429250  
                               pDC                          0.085525  
                               Neutrophils                  1.267411  
  1. Compute “credible effects” based on the FDR. This will select cell-types that meet the FDR threshold.

Note

condition[T.LUAD] refers to the coefficient of the model that captures the differences inf LUAD compared to LUSC (which se set as the reference). The name of the coefficient can be found in the summary output above. By chosing a differen coefficient (e.g. tumor_stage[T.early]) we could test the effects of a different variable (in this case tumor_stage) on the cell-type composition.

credible_effects_condition = sccoda_res.credible_effects()["condition[T.LUAD]"]
  1. Make a plot of log2 fold changes between LUAD and LUSC using altair.

alt.Chart(
    sccoda_res.effect_df.loc["condition[T.LUAD]"].loc[credible_effects_condition].reset_index(),
    title="condition",
).mark_bar().encode(
    x=alt.X("Cell Type", sort="y"),
    y="log2-fold change",
    color=alt.Color("Cell Type"),
)