Differential gene expression#

One of the most commonly asked questions when comparing between conditions is which genes are differentially expressed. In statistical terms, we ask if the mean gene expression differs between two groups more than what would be expected by random chance. A common fallacy here is that frequently applied statistical tests assume independence of the observations, while cells from the same sample or patient are not independent. This is also called “pseudoreplication bias” and leads to drastically inflated p-values [Squair et al., 2021, Zimmerman et al., 2021].

Unfortunately, the differential testing functions available from the most commonly used frameworks (e.g. scanpy.tl.rank_genes_groups() in scanpy) do not account for this, allowing for qualitative comparisons at best. As an alternative, Squair et al. [2021] suggest to aggregate samples into “pseudo-bulks” and apply tools originally designed for comparing bulk RNA-seq samples.

In this section, we are going to demonstrate how to generate pseudo-bulk samples by cell-type using decoupler [Badia-i-Mompel et al., 2022] and apply DESeq2 [Love et al., 2014] to perform cell-type-specific comparison between conditions.

See also

1.Load libraries#

%load_ext autoreload
%autoreload 2
import re
import subprocess as sp
from pathlib import Path

import decoupler as dc
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
from anndata import AnnData
from tqdm.contrib.concurrent import process_map

import atlas_protocol_scripts as aps

2. Load input data#

  1. Load the annotated AnnData object

adata_path = "../../data/input_data_zenodo/atlas-integrated-annotated.h5ad"
adata = sc.read_h5ad(adata_path)
  1. Define the paths to the scripts for DESeq2. The script is shipped as part of the atlas_protocol repo.

deseq_script_path = "../../bin/deseq2.R"
  1. Create output directories

deseq_results = Path("../../data/results/differential_expression/")
!mkdir -p {deseq_results}

3. Generate Pseudobulk#

Here, we aggregate counts by biological replicate (patient) for each cell-type individually.

  1. subset AnnData to the cells of interest. In this case, we only focus on cells originating from primary tumor samples and the two conditions (LUAD vs. LUSC) we are going to compare.

adata = adata[adata.obs["origin"].isin(["tumor_primary"]) & adata.obs["condition"].isin(["LUAD", "LUSC"])]
adata
View of AnnData object with n_obs × n_vars = 25892 × 17837
    obs: 'sample', 'uicc_stage', 'sex', 'ever_smoker', 'driver_genes', 'condition', 'age', 'patient', 'tissue', 'origin', 'dataset', 'cell_type', 'cell_type_predicted', 'doublet_status', 'leiden', 'n_genes_by_counts', 'total_counts', 'total_counts_mito', 'pct_counts_mito', 'cell_type_coarse', 'cell_type_tumor', 'tumor_stage', 'EGFR_mutation', 'TP53_mutation', 'ALK_mutation', 'BRAF_mutation', 'ERBB2_mutation', 'KRAS_mutation', 'ROS_mutation', 'origin_fine', 'study', 'platform', 'platform_fine', 'cell_type_major', 'batch', '_predictions', '_leiden', '_cell_type_tumor_predicted'
    var: 'is_highly_variable', 'mito', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: '_cell_type_tumor_predicted_colors', '_scvi', 'cell_type_coarse_colors', 'cell_type_colors', 'cell_type_major_colors', 'cell_type_predicted_colors', 'cell_type_tumor_colors', 'condition_colors', 'dataset_colors', 'doublet_status_colors', 'ever_smoker_colors', 'hvg', 'leiden_colors', 'neighbors', 'origin_colors', 'platform_colors', 'sex_colors', 'tumor_stage_colors', 'uicc_stage_colors', 'umap'
    obsm: 'X_scANVI', 'X_scVI', 'X_umap'
    layers: 'raw_counts'
    obsp: 'connectivities', 'distances'
  1. Generate pseudobulk

Important

Generate pseudo-bulk based on raw counts and aggregate them by summing up.

# Get pseudo-bulk profile
pdata = dc.get_pseudobulk(
    adata,
    sample_col="sample",
    groups_col=["cell_type_coarse", "condition"],
    layer="raw_counts",  # layer where raw counts are store in adata
    mode="sum",
    min_cells=0,  # we are setting this to 0 and filter in an explicit, separate step.
    min_counts=0,
)
pdata
AnnData object with n_obs × n_vars = 223 × 17837
    obs: 'sample', 'uicc_stage', 'sex', 'ever_smoker', 'driver_genes', 'condition', 'age', 'patient', 'tissue', 'origin', 'dataset', 'doublet_status', 'total_counts_mito', 'pct_counts_mito', 'cell_type_coarse', 'tumor_stage', 'EGFR_mutation', 'TP53_mutation', 'ALK_mutation', 'BRAF_mutation', 'ERBB2_mutation', 'KRAS_mutation', 'ROS_mutation', 'origin_fine', 'study', 'platform', 'platform_fine', 'batch', '_predictions', '_leiden', '_cell_type_tumor_predicted', 'cell_type_coarse_condition', 'psbulk_n_cells', 'psbulk_counts'
    layers: 'psbulk_props'
  1. Inspect the table of “pseudobulk samples”

pdata.obs
sample uicc_stage sex ever_smoker driver_genes condition age patient tissue origin ... study platform platform_fine batch _predictions _leiden _cell_type_tumor_predicted cell_type_coarse_condition psbulk_n_cells psbulk_counts
Lambrechts_Thienpont_2018_6653_scrBT1430m_B cell_LUAD Lambrechts_Thienpont_2018_6653_scrBT1430m III male yes NaN LUAD 65.0 Lambrechts_Thienpont_2018_6653_6 lung tumor_primary ... Lambrechts_Thienpont_2018 10x 10x_3p_v2 NaN NaN NaN NaN B cell_LUAD 178.0 8.976160e+05
Lambrechts_Thienpont_2018_6653_scrBT1431m_B cell_LUAD Lambrechts_Thienpont_2018_6653_scrBT1431m III male yes NaN LUAD 65.0 Lambrechts_Thienpont_2018_6653_6 lung tumor_primary ... Lambrechts_Thienpont_2018 10x 10x_3p_v2 NaN NaN NaN NaN B cell_LUAD 702.0 3.154887e+06
Lambrechts_Thienpont_2018_6653_scrBT1432m_B cell_LUAD Lambrechts_Thienpont_2018_6653_scrBT1432m III male yes NaN LUAD 65.0 Lambrechts_Thienpont_2018_6653_6 lung tumor_primary ... Lambrechts_Thienpont_2018 10x 10x_3p_v2 NaN NaN NaN NaN B cell_LUAD 114.0 5.869860e+05
Maynard_Bivona_2020_LT_S42_B cell_LUAD Maynard_Bivona_2020_LT_S42 IV female no EGFR LUAD 56.0 Maynard_Bivona_2020_TH158 lung tumor_primary ... Maynard_Bivona_2020 Smart-seq2 Smart-seq2 NaN NaN NaN NaN B cell_LUAD 10.0 5.065048e+07
Maynard_Bivona_2020_LT_S56_B cell_LUAD Maynard_Bivona_2020_LT_S56 IV male no ALK LUAD 37.0 Maynard_Bivona_2020_TH231 lung tumor_primary ... Maynard_Bivona_2020 Smart-seq2 Smart-seq2 NaN NaN NaN NaN B cell_LUAD 2.0 1.941628e+06
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
UKIM-V_P1_tumor_primary_pDC_LUAD UKIM-V_P1_tumor_primary I female yes NaN LUAD 65.0 UKIM-V_P1 lung tumor_primary ... UKIM-V BD-Rhapsody BD-Rhapsody NaN NaN NaN NaN pDC_LUAD 25.0 8.279220e+05
UKIM-V_P3_tumor_primary_pDC_LUAD UKIM-V_P3_tumor_primary I male yes NaN LUAD 64.0 UKIM-V_P3 lung tumor_primary ... UKIM-V BD-Rhapsody BD-Rhapsody NaN NaN NaN NaN pDC_LUAD 15.0 1.456810e+05
Lambrechts_Thienpont_2018_6653_BT1375_pDC_LUSC Lambrechts_Thienpont_2018_6653_BT1375 I male yes NaN LUSC 60.0 Lambrechts_Thienpont_2018_6653_7 lung tumor_primary ... Lambrechts_Thienpont_2018 10x 10x_3p_v2 NaN NaN NaN NaN pDC_LUSC 3.0 6.947000e+03
Lambrechts_Thienpont_2018_6653_BT1377_pDC_LUSC Lambrechts_Thienpont_2018_6653_BT1377 I male yes NaN LUSC 60.0 Lambrechts_Thienpont_2018_6653_7 lung tumor_primary ... Lambrechts_Thienpont_2018 10x 10x_3p_v2 NaN NaN NaN NaN pDC_LUSC 1.0 6.048000e+03
UKIM-V_P2_tumor_primary_pDC_LUSC UKIM-V_P2_tumor_primary I female yes NaN LUSC 70.0 UKIM-V_P2 lung tumor_primary ... UKIM-V BD-Rhapsody BD-Rhapsody NaN NaN NaN NaN pDC_LUSC 8.0 1.085030e+05

223 rows × 34 columns

4. Filter samples#

In this step, we are removing samples that consist of too few cells or have a small number of total counts in both conditions.

  1. Plot relationship of counts and cells per dataset and cell-type

for col in ["dataset", "cell_type_coarse", "condition"]:
    dc.plot_psbulk_samples(pdata, groupby=col, figsize=(3, 3))
../_images/4128ce88e93560cca01c3e6699df9e3b3ddb5f6c66023353030c3230cac586f8.png ../_images/aa8ec0457fdf530089500cf6eb410c592b6d7928ce7c0920bb103989e2aa4f45.png ../_images/81afdd78d27ed0ab09844fab87a39e6ab68a939c179cd76fd5dab658152386a6.png
  1. Remove samples with less than 10 cells or less than 1000 counts

pdata = pdata[(pdata.obs["psbulk_n_cells"] >= 10) & (pdata.obs["psbulk_counts"] >= 1000)]

5. Split pseudobulk by cell-type#

For the following steps, we treat each cell-type separately, therefore we split the pseudobulk object into a dictionary of objects per cell-type.

  1. Define list of cell-types

cell_types = pdata.obs["cell_type_coarse"].unique()
  1. Generate per-cell-type dictionary

pdata_by_cell_type = {}
for ct in cell_types:
    pb = pdata[pdata.obs["cell_type_coarse"] == ct, :].copy()
    if pb.obs["condition"].nunique() <= 1:
        print(f"Cell type {ct} does not have samples in all groups")
    else:
        pdata_by_cell_type[ct] = pb
Cell type pDC does not have samples in all groups

6. Filter genes#

Here, we filter noisy genes with a low expression value. We remove genes that do not have a certain number of total counts (min_total_count) and have a minimum number of counts in at least \(n\) samples, where \(n\) equals to the number of samples in the smallest group (or less, if there is a sufficient number of samples).

This needs to be done for each cell-type individually since each cell-type may express different sets of genes.

See also

filterByExpr in edgeR.

  1. Inspect, for a given cell-type, the number of genes that the thresholds (here for Epithelial cells):

dc.plot_filter_by_expr(
    pdata_by_cell_type["Epithelial cell"],
    group="condition",
    min_count=10,
    min_total_count=15,
)
../_images/b77a93978895e30a16f40efcd6a8456a32212ae8faad1acafb96b3b07d25eeb7.png
  1. Apply filter genes for all cell-types. Here, we apply the same (default) filter to all cell-types.

for tmp_pdata in pdata_by_cell_type.values():
    dc.filter_by_expr(
        tmp_pdata,
        group="condition",
        min_count=10,
        min_total_count=15,
    )

7. Run DESeq#

  1. Inspect the help page of the DESeq2 script

!{deseq_script_path} --help
There were 17 warnings (use warnings() to see them)
?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25husage: deseq2.R [--] [--help] [--save_workspace] [--opts OPTS]
       [--resDir RESDIR] [--cond_col COND_COL] [--covariate_formula
       COVARIATE_FORMULA] [--prefix PREFIX] [--sample_col SAMPLE_COL]
       [--fdr FDR] [--c1 C1] [--c2 C2] [--cpus CPUS] countData colData

Input Differential Expression Analysis

positional arguments:
  countData            Count matrix
  colData              Sample annotation matrix

flags:
  -h, --help           show this help message and exit
  --save_workspace     Save R workspace

optional arguments:
  -x, --opts           RDS file containing argument values
  -r, --resDir         Output result directory [default: ./results]
  -c, --cond_col       Condition column in sample annotation [default:
                       group]
  --covariate_formula  Additional covariates that will be added to the
                       design formula [default: ]
  -p, --prefix         Prefix of result file [default: test]
  -s, --sample_col     Sample identifier in sample annotation [default:
                       sample]
  -f, --fdr            False discovery rate [default: 0.1]
  --c1                 Contrast level 1 (perturbation) [default: grpA]
  --c2                 Contrast level 2 (baseline) [default: grpB]
  --cpus               Number of cpus [default: 1]

?25h
  1. Define a function to create an output directory per cell-type

def _create_prefix(cell_type):
    ct_sanitized = re.sub("[^0-9a-zA-Z]+", "_", cell_type)
    prefix = deseq_results / "LUAD_vs_LUSC" / ct_sanitized
    prefix.mkdir(parents=True, exist_ok=True)
    return prefix
  1. Export pseudobulk objects to CSV files. This will generate a count matrix an a samplesheet for each cell-type.

for ct, tmp_pdata in pdata_by_cell_type.items():
    prefix = _create_prefix(ct)
    aps.io.write_deseq_tables(tmp_pdata, prefix / "samplesheet.csv", prefix / "counts.csv")
  1. Execute the DESeq2 script

Important

Always specify at least dataset as a covariate when performing differential expression analysis, to account for batch effects.

It may make sense to include other covariates (sex, tumor stage, age, …) into the model depending on your dataset and research question.

def _run_deseq(cell_type: str, pseudobulk: AnnData):
    """Function: Run DESeq2 on pseudobulk of all cell types from cell_type."""
    prefix = _create_prefix(cell_type)
    # fmt: off
    deseq_cmd = [
        deseq_script_path, prefix / "counts.csv", prefix / "samplesheet.csv",
        "--cond_col", "condition",
        "--c1", "LUAD",
        "--c2", "LUSC",
        "--covariate_formula", "+ dataset + sex",
        "--resDir", prefix,
        "--prefix", prefix.stem,
    ]
    # fmt: on
    with open(prefix / "out.log", "w") as stdout:
        with open(prefix / "err.log", "w") as stderr:
            sp.run(
                deseq_cmd,
                capture_output=False,
                stdout=stdout,
                stderr=stderr,
                check=True,
            )


_ = process_map(_run_deseq, pdata_by_cell_type, pdata_by_cell_type.values())
  1. import results

de_results = {}
for ct in pdata_by_cell_type:
    prefix = _create_prefix(ct)
    de_results[ct] = pd.read_csv(prefix / f"{prefix.stem}_DESeq2_result.tsv", sep="\t").assign(cell_type=ct)

8. Make volcano plots#

  1. Convert DE results to p-value/logFC matrices as used by decoupler

logFCs, pvals = aps.tl.long_form_df_to_decoupler(pd.concat(de_results.values()), p_col="padj")
  1. Make a volcano plot for all cell-types of interest (e.g. B cells)

fig, ax = plt.subplots()
dc.plot_volcano(logFCs, pvals, "B cell", name="B cell", top=10, sign_thr=0.1, lFCs_thr=0.5, ax=ax)
ax.set_ylabel("-log10(FDR)")
ax.set_xlabel("log2(FC)")
fig.show()
../_images/f3e4c6514dfef0a886f2199b5c9698c0dda0b8c05cfe1bda836e0b3ae57a1d20.png
  1. Normalize pseudobulk to log counts per million (logCPM)

9. Plot genes of interest as paired dotplots#

Here, we are visualizing the actual expression values for each sample for all genes and cell-types of interest

  1. Define list of genes to visualize

genes = ["EIF1AY", "GSTM1", "XIST", "MYRF"]
  1. Normalize pseudobulk to counts per million (CPM)

for tmp_pdata in pdata_by_cell_type.values():
    sc.pp.normalize_total(tmp_pdata, target_sum=1e6)
    sc.pp.log1p(tmp_pdata)
  1. Make plot (e.g. B cells)

# Get p-values for genes of interest
tmp_pvalues = de_results["B cell"].set_index("gene_id").loc[genes, "padj"].values

aps.pl.plot_paired(
    pdata_by_cell_type["B cell"],
    groupby="condition",
    var_names=genes,
    hue="dataset",
    ylabel="log(CPM+1)",
    panel_size=(2, 4),
    pvalues=tmp_pvalues,
    pvalue_template=lambda x: f"DESeq2 FDR={x:.3f}",
)
../_images/db9e223a3afc6b9d51c2ebea253561ae19f923b511713b250daaf14b6fcfc162.png