Differential gene expression#

One of the most commonly asked questions when comparing between conditions is which genes are differentially expressed. In statistical terms, we ask if the mean gene expression differs between two groups more than what would be expected by random chance. A common fallacy here is that frequently applied statistical tests assume independence of the observations, while cells from the same sample or patient are not independent. This is also called “pseudoreplication bias” and leads to drastically inflated p-values [Squair et al., 2021, Zimmerman et al., 2021].

Unfortunately, the differential testing functions available from the most commonly used frameworks (e.g. scanpy.tl.rank_genes_groups() in scanpy) do not account for this, allowing for qualitative comparisons at best. As an alternative, Squair et al. [2021] suggest to aggregate samples into “pseudo-bulks” and apply tools originally designed for comparing bulk RNA-seq samples.

In this section, we are going to demonstrate how to generate pseudo-bulk samples by cell-type using decoupler [Badia-i-Mompel et al., 2022] and apply DESeq2 [Love et al., 2014] to perform cell-type-specific comparison between conditions.

1.Load libraries#

%load_ext autoreload
%autoreload 2
import re
import subprocess as sp
from pathlib import Path

import decoupler as dc
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
from anndata import AnnData
from tqdm.contrib.concurrent import process_map

import atlas_protocol_scripts as aps

2. Load input data#

Load the annotated AnnData object

adata_path = "../../data/input_data_zenodo/atlas-integrated-annotated.h5ad"
adata = sc.read_h5ad(adata_path)

Define the paths to the scripts for DESeq2. The script is shipped as part of the atlas_protocol repo.

deseq_script_path = "../../bin/deseq2.R"

Create output directories

deseq_results = Path("../../data/results/differential_expression/")
!mkdir -p {deseq_results}

3. Generate Pseudobulk#

Here, we aggregate counts by biological replicate (patient) for each cell-type individually.

subset AnnData to the cells of interest. In this case, we only focus on cells originating from primary tumor samples and the two conditions (LUAD vs. LUSC) we are going to compare.

adata = adata[adata.obs["origin"].isin(["tumor_primary"]) & adata.obs["condition"].isin(["LUAD", "LUSC"])]
adata

View of AnnData object with n_obs × n_vars = 25892 × 17837
    obs: 'sample', 'uicc_stage', 'sex', 'ever_smoker', 'driver_genes', 'condition', 'age', 'patient', 'tissue', 'origin', 'dataset', 'cell_type', 'cell_type_predicted', 'doublet_status', 'leiden', 'n_genes_by_counts', 'total_counts', 'total_counts_mito', 'pct_counts_mito', 'cell_type_coarse', 'cell_type_tumor', 'tumor_stage', 'EGFR_mutation', 'TP53_mutation', 'ALK_mutation', 'BRAF_mutation', 'ERBB2_mutation', 'KRAS_mutation', 'ROS_mutation', 'origin_fine', 'study', 'platform', 'platform_fine', 'cell_type_major', 'batch', '_predictions', '_leiden', '_cell_type_tumor_predicted'
    var: 'is_highly_variable', 'mito', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: '_cell_type_tumor_predicted_colors', '_scvi', 'cell_type_coarse_colors', 'cell_type_colors', 'cell_type_major_colors', 'cell_type_predicted_colors', 'cell_type_tumor_colors', 'condition_colors', 'dataset_colors', 'doublet_status_colors', 'ever_smoker_colors', 'hvg', 'leiden_colors', 'neighbors', 'origin_colors', 'platform_colors', 'sex_colors', 'tumor_stage_colors', 'uicc_stage_colors', 'umap'
    obsm: 'X_scANVI', 'X_scVI', 'X_umap'
    layers: 'raw_counts'
    obsp: 'connectivities', 'distances'

Generate pseudobulk

Important

Generate pseudo-bulk based on raw counts and aggregate them by summing up.

# Get pseudo-bulk profile
pdata = dc.get_pseudobulk(
    adata,
    sample_col="sample",
    groups_col=["cell_type_coarse", "condition"],
    layer="raw_counts",  # layer where raw counts are store in adata
    mode="sum",
    min_cells=0,  # we are setting this to 0 and filter in an explicit, separate step.
    min_counts=0,
)
pdata

AnnData object with n_obs × n_vars = 223 × 17837
    obs: 'sample', 'uicc_stage', 'sex', 'ever_smoker', 'driver_genes', 'condition', 'age', 'patient', 'tissue', 'origin', 'dataset', 'doublet_status', 'total_counts_mito', 'pct_counts_mito', 'cell_type_coarse', 'tumor_stage', 'EGFR_mutation', 'TP53_mutation', 'ALK_mutation', 'BRAF_mutation', 'ERBB2_mutation', 'KRAS_mutation', 'ROS_mutation', 'origin_fine', 'study', 'platform', 'platform_fine', 'batch', '_predictions', '_leiden', '_cell_type_tumor_predicted', 'cell_type_coarse_condition', 'psbulk_n_cells', 'psbulk_counts'
    layers: 'psbulk_props'

Inspect the table of “pseudobulk samples”

pdata.obs

	sample	uicc_stage	sex	ever_smoker	driver_genes	condition	age	patient	tissue	origin	...	study	platform	platform_fine	batch	_predictions	_leiden	_cell_type_tumor_predicted	cell_type_coarse_condition	psbulk_n_cells	psbulk_counts
Lambrechts_Thienpont_2018_6653_scrBT1430m_B cell_LUAD	Lambrechts_Thienpont_2018_6653_scrBT1430m	III	male	yes	NaN	LUAD	65.0	Lambrechts_Thienpont_2018_6653_6	lung	tumor_primary	...	Lambrechts_Thienpont_2018	10x	10x_3p_v2	NaN	NaN	NaN	NaN	B cell_LUAD	178.0	8.976160e+05
Lambrechts_Thienpont_2018_6653_scrBT1431m_B cell_LUAD	Lambrechts_Thienpont_2018_6653_scrBT1431m	III	male	yes	NaN	LUAD	65.0	Lambrechts_Thienpont_2018_6653_6	lung	tumor_primary	...	Lambrechts_Thienpont_2018	10x	10x_3p_v2	NaN	NaN	NaN	NaN	B cell_LUAD	702.0	3.154887e+06
Lambrechts_Thienpont_2018_6653_scrBT1432m_B cell_LUAD	Lambrechts_Thienpont_2018_6653_scrBT1432m	III	male	yes	NaN	LUAD	65.0	Lambrechts_Thienpont_2018_6653_6	lung	tumor_primary	...	Lambrechts_Thienpont_2018	10x	10x_3p_v2	NaN	NaN	NaN	NaN	B cell_LUAD	114.0	5.869860e+05
Maynard_Bivona_2020_LT_S42_B cell_LUAD	Maynard_Bivona_2020_LT_S42	IV	female	no	EGFR	LUAD	56.0	Maynard_Bivona_2020_TH158	lung	tumor_primary	...	Maynard_Bivona_2020	Smart-seq2	Smart-seq2	NaN	NaN	NaN	NaN	B cell_LUAD	10.0	5.065048e+07
Maynard_Bivona_2020_LT_S56_B cell_LUAD	Maynard_Bivona_2020_LT_S56	IV	male	no	ALK	LUAD	37.0	Maynard_Bivona_2020_TH231	lung	tumor_primary	...	Maynard_Bivona_2020	Smart-seq2	Smart-seq2	NaN	NaN	NaN	NaN	B cell_LUAD	2.0	1.941628e+06
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
UKIM-V_P1_tumor_primary_pDC_LUAD	UKIM-V_P1_tumor_primary	I	female	yes	NaN	LUAD	65.0	UKIM-V_P1	lung	tumor_primary	...	UKIM-V	BD-Rhapsody	BD-Rhapsody	NaN	NaN	NaN	NaN	pDC_LUAD	25.0	8.279220e+05
UKIM-V_P3_tumor_primary_pDC_LUAD	UKIM-V_P3_tumor_primary	I	male	yes	NaN	LUAD	64.0	UKIM-V_P3	lung	tumor_primary	...	UKIM-V	BD-Rhapsody	BD-Rhapsody	NaN	NaN	NaN	NaN	pDC_LUAD	15.0	1.456810e+05
Lambrechts_Thienpont_2018_6653_BT1375_pDC_LUSC	Lambrechts_Thienpont_2018_6653_BT1375	I	male	yes	NaN	LUSC	60.0	Lambrechts_Thienpont_2018_6653_7	lung	tumor_primary	...	Lambrechts_Thienpont_2018	10x	10x_3p_v2	NaN	NaN	NaN	NaN	pDC_LUSC	3.0	6.947000e+03
Lambrechts_Thienpont_2018_6653_BT1377_pDC_LUSC	Lambrechts_Thienpont_2018_6653_BT1377	I	male	yes	NaN	LUSC	60.0	Lambrechts_Thienpont_2018_6653_7	lung	tumor_primary	...	Lambrechts_Thienpont_2018	10x	10x_3p_v2	NaN	NaN	NaN	NaN	pDC_LUSC	1.0	6.048000e+03
UKIM-V_P2_tumor_primary_pDC_LUSC	UKIM-V_P2_tumor_primary	I	female	yes	NaN	LUSC	70.0	UKIM-V_P2	lung	tumor_primary	...	UKIM-V	BD-Rhapsody	BD-Rhapsody	NaN	NaN	NaN	NaN	pDC_LUSC	8.0	1.085030e+05

223 rows × 34 columns

4. Filter samples#

In this step, we are removing samples that consist of too few cells or have a small number of total counts in both conditions.

Plot relationship of counts and cells per dataset and cell-type

for col in ["dataset", "cell_type_coarse", "condition"]:
    dc.plot_psbulk_samples(pdata, groupby=col, figsize=(3, 3))

../_images/4128ce88e93560cca01c3e6699df9e3b3ddb5f6c66023353030c3230cac586f8.png

../_images/aa8ec0457fdf530089500cf6eb410c592b6d7928ce7c0920bb103989e2aa4f45.png

../_images/81afdd78d27ed0ab09844fab87a39e6ab68a939c179cd76fd5dab658152386a6.png

Remove samples with less than 10 cells or less than 1000 counts

pdata = pdata[(pdata.obs["psbulk_n_cells"] >= 10) & (pdata.obs["psbulk_counts"] >= 1000)]

5. Split pseudobulk by cell-type#

For the following steps, we treat each cell-type separately, therefore we split the pseudobulk object into a dictionary of objects per cell-type.

Define list of cell-types

cell_types = pdata.obs["cell_type_coarse"].unique()

Generate per-cell-type dictionary

pdata_by_cell_type = {}
for ct in cell_types:
    pb = pdata[pdata.obs["cell_type_coarse"] == ct, :].copy()
    if pb.obs["condition"].nunique() <= 1:
        print(f"Cell type {ct} does not have samples in all groups")
    else:
        pdata_by_cell_type[ct] = pb

Cell type pDC does not have samples in all groups

6. Filter genes#

Here, we filter noisy genes with a low expression value. We remove genes that do not have a certain number of total counts (min_total_count) and have a minimum number of counts in at least \(n\) samples, where \(n\) equals to the number of samples in the smallest group (or less, if there is a sufficient number of samples).

This needs to be done for each cell-type individually since each cell-type may express different sets of genes.

7. Run DESeq#

Inspect the help page of the DESeq2 script

!{deseq_script_path} --help

There were 17 warnings (use warnings() to see them)
?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25h?25husage: deseq2.R [--] [--help] [--save_workspace] [--opts OPTS]
       [--resDir RESDIR] [--cond_col COND_COL] [--covariate_formula
       COVARIATE_FORMULA] [--prefix PREFIX] [--sample_col SAMPLE_COL]
       [--fdr FDR] [--c1 C1] [--c2 C2] [--cpus CPUS] countData colData

Input Differential Expression Analysis

positional arguments:
  countData            Count matrix
  colData              Sample annotation matrix

flags:
  -h, --help           show this help message and exit
  --save_workspace     Save R workspace

optional arguments:
  -x, --opts           RDS file containing argument values
  -r, --resDir         Output result directory [default: ./results]
  -c, --cond_col       Condition column in sample annotation [default:
                       group]
  --covariate_formula  Additional covariates that will be added to the
                       design formula [default: ]
  -p, --prefix         Prefix of result file [default: test]
  -s, --sample_col     Sample identifier in sample annotation [default:
                       sample]
  -f, --fdr            False discovery rate [default: 0.1]
  --c1                 Contrast level 1 (perturbation) [default: grpA]
  --c2                 Contrast level 2 (baseline) [default: grpB]
  --cpus               Number of cpus [default: 1]

?25h

Define a function to create an output directory per cell-type

def _create_prefix(cell_type):
    ct_sanitized = re.sub("[^0-9a-zA-Z]+", "_", cell_type)
    prefix = deseq_results / "LUAD_vs_LUSC" / ct_sanitized
    prefix.mkdir(parents=True, exist_ok=True)
    return prefix

Export pseudobulk objects to CSV files. This will generate a count matrix an a samplesheet for each cell-type.

for ct, tmp_pdata in pdata_by_cell_type.items():
    prefix = _create_prefix(ct)
    aps.io.write_deseq_tables(tmp_pdata, prefix / "samplesheet.csv", prefix / "counts.csv")

Execute the DESeq2 script

Important

Always specify at least dataset as a covariate when performing differential expression analysis, to account for batch effects.

It may make sense to include other covariates (sex, tumor stage, age, …) into the model depending on your dataset and research question.

def _run_deseq(cell_type: str, pseudobulk: AnnData):
    """Function: Run DESeq2 on pseudobulk of all cell types from cell_type."""
    prefix = _create_prefix(cell_type)
    # fmt: off
    deseq_cmd = [
        deseq_script_path, prefix / "counts.csv", prefix / "samplesheet.csv",
        "--cond_col", "condition",
        "--c1", "LUAD",
        "--c2", "LUSC",
        "--covariate_formula", "+ dataset + sex",
        "--resDir", prefix,
        "--prefix", prefix.stem,
    ]
    # fmt: on
    with open(prefix / "out.log", "w") as stdout:
        with open(prefix / "err.log", "w") as stderr:
            sp.run(
                deseq_cmd,
                capture_output=False,
                stdout=stdout,
                stderr=stderr,
                check=True,
            )


_ = process_map(_run_deseq, pdata_by_cell_type, pdata_by_cell_type.values())

import results

de_results = {}
for ct in pdata_by_cell_type:
    prefix = _create_prefix(ct)
    de_results[ct] = pd.read_csv(prefix / f"{prefix.stem}_DESeq2_result.tsv", sep="\t").assign(cell_type=ct)

8. Make volcano plots#

Convert DE results to p-value/logFC matrices as used by decoupler

logFCs, pvals = aps.tl.long_form_df_to_decoupler(pd.concat(de_results.values()), p_col="padj")

Make a volcano plot for all cell-types of interest (e.g. B cells)

fig, ax = plt.subplots()
dc.plot_volcano(logFCs, pvals, "B cell", name="B cell", top=10, sign_thr=0.1, lFCs_thr=0.5, ax=ax)
ax.set_ylabel("-log10(FDR)")
ax.set_xlabel("log2(FC)")
fig.show()

../_images/f3e4c6514dfef0a886f2199b5c9698c0dda0b8c05cfe1bda836e0b3ae57a1d20.png

Normalize pseudobulk to log counts per million (logCPM)

9. Plot genes of interest as paired dotplots#

Here, we are visualizing the actual expression values for each sample for all genes and cell-types of interest

Define list of genes to visualize

genes = ["EIF1AY", "GSTM1", "XIST", "MYRF"]

Normalize pseudobulk to counts per million (CPM)

for tmp_pdata in pdata_by_cell_type.values():
    sc.pp.normalize_total(tmp_pdata, target_sum=1e6)
    sc.pp.log1p(tmp_pdata)

Make plot (e.g. B cells)

# Get p-values for genes of interest
tmp_pvalues = de_results["B cell"].set_index("gene_id").loc[genes, "padj"].values

aps.pl.plot_paired(
    pdata_by_cell_type["B cell"],
    groupby="condition",
    var_names=genes,
    hue="dataset",
    ylabel="log(CPM+1)",
    panel_size=(2, 4),
    pvalues=tmp_pvalues,
    pvalue_template=lambda x: f"DESeq2 FDR={x:.3f}",
)

../_images/db9e223a3afc6b9d51c2ebea253561ae19f923b511713b250daaf14b6fcfc162.png

Differential gene expression

Contents

Differential gene expression#

1.Load libraries#

2. Load input data#

3. Generate Pseudobulk#

4. Filter samples#

5. Split pseudobulk by cell-type#

6. Filter genes#

7. Run DESeq#

8. Make volcano plots#

9. Plot genes of interest as paired dotplots#