Before you begin#

Hardware requirements#

For working with up to 1 million cells the absolute minimum will be a workstation with 250GB of RAM and 16 CPU cores. Additionally, we recommend a GPU, as data integration with scvi-tools is faster by at least one order of magnitude compared to CPU-only computing.

Software prerequisites#

For the following instructions, we assume you are working on Linux and have set-up the conda package manager and a way of working with jupyter notebooks. In case you don’t, we recommend setting up miniforge and following the jupyter lab installation instructions.

Clone the atlas protocol repository#

You can obtain all notebooks and helper scripts required for this tutorial from GitHub:

git clone https://github.com/icbi-lab/atlas_protocol.git
cd atlas_protocol

Installing software dependencies#

All required software dependencies are declared in the environment.yml file. To install all dependencies, you can create a conda environment as follows:

conda env create -n atlas_protocol -f env/environment.yml
conda activate atlas_protocol
# install the atlas_protocol package with helper functions
pip install git+https://github.com/icbi-lab/atlas_protocol.git

In order to make conda environments work with jupyter notebooks, we suggest installing nb_conda_kernels.

Alternatively, you can obtain a singluarty container with all dependencies pre-installed.

Obtain and preprocess single-cell datasets#

Important

Make sure to choose datasets carefully. Datasets may have very different characteristics, for instance

generated from frozen vs. fresh tissue
pre-sorted to contain only specific cell-types
different sequencing platforms
whole cells vs. single nuclei
multimodal profiling

While dataset-to-dataset differences can be mitigated by data integration, they cannot be removed completely and it is instrumental to be aware of possible biases beforehand.

Publicly available single-cell datasets come in all forms and flavors. For building the atlas, we need for each dataset (1) a gene expression count matrix and (2) cell-level metadata. Gene expression data is often available from standardized repositories such as gene expression omnibus (GEO), while metadata may be available as supplementary information of the original publication. For some datasets, only read-level data can be downloaded as FASTQ files and you will need to preprocess the data from scratch. Ideally, all datasets could be re-processed from FASTQ files with a consistent reference genome and genome annotations. However, in our experience, some datasets are only available in a processed form, requiring some sort of gene identifier remapping later on.

Obtain bulk RNA-seq datasets and metadata#

For the scissor analysis, bulk data needs to be prepared as an R matrix with samples in column names and gene symbols in row names containing untransformed TPM values, stored as rds file. The associated clinical data must be a TSV file where one column contains the sample identifiers used as rownames of the TPM matrix.

For this protocol, we provide both the TPM matrix and the clinical annotation table from the TCGA LUAD and LUSC cohorts as part of the example data. You can download them from zenodo as follows:

curl TODO

Obtain reference genome GTF files#

To facilitate integration of the four datasets, it is important to standardize the provided gene IDs. In this tutorial, we will download the GTF files from gencode/ensembl that were originally used to annotate the genes in each dataset, enabling us to remap the provided gene symbols. This remapping is necessary to resolve ambiguity in gene symbols and ensure that only counts mapped to the same genomic location are merged, using unique Ensembl IDs as identifiers.

cd ./tables
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz
../bin/gtf_to_table.sh gencode.v32.primary_assembly.annotation.gtf.gz gencode.v32_gene_annotation_table.csv gencode
rm gencode.v32.primary_assembly.annotation.gtf.gz

wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
../bin/gtf_to_table.sh Homo_sapiens.GRCh38.109.gtf.gz Homo_sapiens.GRCh38.109_gene_annotation_table.csv ensembl
rm Homo_sapiens.GRCh38.109.gtf.gz