Reference mapping with HNOCA-tools

In this tutorial we'll explore how to perform reference mapping using the HNOCA toolbox. Analogous to our analysis in the HNOCA manuscript, we map an organoid morphogen screen dataset (Amin at al.) to a reference atlas of the developing human brain (Braun at al.). We already trained a scANVI model for this. Based on this model HNOCA-tools provides functions to map the organoid data to the reference atlas and make quantitative comparisons.

To get started, we first load all necessary pacakges and load the data & model.

import os

import scanpy as sc
import scvi

import hnoca.map as hmap

os.chdir("/home/fleckj/scratch/hnoca/")

# Read reference and query data
ref_adata = sc.read("BraunLinnarsson_2023_devbrain_hv2k.h5ad")
query_adata = sc.read("AminPasca_2023_organoid_screen_hv2k.h5ad")

# Format batch columns
query_adata.obs["batch"] = query_adata.obs["orig.ident"].astype(str).copy()
ref_adata.obs["batch"] = ref_adata.obs["Donor"].astype(str).copy()

# Load reference model
ref_model = scvi.model.SCANVI.load(
    "scarches_BraunLinnarsson/model.pt",
    adata=ref_adata,
)
print(ref_model)

Now we can use the AtlasMapper from HNOCA-tools to map the organoid data to the reference atlas. Under the hood this is using the transfer learning functionality of scANVI to finetune te model to the query dataset. We finetune for 100 epochs (max_epochs) with a batch size of 1024 (batch_size) and use retrain="partial" to only retrain the weights corresponding to the new batch covariates.

# Initiate mapper
mapper = hmap.AtlasMapper(ref_model)
# Map query data to reference
mapper.map_query(query_adata, retrain="partial", max_epochs=100, batch_size=1024)

We then compute a weighted kNN graph between the query and reference data.

mapper.compute_wknn(k=100)

Based on this wkNN mapping, we can now transfer labels from the reference and to estimate cell type composition of the query. We can also compute presence scores with respect to the reference data to assess which reference cell types are covered in the screen.

subregion_transfer = mapper.transfer_labels(label_key="CellClass")
subregion_transfer[["best_label", "best_score"]]

	best_label	best_score
44_33_8__s1	Radial glia	1.789107
44_79_20__s1	Radial glia	0.617641
44_23_10__s1	Radial glia	1.113593
15_50_8__s1	Neurolast	0.360920
15_86_61__s1	Neurolast	0.315698
...	...	...
6_94_77__s8	Neuron	0.563865
35_39_65__s8	Oligo	16.431843
30_36_44__s8	Neuron	0.666777
30_56_8__s8	Neuron	0.315283
23_5_66__s8	Radial glia	1.151799

36265 rows × 2 columns

Let's check how this looks on a UMAP. We can see that this matches the original author annotations ("class") quite well.

query_adata.obs["best_label"] = subregion_transfer["best_label"]
query_adata.obs["best_score"] = subregion_transfer["best_score"]
sc.pl.umap(query_adata, color=["class", "best_label", "best_score"], ncols=1)

No description has been provided for this image

Next, we'd like to know which cell types of the reference atlas are covered by the organoid screen. For this, we can compute the per-condition presence scores based on the wkNN graph. (Depending on the number of conditions this can take a while, even if you are running it on a GPU.)

presence_scores = mapper.get_presence_scores(split_by="condition")

GPU detected and cuml installed. Use cuML for neighborhood estimation.

Now let's check out the presence scores on a umap of the reference atlas.

ref_adata.obs["max_presence_scores"] = presence_scores["max"]
sc.pl.umap(ref_adata, color=["CellClass", "Subregion", "max_presence_scores"], ncols=1, cmap="Blues")

From here on one can perform various other analysis based on the presence scores, e.g. compare them with the organoid atlas to reveal cell types for which coverage was improved by the screen.