Skip to content

Reference mapping with HNOCA-tools

In this tutorial we'll explore how to perform reference mapping using the HNOCA toolbox. Analogous to our analysis in the HNOCA manuscript, we map an organoid morphogen screen dataset (Amin at al.) to a reference atlas of the developing human brain (Braun at al.). We already trained a scANVI model for this. Based on this model HNOCA-tools provides functions to map the organoid data to the reference atlas and make quantitative comparisons.

To get started, we first load all necessary pacakges and load the data & model.

import os
import scvi 
import scanpy as sc
import pandas as pd
import numpy as np
import hnoca.map as hmap

os.chdir("/home/fleckj/scratch/hnoca/")
# Read reference and query data
ref_adata = sc.read("BraunLinnarsson_2023_devbrain_hv2k.h5ad")
query_adata = sc.read("AminPasca_2023_organoid_screen_hv2k.h5ad")

# Format batch columns
query_adata.obs["batch"] = query_adata.obs["orig.ident"].astype(str).copy()
ref_adata.obs["batch"] = ref_adata.obs["Donor"].astype(str).copy()

# Load reference model
ref_model = scvi.model.SCANVI.load(
    "scarches_BraunLinnarsson/model.pt",
    adata=ref_adata,
)
print(ref_model)

Now we can use the AtlasMapper from HNOCA-tools to map the organoid data to the reference atlas. Under the hood this is using the transfer learning functionality of scANVI to finetune te model to the query dataset. We finetune for 100 epochs (max_epochs) with a batch size of 1024 (batch_size) and use retrain="partial" to only retrain the weights corresponding to the new batch covariates.

# Initiate mapper
mapper = hmap.AtlasMapper(ref_model)
# Map query data to reference
mapper.map_query(
    query_adata, 
    retrain="partial", 
    max_epochs=100, 
    batch_size=1024
)

We then compute a weighted kNN graph between the query and reference data.

mapper.compute_wknn(k=100)

Based on this wkNN mapping, we can now transfer labels from the reference and to estimate cell type composition of the query. We can also compute presence scores with respect to the reference data to assess which reference cell types are covered in the screen.

subregion_transfer = mapper.transfer_labels(label_key="CellClass")
subregion_transfer[["best_label", "best_score"]]
best_label best_score
44_33_8__s1 Radial glia 1.789107
44_79_20__s1 Radial glia 0.617641
44_23_10__s1 Radial glia 1.113593
15_50_8__s1 Neurolast 0.360920
15_86_61__s1 Neurolast 0.315698
... ... ...
6_94_77__s8 Neuron 0.563865
35_39_65__s8 Oligo 16.431843
30_36_44__s8 Neuron 0.666777
30_56_8__s8 Neuron 0.315283
23_5_66__s8 Radial glia 1.151799

36265 rows × 2 columns

Let's check how this looks on a UMAP. We can see that this matches the original author annotations ("class") quite well.

query_adata.obs["best_label"] = subregion_transfer["best_label"]
query_adata.obs["best_score"] = subregion_transfer["best_score"]
sc.pl.umap(query_adata, color=["class", "best_label", "best_score"], ncols=1)
No description has been provided for this image

Next, we'd like to know which cell types of the reference atlas are covered by the organoid screen. For this, we can compute the per-condition presence scores based on the wkNN graph. (Depending on the number of conditions this can take a while, even if you are running it on a GPU.)

presence_scores = mapper.get_presence_scores(split_by="condition")
GPU detected and cuml installed. Use cuML for neighborhood estimation.

Now let's check out the presence scores on a umap of the reference atlas.

ref_adata.obs["max_presence_scores"] = presence_scores["max"]
sc.pl.umap(ref_adata, color=["CellClass", "Subregion", "max_presence_scores"], ncols=1, cmap="Blues")
No description has been provided for this image

From here on one can perform various other analysis based on the presence scores, e.g. compare them with the organoid atlas to reveal cell types for which coverage was improved by the screen.