Atlas annotation with snapseed

We developed snapseed to quickly provide marker-based annotations on large datasets. This can be very handy to provide "seed annotations" for label-dependent integration methods such as scANVI or scPoli.

Snapseed is based on the idea that a marker gene of a cell type should be specific for that cell type and should be expressed in a large fraction of the cells of that type. As a measure of specificity, snapseed computes single-feature ROC-AUC scores for each marker gene in each cluster. Together with the fraction of cells expressing the gene, we obtain a score that is used to assign clusters to cell types.

In this notebook, we'll showcase how to use snapseed to annotate cell types in the Human Neural Organoid Atlas dataset.

import os

import scanpy as sc
from flax.core.frozen_dict import FrozenDict

import hnoca.snapseed as snap
from hnoca.snapseed.utils import read_yaml

os.chdir("/home/fleckj/scratch/hnoca/")

# Read input data
hnoca_adata = sc.read("HNOCA_hv2k.h5ad")
print(hnoca_adata)

AnnData object with n_obs × n_vars = 1770578 × 2000
    obs: 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'bio_sample', 'cell_line', 'cell_type', 'development_stage', 'disease', 'ethnicity', 'gm', 'id', 'individual', 'organ', 'organism', 'sex', 'state_exact', 'sample_source', 'source_doi', 'suspension_type_original', 'tech_sample', 'treatment', 'assay_sc_original', 'cell_line_original', 'cell_type_original', 'development_stage_original', 'disease_original', 'ethnicity_original', 'organ_original', 'organism_original', 'sex_original', 'suspension_type', 'obs_names_original', 'organoid_age_days', 'publication', 'doi', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'leiden_pca_unintegrated_1', 'leiden_pca_unintegrated_80', 'leiden_pca_rss_1', 'leiden_pca_rss_80', 'snapseed_pca_unintegrated_level_1', 'snapseed_pca_unintegrated_level_2', 'snapseed_pca_unintegrated_level_3', 'snapseed_pca_unintegrated_level_4', 'snapseed_pca_unintegrated_level_5', 'snapseed_pca_unintegrated_level_12', 'snapseed_pca_unintegrated_level_123', 'snapseed_pca_unintegrated_level_1234', 'snapseed_pca_unintegrated_level_12345', 'snapseed_pca_rss_level_1', 'snapseed_pca_rss_level_2', 'snapseed_pca_rss_level_3', 'snapseed_pca_rss_level_4', 'snapseed_pca_rss_level_5', 'snapseed_pca_rss_level_12', 'snapseed_pca_rss_level_123', 'snapseed_pca_rss_level_1234', 'snapseed_pca_rss_level_12345', 'leiden_scpoli_1', 'leiden_scpoli_80', 'snapseed_scpoli_level_1', 'snapseed_scpoli_level_2', 'snapseed_scpoli_level_3', 'snapseed_scpoli_level_4', 'snapseed_scpoli_level_5', 'snapseed_scpoli_level_12', 'snapseed_scpoli_level_123', 'snapseed_scpoli_level_1234', 'snapseed_scpoli_level_12345', 'annot_region_rev2', 'annot_level_3_rev2', 'annot_level_4_rev2', 'annot_ntt_rev2'
    var: 'ensembl', 'gene_symbol', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'gene_length', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'annot_region_rev2_colors', 'hvg', 'knn_pca_rss', 'knn_pca_unintegrated', 'knn_scpoli', 'log1p'
    obsm: 'X_pca_rss', 'X_pca_unintegrated', 'X_rss', 'X_scpoli', 'X_umap_pca_rss', 'X_umap_pca_unintegrated', 'X_umap_scpoli'
    layers: 'counts', 'counts_lengthnorm', 'lognorm'
    obsp: 'knn_pca_rss_connectivities', 'knn_pca_rss_distances', 'knn_pca_unintegrated_connectivities', 'knn_pca_unintegrated_distances', 'knn_scpoli_connectivities', 'knn_scpoli_distances'

In it's most basic form, snapeed takes adata object and a dict with marker genes for different cell types. A very simple example might look like this:

marker_dict = {
    "neuron": ["STMN2", "DCX"],
    "progenitor": ["SOX2", "VIM", "NES"],
}

assignments = snap.annotate(hnoca_adata, marker_dict, group_name="leiden_pca_rss_80")
assignments = assignments.set_index("leiden_pca_rss_80")

The output of snapseed is a pandas dataframe with annotations for each cluster. We can join this information back to the adata object to visualize the annotations on the single-cell level.

cell_assign = assignments.loc[hnoca_adata.obs["leiden_pca_rss_80"].astype(str).values, :]
hnoca_adata.obs["snap_class"] = cell_assign["class"].values

# Plot the annotations
sc.pl.scatter(hnoca_adata, color=["snap_class", "STMN2", "VIM"], basis="umap_scpoli")

No description has been provided for this image

This looks quite reasonable, so let's make it a bit more interesting. In practice, cell type annotations are often hierachical, so we might want to take this into account in the annotation process. For example, we can first annotate progenitors and neurons and then split them up into more fine-grained subtypes and regional identities.

Snapseed allows this by recusively annotating cells based on a hierarchy of cell types and marker genes. Such a hierarchy can be provided as a YAML file like this one:

(The FrozenDict is just to enable better printing of a large dictionary)

marker_hierarchy = read_yaml("brain_markers.yaml")
print(FrozenDict(marker_hierarchy))

FrozenDict({
    neural_progenitor_cell: {
        marker_genes: ['SOX2', 'VIM', 'NES'],
        subtypes: {
            glioblast: {
                marker_genes: ['HOPX', 'BCAN', 'TNC'],
            },
            telencephalic_npc: {
                marker_genes: ['FOXG1'],
                subtypes: {
                    dorsal_npc: {
                        marker_genes: ['EMX1'],
                    },
                    hippocampal_npc: {
                        marker_genes: ['GABRA2', 'GABRB1'],
                    },
                    ventral_npc: {
                        marker_genes: ['DLX2'],
                    },
                },
            },
            diencephalic_npc: {
                marker_genes: ['TCF7L2', 'SIX3'],
                subtypes: {
                    hypothalamic_npc: {
                        marker_genes: ['SIX3'],
                    },
                    thalamic_npc: {
                        marker_genes: ['TCF7L2'],
                    },
                },
            },
            mesencephalic_npc: {
                marker_genes: ['OTX2', 'PAX7'],
            },
            retinal_npc: {
                marker_genes: ['VSX2'],
            },
            rhombencephalic_npc: {
                marker_genes: ['CYP26A1', 'HOXA2', 'HOXB2', 'UNC5C', 'BCAN'],
                subtypes: {
                    cerebellar_npc: {
                        marker_genes: ['UNC5C', 'CYP26A1'],
                    },
                    pons_npc: {
                        marker_genes: ['GPC6', 'BCAN'],
                    },
                    medullary_npc: {
                        marker_genes: ['HOXB2', 'HOXD3', 'HOXA3'],
                    },
                },
            },
        },
    },
    neuron: {
        marker_genes: ['STMN2', 'DCX'],
        subtypes: {
            excitatory_neuron: {
                marker_genes: ['SLC17A7', 'SLC17A6'],
                subtypes: {
                    telencephalic_excitatory_neuron: {
                        marker_genes: ['FOXG1', 'EMX1', 'SLC17A7'],
                        subtypes: {
                            deeper_layer_cortical_excitatory_neuron: {
                                marker_genes: ['BCL11B'],
                            },
                            upper_layer_cortical_excitatory_neuron: {
                                marker_genes: ['SATB2'],
                            },
                            hippocampal_excitatory_neuron: {
                                marker_genes: ['GABRA2', 'GABRB1'],
                            },
                        },
                    },
                    non_telencephalic_excitatory_neuron: {
                        marker_genes: ['SLC17A6'],
                        subtypes: {
                            diencephalic_excitatory_neuron: {
                                marker_genes: ['TCF7L2', 'PITX2'],
                                subtypes: {
                                    hypothalamic_excitatory_neuron: {
                                        marker_genes: ['PITX2'],
                                    },
                                    thalamic_excitatory_neuron: {
                                        marker_genes: ['TCF7L2'],
                                    },
                                },
                            },
                            mesencephalic_excitatory_neuron: {
                                marker_genes: ['BARHL2', 'TFAP2D'],
                            },
                            rhombencephalic_excitatory_neuron: {
                                marker_genes: ['TFAP2A', 'HOXA2', 'HOXB2', 'HOXB2', 'HOXD3', 'HOXA3'],
                                subtypes: {
                                    cerebellar_excitatory_neuron: {
                                        marker_genes: ['UNCX', 'INSM1', 'UNC5C'],
                                    },
                                    pons_excitatory_neuron: {
                                        marker_genes: ['ROBO1', 'NEGR1'],
                                    },
                                    medullary_excitatory_neuron: {
                                        marker_genes: ['HOXB2', 'HOXD3', 'HOXA3'],
                                    },
                                },
                            },
                        },
                    },
                },
            },
            inhibitory_neuron: {
                marker_genes: ['GAD1', 'GAD2', 'SLC32A1'],
                subtypes: {
                    telencephalic_inhibitory_neuron: {
                        marker_genes: ['FOXG1'],
                        subtypes: {
                            ventral_inhibitory_neuron: {
                                marker_genes: ['DLX2', 'DLX5'],
                                subtypes: {
                                    mge_inhibitory_neuron: {
                                        marker_genes: ['NKX2-1'],
                                    },
                                    lge_inhibitory_neuron: {
                                        marker_genes: ['ISL1'],
                                    },
                                    cge_inhibitory_neuron: {
                                        marker_genes: ['NR2F1'],
                                    },
                                },
                            },
                        },
                    },
                    non_telencephalic_inhibitory_neuron: {
                        marker_genes: ['LHX5', 'LHX1'],
                        subtypes: {
                            diencephalic_inhibitory_neuron: {
                                marker_genes: ['TCF7L2', 'ISL1'],
                                subtypes: {
                                    hypothalamic_inhibitory_neuron: {
                                        marker_genes: ['ISL1'],
                                    },
                                    thalamic_inhibitory_neuron: {
                                        marker_genes: ['DLX1', 'DLX5'],
                                    },
                                },
                            },
                            mesencephalic_inhibitory_neuron: {
                                marker_genes: ['OTX2'],
                            },
                            rhombencephalic_inhibitory_neuron: {
                                marker_genes: ['SKOR2', 'HOXA2', 'HOXB2', 'CA8', 'LAMP5'],
                                subtypes: {
                                    cerebellar_inhibitory_neuron: {
                                        marker_genes: ['CA8', 'TFAP2A', 'SKOR2', 'UNC5C'],
                                    },
                                    medullary_inhibitory_neuron: {
                                        marker_genes: ['HOXB2', 'HOXD3', 'HOXA3', 'LAMP5'],
                                    },
                                    pons_inhibitory_neuron: {
                                        marker_genes: ['ROBO1', 'NEGR1'],
                                    },
                                },
                            },
                        },
                    },
                },
            },
        },
    },
    choroid_plexus_epithelium: {
        marker_genes: ['TTR'],
    },
    astrocyte: {
        marker_genes: ['GFAP', 'AQP4'],
    },
    oligodendrocyte_lineage: {
        marker_genes: ['OLIG1', 'MBP'],
        subtypes: {
            oligodendrocyte_precursor_cell: {
                marker_genes: ['OLIG1'],
            },
            mature_oligodendrocyte: {
                marker_genes: ['MBP'],
            },
        },
    },
    microglia: {
        marker_genes: ['AIF1'],
    },
    vascular_endothelial_cell: {
        marker_genes: ['CLDN5'],
    },
    mesenchymal_cell: {
        marker_genes: ['DCN'],
    },
    neural_crest: {
        marker_genes: ['SOX10'],
    },
    pns_neurons: {
        marker_genes: ['PRPH'],
    },
})

assignments = snap.annotate_hierarchy(hnoca_adata, marker_hierarchy, group_name="leiden_pca_rss_80")

Now, snapseeds outputs a dictionary with the assignment dataframe, as well as the hierarchy. Again, we can plot the result as in a UMAP.

cell_assign = assignments["assignments"].loc[hnoca_adata.obs["leiden_pca_rss_80"].astype(str).values, :]
hnoca_adata.obs["snap_level_1"] = cell_assign["level_1"].values
hnoca_adata.obs["snap_level_2"] = cell_assign["level_2"].values
hnoca_adata.obs["snap_level_3"] = cell_assign["level_3"].values

# Plot the annotations
sc.pl.scatter(hnoca_adata, color=["snap_level_1", "snap_level_2", "snap_level_3"], basis="umap_scpoli")

Here, snapseed gives annotations for each level of the provided hierarchy.