Mirrors the pangeapy
hierarchical annotation flow in the browser. Level1 runs on every cell;
Level2 runs only on groups whose Level1 label has ≥ 50 cells and a matching Level2 model.
For large inputs or many samples, use the pangeapy API instead.
Input file configuration
Should contain gene expression matrix (cell_barcode × gene_id)
Raw expression must be 1e4-normalized & log1p-transformed
normalized up to 10,000 counts per cell, then log-transformed with 1 pseudocount
Level1 — runs the Whole model on every cell (32 broad cell types)
Level2 — for each Level1 label that has ≥ 50 cells and a matching Level2 model
(B_mature, Dendritic_classical, Ductal, Endothelial,
Fibroblast, Macrophage, Monocyte, Mural,
Squamous, T&NK), run the corresponding model on just those cells
Level2 models are downloaded on demand (only the ones needed)
Meta prediction — mirrors MetaAnnotator().annotate()
Filter cells: Level1|conf_score > 0.5 AND Level2|conf_score > 0.5
If fewer than 500 cells remain, meta is skipped
Build a composition vector (Level1 proportions + per-Level1 Level2 proportions with ≥ 50 cells)
Organ predictor → top organ + probability distribution
Phenotype predictor: Blood model if organ=Blood (prob ≥ 0.5), otherwise Tissue model
Requires meta_*_portable.npz in /assets/models/ — see tools/convert_meta_models.py
Output file configuration — pred.csv with columns
cell_id
Level1|predicted_label, Level1|conf_score
Level2|predicted_label, Level2|conf_score (blank if no Level2 was run for that cell)
PG_annotations — Level1|Level2 concatenated, or just Level1 when no Level2
PG_combined_score — geometric mean of the available level scores