Building Unified Spatial Atlases: A Step-by-Step Guide to Integrating Fragmented Cell Maps

By ● min read

Overview

Recent breakthroughs in spatial multi-omics technologies have given scientists the power to map gene and protein activity at single-cell resolution within intact tissues. However, these ultra-high-resolution maps are often generated from different tissue samples, platforms, or experimental batches, leaving them fragmented and incomparable. A new computational method—detailed in a Nature Genetics study—solves this by unifying these fragmented maps into coherent spatial atlases. This tutorial walks you through the entire process, from understanding the method to applying it to your own data, using practical steps and insights.

Building Unified Spatial Atlases: A Step-by-Step Guide to Integrating Fragmented Cell Maps — Source: phys.org

Prerequisites

Required Skills and Knowledge

Familiarity with single-cell sequencing data (e.g., scRNA-seq, scATAC-seq)
Basic programming in Python or R
Understanding of spatial transcriptomics (e.g., MERFISH, Visium, Slide-seq)
Experience with statistical analysis (normalization, dimensionality reduction)

Software and Tools

Python 3.8+ with packages: numpy, pandas, scanpy, squidpy, anndata
Optional: R with Seurat or SpatialExperiment
Computational resource: 16 GB RAM minimum (for moderate-sized datasets)

Data Requirements

You will need at least two spatial transcriptomics datasets from the same tissue type (e.g., mouse brain, human lymph node). Each dataset should contain:

Gene expression matrix (genes × spots/cells)
Spatial coordinates (x, y) for each spot
Optional: metadata such as section ID, batch label

Step-by-Step Instructions

Step 1: Data Acquisition and Quality Control

Begin by loading your spatial datasets into AnnData objects (or Seurat). For each dataset, perform basic quality control:

Filter out spots with low total counts (e.g., < 200 genes) or high mitochondrial content
Normalize using library size scaling or SCTransform
Identify highly variable genes for downstream integration

# Python example
import scanpy as sc
adata1 = sc.read('dataset1.h5ad')
sc.pp.filter_cells(adata1, min_genes=200)
sc.pp.normalize_total(adata1, target_sum=1e4)
sc.pp.highly_variable_genes(adata1, n_top_genes=2000)

Step 2: Preliminary Clustering and Annotation

Before integration, cluster each dataset independently to identify cell types or regions. This helps later in aligning spatial patterns.

Perform PCA on highly variable genes
Compute neighborhood graph and cluster (e.g., Leiden algorithm)
Optionally annotate clusters using known markers

Store these cluster labels in adata.obs for reference.

Step 3: Feature Selection for Integration

The unifying method relies on shared features across tissues. Select features (genes or proteins) that are:

Consistently expressed in both datasets (mean expression > 0.1)
Spatially variable (using Moran’s I or SPARK-X)

This reduces noise and focuses on spatial patterns.

Step 4: Aligning Coordinate Systems

Fragmented maps often come from different sections or orientations. Use a landmark-based approach or a neural network (like a U-Net) to find a transformation that aligns tissue shapes. For simplicity, you can:

Manually identify a few corresponding points (e.g., tissue boundaries)
Apply a similarity transformation (rotation + scaling) using Procrustes analysis

from scipy.spatial import procrustes
# mtx1 and mtx2 are 2D coordinate arrays
mtx1, mtx2, disparity = procrustes(mtx2, mtx1)

Step 5: Integrating Expression Data with Spatial Constraints

This is the core step. Use a graph-based integration that preserves both expression similarity and spatial proximity. The method from the paper leverages a spatial mutual nearest neighbors (MNN) approach. Pseudo-code:

Build spatial k-nearest neighbor graphs within each dataset (using coordinates)
Identify MNN pairs across datasets after PCA embedding
Compute batch-correction vectors only for spatially consistent MNN pairs

# Conceptual (simplified)
from scipy.spatial import cKDTree
from sklearn.neighbors import NearestNeighbors
# Find cross-dataset nearest neighbors in PCA space
pca1 = adata1.obsm['X_pca']
pca2 = adata2.obsm['X_pca']
nn = NearestNeighbors(n_neighbors=5).fit(pca2)
distances, indices = nn.kneighbors(pca1)
# Keep only pairs where spatial distance < threshold
spatial_tree = cKDTree(adata2.obsm['spatial'])
spatial_dists, _ = spatial_tree.query(adata1.obsm['spatial'], k=1)
valid_pairs = spatial_dists.flatten() < 50  # adjust threshold
# Correct batch effect only for valid pairs

Step 6: Visualization and Quality Assessment

After integration, visualize the unified atlas. Common plots:

Joint UMAP colored by dataset to check mixing
Spatial scatter plot with integrated clusters
Expression of key markers to verify spatial patterns

sc.pl.spatial(adata_combined, color=['leiden', 'dataset_id'], spot_size=10)

Evaluate integration success using:

Silhouette score for batch labels (lower is better)
Correlation of spatial expression of conserved genes

Common Mistakes

Ignoring Batch Effects Within a Single Dataset

If your data comes from multiple runs, treat each run as a separate map. Failure to correct intra-dataset batch effects will cause misalignment.

Over-Aligning with Too Many Dimensions

Using 50+ PCs for MNN can over-correct and wash out biological variation. Stick to 15–30 PCs depending on dataset complexity.

Not Verifying Spatial Correspondence

MNN pairs must be spatially plausible. Without spatial filtering, you may link cells from opposite sides of the tissue, producing false seamless maps.

Using Different Gene Panels

If technologies measure distinct gene sets (e.g., MERFISH vs. Visium), restrict integration to the intersection and confirm that housekeeping genes are consistent.

Summary

Unifying fragmented cell maps into a single spatial atlas requires careful data handling, alignment, and integration that respects both gene expression and physical location. By following this guide—preprocessing individual datasets, selecting shared spatially variable features, performing coordinate alignment, and applying spatially constrained MNN correction—you can create integrated atlases that reveal how cells organize across different sections or experiments. This approach, rooted in recent Nature Genetics methodology, dramatically accelerates the construction of whole-body spatial maps, enabling deeper insights into complex tissues like the brain and immune system.

Tags: