Skip to content

bio-datascience/biscot

Repository files navigation

biscot logo

biscot

Bacterial Integrative Single-Cell Optimal Transport

A computational framework for the integrative analysis of bacterial single-cell datasets, combining flow cytometry and single-cell RNA sequencing through Optimal Transport methods.

Key Features:

  • Automated gating: High-dimensional gating without manual intervention
  • GMM-OT alignment: Gaussian Mixture Model-based Optimal Transport for robust population matching
  • Cross-modal imputation: k-nearest neighbor strategies for imputing gene expression onto flow cytometry data
  • Temporal tracking: Track bacterial population dynamics over time (e.g., sporulation)
  • Scalable: Designed for high-throughput cytometric platforms
  • AnnData/MuData: Standard single-cell data structures for interoperability

Installation

git clone https://github.com/bio-datascience/biscot.git
cd biscot
pip install -e .

Requirements:

  • Python >= 3.12
  • Key dependencies: POT (optimal transport), scikit-learn, mudata, pytometry, torch

Quick Start

Unimodal: Temporal Population Tracking

from biscot import UniModalData, GMMConfig

# Create data container
data = UniModalData(
    channels=["FSC", "SSC", "DAPI"],
    analysis_mode="temporal",
    reference_timepoint="t2"
)

# Add flow cytometry datasets
data.add_dataset("t0", adata_t0)
data.add_dataset("t1", adata_t1)
data.add_dataset("t2", adata_t2)

# Run analysis with automatic component selection
data.analyze(gmm_config=GMMConfig(n_components="bic", k_range=(5, 15)))

# Compute transport between timepoints
data.compute_transport("t0", "t1")
data.compute_transport("t1", "t2")

Multimodal: Flow + RNA-seq Integration

from biscot import MultiModalData, GMMConfig, OTConfig

# Create multimodal container
data = MultiModalData()

# Add modalities
data.add_modality("flow", flow_adata, modality_type="flow")
data.add_modality("rna", rna_adata, modality_type="rna")

# Fit GMMs
data.analyze(gmm_config=GMMConfig(n_components=10))

# Align modalities via PCA + Procrustes
data.align_modalities("flow", "rna")

# Compute optimal transport
data.compute_transport("flow", "rna", OTConfig(method="bary"))

# Impute gene expression onto flow cells
data.impute_features(
    source_modality="rna",
    target_modality="flow",
    features=["groEL", "ftsZ", "spoIVA"]
)

High-Level Unified API

For simpler workflows, use the unified API:

from biscot import analyze_biscot, GMMConfig

# One-line temporal analysis
results = analyze_biscot(
    data={"t0": adata0, "t1": adata1, "t2": adata2},
    mode="unimodal",
    analysis_type="temporal",
    gmm_config=GMMConfig(n_components="bic"),
    plot_results=True,
    output_dir="output/"
)

# Access results
similarity_matrix = results.similarity_matrix
results.plot_summary()
results.export("output/")

Data Requirements

Flow Cytometry Data (FCS Files)

  • Format: FCS v2.0 or v3.0
  • Preprocessing: Compensated, singlet-gated, live-cell-gated recommended
  • NOT log-transformed (Biscot handles transformation)
  • Minimum: 10,000+ events per file recommended
from biscot import load_fcs, PreprocessingConfig

# Load single FCS file
adata = load_fcs(
    "sample.fcs",
    preprocessing_config=PreprocessingConfig(
        channels_to_select=["FSC", "SSC", "DAPI"],
        apply_log_transform=True
    )
)

# Load batch of FCS files
from biscot import load_fcs_batch
adatas, metadata = load_fcs_batch(
    file_paths={"t0": "t0.fcs", "t1": "t1.fcs"},
    channels=["FSC", "SSC"],
    preprocessing_config=PreprocessingConfig(apply_log_transform=True)
)

RNA-seq Data (for Multimodal)

  • Format: AnnData with PCA coordinates in .obsm['X_pca']
  • Required: Gene expression accessible (in .X or .uns['original_expression'])
import scanpy as sc

# Ensure PCA is computed
sc.pp.pca(rna_adata, n_comps=3)
assert 'X_pca' in rna_adata.obsm

Core API

Data Classes

Class Purpose
UniModalData Single-modality flow cytometry analysis (temporal tracking, similarity)
MultiModalData Multi-modal integration (flow + RNA-seq)

Configuration Classes

Class Purpose
GMMConfig GMM parameters: n_components, covariance_type, BIC selection
OTConfig Optimal transport: method ("bary", "emd", "sinkhorn"), epsilon
GateDefinition Manual gate polygons for cell filtering
PreprocessingConfig Data preprocessing: channel selection, transforms, filtering
PaddingConfig Dimension padding for mismatched modalities

Key Functions

Data Loading:

  • load_fcs() - Load single FCS file
  • load_fcs_batch() - Load multiple FCS files
  • export_to_fcs() - Export AnnData to FCS

Analysis:

  • temporal_analysis() - Run temporal tracking workflow
  • similarity_analysis() - Compute pairwise sample similarities
  • cross_modal_mapping() - Cross-modal feature imputation

Model Selection:

  • select_gmm_components_bic() - Automatic BIC-based component selection

Visualization:

  • plot_fcm_gates() - Flow cytometry gate visualization
  • plot_clusters() - Cluster scatter plots
  • plot_imputed_expression() - Gene expression heatmaps
  • plot_temporal_tracking() - Temporal population tracking
  • plot_similarity_matrix() - Sample similarity heatmap

Tutorials

Quickstart

Notebook Description
01_wrapper_api.ipynb Complete cross-modal workflow (recommended start)
02_full_tutorial.ipynb Detailed manual API tutorial

Unimodal (Flow Cytometry Only)

Notebook Description
01_temporal_automated.ipynb Automated population tracking over time
02_temporal_with_gates.ipynb Temporal analysis with manual gates
03_3d_tessellation.ipynb 3D tessellation analysis
04_spatial_biofilm.ipynb Spatial biofilm analysis
05_mixture_analysis.ipynb Mixture similarity analysis

Multimodal (Flow + RNA-seq)

Notebook Description
01_baseline_2d.ipynb 2D baseline cross-modal integration
02_dimension_padding.ipynb Handling dimension mismatches
03_padding_evaluation.ipynb Padding method comparison

GMM Configuration

Fixed Components

gmm_config = GMMConfig(n_components=10)

Automatic BIC Selection

# Search range (5-15 components)
gmm_config = GMMConfig(n_components="bic", k_range=(5, 15))

# With elbow detection
gmm_config = GMMConfig(
    n_components="bic",
    k_range=(5, 20),
    bic_use_elbow=True,
    bic_replicates=3
)

Covariance Types

# Full covariance (default, most flexible)
GMMConfig(n_components=10, covariance_type="full")

# Diagonal (faster, less parameters)
GMMConfig(n_components=10, covariance_type="diag")

# Spherical (fastest, equal variance in all directions)
GMMConfig(n_components=10, covariance_type="spherical")

Manual Gating

Define polygon gates for cell filtering:

from biscot import GateDefinition, UniModalData

# Define gates
gates = [
    GateDefinition(
        label="Population_A",
        coords=[(0.5, 0.5), (0.5, 2.0), (2.0, 2.0), (2.0, 0.5)],
        coordinate_space="log"
    ),
    GateDefinition(
        label="Population_B",
        coords=[(2.0, 1.0), (2.0, 3.0), (4.0, 3.0), (4.0, 1.0)],
        coordinate_space="log"
    ),
]

# Apply gates
data = UniModalData(channels=["FSC", "DAPI"], gates=gates)
data.add_dataset("sample1", adata)
data.apply_gates()

Troubleshooting

"KeyError: 'X_pca'" (RNA data)

import scanpy as sc
sc.pp.pca(rna_adata, n_comps=3)

"ValueError: shape mismatch" (dimension mismatch)

# Use dimension padding for mismatched modalities
data.impute_missing_dimensions("flow", "rna", n_components=10)

"GMM did not converge"

# Try fewer components or diagonal covariance
gmm_config = GMMConfig(n_components=5, covariance_type="diag")

Data already log-transformed

# Check max value - if < 10, already transformed
import numpy as np
print(f"Max: {np.max(adata.X)}")  # Should be > 10,000 for raw data

# Disable log transform
PreprocessingConfig(apply_log_transform=False)

How It Works

Raw Data (FCS/H5AD)
        |
        v
  Preprocessing (log transform, channel selection)
        |
        v
  Automated Gating / GMM Clustering (identify populations)
        |
        v
  GMM-OT Alignment (match populations via Optimal Transport)
        |
        v
  KNN Imputation (transfer features across modalities)
        |
        v
  Results (imputed gene expression, population tracking, similarity matrices)

Core Methodology (GMM-OT):

  1. Fit Gaussian Mixture Models to represent cell populations in each sample/modality
  2. Use Optimal Transport on GMM components for robust population alignment
  3. Apply k-nearest neighbor strategies to impute gene expression onto flow cytometry data
  4. Track population dynamics across time points or experimental conditions

Citation

If you use biscot in your research, please cite:

@article{biscot2025,
  title={biscot: an Optimal Transport framework for multimodal and unimodal bacterial single-cell data analysis},
  author={Feldl et. al.},
  journal={},
  year={2025}
}

Development

# Clone and install in development mode
git clone https://github.com/bio-datascience/biscot.git
cd biscot
uv sync

# Run tests
uv run pytest

# Lint (with import sorting)
uv run ruff check --select I --fix

# Format code
uv run ruff format

# Build documentation
uv run mkdocs build

# Serve documentation locally
uv run mkdocs serve

About

Biscot package

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages