Skip to content

Tele-AI/TeleFuser

中文 | English

License Python PyTorch CUDA

TeleFuser is a high-performance runtime for world model inference and multimodal generation. It is designed for continuous, low-latency, stateful visual generation workloads such as real-time world models, interactive agents, speech-driven animation, and streaming visual systems.

Why TeleFuser

Most open-source inference stacks are optimized for one of three cases:

  • one-shot image generation
  • offline video generation
  • general LLM serving

Real-time world models need a different runtime profile: continuous execution, streaming output, bidirectional interaction, stateful sessions, long-context efficiency, and stable performance under concurrency. TeleFuser focuses on those runtime problems directly.

The project treats a world model as more than a function that returns a single clip. It provides the infrastructure needed to run a model as a continuously updated system that can receive input, keep state, and emit frames progressively.

What TeleFuser Provides

  • World-model-oriented runtime: Support for continuous video generation, interactive sessions, and bidirectional control loops.
  • AI Dev First interfaces: Pipelines can publish PIPELINE_CONTRACT / PIPELINE_MANIFEST metadata so agents and services can discover tasks, inputs, and parameters programmatically.
  • Asynchronous pipeline scheduling: Stage-based execution with request isolation, resource locking, and parallel stage groups.
  • Streaming transport: WebRTC-based streaming with media tracks plus DataChannel control for real-time inference.
  • Scalable GPU runtime: Multi-GPU execution with tensor parallelism, sequence parallelism, Ray-based deployment, and distributed worker orchestration.
  • Inference optimization stack: Triton kernels, optimized attention backends, quantization, offload, and feature caching.
  • Unified serving: Local Python API, telefuser serve for task APIs, and telefuser stream-serve for continuous streaming services.

World Model Inference Focus

TeleFuser is built around the runtime requirements that world models expose in production:

  • Continuous execution instead of one-shot calls: stream frames as they are produced instead of waiting for full completion.
  • Interactive control: accept prompts, controls, images, audio, or action signals while a session is active.
  • Stateful sessions: keep runtime state across chunks rather than rebuilding the full pipeline every step.
  • Low first-frame latency: expose partial outputs quickly through async scheduling and streaming transport.
  • Long-horizon efficiency: reduce memory pressure for long videos and repeated denoising through sequence parallelism, offload, and caching.

Today that maps to concrete capabilities in this repo, including:

  • bidirectional WebRTC sessions for LingBot-World-Fast
  • speech-to-video generation for LiveAct
  • streaming video processing for FlashVSR
  • long-form and continuation workflows for LongCat-Video
  • batch and async video generation for WanVideo, HunyuanVideo, and LTX Video

Quick Start

Install

pip install -e .

For development:

pip install -e ".[dev]"

For WebRTC streaming:

pip install -e ".[webrtc]"

1. Batch Video Inference

from telefuser.pipelines.wan_video.wan21_video import Wan21VideoPipeline
import torch

pipe = Wan21VideoPipeline.from_pretrained(
    model_id_or_path="Wan-AI/Wan2.1-T2V-1.3B",
    device="cuda",
    torch_dtype=torch.bfloat16,
)

video = pipe(
    prompt="A cat playing piano",
    num_frames=81,
    height=480,
    width=832,
)

2. Real-Time World Model Demo

TeleFuser includes a bidirectional WebRTC demo for LingBot-World-Fast.

export LINGBOT_WORLD_CHECKPOINT_DIR=/path/to/LingBot-World

telefuser stream-serve examples/stream_server/stream_lingbot_world_fast.py \
  -p 8088 \
  --skip-validation

python examples/stream_server/webrtc_bidirectional_demo.py \
  --server-url http://localhost:8088 \
  --image-path /path/to/input.png

This starts a continuous session where the client sends control messages over a WebRTC DataChannel and receives generated video frames over media tracks.

3. Batch Service Mode

telefuser serve examples/wan_video/wan22_14b_text_to_video_service.py --port 8000

TeleFuser exposes:

  • native task APIs under /v1/tasks/*
  • OpenAI-compatible image and video APIs under /v1/images and /v1/videos
  • service metadata that reflects the pipeline contract

See docs/en/service.md for full API details.

Serving Modes

telefuser serve

Use this mode for request-response inference with task management, standard REST APIs, and service metadata.

  • good fit for batch text-to-video, image-to-video, image generation, and super-resolution
  • supports pipeline contracts for structured parameter exposure
  • supports OpenAI-compatible routes for easier client integration

telefuser stream-serve

Use this mode for continuous streaming workloads.

  • server-push WebRTC for progressive video output
  • bidirectional WebRTC for interactive control loops
  • useful for real-time world models, speech-driven generation, and streaming media pipelines

See docs/en/stream_server.md for stream protocol details.

AI Dev First Runtime

TeleFuser is designed so pipelines are understandable not only to human developers, but also to automated systems and agents.

  • PIPELINE_CONTRACT and PIPELINE_MANIFEST define supported tasks, required file inputs, defaults, and user-facing parameters.
  • the service layer uses those contracts to expose machine-readable metadata
  • the same pipeline can be used locally, through REST APIs, or through streaming services

This is the core of the project's "AI Dev First" direction: standardize runtime behavior so orchestration systems can discover and use pipelines without reverse-engineering internal code paths.

Architecture

TeleFuser uses a layered runtime architecture that maps cleanly to the repository structure:

  1. Access layer: FastAPI task APIs and WebRTC streaming entrypoints.
  2. Service and scheduling layer: request routing, task management, stream sessions, and orchestration.
  3. Pipeline abstraction layer: stage-based pipelines with async execution, request isolation, and resource locks.
  4. Model and optimization layer: model loading, attention selection, quantization, offload, LoRA, and cache integration.
  5. Execution backend layer: optimized ops, Triton kernels, and device-specific implementations.

Relevant directories:

telefuser/
├── service/         # REST APIs, streaming APIs, WebRTC integration
├── orchestrator/    # Pipeline orchestration
├── pipelines/       # Model-specific pipelines
├── distributed/     # TP / SP / FSDP / Ray utilities
├── feature_cache/   # AdaTaylorCache
├── ops/             # Compile-aware operator dispatch
├── kernel/triton/   # Triton kernels
└── models/          # DiT, VAE, encoders, decoders

Runtime Capabilities

  • Async pipeline scheduling: run independent stages concurrently and gate shared resources with lock groups.
  • Distributed inference: tensor parallelism, sequence parallelism, Ray-based multi-GPU deployment, and pipeline-scale orchestration.
  • Attention backends: Torch SDPA, FlashAttention, SageAttention, sparse attention variants, and other configurable implementations.
  • Feature caching: AdaTaylorCache accelerates supported diffusion models with calibrated skip/reuse logic.
  • Memory optimization: CPU offload, weight reuse, and runtime-aware loading strategies for large video models.
  • Quantization: FP8 and INT8-related runtime support where the model/backend path allows it.
  • Streaming output: progressive frame delivery over WebRTC with optional audio tracks.

Supported Pipelines

World Model and Real-Time Oriented

Pipeline Task Notes
LingBot-World-Fast Bidirectional world-model streaming Interactive WebRTC control loop via examples/stream_server/stream_lingbot_world_fast.py
LiveAct S2V Speech-driven talking head generation via examples/liveact/liveact_s2v_h100.py
FlashVSR VSR Streaming video super-resolution via examples/flashvsr/README.md
LongCat-Video T2V, I2V, VC Long-form generation and continuation via examples/longcat_video/README.md

Video Generation

Pipeline Task Notes
WanVideo (Wan2.1 / Wan2.2) T2V, I2V, FL2V Main video generation family, including async and service examples in examples/wan_video/README.md
HunyuanVideo T2V, I2V Supported via examples/hunyuan_video/README.md
LTX Video I2V + Audio Unified audio-video generation via examples/ltx_video/README.md

Image Generation and Other Multimodal Pipelines

Pipeline Task Notes
Qwen-Image T2I, Edit examples/qwen_image/README.md
Z-Image T2I examples/z_image/README.md
Flux2 Klein T2I examples/flux2_klein/README.md

Examples

Key entry points:

To inspect CLI options for an example:

python examples/wan_video/wan21_1_3b_text_to_video_hf.py --help

The examples/README.md file documents the example runner and baseline comparison workflow.

Documentation

Known Limitations

  • AdaTaylorCache is only calibrated for selected model families.
  • torch.compile support is still experimental in parts of the stack.
  • Some optimized paths require specific GPU architectures and CUDA versions.
  • World-model examples such as LingBot-World-Fast require external checkpoints and environment setup.
  • Multi-machine deployment exists in the architecture but may require project-specific integration and validation.

Development

pip install -e ".[dev]"
pre-commit install
pytest tests/

See CONTRIBUTING.md for contribution workflow and AGENTS.md for project-specific agent guidance.

License

Apache 2.0 License. See LICENSE.

Acknowledgements

TeleFuser builds on and is inspired by a broad set of open-source efforts in multimodal generation and inference systems, including:

About

A high-performance runtime for world model inference and multimodal generation -- continuous streaming, stateful sessions, distributed GPU execution, and WebRTC-based interactive control loops.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors