Skip to content

wpferrell/Bigsmall

Repository files navigation

PyPI version DOI License Python Downloads

BigSmall — Lossless AI Model Compression

Make any AI model ~34% smaller with bit-identical weights — and let it decide how to run on your machine. Your machine runs bigger models than you think, and BigSmall never lies to you about it.

pip install bigsmall          # CLI + compression/decompression
pip install bigsmall[torch]   # add this for model loading (from_pretrained)

A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. A 14.2 GB model runs in ~2.5 GB of working memory through the streaming executor (it reads weights from disk layer by layer instead of holding the whole model). The decompressed model is every weight bit-for-bit identical to the original — each tensor's md5 (a checksum) is verified on decompress.


60-second quickstart

Three commands. No settings to learn.

bigsmall profile          # once: ~10s hardware probe (saved, never asked again)
bigsmall plan model.bs    # one sentence: what would run, where, how faithfully
bigsmall run model.bs     # do it

Real output from the reference machine (8-core CPU, RTX A4500 busy with another job):

$ bigsmall plan qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.

$ bigsmall run qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.
loaded 290 tensors into host RAM in 27.0s [mode=perfect] (bit-exact receipt honoured)

The planner picks the highest fidelity that runs at usable speed on your hardware, and one rule is enforced by the test suite itself: anything below bit-exact is announced before it happens, never silently. Full walkthrough: docs/quickstart.md.


The improvement, in two pictures

Bar chart: loading the bit-exact original took 150.9 s (Qwen2.5-0.5B) and 411.2 s (Llama-3.2-1B) in the first 4.0 build; 4.0.0 ships 25.3 s and 56.2 s — 6.0x and 7.3x faster

Bar chart: one download, three sizes — the original is 100%, the lossless .bs is about 66%, and the Ferrell Duo fast member reads about 21.5%; the Duo file holds both for about 3% more than lossless alone


What's in 4.0

Feature What you get Command
Autopilot No decisions: profile once, then one-sentence picks with receipts bigsmall profile / plan / rundocs
Ferrell Duo (.bsd) One file. Two models: the fast one (INT4, reads 21.4% of raw) and the real one (bit-exact, proven) — for ~2.9 pp over lossless-only bigsmall dualdocs
Streaming executor Run models bigger than your RAM: bounded resident set, promise checked against measurement every run bigsmall run --stream, serve-streamdocs
FP8-native lossless Models released in fp8 compress to 0.829 of their fp8 bytes, bit-exact bigsmall compress (automatic) — docs
Capacity math What fits on your card, computed honestly (13B-class lossless resident on a 20 GB GPU) bigsmall plandocs
Integrity tooling Bit-exact receipts in the file, checkpoint forensics, an untrained-model trap bigsmall xray, verifydocs

Everything from 3.x is still here: delta patches for fine-tunes, KV-cache compression (entry v3 + FP8 + 32k chunked + GPU codec), xray, reshard, resume, ECC. Full changelog →


What BigSmall does

1. Make any model smaller

bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/

Before: 14.2 GB of safetensors. After: 9.3 GB .bs file. Saved: 4.9 GB (34%).

Every weight is bit-for-bit identical. Works on any safetensors model — LLMs, diffusion, audio, vision. BF16, F16, F32, and FP8 weights are all native.

2. Store fine-tunes as patches

bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/

Before: 14.2 GB Qwen2.5-7B-Instruct. After: ~5 GB patch. If your users already have the base model, they only download what changed. Delta size is pair-dependent — measured from under 1% of full size (best ≥7B SFT pairs) to ~61% (small-model full tunes); the ≥7B official-instruct class measures 34–50%. The engine measures both codings per tensor and never ships a delta larger than standalone. Full table: docs/delta-compression.md.

3. One file. Two models: the fast one and the real one — the Ferrell Duo

bigsmall dual qwen2.5-0.5b/model.safetensors -o qwen2.5-0.5b.bsd
bigsmall run qwen2.5-0.5b.bsd                # perfect: bit-exact, verified
bigsmall run qwen2.5-0.5b.bsd --mode fast    # fast: INT4, reads 21.4% of raw

A .bsd is a Ferrell Duo: use the fast one while you work (a lossy-INT4 member that loads by reading a fifth of the bytes); the real one is the exact original — every bit, proven. The same file holds the lossless residual that reconstructs the original bits, receipt included. Measured cost: a .bsd is ~2.9 pp of raw larger than the lossless-only .bs (details).

4. Run models bigger than your RAM

bigsmall run model.bsd --stream

The layer-streaming executor keeps the resident set bounded and tells you the bound before it starts. Measured receipt, Qwen2.5-7B (14.2 GB raw): streamed forward pass bit-exact against the fully-loaded model (identical logits sha256) in ~2.5 GB of hot state. Slow is stated plainly where it is slow — see docs/features/streaming.md for the measured rates.

5. Download smaller, use instantly

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall"
)

Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. 25+ pre-compressed models ready to use (browse them all).


Compression numbers (every published model)

Every row is a real measurement. Click a model to download it.

Model Original BigSmall Saved
Qwen2.5-14B-Instruct 29.5 GB 19.5 GB 34%
Gemma-3-12B-it 22.7 GB 14.8 GB 35%
Gemma-2-9B-it 17.2 GB 11.3 GB 34%
Llama-3.1-8B-Instruct 15.0 GB 9.7 GB 35%
Llama-3-8B-Instruct 15.0 GB 9.8 GB 34%
Qwen3-8B 15.3 GB 10.1 GB 34%
Mistral-7B-Instruct v0.3 14.2 GB 8.9 GB 37%
Mistral-7B-Instruct v0.2 14.2 GB 8.9 GB 37%
Qwen2.5-7B-Instruct 14.2 GB 9.4 GB 34%
Phi-3.5-mini-instruct 7.1 GB 4.7 GB 34%
Gemma-3-4B-it 8.0 GB 5.2 GB 35%
Qwen3-4B-Instruct 7.5 GB 5.0 GB 34%
Llama-3.2-3B-Instruct 6.4 GB 3.9 GB 39%
Gemma-2-2B-it 4.9 GB 3.2 GB 34%
Qwen2.5-3B-Instruct 5.7 GB 3.8 GB 34%
Qwen2.5-1.5B-Instruct 2.9 GB 1.9 GB 34%
Llama-3.2-1B-Instruct 2.3 GB 1.5 GB 34%
Gemma-3-1B-it 1.9 GB 1.2 GB 35%
Qwen2.5-0.5B-Instruct 920 MB 610 MB 34%
GPT-2 (117M) 548 MB 414 MB 24%
Gemma-3-270M-it 500 MB 330 MB 34%
Gemma-3-270M 500 MB 330 MB 34%
Gemma-2-2B 9.7 GB 8.1 GB 17%

Browse all 25+ models on HuggingFace →

v4-line measurements on the reference machine (2026-06 campaign): Qwen2.5-7B compressed to 65.95% of raw and ran bit-exact through the streaming executor; a real fp8 release (Qwen3-0.6B-FP8) compressed to 0.829 of its fp8 weight bytes, 507/507 tensors bit-exact.


What "lossless" actually means

Every weight in the model is mathematically identical to the original — same bit pattern, same floating-point value, same gradient, same output.

  • Not quantization. Quantization rounds weights to fewer bits and the model's behaviour changes. (The .bsd fast member is quantization — and is labelled lossy every time it is picked, which is the point.)
  • Not pruning. Pruning deletes weights.
  • Not approximation. No tricks, no calibration data, no quality drop.

BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. md5 is verified on every tensor at decompression. If a single bit differs, verify fails.

The honesty rules are code, not policy: any pick below a file's best fidelity carries a mandatory announcement, the announcement is printed before the load, and a test sweep across every profile × file × override combination fails the suite if a downgrade could ever happen silently. Details: docs/features/integrity.md.


CLI

bigsmall profile                            one-time hardware probe (~10s)
bigsmall plan SRC [--mode M]                the decision, one sentence, no execution
bigsmall run SRC [--mode M] [--stream]      pick, announce, load (or stream)
bigsmall dual SRC [-o OUT.bsd]              create/inspect a dual-fidelity .bsd
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall transcode SRC DST.bsr [--mode M]   re-encode for decode speed
bigsmall serve-stream SRC [--prompt ...]    tiered weight-streaming inference
bigsmall xray SRC                           checkpoint forensics
bigsmall info | scan | stat | verify | diff | apply | repair | benchmark
bigsmall migrate | status | pipeline run | reshard

Every command has --help. See docs/cli-reference.md for full examples with real output.


Python API

import bigsmall

# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")

# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")

# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")

# Low-VRAM streaming inference (~12x less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,
)

# Stream-decompress straight from the HF CDN — no .bs written to disk
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")

# Reshard .bs files along layer boundaries, no re-encoding
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)

Research

The lossless compression ceiling for BF16 neural weights has been measured. It is ~62% of raw BF16 for any model, ~34% for ≥7B instruct fine-tunes with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.

The floor is measured at the wall, not extrapolated: across 4,143 weight matrices in 8 architectures the per-tensor entropy floor is flat (coefficient of variation ≈ 0), and trained mantissa/sign bits are coder-equivalent to matched random controls in every family tested — training only writes the exponent. The floor exists at initialization and never moves during training. Details: docs/research.md.

Curious how BigSmall relates to DFloat11, ZipNN, GGUF quants, and the rest of the landscape? One honest reference page: docs/comparison.md.

The Ferrell Duo format has its own paper — how carrying the bit-exact original next to an INT4 fast member went from +13.4 to +2.9 points of file size, every number receipted: 10.5281/zenodo.20673133 (markdown, plain-English page).

Full findings, all experiments, all dead-ends: 10.5281/zenodo.20279247. Plain-English summary: docs/research.md.


Install

pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything

Requires Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.

New here? Start with docs/quickstart.md, or find your situation in docs/how-it-helps.md.


License

Code: Elastic License 2.0. Free for personal, research, and commercial use. SaaS providers should see LICENSING.md.

Model weights distributed in .bs format keep the license of the original model.


Links


Feedback & Community

Did BigSmall work for your model? We'd love to know.

  • Open a Discussion — share your compression results, ask questions, or suggest improvements
  • File an Issue — if something didn't work, tell us exactly what happened
  • HuggingFace — all compressed models are at huggingface.co/wpferrell

We especially want to hear:

  • Which model you compressed and what ratio you got
  • Any errors or unexpected behaviour
  • Use cases we haven't thought of

About

Lossless AI model compression - ~34% smaller with bit-identical weights; the autopilot profiles your machine, picks the highest fidelity that runs, and streams models bigger than your RAM.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages