publications
Oct 23, 2025
3 min

BitBIRCH-Lean: Chemical Space in the Palm of Your Workstation

Ultra-lean BitBIRCH rewrite that clusters hundred-million scale molecular libraries with on-the-fly compression, optional C++ accelerators, and a batteries-included CLI.

BitBIRCH-Lean is a fast, memory-frugal reimplementation of the Bit-BIRCH clustering algorithm for massive molecular libraries. The new release keeps every step (fingerprint packing, distance evaluations, tree manipulations, and refinement passes) in compressed form, while optional C++ extensions accelerate the remaining hotspots by up to 2x. Benchmarking on GPU-ready datasets shows that the lean version can process hundreds of millions of drug-like fingerprints within minutes on a workstation without ceding cluster quality.

Why it matters

  • Scales to billions - the packed tree layout keeps the full hierarchy in RAM, so ultra-large enumerations become a desktop workflow instead of an HPC job.
  • Deterministic multi-round mode - a parallel bb multiround driver merges independent BitBIRCH runs, letting you shard very large fingerprint batches without losing reproducibility.
  • Transparent telemetry - every run emits JSON summaries (parameters, memory peaks, timings) and optional Murcko scaffold analyses so results plug straight into dashboards and ELNs.
  • Tunable similarity - dynamic fingerprints let you switch between rdkit, ecfp4/6, or custom encodings; the docs capture recommended threshold ranges (0.5-0.65 for RDKit, 0.3-0.4 for ECFP) so you start from sane defaults.

CLI workflow

pip install bblean
bb fps-from-smiles path/to/library.smi --out-dir packed_fps
bb multiround packed_fps --branching 64 --threshold 0.55 --refine-num 1
bb plot-summary bb_multiround_outputs/<run-id> --top 20

The bb toolchain converts SMILES into packed uint8 fingerprints, clusters them either serially (bb run) or in the new multi-round parallel mode, and finally surfaces diagnostics or t-SNE views (bb plot-tsne). Outputs follow a consistent naming scheme so downstream notebooks can glob over bb_run_outputs/* without custom wiring.

Resources

  • Preprint: “BitBIRCH-Lean: chemical space in the palm of your workstation,” bioRxiv (2025).
  • Documentation: developer notes, best practices, and parameter tuning guides live at mqcomplab.github.io/bblean/devdocs.
  • Source code: mqcomplab/bblean (GPL-3.0) with optional C++ accelerators.
  • Releases: reproducible wheels and provenance snapshots are archived on Zenodo.