BitBIRCH-Lean is a fast, memory-frugal reimplementation of the Bit-BIRCH clustering algorithm for massive molecular libraries. The new release keeps every step (fingerprint packing, distance evaluations, tree manipulations, and refinement passes) in compressed form, while optional C++ extensions accelerate the remaining hotspots by up to 2x. Benchmarking on GPU-ready datasets shows that the lean version can process hundreds of millions of drug-like fingerprints within minutes on a workstation without ceding cluster quality.
Why it matters
- Scales to billions - the packed tree layout keeps the full hierarchy in RAM, so ultra-large enumerations become a desktop workflow instead of an HPC job.
- Deterministic multi-round mode - a parallel
bb multirounddriver merges independent BitBIRCH runs, letting you shard very large fingerprint batches without losing reproducibility. - Transparent telemetry - every run emits JSON summaries (parameters, memory peaks, timings) and optional Murcko scaffold analyses so results plug straight into dashboards and ELNs.
- Tunable similarity - dynamic fingerprints let you switch between
rdkit,ecfp4/6, or custom encodings; the docs capture recommendedthresholdranges (0.5-0.65 for RDKit, 0.3-0.4 for ECFP) so you start from sane defaults.
CLI workflow
pip install bblean
bb fps-from-smiles path/to/library.smi --out-dir packed_fps
bb multiround packed_fps --branching 64 --threshold 0.55 --refine-num 1
bb plot-summary bb_multiround_outputs/<run-id> --top 20
The bb toolchain converts SMILES into packed uint8 fingerprints, clusters them either serially (bb run) or in the new multi-round parallel mode, and finally surfaces diagnostics or t-SNE views (bb plot-tsne). Outputs follow a consistent naming scheme so downstream notebooks can glob over bb_run_outputs/* without custom wiring.
Resources
- Preprint: “BitBIRCH-Lean: chemical space in the palm of your workstation,” bioRxiv (2025).
- Documentation: developer notes, best practices, and parameter tuning guides live at
mqcomplab.github.io/bblean/devdocs. - Source code:
mqcomplab/bblean(GPL-3.0) with optional C++ accelerators. - Releases: reproducible wheels and provenance snapshots are archived on Zenodo.