CellARC reframes “intelligence tests” for modern AI models by expressing every task as a cellular automaton episode. Each episode provides five support pairs, one held-out query grid, and the expected solution. Systems must induce the hidden rule table (radius, alphabet, window) and generalize to the query - mirroring ARC-AGI’s program synthesis flavor, but with the interpretable physics of discrete dynamical systems.
Benchmark structure
- Curated splits. Train (95k), validation (1k),
test_interpolation(1k), andtest_extrapolation(1k) cover both in-distribution rule families and deliberately out-of-distribution scenarios. - Metadata everywhere. The
cellarc_100k_metarelease exposes alphabet sizes, radii, window sizes, Langton’s lambda, entropy, morphology descriptors, coverage stats, and the exact lookup tables (base64) for every episode. - Reproducible subsets. Deterministic 100-episode slices (
train_100,val_100, etc.) plus the corresponding ID lists undersubset_ids/enable quick regression tests or API-budget-friendly evals. - Scoring parity. The leaderboard/demo site runs the exact same JSONL/Parquet artifacts that are published on Hugging Face, so submissions reproduce locally before being posted publicly.
Evaluation workflow
from datasets import load_dataset
ds = load_dataset(
"mireklzicar/cellarc_100k_meta",
split="test_extrapolation",
streaming=False,
)
for episode in ds:
support_pairs = episode["train"]
query = episode["query"]
# infer rule table, emit prediction, track metadata from episode["meta"]
- Load metadata-rich splits directly from Hugging Face (PyTorch, JAX, or Polars/Dask pipelines all work thanks to the Parquet exports).
- Train or prompt your solver using the five support pairs while respecting the automaton constraints provided in
meta. - Submit predictions via the open-source
cellarcrepository and verify them on the public leaderboard/demo atcellarc.mireklzicar.com.