publications
Nov 09, 2025
2 min

CellARC: Measuring Intelligence with Cellular Automata

Introducing a cellular-automata-native benchmark, leaderboard, and dataset suite that scores reasoning systems on metadata-rich ARC-AGI style episodes.
Example of four multicolor 1D CA rules from CellARC extrapolation test split.
Example of four multicolor 1D CA rules from CellARC extrapolation test split.

CellARC reframes “intelligence tests” for modern AI models by expressing every task as a cellular automaton episode. Each episode provides five support pairs, one held-out query grid, and the expected solution. Systems must induce the hidden rule table (radius, alphabet, window) and generalize to the query - mirroring ARC-AGI’s program synthesis flavor, but with the interpretable physics of discrete dynamical systems.

Benchmark structure

  • Curated splits. Train (95k), validation (1k), test_interpolation (1k), and test_extrapolation (1k) cover both in-distribution rule families and deliberately out-of-distribution scenarios.
  • Metadata everywhere. The cellarc_100k_meta release exposes alphabet sizes, radii, window sizes, Langton’s lambda, entropy, morphology descriptors, coverage stats, and the exact lookup tables (base64) for every episode.
  • Reproducible subsets. Deterministic 100-episode slices (train_100, val_100, etc.) plus the corresponding ID lists under subset_ids/ enable quick regression tests or API-budget-friendly evals.
  • Scoring parity. The leaderboard/demo site runs the exact same JSONL/Parquet artifacts that are published on Hugging Face, so submissions reproduce locally before being posted publicly.

Evaluation workflow

from datasets import load_dataset

ds = load_dataset(
    "mireklzicar/cellarc_100k_meta",
    split="test_extrapolation",
    streaming=False,
)

for episode in ds:
    support_pairs = episode["train"]
    query = episode["query"]
    # infer rule table, emit prediction, track metadata from episode["meta"]
  1. Load metadata-rich splits directly from Hugging Face (PyTorch, JAX, or Polars/Dask pipelines all work thanks to the Parquet exports).
  2. Train or prompt your solver using the five support pairs while respecting the automaton constraints provided in meta.
  3. Submit predictions via the open-source cellarc repository and verify them on the public leaderboard/demo at cellarc.mireklzicar.com.