Temporal Heuristic Rapid Evaluation And Dynamics
T.H.R.E.A.D. is the advanced telemetry and benchmarking suite for the Loom AI framework.
Unlike traditional benchmarks (MMLU, HumanEval) that measure the destination (static accuracy), T.H.R.E.A.D. measures the trajectory of an algorithm during its first minute of life. It acts as a dynamometer for learning algorithms, stressing them with non-stationary data streams to quantify Plasticity, Stability, and Memory in real-time.
"The First Minute is Everything."
In Edge AI, Robotics, and Real-Time Systems, you do not have the luxury of offline batch training. Models must adapt now. T.H.R.E.A.D. treats the training process itself as the product. It answers:
- Wake-up Time: How many milliseconds to go from random noise to usable predictions?
- Plasticity: When physics change (e.g., frequency shift), does the model adapt or crash?
- Memory: When the environment returns to a previous state, did the model remember it?
- Safety: Is the model "sort of wrong" (10% error) or "hallucinating" (>100% error)?
T.H.R.E.A.D. calculates six specific dimensions of learning health before aggregating them.
Raw speed is important, but with diminishing returns. We use a base-10 logarithm so that 100k tok/sec isn't weighted 10x higher than 10k tok/sec (which is already sufficient for real-time).
Measures how smooth the learning curve is. High variance (thrashing) is penalized.
A reliability metric. It asks: "Can I trust this model right now?"
Measures adaptation velocity. It calculates the "Recovery Time" (
Measures Catastrophic Forgetting. It compares performance on the first visit to a task vs. a return visit after interference. $$\Delta_{mem} = \text{Acc}{\text{visit 2}} - \text{Acc}{\text{visit 1}}$$ (Positive scores indicate true learning/compression. Negative scores indicate overwriting.)
Derived from DeviationMetrics, this penalizes "Hallucinations" (errors > 100%) much more heavily than "Inaccuracies" (errors < 10%).
The final score is a single integer that balances Speed (Log-Throughput) against Intelligence (Stability, Plasticity, Memory).
-
The Engine (
$S_{tput} \times I_{stab} \times R_{cons}$ ):- A fast but unstable model gets a low score.
- A stable but slow model gets a moderate score.
- A fast AND stable model gets a high base score.
-
The Multipliers (Plasticity & Memory):
- Models that adapt instantly (
$Q_{plast}$ ) get a massive bonus multiplier. - Models that remember the past (
$\Delta_{mem}$ ) get a persistence bonus.
- Models that adapt instantly (
A brutal test of plasticity. The model must predict the next value of a sine wave, but the frequency switches every 150ms (1.0x → 2.0x → 3.0x → 1.0x).
- Goal: Adapt to the new physics within <500ms.
- Challenge: Most gradient descent methods fail to adapt fast enough or forget the previous frequency immediately.
- Winner:
StepTweenChain(Loom's geometric update with chain rule).
(Coming Soon: The Shifting Class MNIST, The Windy Pole Control, The Dynamic Cipher)
Run the suite locally to verify Edge Readiness:
# Clone the repository
git clone [https://github.com/openfluke/thread.git](https://github.com/openfluke/thread.git)
cd thread
# Run the Sine Wave Adaptation Benchmark (Test 41)
go run oldexample/test41_sine_adaptation_60s_idiot.go
Sample Output:
╔═════════════════════════════════════════════════════════════════════════════════════╗
║ 🌊 TEST 41: SINE WAVE ADAPTATION BENCHMARK ║
║ TRAINING: Cycle Sin(1x)→Sin(2x)→Sin(3x)→Sin(1x) [IDIOT TEST] ║
╚═════════════════════════════════════════════════════════════════════════════════════╝
🚀 [StepTweenChain] Starting...
✅ [StepTweenChain] Done | Acc: 57.4% | Stab: 83% | Cons: 79% | Tput: 10504 | Score: 684
🏆 WINNER: StepTweenChain
Distributed under the Apache 2.0 License. See LICENSE for more information.