Hugging Face – Posts

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

All HF Hub posts

posted an update 2 days ago

Post

4507

Darwin V9 — GPQA Diamond 90.9%, #1 on the leaderboard, with pure greedy decoding
Darwin-398B-JGOS reaches 90.9% (180/198) on GPQA Diamond, the PhD-level scientific reasoning benchmark, ranking #1 on the Hugging Face GPQA Diamond leaderboard. No self-consistency, no test-time compute scaling — this was achieved with a single greedy decode (temperature 0, single sample, max 16,384 tokens). The full eval config is published in the model card, so anyone can reproduce it. Raw reasoning, no score inflation.
The result comes from Darwin V9, a patented evolutionary model-development platform. Its core idea: it never trains a model from scratch.
Why Darwin V9 beats training from scratch

Cost & speed: no trillion-token pretraining run, no months of compute — a purpose-built, high-performance model is produced in a fraction of the time.
Reuse of proven intelligence: instead of re-learning every capability from a blank slate, it selects and combines only the strengths of already-trained, already-validated models, so results are stable and predictable.
Surgical transplantation: it identifies which neural region of which model holds which capability — at the FFN (Feed Forward Network) layer level — and grafts in only the segments that contribute to the target skill.

How it works: a large model (Qwen 3.5 397B) serves as the mother model (the substrate); several father models specialized in reasoning, coding, and language are analyzed layer-by-layer across their FFN regions; the segments that contribute to the target performance are extracted and transplanted into the mother model to produce a new child model. The result is a ~400B MoE that activates only ~17B parameters per token at inference — large-model capacity with efficient inference.
If training from scratch means rebuilding everything from a blank page, Darwin V9 means precisely recombining intelligence that has already been proven. GPQA Diamond #1 is the proof.
Model: FINAL-Bench/Darwin-398B-JGOS

ovi054

posted an update 1 day ago

Post

2270

Qwen3-14B Manim Expert LoRA

For "Build Small Hackathon", I built a Gradio app that turns any concept into a Manim explainer video.

This is powered by Qwen3-14B + Manim LoRA I trained on a synthetic 10k dataset I generated.

👉 Try it now: build-small-hackathon/anim-vid-ai

1 reply

prithivMLmods

posted an update 1 day ago

Post

1755

Wan2.2-I2V-Fast with highly upscaled sequential frame sampling is now available as a Spaces demo, built using Wan2.2-I2V and FLUX.2-Klein. Try the demo using the links below.👇

➠ wan2.2-i2v-fast : prithivMLmods/wan2.2-i2v-fast
➠ github: https://github.com/prithivsakthiur/wan2.2-i2v-fast
➠ collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

⤷ To learn more, visit the app page or the respective model pages.

mmhamdy

posted an update about 18 hours ago

Post

What if you could train a model on just 10 images instead of 60,000 and still get close to the same performance?

Traditional machine learning requires thousands, even millions, of data points to achieve high accuracy. But what if we could "distill" the entire dataset into just a few synthetic samples?

This is what Dataset Distillation offers. Unlike traditional knowledge distillation, we keep the model fixed and distill the knowledge contained in a massive training set into a tiny set of synthetic distilled images.

The goal is to train a model on this ultra-small set and achieve performance that almost matches what the same model would get when trained on the massive original dataset.

For example, training on only 10 distilled MNIST images (this is equivalent to a single image per class) yields 94% accuracy, compared to 99% when training on the full 60,000 images.

Interestingly, these distilled images look significantly different (as you can see in the image below) from natural images because they are optimized for model training rather than for matching the correct data distribution.

But that's not all.

Most importantly, this same method opens the door to a potent form of data poisoning. Because distilled images are specifically optimized for rapid learning, an attacker can create a tiny set of adversarial distilled images to cause a well-trained model to forget or misclassify a specific category.

What I find fascinating about dataset distillation is this: it mimics human-like learning by letting a model grasp a concept from a single example, but it does so using alien synthetic images that mean absolutely nothing to a human eye!

What about you? What are your thoughts on it?

1 reply

kingkw1

posted an update 2 days ago

Post

2715

I built Read-Along AI for the Hugging Face Build Small Hackathon.

It is an offline-capable reading practice app for early readers: one short sentence at a time, tap-to-hear word help, record a read-aloud attempt, then get gentle feedback.

The goal is Backyard AI in the literal sense: a tool for real home reading practice, where feedback needs to be patient, developmentally fair, and private. A child’s voice should not need to leave the app just to practice “The dog ran fast.”

What makes it small-model native:

- Exact clean readings pass immediately.
- Close or ambiguous child-speech transcripts get a second look from a fine-tuned MiniCPM phonetic evaluator.
- Meaning-changing mistakes still fail closed, e.g. “blue hat” should not pass for “red hat.”
- Off the Grid Mode runs local ASR plus the MiniCPM GGUF evaluator through llama.cpp.
- Turbo Mode uses Modal endpoints for lower-latency ASR/TTS/evaluation.
- The UI is custom Gradio with a child-facing reading canvas, clickable words, progress feedback, and celebration on success.

Targeted tracks and badges:
Backyard AI, Off-Brand, Off the Grid, Llama Champion, Well-Tuned, Tiny Titan, Sharing is Caring, Field Notes.

Space:
build-small-hackathon/read-along-ai

Demo video:
https://youtu.be/4bpbwhipLU4

Repo:
https://github.com/kingkw1/read-along-ai

Built with Codex as the lead development partner.

5 replies

YMRohit

posted an update 2 days ago

Post

2848

A 1B model that writes GPU kernels you can trust

I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark.

The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell.

Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool.

Full write-up: https://huggingface.co/blog/YMRohit/ouroboros-kernel-mint
Try it: build-small-hackathon/ouroboros-kernel-mint
2-min demo: https://youtu.be/ViicZHktb-A

Built for #BuildSmallHackathon with MiniCPM, Qwen, Triton, Gradio, Codex, and Modal H200s.

1 reply

YerbaPage

posted an update 2 days ago

Post

5241

Why your LLM agent starts "forgetting" things? 🤯

New survey breaks down everything about context compression for long-horizon agents 📚

- what to compress
- how to compress it
- who decides when

Plus a curated paper collection to go with it ⭐

📄 Paper: https://doi.org/10.20944/preprints202605.2065.v1
⭐ Repo: https://github.com/YerbaPage/Awesome-Agent-Context-Compression

SeaWolf-AI

posted an update 3 days ago

Post

6683

🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.

Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.

Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation

The rules are simple and strict:
✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable).
🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made.
🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits.
📤 Anyone can submit their own method via the Submit tab for review and listing.

Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.

A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.

Leaderboard: FINAL-Bench/quantum-bench-leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/quantum-leaderboard

#quantum #QEC #QuantumComputing #benchmark

2 replies

kanaria007

posted an update about 14 hours ago

Post

✅ Article highlight: *Institutional Memory & Forgetting for Learning Worlds* (art-60-172, v0.1)

TL;DR:
This article argues that if a living world becomes training data, memory becomes infrastructure.

Logs, dialogue, labels, releases, feature stores, and model weights can turn a world into something that cannot honestly forget. 172 makes deletion, redaction, exclusion, forgetting requests, SANITIZED/PUBLIC releases, and unlearning claims into receipted governance lifecycles.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents learning worlds from becoming “unforgettable worlds”
• separates deletion, redaction, and future extraction exclusion
• makes right-to-be-forgotten requests caseable and appealable
• preserves canon facts without preserving every memory surface
• blocks public promises like “guaranteed deletion everywhere”

What’s inside:
• retention policy contracts for what may be kept, copied, trained on, or released
• corpus segment manifests and propagation indexes for known controlled copies
• forgetting request, adjudication, remedy, deletion, redaction, and exclusion receipts
• tombstone manifests and semantic preservation receipts for canon-safe forgetting
• use eligibility receipts for deciding whether a segment may train a future run
• release contracts, redaction maps, and irreversibility disclosures for SANITIZED/PUBLIC releases
• bounded unlearning contracts and post-unlearning verification receipts

Key idea:
Do not say:

*“we deleted it, so it is forgotten.”*

Say:

*“this subject was handled under this retention policy, propagation index, adjudication path, remedy contract, tombstone, semantic preservation receipt, extraction exclusion receipt, and bounded public claim.”*

Forgetting is not a button.

It is governance with receipts.

Kasualdad

posted an update 1 day ago

Post

From Plain English to DuckDB SQL: Building LFEDS
🏫 I just shipped Local First Education Data Stack— a plain-English-to-SQL assistant for school district analytics — for the HF Build Small Hackathon.

The problem: school staff have useful data (attendance, grades, enrollment, discipline) but no fast, private way to ask questions. Most AI tools send that data to a cloud API. LFED doesn't.

What it does:
→ Type a question like "What's the average GPA for chronically absent students in 2023-2024?"
→ A fine-tuned Qwen2.5-Coder-14B model generates DuckDB SQL
→ A validation layer rejects anything that isn't a SELECT
→ Results come back as a summary, table, CSV download, and the SQL itself

Two flavors:
- Live Space demo: transformers + PEFT on HF ZeroGPU
- Local-first: llama.cpp + GGUF Q4_K_M on your own machine — no data leaves

The fine-tune:
- 27,859 synthetic NL→SQL pairs
- Unsloth QLoRA r=32 on Qwen2.5-Coder-14B
- Trained on Modal A10G

Hardest lessons were not model training:
1. Scope the model's job tightly — schema + few-shots + SELECT only.
2. Validate before executing. Always.
3. ZeroGPU is PyTorch-only; llama.cpp won't work there.
4. Gradio's scoped Svelte CSS beats generic selectors — inspect the live DOM.
5. modal deploy + fn.spawn() is fire-and-forget; modal run dies if your terminal drops.
6. Data artifacts matter as much as the model — Parquet seeds, dataset card, model card.

I also published the training dataset: 25,886 question→SQL pairs on the Hub.

Links:
Demo: https://youtu.be/cE0yp4qmFIA
- Live Space: build-small-hackathon/Kasualdad_LFED
- LoRA adapter: build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora
- GGUF: build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf
- Dataset: build-small-hackathon/lfed-training-data

#BuildSmallHackathon #BackyardAI #HuggingFace #TextToSQL #DuckDB #LocalFirst #EdTech #Qwen #QLoRA #LLM

Recently active users