Body:
Description
During extensive evaluation of the framework across both domain-specific (med, code) and general reasoning domains, we identified a critical, systematic bias in the evaluation pipeline (answer_utils.py) combined with a tensor slicing edge case (inference_mas.py).
These combined issues result in a significant number of false positives (incorrectly scored as correct=True) and artificially skew the gold standard target distribution toward choice A (index 0). In our benchmarks, this artifact accounts for up to 25% of false success rates during partial generation failures.
Technical Breakdown & Root Causes
1. Cascading "A" Default Fallback (CRITICAL)
In answer_utils.py inside compare_answers(), the function handles parsing failures by silently falling back to "A".
# answer_utils.py — lines 411–413 (approx)
pred_answer = extract_choice_answer(pred_text, default="A")
gold_choice = extract_choice_answer(gold_answer, default="A")
pred_choice = extract_choice_answer(pred_answer, default="A")
The Flaw: If a small/quantized model (light configuration) experiences a runtime truncation or output corruption, pred_text parsing fails and returns "A". Crucially, if the benchmark's gold target field uses an unmapped variant format, gold_choice also defaults to "A".
As a result, gold_choice == pred_choice evaluates to True, silently masking a catastrophic inference failure as a successful prediction.
2. Ground Truth Corruption via extract_gold_answer()
At line 260 of answer_utils.py:
return choice if choice is not None else "A"
If a dataset contains a non-standard answer format or long-form reasoning in the ground truth field, it is silently rewritten as "A", altering the underlying benchmark baseline distribution.
3. Prompt Leakage via Latent Solver Slicing Heuristic
In inference_mas.py (lines 1520–1523), the heuristic used to separate the generated tokens from the prompt tokens relies on a strict max_new_tokens boundary:
if sequences.size(1) > max_new_tokens:
gen_ids = sequences[:, prompt_len:]
else:
gen_ids = sequences # <--- Slicing failure
The Flaw: When under-parametrized models output a highly concise response (e.g., just \boxed{B}), the total tensor length sequences.size(1) can easily fall below max_new_tokens. This bypasses the slice, causing the entire prompt tensor (including system instructions) to be decoded as the final answer.
Consequently, the regex extractor frequently parses the first character of the system prompt (e.g., Y from "You are a..." or C from "Cutting Knowledge...") rather than the actual model generation.
Proposed Remediation
We have isolated a robust, minimal fix to decouple failure states from valid 'A' choices and enforce strict tensor slicing:
Fix for answer_utils.py:
Change fallbacks to None and enforce explicit validation matching:
pred_answer = extract_choice_answer(pred_text, default=None)
gold_choice = extract_choice_answer(gold_answer, default=None)
pred_choice = extract_choice_answer(pred_answer, default=None) if pred_answer else None
correct = (gold_choice is not None and pred_choice is not None and gold_choice == pred_choice)
Fix for inference_mas.py:
Tighten the latent sequence slicing threshold by explicitly comparing against prompt_len:
gen_len = sequences.size(1)
if gen_len > max_new_tokens or gen_len > prompt_len:
gen_ids = sequences[:, prompt_len:]
else:
gen_ids = sequences
Environment & Reproducibility
- Framework Style:
sequential_light / sequential_scaled
- Hardware Context: Local orchestration (VRAM high-saturation scenarios)
- Dataset observed: Multiple-choice benchmarks (4 options)
I have already implemented and locally verified these fixes on a dedicated branch. I would be happy to submit a Pull Request if the maintainers confirm this aligns with the repository's evaluation guidelines.
Body:
Description
During extensive evaluation of the framework across both domain-specific (
med,code) andgeneralreasoning domains, we identified a critical, systematic bias in the evaluation pipeline (answer_utils.py) combined with a tensor slicing edge case (inference_mas.py).These combined issues result in a significant number of false positives (incorrectly scored as
correct=True) and artificially skew the gold standard target distribution toward choiceA(index 0). In our benchmarks, this artifact accounts for up to 25% of false success rates during partial generation failures.Technical Breakdown & Root Causes
1. Cascading "A" Default Fallback (CRITICAL)
In
answer_utils.pyinsidecompare_answers(), the function handles parsing failures by silently falling back to"A".The Flaw: If a small/quantized model (
lightconfiguration) experiences a runtime truncation or output corruption,pred_textparsing fails and returns"A". Crucially, if the benchmark's gold target field uses an unmapped variant format,gold_choicealso defaults to"A".As a result,
gold_choice == pred_choiceevaluates toTrue, silently masking a catastrophic inference failure as a successful prediction.2. Ground Truth Corruption via
extract_gold_answer()At line 260 of
answer_utils.py:If a dataset contains a non-standard answer format or long-form reasoning in the ground truth field, it is silently rewritten as
"A", altering the underlying benchmark baseline distribution.3. Prompt Leakage via Latent Solver Slicing Heuristic
In
inference_mas.py(lines 1520–1523), the heuristic used to separate the generated tokens from the prompt tokens relies on a strictmax_new_tokensboundary:The Flaw: When under-parametrized models output a highly concise response (e.g., just
\boxed{B}), the total tensor lengthsequences.size(1)can easily fall belowmax_new_tokens. This bypasses the slice, causing the entire prompt tensor (including system instructions) to be decoded as the final answer.Consequently, the regex extractor frequently parses the first character of the system prompt (e.g., Y from "You are a..." or C from "Cutting Knowledge...") rather than the actual model generation.
Proposed Remediation
We have isolated a robust, minimal fix to decouple failure states from valid 'A' choices and enforce strict tensor slicing:
Fix for
answer_utils.py:Change fallbacks to
Noneand enforce explicit validation matching:Fix for
inference_mas.py:Tighten the latent sequence slicing threshold by explicitly comparing against
prompt_len:Environment & Reproducibility
sequential_light/sequential_scaledI have already implemented and locally verified these fixes on a dedicated branch. I would be happy to submit a Pull Request if the maintainers confirm this aligns with the repository's evaluation guidelines.