Skip to content

bug: Systematic 'A' (Index 0) Evaluation Bias and Improper Tensor Slicing in Latent Inference` #24

@danielesalpietro

Description

@danielesalpietro

Body:

Description

During extensive evaluation of the framework across both domain-specific (med, code) and general reasoning domains, we identified a critical, systematic bias in the evaluation pipeline (answer_utils.py) combined with a tensor slicing edge case (inference_mas.py).

These combined issues result in a significant number of false positives (incorrectly scored as correct=True) and artificially skew the gold standard target distribution toward choice A (index 0). In our benchmarks, this artifact accounts for up to 25% of false success rates during partial generation failures.


Technical Breakdown & Root Causes

1. Cascading "A" Default Fallback (CRITICAL)

In answer_utils.py inside compare_answers(), the function handles parsing failures by silently falling back to "A".

# answer_utils.py — lines 411–413 (approx)
pred_answer = extract_choice_answer(pred_text, default="A")   
gold_choice  = extract_choice_answer(gold_answer, default="A") 
pred_choice  = extract_choice_answer(pred_answer, default="A") 

The Flaw: If a small/quantized model (light configuration) experiences a runtime truncation or output corruption, pred_text parsing fails and returns "A". Crucially, if the benchmark's gold target field uses an unmapped variant format, gold_choice also defaults to "A".
As a result, gold_choice == pred_choice evaluates to True, silently masking a catastrophic inference failure as a successful prediction.

2. Ground Truth Corruption via extract_gold_answer()

At line 260 of answer_utils.py:

return choice if choice is not None else "A"

If a dataset contains a non-standard answer format or long-form reasoning in the ground truth field, it is silently rewritten as "A", altering the underlying benchmark baseline distribution.

3. Prompt Leakage via Latent Solver Slicing Heuristic

In inference_mas.py (lines 1520–1523), the heuristic used to separate the generated tokens from the prompt tokens relies on a strict max_new_tokens boundary:

if sequences.size(1) > max_new_tokens:
    gen_ids = sequences[:, prompt_len:]
else:
    gen_ids = sequences   # <--- Slicing failure

The Flaw: When under-parametrized models output a highly concise response (e.g., just \boxed{B}), the total tensor length sequences.size(1) can easily fall below max_new_tokens. This bypasses the slice, causing the entire prompt tensor (including system instructions) to be decoded as the final answer.
Consequently, the regex extractor frequently parses the first character of the system prompt (e.g., Y from "You are a..." or C from "Cutting Knowledge...") rather than the actual model generation.


Proposed Remediation

We have isolated a robust, minimal fix to decouple failure states from valid 'A' choices and enforce strict tensor slicing:

Fix for answer_utils.py:

Change fallbacks to None and enforce explicit validation matching:

pred_answer = extract_choice_answer(pred_text, default=None)   
gold_choice  = extract_choice_answer(gold_answer, default=None) 
pred_choice  = extract_choice_answer(pred_answer, default=None) if pred_answer else None

correct = (gold_choice is not None and pred_choice is not None and gold_choice == pred_choice)

Fix for inference_mas.py:

Tighten the latent sequence slicing threshold by explicitly comparing against prompt_len:

gen_len = sequences.size(1)
if gen_len > max_new_tokens or gen_len > prompt_len:
    gen_ids = sequences[:, prompt_len:]
else:
    gen_ids = sequences

Environment & Reproducibility

  • Framework Style: sequential_light / sequential_scaled
  • Hardware Context: Local orchestration (VRAM high-saturation scenarios)
  • Dataset observed: Multiple-choice benchmarks (4 options)

I have already implemented and locally verified these fixes on a dedicated branch. I would be happy to submit a Pull Request if the maintainers confirm this aligns with the repository's evaluation guidelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions