bug: Systematic 'A' (Index 0) Evaluation Bias and Improper Tensor Slicing in Latent Inference`

### Body:

## Description

During extensive evaluation of the framework across both domain-specific (`med`, `code`) and `general` reasoning domains, we identified a critical, systematic bias in the evaluation pipeline (`answer_utils.py`) combined with a tensor slicing edge case (`inference_mas.py`).

These combined issues result in a significant number of **false positives** (incorrectly scored as `correct=True`) and artificially skew the gold standard target distribution toward choice `A` (index 0). In our benchmarks, this artifact accounts for up to 25% of false success rates during partial generation failures.

---

## Technical Breakdown & Root Causes

### 1. Cascading "A" Default Fallback (CRITICAL)

In `answer_utils.py` inside `compare_answers()`, the function handles parsing failures by silently falling back to `"A"`.

```python
# answer_utils.py — lines 411–413 (approx)
pred_answer = extract_choice_answer(pred_text, default="A")   
gold_choice  = extract_choice_answer(gold_answer, default="A") 
pred_choice  = extract_choice_answer(pred_answer, default="A") 

```

**The Flaw:** If a small/quantized model (`light` configuration) experiences a runtime truncation or output corruption, `pred_text` parsing fails and returns `"A"`. Crucially, if the benchmark's gold target field uses an unmapped variant format, `gold_choice` also defaults to `"A"`.
As a result, `gold_choice == pred_choice` evaluates to `True`, silently masking a catastrophic inference failure as a successful prediction.

### 2. Ground Truth Corruption via `extract_gold_answer()`

At line 260 of `answer_utils.py`:

```python
return choice if choice is not None else "A"

```

If a dataset contains a non-standard answer format or long-form reasoning in the ground truth field, it is silently rewritten as `"A"`, altering the underlying benchmark baseline distribution.

### 3. Prompt Leakage via Latent Solver Slicing Heuristic

In `inference_mas.py` (lines 1520–1523), the heuristic used to separate the generated tokens from the prompt tokens relies on a strict `max_new_tokens` boundary:

```python
if sequences.size(1) > max_new_tokens:
    gen_ids = sequences[:, prompt_len:]
else:
    gen_ids = sequences   # <--- Slicing failure

```

**The Flaw:** When under-parametrized models output a highly concise response (e.g., just `\boxed{B}`), the total tensor length `sequences.size(1)` can easily fall below `max_new_tokens`. This bypasses the slice, causing the *entire* prompt tensor (including system instructions) to be decoded as the final answer.
Consequently, the regex extractor frequently parses the first character of the system prompt (e.g., **Y** from *"You are a..."* or **C** from *"Cutting Knowledge..."*) rather than the actual model generation.

---

## Proposed Remediation

We have isolated a robust, minimal fix to decouple failure states from valid 'A' choices and enforce strict tensor slicing:

### Fix for `answer_utils.py`:

Change fallbacks to `None` and enforce explicit validation matching:

```python
pred_answer = extract_choice_answer(pred_text, default=None)   
gold_choice  = extract_choice_answer(gold_answer, default=None) 
pred_choice  = extract_choice_answer(pred_answer, default=None) if pred_answer else None

correct = (gold_choice is not None and pred_choice is not None and gold_choice == pred_choice)

```

### Fix for `inference_mas.py`:

Tighten the latent sequence slicing threshold by explicitly comparing against `prompt_len`:

```python
gen_len = sequences.size(1)
if gen_len > max_new_tokens or gen_len > prompt_len:
    gen_ids = sequences[:, prompt_len:]
else:
    gen_ids = sequences

```

---

## Environment & Reproducibility

* **Framework Style:** `sequential_light` / `sequential_scaled`
* **Hardware Context:** Local orchestration (VRAM high-saturation scenarios)
* **Dataset observed:** Multiple-choice benchmarks (4 options)

I have already implemented and locally verified these fixes on a dedicated branch. I would be happy to submit a Pull Request if the maintainers confirm this aligns with the repository's evaluation guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Systematic 'A' (Index 0) Evaluation Bias and Improper Tensor Slicing in Latent Inference` #24

Body:

Description

Technical Breakdown & Root Causes

1. Cascading "A" Default Fallback (CRITICAL)

2. Ground Truth Corruption via `extract_gold_answer()`

3. Prompt Leakage via Latent Solver Slicing Heuristic

Proposed Remediation

Fix for `answer_utils.py`:

Fix for `inference_mas.py`:

Environment & Reproducibility

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: Systematic 'A' (Index 0) Evaluation Bias and Improper Tensor Slicing in Latent Inference` #24

Description

Body:

Description

Technical Breakdown & Root Causes

1. Cascading "A" Default Fallback (CRITICAL)

2. Ground Truth Corruption via extract_gold_answer()

3. Prompt Leakage via Latent Solver Slicing Heuristic

Proposed Remediation

Fix for answer_utils.py:

Fix for inference_mas.py:

Environment & Reproducibility

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Ground Truth Corruption via `extract_gold_answer()`

Fix for `answer_utils.py`:

Fix for `inference_mas.py`: