Skip to content

Fixed output_all_hiddens for hubert in huggingface_wav2vec#1587

Merged
TParcollet merged 2 commits intospeechbrain:developfrom
gorinars:fix-hubert-output-all
Sep 29, 2022
Merged

Fixed output_all_hiddens for hubert in huggingface_wav2vec#1587
TParcollet merged 2 commits intospeechbrain:developfrom
gorinars:fix-hubert-output-all

Conversation

@gorinars
Copy link
Collaborator

I am trying to extract all hidden representations from several HF models using recently implemented in #1570 output_all_hiddens property.

Specifically, I used source=["facebook/wav2vec2-base", "facebook/hubert-base-ls960", "microsoft/wavlm-base", "microsoft/wavlm-base-plus"] in HuggingFaceWav2Vec2 class.

All works good except Hubert where we have dim(out) = 2 so the code crashes.

Unlike others, it does not have a 512-dimensional representation in the out[1], which is not used anyway.

Taking the last dimension for accessing all transformer layers should work for all these models unless I am missing something.

@gorinars gorinars requested a review from BenoitWang September 28, 2022 16:20
@BenoitWang
Copy link
Collaborator

Hi @gorinars, thanks for trying this PR and pointing this out.

It is true that I didn't test for all the models, however, if I made it right, these models output attentions at the very end which is None if not specified, and that blocks you code. So we should use the second last output.

Please check these huggingface codes: wav2vec, HuBERT

@gorinars
Copy link
Collaborator Author

Thanks for a quick reply @BenoitWang.

It seems that all models I am testing have output_attentions=False. And in this case for some reason I do not see None in the out variable.

Let me test a bit more with different settings.

Here is a simple test that I currently use and that passes for wav2vec and wavlm but fails for hubert.

@pytest.mark.slow
@pytest.mark.parametrize("model", ["facebook/wav2vec2-base", "facebook/hubert-base-ls960", "microsoft/wavlm-base", "microsoft/wavlm-base-plus", "microsoft/wavlm-base-plus-sd"])
@pytest.mark.parametrize("batch_size", [1, 4])
def test_sb_wav2vec(batch_size, model):

    model = HuggingFaceWav2Vec2(model, "data")

    wav = torch.rand([batch_size, 32000])

    # Extract wav2vec output
    out = model.model(wav, output_hidden_states=True)

    out_expected_len = 99
    assert len(out) == 3
    assert out[0].shape == torch.Size([batch_size, out_expected_len, 768])
    assert out[1].shape == torch.Size((batch_size, out_expected_len, 512))
    assert len(out[2]) == 13
    assert out[2][12].shape == torch.Size((batch_size, out_expected_len, 768))

@gorinars
Copy link
Collaborator Author

gorinars commented Sep 28, 2022

OK, so enforcing model.model.config.output_attentions=True in the test above makes len(out) == 4.
With that, I actually think the safest approach would be to use explicit names like

out[0] -> out.last_hidden_state
out[2] -> out.hidden_states

thoughts?

@BenoitWang
Copy link
Collaborator

Yes much neater, and hope that they use always the same names for all the models:).

Thanks!

@TParcollet
Copy link
Collaborator

@BenoitWang could you review the PR and merge if it looks good to you? Thanks!

@TParcollet TParcollet merged commit 211083f into speechbrain:develop Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants