CL-MASR by lucadellalib · Pull Request #2033 · speechbrain/speechbrain

lucadellalib · 2023-06-15T15:49:58Z

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

NOTE: when merging we should not automatically delete the branch in the source repository (URL must stay available).

…speechbrain into continual-learning

Adel-Moumen · 2023-06-16T08:20:21Z

Hello @lucadellalib (and @poonehmousavi?),

Thanks for this PR. What is the current state of this PR? Is it ready to review?

Many thanks. :-)

mravanelli · 2023-06-16T13:41:28Z

Hi @Adel-Moumen, I plan to review this PR very soon. Please @lucadellalib, make sure the tests are passing. @Adel-Moumen, I might request a second quick review after mine.

mravanelli · 2023-06-17T15:24:58Z

Hi @lucadellalib,
Thank you for this PR! I have some comments and suggestions for improvement:

1. Repository Size:
- We aim to keep the main repository as lightweight as possible. Currently, it is 44 MB, and with this PR, an additional 25 MB will be added. To address this, we can move the heaviest files to the official SpeechBrain Dropbox folder, which we are setting up. I will grant you access to it. For example, you can place the following files there:
- random_idxes.txt (14.9 MB)
- the tokenizer files of wavlm.
To automate the process of downloading from our Dropbox, you can modify the preparation script. You can utilize the download_file function from speechbrain.utils.data_utils, as we have done in other recipes.

2. User Convenience:
We should strive to make it as easy as possible for our users. Currently, they have to manually download all the files for all languages, which can be quite annoying. I suggest uploading all the benchmark data to Zenodo. We have done this in the past for CommonLanguage (see: https://zenodo.org/record/5036977). Please double-check the licensing, but I believe this approach should be permissible.

3. Simplifying Results Analysis:
Our goal is to enable users to analyze the results effortlessly. However, according to the readme, users need to manually copy and paste files to a common directory for the analyze_logs.py script. Is there a way we can simplify this process?

4.Readme Enhancements:
In the readme, it would be beneficial to include the expected results with a link to the output folder. We can place the output folder (along with the other files) in Dropbox. This is a standard practice we follow for all our recipes (e.g., see: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/seq2seq).
Additionally, it would be helpful to provide an example of what the output of analyze_logs should look like.

5. Reference Paper: At some point, we should publish the reference paper on Arxiv and cite it here. This step can also be done later if you prefer.

Feel free to change the code to address my comments. Once done, I will proceed and make sure everything is running and providing the expected results.

mravanelli · 2023-06-23T23:35:36Z

Thank you @lucadellalib, for addressing my previous comments. After conducting a more in-depth review, I have identified the following additional comments:

README:
- It would be helpful to include a table in the README showcasing the results that can be achieved after running each recipe.
- Once the reference paper becomes available on arXiv, please add it to the README.
- Clarify that the dataset is extracted from Common Voice and provide a link to the original dataset and the publication. This information should also be added to Zenodo.

DEPENDENCIES:
- Regarding the dependency on transformers, using transformers>=4.24,<4.29 seems unusual. Can we investigate why the latest version of transformers doesn't work? It would be beneficial to address this issue and find a suitable solution.

COMMENTS, DOCTRINGS, AND OTHERS:

In the YAML files, there is a line specifying max_durations: [36000, 3600, 3600]. It would be beneficial to add a comment explaining this line to ensure immediate understanding. For example: "The maximum total durations in seconds for the train, dev, and test splits"
In the train*.py files, please make sure that all the functions and classes you define have concise docstrings. Additionally, in the model.py files, please provide a full docstring with runnable examples.
Consider changing the variable name data_dir to data_folder for the sake of uniformity with other recipes.
The common_voice_prepare function appears to be too verbose. It might be worth exploring ways to simplify it.

RUNNING ISSUES:

- I managed to successfully run wavlm/results/wavlm-large/PNN. However, when attempting to run the logger analyzer, I encountered the following error:

(myenv) [ravanelm@ng20604 cl-masr]$ python analyze_logs.py wavlm/results/wavlm-large/PNN/0/
PNN:   0%|                                                                                                                                                                                                              	| 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/analyze_logs.py", line 769, in <module>
    wer_matrix = compute_wer_matrix(
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/analyze_logs.py", line 138, in compute_wer_matrix
    raise RuntimeError("Fewer locales than expected")
RuntimeError: Fewer locales than expected

Is there something I might be missing? Please note that I ran and stopped this recipe multiple times, which might not be supported.

When attempting to run an experiment with Whisper, I encountered the following issue:

  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 500, in <module>
	train(hparams, run_opts)
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 312, in train
	test(
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 262, in test
	_, _, test_data = dataio_prepare(hparams, tokenizer)
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 134, in dataio_prepare
	train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
  File "/scratch/ravanelm/speechbrain_clmars/speechbrain/dataio/dataset.py", line 365, in from_csv
	data = load_data_csv(csv_path, replacements)
  File "/scratch/ravanelm/speechbrain_clmars/speechbrain/dataio/dataio.py", line 133, in load_data_csv
	for row in reader:
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/csv.py", line 110, in __next__
	self.fieldnames
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/csv.py", line 97, in fieldnames
	self._fieldnames = next(self.reader)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/encodings/ascii.py", line 26, in decode
	return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 111: ordinal not in range(128)

poonehmousavi · 2023-06-24T03:29:06Z

@mravanelli regarding the dependency issue for transformers 4.28 . the reason is that transformers >= 4.30 change the way they manage audio processing. Basically, they remove all model-specific audio feature extraction functions and unify all of them in audio_utils class that supports different audio processing. this causes an issue in a function in our whisper_huggingface.py class that generate mel-spec. This issue is discussed and resolved in this PR.
#2016
So after merging this pull request, we don't have the dependency issue.

lucadellalib · 2023-06-26T17:18:49Z

@mravanelli regarding the UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 111: ordinal not in range(128), I never experienced this issue before. I suspect it's related to the default encoding set on your platform, which might be different from "UTF-8". The code at https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataio.py#L129 does not force "UTF-8", thus according to the docs (https://docs.python.org/3/library/functions.html#open) the platform default encoding is used. Can you please check? To do so open an interactive Python session and run import locale; locale.getpreferredencoding(False).

mravanelli · 2023-06-27T15:38:14Z

Here is what I got:

Python 3.10.2 (main, Feb  4 2022, 19:10:35) [GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale; locale.getpreferredencoding(False)
'UTF-8'
>>>

I'm actually using Narval. Are you able to replicate the issue on Narval?

…d improvements

mravanelli · 2023-06-29T22:04:57Z

Thank you for making the changes. I'm running various tests and experiments, and so far, everything seems to be working well. Here are the last minor issue to address:

The analyze log script generates files with spaces in their names (e.g., "Average WER.png"). This can be quite annoying. I suggest replacing the spaces with underscores ("_") instead.
I still occasionally encounter the "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position" error. However, when I restart the script, the error disappears. Maybe we should specify the utf-8 encoding when writing the csv files? Investigating a bit more will save future troubles for our users.
As we discussed in the previous speechbrain general meeting, let's close this pull request and integrate the code into the independent repository (speechbrain/benchmarks) that I recently created. I have granted you writing permission for this private repository. This seems like a good moment to make the switch.

lucadellalib and others added 30 commits December 10, 2022 01:45

Add data preprocessing, add Whisper recipe

39fb4e8

Update hyperparameters, minor improvements

d0edc77

Add Whisper multilingual fine-tuning

bd44ff4

Update batch size, minor improvements

230d0b5

Make CustomPaddedBatch pickleable

2adf242

Minor fix

fe03199

Fix preprocessing, update hyperparameters

867f65b

Other bug fixes

1de268b

Fix and update

e5ed3ff

Simplify preprocessing, update training scripts

6cb2e52

Bug fixes and improvements

969230e

Add EWC, minor improvements

440072c

Update hyperparameters

62da47d

Add rehearsal, update hyperparameters

d1347c1

Minor bug fixes, switch to Common Voice 12

f5a13dc

Start PNN implementation

5efc403

Update hyperparameters

fe16c7c

Fix and improve

86d537a

Fix and update

23e6e52

Update hyperparameters

2da3a9e

add learning without forgetting CL approach

8254e6f

minor fix

ebc233c

mask new-tokens for old model in LWF

21750f0

update hyperparam. fix bug in common_voice prepare

eadc639

update hyperparam for lwc

f975512

Add plotting scripts, progress with PNN

f9ef08e

Add data subsampling

a4907e1

Update

1a5b93a

start implementing Learning to prompt method

f96bbff

Merge branch 'continual-learning' of https://github.com/lucadellalib/…

295f385

…speechbrain into continual-learning

lucadellalib added 8 commits May 25, 2023 17:19

Update log analysis, update requirements, minor refactoring, cleaning

01b4baf

Fix typo

2809967

Fix checkpointing for joint training

dc8650b

Update directory name

47f553a

Update README

e5b6fa9

Minor improvements

ce1aa7a

Fix pre-commit

aecc2b9

Fix black

3e1abb6

mravanelli self-requested a review June 16, 2023 13:40

mravanelli self-assigned this Jun 16, 2023

lucadellalib added 2 commits June 16, 2023 17:22

Add to recipe testing

0971033

Fix recipe testing

31c4a97

Minor improvements, update readme

c0fa74c

mravanelli mentioned this pull request Jun 26, 2023

SSL Benchmark #2047

Merged

lucadellalib and others added 6 commits June 28, 2023 00:53

Fix dependencies, add docstrings, update readme, minor refactoring an…

f859022

…d improvements

Fix flake8

a8ac5c3

Fix pre-commit

8c48770

Delete .gitignore

635251e

Update README.md

4dcf218

Update README.md

4d7d0e3

Update analyze_logs.py

4a615c0

lucadellalib closed this Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CL-MASR#2033

CL-MASR#2033
lucadellalib wants to merge 126 commits intospeechbrain:developfrom
lucadellalib:continual-learning

lucadellalib commented Jun 15, 2023 •

edited

Loading

Uh oh!

Adel-Moumen commented Jun 16, 2023

Uh oh!

mravanelli commented Jun 16, 2023

Uh oh!

mravanelli commented Jun 17, 2023

Uh oh!

mravanelli commented Jun 23, 2023

Uh oh!

poonehmousavi commented Jun 24, 2023 •

edited

Loading

Uh oh!

lucadellalib commented Jun 26, 2023

Uh oh!

mravanelli commented Jun 27, 2023

Uh oh!

mravanelli commented Jun 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lucadellalib commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Adel-Moumen commented Jun 16, 2023

Uh oh!

mravanelli commented Jun 16, 2023

Uh oh!

mravanelli commented Jun 17, 2023

Uh oh!

mravanelli commented Jun 23, 2023

Uh oh!

poonehmousavi commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucadellalib commented Jun 26, 2023

Uh oh!

mravanelli commented Jun 27, 2023

Uh oh!

mravanelli commented Jun 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lucadellalib commented Jun 15, 2023 •

edited

Loading

poonehmousavi commented Jun 24, 2023 •

edited

Loading