Skip to content

CL-MASR#2033

Closed
lucadellalib wants to merge 126 commits intospeechbrain:developfrom
lucadellalib:continual-learning
Closed

CL-MASR#2033
lucadellalib wants to merge 126 commits intospeechbrain:developfrom
lucadellalib:continual-learning

Conversation

@lucadellalib
Copy link
Collaborator

@lucadellalib lucadellalib commented Jun 15, 2023

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

NOTE: when merging we should not automatically delete the branch in the source repository (URL must stay available).

@Adel-Moumen
Copy link
Collaborator

Hello @lucadellalib (and @poonehmousavi?),

Thanks for this PR. What is the current state of this PR? Is it ready to review?

Many thanks. :-)

@mravanelli mravanelli self-requested a review June 16, 2023 13:40
@mravanelli
Copy link
Collaborator

Hi @Adel-Moumen, I plan to review this PR very soon. Please @lucadellalib, make sure the tests are passing. @Adel-Moumen, I might request a second quick review after mine.

@mravanelli mravanelli self-assigned this Jun 16, 2023
@mravanelli
Copy link
Collaborator

Hi @lucadellalib,
Thank you for this PR! I have some comments and suggestions for improvement:

1. Repository Size:
- We aim to keep the main repository as lightweight as possible. Currently, it is 44 MB, and with this PR, an additional 25 MB will be added. To address this, we can move the heaviest files to the official SpeechBrain Dropbox folder, which we are setting up. I will grant you access to it. For example, you can place the following files there:
- random_idxes.txt (14.9 MB)
- the tokenizer files of wavlm.
To automate the process of downloading from our Dropbox, you can modify the preparation script. You can utilize the download_file function from speechbrain.utils.data_utils, as we have done in other recipes.

2. User Convenience:
We should strive to make it as easy as possible for our users. Currently, they have to manually download all the files for all languages, which can be quite annoying. I suggest uploading all the benchmark data to Zenodo. We have done this in the past for CommonLanguage (see: https://zenodo.org/record/5036977). Please double-check the licensing, but I believe this approach should be permissible.

3. Simplifying Results Analysis:
Our goal is to enable users to analyze the results effortlessly. However, according to the readme, users need to manually copy and paste files to a common directory for the analyze_logs.py script. Is there a way we can simplify this process?

4.Readme Enhancements:
In the readme, it would be beneficial to include the expected results with a link to the output folder. We can place the output folder (along with the other files) in Dropbox. This is a standard practice we follow for all our recipes (e.g., see: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/seq2seq).
Additionally, it would be helpful to provide an example of what the output of analyze_logs should look like.

5. Reference Paper: At some point, we should publish the reference paper on Arxiv and cite it here. This step can also be done later if you prefer.

Feel free to change the code to address my comments. Once done, I will proceed and make sure everything is running and providing the expected results.

@mravanelli
Copy link
Collaborator

Thank you @lucadellalib, for addressing my previous comments. After conducting a more in-depth review, I have identified the following additional comments:

README:
- It would be helpful to include a table in the README showcasing the results that can be achieved after running each recipe.
- Once the reference paper becomes available on arXiv, please add it to the README.
- Clarify that the dataset is extracted from Common Voice and provide a link to the original dataset and the publication. This information should also be added to Zenodo.

DEPENDENCIES:
- Regarding the dependency on transformers, using transformers>=4.24,<4.29 seems unusual. Can we investigate why the latest version of transformers doesn't work? It would be beneficial to address this issue and find a suitable solution.

COMMENTS, DOCTRINGS, AND OTHERS:

  • In the YAML files, there is a line specifying max_durations: [36000, 3600, 3600]. It would be beneficial to add a comment explaining this line to ensure immediate understanding. For example: "The maximum total durations in seconds for the train, dev, and test splits"
  • In the train*.py files, please make sure that all the functions and classes you define have concise docstrings. Additionally, in the model.py files, please provide a full docstring with runnable examples.
  • Consider changing the variable name data_dir to data_folder for the sake of uniformity with other recipes.
  • The common_voice_prepare function appears to be too verbose. It might be worth exploring ways to simplify it.

RUNNING ISSUES:

- I managed to successfully run wavlm/results/wavlm-large/PNN. However, when attempting to run the logger analyzer, I encountered the following error:
(myenv) [ravanelm@ng20604 cl-masr]$ python analyze_logs.py wavlm/results/wavlm-large/PNN/0/
PNN:   0%|                                                                                                                                                                                                              	| 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/analyze_logs.py", line 769, in <module>
    wer_matrix = compute_wer_matrix(
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/analyze_logs.py", line 138, in compute_wer_matrix
    raise RuntimeError("Fewer locales than expected")
RuntimeError: Fewer locales than expected

Is there something I might be missing? Please note that I ran and stopped this recipe multiple times, which might not be supported.

  • When attempting to run an experiment with Whisper, I encountered the following issue:
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 500, in <module>
	train(hparams, run_opts)
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 312, in train
	test(
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 262, in test
	_, _, test_data = dataio_prepare(hparams, tokenizer)
  File "/scratch/ravanelm/speechbrain_clmars/recipes/CommonVoice/cl-masr/whisper/train_pnn.py", line 134, in dataio_prepare
	train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
  File "/scratch/ravanelm/speechbrain_clmars/speechbrain/dataio/dataset.py", line 365, in from_csv
	data = load_data_csv(csv_path, replacements)
  File "/scratch/ravanelm/speechbrain_clmars/speechbrain/dataio/dataio.py", line 133, in load_data_csv
	for row in reader:
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/csv.py", line 110, in __next__
	self.fieldnames
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/csv.py", line 97, in fieldnames
	self._fieldnames = next(self.reader)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/encodings/ascii.py", line 26, in decode
	return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 111: ordinal not in range(128)

@poonehmousavi
Copy link
Collaborator

poonehmousavi commented Jun 24, 2023

@mravanelli regarding the dependency issue for transformers 4.28 . the reason is that transformers >= 4.30 change the way they manage audio processing. Basically, they remove all model-specific audio feature extraction functions and unify all of them in audio_utils class that supports different audio processing. this causes an issue in a function in our whisper_huggingface.py class that generate mel-spec. This issue is discussed and resolved in this PR.
#2016
So after merging this pull request, we don't have the dependency issue.

@lucadellalib
Copy link
Collaborator Author

@mravanelli regarding the UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 111: ordinal not in range(128), I never experienced this issue before. I suspect it's related to the default encoding set on your platform, which might be different from "UTF-8". The code at https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataio.py#L129 does not force "UTF-8", thus according to the docs (https://docs.python.org/3/library/functions.html#open) the platform default encoding is used. Can you please check? To do so open an interactive Python session and run import locale; locale.getpreferredencoding(False).

@mravanelli mravanelli mentioned this pull request Jun 26, 2023
@mravanelli
Copy link
Collaborator

Here is what I got:

Python 3.10.2 (main, Feb  4 2022, 19:10:35) [GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale; locale.getpreferredencoding(False)
'UTF-8'
>>> 

I'm actually using Narval. Are you able to replicate the issue on Narval?

@mravanelli
Copy link
Collaborator

Thank you for making the changes. I'm running various tests and experiments, and so far, everything seems to be working well. Here are the last minor issue to address:

  1. The analyze log script generates files with spaces in their names (e.g., "Average WER.png"). This can be quite annoying. I suggest replacing the spaces with underscores ("_") instead.

  2. I still occasionally encounter the "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position" error. However, when I restart the script, the error disappears. Maybe we should specify the utf-8 encoding when writing the csv files? Investigating a bit more will save future troubles for our users.

  3. As we discussed in the previous speechbrain general meeting, let's close this pull request and integrate the code into the independent repository (speechbrain/benchmarks) that I recently created. I have granted you writing permission for this private repository. This seems like a good moment to make the switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants