fix sorting bug by anautsch · Pull Request #1730 · speechbrain/speechbrain

anautsch · 2022-11-28T14:11:21Z

As described in #1722 , there is an issue with sorting and DDP.

This is to open up a PR. The goal is not fixing all possibilities - the goal is to restore functionality.
A testing is not yet implemented (needs DDP set-up; that might be a different sort of github workflow altogether).

speechbrain/speechbrain/core.py

Line 726 in c229dbc

shuffle = loader_kwargs.get("shuffle", False)

defines a shuffle variable that is simply re-used for the flags that were redefined in #1518 ; defaults to False, such that sorting is kept if wanted througout.

TParcollet · 2022-11-28T14:21:17Z

To what is this variable connected? a yaml parameter?

anautsch · 2022-11-28T14:43:57Z

To what is this variable connected? a yaml parameter?

yeah kinda - but it might be in conflict with some other concept beneath.
maybe the idea is bogus. need to test - let's use this as a playground to get sth working

well, actually, the sorting flag is the flag?

anautsch · 2022-11-28T15:08:04Z

just put it to "False" instead; this simply reverts the other PR that caused the bug w/o needing to rethink recipe & hparam logic.

anautsch · 2022-11-30T16:43:20Z

Added a test and an initial fix outline. The test uses a CPU ddp backend, gloo.

I run into these nearby-sortings w/ sorting & ddp:

tensor([2.2050, 2.3400, 2.6000, 2.8700, 3.1050, 3.1500, 3.2200, 3.2800, 3.4550,
        3.6500, 3.7000, 4.2650, 4.4701, 4.7600, 4.8350, 4.8900, 5.1600, 6.1450,
        1.7750], dtype=torch.float64)

tensor([6.1450, 5.1600, 4.8900, 4.8350, 4.7600, 4.4701, 4.2650, 3.7000, 3.6500,
        3.4550, 3.2800, 3.2200, 3.1500, 3.1050, 2.8700, 2.6000, 2.3400, 2.2050,
        6.2000], dtype=torch.float64)

edit: also, only 38/100 examples are sampled when DDP uses sorting, which should not be.

W/o ddp (and shuffling ddp), all works fine.

anautsch · 2022-12-01T08:48:10Z

The test needed to be adjusted to DDP & pytorch's DistributedSampler which has:

        drop_last (bool, optional): if ``True``, then the sampler will drop the
            tail of the data to make it evenly divisible across the number of
            replicas. If ``False``, the sampler will add extra indices to make
            the data evenly divisible across the replicas. Default: ``False``.

Test works for the proposed fix on DDP & sorting. Bc of the drop_last & replicas, the final elements of the last batch won't be sorted within that batch. Yet, DDP seems to distribute the batches w/ replicas BUT

for sorted (filtered) datasets, the total number of examples does not change
for random (shuffling), the number of examples increases, as replicas are included.

The sorting asserts are thus tailored to the dummy data (and a bit lazy with "[:-1]" subset testing across all batches, instead of final batch only).

Ready for review.

TParcollet

See comments.

speechbrain/core.py

TParcollet · 2022-12-01T18:15:02Z

speechbrain/utils/distributed.py

-                    "Killing process " + str() + "\n"
-                    "Not enough GPUs available!"
-                )
+            if not run_opts["distributed_backend"] == "gloo":


Again, why is gloo treated like that? :p

CPU vs GPU - there is no GPU testing on github, or pay for it :p

the way it is implemented for GPU is incompatible with CPU: gloo will crash when run in cpu only.

This pytorch example for gloo is incompatible with SB, as it was before the contributed fix
https://pytorch.org/tutorials/intermediate/dist_tuto.html

for this part, if there is only CPU, ofc there are not enough GPUs ;)

TParcollet

LGTM

anautsch added 2 commits November 28, 2022 15:05

hotfix

2e6c1df

lints

5d8fe3e

shuffle to false

c9394a6

tests for fixing ddp sampling w/ sorting

ddd2f74

anautsch added 2 commits November 30, 2022 17:58

lints & naming

3ac910c

adjusted test

0d71b21

Adel-Moumen requested a review from TParcollet December 1, 2022 09:24

TParcollet reviewed Dec 1, 2022

View reviewed changes

TParcollet approved these changes Dec 2, 2022

View reviewed changes

TParcollet merged commit b5c223c into speechbrain:develop Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix sorting bug#1730

fix sorting bug#1730
TParcollet merged 6 commits intospeechbrain:developfrom
anautsch:sortfix

anautsch commented Nov 28, 2022

Uh oh!

TParcollet commented Nov 28, 2022

Uh oh!

anautsch commented Nov 28, 2022

Uh oh!

anautsch commented Nov 28, 2022

Uh oh!

anautsch commented Nov 30, 2022 •

edited

Loading

Uh oh!

anautsch commented Dec 1, 2022

Uh oh!

TParcollet left a comment

Uh oh!

Uh oh!

TParcollet Dec 1, 2022

Uh oh!

anautsch Dec 2, 2022

Uh oh!

anautsch Dec 2, 2022 •

edited

Loading

Uh oh!

anautsch Dec 2, 2022

Uh oh!

TParcollet left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anautsch commented Nov 28, 2022

Uh oh!

TParcollet commented Nov 28, 2022

Uh oh!

anautsch commented Nov 28, 2022

Uh oh!

anautsch commented Nov 28, 2022

Uh oh!

anautsch commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anautsch commented Dec 1, 2022

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TParcollet Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

anautsch Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

anautsch Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anautsch Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anautsch commented Nov 30, 2022 •

edited

Loading

anautsch Dec 2, 2022 •

edited

Loading