Skip to content

fix: allow doclist to have nested optional document#1472

Merged
samsja merged 28 commits into
mainfrom
fix-docarray-optional
May 5, 2023
Merged

fix: allow doclist to have nested optional document#1472
samsja merged 28 commits into
mainfrom
fix-docarray-optional

Conversation

@samsja

@samsja samsja commented Apr 27, 2023

Copy link
Copy Markdown
Member

Context

This pr fix #1468

What this PR do

This pr do two things related to Optional field

DocList support for Optional field

from docarray.typing import NdArray
import numpy as np

class ImageDoc(BaseDoc):
    tensor: NdArray

class ArticleDoc(BaseDoc):
    image: Optional[ImageDoc]
    title: str

docs = DocList[ArticleDoc](
    [
        ArticleDoc(image=ImageDoc(tensor=np.ones((3, 224, 224))), title="Hello"),
        ArticleDoc(image=None, title="World"),
    ]
)

Before calling docs.image would raise an error because it will try to return a DoctList[ImageDoc] but the None value cannot he pass into such a DocList.

This PR change this behavior (this is a breaking change) and now return a list of the collected object. It can contain None or a ImageDoc.

DocVec support for Optional field

from docarray.typing import NdArray
import numpy as np

class ImageDoc(BaseDoc):
    tensor: NdArray

class ArticleDoc(BaseDoc):
    image: Optional[ImageDoc]
    title: str

docs = DocVec[ArticleDoc]([ArticleDoc(title="Hello") for _ in range(10)])

before you could not construct such a DocVec because you the doc are set to None.

We enforce as well the following behavior: if one field is None all of the other field must be None. An error will be raise at init time if this is not the case.

This pr allows this. Instead of storing all of the None it will just store None for this column.

This behavior is uniformized for the tensor field as well.

This is a breaking change. Before we were casting the None value into Nan value of each of the tensor backends and we would store the whole column of Nan value.

Now this is different we just store the column as None.

Example usage

    class Features(BaseDoc):
        tensor: NdArray[100]

    class Image(BaseDoc):
        url: ImageUrl
        features: Optional[Features]

    docs = DocVec[Image](
        [Image(url='http://url.com/foo.png') for _ in range(10)]
    )

    print(docs.features) # None

    docs.features = [Features(tensor=np.random.random([100])) for _ in range(10)]
    print(docs.features) # <DocVec[Features] (length=10)>
    assert isinstance(docs.features, DocVec[Features])

    docs.features.tensor = np.ones((10, 100))

    assert docs[0].features.tensor.shape == (100, )

    docs.features = None

    assert docs[0].features is None

Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Joan Fontanals and others added 2 commits April 28, 2023 11:46
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
@samsja samsja marked this pull request as ready for review May 2, 2023 12:14
@samsja samsja marked this pull request as draft May 2, 2023 12:15
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
@github-actions github-actions Bot added size/m and removed size/s labels May 3, 2023
samsja added 5 commits May 3, 2023 15:05
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
@samsja samsja marked this pull request as ready for review May 3, 2023 15:08
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
@JohannesMessner

Copy link
Copy Markdown
Member

This behavior is uniformized for the tensor field as well.

This is a breaking change. Before we were casting the None value into Nan value of each of the tensor backends and we would store the whole column of Nan value.

is this breaking change necessary? can't we just keep nan values in the respective backend format?

Comment thread docarray/array/doc_list/doc_list.py Outdated
Comment thread docarray/array/doc_list/doc_list.py Outdated
Comment thread docarray/array/doc_vec/column_storage.py Outdated
Comment thread docarray/array/doc_vec/column_storage.py Outdated
Comment thread docarray/array/doc_vec/doc_vec.py Outdated
Comment thread docarray/array/doc_vec/doc_vec.py
Comment thread docarray/array/doc_vec/doc_vec.py Outdated
Comment thread docarray/array/doc_vec/doc_vec.py
Comment thread docarray/array/doc_vec/doc_vec.py
Comment thread docarray/array/doc_vec/doc_vec.py
@samsja

samsja commented May 3, 2023

Copy link
Copy Markdown
Member Author

is this breaking change necessary? can't we just keep nan values in the respective backend format?

Hmm not it is none. But this is a way harder story to tell if we have different behavior for Optional[Tensor] or Optional[BaseDoc]. I would rather have the same behavior. Plus I think this one is way more optimized and solve benchmark problem we had before wit pytorch. Because we don't carry long none value anymore

samsja and others added 2 commits May 4, 2023 09:09
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: samsja <55492238+samsja@users.noreply.github.com>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
samsja and others added 8 commits May 4, 2023 09:13
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: samsja <55492238+samsja@users.noreply.github.com>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: samsja <55492238+samsja@users.noreply.github.com>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>

@JohannesMessner JohannesMessner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be pain in the butt, a lot of suggestions :D

But honestly it seems like the entire doc section could be condensed down to a few sentences:

  • Nested optionals are a bit tricky, watch out
  • For doclist the return will be a DocList instead of DocList[...]
  • For docvec the return will be DocList[...] if all documents are there, or None if all docs are None. No mix of docs and None allowed!

I would lead with these "rules", I think they are most important to the user. And after that we can go into an explanation of why that is, maybe even hidden away behind a collapsed box.

Otherwise I think it is difficult for the users to tease out what is the important information

Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment on lines +448 to +449
!!! note
This whole section is specific for nested BaseDoc, DocList and DocVec.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case then the heading for this should be different, something like "dealing with nested, optional fields" or something like that

Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
assert docs.image == [ImageDoc(tensor=np.ones((3, 224, 224))), None]
```

but for DocVec it is a bit different. Indeed, DocVec store the data for each filed as contiguous column.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the fact that it stores it as a column to do with it not being a DocList? That column could in theory be a DocList, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well not if some document are none ?

Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated


```python
from docarray import DocVec

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use tabs for these examples?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split it a bit differently now

@samsja

samsja commented May 4, 2023

Copy link
Copy Markdown
Member Author

Sorry to be pain in the butt, a lot of suggestions :D

But honestly it seems like the entire doc section could be condensed down to a few sentences:

* Nested optionals are a bit tricky, watch out

* For doclist the return will be a `DocList` instead of `DocList[...]`

* For docvec the return will be `DocList[...]` if all documents are there, or `None` if all docs are None. No mix of docs and None allowed!

I would lead with these "rules", I think they are most important to the user. And after that we can go into an explanation of why that is, maybe even hidden away behind a collapsed box.

Otherwise I think it is difficult for the users to tease out what is the important information

sure I will follow your suggestion

samsja and others added 4 commits May 4, 2023 15:05
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: samsja <55492238+samsja@users.noreply.github.com>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
@JoanFM JoanFM requested a review from JohannesMessner May 5, 2023 09:49
or extend your data.


## Dealing with Optional fields

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Dealing with Optional fields
## Dealing with nested Optional fields

Everything below only applies to the nested case, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated

For nested BaseDoc
* DocList will return a list of document if the field is optional and a DocList if the field is not optional
* DocVec will return a DocList if all documents are there, or None if all docs are None. No mix of docs and None allowed!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't DocVec return a DocVec? Am i remembering correctly that we wanted to do that but haven't implemented it yet?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shit you are right ofc

Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
Comment thread docs/user_guide/representing/array.md Outdated
samsja and others added 3 commits May 5, 2023 14:45
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: samsja <55492238+samsja@users.noreply.github.com>
Signed-off-by: samsja <sami.jaghouar@hotmail.fr>
@github-actions

github-actions Bot commented May 5, 2023

Copy link
Copy Markdown

📝 Docs are deployed on https://ft-fix-docarray-optional--jina-docs.netlify.app 🎉

@samsja samsja merged commit b77df16 into main May 5, 2023
@samsja samsja deleted the fix-docarray-optional branch May 5, 2023 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating DocVec with the Object that has Optional Field

3 participants