-
+
The data structure for multimodal data
+### Compose nested Documents
-
-
- |
-+Of course, you can compose Documents into a nested structure: ```python -from docarray import dataclass, Document -from docarray.typing import Image, Text, JSON +from docarray import BaseDoc +from docarray.documents import ImageDoc, TextDoc +import numpy as np -@dataclass -class WPArticle: - banner: Image - headline: Text - meta: JSON +class MultiModalDocument(BaseDoc): + image_doc: ImageDoc + text_doc: TextDoc -a = WPArticle( - banner='https://.../cat-dog-flight.png', - headline='Everything to know about flying with pets, ...', - meta={ - 'author': 'Nathan Diller', - 'Column': 'By the Way - A Post Travel Destination', - }, +doc = MultiModalDocument( + image_doc=ImageDoc(tensor=np.zeros((3, 224, 224))), text_doc=TextDoc(text='hi!') ) - -d = Document(a) ``` - | -
| left/00018.jpg | -right/00018.jpg | -left/00131.jpg | -right/00131.jpg | -
|---|---|---|---|
![]() |
- ![]() |
- ![]() |
- ![]() |
-
| Pull from Cloud | -Download, unzip, load from local | -
|---|---|
| +It supports ANN vector search, text search, filtering, and hybrid search. ```python -right_da = ( - DocumentArray.pull('jina-ai/demo-rightda', show_progress=True) - .apply(preproc) - .embed(model, device='cuda')[:1000] +from docarray import DocList, BaseDoc +from docarray.index import HnswDocumentIndex +import numpy as np + +from docarray.typing import ImageUrl, ImageTensor, NdArray + + +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: ImageTensor + embedding: NdArray[128] + + +# create some data +dl = DocList[ImageDoc]( + [ + ImageDoc( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + embedding=np.random.random((128,)), + ) + for _ in range(100) + ] ) + +# create a Document Index +index = HnswDocumentIndex[ImageDoc](work_dir='/tmp/test_index') + + +# index your data +index.index(dl) + +# find similar Documents +query = dl[0] +results, scores = index.find(query, limit=10, search_field='embedding') ``` - - | -
+
+---
+
+## Learn DocArray
+
+Depending on your background and use case, there are different ways for you to understand DocArray.
+
+### Coming from DocArray <=0.21
+
+
+
+
+### Coming from Pydantic
+
+Click to expand+ +If you are using DocArray version 0.30.0 or lower, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/). + +_DocArray >=0.30 is that idea, taken seriously._ Every document is created through a dataclass-like interface, +courtesy of [Pydantic](https://pydantic-docs.helpmanual.io/usage/models/). + +This gives the following advantages: +- **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema +- **Multimodality:** At their core, documents are just dictionaries. This makes it easy to create and send them from any language, not just Python. + +You may also be familiar with our old Document Stores for vector DB integration. +They are now called **Document Indexes** and offer the following improvements (see [here](#store) for the new API): + +- **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields +- **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain +- **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client + +For now, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, **[Redis](https://redis.io/)**, **[Mongo Atlas](https://www.mongodb.com/)**, Exact Nearest Neighbour search and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come. + +
+ Click to expand+ +If you come from Pydantic, you can see DocArray documents as juiced up Pydantic models, and DocArray as a collection of goodies around them. + +More specifically, we set out to **make Pydantic fit for the ML world** - not by replacing it, but by building on top of it! + +This means you get the following benefits: + +- **ML-focused types**: Tensor, TorchTensor, Embedding, ..., including **tensor shape validation** +- Full compatibility with **FastAPI** +- **DocList** and **DocVec** generalize the idea of a model to a _sequence_ or _batch_ of models. Perfect for **use in ML models** and other batch processing tasks. +- **Types that are alive**: ImageUrl can `.load()` a URL to image tensor, TextUrl can load and tokenize text documents, etc. +- Cloud-ready: Serialization to **Protobuf** for use with microservices and **gRPC** +- **Pre-built multimodal documents** for different data modalities: Image, Text, 3DMesh, Video, Audio and more. Note that all of these are valid Pydantic models! +- **Document Stores** and **Document Indexes** let you store your data and retrieve it using **vector search** + +The most obvious advantage here is **first-class support for ML centric data**, such as `{Torch, TF, ...}Tensor`, `Embedding`, etc. + +This includes handy features such as validating the shape of a tensor: ```python -right_da = ( - DocumentArray.from_files('right/*.jpg')[:1000] - .apply(preproc) - .embed(model, device='cuda') -) +from docarray import BaseDoc +from docarray.typing import TorchTensor +import torch + + +class MyDoc(BaseDoc): + tensor: TorchTensor[3, 224, 224] + + +doc = MyDoc(tensor=torch.zeros(3, 224, 224)) # works +doc = MyDoc(tensor=torch.zeros(224, 224, 3)) # works by reshaping + +try: + doc = MyDoc(tensor=torch.zeros(224)) # fails validation +except Exception as e: + print(e) + # tensor + # Cannot reshape tensor of shape (224,) to shape (3, 224, 224) (type=value_error) + + +class Image(BaseDoc): + tensor: TorchTensor[3, 'x', 'x'] + + +Image(tensor=torch.zeros(3, 224, 224)) # works + +try: + Image( + tensor=torch.zeros(3, 64, 128) + ) # fails validation because second dimension does not match third +except Exception as e: + print() + + +try: + Image( + tensor=torch.zeros(4, 224, 224) + ) # fails validation because of the first dimension +except Exception as e: + print(e) + # Tensor shape mismatch. Expected(3, 'x', 'x'), got(4, 224, 224)(type=value_error) + +try: + Image( + tensor=torch.zeros(3, 64) + ) # fails validation because it does not have enough dimensions +except Exception as e: + print(e) + # Tensor shape mismatch. Expected (3, 'x', 'x'), got (3, 64) (type=value_error) ``` - |
-
-{%- endblock %}
\ No newline at end of file
diff --git a/docs/_templates/sidebar/brand.html b/docs/_templates/sidebar/brand.html
deleted file mode 100644
index 4e9d09f841a..00000000000
--- a/docs/_templates/sidebar/brand.html
+++ /dev/null
@@ -1,48 +0,0 @@
-
- {% block brand_content %}
- {%- if logo_url %}
-
- {%- endif %}
- {%- if theme_light_logo and theme_dark_logo %}
-
- {%- endif %}
- {% if not theme_sidebar_hide_name %}
-
- {%- endif %}
- {% endblock brand_content %}
-
-
-
-