diff --git a/.github/ISSUE_TEMPLATE/bug-v1-deprecated.yml b/.github/ISSUE_TEMPLATE/bug-v1-deprecated.yml
index 885c37877b9..ecedf100fc6 100644
--- a/.github/ISSUE_TEMPLATE/bug-v1-deprecated.yml
+++ b/.github/ISSUE_TEMPLATE/bug-v1-deprecated.yml
@@ -1,4 +1,4 @@
-name: 🐛 DocArray V1 Bug (0.1.0 - 0.20.1) (Deprecated Version)
+name: 🐛 DocArray <=0.21 Bug (0.1.0 - 0.20.1) (Deprecated Version)
description: Report a bug or unexpected behavior in DocArray version prior to v2 (0.21.1)
labels: [bug V1, unconfirmed]
diff --git a/README.md b/README.md
index 0d60bbfdda2..656c595a871 100644
--- a/README.md
+++ b/README.md
@@ -12,49 +12,56 @@
-> ⬆️ **DocArray v2**: This readme is for the second version of DocArray (starting at 0.30). If you want to use the older
-> version (prior to 0.30) check out the [docarray-v1-fixes](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch
+> **Note**
+> The README you're currently viewing is for DocArray>0.30, which introduces some significant changes from DocArray 0.21. If you wish to continue using the older DocArray <=0.21, ensure you install it via `pip install docarray==0.21`. Refer to its [codebase](https://github.com/docarray/docarray/tree/v0.21.0), [documentation](https://docarray.jina.ai), and [its hot-fixes branch](https://github.com/docarray/docarray/tree/docarray-v1-fixes) for more information.
-DocArray is a library for **representing, sending and storing multi-modal data**, perfect for **Machine Learning applications**.
-With DocArray you can:
+DocArray is a Python library expertly crafted for the [representation](#represent), [transmission](#send), [storage](#store), and [retrieval](#retrieve) of multimodal data. Tailored for the development of multimodal AI applications, its design guarantees seamless integration with the extensive Python and machine learning ecosystems. As of January 2022, DocArray is openly distributed under the [Apache License 2.0](https://github.com/docarray/docarray/blob/main/LICENSE) and currently enjoys the status of a sandbox project within the [LF AI & Data Foundation](https://lfaidata.foundation/).
-1. [**Represent data**](#represent)
-2. [**Send data**](#send)
-3. [**Store data**](#store)
-DocArray handles your data while integrating seamlessly with the rest of your **Python and ML ecosystem**:
-- :fire: Native compatibility for **[NumPy](https://github.com/numpy/numpy)**, **[PyTorch](https://github.com/pytorch/pytorch)** and **[TensorFlow](https://github.com/tensorflow/tensorflow)**, including for **model training use cases**
-- :zap: Built on **[Pydantic](https://github.com/pydantic/pydantic)** and out-of-the-box compatible with **[FastAPI](https://github.com/tiangolo/fastapi/)** and **[Jina](https://github.com/jina-ai/jina/)**
-- :package: Support for vector databases like **[Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [ElasticSearch](https://www.elastic.co/de/elasticsearch/)** and **[HNSWLib](https://github.com/nmslib/hnswlib)**
-- :chains: Send data as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)**
+- :fire: Offers native support for **[NumPy](https://github.com/numpy/numpy)**, **[PyTorch](https://github.com/pytorch/pytorch)**, and **[TensorFlow](https://github.com/tensorflow/tensorflow)**, catering specifically to **model training scenarios**.
+- :zap: Based on **[Pydantic](https://github.com/pydantic/pydantic)**, and instantly compatible with web and microservice frameworks like **[FastAPI](https://github.com/tiangolo/fastapi/)** and **[Jina](https://github.com/jina-ai/jina/)**.
+- :package: Provides support for vector databases such as **[Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [ElasticSearch](https://www.elastic.co/de/elasticsearch/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**.
+- :chains: Allows data transmission as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)**.
-> :bulb: **Where are you coming from?** Based on your use case and background, there are different ways to understand DocArray:
->
-> - [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch)
-> - [Coming from Pydantic](#coming-from-pydantic)
-> - [Coming from FastAPI](#coming-from-fastapi)
-> - [Coming from a vector database](#coming-from-vector-database)
-> - [Coming from Langchain](#coming-from-langchain)
+## Installation
+
+To install DocArray from the CLI, run the following command:
+
+```shell
+pip install -U docarray
+```
+
+> **Note**
+> To use DocArray <=0.21, make sure you install via `pip install docarray==0.21` and check out its [codebase](https://github.com/docarray/docarray/tree/v0.21.0) and [docs](https://docarray.jina.ai) and [its hot-fixes branch](https://github.com/docarray/docarray/tree/docarray-v1-fixes).
+
+## Get Started
+New to DocArray? Depending on your use case and background, there are multiple ways to learn about DocArray:
+
+- [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch)
+- [Coming from Pydantic](#coming-from-pydantic)
+- [Coming from FastAPI](#coming-from-fastapi)
+- [Coming from a vector database](#coming-from-vector-database)
+- [Coming from Langchain](#coming-from-langchain)
-DocArray has been distributed under the open-source [Apache License 2.0](https://github.com/docarray/docarray/blob/main/LICENSE) since January 2022. It is currently a sandbox project under [LF AI & Data Foundation](https://lfaidata.foundation/).
## Represent
-DocArray allows you to **represent your data**, in an ML-native way.
+DocArray empowers you to **represent your data** in a manner that is inherently attuned to machine learning.
-This is useful for different use cases:
+This is particularly beneficial for various scenarios:
-- :running: You are **training a model**: There are tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them.
-- :cloud: You are **serving a model**: For example through FastAPI, and you want to specify your API endpoints.
-- :card_index_dividers: You are **parsing data**: For later use in your ML or data science applications.
+- :running: You are **training a model**: You're dealing with tensors of varying shapes and sizes, each signifying different elements. You desire a method to logically organize them.
+- :cloud: You are **serving a model**: Let's say through FastAPI, and you wish to define your API endpoints precisely.
+- :card_index_dividers: You are **parsing data**: Perhaps for future deployment in your machine learning or data science projects.
-> :bulb: **Coming from Pydantic?** You should be happy to hear
-> that DocArray is built on top of, and is fully compatible with, Pydantic!
-> Also, we have a [dedicated section](#coming-from-pydantic) just for you!
+> :bulb: **Familiar with Pydantic?** You'll be pleased to learn
+> that DocArray is not only constructed atop Pydantic but also maintains complete compatibility with it!
+> Furthermore, we have a [specific section](#coming-from-pydantic) dedicated to your needs!
+
+In essence, DocArray facilitates data representation in a way that mirrors Python dataclasses, with machine learning being an integral component:
-Put simply, DocArray lets you represent your data in a dataclass-like way, with ML as a first class citizen:
```python
from docarray import BaseDoc
@@ -256,21 +263,22 @@ assert isinstance(dl_2, DocList)
## Send
-DocArray allows you to **send your data** in an ML-native way.
+DocArray facilitates the **transmission of your data** in a manner inherently compatible with machine learning.
+
+This includes native support for **Protobuf and gRPC**, along with **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes.
-This means there is native support for **Protobuf and gRPC**, on top of **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes.
+This feature proves beneficial for several scenarios:
-This is useful for different use cases:
+- :cloud: You are **serving a model**, perhaps through frameworks like **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)**
+- :spider_web: You are **distributing your model** across multiple machines and need an efficient means of transmitting your data between nodes
+- :gear: You are architecting a **microservice** environment and require a method for data transmission between microservices
-- :cloud: You are **serving a model**, for example through **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)**
-- :spider_web: You are **distributing your model** across machines and need to send your data between nodes
-- :gear: You are building a **microservice** architecture and need to send your data between microservices
+> :bulb: **Are you familiar with FastAPI?** You'll be delighted to learn
+> that DocArray maintains full compatibility with FastAPI!
+> Plus, we have a [dedicated section](#coming-from-fastapi) specifically for you!
-> :bulb: **Coming from FastAPI?** You should be happy to hear
-> that DocArray is fully compatible with FastAPI!
-> Also, we have a [dedicated section](#coming-from-fastapi) just for you!
+When it comes to data transmission, serialization is a crucial step. Let's delve into how DocArray streamlines this process:
-Whenever you want to send your data, you need to serialize it, so let's take a look at how that works with DocArray:
```python
from docarray import BaseDoc
@@ -305,18 +313,14 @@ Of course, serialization is not all you need. So check out how DocArray integrat
## Store
-Once you've modelled your data, and maybe sent it around, usually you want to **store it** somewhere.
-DocArray has you covered!
+After modeling and possibly distributing your data, you'll typically want to **store it** somewhere. That's where DocArray steps in!
-**Document Stores** let you, well, store your Documents, locally or remotely, all with the same user interface:
+**Document Stores** provide a seamless way to, as the name suggests, store your Documents. Be it locally or remotely, you can do it all through the same user interface:
-- :cd: **On disk** as a file in your local file system
+- :cd: **On disk**, as a file in your local filesystem
- :bucket: On **[AWS S3](https://aws.amazon.com/de/s3/)**
- :cloud: On **[Jina AI Cloud](https://cloud.jina.ai/)**
-
- See Document Store usage
-
The Document Store interface lets you push and pull Documents to and from multiple data sources, all with the same user interface.
For example, let's see how that works with on-disk storage:
@@ -334,7 +338,8 @@ docs.push('file://simple_docs')
docs_pull = DocList[SimpleDoc].pull('file://simple_docs')
```
-
+
+## Retrieve
**Document Indexes** let you index your Documents in a **vector database** for efficient similarity-based retrieval.
@@ -346,9 +351,6 @@ This is useful for:
Currently, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come!
-
- See Document Index usage
-
The Document Index interface lets you index and retrieve Documents from multiple vector databases, all with the same user interface.
It supports ANN vector search, text search, filtering, and hybrid search.
@@ -391,18 +393,21 @@ query = dl[0]
results, scores = index.find(query, limit=10, search_field='embedding')
```
-
+
+---
+
+## Learn DocArray
Depending on your background and use case, there are different ways for you to understand DocArray.
-## Coming from old DocArray
+### Coming from DocArray <=0.21
Click to expand
If you are using DocArray version 0.30.0 or lower, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/).
-_DocArray v2 is that idea, taken seriously._ Every document is created through a dataclass-like interface,
+_DocArray >=0.30 is that idea, taken seriously._ Every document is created through a dataclass-like interface,
courtesy of [Pydantic](https://pydantic-docs.helpmanual.io/usage/models/).
This gives the following advantages:
@@ -420,7 +425,7 @@ For now, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdran
-## Coming from Pydantic
+### Coming from Pydantic
Click to expand
@@ -497,7 +502,7 @@ except Exception as e:
-## Coming from PyTorch
+### Coming from PyTorch
Click to expand
@@ -511,7 +516,7 @@ It offers you several advantages:
- **Go directly to deployment**, by re-using your data model as a [FastAPI](https://fastapi.tiangolo.com/) or [Jina](https://github.com/jina-ai/jina) API schema
- Connect model components between **microservices**, using Protobuf and gRPC
-DocArray can be used directly inside ML models to handle and represent multi-modal data.
+DocArray can be used directly inside ML models to handle and represent multimodaldata.
This allows you to reason about your data using DocArray's abstractions deep inside of `nn.Module`,
and provides a FastAPI-compatible schema that eases the transition between model training and model serving.
@@ -609,7 +614,7 @@ schema definition (see [below](#coming-from-fastapi)). Everything is handled in
-## Coming from TensorFlow
+### Coming from TensorFlow
Click to expand
@@ -657,7 +662,7 @@ class MyPodcastModel(tf.keras.Model):
-## Coming from FastAPI
+### Coming from FastAPI
Click to expand
@@ -680,6 +685,7 @@ from docarray import BaseDoc
from docarray.documents import ImageDoc
from docarray.typing import NdArray
+
class InputDoc(BaseDoc):
img: ImageDoc
text: str
@@ -692,12 +698,15 @@ class OutputDoc(BaseDoc):
app = FastAPI()
+
def model_img(img: ImageTensor) -> NdArray:
return np.zeros((100, 1))
+
def model_text(text: str) -> NdArray:
return np.zeros((100, 1))
+
@app.post("/embed/", response_model=OutputDoc, response_class=DocArrayResponse)
async def create_item(doc: InputDoc) -> OutputDoc:
doc = OutputDoc(
@@ -705,16 +714,16 @@ async def create_item(doc: InputDoc) -> OutputDoc:
)
return doc
+
async with AsyncClient(app=app, base_url="http://test") as ac:
response = await ac.post("/embed/", data=input_doc.json())
-
```
Just like a vanilla Pydantic model!
-## Coming from a vector database
+### Coming from a vector database
Click to expand
@@ -770,14 +779,14 @@ Currently, DocArray supports the following vector databases:
An integration of [OpenSearch](https://opensearch.org/) is currently in progress.
-Legacy versions of DocArray also support [Redis](https://redis.io/) and [Milvus](https://milvus.io/), but these are not yet supported in the current version.
+DocArray <=0.21 also support [Redis](https://redis.io/) and [Milvus](https://milvus.io/), but these are not yet supported in the current version.
Of course this is only one of the things that DocArray can do, so we encourage you to check out the rest of this readme!
-## Coming from Langchain
+### Coming from Langchain
Click to expand
@@ -835,7 +844,6 @@ db = InMemoryExactNNIndex[MovieDoc](docs)
3. Finally, initialize a retriever and integrate it into your chain!
```python
-
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.retrievers import DocArrayRetriever
@@ -859,20 +867,13 @@ Both are user-friendly and are best suited to small to medium-sized datasets.
-## Installation
-
-To install DocArray from the CLI, run the following command:
-
-```shell
-pip install -U docarray
-```
## See also
- [Documentation](https://docs.docarray.org)
+- [DocArray<=0.21 documentation](https://docarray.jina.ai/)
- [Join our Discord server](https://discord.gg/WaMp6PVPgR)
- [Donation to Linux Foundation AI&Data blog post](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/)
-- ["Legacy" DocArray github page](https://github.com/docarray/docarray/tree/docarray-v1-fixes)
-- ["Legacy" DocArray documentation](https://docarray.jina.ai/)
+
> DocArray is a trademark of LF AI Projects, LLC
diff --git a/docarray/array/doc_list/io.py b/docarray/array/doc_list/io.py
index c2b531c2550..9667c673c09 100644
--- a/docarray/array/doc_list/io.py
+++ b/docarray/array/doc_list/io.py
@@ -555,7 +555,7 @@ def _stream_header(self) -> bytes:
# Binary format for streaming case
# V2 DocList streaming serialization format
- # | 1 byte | 8 bytes | 4 bytes | variable(docarray v2) | 4 bytes | variable(docarray v2) ...
+ # | 1 byte | 8 bytes | 4 bytes | variable(DocArray >=0.30) | 4 bytes | variable(DocArray >=0.30) ...
# 1 byte (uint8)
version_byte = b'\x02'
diff --git a/docarray/documents/legacy/legacy_document.py b/docarray/documents/legacy/legacy_document.py
index eea42f1d93e..74a105fbcfe 100644
--- a/docarray/documents/legacy/legacy_document.py
+++ b/docarray/documents/legacy/legacy_document.py
@@ -8,10 +8,10 @@
class LegacyDocument(BaseDoc):
"""
- This Document is the LegacyDocument. It follows the same schema as in DocArray v1.
+ This Document is the LegacyDocument. It follows the same schema as in DocArray <=0.21.
It can be useful to start migrating a codebase from v1 to v2.
- Nevertheless, the API is not totally compatible with DocArray v1 `Document`.
+ Nevertheless, the API is not totally compatible with DocArray <=0.21 `Document`.
Indeed, none of the method associated with `Document` are present. Only the schema
of the data is similar.
diff --git a/docs/assets/docarray-colorful.svg b/docs/assets/docarray-colorful.svg
new file mode 100644
index 00000000000..ed803d09d56
--- /dev/null
+++ b/docs/assets/docarray-colorful.svg
@@ -0,0 +1,16 @@
+
+
\ No newline at end of file
diff --git a/docs/assets/docarray-dark.svg b/docs/assets/docarray-dark.svg
index 7bb9d21c90e..e8c43ac48d4 100644
--- a/docs/assets/docarray-dark.svg
+++ b/docs/assets/docarray-dark.svg
@@ -2,7 +2,7 @@