Skip to content

0.36 Nested Collections not stored in a separate Sub Index #1721

@dwlmt

Description

@dwlmt

Initial Checks

  • I have read and followed the docs and still think this is a bug

Description

I have a document structure as per the docs given below. Until version 0.36, the nested fields "paths" were stored in separate collections with Qdrant. So with a collection name of "channel_category" for the parent doc, the paths would be stored in "channel_category__paths". With 0.36 the nested paths vectors and their collections are being held in the parent collection "channel_category" as separate records. Is this an intended change or a bug in how nested data is stored?

class MetaPathDoc(BaseDoc):
    path_id: str
    level: int
    text: str
    embedding: Optional[AnyTensor] = Field(
        space=similarity_space, dim=dim_size)

class MetaCategoryDoc(BaseDoc):
    node_id: Optional[str]
    node_name: Optional[str]
    name: Optional[str]
    product_type_definitions: Optional[str]
    leaf: bool
    paths: Optional[DocList[MetaPathDoc]]
    embedding: Optional[AnyTensor] = Field(
        space=similarity_space, dim=dim_size)
    channel: str
    lang: str

Example Code

I'm loading documents to QDrant via a Jina executor like this:

import os
import sys

import more_itertools
from docarray import DocList
from docarray.index import QdrantDocumentIndex
from utils.docs import MetaCategoryDoc
from jina import Executor, requests
from jina.logging.logger import JinaLogger
from qdrant_client.http import models

QDRANT_LOCATION = os.getenv('QDRANT_LOCATION', "http://localhost:6333")
QDRANT_API_KEY = os.getenv('QDRANT_API_KEY', None)

class MetaChannelCategoryIndexingExec(Executor):
    def __init__(self,
                 collection_name: str = "channel_category",
                 batch_size: str = 64,
                 qdrant_location: str = QDRANT_LOCATION,
                 qdrant_api_key: str = QDRANT_API_KEY,
                 *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.logger = JinaLogger('meta_channel_category_indexing')


        db_config = QdrantDocumentIndex.DBConfig(
            location=qdrant_location,
            api_key=qdrant_api_key,
            collection_name=collection_name,
            quantization_config=models.ScalarQuantization(
                scalar=models.ScalarQuantizationConfig(
                    type=models.ScalarType.INT8,
                    quantile=0.99,
                    always_ram=False,
                )
            ),
            optimizers_config=models.OptimizersConfigDiff(
                memmap_threshold=20000, indexing_threshold=20000),
            on_disk_payload=True,
            hnsw_config=models.HnswConfigDiff(m=16,ef_construct=100,on_disk=True),
            wal_config=models.WalConfigDiff(
                wal_capacity_mb=64, wal_segments_ahead=1),
            prefer_grpc=False)

        self.doc_index = QdrantDocumentIndex[MetaCategoryDoc](db_config)
        self.batch_size = batch_size

    @requests(
        request_schema=DocList[MetaCategoryDoc],
        response_schema=DocList[MetaCategoryDoc]
    )
    def index_metadata(self, docs, **kwargs):
        """ Save products to the Vector DB.
        """
        for doc_batch in more_itertools.chunked(docs, self.batch_size):
            # Indexing the documents
            self.doc_index.index(
                doc_batch
            )

Python, Pydantic & OS Version

0.36.0

Affected Components

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions