From b322422d1e5d0c38381ce4772ebd15e7713aa4c2 Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Wed, 29 Apr 2026 10:23:09 -0700 Subject: [PATCH 1/9] docs: Add MongoDB docs for data-source and offline-store integration Signed-off-by: jvincent-mongodb --- docs/SUMMARY.md | 2 + docs/reference/data-sources/README.md | 4 + docs/reference/data-sources/mongodb.md | 32 +++++++ docs/reference/offline-stores/README.md | 4 + docs/reference/offline-stores/mongodb.md | 102 +++++++++++++++++++++++ 5 files changed, 144 insertions(+) create mode 100644 docs/reference/data-sources/mongodb.md create mode 100644 docs/reference/offline-stores/mongodb.md diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index b29a0ac9ce8..44c1cc09477 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -113,6 +113,7 @@ * [Athena (contrib)](reference/data-sources/athena.md) * [Clickhouse (contrib)](reference/data-sources/clickhouse.md) * [Ray (contrib)](reference/data-sources/ray.md) + * [MongoDB (contrib)](reference/data-sources/mongodb.md) * [Offline stores](reference/offline-stores/README.md) * [Overview](reference/offline-stores/overview.md) * [Dask](reference/offline-stores/dask.md) @@ -129,6 +130,7 @@ * [Ray (contrib)](reference/offline-stores/ray.md) * [Oracle (contrib)](reference/offline-stores/oracle.md) * [Athena (contrib)](reference/offline-stores/athena.md) + * [MongoDB (contrib)](reference/offline-stores/mongodb.md) * [Remote Offline](reference/offline-stores/remote-offline-store.md) * [Hybrid](reference/offline-stores/hybrid.md) * [Online stores](reference/online-stores/README.md) diff --git a/docs/reference/data-sources/README.md b/docs/reference/data-sources/README.md index e25a7f6e8ae..24bf18dbe86 100644 --- a/docs/reference/data-sources/README.md +++ b/docs/reference/data-sources/README.md @@ -69,3 +69,7 @@ Please see [Data Source](../../getting-started/concepts/data-ingestion.md) for a {% content-ref url="ray.md" %} [ray.md](ray.md) {% endcontent-ref %} + +{% content-ref url="mongodb.md" %} +[mongodb.md](mongodb.md) +{% endcontent-ref %} diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md new file mode 100644 index 00000000000..f2061389f7e --- /dev/null +++ b/docs/reference/data-sources/mongodb.md @@ -0,0 +1,32 @@ +# MongoDB source (contrib) + +## Description + +MongoDB data sources are [MongoDB](https://www.mongodb.com/) collections that can be used as a source for feature data. The `MongoDBSource` points at a MongoDB collection and provides the metadata Feast needs to read historical features from the offline store's `feature_history` collection. + +## Examples + +Defining a MongoDB source: + +```python +from feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_source import ( + MongoDBSource, +) + +driver_stats_source = MongoDBSource( + name="driver_stats", + timestamp_field="event_timestamp", + created_timestamp_column="created_at", +) +``` + +The `name` field becomes the `feature_view` discriminator stored in every document in the `feature_history` collection. + +Configuration options such as `connection_string`, `database`, and `collection` are inherited from the offline store configuration in `feature_store.yaml`. + +The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_source.MongoDBSource). + +## Supported Types + +MongoDB data sources support all eight primitive types (`bytes`, `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`) and their corresponding array types. Complex types such as `Map` and `Struct` are preserved through the MongoDB document model. +For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix). diff --git a/docs/reference/offline-stores/README.md b/docs/reference/offline-stores/README.md index 5f4e146326a..1c0d24c8d07 100644 --- a/docs/reference/offline-stores/README.md +++ b/docs/reference/offline-stores/README.md @@ -62,6 +62,10 @@ Please see [Offline Store](../../getting-started/components/offline-store.md) fo [clickhouse.md](clickhouse.md) {% endcontent-ref %} +{% content-ref url="mongodb.md" %} +[mongodb.md](mongodb.md) +{% endcontent-ref %} + {% content-ref url="remote-offline-store.md" %} [remote-offline-store.md](remote-offline-store.md) {% endcontent-ref %} diff --git a/docs/reference/offline-stores/mongodb.md b/docs/reference/offline-stores/mongodb.md new file mode 100644 index 00000000000..a70f7d31387 --- /dev/null +++ b/docs/reference/offline-stores/mongodb.md @@ -0,0 +1,102 @@ +# MongoDB offline store (contrib) + +## Description + +The MongoDB offline store provides support for reading [MongoDBSource](../data-sources/mongodb.md). +* Entity dataframes can be provided as a Pandas dataframe. The offline store converts entity identifiers into serialized entity keys for efficient lookup against the `feature_history` collection. +* Uses a single shared `feature_history` collection with a compound index for all FeatureViews, distinguished by a `feature_view` discriminator field. + +## Getting started + +In order to use this offline store, you'll need to run `pip install 'feast[mongodb]'`. + +## Example + +{% code title="feature_store.yaml" %} +```yaml +project: my_project +registry: data/registry.db +provider: local +offline_store: + type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStore + connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" + database: feast + collection: feature_history +online_store: + type: mongodb + connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" + database_name: feast_online_store +``` +{% endcode %} + +The full set of configuration options is available in [MongoDBOfflineStoreConfig](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStoreConfig). + +## Data Model + +The offline store uses a single shared collection (by default `feature_history`) that stores append-only historical feature rows for all feature views. Each document represents one observation of one entity for one FeatureView at a specific event timestamp: + +```json +{ + "entity_id": "Binary(...)", + "feature_view": "driver_stats", + "event_timestamp": "ISODate(2024-01-15T12:00:00Z)", + "created_at": "ISODate(2024-01-15T12:01:00Z)", + "features": { + "conv_rate": 0.72, + "acc_rate": 0.91, + "avg_daily_trips": 14 + } +} +``` + +Key properties: + +* **Append-only**: Historical data is treated as immutable; corrections are written as new rows with newer `created_at` timestamps rather than in-place updates. +* **Time-series friendly**: `event_timestamp` represents when the feature value was observed; `created_at` is used as a tie-breaker when multiple observations share the same event timestamp. +* **Feature grouping by FeatureView**: `feature_view` identifies which FeatureView the row belongs to, so a single collection can host multiple FVs. + +A single compound index supports all major query patterns: + +``` +(entity_id ASC, feature_view ASC, event_timestamp DESC, created_at DESC) +``` + +This index enables efficient range scans over entities and feature views, while ensuring that the most recent observation per `(entity_id, feature_view)` is seen first during aggregation. The index is created lazily on first use and cached per connection string. + +## Key Optimizations + +* **K-collapse**: Multiple FeatureViews that share the same join keys are queried in a single aggregation using `feature_view: {$in: [...]}`, reducing round trips. +* **Scoring vs. training paths**: When `entity_df` has unique entity IDs (scoring), server-side `$group $first` is used for efficient retrieval. When entity IDs repeat (training), `pd.merge_asof` provides correct point-in-time joins. +* **Two-level chunking**: Outer `CHUNK_SIZE` (50,000 rows) limits the entity DataFrame slice; inner `MONGO_BATCH_SIZE` (10,000 entity IDs) limits `$in` array size per aggregation call. + +## Functionality Matrix + +The set of functionality supported by offline stores is described in detail [here](overview.md#functionality). +Below is a matrix indicating which functionality is supported by the MongoDB offline store. + +| | MongoDB | +| :----------------------------------------------------------------- | :------ | +| `get_historical_features` (point-in-time correct join) | yes | +| `pull_latest_from_table_or_query` (retrieve latest feature values) | yes | +| `pull_all_from_table_or_query` (retrieve a saved dataset) | yes | +| `offline_write_batch` (persist dataframes to offline store) | yes | +| `write_logged_features` (persist logged features to offline store) | no | + +Below is a matrix indicating which functionality is supported by `MongoDBRetrievalJob`. + +| | MongoDB | +| ----------------------------------------------------- | ------- | +| export to dataframe | yes | +| export to arrow table | yes | +| export to arrow batches | no | +| export to SQL | no | +| export to data lake (S3, GCS, etc.) | no | +| export to data warehouse | no | +| export as Spark dataframe | no | +| local execution of Python-based on-demand transforms | yes | +| remote execution of Python-based on-demand transforms | no | +| persist results in the offline store | yes | +| preview the query plan before execution | no | +| read partitioned data | no | + +To compare this set of functionality against other offline stores, please see the full [functionality matrix](overview.md#functionality-matrix). From 506d235fa44b4cc5a151ee89b8b5ebe20bd91701 Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Wed, 29 Apr 2026 15:00:01 -0700 Subject: [PATCH 2/9] docs: various copy revisions Signed-off-by: jvincent-mongodb --- docs/reference/offline-stores/mongodb.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/reference/offline-stores/mongodb.md b/docs/reference/offline-stores/mongodb.md index a70f7d31387..53bdcfaa7bf 100644 --- a/docs/reference/offline-stores/mongodb.md +++ b/docs/reference/offline-stores/mongodb.md @@ -3,8 +3,8 @@ ## Description The MongoDB offline store provides support for reading [MongoDBSource](../data-sources/mongodb.md). -* Entity dataframes can be provided as a Pandas dataframe. The offline store converts entity identifiers into serialized entity keys for efficient lookup against the `feature_history` collection. * Uses a single shared `feature_history` collection with a compound index for all FeatureViews, distinguished by a `feature_view` discriminator field. +* Entity dataframes can be provided as a Pandas dataframe. The offline store converts entity identifiers into serialized entity keys for efficient lookup against the `feature_history` collection. ## Getting started @@ -26,6 +26,8 @@ online_store: type: mongodb connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" database_name: feast_online_store + collection_suffix: latest + client_kwargs: {} ``` {% endcode %} @@ -66,8 +68,8 @@ This index enables efficient range scans over entities and feature views, while ## Key Optimizations * **K-collapse**: Multiple FeatureViews that share the same join keys are queried in a single aggregation using `feature_view: {$in: [...]}`, reducing round trips. -* **Scoring vs. training paths**: When `entity_df` has unique entity IDs (scoring), server-side `$group $first` is used for efficient retrieval. When entity IDs repeat (training), `pd.merge_asof` provides correct point-in-time joins. -* **Two-level chunking**: Outer `CHUNK_SIZE` (50,000 rows) limits the entity DataFrame slice; inner `MONGO_BATCH_SIZE` (10,000 entity IDs) limits `$in` array size per aggregation call. +* **Scoring vs. training paths**: When each entity appears only once in `entity_df` (scoring/inference — one feature lookup per entity), server-side `$group $first` efficiently returns the single latest value per entity. When the same entity appears at multiple timestamps (training — building a dataset with many historical snapshots per entity), the store retrieves all candidate rows and uses `pd.merge_asof` to select the correct point-in-time value for each request timestamp. +* **Two-level chunking**: `CHUNK_SIZE` (50,000 rows) controls the size of intermediate DataFrames in memory; `MONGO_BATCH_SIZE` (10,000 entity IDs) limits the query size sent to MongoDB. ## Functionality Matrix From e1ab8d8bbada0998cbe647609d2b851ad3bff6fd Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Thu, 30 Apr 2026 13:02:53 -0700 Subject: [PATCH 3/9] docs: Update MongoDB data-source documentation with Vector Search section Signed-off-by: jvincent-mongodb --- docs/reference/data-sources/mongodb.md | 79 ++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md index f2061389f7e..0c6e62d42ab 100644 --- a/docs/reference/data-sources/mongodb.md +++ b/docs/reference/data-sources/mongodb.md @@ -26,6 +26,85 @@ Configuration options such as `connection_string`, `database`, and `collection` The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_source.MongoDBSource). +## Vector Search + +The MongoDB online store supports [Atlas Vector Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/), enabling similarity search over feature embeddings stored in MongoDB Atlas. This is powered by the `$vectorSearch` aggregation stage and requires MongoDB Atlas (or the `mongodb/mongodb-atlas-local` Docker image for local development). + +See [PR #6344](https://github.com/feast-dev/feast/pull/6344) for full implementation details. + +### Configuration + +Enable vector search in your `feature_store.yaml`: + +```yaml +project: my_project +provider: local +online_store: + type: mongodb + connection_string: mongodb+srv://:@cluster.mongodb.net + vector_enabled: true + similarity: cosine # cosine | euclidean | dotProduct + vector_index_wait_timeout: 60 # seconds to wait for index to become queryable + vector_index_wait_poll_interval: 1.0 # seconds between polls +``` + +### Defining a Feature View with Vector Index + +Mark embedding fields with `vector_index=True` and specify `vector_length`: + +```python +from feast import Entity, FeatureView, Field, FileSource +from feast.types import Array, Float32, Int64, String +from datetime import timedelta + +item_embeddings = FeatureView( + name="item_embeddings", + entities=[Entity(name="item_id", join_keys=["item_id"])], + schema=[ + Field( + name="embedding", + dtype=Array(Float32), + vector_index=True, + vector_length=384, + vector_search_metric="cosine", + ), + Field(name="title", dtype=String), + Field(name="item_id", dtype=Int64), + ], + source=FileSource(path="items.parquet", timestamp_field="event_timestamp"), + ttl=timedelta(hours=24), +) +``` + +When `feast apply` (or `store.update()`) runs with `vector_enabled=True`, Atlas vector search indexes are automatically created for any field with `vector_index=True`. Indexes are also automatically dropped when feature views are removed. + +### Retrieving Documents via Vector Search + +Use `retrieve_online_documents_v2()` to perform similarity search: + +```python +results = store.retrieve_online_documents_v2( + config=repo_config, + table=item_embeddings, + requested_features=["embedding", "title"], + embedding=[0.1, 0.2, ...], # query vector + top_k=5, +) + +# Each result is a (event_timestamp, entity_key_proto, feature_dict) tuple. +# feature_dict includes a synthetic "distance" key with the vector search score. +for ts, entity_key, features in results: + print(features["title"].string_val, features["distance"].float_val) +``` + +### How It Works + +- **Index creation**: `update()` creates an Atlas vector search index named `____vs_index` for each vector-indexed field. It waits for the index to reach `READY` status before proceeding. +- **Query execution**: `retrieve_online_documents_v2()` builds a `$vectorSearch` aggregation pipeline with `numCandidates = max(top_k * 10, 100)` and the specified `limit`. +- **Score**: Results include a `distance` field populated from `$meta: "vectorSearchScore"`. +- **BSON compatibility**: Query vectors are coerced to native Python floats to avoid numpy serialization issues. +- **Idempotency**: Calling `update()` multiple times will not duplicate indexes. + ## Supported Types MongoDB data sources support all eight primitive types (`bytes`, `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`) and their corresponding array types. Complex types such as `Map` and `Struct` are preserved through the MongoDB document model. From 03aaa782320622d7e2f507815513c951b086c0dd Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Thu, 30 Apr 2026 13:22:41 -0700 Subject: [PATCH 4/9] docs: fix example import path for MongoDB data-source Signed-off-by: jvincent-mongodb --- docs/reference/data-sources/mongodb.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md index 0c6e62d42ab..ad820b5e394 100644 --- a/docs/reference/data-sources/mongodb.md +++ b/docs/reference/data-sources/mongodb.md @@ -9,7 +9,7 @@ MongoDB data sources are [MongoDB](https://www.mongodb.com/) collections that ca Defining a MongoDB source: ```python -from feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_source import ( +from feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb import ( MongoDBSource, ) From 71b8853bf2f3f4b16257bc30cc175c5bfc28b402 Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Thu, 30 Apr 2026 13:31:34 -0700 Subject: [PATCH 5/9] docs: fix broken url Signed-off-by: jvincent-mongodb --- docs/reference/data-sources/mongodb.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md index ad820b5e394..e47c6ca4244 100644 --- a/docs/reference/data-sources/mongodb.md +++ b/docs/reference/data-sources/mongodb.md @@ -24,7 +24,7 @@ The `name` field becomes the `feature_view` discriminator stored in every docume Configuration options such as `connection_string`, `database`, and `collection` are inherited from the offline store configuration in `feature_store.yaml`. -The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_source.MongoDBSource). +The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBSource). ## Vector Search From 3eab42a63e719d65b54995b505bd20e2201cf343 Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Thu, 30 Apr 2026 13:40:37 -0700 Subject: [PATCH 6/9] docs: fix code example Signed-off-by: jvincent-mongodb --- docs/reference/data-sources/mongodb.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md index e47c6ca4244..cda9c1a8a10 100644 --- a/docs/reference/data-sources/mongodb.md +++ b/docs/reference/data-sources/mongodb.md @@ -84,17 +84,14 @@ Use `retrieve_online_documents_v2()` to perform similarity search: ```python results = store.retrieve_online_documents_v2( - config=repo_config, - table=item_embeddings, - requested_features=["embedding", "title"], - embedding=[0.1, 0.2, ...], # query vector + features=["item_embeddings:embedding", "item_embeddings:title"], + query=[0.1, 0.2, ...], # query vector top_k=5, ) - -# Each result is a (event_timestamp, entity_key_proto, feature_dict) tuple. -# feature_dict includes a synthetic "distance" key with the vector search score. -for ts, entity_key, features in results: - print(features["title"].string_val, features["distance"].float_val) +# Returns an OnlineResponse; to_dict() gives {feature_name: [values]}. +response_dict = results.to_dict() +for title, distance in zip(response_dict["title"], response_dict["distance"]): + print(title, distance) ``` ### How It Works From 8d6125abeec9eafa681868d666a3c1b5a59b286a Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Thu, 30 Apr 2026 16:01:33 -0700 Subject: [PATCH 7/9] docs: minor edits to MongoDB data-source and offline-store docs Signed-off-by: jvincent-mongodb --- docs/reference/data-sources/mongodb.md | 20 ++++++++++++-------- docs/reference/offline-stores/mongodb.md | 4 ++-- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md index cda9c1a8a10..ad698fe7cb5 100644 --- a/docs/reference/data-sources/mongodb.md +++ b/docs/reference/data-sources/mongodb.md @@ -2,7 +2,7 @@ ## Description -MongoDB data sources are [MongoDB](https://www.mongodb.com/) collections that can be used as a source for feature data. The `MongoDBSource` points at a MongoDB collection and provides the metadata Feast needs to read historical features from the offline store's `feature_history` collection. +MongoDB data sources are [MongoDB](https://www.mongodb.com/) collections that can be used as a source for feature data. The `MongoDBSource` points at a MongoDB collection and provides the metadata Feast needs to read historical features from the offline store's collection. ## Examples @@ -83,15 +83,19 @@ When `feast apply` (or `store.update()`) runs with `vector_enabled=True`, Atlas Use `retrieve_online_documents_v2()` to perform similarity search: ```python -results = store.retrieve_online_documents_v2( - features=["item_embeddings:embedding", "item_embeddings:title"], - query=[0.1, 0.2, ...], # query vector +results = FeatureStore.store.retrieve_online_documents_v2( + config=repo_config, + table=item_embeddings, + requested_features=["embedding", "title"], + embedding=[0.1, 0.2, ...], # query vector top_k=5, ) -# Returns an OnlineResponse; to_dict() gives {feature_name: [values]}. -response_dict = results.to_dict() -for title, distance in zip(response_dict["title"], response_dict["distance"]): - print(title, distance) + +# Each result is a (event_timestamp, entity_key_proto, feature_dict) tuple. +# feature_dict includes a synthetic "distance" key with the vector search score. +for ts, entity_key, features in results: + print(features["title"].string_val, features["distance"].float_val) +``` ``` ### How It Works diff --git a/docs/reference/offline-stores/mongodb.md b/docs/reference/offline-stores/mongodb.md index 53bdcfaa7bf..aed0ab4c12a 100644 --- a/docs/reference/offline-stores/mongodb.md +++ b/docs/reference/offline-stores/mongodb.md @@ -3,8 +3,8 @@ ## Description The MongoDB offline store provides support for reading [MongoDBSource](../data-sources/mongodb.md). -* Uses a single shared `feature_history` collection with a compound index for all FeatureViews, distinguished by a `feature_view` discriminator field. -* Entity dataframes can be provided as a Pandas dataframe. The offline store converts entity identifiers into serialized entity keys for efficient lookup against the `feature_history` collection. +* Uses a single shared collection with a compound index for all FeatureViews, distinguished by a `feature_view` discriminator field. +* Entity dataframes can be provided as a Pandas dataframe. The offline store converts entity identifiers into serialized entity keys for efficient lookup against the collection. ## Getting started From c6f12234adcb87a0e9fa842d904f2c929049a292 Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Fri, 1 May 2026 10:47:09 -0700 Subject: [PATCH 8/9] docs: minor docs edits Signed-off-by: jvincent-mongodb --- README.md | 2 ++ docs/reference/data-sources/mongodb.md | 3 ++- docs/reference/offline-stores/mongodb.md | 5 ++--- .../contrib/mongodb_offline_store/README.md | 5 ++--- .../contrib/mongodb_offline_store/mongodb.py | 8 ++------ 5 files changed, 10 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index ded20b15376..115bd37903f 100644 --- a/README.md +++ b/README.md @@ -185,6 +185,7 @@ The list below contains the functionality that contributors are planning to deve * [x] [Athena (contrib plugin)](https://docs.feast.dev/reference/data-sources/athena) * [x] [Clickhouse (contrib plugin)](https://docs.feast.dev/reference/data-sources/clickhouse) * [x] [Oracle (contrib plugin)](https://docs.feast.dev/reference/data-sources/oracle) + * [x] [MongoDB (contrib plugin)](https://docs.feast.dev/reference/data-sources/mongodb) * [x] [Ray source (contrib plugin)](https://docs.feast.dev/reference/data-sources/ray) * [x] Kafka / Kinesis sources (via [push support into the online store](https://docs.feast.dev/reference/data-sources/push)) * **Offline Stores** @@ -204,6 +205,7 @@ The list below contains the functionality that contributors are planning to deve * [x] [Clickhouse (contrib plugin)](https://docs.feast.dev/reference/offline-stores/clickhouse) * [x] [Ray (contrib plugin)](https://docs.feast.dev/reference/offline-stores/ray) * [x] [Oracle (contrib plugin)](https://docs.feast.dev/reference/offline-stores/oracle) + * [x] [MongoDB (contrib plugin)](https://docs.feast.dev/reference/offline-stores/mongodb) * [x] [Hybrid](https://docs.feast.dev/reference/offline-stores/hybrid) * [x] [Custom offline store support](https://docs.feast.dev/how-to-guides/customizing-feast/adding-a-new-offline-store) * **Online Stores** diff --git a/docs/reference/data-sources/mongodb.md b/docs/reference/data-sources/mongodb.md index ad698fe7cb5..c1b6eed1bed 100644 --- a/docs/reference/data-sources/mongodb.md +++ b/docs/reference/data-sources/mongodb.md @@ -83,7 +83,8 @@ When `feast apply` (or `store.update()`) runs with `vector_enabled=True`, Atlas Use `retrieve_online_documents_v2()` to perform similarity search: ```python -results = FeatureStore.store.retrieve_online_documents_v2( +source = FeatureStore(repo_path=".") +results = store.retrieve_online_documents_v2( config=repo_config, table=item_embeddings, requested_features=["embedding", "title"], diff --git a/docs/reference/offline-stores/mongodb.md b/docs/reference/offline-stores/mongodb.md index aed0ab4c12a..0e8d1786699 100644 --- a/docs/reference/offline-stores/mongodb.md +++ b/docs/reference/offline-stores/mongodb.md @@ -19,12 +19,12 @@ registry: data/registry.db provider: local offline_store: type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStore - connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" + connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" # pragma: allowlist secret database: feast collection: feature_history online_store: type: mongodb - connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" + connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" # pragma: allowlist secret database_name: feast_online_store collection_suffix: latest client_kwargs: {} @@ -67,7 +67,6 @@ This index enables efficient range scans over entities and feature views, while ## Key Optimizations -* **K-collapse**: Multiple FeatureViews that share the same join keys are queried in a single aggregation using `feature_view: {$in: [...]}`, reducing round trips. * **Scoring vs. training paths**: When each entity appears only once in `entity_df` (scoring/inference — one feature lookup per entity), server-side `$group $first` efficiently returns the single latest value per entity. When the same entity appears at multiple timestamps (training — building a dataset with many historical snapshots per entity), the store retrieves all candidate rows and uses `pd.merge_asof` to select the correct point-in-time value for each request timestamp. * **Two-level chunking**: `CHUNK_SIZE` (50,000 rows) controls the size of intermediate DataFrames in memory; `MONGO_BATCH_SIZE` (10,000 entity IDs) limits the query size sent to MongoDB. diff --git a/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/README.md b/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/README.md index 6a30854969c..24446f8c003 100644 --- a/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/README.md +++ b/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/README.md @@ -1,10 +1,9 @@ # MongoDB Offline Store This offline store lets you train models and run batch scoring directly from it. -All feature views share a single collection (`feature_history`). Reads use +All feature views share a single collection. Reads use MongoDB aggregation pipelines with a compound index, so per-entity cost is -O(log n_observations) regardless of collection size, and K feature views with the same -entity key collapse into one round-trip instead of K (1 if your data shares a unique id.) +O(log n_observations) regardless of collection size. ## Schema diff --git a/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb.py b/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb.py index 07e35f66c15..76aa173cb95 100644 --- a/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb.py +++ b/sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb.py @@ -17,11 +17,7 @@ Single-collection schema. Key optimizations: -1. K-collapse: feature views that share the same join key set are batched - into a single ``$match + $sort`` aggregation instead of K separate find - queries. Reduces round-trips from K to |unique join key signatures|. - -2. Server-side deduplication (scoring path): when entity_df has unique +1. Server-side deduplication (scoring path): when entity_df has unique entity IDs the aggregation adds a ``$group`` stage that returns at most one document per (entity_id, feature_view) pair — O(N×K) transfer instead of O(N×P×K). The compound index backs the entire pipeline, @@ -561,7 +557,7 @@ def get_historical_features( Training path (repeated entity IDs at different timestamps): Omits ``$group`` and uses ``merge_asof`` in Python, matching - standard PIT behaviour but still with K-collapsed queries. + standard PIT behaviour. Args: strict_pit: When True (default) features whose document timestamp From 49ab2397c6aec24c0aad0236af00e1a4f5d57cd1 Mon Sep 17 00:00:00 2001 From: jvincent-mongodb Date: Fri, 1 May 2026 10:47:09 -0700 Subject: [PATCH 9/9] docs: minor docs edits Signed-off-by: jvincent-mongodb --- docs/roadmap.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/roadmap.md b/docs/roadmap.md index 017127d7355..e47aa79b573 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -20,6 +20,7 @@ The list below contains the functionality that contributors are planning to deve * [x] [Athena (contrib plugin)](https://docs.feast.dev/reference/data-sources/athena) * [x] [Clickhouse (contrib plugin)](https://docs.feast.dev/reference/data-sources/clickhouse) * [x] [Oracle (contrib plugin)](https://docs.feast.dev/reference/data-sources/oracle) + * [x] [MongoDB (contrib plugin)](https://docs.feast.dev/reference/data-sources/mongodb) * [x] [Ray source (contrib plugin)](https://docs.feast.dev/reference/data-sources/ray) * [x] Kafka / Kinesis sources (via [push support into the online store](https://docs.feast.dev/reference/data-sources/push)) * **Offline Stores** @@ -39,6 +40,7 @@ The list below contains the functionality that contributors are planning to deve * [x] [Clickhouse (contrib plugin)](https://docs.feast.dev/reference/offline-stores/clickhouse) * [x] [Ray (contrib plugin)](https://docs.feast.dev/reference/offline-stores/ray) * [x] [Oracle (contrib plugin)](https://docs.feast.dev/reference/offline-stores/oracle) + * [x] [MongoDB (contrib plugin)](https://docs.feast.dev/reference/offline-stores/mongodb) * [x] [Hybrid](https://docs.feast.dev/reference/offline-stores/hybrid) * [x] [Custom offline store support](https://docs.feast.dev/how-to-guides/customizing-feast/adding-a-new-offline-store) * **Online Stores**