diff --git a/infra/website/docs/blog/feast-spark-kubernetes-production.md b/infra/website/docs/blog/feast-spark-kubernetes-production.md
new file mode 100644
index 00000000000..0d89dea6d88
--- /dev/null
+++ b/infra/website/docs/blog/feast-spark-kubernetes-production.md
@@ -0,0 +1,479 @@
+---
+title: "From Local to Production: BatchFeatureView + Spark on Kubernetes"
+description: "How to eliminate redundant UDF re-execution at training time with the query+path pattern — and then deploy the whole pipeline to Kubernetes without spending a week debugging environmental failures."
+date: 2026-05-27
+authors: ["Abhijeet Dhumal"]
+---
+
+# From Local to Production: BatchFeatureView + Spark on Kubernetes
+
+Two problems, same pipeline.
+
+The first: a data scientist runs three training experiments and waits 22 minutes for each one — not to train the model, but to re-run the same Spark window aggregations over the same 26M rows. The features from the morning's `feast materialize` are sitting in S3. Feast just doesn't know it's allowed to read them there.
+
+The second: you fix the first problem, your pipeline works perfectly on the dev cluster, and then you move it to Kubernetes. `feast materialize` exits 0. Nothing wrote to Redis. No error. The features are null. You spend the next three days finding out why.
+
+This post covers both: the `query+path` pattern that makes features compute once, and the ordered checklist of what breaks when you deploy Feast + Spark to Kubernetes — with a complete production reference config at the end.
+
+---
+
+## Part 1: Compute Features Once
+
+<div class="hero-image">
+  <img src="../../public/images/blog/batchfeatureview-compute-once.png" alt="BatchFeatureView compute-once pattern with Feast and Spark" loading="lazy">
+</div>
+
+### Why `get_historical_features()` Re-Runs Your UDF
+
+When you define a `BatchFeatureView` with a `SparkSource`, the source tells Feast where the raw data is. During `feast materialize`, Feast runs your UDF on that raw data and writes features to your online store. During `get_historical_features()`, Feast goes back to the same `SparkSource`, reads the same raw data, and runs your UDF again — because from Feast's perspective, the source is raw data and features are always derived on demand.
+
+There was no way to say: *"for materialization, use the query. For training, read the pre-computed output."* That distinction didn't exist.
+
+### The `query + path` Pattern
+
+> **Data Scientist** — *"I run `feast materialize` overnight. By morning, every training job reads pre-computed features in seconds — not minutes of Spark compute."*
+
+`SparkSource` now accepts both `query` and `path` together. `query` drives materialization. `path` drives training reads. Define both once; Feast routes automatically:
+
+```python
+from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
+
+user_features_source = SparkSource(
+    name="user_features_source",
+    # Materialization: reads raw data, runs UDF, writes to online store + path
+    query=(
+        "SELECT *, CAST(timestamp / 1000 AS TIMESTAMP) AS event_timestamp "
+        "FROM parquet.`s3a://data-lake/raw/reviews/*/`"
+    ),
+    # Training: reads pre-computed parquet — no UDF, no raw data scan
+    path="s3a://feast-features/offline/user_features/",
+    file_format="parquet",
+    timestamp_field="event_timestamp",
+)
+```
+
+| Operation | Field used | What happens |
+|-----------|------------|--------------|
+| `feast materialize` | `query` | Runs UDF on raw data → writes features to online store + `path` |
+| `get_historical_features()` | `path` | Reads pre-computed parquet directly |
+
+**Graceful fallback:** if `path` doesn't exist yet (before the first `materialize` completes), Feast falls back to the live `query` automatically. The pipeline works on day one without pre-computed data.
+
+### Defining a BatchFeatureView with TransformationMode.PYTHON
+
+> **Data Scientist** — *"I want to write a PySpark transformation once — window aggregations, joins, derived features — and have Feast handle when it runs and where the output goes."*
+
+```python
+from feast import Entity, Field
+from feast.batch_feature_view import batch_feature_view
+from feast.transformation.mode import TransformationMode
+from feast.types import Float32, Int64, String
+from pyspark.sql import DataFrame
+from pyspark.sql import functions as F
+from pyspark.sql.window import Window
+from datetime import timedelta
+
+user = Entity(name="user_id", join_keys=["user_id"])
+
+@batch_feature_view(
+    source=user_features_source,
+    entities=[user],
+    mode=TransformationMode.PYTHON,
+    online=True,   # Written to Redis for real-time serving
+    offline=True,  # Written to S3 for point-in-time correct training data
+    schema=[
+        Field(name="user_avg_rating", dtype=Float32),
+        Field(name="user_review_count", dtype=Int64),
+        Field(name="user_primary_category", dtype=String),
+    ],
+    ttl=timedelta(days=7),
+)
+def user_features(df: DataFrame) -> DataFrame:
+    # Runs once per materialize cycle — not per training experiment
+    w = Window.partitionBy("user_id").orderBy(F.desc("cat_count"))
+    user_cat_counts = (
+        df.groupBy("user_id", "category")
+        .agg(F.count("*").alias("cat_count"), F.avg("rating").alias("avg_rating"))
+    )
+    return (
+        user_cat_counts
+        .withColumn("rn", F.row_number().over(w))
+        .filter(F.col("rn") == 1)
+        .groupBy("user_id")
+        .agg(
+            F.avg("avg_rating").alias("user_avg_rating"),
+            F.sum("cat_count").alias("user_review_count"),
+            F.first("category").alias("user_primary_category"),
+        )
+        .withColumn("event_timestamp", F.current_timestamp())
+    )
+```
+
+### How Materialization and Training Fit Together
+
+> **MLOps Engineer** — *"I want materialization on a schedule and training jobs fully decoupled — data scientists iterate at experiment speed, not Spark job speed."*
+
+```bash
+# Runs on a schedule: nightly cron, Argo Workflow, Airflow DAG
+feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
+```
+
+During materialization, Feast runs `user_features_source.query` → `user_features()` UDF → writes to:
+- Redis (for `get_online_features()` at serving time)
+- `s3a://feast-features/offline/user_features/` as parquet (for training)
+
+At training time — regardless of how many experiments run that day:
+
+```python
+store = FeatureStore(repo_path=".")
+
+training_df = store.get_historical_features(
+    entity_df=entity_df,
+    features=[
+        "user_features:user_avg_rating",
+        "user_features:user_review_count",
+        "user_features:user_primary_category",
+        "item_features:item_avg_rating",
+        "interactions:label",
+    ],
+).to_df()
+```
+
+This reads pre-computed parquet. No Spark job triggered. No raw data scan. Point-in-time correctness is preserved — each materialized parquet partition carries its `event_timestamp`.
+
+### Offline-Only Feature Views for Training Labels
+
+> **Data Scientist** — *"My interaction table is 500GB and training-only — it will never live in Redis. I still want its schema and transformation logic tracked in the registry, retrievable via the same API, and point-in-time correct like any other feature."*
+
+`online=False, offline=True` is now a first-class `BatchFeatureView` configuration:
+
+```python
+@batch_feature_view(
+    source=SparkSource(
+        name="interactions_source",
+        query="SELECT * FROM parquet.`s3a://data-lake/raw/interactions/*/`",
+        path="s3a://feast-features/offline/interactions/",
+        file_format="parquet",
+        timestamp_field="event_timestamp",
+    ),
+    entities=[user, item],
+    mode=TransformationMode.PYTHON,
+    online=False,   # Never written to Redis
+    offline=True,   # Written to S3 — retrievable via get_historical_features()
+    schema=[
+        Field(name="label", dtype=Int32),
+        Field(name="interaction_type", dtype=String),
+        Field(name="dwell_time_seconds", dtype=Int64),
+    ],
+    ttl=timedelta(days=90),
+)
+def interactions(df: DataFrame) -> DataFrame:
+    return df.select(
+        "user_id", "item_id", "label",
+        "interaction_type", "dwell_time_seconds", "event_timestamp",
+    )
+```
+
+Training labels, interaction histories, and large join tables are now first-class Feast objects — with schema and transformation logic tracked in the registry (not data versioning), consistent lineage, and the same `get_historical_features()` retrieval API as your serving features.
+
+---
+
+## Part 2: Deploying to Kubernetes
+
+<div class="hero-image">
+  <img src="../../public/images/blog/feast-spark-k8s-production.png" alt="Feast + Spark on Kubernetes production deployment" loading="lazy">
+</div>
+
+Getting Feast + Spark working locally takes an afternoon. Getting it working on Kubernetes takes a week — not because Kubernetes is hard, but because the failures are environmental, sequential, and each one looks unrelated to the last.
+
+Here's the ordered checklist: what breaks, in roughly the order you encounter it, why, and exactly how to fix it.
+
+### Step 1: Build the Right Container Images
+
+> **MLOps Engineer** — *"I need driver and executor images that agree on Feast version, Spark version, and JARs — and don't break on UBI9 with FIPS enabled."*
+
+Two images are needed: a **driver image** (runs the Feast server and SparkComputeEngine) and an **executor image** (spawned by Kubernetes for each Spark worker).
+
+**Java 11 + FIPS disable**
+
+PySpark 3.5.x requires Java 11. On UBI9/RHEL environments (OpenShift AI, RHEL nodes), the JVM runs in FIPS mode by default. FIPS blocks HMAC-SHA256 key operations used by AWS SDK v1 — every `s3a://` read/write silently fails:
+
+```dockerfile
+RUN microdnf install -y java-11-openjdk-headless git && microdnf clean all
+ENV JAVA_HOME=/usr/lib/jvm/jre-11
+ENV PATH="${JAVA_HOME}/bin:${PATH}"
+# Disables FIPS JVM mode — required for AWS SDK v1 HMAC signing on UBI9/RHEL
+ENV JAVA_TOOL_OPTIONS="-Dcom.redhat.fips=false"
+```
+
+**S3A JARs — version pinning matters**
+
+PySpark 3.5.x bundles Hadoop 3.3.4, which uses AWS SDK v1. The JARs must exactly match:
+
+```dockerfile
+RUN SPARK_JARS=$(python3 -c 'import pyspark, os; print(os.path.join(os.path.dirname(pyspark.__file__), "jars"))') && \
+    curl -fsSL -o "${SPARK_JARS}/hadoop-aws-3.3.4.jar" \
+      https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar && \
+    curl -fsSL -o "${SPARK_JARS}/aws-java-sdk-bundle-1.12.367.jar" \
+      https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar
+```
+
+> ⚠️ **Do not upgrade to PySpark 4.0 yet.** PySpark 4.0 bundles Hadoop 3.4.0, which requires AWS SDK v2 but does not bundle it. Every `s3a://` access fails with `NoClassDefFoundError`.
+
+**Executor `/opt/spark` symlinks**
+
+When using `spark.master: "k8s://..."`, the driver launches executor pods by calling `/opt/spark/bin/spark-class`. A pip-installed PySpark puts its binaries inside site-packages — not `/opt/spark`:
+
+```dockerfile
+RUN PYSPARK_HOME=$(python3 -c 'import pyspark, os; print(os.path.dirname(pyspark.__file__))') && \
+    mkdir -p /opt/spark && \
+    ln -sf "${PYSPARK_HOME}/bin"    /opt/spark/bin    && \
+    ln -sf "${PYSPARK_HOME}/jars"   /opt/spark/jars   && \
+    ln -sf "${PYSPARK_HOME}/python" /opt/spark/python
+ENV SPARK_HOME=/opt/spark
+```
+
+Add an entrypoint shim — the Spark k8s executor contract expects `/opt/entrypoint.sh`:
+
+```bash
+#!/bin/bash
+# /opt/entrypoint.sh
+if [ "$1" = "executor" ]; then
+  exec /opt/spark/bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend "$@"
+fi
+exec "$@"
+```
+
+---
+
+### Step 2: S3 Event Logging Without Killing SparkContext
+
+> **MLOps Engineer** — *"I enabled Spark event logging to S3. Now `feast materialize` exits 0 and writes nothing."*
+
+On a fresh bucket, `SparkContext` initialization verifies the event log directory exists. S3 has no real directories — a missing prefix returns 404, the pre-flight check raises, and Feast catches it silently. The job exits 0. Nothing was written.
+
+**The fix:** Feast now automatically writes a zero-byte `.keep` placeholder to the event log prefix before `SparkSession` initialization:
+
+```yaml
+offline_store:
+  type: spark
+  spark_conf:
+    spark.eventLog.enabled: "true"
+    spark.eventLog.dir: "s3a://feast-logs/spark-events/"
+    spark.hadoop.fs.s3a.endpoint: "http://minio.feast-system.svc.cluster.local:9000"
+    spark.hadoop.fs.s3a.path.style.access: "true"
+```
+
+No bucket setup required. With event logging working, Spark History Server gives you stage timelines and executor utilization for every `feast materialize` job.
+
+---
+
+### Step 3: Window Operations on Spark 3.5+
+
+> **Data Scientist** — *"My BFV uses `Window.partitionBy`. It worked on Spark 3.4. On Spark 3.5 (Databricks 14+, EMR 7+) it crashes with a serializer error."*
+
+Spark 3.5 introduced `WindowGroupLimitExec` — a new physical plan node that inserts itself upstream of Arrow-based UDF execution when `Window` operations are present. The fix replaces the Arrow UDF bridge with `foreachPartition`:
+
+```python
+# Works correctly on all Spark versions including 3.5+
+@batch_feature_view(
+    sources=[user_reviews_source],
+    entities=[user],
+    mode=TransformationMode.PYTHON,
+    schema=[Field(name="user_primary_category", dtype=String)],
+)
+def user_features(df: DataFrame) -> DataFrame:
+    w = Window.partitionBy("user_id").orderBy(F.desc("cnt"))
+    counts = df.groupBy("user_id", "category").agg(F.count("*").alias("cnt"))
+    return (
+        counts.withColumn("rn", F.row_number().over(w))
+        .filter(F.col("rn") == 1)
+        .select("user_id", F.col("category").alias("user_primary_category"))
+        .withColumn("event_timestamp", F.current_timestamp())
+    )
+```
+
+No version pinning, no workarounds. The same BFV code runs on Databricks Runtime 14+, EMR 7+, or self-hosted Spark 3.5.
+
+---
+
+### Step 4: Staging Reads Fail on MinIO and Custom S3 Endpoints
+
+> **MLOps Engineer** — *"Materialization succeeds. Features land in Redis. But `get_historical_features()` raises `FileNotFoundError` on the staging path — the same bucket that just received the write."*
+
+Without an explicit filesystem client, PyArrow resolves `s3://` URIs using the default credential chain and connects to `s3.amazonaws.com`. On MinIO or any private endpoint, PyArrow is connecting to the wrong host.
+
+Feast now builds the correct `pyarrow.fs.S3FileSystem` from your environment. Set `AWS_ENDPOINT_URL_S3`:
+
+```bash
+# Pod environment — mount from Kubernetes Secret
+AWS_ENDPOINT_URL_S3=http://minio.feast-system.svc.cluster.local:9000
+AWS_ACCESS_KEY_ID=<from-secret>
+AWS_SECRET_ACCESS_KEY=<from-secret>  # pragma: allowlist secret
+```
+
+GCS (`gs://`) and local paths are handled the same way — Feast selects `GcsFileSystem` or `LocalFileSystem` based on URI scheme.
+
+---
+
+### Step 5: Executor Pods Getting OOMKilled on Large Feature Views
+
+> **MLOps Engineer** — *"Materialization of the 10M-key user feature view keeps OOMKilling executor pods. Raising memory doesn't help beyond a point."*
+
+The original write path accumulated an entire Spark partition in memory before flushing to the online store. Feast now writes in fixed-size chunks, bounding peak memory per executor regardless of partition size:
+
+```yaml
+batch_engine:
+  type: spark.engine
+  partitions: 200
+  spark_conf:
+    spark.executor.memory: "6g"
+    spark.executor.memoryOverhead: "14g"  # GPU UVM counted against overhead for RAPIDS
+    spark.executor.instances: "2"
+    spark.executor.cores: "4"
+```
+
+---
+
+## The Full Production Configuration
+
+With all five steps addressed, here is a production-ready `feature_store.yaml` for a Kubernetes deployment with MinIO, RAPIDS GPU executors, and a stable driver service:
+
+```yaml
+project: my-ml-project
+registry:
+  registry_type: file
+  path: /feast-registry/registry.db
+
+# offline_store: used by training pods (get_historical_features) with local[*] SparkSession
+offline_store:
+  type: spark
+  spark_conf:
+    spark.master: "local[*]"
+    spark.driver.memory: "8g"
+    spark.sql.shuffle.partitions: "200"
+    spark.default.parallelism: "200"
+    spark.sql.session.timeZone: "UTC"
+    spark.sql.runSQLOnFiles: "true"
+    spark.sql.execution.arrow.pyspark.enabled: "true"
+    spark.sql.execution.arrow.fallback.enabled: "true"
+    spark.sql.execution.arrow.maxRecordsPerBatch: "50000"
+    spark.driver.extraJavaOptions: "-Dcom.redhat.fips=false"
+    spark.hadoop.fs.s3a.endpoint: "http://minio.feast-system.svc.cluster.local:9000"
+    spark.hadoop.fs.s3a.path.style.access: "true"
+    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
+    spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.EnvironmentVariableCredentialsProvider"
+    spark.hadoop.fs.s3a.connection.ssl.enabled: "false"
+    spark.hadoop.fs.s3a.connection.maximum: "100"
+    spark.hadoop.fs.s3a.threads.max: "64"
+  staging_location: "s3://feast-features/feast-staging/"
+
+# batch_engine: used by the feast-spark server pod for feast materialize
+# spark.master k8s:// spawns GPU executor pods on demand
+batch_engine:
+  type: spark.engine
+  partitions: 200
+  spark_conf:
+    spark.master: "k8s://https://kubernetes.default.svc:443"
+    spark.submit.deployMode: "client"
+    spark.driver.host: "feast-spark-driver.feast-system.svc.cluster.local"
+    spark.driver.bindAddress: "0.0.0.0"
+    spark.driver.port: "7078"
+    spark.blockManager.port: "7079"
+    spark.driver.memory: "8g"
+    spark.driver.maxResultSize: "4g"
+    spark.driver.extraJavaOptions: "-Dcom.redhat.fips=false"
+    spark.kubernetes.namespace: "feast-system"
+    spark.kubernetes.container.image: "your-registry/feast-spark-executor-rapids:latest"
+    spark.kubernetes.container.image.pullPolicy: "Always"
+    spark.kubernetes.authenticate.driver.serviceAccountName: "feast-sa"
+    spark.kubernetes.executor.deleteOnTermination: "true"
+    spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID: "feast-s3-credentials:AWS_ACCESS_KEY_ID"
+    spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY: "feast-s3-credentials:AWS_SECRET_ACCESS_KEY"  # pragma: allowlist secret
+    spark.executor.instances: "2"
+    spark.executor.cores: "4"
+    spark.kubernetes.executor.request.cores: "3"
+    spark.executor.memory: "6g"
+    spark.executor.memoryOverhead: "14g"
+    spark.executor.extraJavaOptions: "-Duser.dir=/tmp -Dcom.redhat.fips=false"
+    # RAPIDS GPU plugin
+    spark.rapids.sql.enabled: "true"
+    spark.plugins: "com.nvidia.spark.SQLPlugin"
+    spark.rapids.sql.concurrentGpuTasks: "2"
+    spark.rapids.memory.pinnedPool.size: "1g"
+    spark.task.resource.gpu.amount: "0.25"
+    spark.executor.resource.gpu.amount: "1"
+    spark.executor.resource.gpu.vendor: "nvidia.com"
+    spark.executor.resource.gpu.discoveryScript: "/opt/spark/getGpusResources.sh"
+    spark.kubernetes.executor.limit.nvidia.com/gpu: "1"
+    # Resilience
+    spark.executor.maxNumFailures: "16"
+    spark.task.maxFailures: "8"
+    spark.network.timeout: "600s"
+    spark.executor.heartbeatInterval: "60s"
+    spark.kubernetes.executor.missingPodDetectDelta: "120s"
+    spark.sql.runSQLOnFiles: "true"
+    spark.sql.session.timeZone: "UTC"
+    spark.sql.shuffle.partitions: "200"
+    spark.sql.execution.arrow.pyspark.enabled: "true"
+    spark.sql.execution.arrow.maxRecordsPerBatch: "10000"
+    spark.hadoop.fs.s3a.endpoint: "http://minio.feast-system.svc.cluster.local:9000"
+    spark.hadoop.fs.s3a.path.style.access: "true"
+    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
+    spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.EnvironmentVariableCredentialsProvider"
+    spark.hadoop.fs.s3a.connection.ssl.enabled: "false"
+    spark.hadoop.fs.s3a.connection.maximum: "100"
+    spark.hadoop.fs.s3a.threads.max: "64"
+
+online_store:
+  type: redis
+  connection_string: "${REDIS_HOST}:${REDIS_PORT},password=${REDIS_PASSWORD}"
+
+entity_key_serialization_version: 3
+```
+
+```bash
+# Pod environment — mount from Kubernetes Secrets
+AWS_ENDPOINT_URL_S3=http://minio.feast-system.svc.cluster.local:9000
+AWS_ACCESS_KEY_ID=<minio-access-key>
+AWS_SECRET_ACCESS_KEY=<minio-secret-key>  # pragma: allowlist secret
+REDIS_HOST=redis.feast-system.svc
+REDIS_PORT=6379
+REDIS_PASSWORD=<redis-password>
+JAVA_TOOL_OPTIONS=-Dcom.redhat.fips=false
+```
+
+Key notes:
+- `offline_store.spark_conf` uses `local[*]` — training pods run a local SparkSession, no k8s executor spawning needed
+- `batch_engine.spark_conf` uses `k8s://` — the materialization server spawns GPU executor pods on demand
+- `spark.executor.memoryOverhead: "14g"` is intentionally large: RAPIDS GPU UVM is allocated from the overhead budget, not the JVM heap
+
+## GPU vs CPU RAPIDS Benchmark
+
+> 📊 **Coming soon:** end-to-end materialization benchmarks comparing CPU-only vs RAPIDS GPU-accelerated executors across tabular aggregation BFVs and embedding BFVs. Results will be added to this post.
+
+---
+
+## What You Get When It All Works
+
+**The Spark cluster stops being the experiment bottleneck.** Materialization runs once on a schedule. Every training job after that reads pre-computed parquet. The cluster's compute budget goes to feature refresh, not feature re-derivation.
+
+**Training-serving consistency becomes structural.** `get_historical_features()` reads the same parquet that `feast materialize` wrote. There is no re-implementation of transformation logic. Training-serving skew cannot emerge because there is only one code path.
+
+**A complete feature catalog.** Serving features, training features, and training labels — all in the same Feast repository, with the same APIs for discovery, lineage, and retrieval.
+
+**Platform freedom.** Works on OpenShift AI, self-hosted Kubernetes, Kubeflow, and any K8s-based ML platform with MinIO or GCS. No AWS dependency.
+
+**Operational visibility.** Spark event logging to S3 gives you stage timelines and executor utilization for every materialization job — the primitives needed for feature freshness SLAs.
+
+---
+
+## Share Your Feedback
+
+We want to hear from you! Try the compute-once pattern and the deployment guide and tell us:
+
+- Which Kubernetes platform and object storage backend you're running on
+- How the `query + path` routing fits your existing materialization schedules
+- Which step in the K8s guide unblocked your deployment — or what's still blocking it
+
+Join the conversation on [GitHub](https://github.com/feast-dev/feast) or in the [Feast Slack community](https://slack.feast.dev). Your feedback directly shapes what we build next.
diff --git a/infra/website/docs/blog/spark-gpu-embeddings-rag.md b/infra/website/docs/blog/spark-gpu-embeddings-rag.md
new file mode 100644
index 00000000000..63f3c2bc666
--- /dev/null
+++ b/infra/website/docs/blog/spark-gpu-embeddings-rag.md
@@ -0,0 +1,391 @@
+---
+title: "GPU-Accelerated Embedding Pipelines with Feast, Spark & Milvus"
+description: "How to generate, materialize, and serve millions of embedding vectors for semantic search using BatchFeatureView and pandas_udf on GPU Spark executors — no separate Ray cluster needed if you're already on Spark."
+date: 2026-05-27
+authors: ["Abhijeet Dhumal"]
+---
+
+# GPU-Accelerated Embedding Pipelines with Feast, Spark & Milvus
+
+Your ML infrastructure already runs Spark. You use it for tabular feature engineering: user aggregations, item statistics, interaction labels. `BatchFeatureView` with `SparkComputeEngine` handles all of it.
+
+Now the product team wants semantic search. Users should be able to ask questions and get results ranked by meaning, not just keywords. The answer is embeddings — dense vector representations of text, stored in a vector database, retrieved by cosine similarity.
+
+Your infrastructure team scopes it out: you need a model inference step (sentence-transformers), a vector store (Milvus), and something to orchestrate the pipeline at scale. Someone suggests adding Ray to the stack. But your Spark cluster already has GPU nodes. The A100s sitting idle between nightly materialization runs could be doing this work.
+
+This post shows how to build the entire embedding pipeline — generation, materialization, and serving — as a first-class Feast `BatchFeatureView`, using the same Spark GPU cluster you already operate.
+
+<div class="hero-image">
+  <img src="../../public/images/blog/spark-gpu-embeddings-rag.png" alt="GPU embedding pipeline with Feast, Spark, and Milvus" loading="lazy">
+</div>
+
+---
+> **Ray or Spark?** If you're starting fresh without existing Spark infrastructure, Feast's [Ray integration](./feast-ray-distributed-processing) is an excellent entry point — it's purpose-built for distributed embedding generation with minimal setup. This post is specifically for teams already running `SparkComputeEngine` for tabular feature engineering who want to add semantic search without introducing a second compute framework.
+
+---
+
+## How It All Fits Together
+
+```
+Raw text data (S3 parquet)
+        │
+        ▼
+@batch_feature_view (TransformationMode.PYTHON)
+        │
+  pandas_udf (_embed_udf)
+  sentence-transformers on GPU executor pods
+        │
+        ▼
+Milvus (online store, IVF_FLAT COSINE index)
+        │
+        ▼
+retrieve_online_documents_v2()
+```
+
+One `feast materialize` call. No separate Ray cluster. No custom Airflow DAG for embedding generation. The same scheduling, registry, and serving primitives you already use for tabular features handle embeddings end-to-end.
+
+---
+
+## Step 1: Define the Embedding Source and Entity
+
+> **Data Scientist** — *"I want to embed my text corpus once per refresh cycle and query results at serving time — without managing a separate embedding pipeline."*
+
+The `SparkSource` query pre-processes raw text into the form the embedding model expects. In this case, concatenating review title and body, generating a stable hash ID per document:
+
+```python
+from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
+from feast import Entity
+from feast.value_type import ValueType
+
+review = Entity(
+    name="review_id",
+    value_type=ValueType.STRING,
+    description="Unique document identifier for vector lookup",
+)
+
+reviews_source = SparkSource(
+    name="reviews_source",
+    query=(
+        "SELECT *, "
+        "CAST(timestamp / 1000 AS TIMESTAMP) AS event_timestamp, "
+        "SHA2(CONCAT_WS('_', user_id, COALESCE(parent_asin, asin), CAST(timestamp AS STRING)), 256) AS review_id "
+        "FROM parquet.`s3a://data-lake/raw/reviews/*/`"
+    ),
+    timestamp_field="event_timestamp",
+)
+```
+
+The SHA2 hash produces a stable, deterministic `review_id` from user + item + timestamp — no UUID generation, reproducible across runs.
+
+---
+
+## Step 2: Define the Embedding BatchFeatureView
+
+> **Data Scientist** — *"I want to write the embedding logic as a Python function and have Feast handle when it runs, how it scales, and where the vectors land."*
+
+The embedding UDF is a `pandas_udf` — Spark serializes batches of text to the Python worker as a `pd.Series`, the model encodes them, and the results come back as a `pd.Series` of float lists:
+
+```python
+from feast.batch_feature_view import batch_feature_view
+from feast.field import Field
+from feast.transformation.mode import TransformationMode
+from feast.types import Array, Float32, Float64, String
+from datetime import timedelta
+
+EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+EMBEDDING_DIM = 384
+
+@batch_feature_view(
+    name="review_embeddings",
+    entities=[review],
+    source=reviews_source,
+    mode=TransformationMode.PYTHON,
+    online=True,    # Written to Milvus for vector similarity search
+    offline=False,  # Embeddings not needed for point-in-time training joins
+    schema=[
+        Field(name="item_id", dtype=String),
+        Field(name="user_id", dtype=String),
+        Field(name="rating", dtype=Float64),
+        Field(name="review_title", dtype=String),
+        Field(name="embed_text", dtype=String),
+        Field(
+            name="embedding",
+            dtype=Array(Float32),
+            vector_index=True,
+            vector_search_metric="COSINE",
+        ),
+    ],
+    ttl=timedelta(days=90),
+)
+def review_embeddings(df):
+    import pandas as pd
+    from pyspark.sql import functions as F
+    from pyspark.sql.types import ArrayType, FloatType
+
+    # Concatenate title + body into the text to embed
+    df = df.withColumn(
+        "embed_text",
+        F.concat_ws(
+            " ",
+            F.coalesce(F.col("title"), F.lit("")),
+            F.coalesce(F.col("text"), F.lit("")),
+        ),
+    ).filter(F.length("embed_text") >= 20)
+
+    staging = df.select(
+        F.col("review_id"),
+        F.col("parent_asin").alias("item_id"),
+        F.col("user_id"),
+        F.col("rating").cast("double"),
+        F.col("title").alias("review_title"),
+        F.col("embed_text").substr(1, 511).alias("embed_text"),
+        F.current_timestamp().alias("event_timestamp"),
+    )
+
+    @F.pandas_udf(ArrayType(FloatType()))
+    def _embed_udf(texts: pd.Series) -> pd.Series:
+        from sentence_transformers import SentenceTransformer
+        # Cache on the function object — loaded once per Python worker process,
+        # not once per Arrow batch. Avoids repeated 90MB weight loads.
+        if not hasattr(_embed_udf, "_model"):
+            _embed_udf._model = SentenceTransformer(EMBEDDING_MODEL)
+        embeddings = _embed_udf._model.encode(
+            texts.tolist(),
+            normalize_embeddings=True,
+            batch_size=64,
+            show_progress_bar=False,
+        )
+        return pd.Series([e.tolist() for e in embeddings])
+
+    return staging.withColumn("embedding", _embed_udf(F.col("embed_text")))
+```
+
+A few design decisions worth noting:
+
+- **`online=True, offline=False`**: embeddings go to Milvus for serving. They're not needed for point-in-time training joins — that's what user/item tabular features are for.
+- **`normalize_embeddings=True`**: produces unit vectors, making cosine similarity equivalent to dot product — faster at query time in most vector databases.
+- **`batch_size=64`**: controls how many texts the model processes per GPU forward pass. Tune based on GPU memory; larger batches improve throughput until you hit VRAM limits.
+- **`substr(1, 511)`**: truncates embed_text to 511 characters before encoding — `all-MiniLM-L6-v2` has a 512-token limit; truncation here avoids silent tokenizer truncation downstream.
+- **Model caching via `hasattr`**: `pandas_udf` is called once per Arrow batch, not once per partition. Without caching, the 90MB model weights would be loaded from disk on every batch call. Storing the model on the function object (`_embed_udf._model`) means it's loaded once per Python worker process and reused across all batches that worker handles.
+
+---
+
+## Step 3: Configure Milvus as the Online Store
+
+> **MLOps Engineer** — *"I want vector similarity search served by Milvus with the same Feast API I already use for tabular feature retrieval."*
+
+```yaml
+# feature_store.yaml — Milvus configuration
+online_store:
+  type: milvus
+  host: milvus.feast-system.svc.cluster.local
+  port: 19530
+  vector_enabled: true
+  embedding_dim: 384
+  index_type: "IVF_FLAT"
+  metric_type: "COSINE"
+  index_params:
+    nlist: 1024
+  search_params:
+    nprobe: 16
+```
+
+For IVF_FLAT with COSINE metric:
+- `nlist: 1024` — number of Voronoi clusters. Roughly `sqrt(n_vectors)` is a good starting point for up to 1M vectors; scale up for larger collections.
+- `nprobe: 16` — clusters searched per query. Higher values improve recall at the cost of latency. 16 gives >99% recall at reasonable latency for most collection sizes.
+
+---
+
+## Step 4: Build the Executor Image with Model Weights
+
+> **MLOps Engineer** — *"Every executor pod downloading 90MB of model weights at startup creates a race condition under load. Bake the weights into the image."*
+
+For GPU executor pods that run the embedding UDF, extend the RAPIDS executor image with `sentence-transformers` and pre-cached model weights:
+
+```dockerfile
+FROM your-registry/feast-spark-executor-rapids:latest
+
+USER 0
+
+RUN pip install --no-cache-dir sentence-transformers==3.4.1
+
+# Cache model weights at build time — avoids cold-start downloads in air-gapped clusters
+# or when many executor pods start simultaneously
+ENV HF_HOME=/opt/hf-cache
+ENV TRANSFORMERS_CACHE=/opt/hf-cache
+
+RUN python3 -c "
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
+emb = model.encode(['test'])
+assert len(emb[0]) == 384
+print('Model cached: dim=384')
+" && chmod -R g+r /opt/hf-cache
+
+USER 1001
+```
+
+This is critical for Kubernetes deployments: if 4 executor pods all start simultaneously and each tries to download 90MB from HuggingFace Hub (or your model registry), you get a race condition between download completion and the Spark task deadline.
+
+---
+
+## Step 5: Run Your First Materialization
+
+> **MLOps Engineer** — *"I want a single command that generates and stores all embedding vectors — on a schedule, incrementally, without touching any custom pipeline code."*
+
+```bash
+feast apply
+
+# Materialize embeddings for a date range
+feast materialize 2024-01-01T00:00:00 2024-12-31T23:59:59 \
+  --feature-views review_embeddings
+```
+
+`feast materialize` runs the `review_embeddings` BFV: the `reviews_source` query reads raw parquet from S3, the `review_embeddings()` function runs on each partition via `foreachPartition`, `_embed_udf` executes on GPU executor pods in batches of 64, and the resulting 384-dim vectors are written to Milvus.
+
+For incremental updates (new reviews since last run):
+
+```bash
+feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") \
+  --feature-views review_embeddings
+```
+
+---
+
+## Step 6: Serve Semantic Search at Query Time
+
+> **Data Scientist** — *"I want to retrieve the most semantically similar documents for a user's query — with the same Feast API I use for every other feature."*
+
+At serving time, encode the query with the same model used for materialization, then call `retrieve_online_documents_v2()`:
+
+```python
+from feast import FeatureStore
+from sentence_transformers import SentenceTransformer
+
+store = FeatureStore(repo_path=".")
+embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+
+def semantic_search(query: str, top_k: int = 5) -> list[dict]:
+    query_embedding = embed_model.encode(query).tolist()
+
+    result_df = store.retrieve_online_documents_v2(
+        features=[
+            "review_embeddings:embedding",
+            "review_embeddings:embed_text",
+            "review_embeddings:item_id",
+            "review_embeddings:rating",
+            "review_embeddings:review_title",
+        ],
+        query=query_embedding,
+        top_k=top_k,
+        distance_metric="COSINE",
+    ).to_df()
+
+    return result_df.to_dict(orient="records")
+```
+
+The `retrieve_online_documents_v2()` call routes to Milvus, performs an approximate nearest-neighbor search using the IVF_FLAT index, and returns the top-k results with their feature values. The same features available at materialization time — `item_id`, `rating`, `review_title`, `embed_text` — are returned alongside the similarity score.
+
+For a FastAPI serving endpoint with Prometheus metrics:
+
+```python
+from contextlib import asynccontextmanager
+from fastapi import FastAPI
+from feast import FeatureStore
+from prometheus_client import Histogram, make_asgi_app
+from sentence_transformers import SentenceTransformer
+
+RETRIEVAL_DURATION = Histogram(
+    "rag_retrieval_duration_seconds",
+    "Feast/Milvus vector search latency",
+    buckets=[.01, .025, .05, .1, .25, .5, 1],
+)
+
+store: FeatureStore | None = None
+embed_model: SentenceTransformer | None = None
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global store, embed_model
+    store = FeatureStore(repo_path=".")
+    embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+    yield
+
+app = FastAPI(lifespan=lifespan)
+app.mount("/metrics", make_asgi_app())
+
+@app.post("/search")
+def search(query: str, top_k: int = 5):
+    with RETRIEVAL_DURATION.time():
+        query_vec = embed_model.encode(query).tolist()
+        return store.retrieve_online_documents_v2(
+            features=[
+                "review_embeddings:embedding",
+                "review_embeddings:embed_text",
+                "review_embeddings:item_id",
+            ],
+            query=query_vec,
+            top_k=top_k,
+            distance_metric="COSINE",
+        ).to_df().to_dict(orient="records")
+```
+
+---
+
+## GPU vs CPU: RAPIDS Acceleration Benchmark
+
+> 📊 **Coming soon:** end-to-end materialization benchmarks for the embedding BFV comparing:
+> - CPU-only Spark executors (`local[*]`, 8 cores)
+> - k8s:// Spark executors, CPU only
+> - k8s:// Spark executors with RAPIDS GPU plugin (A100, CUDA 13)
+>
+> Metrics: total materialization time, vectors/second throughput, cost per million embeddings. Results will be added to this post.
+
+---
+
+## What You Can Build With This
+
+> **Data Scientist + MLOps Engineer** — *"Tabular features and semantic search from the same pipeline, the same registry, the same materialization schedule, the same serving API."*
+
+**One cluster, two workloads.** Tabular BFVs (user aggregations, item statistics) and embedding BFVs (semantic search vectors) run on the same Spark GPU cluster, scheduled by the same `feast materialize` command. GPU nodes that sit idle between tabular feature runs generate embeddings during those windows.
+
+**No training-serving skew for retrieval.** The embedding model used at materialization time is the same model used at query time — both are `sentence-transformers/all-MiniLM-L6-v2`. The Feast registry ensures the feature view definition, including the transformation logic, is versioned and consistent across runs.
+
+**Scale to millions of vectors.** With memory-bounded, chunked writes, materializing 8M+ 384-dim embedding vectors to Milvus is practical on a 2-pod GPU Spark cluster without OOMKill.
+
+**Incremental updates.** `feast materialize-incremental` only processes new documents since the last materialization. Embedding generation doesn't re-run on the full corpus every refresh cycle — only on newly ingested data.
+
+**Extend to RAG.** The same pattern extends naturally to full RAG pipelines: retrieve similar documents via `retrieve_online_documents_v2()`, combine with user context features from Redis via `get_online_features()`, and pass the assembled context to an LLM. Feast manages both the vector retrieval and the entity-based feature retrieval through a unified API.
+
+---
+
+## Getting Started
+
+```bash
+pip install "feast[spark,milvus]"
+```
+
+Configure Milvus as your online store, define your embedding BFV with `vector_index=True` on the embedding field, and run:
+
+```bash
+feast apply
+feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
+```
+
+See also:
+- [BatchFeatureView documentation](https://docs.feast.dev/reference/batch-feature-view)
+- [Milvus online store reference](https://docs.feast.dev/reference/online-stores/milvus)
+- [retrieve_online_documents_v2 API](https://docs.feast.dev/reference/feature-retrieval)
+- [RAG with Feast overview](./rag-with-feast)
+
+---
+
+## Share Your Feedback
+
+We want to hear from you! Try the GPU embedding pipeline and tell us:
+
+- Which embedding models and vector stores you're using with Feast + Spark
+- How `feast materialize-incremental` fits into your embedding refresh workflow
+- What RAG patterns you're building on top of `retrieve_online_documents_v2()`
+
+Join the conversation on [GitHub](https://github.com/feast-dev/feast) or in the [Feast Slack community](https://slack.feast.dev). Your feedback directly shapes what we build next.
diff --git a/infra/website/public/images/blog/batchfeatureview-compute-once.png b/infra/website/public/images/blog/batchfeatureview-compute-once.png
new file mode 100644
index 00000000000..7f8cd934ee1
Binary files /dev/null and b/infra/website/public/images/blog/batchfeatureview-compute-once.png differ
diff --git a/infra/website/public/images/blog/feast-spark-k8s-production.png b/infra/website/public/images/blog/feast-spark-k8s-production.png
new file mode 100644
index 00000000000..4143b98aa33
Binary files /dev/null and b/infra/website/public/images/blog/feast-spark-k8s-production.png differ
diff --git a/infra/website/public/images/blog/spark-gpu-embeddings-rag.png b/infra/website/public/images/blog/spark-gpu-embeddings-rag.png
new file mode 100644
index 00000000000..723de52ba78
Binary files /dev/null and b/infra/website/public/images/blog/spark-gpu-embeddings-rag.png differ