[migrations] Spark-to-Feldera migration tool PoC.#5837
Draft
wilmaontherun wants to merge 25 commits intomainfrom
Draft
[migrations] Spark-to-Feldera migration tool PoC.#5837wilmaontherun wants to merge 25 commits intomainfrom
wilmaontherun wants to merge 25 commits intomainfrom
Conversation
f878931 to
3f816fb
Compare
CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL. Signed-off-by: Wilma <wilmaontherun@gmail.com>
4d43839 to
046f795
Compare
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
mihaibudiu
reviewed
Mar 16, 2026
Contributor
|
We should build a library with compatibility functions that people can just reuse, especially if they can be written in SQL. |
addressed remaining comments
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
- Added --model CLI option to translate, translate-file, and example commands - Model and compiler path now read exclusively from .env / CLI flags - Removed OpenAI provider support (untested) - Removed hardcoded default compiler path - Updated README for consistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6bd2d52 to
14e7cc6
Compare
- Replaced custom semicolon scanner with sqlparse.split() — handles string literals, comments, block comments correctly - Added sqlparse>=0.5.0 to dependencies, removed openai dependency - Fixed README: clarified FELDERA_COMPILER comment (not a default, just repo location) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- llm.py: wrap system prompt in cache_control ephemeral block to enable Anthropic prompt caching; add retry with exponential backoff on rate limits - translator.py: omit examples on first translation attempt (skills only) to reduce token usage and latency (~20s → ~4s for simple queries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs.py: replace module-level _FUNC_ANCHORS with per-dir _get_cats_and_anchors() cache - llm.py: move imports to top level, add unreachable guard - translator.py: move sqlparse import to top level, fix LLMClient type annotation, remove double-strip - feldera_client.py: keep f.name usage inside with block - skills.py: remove redundant intermediate sort - cli.py: remove untested batch command, fix Status import, add missing --compiler/--model to all commands - pyproject.toml: remove unused httpx dependency - README.md: update to reflect removed batch command and full options list - spark_skills.md: add rewrite rules and unsupported constructs from test investigation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- STRING/TEXT type mapping: STRING→VARCHAR, TEXT→VARCHAR - Remove duplicate HEX/UNHEX from Hashing section - Remove CAST(INTERVAL SECOND AS DECIMAL) from Unsupported (contradicted Supported section) - Window: unify ROWS/RANGE BETWEEN into one Unsupported entry - split(str,delim,limit): clarify 2-arg form is supported - LN/LOG10: "runtime error" → "drops the row (WorkerPanic)" for negative input - TIMESTAMP_NTZ: clarify "replace with TIMESTAMP in DDL" - FIRST_VALUE/LAST_VALUE notes consistent with Window unsupported section - Scalar subquery rule: fix incorrect "subquery with FROM → mark unsupported" - Remove unexplained CREATE TYPE + jsonstring_as_ hint - trunc(d,'Q'): move to Unsupported (DATE_TRUNC QUARTER fails at runtime) - make_timestamp: move to Rewritable with PARSE_TIMESTAMP rewrite - from_unixtime: use TIMESTAMPADD directly (consistent with to_timestamp) - encode/decode: remove misleading "IS rewritable as CASE WHEN" note - width_bucket: remove stray extra column - SIGN: remove misleading "Input/output: DECIMAL" note - date_format: handle TIMESTAMP input via CAST to DATE - log(base,x): add examples to reinforce arg swap rule - [GBD-ARRAY-ORDER]: new GBD entry; annotate ARRAY_UNION/ARRAY_EXCEPT - Bitwise scalar operators moved to Unsupported section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
- Add ground truth note: all signatures from spark.apache.org/docs/latest/api/sql/index.html - Fix unix_millis/unix_micros: take timestamp arg (not no-arg current-time) - Fix pmod: unified formula MOD(MOD(a,ABS(b))+ABS(b),ABS(b)) for all divisor signs - Move try_divide/try_add/try_subtract/try_multiply to unsupported (semantic mismatch) - Move map_entries to unsupported (returns array of structs, no Feldera equivalent) - Fix from_unixtime: note STRING vs TIMESTAMP type difference, mark fmt-arg as unsupported - Fix posexplode: subtract 1 from ORDINALITY (Spark 0-based, SQL 1-based) - Add translate warning: REGEXP_REPLACE treats chars as regex patterns - Fix lpad/rpad: document optional pad arg (defaults to space) - Fix months_between: add roundOff note, precise fractional example - Fix trunc WEEK: move to unsupported (same Sunday/Monday mismatch as date_trunc WEEK) - Move try_* from String to Math in unsupported section - Add trunc YYYY/MM/MON aliases, to_date using PARSE_TIMESTAMP Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vial entries - Fix json_extraction: lateral alias not supported in Feldera; repeat PARSE_JSON per field - Fix datediff: use correct DATEDIFF(DAY, start, end) instead of TIMESTAMPDIFF - Replace null_safe_equality with LOG argument order reversal (critical gotcha) - Replace nvl_coalesce with LPAD/RPAD rewrite (no native support in Feldera) - Improve array_map_functions: add element_at(map,key) → map[key], CARDINALITY NULL note - Add explode_unnest: LATERAL VIEW explode/posexplode/inline → UNNEST patterns - Add json_extraction: get_json_object → PARSE_JSON + bracket syntax, CTE for GROUP BY - Remove array_lambda (unsupported-only, no rewrite value) - Remove row_number_topk (trivial CREATE VIEW wrapper, no translation needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.
Requires Anthropic API key in felderize/.env
Describe Manual Test Plan
No automated tests yet. Tested manually using examples in the demo folder.