feat: Implement data quality field contraint protos#366
Conversation
| if len(proto.allowed_values) > 0: | ||
| kwargs["allowed_values"] = list(proto.allowed_values) | ||
| # `regex` is a plain string in proto3; empty == unset. | ||
| if proto.regex != "": |
There was a problem hiding this comment.
Saw the empty == unset note here, makes sense for proto3. Only nit: the validator still accepts regex="" and to_proto writes it, so you can create a value that silently reads back as None. Could reject empty regex in _regex_compiles to keep the two ends consistent.
There was a problem hiding this comment.
Nice find. I added a check to reject empty regex strings and a unit test that should reject them as well.
| allowed_values: Optional[List[str]] = None | ||
| regex: Optional[str] = None | ||
| unique: Optional[bool] = None | ||
| custom: Optional[Dict[str, str]] = None |
There was a problem hiding this comment.
Every other field validates at definition time (regex compiles, ranges in [0,1], etc.), but custom predicates are stored as-is, so a bad/empty predicate only fails at the next FE run. Could we reject empty predicate strings here? Maybe a non-empty check would catch the common typo early.
There was a problem hiding this comment.
I added a method _custom_nonempty that checks that the predicate and name are non-empty. Other invalid custom sql expressions will still be accepted by the registry, but we dont have a spark session at the feast apply stage, so basic syntax checks seem like the best we can do for now. The upstream feature engineering job will (likely) have a spark session to perform that check.
What this PR does / why we need it:
Adds typed
FieldConstraintsandImputationproto messages plus matching Python/Pydantic surfaces so customers can declare per-column data quality rules and null-imputation strategies on aSparkSource. Constraints are stored on the source's proto and round-trip through the registry. The runtime that enforces them lives downstream in implementing feature engineering jobs that handle them.Which issue(s) this PR fixes:
Misc