feat: add StringTokenScannerSymbols for configurable multi-character delimiters (fixes #195)#1305
feat: add StringTokenScannerSymbols for configurable multi-character delimiters (fixes #195)#1305jmoraleda wants to merge 5 commits into
Conversation
jasmith-hs
left a comment
There was a problem hiding this comment.
@jmoraleda , since they're almost fully different implementations, what are your thoughts on having there be a StringTokenScanner and TreeParser determines whether the scanner it uses is TokenScanner or StringTokenScanner based on whether symbols.isStringBased?
|
Hello @jasmith-hs. Thank you. Good idea. Done. |
|
Hello @jasmith-hs I wonder if you have time to review and move forward with this and the two related PR's. To recap, the PR's are built on top of each other, so it makes sense to review one only after the previous one has been merged. This is the order: Thank you! |
4d42df5 to
49ce62a
Compare
|
@jmoraleda I'll try to get to it this week if my week isn't too chaotic |
|
@jasmith-hs Thank you. I just updated all three PR's to migrate to the most recent config structures so they will merge cleanly into master. |
This PR supersedes #1303 (same content from the correct branch).
Here's the revised PR description:
Title:
feat: add StringTokenScannerSymbols for configurable multi-character delimiters (fixes #195)Description:
Closes #195.
Python's Jinja2 allows full customization of the six delimiter strings via its
Environmentconstructor (block_start_string,block_end_string,variable_start_string,variable_end_string,comment_start_string,comment_end_string), plusline_statement_prefixandline_comment_prefix. Jinjava had no equivalent, making it impossible to use Jinja-style templating in contexts where{{,{%, or{#appear as literal content (e.g. LaTeX documents, some JSON schemas, or Kubernetes YAML with Helm-style markers).What this PR adds:
A new
StringTokenScannerSymbolsclass with a builder API that allows all six delimiter strings to be configured independently, with no constraint on length or shared prefix characters:Changes:
StringTokenScannerSymbols(new) — builder-configuredTokenScannerSymbolsimplementation. Uses Unicode Private Use Area sentinel characters as internal token-kind discriminators soToken.newToken()dispatches correctly without changes toToken.TokenScanner— adds a string-matching scan path (getNextTokenStringBased()) activated whensymbols.isStringBased()is true. The original char-based path is completely unchanged. Also supportslineStatementPrefixandlineCommentPrefix, matching Python Jinja2 semantics including indented prefixes.TokenScannerSymbols— addsisStringBased()(defaultfalse), six delimiter-length accessors (getTagStartLength()etc.), and two optional line-prefix accessors (getLineStatementPrefix(),getLineCommentPrefix()). All default implementations preserve existing behaviour.TagToken,ExpressionToken,NoteToken— replaced hardcoded delimiter offsets with calls to the new length accessors onsymbols. This is a correctness fix that affects allTokenScannerSymbolsimplementations, not justStringTokenScannerSymbols:ExpressionToken.parse()was callingWhitespaceUtils.unwrap(image, "{{", "}}")with literal strings regardless of the configured symbols, meaning any custom char-based subclass (like the one inCustomTokenScannerSymbolsTest) would silently fail to strip its expression delimiters. The fix usessymbols.getExpressionStart()andsymbols.getExpressionEnd()instead.Backward compatibility:
The char-based scan path and all existing
TokenScannerSymbolssubclasses are completely unaffected. The new length accessors onTokenScannerSymbolsdefault togetTheCorrespondingString().length(), which forDefaultTokenScannerSymbolsalways returns2. The full test suite passes without modification.