Skip to content

Line normalization before diff #219

@epictecch

Description

@epictecch

Description

The DiffRowGenerator class offers the lineNormalizer property. By default, it is used to replace < and > by their escaped versions &lt; and &gt;.

The lineNormalizer is applied to the input texts before the diff is calculated. While I see this is as a useful feature, in case of the default settings it might be surprising that the resulting text might not have HTML escaping anymore:

final var generator = DiffRowGenerator.create() //
        .mergeOriginalRevised(true) //
        .showInlineDiffs(true) //
        .inlineDiffByWord(true) //
        .build();

final var rows = generator.generateDiffRows(List.of("hello <world>"), List.of("bye >world<"));

final var resultingText = rows.stream() //
        .map(DiffRow::getOldLine) //
        .collect(Collectors.joining(StringUtils.LF));

The resulting text is

<span class="editOldInline">hello</span><span class="editNewInline">bye</span> &<span class="editOldInline">lt</span><span class="editNewInline">gt</span>;world&<span class="editOldInline">gt</span><span class="editNewInline">lt</span>;

Note the part & is considered as an equal text part because both replacements &lt; and &gt; start with an ampersand. This resulting text is therefore no valid HTML anymore.

In order for this behaviour to be a problem, the following conditions must all be true:

  1. The inlineDiffByWord must be used
  2. The default lineNormalizer must be used
  3. The two provided texts must differ at a position which starts with a character that is replaced by the lineNormalizer
  4. A release >= 4.15 must be used.

Workaround
Override the lineNormalizer. E.g., by using the SPLIT_BY_WORD_PATTERN of release 4.12, in which the ampersand was not considered a character that splits words.

Solution approaches
IMHO, the SPLIT_BY_WORD_PATTERN of release 4.15+ is fine and I do not consider it to be the problem.

The library could offer one of the following features:

  1. a parameter which defines when the 'lineNormalizer' should be applied (before diff-ing or after)
  2. a second type of line-normalizer that is applied after diff-ing
  3. an option to have the library apply the processDiffs function to non-diffs as well

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions