Skip to content

Conversation

@macchiati
Copy link
Member

@macchiati macchiati commented Nov 10, 2024

Code to generate the property files and test data for UTS 58

  • Incorporates changes from the PAG
  • Also some wording updates
  • The Link_Email property is also modified to be narrower. The reasons for this are:
    • If we have to make changes later, it is less disruptive to broaden the character set than to narrow it.
    • The non-ASCII are less commonly supported currently
    • I went with the identifiers from UAX 31, modified by what is valid in the ASCII ranges for the local-part:
      • \p{XID_Continue}
      • [\p{block=basic_latin}-\p{Cc}] // ASCII
      • -[\u0020 ; : " ( ) [ ] @ \ < >] // email exclusions from ASCII

See also the related spec changes in https://github.com/unicode-org/unicode-reports/pull/247

@macchiati macchiati marked this pull request as draft November 10, 2024 18:20
@macchiati
Copy link
Member Author

macchiati commented Dec 2, 2025

TODO:

  1. Update data file & test data generator to match rev 1 draft 5.
  2. Add newer test data from ICANN.
  3. Change the data file folder to /Public/<version>/linkification/
  4. Change the filename SerializationTest.txt to FormattingTest.txt
  5. Create a unicodetools PR with the code, get it reviewed & merged.

@macchiati macchiati changed the title Linkification testing Linkification Data files and tooling Dec 4, 2025
@macchiati macchiati force-pushed the Linkification-testig branch from eebcf83 to 992ff07 Compare December 4, 2025 23:34
@macchiati macchiati marked this pull request as ready for review December 14, 2025 07:51
@macchiati
Copy link
Member Author

The following failures are a mystery to me:

@eggrobin
Copy link
Member

@macchiati:

The following failures are a mystery to me:

Expected until β, ignore those.

That is a failure in your new test LinkUtilitiesTest, so you better demystify it!

Error:  Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.18 s <<< FAILURE! - in org.unicode.unittest.LinkUtilitiesTest
Error:  org.unicode.unittest.LinkUtilitiesTest.testMinimumEscaping  Time elapsed: 0.004 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: 10) {QUERY=a%3D%26%=%3D%26%} ==> expected: <?a%3D%26%=%3D%26%> but was: <?a%253D%2526%=%253D%2526%>
	at org.unicode.unittest.LinkUtilitiesTest.testMinimumEscaping(LinkUtilitiesTest.java:200)

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of only the data files so far.

# All code points not explicitly listed for Link_Bracket
# have the value undefined.
#
# @missing: 0000..10FFFF; undefined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have ever had an undefined missing value.
Let's make it the same as for Bidi_Paired_Bracket and lots of other properties:

Suggested change
# @missing: 0000..10FFFF; undefined
# @missing: 0000..10FFFF; <none>

# The short name of the property is the same as its long name.
#
# All code points not explicitly listed for Link_Bracket
# have the value undefined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# have the value undefined.
# have the value <none>.

see below

# ================================================

0029 ; 0028 #1.1 () ⇒ () RIGHT PARENTHESIS
003E ; 003C #1.1 (&gt; ⇒ &lt;) GREATER-THAN SIGN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escaping < and > in this context seems weird

# Format
#
# Field 0: code point range
# Field 1: binary value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works, but it looks a bit odd. The second column is basically useless.

This must be the first time that we have a data file dedicated to one single binary property.

When we list binary properties in multi-property data files, we don't have a value column. By analogy, we only really need a single code point column here.

# All code points not explicitly listed for Link_Email
# have the value No.
#
# @missing: 0000..10FFFF; No
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above -- we haven't written @missing lines for binary properties; they just all default to No

of course the @missing line, when present in a data file, should have the same syntax as the value lines in the same file. so if we explicitly print Yes then i guess we should have this line as well. but i think it just shows even more how weird it is. i think we should omit the Yes values and also omit the @missing line.

See @example.😎

# No local-part
See ⸠@example.com⸡
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this really linkify?

See ..john.doe@example.com

#Quoted local-parts (not in the base algorithm).
See "john\ ⸠doe@example.com⸡"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in formal email syntax, the closing quote would be right before the at sign, correct?

# Field 4: Result — with minimal escaping
#
# Empty lines, and lines starting with # are ignored.
# Spaces around the semicolons are ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please swap this line with the next one.

# Spaces around the semicolons are ignored.
# Otherwise # is treated like any other character.
#
# The Path, Query, and Fragment will contain backslash escapes when characters would otherwise be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions in the "Notes" doc about the purpose and behavior of the backslash syntax. You didn't respond there. I think we should drop this.

https:// ; mad.wikipedia.org ; /wiki/Tasè’ ; ; ; https://mad.wikipedia.org/wiki/Tasè%E2%80%99
https:// ; wuu.wikipedia.org ; /wiki/聖保羅(巴西) ; ; ; https://wuu.wikipedia.org/wiki/聖保羅(巴西)
https:// ; vep.wikipedia.org ; /wiki/Brüssel' ; ; ; https://vep.wikipedia.org/wiki/Brüssel%27
https:// ; tw.wikipedia.org ; /wiki/Wiase_Nyinaa_Wɛbsaet_(_World_Wide_Web;_WWW_) ; ; ; https://tw.wikipedia.org/wiki/Wiase_Nyinaa_Wɛbsaet_(_World_Wide_Web;_WWW_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't have a literal ; in a field in a semicolon-separated file -- see the "Notes" doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants