-
-
Notifications
You must be signed in to change notification settings - Fork 58
Linkification Data files and tooling #961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
TODO:
|
…mple test cases, match spec better
…p wikipedia pages), gather statistics on mismatches.
Plus fixed ICANN file
eebcf83 to
992ff07
Compare
|
The following failures are a mystery to me:
|
Expected until β, ignore those.
That is a failure in your new test LinkUtilitiesTest, so you better demystify it! |
markusicu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of only the data files so far.
| See @example.😎 | ||
|
|
||
| # No local-part | ||
| See ⸠@example.com⸡ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this really linkify?
| # Spaces around the semicolons are ignored. | ||
| # Otherwise # is treated like any other character. | ||
| # | ||
| # The Path, Query, and Fragment will contain backslash escapes when characters would otherwise be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have questions in the "Notes" doc about the purpose and behavior of the backslash syntax. You didn't respond there. I think we should drop this.
|
I think I got all of them except for the @example.com; that needs a code
fix that I'll get to later today.
On Mon, Dec 15, 2025 at 8:17 PM Markus Scherer ***@***.***> wrote:
***@***.**** requested changes on this pull request.
Review of only the data files so far.
------------------------------
In unicodetools/data/linkification/dev/LinkBracket.txt
<#961 (comment)>
:
> +# Property: Link_Bracket
+# Format
+#
+# Field 0: code point
+# Field 1: code point
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of link detection and formatting operations, the property Link_Bracket is defined as
+# a string property whose value is either a single code point or is undefined.
+#
+# The short name of the property is the same as its long name.
+#
+# All code points not explicitly listed for Link_Bracket
+# have the value undefined.
+#
+# @missing: 0000..10FFFF; undefined
I don't think we have ever had an undefined missing value.
Let's make it the same as for Bidi_Paired_Bracket and lots of other
properties:
⬇️ Suggested change
-# @missing: 0000..10FFFF; undefined
+# @missing: 0000..10FFFF; <none>
done
(Thanks, you can tell I haven't done this in a while)
------------------------------
In unicodetools/data/linkification/dev/LinkBracket.txt
<#961 (comment)>
:
> +# ================================================
+#
+# Property: Link_Bracket
+# Format
+#
+# Field 0: code point
+# Field 1: code point
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of link detection and formatting operations, the property Link_Bracket is defined as
+# a string property whose value is either a single code point or is undefined.
+#
+# The short name of the property is the same as its long name.
+#
+# All code points not explicitly listed for Link_Bracket
+# have the value undefined.
⬇️ Suggested change
-# have the value undefined.
+# have the value <none>.
see below
------------------------------
In unicodetools/data/linkification/dev/LinkBracket.txt
<#961 (comment)>
:
> +# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of link detection and formatting operations, the property Link_Bracket is defined as
+# a string property whose value is either a single code point or is undefined.
+#
+# The short name of the property is the same as its long name.
+#
+# All code points not explicitly listed for Link_Bracket
+# have the value undefined.
+#
+# @missing: 0000..10FFFF; undefined
+#
+# ================================================
+
+0029 ; 0028 #1.1 () ⇒ () RIGHT PARENTHESIS
+003E ; 003C #1.1 (> ⇒ <) GREATER-THAN SIGN
escaping < and > in this context seems weird
Ah, I'd noticed that but forgot to fix it. It uses a transliterator to make
sure values are kosher, but I had the wrong one.
fixed
------------------------------
In unicodetools/data/linkification/dev/LinkEmail.txt
<#961 (comment)>
:
> @@ -0,0 +1,1298 @@
+# LinkEmail.txt
+# Date: 2025-12-14, 06:39:23 GMT
+# © 2025 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Property: Link_Email
+# Format
+#
+# Field 0: code point range
+# Field 1: binary value
I guess this works, but it looks a bit odd. The second column is basically
useless.
This must be the first time that we have a data file dedicated to one
single *binary* property.
When we list binary properties in multi-property data files, we don't have
a value column. By analogy, we only really need a single code point column
here.
------------------------------
In unicodetools/data/linkification/dev/LinkEmail.txt
<#961 (comment)>
:
> +# Property: Link_Email
+# Format
+#
+# Field 0: code point range
+# Field 1: binary value
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of link detection and formatting operations, the property Link_Email is defined as
+# a binary property.
+#
+# The short name of the property is the same as its long name.
+#
+# All code points not explicitly listed for Link_Email
+# have the value No.
+#
+# @missing: 0000..10FFFF; No
same as above -- we haven't written @missing lines for binary properties;
they just all default to No
of course the @missing line, when present in a data file, should have the
same syntax as the value lines in the same file. so if we explicitly print
Yes then i guess we should have this line as well. but i think it just
shows even more how weird it is. i think we should omit the Yes values
and also omit the @missing line.
done
------------------------------
In unicodetools/data/linkification/dev/LinkTerm.txt
<#961 (comment)>
:
> +# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Property: Link_Term
+# Format
+#
+# Field 0: code point range
+# Field 1: a Link_Term value
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of detection and formatting operations, the property Link_Term is defined as
+# mapping each code point to a set of enumerated values.
⬇️ Suggested change
-# mapping each code point to a set of enumerated values.
+# an enumerated property of code points.
like in
https://www.unicode.org/Public/17.0.0/security/IdentifierStatus.txt
done
------------------------------
In unicodetools/data/linkification/dev/LinkTerm.txt
<#961 (comment)>
:
> +# © 2025 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Property: Link_Term
+# Format
+#
+# Field 0: code point range
+# Field 1: a Link_Term value
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of detection and formatting operations, the property Link_Term is defined as
We have been writing "For the purpose of regular expressions" as a hook
for documenting names/types/values of non-UCD properties.
Ah, didn't know the right incantation
done
------------------------------
In unicodetools/data/linkification/dev/LinkBracket.txt
<#961 (comment)>
:
> +# © 2025 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Property: Link_Bracket
+# Format
+#
+# Field 0: code point
+# Field 1: code point
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of link detection and formatting operations, the property Link_Bracket is defined as
We have been writing "For the purpose of regular expressions" as a hook
for documenting names/types/values of non-UCD properties.
done
------------------------------
In unicodetools/data/linkification/dev/LinkEmail.txt
<#961 (comment)>
:
> +# © 2025 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Property: Link_Email
+# Format
+#
+# Field 0: code point range
+# Field 1: binary value
+# For more information, see https://www.unicode.org/reports/tr58/#property-data.
+#
+# For the purpose of link detection and formatting operations, the property Link_Email is defined as
We have been writing "For the purpose of regular expressions" as a hook
for documenting names/types/values of non-UCD properties.
done
------------------------------
In unicodetools/data/linkification/dev/LinkDetectionTest.txt
<#961 (comment)>
:
> @@ -0,0 +1,434 @@
+# LinkDetectionTest.txt
+# Date: 2025-12-16, 01:20:14 GMT
+# © 2025 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Format:
+# Each line contains zero or more marked links, such as ⸠abc.com⸡
+#
+# Operation:# For each line.
⬇️ Suggested change
-# Operation:# For each line.
+# Operation:
+# For each line.
done
------------------------------
In unicodetools/data/linkification/dev/LinkDetectionTest.txt
<#961 (comment)>
:
> +# Misc. test cases
+
+# Implementations may differ as to whether they linkify text where the TLD is invalid (not listed in IANA's tlds-alpha-by-domain.txt)
+# So this is not specifically tested for
+# a test.invalid b
+
+# Sample uppercase
+See ⸠TEST.COM⸡ on…
+See ⸠FOO.VERMÖGEN.com <http://foo.xn--vermgen-d1a.com>⸡ on…
+
+# Illegal domain names
+See http://.foo.example.com/αβγ <http://foo.example.com/%CE%B1%CE%B2%CE%B3> on…
+See http://foo..example.com/αβγ <http://example.com/%CE%B1%CE%B2%CE%B3> on…
+See http://-foo.example-.com. on…
+
+# Legal but unusual. Because we might be at the end of a sentence, we don't include the . unless followed by [/$#]
What does the $ stand for?
Typo, changed to followed by a Path, Query, or Fragment
------------------------------
In unicodetools/data/linkification/dev/LinkDetectionTest.txt
<#961 (comment)>
:
> +See ***@***.*** on…
+
+#Stop backing up when a space is hit
+See ***@***.***⸡
+
+#Include the medial dot
+See ***@***.***⸡.
+
+#Handle non-ASCII
+See ***@***.***⸡
+
+#No valid domain name
+See @example.😎
+
+# No local-part
+See ***@***.***⸡
should this really linkify?
That's a mistake in my code; it shouldn't. Will fix
------------------------------
In unicodetools/data/linkification/dev/LinkDetectionTest.txt
<#961 (comment)>
:
> +#Handle non-ASCII
+See ***@***.***⸡
+
+#No valid domain name
+See @example.😎
+
+# No local-part
+See ***@***.***⸡
+
+# No valid local-part
+See ***@***.***
+See .***@***.***
+See ..***@***.***
+
+#Quoted local-parts (not in the base algorithm).
+See "john\ ***@***.***⸡"
in formal email syntax, the closing quote would be right before the at
sign, correct?
Yes. Fixed the sample, but still have to fix the @example problem.
------------------------------
In unicodetools/data/linkification/dev/LinkFormattingTest.txt
<#961 (comment)>
:
> +# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use and license, see https://www.unicode.org/terms_of_use.html
+#
+# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Format: Each line has the following fields, separated by semicolons.
+# Field 0: Scheme/host
+# Field 1: Path
+# Field 2: Query
+# Field 3: Fragment
+# Field 4: Result — with minimal escaping
+#
+# Empty lines, and lines starting with # are ignored.
+# Spaces around the semicolons are ignored.
Please swap this line with the next one.
done
------------------------------
In unicodetools/data/linkification/dev/LinkFormattingTest.txt
<#961 (comment)>
:
> +# The usage and stability of these values is covered in https://www.unicode.org/reports/tr58/
+#
+# ================================================
+#
+# Format: Each line has the following fields, separated by semicolons.
+# Field 0: Scheme/host
+# Field 1: Path
+# Field 2: Query
+# Field 3: Fragment
+# Field 4: Result — with minimal escaping
+#
+# Empty lines, and lines starting with # are ignored.
+# Spaces around the semicolons are ignored.
+# Otherwise # is treated like any other character.
+#
+# The Path, Query, and Fragment will contain backslash escapes when characters would otherwise be
I have questions in the "Notes" doc about the purpose and behavior of the
backslash syntax. You didn't respond there. I think we should drop this.
We can't. But let's discuss in the meeting.
------------------------------
In unicodetools/data/linkification/dev/LinkFormattingTest.txt
<#961 (comment)>
:
> +https:// ; example.com ; ; α%β%3D=γ%δ%3D ; ; https://example.com?α%β%253D=γ%δ%253D
+
+# Wikipedia test cases
+
+https:// ; ru.wikinews.org ; /wiki/Категория:Вселенная ; ; ; https://ru.wikinews.org/wiki/Категория:Вселенная
+https:// ; av.wikipedia.org ; /wiki/Ракь_(планета) ; ; ; https://av.wikipedia.org/wiki/Ракь_(планета)
+https:// ; bo.wikipedia.org ; /wiki/སའི་གོ་ལ། ; ; ; https://bo.wikipedia.org/wiki/སའི་གོ་ལ%E0%BC%8D
+https:// ; fiu-vro.wikipedia.org ; /wiki/Maa_(hod'otäht) ; ; ; https://fiu-vro.wikipedia.org/wiki/Maa_(hod'otäht)
+https:// ; ty.wikipedia.org ; /wiki/’Afirita ; ; ; https://ty.wikipedia.org/wiki/’Afirita
+https:// ; ab.wikipedia.org ; /wiki/Вашингтон,_Џьорџь ; ; ; https://ab.wikipedia.org/wiki/Вашингтон,_Џьорџь
+https:// ; mni.wikipedia.org ; /wiki/ꯅ꯭ꯌꯨ_ꯌꯣꯔ꯭ꯛ_ꯁꯤꯇꯤꯒꯤ_ꯌꯨ.ꯑꯦꯁ. ; ; ; https://mni.wikipedia.org/wiki/ꯅ꯭ꯌꯨ_ꯌꯣꯔ꯭ꯛ_ꯁꯤꯇꯤꯒꯤ_ꯌꯨ.ꯑꯦꯁ%2E
+https:// ; azb.wikipedia.org ; /wiki/واشینقتن،_دی.سی. ; ; ; https://azb.wikipedia.org/wiki/واشینقتن،_دی.سی%2E
+https:// ; mad.wikipedia.org ; /wiki/Tasè’ ; ; ; https://mad.wikipedia.org/wiki/Tasè%E2%80%99 <https://mad.wikipedia.org/wiki/Tas%C3%A8%E2%80%99>
+https:// ; wuu.wikipedia.org ; /wiki/聖保羅(巴西) ; ; ; https://wuu.wikipedia.org/wiki/聖保羅(巴西)
+https:// ; vep.wikipedia.org ; /wiki/Brüssel' ; ; ; https://vep.wikipedia.org/wiki/Brüssel%27 <https://vep.wikipedia.org/wiki/Br%C3%BCssel%27>
+https:// ; tw.wikipedia.org ; /wiki/Wiase_Nyinaa_Wɛbsaet_(_World_Wide_Web;_WWW_) ; ; ; https://tw.wikipedia.org/wiki/Wiase_Nyinaa_Wɛbsaet_(_World_Wide_Web;_WWW_)
you can't have a literal ; in a field in a semicolon-separated file --
see the "Notes" doc
Ok, I'll filter those out.
… —
Reply to this email directly, view it on GitHub
<#961 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCCVXUQUSU6YGRJBAL4B6BW3AVCNFSM6AAAAACICKNJ5KVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKOBRGA3DMMBXGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Mark: Please respond to GitHub comments on GitHub, not via email -- email replies show up in the stream, but are separate from what they respond to, and are a pain to track. |
FYI -- I resolved my comments that you took care of, keeping the unresolved visible. |
|
Ok, and thanks for the detailed review. I had a brain-hiccough; should have
responded in github.
…On Tue, Dec 16, 2025 at 10:21 AM Markus Scherer ***@***.***> wrote:
*markusicu* left a comment (unicode-org/unicodetools#961)
<#961 (comment)>
Mark: Please respond to GitHub comments on GitHub, not via email -- email
replies show up in the stream, but are separate from what they respond to,
and are a pain to track.
FYI -- I resolved my comments that you took care of, keeping the
unresolved visible.
I will continue reviewing other files today.
—
Reply to this email directly, view it on GitHub
<#961 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBO4UBRL4RNW5GDDRD4CBEUNAVCNFSM6AAAAACICKNJ5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRRHAZTKOBYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Code to generate the property files and test data for UTS 58
See also the related spec changes in https://github.com/unicode-org/unicode-reports/pull/247