-
-
Notifications
You must be signed in to change notification settings - Fork 58
Linkification Data files and tooling #961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
TODO:
|
…mple test cases, match spec better
…p wikipedia pages), gather statistics on mismatches.
Plus fixed ICANN file
eebcf83 to
992ff07
Compare
|
The following failures are a mystery to me:
|
Expected until β, ignore those.
That is a failure in your new test LinkUtilitiesTest, so you better demystify it! |
markusicu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of only the data files so far.
| # All code points not explicitly listed for Link_Bracket | ||
| # have the value undefined. | ||
| # | ||
| # @missing: 0000..10FFFF; undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have ever had an undefined missing value.
Let's make it the same as for Bidi_Paired_Bracket and lots of other properties:
| # @missing: 0000..10FFFF; undefined | |
| # @missing: 0000..10FFFF; <none> |
| # The short name of the property is the same as its long name. | ||
| # | ||
| # All code points not explicitly listed for Link_Bracket | ||
| # have the value undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # have the value undefined. | |
| # have the value <none>. |
see below
| # ================================================ | ||
|
|
||
| 0029 ; 0028 #1.1 () ⇒ () RIGHT PARENTHESIS | ||
| 003E ; 003C #1.1 (> ⇒ <) GREATER-THAN SIGN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
escaping < and > in this context seems weird
| # Format | ||
| # | ||
| # Field 0: code point range | ||
| # Field 1: binary value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this works, but it looks a bit odd. The second column is basically useless.
This must be the first time that we have a data file dedicated to one single binary property.
When we list binary properties in multi-property data files, we don't have a value column. By analogy, we only really need a single code point column here.
| # All code points not explicitly listed for Link_Email | ||
| # have the value No. | ||
| # | ||
| # @missing: 0000..10FFFF; No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above -- we haven't written @missing lines for binary properties; they just all default to No
of course the @missing line, when present in a data file, should have the same syntax as the value lines in the same file. so if we explicitly print Yes then i guess we should have this line as well. but i think it just shows even more how weird it is. i think we should omit the Yes values and also omit the @missing line.
| See @example.😎 | ||
|
|
||
| # No local-part | ||
| See ⸠@example.com⸡ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this really linkify?
| See ..john.doe@example.com | ||
|
|
||
| #Quoted local-parts (not in the base algorithm). | ||
| See "john\ ⸠doe@example.com⸡" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in formal email syntax, the closing quote would be right before the at sign, correct?
| # Field 4: Result — with minimal escaping | ||
| # | ||
| # Empty lines, and lines starting with # are ignored. | ||
| # Spaces around the semicolons are ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please swap this line with the next one.
| # Spaces around the semicolons are ignored. | ||
| # Otherwise # is treated like any other character. | ||
| # | ||
| # The Path, Query, and Fragment will contain backslash escapes when characters would otherwise be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have questions in the "Notes" doc about the purpose and behavior of the backslash syntax. You didn't respond there. I think we should drop this.
| https:// ; mad.wikipedia.org ; /wiki/Tasè’ ; ; ; https://mad.wikipedia.org/wiki/Tasè%E2%80%99 | ||
| https:// ; wuu.wikipedia.org ; /wiki/聖保羅(巴西) ; ; ; https://wuu.wikipedia.org/wiki/聖保羅(巴西) | ||
| https:// ; vep.wikipedia.org ; /wiki/Brüssel' ; ; ; https://vep.wikipedia.org/wiki/Brüssel%27 | ||
| https:// ; tw.wikipedia.org ; /wiki/Wiase_Nyinaa_Wɛbsaet_(_World_Wide_Web;_WWW_) ; ; ; https://tw.wikipedia.org/wiki/Wiase_Nyinaa_Wɛbsaet_(_World_Wide_Web;_WWW_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't have a literal ; in a field in a semicolon-separated file -- see the "Notes" doc
Code to generate the property files and test data for UTS 58
See also the related spec changes in https://github.com/unicode-org/unicode-reports/pull/247