Do not detect MD5s as UUIDs, and preserve UUID casing for UUID PKs by nolar · Pull Request #813 · datafold/data-diff

nolar · 2023-12-27T11:58:58Z

Comparing MD5s as UUIDs does not work anyway: it improperly slices and then compares the values, since our code always renders UUIDs as abcdabcd-abcd-abcd-abcd-abcdabcdabcd, always dashed and lower-cased, while the actual value stored in MD5 (i.e. string) PKs can be uppercased and typically non-dashed (e.g. ABCDABCDABCDABCDABCDABCDABCDABCD). As a result, all such MD5 PKs go into one pseudo-UUID range, usually the first one (because in ASCII & UTF-8, uppercase is lesser than lowercase letters).

The root cause is that Python's UUID can parse even such values:

In [3]: from hashlib import md5
In [8]: s=md5(b'hello').hexdigest()

In [9]: s
Out[9]: '5d41402abc4b2a76b9719d911017c592'

In [10]: from uuid import uuid4, UUID
In [11]: UUID(s)
Out[11]: UUID('5d41402a-bc4b-2a76-b971-9d911017c592')

This PR excludes MD5s and other UUID-like textual PKs from UUID detection.

As an extra change (separate commits), this PR also preserves the information on how the database presents the UUIDs — either lowercased or uppercased, and renders the actual sliced UUID values accordingly. This does not matter for native UUIDs (stored & compared as numbers), but does matter for UUIDs stored and/or compared as strings (at least from one side of the diff).

nolar · 2023-12-27T16:39:11Z

FIXED. On a unrelated discussion, this popped up: two sides should be lower-/upper-cased independently based on each side's samples. However, we now slice by PK ranges of one side, and propagate that side to the other one. The casing of the "other" side must be preserved.

dagadbm

to unblock if needed but needs proper review

It fails the comparison anyway — because of casing & dashes not fitting into alphanumeric ranges/slices.

…ngly

…e when slicing Otherwise, it uses the same PK values, e.g. `ArithUUID` from the side A, and then pushes them to side B, where improper rendering can lead to improper slicing.

nolar requested a review from dlawin December 27, 2023 11:59

This was referenced Dec 27, 2023

Retrieve collations from the schema (and refactor the column info structures) #804

Closed

Retrieve collations from the schema (and refactor the column info structures) #814

Merged

dagadbm approved these changes Dec 28, 2023

View reviewed changes

dlawin approved these changes Dec 29, 2023

View reviewed changes

nolar force-pushed the uuid-misdetection branch 2 times, most recently from 136e605 to 2114ede Compare December 30, 2023 14:12

Sergey Vasilyev added 4 commits December 30, 2023 19:49

Cease detecting MD5 hashes as UUIDs

50c1595

It fails the comparison anyway — because of casing & dashes not fitting into alphanumeric ranges/slices.

Refactor UUID & ArithUUID from inheritance into composition

871d8e2

Preserve lower-/upper-case mode of UUIDs and render them back accordi…

e8ec55b

…ngly

Restore the proper meta-params of PK column type relevant to each sid…

9a99030

…e when slicing Otherwise, it uses the same PK values, e.g. `ArithUUID` from the side A, and then pushes them to side B, where improper rendering can lead to improper slicing.

nolar force-pushed the uuid-misdetection branch from 2114ede to 9a99030 Compare December 30, 2023 18:49

Update the locked setuptools to fix he CI issues

6886ecc

nolar merged commit 8f55fb4 into master Dec 30, 2023

nolar deleted the uuid-misdetection branch December 30, 2023 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not detect MD5s as UUIDs, and preserve UUID casing for UUID PKs#813

Do not detect MD5s as UUIDs, and preserve UUID casing for UUID PKs#813
nolar merged 5 commits intomasterfrom
uuid-misdetection

nolar commented Dec 27, 2023

Uh oh!

nolar commented Dec 27, 2023 •

edited

Loading

Uh oh!

dagadbm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

nolar commented Dec 27, 2023

Uh oh!

nolar commented Dec 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagadbm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

nolar commented Dec 27, 2023 •

edited

Loading