Add an ability to ignore bad content-type headers by iamamoose · Pull Request #246 · apache/incubator-ponymail-foal

iamamoose · 2023-03-28T10:29:18Z

Instead of failing on them. A fair number of emails in the wild have bad headers so just try the default ones instead if this option is set.

Example:

mjc10-sanitised.mbox.txt

failing on them. A fair number of emails in the wild have bad headers so just try the default ones instead if this option is set

iamamoose · 2023-03-28T10:47:05Z

Trying to import the attached file without the patch:

$ pipenv run ./import-mbox.py --source /tmp/mjc10-sanitised.mbox --html2text --lid mark@awe.com --dry
...
Thread-1: Slurping /tmp/mjc10-sanitised.mbox
unknown encoding: utf-8 charset="iso-8859-1"
Thread-1: Failed to parse: Return=mark@awe.com Message-Id=<6c0d22d9cf684e98bf41a>
Thread-1: Parsed 0 records (failed: 1) from /tmp/mjc10-sanitised.mbox
Thread-1: Done, 0 elements left to slurp
All done! 0 records inserted after 0 seconds. 1 records were bad and ignored. 0 duplicates were ignored.

after the patch, setting "ignore_bad_contenttype: true"

$ pipenv run ./import-mbox.py --source /tmp/mjc10-sanitised.mbox --html2text --lid mark@awe.com --dry
...
Thread-1: Slurping /tmp/mjc10-sanitised.mbox
Ignoring bad Content-Type: utf-8 charset="iso-8859-1"
Thread-1: Parsed 1 records (failed: 0) from /tmp/mjc10-sanitised.mbox
Thread-1: Done, 0 elements left to slurp
All done! 1 records inserted after 0 seconds. 0 records were bad and ignored. 0 duplicates were ignored.

sebbASF · 2023-03-28T10:57:48Z

                        break
                    except UnicodeDecodeError:
                        pass
+                    except:


Please only catch the Exception which is specific to this case.

I fed around 300k emails using import-mbox and had many hundreds of content-type header failures. If there's a way for someone to make an invalid content-type header, I had a mail for it! So after making this change I didn't capture all the other broken headers and figured the option should be "allow overiding any bad decoding"

At this point in the logic, the only thing that can legitimately fail is the string.decode(), so that is all that should be permitted. For example, IndexError should not occur here

sebbASF · 2023-03-28T13:57:57Z

It would be useful to update the tests as well.

Humbedooh · 2023-06-16T15:12:48Z

How would you envision the tests for this? That is, what should the criteria be for pass/fail.

Add an ability to ignore bad content-type headers rather than

7ddacb4

failing on them. A fair number of emails in the wild have bad headers so just try the default ones instead if this option is set

sebbASF requested changes Mar 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an ability to ignore bad content-type headers#246

Add an ability to ignore bad content-type headers#246
iamamoose wants to merge 1 commit into
apache:masterfrom
iamamoose:ignorebadcontenttype

iamamoose commented Mar 28, 2023

Uh oh!

iamamoose commented Mar 28, 2023

Uh oh!

sebbASF Mar 28, 2023

Uh oh!

iamamoose Mar 28, 2023

Uh oh!

sebbASF Mar 28, 2023

Uh oh!

sebbASF commented Mar 28, 2023

Uh oh!

Humbedooh commented Jun 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iamamoose commented Mar 28, 2023

Uh oh!

iamamoose commented Mar 28, 2023

Uh oh!

sebbASF Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

iamamoose Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

sebbASF Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

sebbASF commented Mar 28, 2023

Uh oh!

Humbedooh commented Jun 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants