Skip to content

Add an ability to ignore bad content-type headers#246

Open
iamamoose wants to merge 1 commit into
apache:masterfrom
iamamoose:ignorebadcontenttype
Open

Add an ability to ignore bad content-type headers#246
iamamoose wants to merge 1 commit into
apache:masterfrom
iamamoose:ignorebadcontenttype

Conversation

@iamamoose

Copy link
Copy Markdown
Member

Instead of failing on them. A fair number of emails in the wild have bad headers so just try the default ones instead if this option is set.

Example:

mjc10-sanitised.mbox.txt

failing on them. A fair number of emails in the wild have bad
headers so just try the default ones instead if this option is
set
@iamamoose

Copy link
Copy Markdown
Member Author

Trying to import the attached file without the patch:

$ pipenv run ./import-mbox.py --source /tmp/mjc10-sanitised.mbox --html2text --lid mark@awe.com --dry
...
Thread-1: Slurping /tmp/mjc10-sanitised.mbox
unknown encoding: utf-8 charset="iso-8859-1"
Thread-1: Failed to parse: Return=mark@awe.com Message-Id=<6c0d22d9cf684e98bf41a>
Thread-1: Parsed 0 records (failed: 1) from /tmp/mjc10-sanitised.mbox
Thread-1: Done, 0 elements left to slurp
All done! 0 records inserted after 0 seconds. 1 records were bad and ignored. 0 duplicates were ignored.

after the patch, setting "ignore_bad_contenttype: true"

$ pipenv run ./import-mbox.py --source /tmp/mjc10-sanitised.mbox --html2text --lid mark@awe.com --dry
...
Thread-1: Slurping /tmp/mjc10-sanitised.mbox
Ignoring bad Content-Type: utf-8 charset="iso-8859-1"
Thread-1: Parsed 1 records (failed: 0) from /tmp/mjc10-sanitised.mbox
Thread-1: Done, 0 elements left to slurp
All done! 1 records inserted after 0 seconds. 0 records were bad and ignored. 0 duplicates were ignored.

Comment thread tools/archiver.py
break
except UnicodeDecodeError:
pass
except:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only catch the Exception which is specific to this case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fed around 300k emails using import-mbox and had many hundreds of content-type header failures. If there's a way for someone to make an invalid content-type header, I had a mail for it! So after making this change I didn't capture all the other broken headers and figured the option should be "allow overiding any bad decoding"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the logic, the only thing that can legitimately fail is the string.decode(), so that is all that should be permitted. For example, IndexError should not occur here

@sebbASF

sebbASF commented Mar 28, 2023

Copy link
Copy Markdown
Contributor

It would be useful to update the tests as well.

@Humbedooh

Copy link
Copy Markdown
Member

How would you envision the tests for this? That is, what should the criteria be for pass/fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants