Add an ability to ignore bad content-type headers#246
Conversation
failing on them. A fair number of emails in the wild have bad headers so just try the default ones instead if this option is set
|
Trying to import the attached file without the patch: $ pipenv run ./import-mbox.py --source /tmp/mjc10-sanitised.mbox --html2text --lid mark@awe.com --dry after the patch, setting "ignore_bad_contenttype: true" $ pipenv run ./import-mbox.py --source /tmp/mjc10-sanitised.mbox --html2text --lid mark@awe.com --dry |
| break | ||
| except UnicodeDecodeError: | ||
| pass | ||
| except: |
There was a problem hiding this comment.
Please only catch the Exception which is specific to this case.
There was a problem hiding this comment.
I fed around 300k emails using import-mbox and had many hundreds of content-type header failures. If there's a way for someone to make an invalid content-type header, I had a mail for it! So after making this change I didn't capture all the other broken headers and figured the option should be "allow overiding any bad decoding"
There was a problem hiding this comment.
At this point in the logic, the only thing that can legitimately fail is the string.decode(), so that is all that should be permitted. For example, IndexError should not occur here
|
It would be useful to update the tests as well. |
|
How would you envision the tests for this? That is, what should the criteria be for pass/fail. |
Instead of failing on them. A fair number of emails in the wild have bad headers so just try the default ones instead if this option is set.
Example:
mjc10-sanitised.mbox.txt