Jump to content

Mailbox - Failed to decode message, invalid UTF-8 encoding encountered


Martyn Houghton

Recommended Posts

We process around 9,000 emails a month through Hornbill and have been seeing a few but increasing occurrences of emails failing to be recevied properly into the Mailbox, with the affected email being inserted as a error with 'Failed to decode message, invalid UTF-8 encoding encountered' subject. We have raised this with Hornbill support (IN00144192) who have advised that this is down to component used in Hornbill not supporting the character set being used  by the system sending the emails, which then means we struggle to receive emails from the affected customer. The same emails are readable in Exchange/Outlook.

They have advised that this is not a defect, just a limitation of the current component support for different character sets used in emails and that I should raise this on the forum to determine if this is impacting other users. I have asked for details of the character sets supported by the currently implemented component.

Is this issue affecting anyone else?

Cheers

Martyn

Link to comment
Share on other sites

Hi Martin,

Just to clarify, the specific issue is relating to UTF-7 encoded emails which our mail servers are unable to process? UTF-7 is a somewhat esoteric encoding scheme "proposed" but never actually standardised.  Please correct me if this is not the UTF-7 encoding issue? If it is, then the rest of my comment applies, if not please ignore me and see next post below :)

You can read more about it here on Wikipedia: 

https://en.wikipedia.org/wiki/UTF-7

But I will draw your attention to this specific paragraph, where it states...

Quote

UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the Internet Mail Consortium recommends against its use.

Now we do our very best to adhere to all standards, but UTF-7 has never made is as a standard, mainly because it really does not offer anything useful that cannot already be achieved with UTF-8, yet its use would likely create lots of interoperability issues because of the complexity and ambiguity in its specification,  and thats why the IMC recommends against its use which is why we do not support it. UTF-7 also opens up possible security holes which make it possible for ASCII escaped unicode blocks to slip malicious strings past the UTF-7 processor, there is a known XSS issue in older versions of Internet Explorer that does exactly this

It would be good to understand where mail with this encoding scheme is originating from, we process an awful lot of mail messages on our platform every single day but we very rarely see this problem.  

Is it possible to inform the originator of these messages and see they can change their mail system config to use something more aligned with standards? Also, can you confirm that you still receive the messages, but the message body is just added to the mail you receive as a text file (thats what should happen)

Hope that makes sense?

Gerry

Link to comment
Share on other sites

Hi Martyn,

Ah, actually check back on the internal workspace, your error may be this one...please correct me if I am wrong... these posts are useful knowledge to share with our community so I thought it worth posting anyways. 

Quote

UTF-8 sequence under run (11xxxxxx). Character: 0x91 at location: Line=154, Col=31

This is another oddity, and we have seen this a small number of times, here is a full explanation of the actual problem. 

SUMMARY

When processing an multi-part mime encoded email the text and/or HTML body parts are declared encoded as UTF-8 but when we try to process the message we encounter the above UTF-8 character sequence under run error.  This is caused by the content of the mime part not being a valid UTF-8 encoded text stream despite being declared as UTF-8

IN DETAIL

In the small number of issues we have had this reported, it has consistently been identified that a corporate email disclaimer text that is being injected into the bottom of the mail message is injecting text that is not encoded in UTF-8. This means the message part that was previously encoded correctly now has invalid encoding, and thats why our system is unable to process the message. The error message pinpoints the problem to a specific character (see attached image for an example of this).  

bad-utf8-disclaimer.png

This only occurs (from the instances that we have seen) when the mail disclaimer contains unicode characters, all of the examples we have seen have been Latin 1 extended characters that fall outside of the standard US-ASCII range. 

When processing mail, what we do in this situation is attach the message part as a plain text file (of obviously unknown character encoding) attachment and report the actual error in the message body and deliver the message as usual.  It is generally possible to open the attachment to see the original message body content in something like Notepad. 

We have not been able to identify the mail system(s) that allow this type of corruption to be applied to email messages it emits but we have certainly seen it on more than one instance from time to time.  

SOLUTION

The solution to this problem is obviously to expect mail transmitted to our system to at least be correctly encoded, so we would recommend that the organisation who is emitting these malformed messages be notified and asked to fix the problem which does not seem to be an unreasonable position. 

While it is entirely possible to hack something together to ignore this error, I would be uncomfortable putting a *hack* in our codebase to work around this, or ignore it only to be held accountable for not correctly handling character encoding. We have put a lot of effort into ensuing we have reliable and predictable email handling by conforming properly to well defined and ratified standards.  Getting the error message we report on this specific error condition to be so accurate and precise took time and effort, and we were really being forced into that because in the absence of this level specific error understanding and reporting we were being held accountable for these failures, with claims like, it works ok on Exchange/Outlook so it must be your systems :( 

In the statement that we have communicated to you with regards to any "limitation of the current component support for different character sets used in emails" we probably have not communicated this correctly - its not a limitation as such, its an inability to make sense of incorrectly encoded data stream, because - well, its an incorrectly encoded data stream!  With regards to "The same emails are readable in Exchange/Outlook" its probably Exchange thats emitting the incorrectly encoded messages in the first place, so it would not be surprising that it could handle such a situation, but that does not mean that Microsoft have got it right, the standards are really clear about this and the error condition is very easily demonstrable - so maybe a moral dilemma there - should we hack our own system and make it ignore the broken errors because Microsoft Exchange does? and if we do, what do we tell the next customer that complains that our system breaks their email because their disclaimer text no longer shows the missing characters that we have to ignore to fix the broken stuff, in the eyes of anyone else looking at their malformed message is going to blame us for breaking their mail. Or perhaps we "pass it through" as is and ignore the UTF-8 errors, then what happens in the various browsers that people are using, what errors will they throw, or what malicious XXS code could get injected as a result of us passing invalid utf-8 streams into the browser,  or what if the browser starts reporting encoding errors, or makes the message when you view it in the browser show corrupt or mal-formed content - these are all things that we would be expected to fix. 

The only correct answer to this is to fix the source, we have considered every other possibility and that was our conclusion but I am happy to have a debate should yourself or anyone in the community that wished to contribute their views and thoughts. 

Gerry

Link to comment
Share on other sites

Hi Martyn,

In answer to your specific question "I have asked for details of the character sets supported by the currently implemented component." we fully support Unicode, so all possible characters defined in the Unicode standard throughout the entire system - or at least that is our design intent (we as of the time of this writing, do have some unicode issues in WebDAV URL paths which we are working on resolving, bit nothing to do with email processing). 

If your problem is the second one above relating to invalid UTF-8 encoding, its not a question of character set support at all, the problem is, the mail message contains a mix of UTF-8 and something else (probably a Latin 1 type character set), so we are unable to process it, partly because its mixed and therefore invalid, and partly because there is no possible way to know what the character encoding is supposed to be.  

The other thing we do not support is UTF-7 obviously, which is an encoding scheme rather than a character set.

Gerry

Link to comment
Share on other sites

@Gerry

Just to confirm the error we are getting is about  UTF-8 rather than UTF-7. In terms of the respone we received this as below. We will also review the disclaimers on the effected organisation emails to determine if the invalid characters are present.

Thanks for the detailed replies.

Cheers

Martyn

From: Hornbill Support [mailto:hornbill.support@hornbill.com]
Sent: 16 November 2016 14:27
To: Martyn Houghton 
Subject: Hornbill Incident IN00144192 Update - Failed to decode message, invalid UTF-8 encoding encountered

 

Hi Martyn,

Regarding the issue you raised about the failed to decode message invalid UTF-8 encoding encountered, I gave you a call just now to discuss the way forward with this issue but you were not available. As Victor explained in the email he sent to you on 14th November, the error message you got is because the incoming emails have characters encoded in a "character set" which our mail decoder does not know about. Your other email processors handle this better, as they most likely contain more character sets than Hornbill does.

Victor also mentioned that this can not be considered as a defect, as our email decoder behaves as intended although it may not process all possible character sets. What you can do is raise this on the forums as an Enhancement Request. Kindly let me know once you have done this so that I close this ticket.

Many thanks

Pamela

Hornbill Application Support Team

 

 

Link to comment
Share on other sites

HI Martyn,

Yes some detail would appear to have been lost in the translation from the internal comms back to you, my apologies.  I will try to make sure that in future we move these deeply technical types of conversations on the forums where the subject matter expert can communicate directly and at the right level of detail to avoid translation loss..

Back to the problem - when you receive such a message you will get the text part attached, you can also turn on message tracking and get the originating RFC822 message its self, in both of these you will see the illegal character streams.  

We have discussed internally introducing a system setting "hack switch" which when turned on would force our system to ignore the errors and replace all invalid characters with '?' or something like that.  This way an individual customer could "switch on" the hack and get that behaviour, but on the understanding that it is in fact a hack and other character encoding/formatting issues might occur as a result.   If you desperately need us to do this, let me know and I will see what we can do. 

Please let me know how you get on. 

Gerry

Link to comment
Share on other sites

  • 2 weeks later...

Just to round this thread out.  We have added a new system setting called 'mail.importer.forceUtf8FixHack' which enables a hack to force fix invalid UTF-8 encoded mime parts.  We have called this a hack because we would rather the mail content be correctly encoded in the first place. None the less, there are rogue mail systems out there and this does happen. When enabled, this option will brute fix the content by replacing all invalid UTF-8 byte sequences with the '?' character. 

This should make it to live production environment early next week. 

Hope this is useful. 

Gerry

Link to comment
Share on other sites

  • 2 weeks later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...