danielwertheim

danielwertheim


notes from a passionate developer

Share


Sections


Tags


Disclaimer

This is a personal blog. The opinions expressed here represent my own and not those of my employer, nor current or previous. All content is published "as is", without warranty of any kind and I don't take any responsibility and can't be liable for any claims, damages or other liabilities that might be caused by the content.

UTF-8 BOM adventures in C#

Time for a quick look at UTF-8 encoding and byte order marker (BOM). Lets jump right into some code. You are probably going to nail this as you most likely will be alert now, given the title and all, but would you have expected this test to pass?

[Fact]
public void Utf8Strings()
{
    var initial = "Hello world!";

    using var ms = new MemoryStream();
    using var writer = new StreamWriter(ms, Encoding.UTF8);

    writer.Write(initial);
    writer.Flush();

    Assert.Equal(
        initial,
        Encoding.UTF8.GetString(ms.ToArray()));
}

So, what is happening here? Lets take a look at a second test to make it a bit more clear.

[Fact]
public void Utf8Arrays()
{
    var initial = "Hello world!";

    using var ms = new MemoryStream();
    using var writer = new StreamWriter(ms, Encoding.UTF8);

    writer.Write(initial);
    writer.Flush();

    Assert.Equal(
        Encoding.UTF8.GetBytes(initial),
        ms.ToArray());
}

What are those extra bytes?

It's the byte order marker (BOM) and when it comes to UTF-8, it's essentially indicating that the stream consists of UTF-8 encoded bytes. It can also be used to tell if the byte order is in little- or big-endian order. Here's a good place to read about it in a somewhat understandable way: https://www.unicode.org/faq/utf_bom.html#bom1

Here are some extracted parts from Unicode.Org's FAQ:

Q: What does ‘endian’ mean?

A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian...

(https://www.unicode.org/faq/utf_bom.html#bom3)

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?

Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order...

(https://www.unicode.org/faq/utf_bom.html#bom5)

Can we find the BOM for UTF-8 in .NET?

Yes. It's located in the Encoding.Preamble or Encoding.GetPreamble():

[Fact]
public void ItIsTheBom()
{
    Assert.Equal(
        new[] { 0xEF, 0xBB, 0xBF },
        new[] { 239, 187, 191 });

    Assert.Equal(
        new byte[] { 239, 187, 191 },
        Encoding.UTF8.GetPreamble());
}

The docs (https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.getpreamble?view=netcore-3.1) says:

When overridden in a derived class, returns a sequence of bytes that specifies the encoding used.

Looking in specifications for UTF-8 in particular, it's actually not required (See D95 under 3.10 Unicode Encoding Schemes).

Can we get rid of it?

Yes, just don't use Encoding.UTF8 but instead create an instance of it and define that it should not include the indicator: new UTF8Encoding(false)

[Fact]
public void Utf8StringsWithoutBom()
{
    var initial = "Hello world!";

    using var ms = new MemoryStream();
    using var writer = new StreamWriter(ms, new UTF8Encoding(false));

    writer.Write(initial);
    writer.Flush();

    Assert.Equal(
        initial,
        Encoding.UTF8.GetString(ms.ToArray()));
}

Great! But then I don't really need a Stream and a StreamWriter? I can just use an encoding instance that excludes the preamble. Right?

[Fact]
public void Outsmarted()
{
    var initial = "Hello world!";
    var encWithBom = new UTF8Encoding(true);
    var encWithoutBom = new UTF8Encoding(false);

    var rWithBome = encWithBom.GetBytes(initial);
    var rWithoutBom = encWithoutBom.GetBytes(initial);

    Assert.NotEqual(
        rWithBome,
        rWithoutBom);
}

No, it's the StreamWriter that makes use of the Preamble for the encoding. And when creating an Encoding instance with false, it just makes the Preamble consist of an empty array of bytes.

That's all for this post. Hope I clarified something.

Cheers,

//Daniel

View Comments