Date: | September 7, 2004 / year-entry #330 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040907-00/?p=37943 |
Comments: | 7 |
Summary: | For some reason, this question gets asked a lot. How do I convert a byte[] to a System.String? (Yes, this is a CLR question. Sorry.) You can use String System.Text.UnicodeEncoding.GetString() which takes a byte[] array and produces a string. Note that this is not the same as just blindly copying the bytes from the byte[]... |
For some reason, this question gets asked a lot. How do I convert a byte[] to a System.String? (Yes, this is a CLR question. Sorry.) You can use String System.Text.UnicodeEncoding.GetString() which takes a byte[] array and produces a string. Note that this is not the same as just blindly copying the bytes from the byte[] array into a hunk of memory and calling it a string. The GetString() method must validate the bytes and forbid invalid surrogates, for example. You might be tempted to create a string and just mash the bytes into it, but that violates string immutability and can lead to subtle problems. |
Comments (7)
Comments are closed. |
On a related question, how do those of us not using .NET achieve streamable character conversion – that is, conversion where the converter can perform a partial conversion, indicate errors such as "the last n bytes of input begin but don’t complete a multibyte character" or "the output buffer is too small so only converted m bytes of input were converted", and then allow you to continue with another block of input data and/or output buffer? MLang appeared to offer this but so far as I can see it doesn’t, or at least the documentation doesn’t cover it. Yet IE is presumably doing it, and MLang is part of IE…
(Apologies for the slightly incoherent rambling sentence above.)
Something else to mention is that you should match the System.Text.Encoding subclass to the contents of the byte[]. For example, passing a byte[] that contains text encoded using UTF-8 to UnicodeEncoding’s GetString method won’t decode the byte[] properly. For example:
<pre>
using System;
using System.Collections;
using System.Text;
public class MyClass
{
public static void Main()
{
byte[] text = Encoding.UTF8.GetBytes("my string");
string s = Encoding.Unicode.GetString(text);
Console.WriteLine(s);
s = Encoding.UTF8.GetString(text);
Console.WriteLine(s);
}
}
</pre>
Unicode? We don’ need no stinkin’ Unicode! :)
string s=System.Text.Encoding.ASCII.GetString(buffer, 0, buffer.Length);
not actually a .NET blog?
Regarding the Abrams link:
Why, oh why, does the string have a cast operator to a non-const C-string, if the string is immutable?
In VC++ 2005 beta 1, either the _T() macro doesn’t work, or there’s something funny about macros that are or used to be UNICODE and _UNICODE. I haven’t had time to investigate. When I had time to practice with VC++ 2005 beta 1, I just worked around it by changing _T("string") to L"string", forcing them to be wide strings, and wide strings are Unicode in Windows.
But … this didn’t have to be done with all strings. Some of them I just left as "string", forcing them to be multibyte strings. Automatic conversions and boxing to type System::String^ correctly converted some of these ANSI strings to Unicode, only garbaging up some others. I haven’t had time to investigate if there’s a reason for this.
(This didn’t seem to be the worst issue I found in VC++ 2005 beta 1, because the IDE was still operating and forms could still be edited after that. But if I didn’t have time to investigate if there’s a more serious underlying cause or not.)