-
TCHAR in new C++ programs
Posted on May 3rd, 2009 1 commentWheres the beginning C++ 4 article?
I have been debating with myself whether I should use TCHAR in my C++ examples in this blog.
When you generate a new C++ console project in Visual Studio 2008, it automatically includes
TCHAR and friends. I think this would be a bit confusing for the beginner who is trying to learn
C++ for the first time. The approach that I think I’m going to take, is to delete all the generated
files and let the beginner build his first program up from scratch in a new file using only standard
types.First a review of the standard C++ char types
A null terminated string is an array of chars with a null (0) character at the end
to signal the end of the string. It is usually referred to with a char* pointer to the
first letter in the string. Alternatively it can be embedded in a std::string or std::vector
which is safer.A wide character null terminated string is the same as the above except that it uses wchar_t
types instead of char types.According to the C++ standard, a char type is a type that can hold the minimum basic character
set needed by a program. The wchar_t type is a type that can hold all character sets for all locales
in the implementation, it’s also at least as large as a char type.Generally, on contemporary operating systems, char is 8 bits wide and wchar_t is 16 bits wide on windows operating systems.
Char is 8 bits wide and wchar_t is 32 bits wide on Unix and Linux operating systems.For a simple character set like ASCII that only has 127 characters, you can fit the entire set into
a char and there is a nice one to one correspondence between a character and the char type.
However for a more complex character set like unicode which (currently) has a maximum of
1114111 characters it can represent. It would need to have a type with at least 21 bits to
be able to represent any of the characters in that type.So how can we use unicode on windows that only has a 8 bit and 16 bit character types?
The solution is simple, we use more than one char or wchar_t to represent a single character.
UTF-8 is the encoding that is used with 8 bit chars and UTF-16 is the encoding used for
16 bit wchar_ts. UTF-32 is the encoding used for 32 bit wchar_ts in unix/linux which is a very simple
one-to-one correspondence of each character to wchar_t.There are other legacy character sets that use multiple chars to represent a character, however
I strongly recommend you avoid them and just use unicode.The point I am trying to make here, is that the type you use for string storage, whether
char or wchar_t is independent of the character set and encoding used for the string.
char* doesn’t automatically mean ASCII and wchar_t* doesn’t automatically mean unicode.C++ itself can generate strings, they are called string literals and are represented in the code
by text surrounded by double quotation marks. There are two types of string literals, narrow
char* string literals and wide wchar_t* string literals which has an L prepended in front.For example
cout << "Hello World"; //char* Narrow string literal, 8 bits just about everywhere wcout << L"Hello World"; //wchar_t* Wide string literal, 16 bits on windows, 32 bits on Linux
The standard doesn’t specify what encoding or character set is used for string literals, but it does say that
for a narrow string literal (char*) a universal character name can map to more chan one char. It says
nothing of the sort for wchar_t.I’ve concluded that VS2008 doesn’t exactly follow the standard for string literals, if it
did wchar_t would be 32 bits instead of 16 bits, since according to the standard any character
in any character set in any locale should be representable in a single wchar_t. I suppose
it depends on how you define locale (I’m willing to bet unicode isn’t in a locale in windows, I haven’t checked this yet.).Why would you use TCHAR
I’ve been thinking a bit about tchar and it’s purpose. A lot of programmers think that simply
adding _T on all their literals and then using the t string functions will automatically
make their code unicode compliant (whatever that means). This is wrong, there is a lot more
to unicode than simply using _T.What is TCHAR?
TCHAR is basically a type that is defined as char or wchar_t depending on whether UNICODE
is defined in the project. _T is a macro that inserts L in front of a string literal if UNICODE is
defined or does nothing._T is used for the string literals that will be scattered around in your program.
The idea is that you can change the UNICODE definition to produce an “ANSI” build or a
“UNICODE” build. All it actually does is change your literals and some apis between
chars and wchar_ts.For example if you have a string literal in wchar_t with some unicode characters in it and
you undefine UNICODE, VS2008 chokes on those characters when compiling the literal
and replaces them with ? characters (It gives a warning that it is doing this). This is almost
certainly what you don’t want. The other way around works fine though, going from char* to w_char*.
As soon as you add non-ascii characters to your literals and strings, you will get unpredictable results
when you remove the UNICODE define.
So I think tchar only makes sense if you’re only ever going to be using ASCII in your program
and would like to be able to compile it for an ASCII only system. For this program, changing
all the strings and string handling to unicode only makes sense if you need the extra performance
that comes from avoiding the string conversions on a unicode operating system.Conclusions
Don’t use TCHAR for new programs, unless you want it to be backward compatible to ASCII only
systems and need the performance of skipping string conversions on unicode systems.Otherwise you don’t need to use it at all:
- If you want to write an ASCII only program just use char* and stick with it.
- If you want to write a UNICODE only program just use wchar_t* and stick with it.
- Yes, you can also write a UNICODE program using only char*s with UTF-8.
- If you want to use a mixture, use char* and wchar_t* as needed and convert between them when necessary.
Also:
- char != ASCII / ANSI
- wchar_t != UNICODE
They are just containers and their usage by your program, the compiler and OS apis determines whether what is inside them is UNICODE or ASCII or any other character sets.
That’s all for now
I might expand on this topic in the future and try to pin it down a bit more clearly, with better definitions and explanations.
Happy Programming
One response to “TCHAR in new C++ programs”
-
Use only char with utf-8!
(speaking as someone who wasted countless hours unicodifying his source to support TCHAR, only to realize that UTF-16 is horrible anyway)
Leave a reply


