Practical, Pragmatic, C++
RSS icon Email icon Home icon
  • What character set does C++ use?

    Posted on May 4th, 2009 wozname No comments

    I believe the answer to this question, is that it depends on the implementation. The standard only states that char is wide enough to contain the “basic execution character set” and that wchar_t is wide enough to contain any character from all the character sets in all locales on an implementation.

    The next question, in what character set is a C++ program actually represented in? Researching this I found that the character sets of the .h and .cpp files are actually implementation defined. So they can be stored in any character set. However the first thing the compiler does is convert the source code to the “basic source character set”.

    Any trigraph sequences in the original file are converted to the new character set. Characters that cannot be represented in this character set are converted to “the universal-character-name that designates that character”. In other words they are converted to \uXXXX notation. XXXX denotes the code number for a character in unicode/ISO-10646.

    There are 3 character sets that the compiler deals with:

    • The basic source character set
    • The basic execution character set
    • The basic execution wide character set.

    The basic source character set consists of the following characters:
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    0 1 2 3 4 5 6 7 8 9
    _ { } [ ] # ( ) < > % : ; . ? * + – / ˆ & | ˜ ! = , \ ” ’
    As well as control characters: new-line, form feed, vertical tab, horizontal tab.

    The basic execution character set contains all the characters of the basic source character set plus these:
    null character (with all zero bits)
    control characters: carriage return, backspace, alert
    It can be held in a single char type.

    The basic execution wide character set contains all the characters of the basic source character set plus these:
    null wide character (with all zero bits)
    control characters: carriage return, backspace, alert
    It can be held in a single wchar_t type.

    As well as these minimum specifications there is also the execution character set and the execution wide-character set which are both supersets of the basic execution character set and the basic execution wide character set respectively. I believe these are the sets that are actually used by the compiler when generating code from narrow and wide string literals. However they are completely implementation defined.

    Happy Programming :)

    Leave a reply