Character Set Description (charmap) Source File Format
Purpose
Defines character symbols as character encodings.
Description
The character set description (charmap) source file defines character symbols as character encodings. The /usr/lib/nls/charmap directory contains charmap source files for supported locales. The localedef command recognizes two sections in charmap source files, the CHARMAP section and the CHARSETID section:
Item | Description |
---|---|
CHARMAP | Maps symbolic character names to code points. This section must precede all other sections, and is mandatory. |
CHARSETID | Maps the code points within the code set to a character set ID. This sections is optional. |
The CHARMAP Section
The CHARMAP section of the charmap file maps symbolic character names to code points. All supported code sets have the portable character set as a proper subset. Only symbols that are not defined in the portable character set must be defined in the CHARMAP section. The portable character set consists of the following character symbols (listed by their standardized symbolic names) and encodings:
Symbol Name | Code (hexadecimal) |
---|---|
<NUL> | 000 |
<SOH>> | 001 |
<STX> | 002 |
<ETX> | 003 |
<EOT> | 004 |
<ENQ> | 005 |
<ACK> | 006 |
<alert> | 007 |
<backspace> | 008 |
<tab> | 009 |
<new-line> | 00A |
<vertical-tab> | 00B |
<form-feed> | 00C |
<carriage-return> | 00D |
<SO> | 00E |
<SI> | 00F |
<DLE> | 010 |
<DC1> | 011 |
<DC2> | 012 |
<DC3> | 013 |
<DC4> | 014 |
<NAK> | 015 |
<SYN> | 016 |
<ETB> | 017 |
<CAN> | 018 |
<EM> | 019 |
<SUB> | 01A |
<ESC> | 01B |
<IS4> | 01C |
<IS3> | 01D |
<IS2> | 01E |
<IS1> | 01F |
<space> | 020 |
<exclamation-mark> | 021 |
<quotation-mark> | 022 |
<number-sign> | 023 |
<dollar-sign> | 024 |
<percent> | 025 |
<ampersand> | 026 |
<apostrophe> | 027 |
<left-parenthesis> | 028 |
<right-parenthesis> | 029 |
<asterisk> | 02A |
<plus-sign> | 02B |
<comma> | 02C |
<hyphen> | 02D |
<period> | 02E |
<slash> | 02F |
<zero> | 030 |
<one> | 031 |
<two> | 032 |
<three> | 033 |
<four> | 034 |
<five> | 035 |
<six> | 036 |
<seven> | 037 |
<eight> | 038 |
<nine> | 039 |
<colon> | 03A |
<semi-colon> | 03B |
<less-than> | 03C |
<equal-sign> | 03D |
<greater-than> | 03E |
<question-mark> | 03F |
<commercial-at> | 040 |
<A> | 041 |
<B> | 042 |
<C> | 043 |
<D> | 044 |
<E> | 045 |
<F> | 046 |
<G> | 047 |
<H> | 048 |
<I> | 049 |
<J> | 04A |
<K> | 04B |
<L> | 04C |
<M> | 04D |
<N> | 04E |
<O> | 04F |
<P> | 050 |
<Q> | 051 |
<R> | 052 |
<S> | 053 |
<T> | 054 |
<U> | 055 |
<V> | 056 |
<W> | 057 |
<X> | 058 |
<Y> | 059 |
<Z> | 05A |
<left-bracket> | 05B |
<backslash> | 05C |
<right-bracket> | 05D |
<circumflex> | 05E |
<underscore> | 05F |
<grave-accent> | 060 |
<a> | 061 |
<b> | 062 |
<c> | 063 |
<d> | 064 |
<e> | 065 |
<f> | 066 |
<g> | 067 |
<h> | 068 |
<i> | 069 |
<j> | 06A |
<k> | 06B |
<l> | 06C |
<m> | 06D |
<n> | 06E |
<o> | 06F |
<p> | 070 |
<q> | 071 |
<r> | 072 |
<s> | 073 |
<t> | 074 |
<u> | 075 |
<v> | 076 |
<w> | 077 |
<x> | 078 |
<y> | 079 |
<z> | 07A |
<left-brace> | 07B |
<vertical-line> | 07C |
<right-brace> | 07D |
<tilde> | 07E |
<DEL> | 07F |
The CHARMAP section contains the following sections:
- The CHARMAP section header.
- An optional special symbolic name-declarations section. The symbolic
name and value must be separated by one or more blank characters.
The following are the special symbolic names and their meanings:
Item Description <code_set_name> Specifies the name of the coded character set for which the charmap file is defined. This value determines the value returned by the nl_langinfo subroutine. The <code_set_name> must be specified using any character from the portable character set, except for control and space characters. <mb_cur_max> Specifies the maximum number of bytes in a multibyte character for the encoded character set. Valid values are 1 to 4. The default value is 1. <mb_cur_min> Specifies the minimum number of bytes in a multibyte character for the encoded character set. Since all supported code sets have the portable character set as a proper subset, this value must be 1. <escape_char> Specifies the escape character that indicates encodings in hexadecimal or octal notation. The default value is a \ (backslash). <comment_char> Specifies the character used to indicate a comment within a charmap file. The default value is a # (pound sign). With the exception of optional comments following a character symbol encoding, comments must start with a comment character in the first column of a line. - Character set mapping statements for the defined code set.
Each statement in this section defines a symbolic name for a character encoding. A character symbol begins with the < (less-than) character and ends with the > (greater-than) character. The characters between the < (less-than) and > (greater-than) can be any characters from the portable character set, except for control and space characters. The > (greater-than) character may be used if it is escaped with the escape character (as specified by the <escape_char> special symbolic name). A character symbol cannot exceed 32 characters in length.
The format of a character symbol definition is:
An encoding is specified as one or more character constants, with the maximum number of character constants specified by the <mb_cur_max> special symbolic name. The localedef command supports decimal, octal, and hexadecimal constants with the following formats:<char_symbol> encoding optional comment
Some examples of character symbol definitions are:hexadecimal constant \xddd octal constant \oddd decimal constant \dddd
A range of one or more symbolic names and corresponding encoding values may also be defined, where the nonnumeric prefix for each symbolic name is common, and the numeric portion of the second symbolic name is equal to or greater than the numeric portion of the first symbolic name. In this format, a symbolic name value consists of zero or more nonnumeric characters followed by an integer of one or more decimal digits. This format defines a series of symbolic names. For example, the string <j0101>...<j0104> is interpreted as the <j0101>, <j0102>, <j0103>, and <j0104> symbolic names, in that order.<A> \d65 decimal constant <B> \x42 hexadecimal constant <j10101> \x81\d254 mixed hex and decimal constants
In statements defining ranges of symbolic names, the encoded value is the value for the first symbolic name in the range. Subsequent symbolic names have encoding values in increasing order. For example:
This character set mapping statement is interpreted as follows:<j0101>...<j0104> \d129\d254
Symbolic names must be unique, but two or more symbolic names can have the same value.<j0101> \d129\d254 <j0102> \d129\d255 <j0103> \d130\d0 <j0104> \d130\d1
- The END CHARMAP section trailer.
Examples
The following is an example of a portion of a possible CHARMAP section from a charmap file:
CHARMAP
<code_set_name> ISO8859-1
<mb_cur_max> 1
<mb_cur_min> 1
<escape_char> \
<comment_char> #
<NUL> \x00
<SOH> \x01
<STX> \x02
<ETX> \x03
<EOT> \x04
<ENQ> \x05
<ACK> \x06
<alert> \x07
<backspace \x09
<tab> \x09
<newline> \x0a
<vertical-tab> \x0b
<form-feed> \x0c
<carriage-return> \x0d
END CHARMAP
The CHARSETID Section
The CHARSETID section maps the code points within the code set to a character set ID. The CHARSETID section contains the following sections:
- The CHARSETID section header.
- Character set ID mappings for the defined code sets.
- The END CHARSETID section trailer.
Character set ID mappings are defined by listing symbolic names or code points for symbolic names and their associated character set IDs. The following are possible formats for a character set ID mapping statement:
<character_symbol> number
<character_symbol>...<character_symbol> number
character_constant number
character_constant...character_constant number
The <character_symbol> used must have previously been defined in the CHARMAP section. The character_constant must follow the format described for the CHARMAP section.
Individual character set mappings are accomplished by indicating either the symbolic name (defined in the CHARMAP section or the portable character set) followed by the character set ID, or the code point associated with a symbolic name followed by the character set ID value. Symbolic names and code points must be separated from a character set ID value by one or more blank characters. Ranges of code points can be mapped to a character set ID value by indicating appropriate combinations of symbolic names and code point values as endpoints to the range, separated by ... (ellipsis) to indicate the intermediate characters, and followed by the character set ID for the range. The first endpoint value must be less than or equal to the second end point value.
Examples
The following is an example of a portion of a possible CHARSETID section from a charmap file:
CHARSETID
<space>...<nobreakspace> 0
<tilde>...<y-diaeresis> 1
END CHARSETID