# ****************************************************************************** # * # * Copyright (C) 1995-2014, International Business Machines # * Corporation and others. All Rights Reserved. # * # ****************************************************************************** # If this converter alias table looks very confusing, a much easier to # understand view can be found at this demo: # http://demo.icu-project.org/icu-bin/convexp # IMPORTANT NOTE # # This file is not read directly by ICU. If you change it, you need to # run gencnval, and eventually run pkgdata to update the representation that # ICU uses for aliases. The gencnval tool will normally compile this file into # cnvalias.icu. The gencnval -v verbose option will help you when you edit # this file. # Please be friendly to the rest of us that edit this table by # keeping this table free of tabs. # This is an alias file used by the character set converter. # A lot of converter information can be found in unicode/ucnv.h, but here # is more information about this file. # # If you are adding a new converter to this list and want to include it in the # icu data library, please be sure to add an entry to the appropriate ucm*.mk file # (see ucmfiles.mk for more information). # # Here is the file format using BNF-like syntax: # # converterTable ::= tags { converterLine* } # converterLine ::= converterName [ tags ] { taggedAlias* }'\n' # taggedAlias ::= alias [ tags ] # tags ::= '{' { tag+ } '}' # tag ::= standard['*'] # converterName ::= [0-9a-zA-Z:_'-']+ # alias ::= converterName # # Except for the converter name, aliases are case insensitive. # Names are separated by whitespace. # Line continuation and comment sytax are similar to the GNU make syntax. # Any lines beginning with whitespace (e.g. U+0020 SPACE or U+0009 HORIZONTAL # TABULATION) are presumed to be a continuation of the previous line. # The # symbol starts a comment and the comment continues till the end of # the line. # # The converter # # All names can be tagged by including a space-separated list of tags in # curly braces, as in ISO_8859-1:1987{IANA*} iso-8859-1 { MIME* } or # some-charset{MIME* IANA*}. The order of tags does not matter, and # whitespace is allowed between the tagged name and the tags list. # # The tags can be used to get standard names using ucnv_getStandardName(). # # The complete list of recognized tags used in this file is defined in # the affinity list near the beginning of the file. # # The * after the standard tag denotes that the previous alias is the # preferred (default) charset name for that standard. There can only # be one of these default charset names per converter. # The world is getting more complicated... # Supporting XML parsers, HTML, MIME, and similar applications # that mark encodings with a charset name can be difficult. # Many of these applications and operating systems will update # their codepages over time. # It means that a new codepage, one that differs from an # old one by changing a code point, e.g., to the Euro sign, # must not get an old alias, because it would mean that # old files with this alias would be interpreted differently. # If an codepage gets updated by assigning characters to previously # unassigned code points, then a new name is not necessary. # Also, some codepages map unassigned codepage byte values # to the same numbers in Unicode for roundtripping. It may be # industry practice to keep the encoding name in such a case, too # (example: Windows codepages). # The aliases listed in the list of character sets # that is maintained by the IANA (http://www.iana.org/) must # not be changed to mean encodings different from what this # list shows. Currently, the IANA list is at # http://www.iana.org/assignments/character-sets # It should also be mentioned that the exact mapping table used for each # IANA names usually isn't specified. This means that some other applications # and operating systems are left to interpret the exact mappings for the # underspecified aliases. For instance, Shift-JIS on a Solaris platform # may be different from Shift-JIS on a Windows platform. This is why # some of the aliases can be tagged to differentiate different mapping # tables with the same alias. If an alias is given to more than one converter, # it is considered to be an ambiguous alias, and the affinity list will # choose the converter to use when a standard isn't specified with the alias. # Name matching is case-insensitive. Also, dashes '-', underscores '_' # and spaces ' ' are ignored in names (thus cs-iso_latin-1, csisolatin1 # and "cs iso latin 1" are the same). # However, the names in the left column are directly file names # or names of algorithmic converters, and their case must not # be changed - or else code and/or file names must also be changed. # For example, the converter ibm-921 is expected to be the file ibm-921.cnv. # The immediately following list is the affinity list of supported standard tags. # When multiple converters have the same alias under different standards, # the standard nearest to the top of this list with that alias will # be the first converter that will be opened. The ordering of the aliases # after this affinity list does not affect the preferred alias, but it may # affect the order of the returned list of aliases for a given converter. # # The general ordering is from specific and frequently used to more general # or rarely used at the bottom. { UTR22 # Name format specified by http://www.unicode.org/unicode/reports/tr22/ HTML # WHATWG's encoding spec; https://encoding.spec.whatwg.org IANA # Source: http://www.iana.org/assignments/character-sets MIME # Source: http://www.iana.org/assignments/character-sets } UTF-8 { MIME* HTML* } unicode-1-1-utf-8 utf8 utf-16be { MIME* HTML* } utf-16le { MIME* HTML* } utf-16 ibm866-html IBM866 { MIME* HTML* } 866 cp866 csibm866 iso-8859-2-html ISO-8859-2 { MIME* HTML* } csisolatin2 iso-ir-101 iso8859-2 iso88592 iso_8859-2 iso_8859-2:1987 l2 latin2 iso-8859-3-html ISO-8859-3 { MIME* HTML* } csisolatin3 iso-ir-109 iso8859-3 iso88593 iso_8859-3 iso_8859-3:1988 l3 latin3 iso-8859-4-html ISO-8859-4 { MIME* HTML* } csisolatin4 iso-ir-110 iso8859-4 iso88594 iso_8859-4 iso_8859-4:1988 l4 latin4 iso-8859-5-html ISO-8859-5 { MIME* HTML* } csisolatincyrillic cyrillic iso-ir-144 iso8859-5 iso88595 iso_8859-5 iso_8859-5:1988 iso-8859-6-html ISO-8859-6 { MIME* HTML* } arabic asmo-708 csiso88596e csiso88596i csisolatinarabic ecma-114 iso-8859-6-e iso-8859-6-i iso-ir-127 iso8859-6 iso88596 iso_8859-6 iso_8859-6:1987 iso-8859-7-html ISO-8859-7 { MIME* HTML* } csisolatingreek ecma-118 elot_928 greek greek8 iso-ir-126 iso8859-7 iso88597 iso_8859-7 iso_8859-7:1987 sun_eu_greek iso-8859-8-html ISO-8859-8 { MIME* HTML* } csiso88598e { MIME } csisolatinhebrew hebrew ISO-8859-8-E ISO-8859-8-I iso-ir-138 iso8859-8 iso88598 iso_8859-8 iso_8859-8:1988 visual # adding this one leads to a failure in encoding-labels.html # csiso88598i # This alias has to be dealt with by TextCodecICU unless # multiple encodings can share a single mapping table. #ISO-8859-8-I { MIME* HTML* } # csiso88598i # logical iso-8859-10-html ISO-8859-10 { MIME* HTML* } csisolatin6 iso-ir-157 iso8859-10 iso885910 l6 latin6 iso-8859-13-html ISO-8859-13 { MIME* HTML* } iso8859-13 iso885913 iso-8859-14-html ISO-8859-14 { MIME* HTML* } iso8859-14 iso885914 iso-8859-15-html ISO-8859-15 { MIME* HTML* } csisolatin9 iso8859-15 iso885915 iso_8859-15 l9 iso-8859-16-html ISO-8859-16 { MIME* HTML* } koi8-r-html KOI8-R { MIME* HTML* } cskoi8r koi koi8 koi8_r koi8-u-html KOI8-U { MIME* HTML* } koi8-ru macintosh-html macintosh { MIME* HTML* } csmacintosh mac x-mac-roman windows-874-html windows-874 { MIME* HTML* } dos-874 iso-8859-11 iso8859-11 iso885911 tis-620 windows-1250-html windows-1250 { MIME* HTML* } cp1250 x-cp1250 windows-1251-html windows-1251 { MIME* HTML* } cp1251 x-cp1251 windows-1252-html windows-1252 { MIME* HTML* } ansi_x3.4-1968 ascii cp1252 cp819 csisolatin1 ibm819 iso-8859-1 iso-ir-100 iso8859-1 iso88591 iso_8859-1 iso_8859-1:1987 l1 latin1 us-ascii x-cp1252 windows-1253-html windows-1253 { MIME* HTML* } cp1253 x-cp1253 windows-1254-html windows-1254 { MIME* HTML* } cp1254 csisolatin5 iso-8859-9 iso-ir-148 iso8859-9 iso88599 iso_8859-9 iso_8859-9:1989 l5 latin5 x-cp1254 windows-1255-html windows-1255 { MIME* HTML* } cp1255 x-cp1255 windows-1256-html windows-1256 { MIME* HTML* } cp1256 x-cp1256 windows-1257-html windows-1257 { MIME* HTML* } cp1257 x-cp1257 windows-1258-html windows-1258 { MIME* HTML* } cp1258 x-cp1258 x-mac-cyrillic-html x-mac-cyrillic { MIME* HTML* } x-mac-ukrainian # Keep GBK and GB18030 separate for now until we decide # what to do about them: crbug.com/339862 # The encoding spec requires that decoding to Unicode should use GB18030 # while encoding from Unicode should use GBK. windows-936-2000 GBK { MIME* IANA* } chinese { IANA } iso-ir-58 { IANA } GB2312 { IANA MIME } GB_2312-80 { IANA } gb_2312 csGB2312 { IANA } csiso58gb231280 x-gbk # GB 18030 is partly algorithmic, using the MBCS converter gb18030 { IANA* } gb18030 { HTML* MIME* } big5-html Big5 { MIME* HTML* } cn-big5 csbig5 x-x-big5 Big5-HKSCS euc-jp-html EUC-JP { MIME* HTML* } cseucpkdfmtjapanese x-euc-jp ISO_2022,locale=ja,version=0 ISO-2022-JP { MIME* HTML* } csiso2022jp shift_jis-html Shift_JIS { MIME* HTML* } csshiftjis ms_kanji ms932 shift-jis sjis windows-31j x-sjis euc-kr-html EUC-KR { MIME* HTML* } cseuckr csksc56011987 iso-ir-149 korean ks_c_5601-1987 ks_c_5601-1989 ksc5601 ksc_5601 windows-949 # We need to keep these aliases so that documents labelled with them # are converted to a single U+FFFD instead of being rendered as a gibberish. ISO-2022-KR { HTML* MIME* } csISO2022KR { IANA } ISO-2022-CN { IANA* HTML* } csISO2022CN x-ISO-2022-CN-GB ISO-2022-CN-EXT { IANA* HTML* } HZ-GB-2312 { HTML* IANA* } HZ