The Unicode Standard Characters Repertoire
The Unicode Standard specifies a numeric value (also known as code point) and a name for each of its characters. In this respect, it is similar to other character encoding standards from ASCII onward. In addition to character codes and names, other information is crucial to ensure legible text: a character’s case, directionality, and alphabetic properties must also be well defined. The Unicode Standard defines these and other semantic values, and includes application data such as case mapping tables and character property tables as part of the Unicode Character Database. Character properties define a character’s identity and behavior; they ensure consistency in the processing and interchange of Unicode data. (See the section Unicode Character Properties.)
The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters. The majority of the common characters used in the major languages of the world are encoded in the first 65,536 code points, known as the Basic Multilingual Plane (BMP). The overall capacity for a little over a million characters is more than sufficient for all currently known character encoding requirements, including full coverage of all minority and historic scripts of the world. Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF- 32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems. The Unicode Standard is code-for-code identical with International Standard ISO/IEC 10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/ IEC 10646.
The latest Unicode Standard, that is, Version 12.0, contains a total of 137,929 characters from the world’s scripts. These characters are ample for communication for all modern languages as well as representing the classical forms of many languages. The Standard encompasses the European alphabetic scripts, Middle Eastern right-to-left scripts, and other regional scripts such as those of Asia and Africa. Likewise, many archaic and historic scripts are encoded. The Han script includes 87,887 unified ideographic characters defined by national, international, and industry standards of China, Japan, Korea, Taiwan, Vietnam, and Singapore. Additionally, the Standard contains many important symbol sets, including currency symbols, punctuation marks, mathematical symbols, technical symbols, geometric shapes, dingbats, and emojis.
List of Unicode characters
In Unicode, the range of integers used to code characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is assigned to a given code point in the codespace, it is then referred to as an encoded character. The Unicode codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for mapping per the repertoire of abstract characters. The table below presents an ordered list of all the code points defined in the current repertoire of the Unicode Standard.
NOTE: The Unicode Standard does not encode idiosyncratic, novel, or private-use characters, nor does it encode logos or graphics. Graphologies unrelated to text, such as dance notations, are likewise outside the scope of Unicode. Font variants are explicitly not encoded. The Standard reserves 6,400 code points in the BMP for private use, which may be used to assign codes to characters not included in the Unicode repertoire. Another 131,068 private-use code points are available outside the BMP, should 6,400 prove insufficient for particular applications.