The Unicode Standard Character Database
The Unicode Standard (commonly known as simply Unicode) is a universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the representation of worldwide text for computer processing and display of written texts of classical and modern languages, as well as many technical disciplines of the world. As the default encoding of HTML and XML, the Unicode Standard provides the pillar for the World Wide Web and the global business ecosystem of the current age. Required in new Internet protocols and implemented in all modern operating systems and programming languages, Unicode is the basis of software that must function all around the world. With Unicode, the technology industry has replaced proliferating character sets with a single, stable, and universal character repertoire that allows for global interoperability and reliable cross-language data interchange.
From a software developer's point of view, the Unicode Standard and its associated specifications provide programmers with a unified universal character encoding, extensive descriptions, and vast amounts of data about how characters in the Unicode repertoire function. The specifications describe how to form words and break lines; sort text in different languages; format numbers, dates, and times appropriate to certain languages; display languages whose written form flows from right to left, such as Arabic, Hebrew, and Thaana; or whose written form splits, combines, and reorders, such as languages of South Asia. Without the character properties and algorithms in the Unicode Standard and its associated core specifications, interoperability between different implementations would be impossible, and much of the vast breadth of the world’s languages would lie outside the reach of modern computer software.
The Unicode Standard associates a rich set of semantics with each encoded character: properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are comprehensively cataloged in what is known as the Unicode Character Database, a collection of data files which contain the Unicode character code points and character names. The data files define character properties and mappings between Unicode characters (such as case mappings). The Unicode Character Database, being an integral part of the Unicode Standard, contains normative property and mapping information required for implementation of Unicode Standard algorithms such as the Bidirectional, Line Breaking, Normalization, Word Boundary Determination, and Casefolding algorithms. The data files also contain additional informative and provisional character property information.