This document provides definitions for various terms related to W3C internationalization.
We welcome comments on this document, but to make it easier to track them, please raise separate issues for each comment, and point to the section you are commenting on using a URL.
This document can be pointed to for definitions of terms, or these definitions may be copied to other documents and slightly adapted.
The W3C Internationalization Working Group also uses definitions provided by the Unicode Consortium.
Application internal identifiers. Identifiers defined by or assigned by a user in a vocabulary that is internal to the document format or protocol and not intended for human interaction. Such values are generally not localizable content.
ASCII case-insensitive matching. Defined in [[INFRA]], this compares two sequences of code points as if all ASCII code points in the range 0x41 to 0x5A (A to Z) were mapped to the corresponding code points in the range 0x61 to 0x7A (a to z), but other code points are not case-folded. ASCII case-insensitive matching can be required when a vocabulary is itself constrained to ASCII.
Base direction determines the general arrangement and progression of content when bidirectional text is displayed. The Unicode Bidirectional Algorithm is primarily focused on arranging adjacent characters, based on character properties. Base direction works at a higher level, and dictates (a) the visual order and direction in which runs of strongly-typed LTR and RTL character are displayed, and (b) where there are weakly-typed characters such as punctuation, the placement of those items relative to the other content.
Basic Multilingual Plane (BMP). The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane. The BMP includes most of the more commonly used characters.
Bidi algorithm, see Unicode Bidirectional Algorithm.
Bidirectional text (often referred to as "bidi text" for short) refers to text that mixes runs of both LTR and RTL text inline. It is common for right-to-left scripts, such as Arabic and Hebrew, to contain short runs of left-to-right text (most commonly in the Latin script), and several of the scripts that are predominantly right-to-left display numbers from left-to-right. Bidirectional text is the source of many of the difficulties when dealing with RTL scripts.
Basic language range. A language range consisting of a sequence of subtags separated by hyphens. That is, it is identical in appearance to a language tag.
Bidi isolation often needs to be applied to a range of text in order to prevent the automatic rules of the Unicode Bidirectional Algorithm incorrectly ordering that content in relation to the surrounding text. For example, numbers following right-to-left text in memory are automatically positioned to the left of that text by the Bidi Algorithm, but sometimes need to appear to the right. Another example occurs when lists of RTL items occur in a LTR sentence: the Bidi Algorithm will automatically assume that the order of items in the list should be "3 ,2 ,1", but actually what's needed is "1, 2, 3". In HTML, bidi isolation can be applied to a range of text by enclosing it in an element with a dir
attribute. In plain text there are Unicode formatting characters that can do the job. These mechanisms remove unwanted 'spillover effects'.
Canonical Unicode locale identifier. A well-formed language tag resulting from the application of the Unicode locale identifier canonicalization rules found in [[UAX35]]. This process converts any valid [[BCP47]] language tag into a valid Unicode locale identifier. For example, deprecated subtags or irregular grandfathered tags are replaced with their preferred value from the IANA language subtag registry.
Character encoding. The way the coded character set is mapped to bytes for manipulation in a computer. Commonly referred to as just the encoding. For examples and further descriptions see Character encodings: Essential concepts.
Character set or repertoire. The set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).
Case mapping is the process of transforming characters to a specific case, such as UPPER, lower, or Titlecase. For those scripts that have a case distinction, Unicode defines a default UPPER, lower, and Titlecase character mapping for each Unicode code point. Case mapping, at first, appears simple. However there are variations that need to be considered when treating the full range of Unicode in diverse languages.
Case folding is the process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching. This is distinct from case mapping, which is primarily meant for display purposes. As with the default case mappings, Unicode defines default case fold mappings ("case folding") for each Unicode code point. Unicode defines two forms of case folding.
Case sensitive matching. Code points are compared directly with no case folding.
Coded character set. A set of characters where a unique number has been assigned to each character. Units of a coded character set are known as code points.
Code point. A code point value represents the position of a character in a coded character set. For example, the code point for the letter á in the Unicode coded character set is 225 in decimal, or 0xE1 in hexadecimal notation. Hexadecimal notation is commonly used for referring to code points. See also Unicode code point.
CLDR, see Common Locale Data Repository.
Unicode code point. The numeric value assigned to each Unicode character. Unicode code points range from 0
to 0x10FFFF
. (See Section 4.1 in [[CHARMOD]] for a deeper discussion of character encoding terminology.) Unicode code points are denoted as U+hhhh
, where hhhh
is a sequence of at least four, and at most six hexadecimal digits. For example, the character € [U+20AC EURO SIGN] has the code point U+20AC, while the character 😺 [U+1F63A SMILING CAT FACE WITH OPEN MOUTH] has the code point U+1F63A.
Common Locale Data Repository (or CLDR). The Common Locale Data Repository ([[CLDR]]) is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.
Consumer. When talking about strings on the Web, the W3C Internationalization group refers to a consumer as any process that receives natural language strings, either for display or processing.
Daylight Saving Time (DST) or "Summer Time" was adopted as a way of allowing people more sunlight hours in the evening. DST varies from country to country (not to mention locality-to-locality) and often has special one-off changes to accommodate special events. Not all regions observe DST: usually those closer to the equator do not need it. In converting times it is important to know when DST was introduced, and sometimes abandoned, for the local area, as well as on what dates DST starts and ends (which can vary from year to year). For example, Korea Standard Time and Japan Standard Time currently use the same zone offset and neither uses daylight saving. However, Japan abandoned DST in 1951, while South Korea used it last in 1988, so an application that tracks time values that reach back that far might need to track these time zones separately.
Document character set. For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. What this means is that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode. In practical terms, this means that browsers usually convert all text to Unicode internally.
Extended language range. A language range consisting of a sequence of hyphen-separated subtags. In an extended language range, a subtag can either be a valid subtag or the wildcard subtag
, which matches any value.*
Field-based formats divide the date and/or time into separate field values such as year, month, day, hour, minute, second, etc. such as 2016-09-11T06:10:32. Contrast this with an alternative way to express the same time, 1465621816590, which is not field-based and is rather hard to read. Field-based times may or may not be tied to either UTC or the local time zone – or they may be indeterminate. Field-based times are also typically tied to a specific calendar (such as the Gregorian calendar). The formats described by the ISO 8601 standard are field-based.
First-strong detection is an algorithm that looks for the first strongly-directional character in a string (while ignoring embedded runs of isolated text), and then uses that to guess at the appropriate base direction for the string as a whole. Unicode code points are associated with properties relating to text direction: generally, letters in right-to-left scripts such as Arabic and Hebrew have a strong RTL direction, whereas Latin and Han characters have a strong LTR direction. Other characters, such as punctuation, only have a weak intrinsic directionality, and the actual directionality is determined according to the context in which they are found.
Floating times are not fixed to a specific incremental time value or time zone. When you apply time zone information to floating times they produce a range of acceptable incremental time values, because they represent a nominal time which is described in the same way in all time zones around the world. For example, Saturday 11 June 2016 happens to be the date of the British Queen's official 90th birthday. The specific time when 11th June starts or ends in Britain may actually be on Friday or Sunday in other countries, because their clocks are set differently, but the date of the event is always referred to as Saturday 11 June. Other examples of floating time events include the publication date for an issue of a newspaper, the date the Tokyo Olympics starts, the time the New Year starts, office hours set to "9 to 5" regardless of time zone, and so on.
Glyph. The visual representation of a character when rendered by a particular font. In more complex orthographies a glyph may represent only a part of a character, or may represent more than one character. A font is a collection of glyph shapes, and different fonts or font rules can render a given character using a variety of different glyphs. For example, the letter 'a' can be represented using regular (a), bold (a), or italic (a) glyphs.
Grapheme. A character or a sequence of
characters in a visual representation of some text
that a typical user would perceive as being a single unit (character
).
Graphemes are important for a number of text operations such as
sorting or text selection, so it is necessary to be able to compute
the boundaries between each user-perceived character. For more information about graphemes and grapheme clusters, with examples, see Character encodings: Essential concepts.
Grapheme clusters are defined by the Unicode Standard as
the default mechanism for computing an approximation to graphemes (see Unicode
Standard Annex #29: Text Segmentation [[UAX29]]). Two types
of default grapheme cluster are defined. Unless otherwise noted, grapheme
cluster
in this document refers to an extended default grapheme
cluster
. (A discussion of grapheme clusters is also given in Section 2
of the Unicode Standard, [[Unicode]]. Cf. near the end of
Section 2.11
in version 8.0 of The Unicode Standard.) Because different natural languages have different needs, grapheme clusters
can also sometimes require tailoring. For example, a Slovak user might
wish to treat the default pair of grapheme clusters "ch" as a single
grapheme cluster. Note that the interaction between the language of
string content and the end-user's preferences might be complex.
Gregorian calendar. The most widely used way of representing civil time. It is a solar calendar, with years usually consisting of 365 days, plus the concept of a "leap year". This adds an additional day every 4 years, except when the year is evenly divisible by 100 (unless the year is also evenly divisible by 400). There are numerous other calendars in use around the world, some of which are lunar calendars, some that are based on a different start date than the Gregorian calendar, and some that are reset each time a prominent person dies. Often these calendars are still used for religious purposes, but sometimes you will also find them being used in newspapers and emails, or for birth dates. There are technologies, such as ICU or Dojo, that support conversion between different calendaring systems.
Incremental time is a way of representing time in computers that is based on a progression of fixed integer units that increase monotonically from a specific point in time (called the "epoch"). Java (and many other systems) count time as the number of milliseconds since midnight (00:00 a.m.) on January 1, 1970 in UTC (less all of the intervening leap seconds). Other systems use different units and/or epochs. For example, the incremental time for 11 June, 2016 at 6.10am BST in JavaScript is 1465621816590. Most programming languages and operating environments provide or use incremental time for working with time values. However, incremental time is not usually seen directly by users, but is typically mapped to a field-based time format for interchange or for human consumption.
Internationalization. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated i18n
because there are eighteen letters between the "I" and the "N" in the English word.
International Preferences. A user's particular set of language and formatting preferences and associated cultural conventions. Software can use these preferences to correctly process or present information exchanged with that user.
IANA Language Subtag Registry. A machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: Registry)
Language metadata. When constrasted with the text-processing language, this indicates the intended use of the resource as a whole. For example, such metadata may be used for searching for a relevant resource, for serving the right language version, for classification, etc. This type of language declaration differs from that of the text-processing language declaration in that (a) the value for such declarations may be more than one language subtag, and (b) the language value declared doesn't indicate which bits of a multilingual resource are in which language.
Language tag extension. A system of additional [[BCP47]] subtags introduced by a single letter or digit subtag registered with IANA and permitting additional types of language identification.
Language negotiation is any process which selects or filters content based on language. Usually this implies selecting content in a single language (or falling back to some meaningful default language that is available) by finding the best matching values when several languages or locales [[LTLI]] are present in the content. Some common language negotiation algorithms include the Lookup algorithm in [[BCP47]] or the BestFitMatcher in [[ECMA-402]].
Language priority list. A collection of one or more language ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [[RFC2616]] Accept-Language
[[RFC3282]] header is an example of one kind of language priority list.
Language range. A string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".
Language subtag. A sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [[BCP47]], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).
Language tag. A string used as an identifier for a language, usually referring explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.
Legacy character encodings are character encoding forms that do not encode the full repertoire of characters in the Unicode character set.
Locale. An identifier (such as a language tag) for a set of international preferences. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.
Locale-aware (or Enabled). A system that can respond to changes in the locale with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range of locales in order to meet the international preferences of many kinds of users.
Locale fallback. The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.
Locale-neutral. A non-linguistic field is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate for any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale aware way.
Localizable content. Document contents intended as human-readable text and not to any of the surrounding or embedded syntactic content that form part of the document structure. Note that syntactic content can have localizable content embedded in it, such as when an [[HTML]] img
element has an alt
attribute containing a description of the image.
Localization. The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as l10n
because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to be localized.
Logical order. Some scripts, in particular Arabic and Hebrew, are written from right to left. Text including characters from these scripts can run in both directions and is therefore called bidirectional text. The Unicode Standard [[Unicode]] requires that characters be stored and interchanged in logical order, i.e. roughly corresponding to the order in which text is typed in via the keyboard or spoken (for a more detailed definition see [[Unicode]], Section 2.2). Logical ordering is important to ensure interoperability of data, and also benefits accessibility, searching, and collation.
LTR stands for "left-to-right" and refers to the inline base direction of left-to-right [[UAX9]]. This is the base text direction used by languages whose starting character progression begins on the left side of the page in horizontal text. It's used for scripts such as Latin, Cyrillic, Devanagari, and many others.
Natural Language (sometimes just language) refers to the spoken, written, or signed communications used by human beings.
Non-linguistic Field. Any element of a data structure not intended for the storage or interchange of natural language textual data. This includes non-string data types, such as booleans, numbers, dates, and so forth. It also includes strings, such as program or protocol internal identifiers. This document uses the term field as a short hand for this concept.
Metadata is additional information about data. Key types of metadata for internationalization are language metadata and metadata to support bidirectional text. Metadata has a scope, e.g., a string or a set of strings. In absence of explicit metadata, defaults might apply, e.g. defaults for the base direction of a text.
Producer. When talking about strings on the Web, the W3C Internationalization group refers to a producer as any process where natural language string data is created for later storage, processing, or interchange.
Resource. In the context of W3C Internationalization documents, a given document, file, or protocol "message" which includes both the localizable content as well as the syntactic content such as identifiers surrounding or containing it. For example, in an HTML document that also has some CSS and a few script
tags with embedded JavaScript, the entire HTML document, considered as a file, is a resource. This term is intentionally similar to the term 'resource' as used in [[RFC3986]], although here the term is applied loosely.
RTL stands for "right-to-left" and refers to the inline base direction of right-to-left [[UAX9]]. This is the base text direction used by languages whose starting character progression begins on the right side of the page in horizontal text. It's used for a variety of scripts which include Arabic, Hebrew, N'Ko, Adlam, Thaana, and Syriac among others.
Serialization agreement. When talking about strings on the Web, the W3C Internationalization group refers to serialization agreements as the common understanding between a producer and consumer about the serialization of string metadata: how it is to be understood, serialized, read, transmitted, removed, etc.
Supplementary characters. Beyond the Basic Multilingual Plane the Unicode character set also contains space for around a million additional code point positions. Characters in this latter range are referred to as supplementary characters. In the UTF-16 encoding, supplementary characters are encoded using a pairs of surrogate characters.
Syntactic content is any text in a document format or protocol that belongs to the structure of the format or protocol. This definition includes values that are typically thought of as "markup" but can also include other values, such as the name of a field in an HTTP header. Syntactic content consists of all of the characters that form the structure of a format or protocol. For example, < and > (as well as the element name and various attributes they surround) are part of the syntactic content in an HTML document. Syntactic content usually is defined by a specification or specifications and includes both the defined, reserved keywords for the given protocol or format as well as string tokens and identifiers that are defined by document authors to form the structure of the document (rather than the "content" of the document).
Text-processing language. The language in which a specific range of text is actually written. This needs to be declared so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, style processors, hyphenators, etc., can apply the appropriate rules to the text in question. So we are, by necessity, talking about associating a single language with a specific range of text. Contrast this with language metadata.
Time zone. A set of rules for determining the local time (wall time) as it relates to incremental time (as used in most computing systems) for a particular geographical region, and vice versa. Time zone rules have to take into account zone offsets plus any daylight savings modifications to wall time that apply.
Time zone identifiers allow you to refer to a particular difference from UTC that includes both zone offsets and daylight savings time. The most definitive reference for identifying sets of time zone rules is the TZ database (also known as the Olson time zone database), which is used by systems such as various commercial UNIX operating systems, Linux, Java, CLDR, ICU, and many other systems and libraries. Other systems exist: for example, Microsoft Windows uses its own data set and identifiers. In the TZ database, time zones are given IDs that typically consist of a region and exemplar city. An exemplar city is a city in the time zone in question that should be well-known to people using the time zone. For example, the U.S. Pacific time zone has a TZ database ID of America/Los_Angeles
. The TZ database also supplies aliases for many IDs; for example, Asia/Ulan Bator
is equivalent to Asia/Ulaanbaatar
. The Common Locale Data Repository (CLDR) can be used to provide a localized form for the IDs. Note that some systems, such as Apple's Mac OS, provide additional exemplar cities.
Transcoder. A process that converts text between two character encodings. Most commonly in W3C internationalization documents it refers to a process that converts from a legacy character encoding to a Unicode encoding form, such as UTF-8.
Unicode Bidirectional Algorithm or Bidi algorithm. This is the name for the rules described in the Unicode Standard Annex #9, “Unicode Bidirectional Algorithm [[UAX9]]. Those rules describe how inline bidirectional text should be rendered for scripts such as Arabic, Hebrew, Thaana, N'Ko, Adlam, etc. The effects of the bidi algorithm depend on the base direction and the directional properties of the characters to which it is applied.
Unicode Locale Identifier or Unicode Locale. A language tag that follows the additional rules and restrictions on subtag choice defined in UTR#35 [[UAX35]]. Any valid Unicode locale identifier is also a valid [[BCP47]] language tag, but a few valid language tags are not also valid Unicode locale identifiers.
Universal Coordinated Time (UTC) is the basis for modern timekeeping. Among other things, it provides a common baseline for converting between incremental and wall time. UTC is also known as GMT (Greenwich Mean Time). There are some subtle differences between the two, but none that the average person would notice.
The time zone offset for UTC is 0. UTC is often indicated in field-based formats using Z
.
User-facing identifiers are identifiers defined by or assigned by a user in a vocabulary that is intended to be at least potentially visible to end-users (and thus is localizable content).
User-percieved character. See grapheme.
User-supplied value. Unreserved syntactic content in a vocabulary that is assigned by users, as distinct from reserved keywords in a given format or protocol. Users generally expect that their user-supplied values can be words or phrases in their preferred natural language. This is why [[CHARMOD]] recommends that "Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000
to U+10FFFF
inclusive."
Valid language tag. A language tag that is well-formed and which also conforms to the additional conformance requirements in [[BCP47]], notably that each of the subtags appears in the IANA Language Subtag Registry.
Vocabulary. The list of reserved keywords and/or rules for assigning user-supplied values (such as identifiers) in a format or protocol. This can include restrictions on range, order, or type of characters that can appear in different places. For example, HTML defines the names of its elements and attributes, as well as enumerated attribute values, which defines the "vocabulary" of HTML syntactic content. Another example would be ECMAScript, which restricts the range of characters that can appear at the start or in the body of an identifier or variable name. It applies different rules for other cases, such as to the values of string literals. Values within a vocabulary fall into two broad classes: those that are meant to be seen, read, or interacted with by humans (and thus might be expected to contain natural language text); and those that are application or protocol internal and not intended for human interaction.
Wall time or local time is a moment in time that can be mapped to a specific point in incremental time if you apply any relevant time zone information, but it corresponds to what a person would recognise the time to be if they looked at a clock and/or calendar mounted on a wall in a particular place. So, for example, the time displayed by a computer in the UK may be Sat 11 Jun 06:10. By applying knowledge about how that time relates to UTC (in this case, adjusting by one hour to account for British Summer Time) it is possible to convert that to the incremental time 1465621816590. It's also possible to convert that to a wall time in another location, such as San Francisco, where someone looking at their computer's time display at the same time would have seen Fri 10 Jun 22:10.
Well-formed language tag. A language tag that follows the grammar defined in [[BCP47]]. That is, it is structurally correct, consisting of ASCII letters and digit subtags of the prescribed length, separated by hyphens.
Zone offset. An amount that is added to or subtracted from UTC based on the location of the event around the world relative to the prime meridian. Usually offsets are at one-hour intervals, but offsets can also include other differences, such as 30 or 45 minutes. A common way to express a zone offset in field-based formats is with +/- followed by the offset. So for example, Japan is 9 hours ahead of UTC, so you may see a time written as 2016-06-11 05:10+09:00. Note, importantly, that the zone offset does not help you convert times to wall time where daylight savings time is in force.