- Windows uses UTF-16LE encoding internally for Unicode strings.
- In Windows, strings are either ANSI (local machine's system code page, and unportable), or Unicode (stored internally as UTF-16LE).
- UTF-8 is an encoding, and Unicode is a character set.
- A character set is a list of characters with unique numbers (code points).
- An encoding is an algorithm that translates a list of numbers to binary for disk storage.
- Binary data -> ENCODING -> Numbers -> CHARACTER SET -> Characters
- Unicode is a generic (ambiguous) term for a group of standards for character encoding.
- UTF-8 is variable length, with a minimum of one byte.
- UTF-16 is variable length, with a minimum of two bytes.
- UTF-8 is compatible with ASCII.
- UTF-8 is the most commonly used encoding for the WWW.
- Byte Order Mark used by Unicode to indicate whether file is little or big endian.
- Optional and first character in file.
- Used for UTF.
- In UTF-16, value = FEFF.
- In UTF-8, value is EF, BB, BF.
- In UTF-8, BOM is redundant (allowed but not recommended) since there is only one byte per character. Can be used to indicate that file is UTF-8.
- In ISO-8859 (Latin1), the BOM characters appear as "þÿ" (Big Endian) or "ÿþ" (Little Endian).
- UCS-2 is obsolete and replaced by UTF-16, which is more powerful, and more efficient (potentially fewer bytes for same number of characters).
- UCS-2 is fixed width, UTF-16 is variable width with a minimum of two bytes and a maximum of four bytes.
- UCS-2 and UTF-16 have identical code points for most characters.
- UTF-16 can represent bi-directional text such as Arabic and Hebrew.
- UTF-16 supports normalisation, which means that it can treat words with identical meaning but different spelling as the same (e.g. can't and cannot).
- UCS has three forms, UCS-2, UCS-2LE and UCS-2BE. For UCS-2, a BOM is needed (mandatory).
Vim supports (amongst others):
- latin1 (alias=ansi): AKA ISO 8859-1, also used for CP1252, which is very
similar, but not the same)
- cp437: Similar to ISO 8859-1
- utf-8 (alias=utf8): 32 bit UTF-8 encoded Unicode
- ucs-2 (alias=unicode): 16 bit UCS-2 encoded Unicode
- ucs-2le: 16 bit little-endian UCS-2 encoded Unicode
Windows 1252 is a character encoding of the Latin alphabet, and the default character set in English Windows. An extension of ISO 8859-1.
Mac's TextEdit "Western (Windows Latin 1)" is UTF-8 without a BOM.
Windows 7's Notepad saves in the following encodings:
- Unicode big endian
- If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications also assume little-endian encoding by default.)
- ASP default codepage is 1252 (a superset of ISO 8859-1).
- Serbian letters with accents appear as symbols in codepage 1252, but as the same letters without the accents in codepage 1251.