- LC_ALL is the environment variable that overrides all the other localisation settings (except $LANGUAGE under some circumstances).
- Different aspects of localisations (like the thousand separator or decimal point character, character set, sorting order, month, day names, language or application messages like error messages, currency symbol) can be set using a few environment variables.
- You’ll typically set $LANG to your preference with a value that identifies your region (like fr_CH.UTF-8 if you’re in French speaking Switzerland, using UTF-8).
- The individual LC_xxx variables override a certain aspect. LC_ALL overrides them all. The locale command, when called without argument gives a summary of the current settings.
For instance, on a GNU system, you get:
- you can override an individual setting with for instance:
- Or override everything with LC_ALL.
- In a script, if you want to force a specific setting, your best and generally only option is to force LC_ALL.
- The C locale is a special locale that is meant to be the simplest locale. The other locales are for humans, the C locale is for computers.
- In the C locale, characters are single bytes, the charset is ASCII (well, is not required to, but in practice will be in the systems most of us will ever get to use), the sorting order is based on the byte values, the language is usually US English (though for application messages (as opposed to things like month or day names or messages by system libraries), it’s at the discretion of the application author) and things like currency symbols are not defined.
- On some systems, there’s a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.
- You generally run a command with LC_ALL=C to avoid the user’s settings to interfere with your script.
- For instance, if you want [a-z] to match the 26 ASCII characters from a to z, you have to set LC_ALL=C.
- On GNU systems, LC_ALL=C and LC_ALL=POSIX (or LC_MESSAGES=C|POSIX) override $LANGUAGE, while LC_ALL=anything-else wouldn’t.
A few cases where you typically need to set LC_ALL=C:
- sort -u or sort … | uniq…. In many locales other than C, on some systems (notably GNU ones), some characters have the same sorting order. sort -u doesn’t report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order.
- the same applies to the = operator of POSIX compliant expr or == operator of POSIX compliant awks (mawk and gawk are not POSIX in that regard), that don’t check whether two strings are identical but whether they sort the same.
- Character ranges like in grep. If you mean to match a letter in the user’s language, use grep ‘[[:alpha:]]’ and don’t modify LC_ALL. But if you want to match the a-zA-Z ASCII characters, you need either LC_ALL=C grep ‘[[:alpha:]]’ or LC_ALL=C grep ‘[a-zA-Z]’.
- [a-z] matches the characters that sort after a and before z (though with many APIs it’s more complicated than that). In other locales, you generally don’t know what those are. For instance some locales ignore case for sorting so [a-z] in some APIs like bash patterns, could include [B-Z] or [A-Y].
- In many UTF-8 locales (including en_US.UTF-8 on most systems), [a-z] will include the latin letters from a to y with diacritics but not those of z (since z sorts before them) .
- floating point arithmetic in ksh93. ksh93 honours the decimal_point setting in LC_NUMERIC. If you write a script that contains a=$((1.2/7)), it will stop working when run by a user whose locale has comma as the decimal separator:
Then you need things like:
- As a side note: the , decimal separator conflicts with the , arithmetic operator which can cause even more confusion.
- When you need characters to be bytes.The most locales are UTF-8 based which means characters can take up from 1 to 6 bytes. When dealing with data that is meant to be bytes, with text utilities, you’ll want to set LC_ALL=C. It will also improve performance significantly because parsing UTF-8 data has a cost.
- a corollary of the previous point: when processing text where you don’t know what character set the input is written in, but can assume it’s compatible with ASCII (as virtually all charsets are). For instance grep ‘<.*>’ to look for lines containing a <, > pair will no work if you’re in a UTF-8 locale and the input is encoded in a single-byte 8-bit character set like iso8859-15. That’s because . only matches characters and non-ASCII characters in iso8859-15 are likely not to form a valid character in UTF-8. On the other hand, LC_ALL=C grep ‘<.*>’ will work because any byte value forms a valid character in the C locale.
- Any time where you process input data or output data that is not intended from/for a human. If you’re talking to a user, you may want to use their convention and language, but for instance, if you generate some numbers to feed some other application that expects English style decimal points, or English month names, you’ll want to set LC_ALL=C:
That also applies to things like case insensitive comparison (like in grep -i) and case conversion (awk’s toupper(), dd conv=ucase…).
is not guaranteed to match on I in the user’s locale. In some Turkish locales for instance, it doesn’t as upper-case i is İ (note the dot) there and lower-case I is ı (note the missing dot).