UTF-8 is an ASCII-compatible encoding of the Unicode (ISO 106464) character set. Since UTF-8 is now the default encoding on most modern Linux distributions, machines in the Computer Laboratory are moving to having their default character set (which was the 8-bit encoding ISO 8859-1) to UTF-8.

General information about using UTF-8 on Linux/Unix machines: http://www.cl.cam.ac.uk/~mgk25/unicode.html

Implementation status at the Computer Laboratory

On the SUSE Linux and Fedora Core 5 and later machines, users are by default already using the locale setting LC_CTYPE=en_GB.UTF-8, which tells applications that their main character set is UTF-8.

However, on the older Fedora Core 3 installations, the default setting is still LC_CTYPE=en_GB, which implies that the user's primary character set is ISO 8859-1. This is due to a local modification of the file /etc/sysconfig/i18n.

Since 3 January 2007, the lab's main filer (elmer) has assumed that the character encoding used in NFS filenames is UTF-8. This means that any non-ASCII characters in Unix filenames must be in encoded UTF-8 to appear correctly under Windows/CIFS. (In practice, restricting filenames to 7-bit ASCII is still a good idea.)

User overrides

Users can override the default settings by changing locale environment variables in their relevant start-up script, e.g. .profile or .bash_profile or .bashrc.

A typical UTF-8 setup would use

  export LANG=en_GB.UTF-8
  export LC_COLLATE=C

and a typical ISO 8859-1 setup would use

  export LANG=en_GB
  export LC_COLLATE=C

(The LC_COLLATE=C setting disables the "culturally correct" dictionary-style alphabetic sorting rules that come with the en_GB locale setting. This preserves traditional string-sorting behaviours such as "ls" sorting all uppercase letters before all lowercase ones, and "rm [A-Z]*" only deleting files starting with an uppercase letter.)

Tools for UTF-8 conversion

The only commonly used non-ASCII character found on British keyboards is the pound sign (£). In local plain-text files, it is the most likely reason why any conversion is necessary at all. Files that contain only 7-bit ASCII characters are already correctly encoded in UTF-8.

A simple shell script such as /home/mgk25/w/scripts/latin1toutf8

  #!/bin/bash
  for i in $* ; do
    echo "Converting $i ..."
    mv -i $i $i.bak
    iconv -f ISO8859-1 -t UTF-8 $i.bak >$i
  done

can be used to convert collections of plain-text files from ISO 8859-1 to UTF-8.

The tool /home/mgk25/local/arch-ix86/bin/utf8_test prints out all lines in a plain-text file that violate the UTF-8 syntax. It can be used to check which plain-text files still need to be converted to UTF-8.

There are some useful Perl one-liners to help with UTF-8 conversion listed at http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl.

Some example UTF-8 files for tests can be found at

Web pages

The web server is at present (October 2006) configured to announce to the browser for every HTML file that the file is encoded in UTF-8. Users can easily override this for *.html and *.txt files across entire directory trees by placing in the relevant top-level subdirectory a .htaccess file with the lines

AddType text/html;charset=UTF-8 html
AddType text/plain;charset=UTF-8 txt

Keyboard entry

UTF-8 now gives users access to typographic characters beyond what used to be provided by typewriters, including directional quotation marks, en/em dashes, minus (as opposed to hyphen), Greek letters, etc. They can make a substantial improvement to the typographic quality of web pages. The easiest way to use these is to change the personal keyboard mapping in ~/.Xmodmap, such that additional characters become available via the AltGr key. An example is given at http://www.cl.cam.ac.uk/~mgk25/unicode.html#input.

Known problems

Most of the known problems that plagued users when UTF-8 was first introduced with Red Hat 8 have long been fixed, but there are still a number of issues that warrant attention.

Xterm starts very slowly for remote access

As described in "in Is X11 ready for Unicode?, X11 has problems with XFontStruct for sparsely populated fonts. For remote access over broadband, this may saturate the link for 30 seconds, during which time all traffic is delayed. One way round this is to set LANG to not include .UTF-8 before xterm is started. The local command cl-xon can be made to do this by passing it the "-zaplang" flag, or setting the environment variable "XON_KEEP_LANG=false" so that LANG is reset when ssh is invoked, passed over to sshd on the remote machine, so that xterm does not use sparse fonts. As xterm is called with the "-ls" flag, the remote machine's LANG setting is then set, so that the bash has the expected LANG.

Exmh "xterm" mail submission and first UTF-8 message

A particular instance of the above is when exmh submits email using an xterm (by setting Preferences -> MH Tweaks -> How to send messages to xterm), which can be avoided by setting LANG before invoking exmh.

However, no way has yet been found to avoid the delay the first time that exmh displays a message with a UTF-8 font.

Printing

Cut & paste in Emacs

SysInfo/UTF-8 (last edited 2007-01-03 15:14:17 by MarkusKuhn)