NLS filenames

Why the character set of a request matters

As an example, the character à (accented 'a'), falls outside the 7 bit US-ASCII (greater than 127) range and is represented by the byte E0 using the ISO8859-1 character set. The UTF-8 encoded version is represented by the two byte sequence C3A0. Depending on the browser configuration, the request can then be generated with one of the following strings:

URL ISO8859-1 UTF-8
http://host/à http://host/%E0 http://host/%C3%A0
http://host/càt.html http://host/c%E0t.html http://host/c%C3%A0t.html

If the request is not properly url-escaped by the browser, it will show in your logs as:

URL ISO8859-1 UTF-8
http://host/à http://host/\xe0 http://host/\xc3\xa0
http://host/càt.html http://host/c\xe0t.html http://host/c\xc3\xa0t.html

IHS will generally (see below) not map between different character sets when trying to resolve a request URI to a filename. You must have agreement between the links and the filenames on disk, or URL-encode the links in advance to avoid ambiguity..

Codepage problems

Browsers may send UTF-8 or ISO8859-1 requests.

When a user clicks or types a URL in the address bar that contains characters outside the 7 bit US-ASCII range, most browsers by default send the request in a single-byte encoding such as ISO8859-1. These bytes are then URL-encoded so that only 7 bit US-ASCII characters are present in the request.

Both Mozilla and Internet Explorer have non-default options to send requests as UTF-8. When this is selected by the user, characters that don't map to 7-bit US-ASCII characters are converted to their multi-byte UTF-8 form before being url encoded.

The most straightforward way to circumvent these types of problem is to URL encode the links in your HTML documents so that the client only sees 7-bit characters. The string you chose to pre-encode would be determined by what sequence of bytes exists in your actual filesystem. Another option is creating symbolic links from the real files to the alternatively named versions.

Possible Solutions

mod_fileiri conversion between local codepage and UTF-8

Third-party module mod_fileiri can convert between UTF-8 and local codepages for on-disk content, redirecting either internally (like mod_rewrite) or externally via a HTTP redirect.

Map selected byte sequences from UTF-8 to local codepage:

An example usage of mod_rewrite to map UTF-8 requests to a local codepage is provided below. Because UTF-8 has strict rules about what sequences are invalid, this script assumes that any request containing invalid UTF-8 is already in the proper local codepage and doesn't alter it. If the request is a valid UTF-8 string, it performs the replacements it's configured for (it does not blindly convert to the local character set to minimize false positives).

A more sophisticated program, intended to correct entire URL's in the wrong codepage and not individual character sequences, might maintain a list of potential filesystem codepages, and poke around in the filesystem to see if a specific sequence of bytes does exist in the filesystem. The key passed in at lookup time might include data captures from the user, such as User-Agent or Accept-Language, for heuristics to guess the original encoding or otherwise affect the mapping.

The script linked below is an UNSUPPORTED example to be used as a REFERENCE. We'd advise limiting the scope (i.e. do it in a directory context, restrict with a RewriteCond) what requests are seen by scripts of this nature for a variety of reasons.

nlsMap.pl

 
RewriteEngine on
RewriteLock /opt/IHS/logs/rewrite.lock
RewriteMap nlsmap prg:/opt/nlsMap.pl
RewriteCond ...
RewriteRule (.*) %{nlsmap:$1} 

Platform Considerations

Windows

IHS 2.0.42 and later on Windows requires URL encoding to be in UTF-8 and will convert all requests into the unicode format used by supported Windows versions. Windows uses the same unicode representation for filenames no matter what the systems local codepage is.

Unix

The filename must match exactly (byte-for-byte) the filename as created in the filesystem. Because there are no guarantees or metadata available that describe the character set of a filesystem, neither IHS nor the OS perform any translation. You can see these low-level bytes (without the risk of environment or terminal issues) by doing the following:

ls -1 *somewildcard*| od -t x1 -c

Example: I've created two files named càt -- each with an accented 'a' in the middle but encoded with a filename composed of differing character sets. The first file is the utf-8 version and the second uses the iso8859-1 single byte encoding. Command: ls -1 c*t.html | od -t x1 -c

Output for UTF8 file representation of càt:

0000000 63 c3  a0   74        (hex)
        c  303 240  t         (character/octal)

Output for ISO8859-1 representation of càt:

0000020   63 e0   74 
          c  340  t   

Without any other precautions in place, an ISO8859-1 encoded request and a UTF-8 encoded request would each find a different file inthe filesystem. If one of the files did not exists, then one of the requests would result in a 404. Read on for information on how these types of requests are manifested.

Known Problems

  • WebSphere plug-in incorrectly decodes certain incoming URLs into signed characters with negative values.

    This was fixed by APAR PK09023. The fix was included in WAS 5.0.2.13, 5.1.1.7, and 6.0.2.1 service pack plug-ins.

Mustgather

  • access_log showing a 404 for NLS filename content

  • The hyperlink the user followed

  • The filesystem encoding (see here)