# NLS filenames

## Why the character set of a request matters

As an example, the character **à** (accented 'a'), falls outside the 7
bit US-ASCII (greater than 127) range and is represented by the byte E0
using the ISO8859-1 character set. The UTF-8 encoded version is
represented by the two byte sequence C3A0. Depending on the browser
configuration, the request can then be generated with one of the
following strings:

<table>
<thead>
<tr class="header">
<th>URL</th>
<th>ISO8859-1</th>
<th>UTF-8</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>http://host/à</td>
<td>http://host/%E0</td>
<td>http://host/%C3%A0</td>
</tr>
<tr class="even">
<td>http://host/càt.html</td>
<td>http://host/c%E0t.html</td>
<td>http://host/c%C3%A0t.html</td>
</tr>
</tbody>
</table>

If the request is not properly url-escaped by the browser, it will show
in your logs as:

<table>
<thead>
<tr class="header">
<th>URL</th>
<th>ISO8859-1</th>
<th>UTF-8</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>http://host/à</td>
<td>http://host/\xe0</td>
<td>http://host/\xc3\xa0</td>
</tr>
<tr class="even">
<td>http://host/càt.html</td>
<td>http://host/c\xe0t.html</td>
<td>http://host/c\xc3\xa0t.html</td>
</tr>
</tbody>
</table>

IHS will generally (see below) not map between different character sets
when trying to resolve a request URI to a filename. You must have
agreement between the links and the filenames on disk, or URL-encode the
links in advance to avoid ambiguity..

## Codepage problems

### Browsers may send UTF-8 or ISO8859-1 requests.

When a user clicks or types a URL in the address bar that contains
characters outside the 7 bit US-ASCII range, most browsers by default
send the request in a single-byte encoding such as ISO8859-1. These
bytes are then URL-encoded so that only 7 bit US-ASCII characters are
present in the request.

Both Mozilla and Internet Explorer have non-default options to send
requests as UTF-8. When this is selected by the user, characters that
don't map to 7-bit US-ASCII characters are converted to their multi-byte
UTF-8 form before being url encoded.

The most straightforward way to circumvent these types of problem is to
URL encode the links in your HTML documents so that the client only sees
7-bit characters. The string you chose to pre-encode would be determined
by what sequence of bytes exists in your actual filesystem. Another
option is creating symbolic links from the real files to the
alternatively named
versions.

### Problem 2: Links in HTML documents may be composed of different encoding then what exists on the filesystem

Saving or converting HTML files that have external links with characters
outside of 7-but US-ASCII as UTF-8 may change the behavior of your links
if the filenames on disk are still encoded in a local codepage.

If your HTML is UTF-8 encoded and contains characters outside of the 7
bit US-ASCII range, you have the following options: Make sure your
filenames on disk are also utf-8 or pre-encode links in a URL-escaped
form.

## Possible Solutions

### Simplest Solution, pre-encode hyperlinks to the files encoding

The typical problem our users encounter is that depending on browser
settings or how a user finds a page, the request for the same resource
comes in both kinds of encodings -- and only one works at a time.

The best short answer we have is: URL-encode your links ahead of time to
take the decision out of the hands of the client.

Simply put, if your link contains characters outside of US-ASCII create
links in the form of  
http://host/%E0 or http://host/%C3%A0 (depending on what exists in the
filesystem)

An example alternative solution using mod\_rewrite to map UTF8 sequences
to other charsets is provided [below](#REWRITE)

### mod\_fileiri conversion between local codepage and UTF-8

Third-party module
[mod\_fileiri](http://www.w3.org/2003/06/mod_fileiri/) can convert
between UTF-8 and local codepages for on-disk content, redirecting
either internally (like mod\_rewrite) or externally via a HTTP redirect.

### Map selected byte sequences from UTF-8 to local codepage:

An example usage of mod\_rewrite to map UTF-8 requests to a local
codepage is provided below. Because UTF-8 has strict rules about what
sequences are invalid, this script assumes that any request containing
invalid UTF-8 is already in the proper local codepage and doesn't alter
it. If the request is a valid UTF-8 string, it performs the replacements
it's configured for (it does not blindly convert to the local character
set to minimize false positives).

A more sophisticated program, intended to correct entire URL's in the
wrong codepage and not individual character sequences, might maintain a
list of potential filesystem codepages, and poke around in the
filesystem to see if a specific sequence of bytes does exist in the
filesystem. The key passed in at lookup time might include data captures
from the user, such as User-Agent or Accept-Language, for heuristics to
guess the original encoding or otherwise affect the mapping.

The script linked below is an UNSUPPORTED example to be used as a
REFERENCE. We'd advise limiting the scope (i.e. do it in a directory
context, restrict with a RewriteCond) what requests are seen by scripts
of this nature for a variety of reasons.

[nlsMap.pl](examples/nlsMap.pl)

``` 
 
RewriteEngine on
RewriteLock /opt/IHS/logs/rewrite.lock
RewriteMap nlsmap prg:/opt/nlsMap.pl
RewriteCond ...
RewriteRule (.*) %{nlsmap:$1} 
```

## Platform Considerations

### Windows 

IHS 2.0.42 and later on Windows requires URL encoding to be in UTF-8 and
will convert all requests into the unicode format used by supported
Windows versions. Windows uses the same unicode representation for
filenames no matter what the systems local codepage is.

### Unix

The filename must match exactly (byte-for-byte) the filename as created
in the filesystem. Because there are no guarantees or metadata available
that describe the character set of a filesystem, neither IHS nor the OS
perform any translation. You can see these low-level bytes (without the
risk of environment or terminal issues) by doing the following:

` ls -1 *somewildcard*| od -t x1 -c`

Example: I've created two files named c**à**t -- each with an accented
'a' in the middle but encoded with a filename composed of differing
character sets. The first file is the utf-8 version and the second uses
the iso8859-1 single byte encoding. Command: ls -1 c\*t.html | od -t x1
-c

Output for UTF8 file representation of c**à**t:

    0000000 63 c3  a0   74        (hex)
            c  303 240  t         (character/octal)

Output for ISO8859-1 representation of c**à**t:

``` 
0000020   63 e0   74 
          c  340  t   
```

Without any other precautions in place, an ISO8859-1 encoded request and
a UTF-8 encoded request would each find a different file inthe
filesystem. If one of the files did not exists, then one of the requests
would result in a 404. Read on for information on how these types of
requests are manifested.

## Known Problems

  - WebSphere plug-in incorrectly decodes certain incoming URLs into
    signed characters with negative values.
    
    This was fixed by APAR PK09023. The fix was included in WAS
    5.0.2.13, 5.1.1.7, and 6.0.2.1 service pack plug-ins.

## Mustgather

  - access\_log showing a 404 for NLS filename content
  - The hyperlink the user followed
  - The filesystem encoding (see [here](#unix-all-ihs-releases))
URL	ISO8859-1	UTF-8
http://host/à	http://host/%E0	http://host/%C3%A0
http://host/càt.html	http://host/c%E0t.html	http://host/c%C3%A0t.html
URL	ISO8859-1	UTF-8
http://host/à	http://host/\xe0	http://host/\xc3\xa0
http://host/càt.html	http://host/c\xe0t.html	http://host/c\xc3\xa0t.html