Reference: IHS Performance

Determining maximum simultaneous connections

The first tuning decision you'll need to make is determining how many simultaneous connections your IBM HTTP Server installation will need to support. Many other tuning decisions are dependent on this value.

For some IBM HTTP Server deployments, the amount of load on the web server is directly related to the typical business day, and may show a load pattern such as the following:

    Simultaneous
    connections

            |
       2000 |
            |
            |                            **********
            |                        ****          ***
       1500 |                   *****                 **
            |               ****                        ***
            |            ***                               ***
            |           *                                     **
       1000 |          *                                        **
            |         *                                           *
            |         *                                           *
            |        *                                             *
        500 |        *                                             *
            |        *                                              *
            |      **                                                *
            |   ***                                                  ***
          1 |***                                                        **
 Time of    +-------------------------------------------------------------
   day         7am  8am  9am  10am  11am  12am  1pm  2pm  3pm  4pm  5pm

For other IBM HTTP Server deployments, providing applications which are used in many time zones, load on the server varies much less during the day.

The maximum number of simultaneous connections must be based on the busiest part of the day. This maximum number of simultaneous connections is only loosely related to the number of users accessing the site. At any given moment, a single user can require anywhere from zero to four independent TCP connections.

The typical way to determine the maximum number of simultaneous connections is to monitor mod_mpmstats or mod_status reports during the day until typical behavior is understood.

The information in this document assumes the tuning exercise is occurring on an individual instance of IBM HTTP Server. When scaled horizontally across a farm of webservers, you may need to account for a swelling of connections when some capacity is removed for maintenance or otherwise taken offline.

In IHS 7.0 and later, the default configuration enables mod_mpmstats which will result in the following periodic messages written to the error log:

[Thu Aug 19 14:01:00 2004] [notice] mpmstats: rdy 712 bsy 312 rd 121 wr 173 ka 0 log 0 dns 0 cls 18                               
[Thu Aug 19 14:02:30 2004] [notice] mpmstats: rdy 809 bsy 215 rd 131 wr 44 ka 0 log 0 dns 0 cls 40                                
[Thu Aug 19 14:04:01 2004] [notice] mpmstats: rdy 707 bsy 317 rd 193 wr 97 ka 0 log 0 dns 0 cls 27                                
[Thu Aug 19 14:05:32 2004] [notice] mpmstats: rdy 731 bsy 293 rd 196 wr 39 ka 0 log 0 dns 0 cls 58

Note that if the web server has not been configured to support enough simultaneous connections, one of the following messages will be logged to the web server error log and clients will experience delays accessing the server.

Windows
[warn] Server ran out of threads to serve requests. Consider raising the ThreadsPerChild setting

Linux and Unix
[error] server reached MaxClients setting, consider raising the MaxClients setting

Check the error log for a message like this to determine if the IBM HTTP Server configuration needs to be changed. When MaxClients has been reached, new connections are buffered by the operating system and do not consume a request processing thread until the current work can complete. The length of this queue is controlled by the OS and can be influenced by the ListenBacklog directive.

Once the maximum number of simultaneous connections has been determined, add 25% as a safety factor. The next section discusses how to use this number in the web server configuration file.

As you approach a requirement exceeding 3000-4000 simultaneous connections, you will have to instead fan out to more instances of IHS, generally on different OS instances. Exceeding this soft limit is not recommended and may result in unpredictable/unintuitive behavior.

The KeepAliveTimeout can affect the apparent number of simultaneous requests being processed by the server. Increasing KeepAliveTimeout effectively reduces the number of threads available to service new inbound requests, and will result in a higher maximum number of simultaneous connections which must be supported by the web server. Decreasing KeepAliveTimeout can drive extra load on the server handling unnecessary TCP connection setup overhead. A setting of 5 to 10 seconds is reasonable for serving requests over high speed, low latency networks.

For example, on a non-Event system:

[Thu Aug 28 10:12:17 2008] [notice] mpmstats: rdy 0 bsy 600 rd 1 wr 70 ka 484 log 0 dns 0 cls 45

... shows that all threads are busy (0 threads are ready "rdy", 600 are busy "bsy"). 484 of them are just waiting for another keepalive request ("ka"), and yet the server will be rejecting requests because it has no threads available to work on them. Lowering KeepAliveTimeout would cause those threads to close their connections sooner and become available for more work.

On 8.5.5 and later under z/OS, and on 9.0 and later under Linux, keepalive connections do not tie up a dedicated thread due to the use of the "Event MPM". On these platforms, there is very little value in reducing the keepalive timeout. In most normal browser traffic patterns, with responsive applciations, servers using the Event MPM will require drastically fewer threads than were required with previous MPMs.

TCP connection states and thread/process requirements

The netstat command can be used to show the state of TCP connections between clients and IBM HTTP Server. For some of these connection states, a web server thread (or child process, with 1.3.x on Unix) is consumed. For other states, no web server thread is consumed. See the following table to determine if a TCP connection in a particular state requires a web server thread.

LISTEN
no conection
WebServer Thread: no
SYN_RCVD
not ready to be processed
WebServer Thread: no
ESTABLISHED
ready for web server to accept and process requests, or already processing requests yes, as soon as the web server realizes that connection is established; but if there aren't enough configured web server threads (e.g., MaxClients is too small), the connection may stall until a thread becomes ready
WebServer Thread: yes, unless server is "past" MaxClients or in HTTP Keepalive with the Event MPM.
FIN_WAIT1
web server has closed the socket; the connection remains in this state until an ACK is received from the client
WebServer Thread: up to 30 seconds (2 seconds if closure is due to an error condition) if ACK is delayed
CLOSE_WAIT
client has closed the socket, web server hasn't yet noticed
WebServer Thread: Usually.
LAST_ACK, CLOSING
client closed socket then web server closed socket
WebServer Thread: no.
FIN_WAIT2
web server closed the socket then client ACKed; the connection remains in this state until a FIN is received from the client or an OS-specific timeout occurs.
WebServer Thread: up to 30 seconds (2 seconds if closure is due to an error condition) if FIN is delayed
TIME_WAIT
waiting for 2*MSL timeout before allowing quad to be reused
WebServer Thread: no

Handling enough simultaneous connections with IBM HTTP Server on Windows / ThreadsPerChild setting

IBM HTTP Server on Windows has a single Parent process and a single multi-threaded Child process.

Special concerns for 32-bit IHS

Most IHS installs prior to 9.0.5.4 are 32-bit, and these runtimes have special scalability limitations.

On 64-bit Windows OS'es, each 32-bit instance of IHS is limited to approximately 2000 ThreadsPerChild. On 32-bit Windows, this number can be closer to 4000, but most operating system versions supported by modern releases of IHS are exlusively 64-bit.

  • The IHS Archive installation includes a 64-bit option.

  • New installations of 9.0.5.4 and later default to 64-bit https://www.ibm.com/support/pages/apar/PH23893

  • With 9.0 or PI04922 in 8.5.5., it's possible to use nearly twice as much memory (nearly twice as many threads). See PI04922.

These numbers are not exact limits, because the real limits are the sum of the fixed startup cost of memory for each thread + the maximum runtime memory usage per thread, which varies based on configuration and workload. Raising ThreadsPerChild and approaching these limits risks child process crashes when runtime memory usage puts the process address space over the 2GB or 3GB barrier. The upper limits become even more restrictive when loading other modules such as mod_mem_cache, or when there are many RewriteCond/RewriteRule directives.

When approaching the limits, the server may run for awhile before the memory limit is exceeded and causes the child process to crash. Values even higher than this would probably cause an immediate crash during startup.

No specific limits can be provided, but it is suggested that anything over a value of 2000 for ThreadsPerChild on a Windows operating system could be at risk. It should be noted that IHS on Windows is a 32-bit application even when running on a 64-bit Windows OS.

General Info

The relevant config directives for tuning the thread values described in this section on Windows are:

  • ThreadsPerChild\ The ThreadsPerChild directive places an upper limit on the number of simultaneous connections the server can handle. ThreadsPerChild should be set according to the maximum number of expected simultaneous connections, but within the restrictions described above.

  • ThreadLimit (valid with IHS 2.0 and above)\ ThreadsPerChild has a built in upper limit. Use ThreadLimit to increase the upper limit of ThreadsPerChild. The value of ThreadLimit affects the size of the shared memory segment the server uses to perform inter-process communication between the parent and the single child process. Do not increase ThreadLimit beyond what is required for ThreadsPerChild.

Handling enough simultaneous connections with IBM HTTP Server 2.0 and above on Linux and Unix systems

On UNIX and Linux platforms, a running instance of IBM HTTP Server will consist of one single threaded Parent process which starts and maintains one or more multi-threaded Child processes. HTTP requests are received and processed by threads running in the Child processes. Each simultaneous request (TCP connection) consumes a thread. You need to use the appropriate configuration directives to control how many threads the server starts to handle requests and on UNIX and Linux, you can control how the threads are distributed amongst the Child processes.

Relevant config directives on UNIX platforms:

  • StartServers

    The StartServers directive controls how many Child Processes are started when the web server initializes. The recommended value is 1. Do not set this higher than MaxSpareThreads divided by ThreadsPerChild. Otherwise, processes will be started at initialization and terminated immediately thereafter.

    Every second, IHS checks if new child processes are needed, so generally tuning of StartServers will be moot as early as a minute after IHS has started.

  • ServerLimit

    There is a built-in upper limit on the number of child processes. At runtime, the actual upper limit on the number of child processes is MaxClients divided by ThreadsPerChild. When using the Event MPM, reserving a few extra processes worth of slack between ServerLimit and the number of child processes normally used to reserve space for exiting processes.

    This should only be changed when you have reason to change MaxClients or ThreadsPerChild, it does not directly dictate the number of child processes created at runtime.

    It is possible to see more child processes than this if some of them are gracefully stopping. If there are many of them, it probably means that MaxSpareThreads is set too small, or that MaxRequestsPerChild is non-zero and not large enough; see below for more information on both these directives.

  • ThreadsPerChild

    Use the ThreadsPerChild directive to control how many threads each Child process starts. More information on strategies for distributing threads amongst child processes is included below.

  • ThreadLimit

    ThreadsPerChild has a built in upper limit. Use ThreadLimit to increase the upper limit of ThreadsPerChild. The value of ThreadLimit affects the size of the shared memory segment the server uses to perform inter-process communication between the parent and child processes. Do not increase ThreadLimit beyond what is required for ThreadsPerChild.

  • MaxClients

    The MaxClients directive places an upper limit on the number of simultaneous connections the server can handle. MaxClients should be set according to the expected load.

The MaxSpareThreads and MinSpareThreads directives affect how the server reacts to changes in server load. You can use these directives to instruct the server to automatically increase the number of Child processes when server load increases (subject to limits imposed by ServerLimit and MaxClients) and to decrease the number of Child processes when server load is low. This feature can be a useful for managing overall system memory utilization when your server is being used for tasks other than serving HTTP requests.

Setting MaxSpareThreads to a relatively small value has a performance penalty: Extra CPU to terminate and create child processes. During normal operation, the load on the server may vary widely (e.g., from 150 busy threads to 450 busy threads). If MaxSpareThreads is smaller than this variance (e.g., 450-150=300), then the web server will terminate and create child processes frequently, resulting in reduced performance.

Recommended settings:

  • ThreadsPerChild

    Leave at the default value, or increase to a larger proportion of MaxClients for better coordination of WebSphere Plugin processing threads (via less child processes). Larger ThreadsPerChild (and fewer processes) also results in fewer dedicated web container threads being used by the ESI invalidation feature of the WebSphere Plugin. Increasing ThreadsPerchild too high on heavily loaded SSL servers may incur more CPU and throughput issues, as there is additional contention for memory.

  • MaxClients

    Set this to the desired maximum number of simultaneous
    connections, rounded up to an even multiple of ThreadsPerChild.

  • StartServers

    Set it to "1" or "2".

  • MinSpareThreads

    Set it to a small multiple of ThreadsPerChild

  • MaxSpareThreads

    In general, set it to the same vaue as MaxClients or a large multiple of ThreadsPerChild. Setting it low helps reclaim memory when load has subsided.

    Setting the value too low (too close to MinSpareThreads) causes IHS processes to be recycled more frequently. This can have some negative side effects:

    • Troubleshooting problems is difficult when processes are short lived

    • Slow requests may delay the exit of processes that are reaped due to this setting

    • Exiting processes have to shutdown existing keepalive connections which may tempt a race with the client.

  • ServerLimit

    MaxClients divided by ThreadsPerChild, plus some additional space to account for exiting children. Doubling the value is usually fine.

  • ThreadLimit

    Set it to the same value as ThreadsPerChild

Note: ThreadLimit and ServerLimit need to appear before these other directives in the configuration file.

Default settings in 7.0-8.5:

ThreadLimit         25
ServerLimit         64
StartServers         2
MaxClients         600
MinSpareThreads     25
MaxSpareThreads     75
ThreadsPerChild     25
MaxRequestsPerChild  0

Default settings in 9.0.0.3 and later:

ThreadLimit          100
# After 9.0.0.3, it's important for the event MPM to have some slack space for ServerLimit
ServerLimit           18
StartServers           1
MaxClients          1200
MinSpareThreads       50
# PI74200: When using the event MPM, discourage process termination during runtime.
MaxSpareThreads      600
ThreadsPerChild      100
MaxRequestsPerChild    0
MaxMemFree 2048

If memory is constrained

If there is concern about available memory on the server, some additional tuning can be done.

Increasing ThreadsPerChild (and ThreadLimit) will reduce the number of total server processes needed, reducing the per-server memory overhead. However, there are a number of possible drawbacks to increasing ThreadsPerChild. Search this document for ThreadsPerChild and consider all the warnings before changing it. Notably, this increases the per-process footprint which can be detrimental on 32-bit httpds.

On Linux, VSZ can appear very large if ulimit -s is left as a large value or unlimited. Reduce it in bin/envvars to e.g. 512 with ulimit -s if this is a concern. A high VSZ has no real cost, it does not consume memory, but it is sometimes noticed.

Setting and MaxMemFree to e.g. 512, will limit memory retained by each thread.

Out of the box tuning concerns


All platforms

Access logs

Disk writes to the access logs can become a bottleneck at high loads on any operating system. mpmstats output will show many busy threads in log state. Specify

   BufferedLogs on

in the IHS configuration file to reduce the disk I/O rate due to access logging.

MaxClients, ThreadsPerChild, etc.

Refer to the previous section.

TCP buffer sizes

By default, IHS allows the platforms default TCP send and receive buffer sizes to be used. This gives the underlying OS and admin the most flexibility. If poor throughput for static files is observed, try combinations of the SendBufferSize directive with disabling EnableMMAP and EnableSendfile. Values between 32k and 512k are reasonable and can sometimes lead to drastic improvements to throughput, especially on Windows.

The relationship between the above directives is that changes to SendBufferSize may be ignored by the OS when the features controlled by EnableMMAP and EnableSendfile are used. Neither feature is ever used for SSL connections.

Beyond the obvious affect of increasing the size of the buffer, setting SendBufferSize to any value at all may disable complex/unpredictable dynamic buffer sizes such as those documented here for Windows.

cipher ordering (SSL only)

Historical interest only: The default SSLCipherSpec ordering enables maximum strength SSL connections at a significant performance penalty. A much better performing and reasonably strong SSLCipherSpec configuration is given below.

Sendfile (non-SSL only)

With IBM HTTP Server 2.0 and above, Sendfile usage is disabled in the current default configuration files. This avoids some occasional platform-specific problems, but it may also increase CPU utilization on platforms on which sendfile is supported (Windows, AIX, Linux, HP-UX, and Solaris/x64).

If you enable sendfile usage on AIX, ensure that the nbc_limit setting displayed by the no program is not too high for your system. On many systems, the AIX system default is 768MB. We recommend setting this to a much more conservative value, such as 256MB. If the limit is too high, and the web server use of sendfile results in a large amount of network buffer cache memory utilization, a wide range of other system functions may fail. In situations like that, the best diagnostic step is to check network buffer cache utilization by running netstat -c. If it is relatively high (hundreds of megabytes), disable sendfile usage and see if the problem occurs again. Alternately, nbc_limit can be lowered significantly but sendfile still be enabled.

Some Apache users on Solaris have noted that sendfile is slower than the normal file handling, and that sendfile may not function properly on that platform with ZFS or some Ethernet drivers. IBM HTTP Server provides support for sendfile on Solaris/x64 but not Solaris/SPARC.

AIX

With IBM HTTP Server 2.0.42 and above, the default IHSROOT/bin/envvars file specifies the setting MALLOCMULTIHEAP=considersize,heaps:8. This enables a memory management scheme for the AIX heap library which is better for multithreaded applications, and configures it to try to minimize memory use and to use a moderate number of heaps. For configurations with extensive heap operations (SSL or certain third-party modules), CPU utilization can be lowered by changing this setting to the following: MALLOCMULTIHEAP=true. This may increase the memory usage slightly.

Windows

The Fast Response Cache Accelerator (FRCA, aka AFPA) is disabled in the current default configuration files because some common Windows extensions, such as Norton Antivirus, are not compatible with it. FRCA is a kernel resident micro HTTP server optimized for serving static, non-access protected files directly out of the file system. The use of FRCA can dramatically reduce CPU utilization in some configurations. FRCA cannot be used for serving content over HTTPS/SSL connections.


Features to avoid

IBM HTTP Server supports some features and configuration directives that can have a severe impact on server performance. Use of these features should be avoided unless there are compelling reasons to enable them.

  • HostnameLookups On

    Performance penalty: Extra DNS lookups per request.

    This is disabled by default in the sample configuration files.

  • IdentityCheck On

    Performance penalty: Delays introduced in the request to contact RFC 1413 ident daemon possibly running on client machine

    This is disabled by default in the sample configuration files.

  • mod_mime_magic

    Performance penalty: Extra CPU and disk I/O to try to find the file type

    This is disabled by default in the sample configuration files.

  • ContentDigest On (1.3 only)

    Performance penalty: Extra CPU to compute MD5 hash of the response

    This is disabled by default in the sample configuration files.

  • setting MaxRequestsPerChild to non-zero

    Performance penalty:

    • Extra CPU to terminate and create child processes

    • With IHS 2 or higher on Linux and Unix, this can lead to an excessive number of child processes, which in turn can lead to excessive swap space usage. Once a child process reaches MaxRequestsPerChild it will not handle any new connections, but existing connections are allowed to complete. In other words, only one long-running request in the process will keep the process active, sometimes indefinitely. In environments where long-running requests are not unusual, a large number of exiting child processes can build up.

    This is set to the optimal setting (0) in default configuration files for recent releases.

    In rare cases, IHS support will recommend setting MaxRequestsPerChild to non-zero to work around a growth in resources, based on an understanding of what type of resource is growing in use, and what other mechanisms are available to address that growth.

    With IBM HTTP Server 1.3 on Linux and Unix, a setting of a high value such as 10000 is not a concern. The child processes each handle only a single connection, so they cannot be prevented from exiting by long-running requests.

    With IBM HTTP Server 2.0 and above on Linux and Unix, if the feature must be used, then only set it to a relatively high value such as 50000 or more to limit the risk of building up a large number of child processes which are trying to exit but which can't because of a long-running request which has not completed.

  • .htaccess files

    Performance penalty: Extra CPU and disk I/O to locate .htaccess files in directories where static files are served

    .htaccess files are disabled in the sample configuration files.

  • detailed logging

    Detailed logging (SSLTrace, plug-in LogLevel=trace, IBM Global Security Kit (GSKit) trace, third-party module logging) is often enabled as part of problem diagnosis. When one or more of these traces is left enabled after the problem is resolved, CPU utilization is higher than normal.

    Detailed logging is disabled in the sample configuration files.

  • disabling Options FollowSymLinks

    If the static files are maintained by untrusted users, you may want to disable this option in the configuration file, in order to prevent those untrusted users from creating symbolic links to private files that should not ordinarily be served. But disabling FollowSymLinks to prevent this problem will result in performance degradation since the web server then has to check every component of the pathname to determine if it is a symbolic link.

    Following symbolic links is enabled in the sample configuration files.


Common Changes

IBM HTTP Server 2.0 and above on Linux and Unix systems: ThreadsPerChild

This directive is commonly modified as part of tuning the web server. There are advantages and disadvantages for different values of ThreadsPerChild:

  • Higher values for ThreadsPerChild result in lower overall memory use for the server, as long as the value of ThreadsPerChild isn't higher than the normal number of concurrent TCP connections handled by the server.

  • Extremely high values for ThreadsPerChild may result in encountering address space limitations.

  • Higher values for ThreadsPerChild often results in lower numbers of connections which the WebSphere connection maintains to the application server and better sharing of markdown information.

  • Higher values for ThreadsPerChild result in higher CPU utilization for SSL processing.

  • Higher ThreadsPerChild results in a more effective use of the cache and connection pooling in mod_ibm_ldap.

  • Higher ThreadsPerChild results in a more effective use of the cache in mod_mem_cache, because each child must fill its own cache. MaxSpareThreads = MaxClients is also beneficial for mod_mem_cache because it prevents child processes who have built up large caches from being gracefully terminated.

System tuning changes may be necessary to run with higher values for ThreadsPerChild. If IBM HTTP Server fails to start after increasing ThreadsPerChild, check the error log for any error messages. A common failure is a failed attempt to create a worker thread.

IBM HTTP Server 2.0 and above on Linux and Unix systems: MaxClients

This directive is commonly modified as part of tuning the web server to handle a greater client load (more concurrent TCP connections).

When MaxClients is increased, the value for MaxSpareThreads should be scaled up as well. Otherwise, extra CPU will be spent terminating and creating child processes when the load changes by a relatively small amount.

ExtendedStatus

This directive controls whether some important information is saved in the scoreboard for use by mod_status and diagnostic modules. When this is set to On, web server CPU usage may increase by as much as one percent. However, it can make mod_status reports and some other diagnostic tools more useful.


WebSphere plug-in


Tuning IHS to make the MaxConnections parameter more effective

The use of the MaxConnections parameter in the WebSphere plug-in configuration is most effective when IBM HTTP Server 2.0 and above is used and there is a single IHS child process. However, there are some operational tradeoffs to using it effectively in a multi-process webserver like IHS.

It is usually much more effective to actively prevent backend systems from accepting more connections than they can reliably handle, performing throttling at the TCP level. When this is done at the client (HTTP Plugin) side, there is no cross-system or cross-process coordination which makes the limits ineffective.

Using MaxConnections with more then 1 child processes, or across a webserver farm, introduces a number of complications. Each IHS child process must have a high enough MaxConnections value to allow each thread to be able to find a backend server, but in aggregate the child processes should not be able to overrun an individual application server.

Choosing a value for MaxConnections

  • MaxConnections has no effect if it exceeds ThreadsPerChild, because no child could try to use that many connections in the first place.

  • Upper limit

    If you are concerned about a single HTTP Server overloading an Application server, you must first determine "N" -- the maximum number of requests the single AppServer can handle.

    MaxConnections would then be = (N / (MaxClients / ThreadsPerChild)), or N divided by the maximum number of child processes based on your configuration . This represents the worst-case number of connections by IHS to a single Application Server. As the number of backends grows, the likelyhood of the worst-case scenario decreases as even the uncoordinated child processes are still distributing load with respect to session affinity and load balancing.

    For example, if you wish to restrict each Application Server to a total of 200 connections, spread out among 4 child processes, you must set the MaxConnections parameter to 50 because each child process keeps its own count.

  • Lower Limit

    If MaxConnections is too small, a child process may start returning errors because it has no AppServers to use.

    To prevent problems, MaxConnections * (number of usable backend servers) should exceed ThreadsPerChild.

    For example, if each child process has 128 ThreadsPerChild and MaxConnections is only 50 with two backend AppServers, a single child process may not be able to fulfill all 128 requests because only 50 * 2 connections can be made.

To use MaxConnections, IHS should be configured to use a small, fixed number of child process, and to not vary them in response to a change in load. This provides a consistent, predictable number of child processes that each have a fixed MaxConnections parameter.

  • MinSpareServers and MaxSpareServers should be set to the same value as MaxClients.

  • StartServers should be set to MaxClients / ThreadsPerChild.

When more then 1 child process is configured (number of child processes is MaxClients/ThreadsPerChild), setting MaxSpareServers equal to MaxClients can have the effect of keeping multiple child process alive when they aren't strictly needed. This can be considered detrimental to the WebSphere Plugin detecting markdowns, because the threads in each child process must discover a server should be marked down. See section 6.2 below.

Tuning IHS for efficiency of Plugin markdown handling

Only WebSphere Plugin threads in a single IHS child process share info about AppServer markdowns, so some customers wish to aggressively limit the number of child processes that are running at any given time. If a user has problems with markdowns being discovered by many different child processes in the same webserver, consider increasing ThreadsPerChild and reducing MinSpareThreads and MaxSpareThreads as detailed below.

  • One approach is to use a single child process, where MaxClients and ThreadsPerChild are set to the same value. IHS will never create or destroy child processes in response to load.

    Cautions

    • A WebServer crash impacts 100% of the clients.

    • Some types of hangs may influence 100% of the clients.

    • CPU usage may increase if SSL is used and ThreadsPerChild exceeds a few hundred.

    • More ramifications of high ThreadsPerChild is discussed here.

  • A second approach is to use a variable number of child processes, but to aggressively limit the number created by IHS in response to demand (and aggressively remove unneeded processes). This is accomplished by setting ThreadsPerChild to 25% or 50% of MaxClients, setting MinSpareThreads and MaxSpareThreads low (relative to recommendations here).

    Cautions:

    • MaxSpareThreads < MaxClients causes IHS to routinely kill off child processes, however it may take some time for these processes to exit while slow requests finish processing.

    • A lower MaxSpareThread can cause extra CPU usage for the creation of replacement child processes.

    • Caches for ESI and mod_mem_cache are thrown away when child processes exit.

See also

Tuning IHS for efficiency of ESI invalidation servlet / web container threads

As the number of child processes increases (ratio of ThreadsPerChild / MaxClients shrinks), if the ESI Invalidation Servlet is used with the WebSphere Plugin, more and more Web Container threads will be permanently consumed. Each child processes uses 1 ESI Invalidation thread (when the feature is configured), and this thread is used synchronously in the web container.

This requires careful consideration of the number of child processes per webserver, the number of webservers, and the number of configured Web Container threads.

SSL Performance

ciphers

When an SSL connection is established, the client (web browser) and the web server negotiate the cipher to use for the connection. The web server has an ordered list of ciphers, and the first cipher in that list which is supported by the client will be selected.

By default, IBM HTTP Server prefers AES ciphers which are not computationally expensive so the tuning of the order of SSL directives for performance reasons is generally not needed.

The full list of supported ciphers can be displayed by running bin/apachectl -t -DDUMP_SSL_CIPHERS on any server with SSL enabled.

In modern times, tuning of SSL ciphers is for security not performance. Most of the performance limitations of SSL are handshake related, and not related to the selected cipher. The example below for 7.0 and earlier explicitly selects from the strongest available ciphers only:

<VirtualHost *:443>
  SSLEnable
  Keyfile keyfile.kdb

  ## FIPS approved SSLV3 and TLSv1 128 bit AES Cipher
  SSLCipherSpec TLS_RSA_WITH_AES_128_CBC_SHA

  ## FIPS approved SSLV3 and TLSv1 256 bit AES Cipher
  SSLCipherSpec TLS_RSA_WITH_AES_256_CBC_SHA
</VirtualHost>

For IHS V8R0 and later, new ciphers and a new syntax for SSLCipherSpec are supported. This is a required change since new TLS protocols with a disjoint set of ciphers are supported. These same releases of IHS also favor strong ciphers, but also completely disable weak and export ciphers.

To remove RC4 from the defaults, which are already restricted to medium and strong ciphers:

    SSLCipherSpec ALL -SSL\_RSA\_WITH\_RC4\_128\_SHA SSL\_RSA\_WITH\_RC4\_128\_MD5

Individual ciphers can be listed to further influence their ordering, but this is generally an unnecessary optimization.

You can use the following LogFormat directive to view and log the SSL cipher negotiated for each connection:

    LogFormat "%h %l %u %t \"%r\" %>s %b \"SSL=%{HTTPS}e\" \"%{HTTPS_CIPHER}e\" \"%{HTTPS_KEYSIZE}e\" \"%{HTTPS_SECRETKEYSIZE}e\"" ssl_common
    CustomLog logs/ssl_cipher.log ssl_common

This logformat will produce an output to the ssl_cipher.log that looks something like this:

    127.0.0.1 - - [18/Feb/2005:10:02:05 -0500] "GET / HTTP/1.1" 200 1582 "SSL=ON" "SSL_RSA_WITH_RC4_128_MD5" "128" "128"

Server certificate size

Larger server certificates are also costly. Every doubling of key size costs 4-8 times more CPU for the required computation.

Unfortunately, you don't have a lot of choice in the size of your server certificate; the industry is currently (2010) moving from 1024-bit to 2048-bit certificates to keep up with the increasing compute power available to those trying to break SSL. But there are some SSL performance tuning tips that can help.

The primary cost of the computation associated with a larger server certificate size is in the SSL handshake when a new session is created, so using keep-alive and re-using SSL sessions can make a significant difference in performance. See more about that below.

Linux and Unix systems, IBM HTTP Server 2.0 and higher: ThreadsPerChild

The SSL CPU utilization will be lower with lower values of ThreadsPerChild. We recommend using a maximum of 100 if your server handles a lot of SSL traffic, so that the client load is spread among multiple child processes. (Note: This optimization is not possible on Windows, which supports only a single child process.)

AIX, IBM HTTP Server 2.0 and higher: MALLOCMULTIHEAP setting in IHSROOT/bin/envvars

Set this to the value true when there is significant SSL work-load, as this will result in better performance for the heap operations used by SSL processing.

Should I use a cryptographic accelerator?

The preferred approach to improving SSL performance is to use software tuning to the greatest extent possible. Installation and maintenance of crypto cards is relatively complex and usually results in a relatively small reduction in CPU usage. We have observed many situations where the improvement is less than 10%.

HTTP keep-alive and SSL

HTTP keep-alive has a much larger benefit for SSL than for non-SSL. If the goal is to limit the number of worker threads utilized for keep-alive handling, performance will be much better if KeepAlive is enabled with a small timeout for SSL-enabled virtual hosts, than if keep-alive is disabled altogether.

Example:

    <VirtualHost *:443>
    normal configuration
    # enable keepalive support, but with very small timeout
    # to minimize the use of worker threads
    KeepAlive On
    KeepAliveTimeout 1
    </VirtualHost>

Warning! We are not recommending "KeepAliveTimeout 1" in general. We are suggesting that this is much better than setting KeepAlive Off. Larger values for KeepAliveTimeout will result in slightly better SSL session utilization at the expense of tying up a worker thread for a longer period of time in case the browser sends in another request before the timeout is over. There are diminishing returns for larger values, and the optimal values are dependent upon the interaction between your application and client browsers.

SSL Sessions and Load Balancers

An SSL session is a logical connection between the client and web server for secure communications. During the establishment of the SSL session, public key cryptography is used to to exchange a shared secret master key between the client and the server, and other characteristics of the communication, such as the cipher, are determined. Later data transfer over the session is encrypted and decrypted with symmetric key cryptography, using the shared key created during the SSL handshake.

The generation of the shared key is very CPU intensive. In order to avoid generating the shared key for every TCP connection, there is a capability to reuse the same SSL session for multiple connections. The client must request to reuse the same SSL session in the subsequent handshake, and the server must have the SSL session identifier cached. When these requirements are met, the handshake for the subsequent TCP connection requires far less server CPU (80% less in some tests). All web browsers in general use are able to reuse the same SSL session. Custom web clients sometimes do not have the necessary support, however.

The use of load balancers between web clients and web servers presents a special problem. IBM HTTP Server cannot share a session id cache across machines. Thus, the SSL session can be reused only if a subsequent TCP connection from the same client is sent by the load balancer to the same web server. If it goes to another web server, the session cannot be reused and the shared key must be regenerated, at great CPU expense.

Because of the importance of reusing the same SSL session, load balancer products generally provide the capability of establishing affinity between a particular web client and a particular web server, as long as the web client tries to reuse an existing SSL session. Without the affinity, subsequent connections from a client will often be handled by a different web server, which will require that a new shared key be generated because a new SSL session will be required.

Some load balancer products refer to this feature as SSL Sticky or Session Affinity. Other products may use their own terminology. It is important to activate the appropriate feature to avoid unnecessary CPU usage in the web server, by increasing the frequency that SSL sessions can be reused on subsequent TCP connections.

End users will generally not be aware that SSL session is not being reused unless the overhead of continually negotiating new sessions causes excessive delay in responses. Web server administrators will generally only become aware of this situation when they observe the CPU utilization approaching 100%. The point at which this becomes noticeable will depend on the performance of the web server hardware, and whether or not a cryptographic accelerator is being used.

When SSL is being used and excessive web server CPU utilization is noticed, it is important to first confirm that Session Affinity is enabled if a load balancer is being used.

Checking the actual reuse of SSL sessions

First, get the number of new sessions and reused sessions. LogLevel must be set to info or debug.

IBM HTTP Server 2.0.42 or 2.0.47 with cumulative fix PK13230 or later and IBM HTTP Server 6.0.2.1 and later writes messages of this format for each handshake:

    [Sat Oct 01 15:30:17 2005] [info] [client 9.49.202.236] Session ID: YT8AAPUJ4gWir+U4v2mZFaw5KDlYWFhYyOM+QwAAAAA= (new)
    [Sat Oct 01 15:30:32 2005] [info] [client 9.49.202.236] Session ID: YT8AAPUJ4gWir+U4v2mZFaw5KDlYWFhYyOM+QwAAAAA= (reused)

To get the relative stats:

    $ grep "Session ID.*reused" logs/error_log  | wc -l
    1115
    $ grep "Session ID:.*new" logs/error_log  | wc -l
    163

The percentage of expensive handshakes for this test run is 163/(1115 + 163) or 12.8%. To confirm that the load balancer is not impeding the reuse of SSL sessions, perform a load test with and without the load balancer and compare the percentage of expensive handshakes in both tests.

Alternately, use the load balancer for both tests, but for one load test have the load balancer to send all connections to a particular web server, and for the other load test have it load balance between multiple web servers.

Session ID cache limits

IBM HTTP Server uses an external session ID cache with no practical limits on the number of session IDs unless the operating system is Windows or the directive SSLCacheDisable is present in the IHS configuration.

When the operating system is Windows or the SSLCacheDisable directive is present, IBM HTTP Server uses the IBM Global Security Kit (GSKit) internal session ID cache which is limited to 512 entries by default.

This limit can be increased to a maximum of 4095 (64000 for z/OS) entries by setting the environment variable GSK_V3_SIDCACHE_SIZE to the desired value.

Network Tuning


All platforms

Problem Description

Low data transfer rates handling large POST requests.

This problem can be caused by a small TCP receive buffer size being used for web server sockets. This results in the client being limited in how much data it can send before the server machine has to acknowledge it, resulting in poor network utilization.

Resolution

Some data transfer performance problems can be solved using the native operating system mechanism for increasing the default size of TCP receive buffers. IBM HTTP Server must be restarted after making the change. The following levels of IBM HTTP Server contain a ReceiveBufferSize directive for setting this value in a platform-independent manner, and only for the web server:

  • 2.0.42.2 with cumulative e-fix PK07831 or later

  • 2.0.47.1 with cumulative e-fix PK07831 or later

  • 6.0.2 or later\ (6.0.2.1 or later on Windows)

Usage:

ReceiveBufferSize number-of-bytes

This directive must appear at global scope in the configuration file.

Making the adjustment

  1. Check the current system default using the platform-specific command in the previous table.

  2. Use either 131072 bytes, or twice the current system default, whichever is greater.\ Example ReceiveBufferSize directive:\ ReceiveBufferSize 131072\ If the ReceiveBufferSize directive is not available, use the platform-specific command in the previous table to change the system default.

  3. Restart the web server, then retry the testcase.

  4. If POST performance did not improve enough, double the receive buffer value and try again.

AIX

Problem Description

Low data transfer rates running on AIX 5 when handling large (multi-megabyte) POST requests from Windows machines. Network traces show large delays (~150 ms) between packet acknowledgments.

Resolution

This performance problem can be corrected by setting an AIX network tuning option and applying AIX maintenance.

For all releases of AIX, set the tcp_nodelayack network option to 1 by using the following command:

no -o tcp_nodelayack=1

Problem Description

Unexpected network latency when the application is somewhat slow. Network traces show a normal HTTP 200 OK message for the first part of the response, then AIX waits ~150ms for a delayed ACK from the client.

Resolution

This performance problem can be corrected by setting an AIX network tuning option.

Set the rfc2414 network option to 1 by using the following command:

no -o rfc2414=1

Operating System Tuning Reference Materials

Instructions for tuning some operating system parameters are available in the WebSphere InfoCenter. Many of these parameters, such as TCP layer configuration or file descriptor configuration, apply to IBM HTTP Server as well.

Specific performance symptoms

Slow startup, or slow response time from proxy or LDAP with IBM HTTP Server 2.0 or above on AIX

In support of IPv6 networking, these levels of IBM HTTP Server will query the resolver library for IPv4 and IPv6 addresses for a host. This can result on extra DNS queries on AIX, even when the IPv4 address is defined in /etc/hosts. To work around this issue, IPv6 lookups can be disabled.

This setting can be changed system-wide by editing /etc/netsvc.conf, which configures the resolver system-wide. Add or modify the lookup rule for hosts so that it has this setting:

hosts=local4,bind4

That will disable IPv6 lookups. Now restart IBM HTTP Server and confirm that the delays with proxy requests or LDAP have been resolved.

The setting can be changed for IHS only by adding this to the end of IHSROOT/bin/envvars:

NSORDER=local4,bind4
export NSORDER

High disk I/O with IBM HTTP Server on AIX

A customer reported that an internal disk mirror showed a high level of write I/O every 60 seconds which was related empirically to client load on the web server and which was determined to be unrelated to logging. AIX support narrowed down the specific web server activity related to the high write I/O and determined that it was due to file access times being updated by the filesystem when the web server served the page.

IBM HTTP Server 2.0 and above can send static files using the AIX send_file() API, which in turn can enable the AIX kernel to deliver the file contents to the client from the network buffer cache. This results in the file access time remaining unchanged, which solved this particular disk I/O problem.

The use of send_file() is controlled with the EnableSendfile directive. Several potential problems must be considered when IBM HTTP Server uses send_file(); thus it is disabled by default in the configuration files provided with the last several releases.

High CPU in child processes after WebSphere plugin config is updated

The WebSphere plugin will normally reload the plugin configuration file (plugin-cfg.xml) during steady state operation if the file is modified. When the reload occurs during steady state operation, it must be reloaded in every web server child process serving requests. Initialization of https transports is particularly CPU-intensive so, if there are many such transports defined or many child processes, the CPU impact can be high.

One way to address the issue on Unix and Linux platforms is to disable automatic reload by setting RefreshInterval to -1 in plugin-cfg.xml, then use the apachectl graceful command to restart the web server when the new plug-in configuration must be activated. This will result in the reload occurring only once -- in the IHS parent process. The new configuration will be inherited by the new child processes which are created by the restart operation.

Another way to address this issue is to utilize WebSphere 6.1 (or later) webserver definitions. This will allow you to have smaller plug-in config files because they are broken down in a way that each plugin-cfg.xml is only generated with the transports relevant to that web server. When the reload occurs, it doesn't reinitialize all the transports; only the one in the config that changed will be reinitialized.

z/OS Considerations

Miscellaenous z/OS issues

  • Some users have reported delays on servers with lots of static, HFS/ZFS files without EnableMMAP OFF

  • IHS threads can block due to disk filesystem sync activity even with logging disabled. This can be caused by updates to file access times for IHS static content files on an HFS. A mount parameter of SYNC(99999) prevents the metadata updates. zFS uses asynchronous disk writes and should not be affected as much by sync activity.

Heap contention without CPACF

On SSL-enabled systems where ICSF is not using crypto accelerators via CPACF, busy servers may see poor performance due to heap latch contention. z/OS offers a runtime option that allows multiple threads within a process to access the heap more efficiently.

Instances of IHS created after PH59165 will configure HEAPPOOLS/HEAPPOOLS64 by default which makes multi-threaded heap access much more efficient. Instances created prior to this level can apply the customization documented in the APAR.

USS save areas

Currently USS has 14,400 save areas per LPAR in a cell pool. One of these is used for every Unix system call. If the LPAR supports more concurrent web connections than this, the cell pool may become depleted. Once this happens, subsequent system calls will still work, but they must first obtain the local lock and use slower methods of allocating storage. Since the cell pool is an LPAR wide resource, other Unix applications (WAS for example) running on the same LPAR are affected.

If your LPAR needs to support tens of thousands of concurrent web connections, check the mpmstats output in the error log for each IHS server. The bsy count shows the total number of busy threads. Each of the busy threads will typically need a USS save area. The ka count shows the number of threads that are busy due to keepalive reads. With IHS 8.5 and later, the ka count will always be zero because keepalive reads are handled asynchronously and do not tie up a thread. With IHS 8.0 and earlier, you may be able to reduce the KeepaliveTimeout value to lower the number of busy threads. Disabling Keepalive altogether is not recommended due to increased CPU overhead needed to re-establish a connection if the client requests another web object. The additional CPU overhead will be much greater for re-establishing SSL connections.

Default stack size for 31-bit IHS on z/OS

IHS 8.5.5.0 introduced an optional 31-bit IHS on z/OS for compatibility with 31-bit only libraries required by some DGW Plugins. If this 31-bit IHS is used, the default stack size should be increased to e.g. STACK(1M,128K,ANY,KEEP,1M,128K) in CEE_RUNOPTS. Otherwise, high CPU can be observed in CEE@HDPSO (xplink stack overflow).

zEDC compression offload

Compressing large files on the fly with mod_deflate can use significant CPU. IHS 8.5.5.4 (PI24424) and later includes an alternate mod_deflate.so module, mod_deflate_z.so, that can use zEnterprise Data Compression (zEDC) when configured. In future releases, the default mod_deflate.so will contain zEDC support.

In IHS benchmarks, more than 10x CPU can be saved in configurations that use high CPU due to mod_deflate.

zEDC FAQ

Limiting number of address spaces

If there is a concern about the number of address spaces used by IHS, a few things can be done minimize the number created:

  • Specify a higher ThreadsPerChild. The number of address spaces used for handling requests is the ration of MaxClients to ThreadsPerChild.

  • Specify a lower MaxSpareThreads and/or MinSpareThreads. This helps reduce the number of idle address spaces used to handle requests.

  • Limit the use of piped loggers.

  • 8.5.5 and earlier: If using piped logs (rotatelogs), specify IHS_PIPED_LOGGER_NO_SHELL in $IHSROOT/bin/envvars. This prevents each piped logger from starting a /bin/sh address space and is the default behavior in version 9

One reason to be concerned about the nuember of address spaces is that z/OS dumps are limited to 15 address spaces.