# MustGather: Hangs

This documents documents common problems leading to hang-like conditions 
as instructions for running the executable collector on unix platforms.
The `GatherHangDoc` collector should only be run on a system actively
experiencing high CPU.

1. [Review known issues](#known-issues-to-check-for-first)
2. [Run the collector](#running-the-tool)


## Known issues to check for first

Review the issues below before running the mustgather or contacting support.  New/contemporary issues
are addded to the top.

### IHS stops accepting connections with MaxRequestsPerChild non-zero

Prior to 9.0.5.11, IHS 9.0 on Linux or z/OS with a non-zero MaxRequestsPerChild
can stop accepting new connections after some time. The normal set of processes
will not be visible with `ps -ef`, only the main parent process 1 or 2 more utility
processes.  See APAR PH41945 here: <https://www.ibm.com/support/pages/node/6539882>.

### Hangs and delays on z/OS

If IHS takes an unreasonable amount of time to read requests or send
responses based on other observations, ensure neither IHS nor WAS nor
the client are running in a WLM service class that uses DISCRETIONARY as a goal.

### [High CPU with mod_ldap and virtualhosts](#PI94050)
mod_ldap prior to PI94050 could cause loops/hangs/high cpu when used
with virtual hosts.

### Hangs, delays, or frontend timeouts with IHS on Linux/PPC or AIX<!-- {#RNG} -->

See <a href="gather_highcpu_doc.html#GSKITICC_HIGHCPU">gather_highcpu_doc.html#GSKITICC_HIGHCPU</a>

### z/OS Only: server stops accepting new connections under lite load<!-- {#MSGRCV} -->

If the server stops accepting new connections on z/OS, make sure LE APAR PM90528 is present.

### All Platforms; WebSphere Application Server MustGathers <!-- {#WASMUSTGATHER} -->

In many cases, the ServerDoc tool will simply show that the WebSphere 
Plugin is waiting for a response from a backend Application Server that
is not responding in a timely fashion. For this reason, we encourage
customers to proactively gather the WebSphere Application Server MustGather for
Hangs and submit it along with the IBM HTTP Server MustGather.


For WebSphere Application Server MustGathers for your platform, see 
[here](http://www-01.ibm.com/support/search.wss?rs=180&tc=SSEQTP&tc1=SSCMPB9&q=mustgather">)

###  Startup hangs with mod_proxy_balancer<!-- {#BALANCERRANDOM} -->
`mod_proxy_balancer` requires 2 bytes of random data during
start/restart which will result in a blocking read of `/dev/random`.
On some virtualized systems, or systems where some other process is exhausting
`/dev/random`, startup may hang until enough entropy can be gathered
by the system.

### Solaris
For web server hangs with IHS 2.0 on Solaris, please see
<a href="solaris_hang.html">this document</a> first.</p>

### AIX 5.3

For web server hangs with any release of IHS on AIX 5.3:

* AIX APAR IY58143 is required.  If that APAR is not currently
installed, you must install that APAR and reproduce the problem (if
possible) before sending hang documentation to IBM HTTP Server
support.  You can check for the installation of this APAR in the
following manner:

```
$ /usr/sbin/instfix -vik IY58143
IY58143 Abstract: Required fixes for AIX 5.3

    Fileset X11.Dt.lib:5.3.0.1 is applied on the system.
    Fileset X11.Dt.rte:5.3.0.1 is applied on the system.
    Fileset X11.base.rte:5.3.0.1 is applied on the system.
    Fileset X11.fnt.ucs.ttf is not applied on the system.
    Fileset X11.fnt.ucs.ttf_extb:5.3.0.1 is applied on the system.
    *...*
    Fileset devices.vdevice.hvterm1.rte:5.3.0.1 is applied on the system.
    Fileset devices.vtdev.scsi.rte:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.apps:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.framework:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.rte:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.webaccess:5.3.0.1 is applied on the system.
    All filesets for IY58143 were found.

```

## What we expect to learn from this information

Common web server hang conditions can be categorized as
follows:

* IHS is not responding because all IHS threads are waiting on an
external application to respond

* IHS is not responding because of a problem preventing it from
processing new client connections

As discussed in the following sections, the root cause of the
problem may not reside in IHS, so analysis of the IHS hang
documentation may indicate that a different type of information is
necessary.

### IHS is waiting on an external application

A primary use of IHS is as a front-end to the WebSphere Application
Server.  It is possible for applications running in WebSphere to have
delayed reponse, or no response at all, so that all IHS threads are
waiting for a application server response and no free IHS threads are
available to handle new client connections.

Some authentication mechanisms for IHS, such as LDAP authentication
capability provided with IHS or by a third party vendor, must contact a server
over the network as part of IHS request processing.  If that
communication stalls, it is possible that after some time all IHS
threads are  waiting on an authentication response and no free IHS
threads are available to handle new client connections.

In any situation where all IHS threads are waiting on an external
application, the IHS hang documentation will show which component is
waiting but it cannot determine the root cause for why the application
is not responding.


Note: If the IHS hang documentation shows that IHS is waiting for
a WebSphere response, related documentation for WebSphere will need to
be gathered.  Instructions for this WebSphere documentation can
be found [here](http://www-1.ibm.com/support/search.wss?rs=180&tc=SSEQTPtc1=SSCMPB9;q=mustgather)
It is possible to collect this documentation at the same time when the
IHS hang documentation is collected, so that the required WebSphere
information is available to IBM support if it is necessary.</p>

Vendors of third-party components which run inside IHS may provide
similar information for gathering documentation on problems that can
cause the component to hang or stall; contact the vendor for more
information.</p>

### IHS has a problem with processing new client connections

For this type of problem, IBM support anticipates being able to
determine the failing component, as well as whether or not this is a
known problem.  Occasionally there are operating system issues which
prevent IHS from finding out about new client connections.  If
analysis of the IHS hang documentation shows such a problem, network
traces may be necessary and operating system support may suggest
further diagnostic information.</p>

## making sure required support programs are available

Please refer to <a href="check_platform.html">these
instructions</a> for verifying that required support programs are
installed.</p>

## Running the tool

You will need to [download the collector](install.html)

### Warning 
This tool uses native tools such as `strace` and `truss`
to obtain system call traces, which include the contents of buffers used to 
read and write data from the network. 

Note: This executable mustgather is not used on Windows nor on z/OS.

* On Windows, refer to <a href="win32_hang_doc.html">win32_hang_doc.html</a>

* On z/OS, system dumps should be collected that include the httpd address spaces. Previously,
      the executable mustgather was provided on z/OS but it depends on the dbx debugger which
      conflicts with the very common SAFRunAs directive.

Run the tool as `root` to avoid any permissions problems
with obtaining backtraces or reading files, such as log files and
configuration files.  (More information about the requirement to run
this tool as `root` is available <a
href="#root">here</a>.)

ServerDoc is passed in four parameters for gathering hang documentation:

* `GatherHangDoc`
* the name of the IHS installation directory (e.g., /usr/HTTPServer)
* the web server parent process id, or "auto" if the parent process
has exited and left stranded child processes, or "auto" if CGI
processes are stalling
* the address of a non-SSL port handled by the web server (e.g.,
127.0.0.1:80), or "-" if there is no non-SSL port

```
# java -jar ServerDoc.jar GatherHangDoc /path/to/IHS 1398 127.0.0.1:80
```

The tool creates a new directory which contains a timestamp in the
name, and the hang documentation will be saved in that directory.

### determining the value of the non-SSL address parameter

If the IHS installation only supports SSL, then use
*-* (hyphen) for this parameter.  Otherwise, specify an IP address and
port which can be used to reach the server from the local machine
without using SSL.

* If the server has a no non-SSL listening ports, use *-*
* If the server has a typical `Listen 80` or `Listen 0.0.0.0:80`, use *127.0.0.1:80*
* If the server listens on a particular interface and port like `Listen 192.168.1.15:81`, use that particular interface and port verbatim like `Listen 192.168.1.15:81`: 

<div id="section">
<h4 id="samplerun"><a href="#samplerun">a sample run </a></h4>

For this example, IHS is installed in `/scratch/IHS`,
the parent process id is stored in file
`/scratch/IHS/logs/httpd.pid`, the non-SSL port can be
reached from the web server machine on address
`127.0.0.1:8080`, and ihsdiag was unpacked into directory
`/root/ihsdiag-1.3.0`.</p>


```
# cd /tmp
# java -jar /root/ihsdiag-1.3.0/ServerDoc.jar GatherHangDoc \
/scratch/IHS `cat /scratch/IHS/logs/httpd.pid` 127.0.0.1:8080
Gathering doc on 4 web server processes...
5985  5986  5988  5984

Seconds remaining before gathering information again:
60...54...48...42...36...30...24...18...12...6...

Gathering doc on 4 web server processes...
5985  5986  5988  5984

Seconds remaining before gathering information again:
30...27...24...21...18...15...12...9...6...3...

Gathering doc on 4 web server processes...
5985  5986  5988  5984

Reports, log files, and configuration files have been saved to
directory
  HangDoc.200408310607
If you have additional log files or configuration files, copy them
there before packing up the directory.
Web server log and conf files other than the default will have to be
copied manually.  WebSphere plug-in conf and log files will have to be copied manually.

Hint for packing up the directory:
  <b>tar -cf HangDoc.200408310607.tar HangDoc.200408310607</b>
  <b>gzip HangDoc.200408310607.tar</b>
# ls -l HangDoc.200408310607/
total 772
-rw-rw-r--    1 trawick  trawick         0 Aug 31 06:07 access_log
-rw-rw-r--    1 trawick  trawick      5358 Aug 31 06:07 apachectl
-rw-rw-r--    1 trawick  trawick       118 Aug 31 06:07 error_log
-rw-rw-r--    1 trawick  trawick    462978 Aug 31 06:07 httpd
-rw-rw-r--    1 trawick  trawick     28790 Aug 31 06:07 httpd.conf
-rw-rw-r--    1 trawick  trawick    255056 Aug 31 06:08 log
-rw-rw-r--    1 trawick  trawick        56 Aug 31 06:07 redhat-release
-rw-rw-r--    1 trawick  trawick      5453 Aug 31 06:08 report
```
</div>

### what if the HangDoc tool is taking a very long time?

If you need to interrupt the tool so the web server can be
restarted (to try to resolve the hang condition), the best place to
interrupt it is when it is counting down the number of seconds until
it checks the web server state again.  The last lines of output on the
display will look like this:</p>


```
Seconds remaining before gathering information again:
60...54...48...42...36...30.
```

If the tool is interrupted at a different time, incomplete
information will be gathered on the state of the web server.  This
will introduce some risk into our analysis of the problem, but as long
as a meaningful percentage of the web server processes have been
examined (>30%), it is usually possible to find a probable cause of
the hang.</p>

* the display is not being updated after several minutes

If the IHS child processes have a very large number of threads
(e.g., ThreadsPerChild is higher than 200), the expected cause is that
the system debugger has a performance degradation analyzing such
processes.</p>

It is also possible that the HangDoc tool has a problem interacting
with the system debugger, and it will never finish.</p>

To find out more information about the cause of the delay, take
these steps:</p>


* Make sure you've waited at least four minutes from the time that
the display was last updated.
* From another terminal window, save the output of `ps
-ef` to a file.  This must be done before interrupting the
HangDoc tool.
* Interrupt the HangDoc tool and find the most recent
`HangDoc.xxxx` directory, which is what it was using when
it stalled.
* Cut and paste the HangDoc display to a file.
* Send in the ps listing, the HangDoc display, and the
`HangDoc.xxxx` directory to IHS support for analysis.

### copying other web server and plug-in files

The next step is to copy any other web server or plug-in
configuration files and logs into the new HangDoc directory.  Here is
a list of files to copy if they are being used:</p>

* any IHS configuration file other than httpd.conf in the
IHS install directory
* any additional web server error or access log files, such as log
files specific to each virtual host or log files created by rotatelogs
* the WebSphere plug-in configuration file
* the WebSphere plug-in log file
* any platform-specific WebSphere Appplication Server <a href="#WASMUSTGATHER">MustGathers</a>
that were proactively collected.


### saving the documentation directory

The last step is to pack up and compress the documentation
directory using zip, tar followed by gzip, or pax followed by 
compress.  The easiest way is to cut and paste the messages displayed
by ServerDoc previously which showed the commands to use.  The
suggested commands will vary by platform.  On z/OS, for example,
pax and compress will be suggested instead of tar and gzip.</p>


Don't forget to collect the corresponding WebSphere Appplication Server <a href="#WASMUSTGATHER">MustGathers</a> and
include them in your submission.

* a sample run

```
 tar -cf HangDoc.200408310607.tar HangDoc.200408310607
 gzip HangDoc.200408310607.tar
```


The resulting compressed file is the file to send to IBM
support.

### understanding the `root` requirement <!-- {#ROOT} -->

When gathering information on web server hangs, the tool must
attach to live web server processes to obtain information about the
state of those processes.</p>

If the web server is started as `root`, then at least one of these processes
will be owned by `root` and other processes will be owned by the web server
user id (e.g., `nobody` or `www`).  Only `root` has the authority to attach to
all of the processes, so the tool itself must be run as `root`.  If the web
server administrator does not have authority to log in or switch user to
`root`, a simple script can be created to gather the hang documentation, and
the system administrator can give the web server administrator `sudo` access to
that script.  `sudo` is a third-party tool available without cost for all
appropriate platforms.

If the web server is not started as `root`, there are no
such concerns, and the hang documentation tool may be run by the user
id which starts the web server.

If the tool is run as non-`root` and it is unable to
gather the required information, the problem will have to be
recreated.  It may not be possible to determine if this problem
occurred until the documentation has been analyzed by IBM HTTP Server
support.


### interpreting the report output <!-- {#INTERPRET} -->

* Why is `gsk_get_last_validation_error` in the call stack in unexpected places?
     
 Sometimes a series of IBM Global Security Kit (GSKit) internal functions show up as `gsk_get_last_validation_error` in the backtraces
     but they are not a cause of concern.  Usually the lowest call in the stack is a properly displayed IBM Global Security Kit (GSKit) function
     (such as `gsk_secure_soc_init`) and higher in the stack will be the IHS or WebServer Plugin I/O callbacks
     (secure_read or plugin_ssl_read).

*  Why do so processes have a number of threads less than `ThreadsPerChild`?
    
This is normal for some utility processes created by IHS, such as the 
IHS parent process, sidd, or the CGI daemon will be decorated as such
    
    
 If one of the threads is decorated as the "IHS main thread waiting for
    process to exit...", then the remaining threads are likely to be hung or stuck
    in a loop (causing high CPU).  We'd typically look for a blocking system call
    (read, poll, select, mutex/lock related) in the first few frames for a hung
    thread, then look to identify the module owning the hung code by looking
    farther down in the stack to see where control was handed off from the core of
    Apache.