Thread Pools

Thread pools and their corresponding threads control all execution of the application. The more threads you have, the more requests you can be servicing at once. However, the more threads you have the more they are competing for shared resources such as CPUs and Java heap and the slower the overall response time may become as these shared resources are contended or exhausted. If you are not reaching a target CPU percentage usage, you can increase the pool sizes, but this will probably require more memory and should be sized properly. If there is a bottleneck other than the CPUs, then CPU usage will stop increasing. You can think of thread pools as queuing mechanisms to throttle how many concurrent requests you will have running at any one time in your application.

The most commonly used (and tuned) thread pools within the application server are:

  1. HTTP: WebContainer
  2. JMS (SIB): SIBJMSRAThreadPool
  3. JMS (MQ Activation Specifications): WMQJCAResourceAdapter
  4. JMS (MQ Listener Ports): MessageListenerThreadPool
  5. EJB: ORB.thread.pool
  6. z/OS: WebSphere WLM Dispatch Thread

Sizing Thread Pools

Understand which thread pools your application uses and size all of them appropriately based on utilization you see in tuning exercises through thread dumps or PMI/TPV.

If the application server ends up being stalled 1/2 of the time it is working on an individual request (likely due to waiting for a database query to start returning data), then you want to run with 2X the number of threads than cores being pinned. Similarly if it's 25%, then 4X, etc.

Use TPV or the IBM Thread and Monitor Dump Analyzer to analyze thread pools.

Thread pools need to be sized with the total number of hardware processor cores in mind.

  • If sharing a hardware system with other WAS instances, thread pools have to be tuned with that in mind.
  • You need to more than likely cut back on the number of threads active in the system to ensure good performance for all applications due to context switching at OS layer for each thread in the system
  • Sizing or restricting the max number of threads an application can have, will help prevent rouge applications from impacting others.

The ActiveCount statistic on a thread pool in WebSphere is defined as "the number of concurrently active threads" managed by that thread pool. This metric is particularly useful on the WebContainer thread pool because it gives an indication of the number of HTTP requests processed concurrently.

Note: The concurrent thread pool usage (PMI ActiveCount) may not necessarily be the concurrently "active" users hitting the application server. This is not due just to human think times and keepalive between requests, but also because of asynchronous I/O where active connections may not be actively using a thread until I/O activity completes (non-blocking I/O). Therefore, it is incorrect to extrapolate incoming concurrent activity from snapshots of thread pool usage.

If this metric approaches its maximum (which is determined by the maximum pool size), then you know that either the pool is simply too small or that there is a bottleneck that blocks the processing of some of the requests.

If the machine has multiple application server instances, then these sizes should be reduced accordingly. Conversely, there could be situations where the thread pool size might need to be increased to account for slow I/O or long running back-end connections.

Recent versions of WAS report when a thread pool has reached 80% or 100% of maximum capacity. Whether or not this is sustained or just a blip needs to be determined with diagnostics or PMI.

WSVR0652W: The size of thread pool "WebContainer" has reached 100 percent of its maximum.

Hung Thread Detection

WAS hung thread detection may be more accurately called WAS long response time detection (which defaults to watching requests taking more than 10-13 minutes) and the "may be hung" warning may be more accurately read as "has been executing for more than the configured threshold." The thread may or may not be actually hung at the time of the detection.

WSVR0605W is the warning printed when WAS detects that a unit of work is taking longer than the WAS hung thread detection threshold. Hang detection only monitors most WAS managed threads, such as the WebContainer thread pool. Any native threads, or threads spawned by an application are not monitored. The warning includes the stack of the thread at the moment the warning is printed which often points to the delay:

[11/16/09 12:41:03:296 PST] 00000020 ThreadMonitor W WSVR0605W: Thread "WebContainer : 0" (00000021) has been active for 655546 milliseconds and may be hung.
There is/are 1 thread(s) in total in the server that may be hung.
  at java.lang.Thread.sleep(Native Method)
  at java.lang.Thread.sleep(Thread.java:851)
  at com.ibm.Sleep.doSleep(Sleep.java:55)
  at com.ibm.Sleep.service(Sleep.java:35)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:831)...

WAS will check threads every com.ibm.websphere.threadmonitor.interval seconds (default 180) and any threads dispatched more than com.ibm.websphere.threadmonitor.threshold seconds (default 600) will be dumped. Therefore, any thread dispatched between com.ibm.websphere.threadmonitor.threshold seconds and com.ibm.websphere.threadmonitor.threshold + com.ibm.websphere.threadmonitor.interval seconds will be marked.

Hung thread detection includes the option of exponential backoff so that logs are not flooded with WSVR0605W warnings. Every com.ibm.websphere.threadmonitor.false.alarm.threshold number of warnings (default 100), the threshold is increased by 1.5X.

The amount of time the thread has been active is approximate and is based on each container's ability to accurately reflect a thread's waiting or running state; however, in general, it is the amount of milliseconds that a thread has been dispatched and doing "work" (i.e. started or reset to "non waiting" by a container) within a WAS managed thread pool.

To configure hung thread detection, change the following properties and restart: $SERVER } Server Infrastructure } Administration } Custom Properties:

  • com.ibm.websphere.threadmonitor.interval: The frequency (in seconds) at which managed threads in the selected application server will be interrogated. Default: 180 seconds (three minutes).
  • com.ibm.websphere.threadmonitor.threshold: The length of time (in seconds) in which a thread can be active before it is considered hung. Any thread that is detected as active for longer than this length of time is reported as hung. Default: The default value is 600 seconds (ten minutes).

Hung Thread Detection Overhead

The hung thread detection algorithm is very simple: it's basically a loop that iterates over every thread and compares the dispatch time (a long) to the current time (a long) and checks if the difference is greater than the threshold. Therefore, in general, it is possible to set the threshold and interval very low to capture "long" responses of a very short duration. For example, some customers run the following in production:

  1. $SERVER } Server Infrastructure } Administration } Custom Properties
  2. com.ibm.websphere.threadmonitor.interval = 1
  3. com.ibm.websphere.threadmonitor.threshold = 5
  4. Restart

OS Core Dumps on Hung Thread Warnings with J9

For OpenJ9 and IBM Java, you can also produce core dumps on a hung thread warning using -Xtrace:trigger:

-Xtrace:trigger=method{com/ibm/ws/runtime/component/ThreadMonitorImpl.threadIsHung,sysdump,,,1}

In this example, the maximum number of system dumps to produce for this trigger is 1. Enabling certain -Xtrace options on IBM Java <= 7.1 may affect the performance of the entire JVM (see the -Xtrace section).

Thread Pool Statistics

Starting with WAS 7.0.0.31, 8.0.0.8, and 8.5.5.2, thread pool statistics may be written periodically to SystemOut.log or trace.log. This information may be written to SystemOut.log by enabling the diagnostic trace Runtime.ThreadMonitorHeartbeat=detail or to trace.log by enabling the diagnostic trace Runtime.ThreadMonitorHeartbeat=debug. Example output:

[1/12/15 19:38:15:208 GMT] 000000d4 ThreadMonitor A   UsageInfo[ThreadPool:hung/active/size/max]={
  SIBFAPThreadPool:0/2/4/50,
  TCPChannel.DCS:0/3/18/20,
  server.startup:0/0/1/3,
  WebContainer:0/3/4/12,
  SIBJMSRAThreadPool:0/0/10/41,
  ProcessDiscovery:0/0/1/2,
  Default:0/2/7/20,
  ORB.thread.pool:0/0/10/77,
  HAManager.thread.pool:0/0/2/2
  }

When the diagnostic trace is enabled, this output is written every com.ibm.websphere.threadmonitor.interval seconds. Only thread pools that have at least one worker thread (whether active or idle) will be reported.

BoundedBuffer

Consider BoundedBuffer tuning: https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunechain.html

The thread pool request buffer is essentially a backlog in front of the thread pool. If the thread pool is at its maximum size and all of the threads are dispatched, then work will queue in the requestBuffer. The maximum size of the requestBuffer is equal to the thread pool maximum size; however, if the unit of work is executed on the thread pool with a blocking mode of EXPAND_WHEN_QUEUE_IS_FULL_ERROR_AT_LIMIT or EXPAND_WHEN_QUEUE_IS_FULL_WAIT_AT_LIMIT, then the maximum size is ThreadPoolMaxSize * 10. When the requestBuffer fills up, then WSVR0629I is issued (although only the first time this happens per JVM run per thread pool). When the requestBuffer is full, work will either wait or throw a ThreadPoolQueueIsFullException, depending on how the unit of work is executed.

How the JVM MBean dumpThreads method works

WAS exposes a JVM MBean for each process that has methods to create thread dumps, heap dumps, and system dumps. For example, to produce a thread dump on server1, use this wsadmin command (-lang jython):

AdminControl.invoke(AdminControl.completeObjectName("type=JVM,process=server1,*"), "dumpThreads")

The dumpThreads functionality is different depending on the operating system:

  • POSIX (AIX, Linux, Solaris, etc.): kill(pid, SIGQUIT)
  • Windows: raise(SIGBREAK)
  • z/OS: In recent versions, produces a javacore, heapdump, and SYSTDUMP by default

For any customers that have changed the behavior of the JVM (-Xdump) in how it responds to SIGQUIT/SIGBREAK (i.e. kill -3), then dumpThreads will respond accordingly (unless running z/OS, in which case use wsadmin_dumpthreads* properties). For anyone wishing to keep a non-default behavior for SIGQUIT/SIGBREAK but still have a scriptable way to produce only javacores, see the Troubleshooting chapters on alternative ways of requesting thread dumps.