Troubleshooting WAS traditional Recipes

  1. Periodically monitor WAS logs for warning and error messages.
  2. Set the maximum size of JVM logs to at least 256MB and maximum number of historical files to at least 4.
  3. Set the maximum size of diagnostic trace to at least 256MB and maximum number of historical files to at least 4.
  4. Change the hung thread detection threshold and interval to something smaller that is tuned for each application, and enable a limited number of thread dumps when these events occur. For example:
    1. com.ibm.websphere.threadmonitor.threshold=30
    2. com.ibm.websphere.threadmonitor.interval=1
    3. com.ibm.websphere.threadmonitor.dump.java=15
    4. com.ibm.websphere.threadmonitor.dump.java.track=3
  5. Unless com.ibm.websphere.threadmonitor.interval has been set very low, consider enabling periodic thread pool statistics logging with the diagnostic trace *=info:Runtime.ThreadMonitorHeartbeat=detail
  6. Monitor for increases in the Count column in the FFDC summary file (${SERVER}_exception.log) for each server, because only the first FFDC will print a warning to the logs.
  7. Review relevant timeout values such as JDBC, HTTP, etc.
  8. A well-tuned WAS is a better-behaving WAS, so also review the WAS traditional tuning recipes.
  9. Review the Troubleshooting Operating System Recipes and Troubleshooting Java Recipes.
  10. Review all warnings and errors in System*.log (or using logViewer if HPEL is enabled) before and during the problem. A regular expression search is " [W|E] ". One common type of warning is an FFDC warning which points to a matching file in the FFDC logs directory.
    1. If you're on Linux or use cygwin, use the following command:
      find . -name "*System*" -print0 | xargs -0 grep " [W|E] " | grep -v -e supposedly_benign_message1 -e supposedly_benign_message2
  11. Review all JVM messages in native_stderr.log before and during the problem. This may include things such as OutOfMemoryErrors. The filename of such artifacts includes a timestamp of the form YYYYMMDD.
  12. Review any strange messages in native_stdout.log before and during the problem.
  13. If verbose garbage collection is enabled, review verbosegc in native_stderr.log (IBM Java), native_stdout.log (HotSpot Java), or any verbosegc.log files (if using -Xverbosegclog or -Xloggc) in the IBM Garbage Collection and Memory Visualizer Tool and ensure that the proportion of time in garbage collection for a relevant period before and during the problem is less than 5 - 10%
  14. Review any javacore*.txt files in the IBM Thread and Monitor Dump Analyzer tool. Review the causes of the thread dump (e.g. user-generated, OutOfMemoryError, etc.) and review threads with large stacks and any monitor contention.
  15. Review any heapdump*.phd and core*.dmp files in the Eclipse Memory Analyzer Tool

Troubleshooting WAS traditional on z/OS

  1. Consider increasing the value of server_region_stalled_thread_threshold_percent so that a servant is only abended when a large percentage of threads are taking a long time. Philosophies on this differ, but consider a value of 10.
  2. Set control_region_timeout_delay to give some time for work to finish before the servant is abended; for example, 5.
  3. Set control_region_timeout_dump_action to gather useful diagnostics when a servant is abended; for example, IEATDUMP
  4. Consider reducing the control_region_$PROTOCOL_queue_timeout_percent values so that requests time out earlier if they queue for a long time; for example, 10.
  5. If necessary, apply granular timeouts to particular requests
  6. Run listTimeoutsV85.py to review and tune timeouts.

Additional Recipes