Logging and Tracing

The SystemOut.log is the main log file (e.g. $WAS/profiles/$PROFILE/logs/$SERVER/SystemOut.log) and contains WAS messages and System.out messages. The SystemErr.log is also an important log file that contains System.err messages (for example, from Throwable.printStackTrace). The native_stderr.log file is another important file as it includes all native stderr messages such as JVM warnings and errors (in general, search for JVM). The native_stdout.log is a lesser used file and contains native stdout messages.

Unless you are consuming JMX notifications for log events, disable them to improve performance of logging and tracing by up to 50% using the system property -Dcom.ibm.ejs.ras.disablerasnotifications=true.

Starting in WAS 8, the IBM service log (activity.log) is disabled by default. Before WAS 8, it is recommended to disable the activity.log.

Trace Overhead

The overhead of WAS diagnostic trace is proportional to the breadth of the trace specification and the number of concurrent threads (e.g. requests) driving said tracers. The overhead is inversely proportional to the filesystem speed, available system capacity (e.g. CPU & caches), number of CPU core threads, and available physical memory. It's very difficult to estimate the overhead of a trace specification, even for those that are commonly used, because just one of these variables may have a significant effect. For example, the broad WAS security trace which enables all security tracers may have very different overhead depending on which security features are configured. Therefore, a customer should run a baseline performance test that's representative of production traffic in a test environment, and then run the same test with the desired trace enabled, and calculate the overhead.

In one DayTrader benchmark, the diagnostic trace ejbcontainer=fine, which is a detailed trace of EJB activity, reduced throughput by 75%. Starting with WAS 8, the optional High Performance Extensible Logging (HPEL) diagnostic trace alternative (with TextLog disabled) reduced that same benchmark overhead by 50%. With both WAS diagnostic trace systems, if log statement JMX notifications are not needed, -Dcom.ibm.ejs.ras.disablerasnotifications=true should also be used.

Here are some ideas to improve the trace experience:

  1. On WAS >=8, switch to HPEL with the TextLog disabled (for convenience, the TextLog may be enabled for only a slightly penalty as it doesn't contain traces).
  2. Tune the speed of the filesystem where the trace is written.
  3. Consider using operating system RAMdisks to dedicate RAM to a virtual filesystem and write the traces to that mount.
  4. If possible, use the generic JVM argument -Dcom.ibm.ejs.ras.disablerasnotifications=true
  5. If the problem can be reproduced with a single user, isolate a production server from production traffic, enable all the full required traces, and use some mechanism to only allow one problematic user onto that server (e.g. direct IPs, ODR routing rules, etc.).
  6. Disable the IBM service log (activity.log). On WAS 8, and later versions, it is disabled by default.
  7. If trace is being written to a networked filesystem, write to a local filesystem instead (or RAMdisk).
  8. Ask IBM development for a reduced trace string or diagnostic patch with very specific trace points.

Controlling Trace Levels

The diagnostic trace level defaults to *=info. The level specification is a colon-delimited list of the form name=level and it may be changed dynamically at runtime: http://www-01.ibm.com/support/docview.wss?uid=swg21254706

Depending on the trace specification and application activity, the volume of trace written may be very high. It is often recommended to update the trace log rotation to something like File > Maximum Size = 250MB and Maximum Number of Historical Log Files = 4 (http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/utrb_traceservice.html?cp=SSAW57_8.5.5%2F3-18-6-295&lang=en).

Print the current trace level using wsadmin -lang jython:

AdminControl.getAttribute(AdminControl.completeObjectName("type=TraceService,process=server1,*"), "traceSpecification")

Dynamically update trace level using wsadmin -lang jython:

AdminControl.setAttribute(AdminControl.completeObjectName("type=TraceService,process=server1,*"), "traceSpecification", "*=all")

In WAS >= 7.0.0.37, 8.0.0.10, and 8.5.5.5, a new setTraceSpecification method has been added which returns the finally applied string (for verification or typos and optimizations):

AdminControl.invoke(AdminControl.completeObjectName("type=TraceService,process=server1,*"), "setTraceSpecification", "*=all:*=info")

The diagnostic trace level may also be used to control java.util.logging.Logger (JUL) thresholds. Here is an example servlet with a JUL: https://raw.githubusercontent.com/kgibm/problemdetermination/master/scripts/java/SimpleWebServlet.java

If the WAS diagnostic trace level is set to *=info: com.ibm.simpleweb.SimpleWebServlet=all, then trace.log will show matching JUL statements:

[10/6/14 12:45:15:158 PDT] 0000009f SimpleWebServ > com.ibm.simpleweb.SimpleWebServlet service ENTRY
[10/6/14 12:45:15:159 PDT] 0000009f SimpleWebServ < com.ibm.simpleweb.SimpleWebServlet service RETURN

However, you will receive the following warning when using such a specification in the administrative console. This warning may be disregarded.

The configured trace state included the following specifications that do not match any loggers currently registered in the server: ''com.ibm.simpleweb.SimpleWebServlet=all''

High Performance Extensible Logging (HPEL)

Consider using High Performance Extensible Logging (HPEL). In benchmarks, HPEL reduced the overhead of logs and trace by about 50%. In general, unless you are listening to log notifications, also consider setting -Dcom.ibm.ejs.ras.disablerasnotifications=true. If possible, disable the HPEL text log to further improve performance. The text log content is redundant and only for convenience; the same information is stored in the binary repositories. Note that HPEL does not use less disk space and in fact will use more disk space; the performance improvements occur for other reasons.

logViewer

The logViewer tool is used to read binary HPEL logs. There are various options, including a -monitor [seconds] option to dynamically tail logs (http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rtrb_logviewer.html?lang=en):

$ logViewer -monitor

Cross Component Trace (XCT)

XCT is available starting in WAS 8.5. XCT adds a unique request identifier to log and trace entries. XCT is similar to request metrics in many ways, but it is more deeply ingrained into the flow. XCT requires that High Performance Extensible Logging (HPEL) is enabled instead of classic logging, and you also have to enable XCT itself.

There are four XCT modes: Disabled, Enabled, Enabled+XCT Records, Enabled+XCT Records+Data Snapshots. The simple Enabled mode adds a unique request ID to every applicable log and trace record. You can dump this data using the HPEL logViewer with the "-format advanced" argument. For example, I've got an application that causes a transaction timeout. Traditionally, all you would get is a WTRN0124I message with the last thread stack and a WTRN0041I message noting the timeout. I enabled the minimal tracing of getting WAS response times and then ran logViewer -format advanced:

[7/10/12 9:11:45:121 PDT] 00000099 I UOW= source=com.ibm.websphere.XCT org=null prod=null component=null thread=[WebContainer : 2] requestID=[AABHT9d/5yd-AAAAAAAAAAB] BEGIN AABHT9d/5yd-AAAAAAAAAAB 00000000000-cccccccccc2 HTTPCF(InboundRequest /TransactionTest/Test RemoteAddress(0:0:0:0:0:0:0:1) RequestContext(2072483128))
[7/10/12 9:13:45:125 PDT] 0000007e I UOW= source=com.ibm.ws.tx.jta.TimeoutManager org=IBM prod=WebSphere component=Application Server thread=[Non-deferrable Alarm : 1] WTRN0124I: When the timeout occurred the thread with which the transaction is, or was most recently, associated was Thread[WebContainer : 2,5,main]. The stack trace of this thread when the timeout occurred was: ...

First Failure Data Capture (FFDC)

Since 7.0.0.19, after an FFDC exception is thrown, the algorithm is here: http://www-01.ibm.com/support/docview.wss?uid=swg1PM39875

"...for the FFDC summary file to be updated for a given incident...
1.  When there have been more than 10 incidents and at least a minute has passed after the last time the summary table was updated.
2.  It has been more than 5 minutes since the last time the summary table was updated."

When this happens, the same file name is used - ${server}_exception.log - but the file is simply truncated and rewritten.

The _exception.log file is only rotated on JVM startup: http://www-01.ibm.com/support/docview.wss?uid=swg1PK86345

The FFDC1003I message is only printed the first time each "type" of an FFDC exception is thrown. After that, only the summary _exception.log file is updated. This can be configured differently but it would create a lot more FFDC log files.

Example _exception.log:

 Index  Count  Time of first Occurrence    Time of last Occurrence     Exception SourceId ProbeId
------+------+---------------------------+---------------------------+---------------------------
     0      4   10/20/14 10:54:32:479 PDT   10/20/14 11:05:32:584 PDT java.io.IOException com.ibm.ws.management.discovery.DiscoveryService.sendQuery 189 ...txt
     1      4   10/20/14 11:23:16:003 PDT   10/20/14 11:23:27:173 PDT org.omg.CORBA.INV_OBJREF com.ibm.ws.giop.message.GIOPRequestMessage.populate 192 ...txt

Transaction Log

The Transaction log directory can be set in the administrative console by navigating to Servers => Application Servers => server_name => Container Services => Transaction Service.

When an application that runs on the application server accesses more than one resource, the application server stores transaction information in the product directory so that it can coordinate and manage the distributed transaction correctly. When there is a higher transaction load, storing persistent information in this way can slow the performance of the application server because it depends on the operating system and the underlying storage systems. To achieve better performance, designate a new directory for the log files on a separate, physically larger, storage system.

https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/udat_contranserv.html

The transaction log is most commonly stored either in a shared filesystem or in a database. In general, internal benchmarks show that using a database is approximately 10% slower, but the time spent processing the transaction log is usually a small proportion of the total transaction time, so this difference is often imperceptible.

Database Transaction Log

In terms of multi threading, the log is written to on multiple threads though there is serialization between the threads so that only one writes to the database at a time. The threads involved are simply the threads that the application is executing transactions on (so threads across multiple thread pools). In order for that to scale the basic idea is that one thread is capable of writing data on behalf of several other threads (i.e. it batches SQL statements) and the other threads wait on an intrinsic lock but once acquired they can return without needing to write to the database since their changes have been persisted already.

-Dcom.ibm.ws.recoverylog.custom.jdbc.ThrottleThreshold is an integer value that defaults to 6 and is related to batching of updates. It is a measure of how many threads are requesting that their data is persisted to the logs. When the number is reached, then the log service prioritises the forcing threads so they can return with a no-op after the current thread writing their data has dropped the intrinsic lock (to a rough approximation). In theory, in a very busy system with very high levels of concurrency, increasing that value could help throughput (fewer, bigger batches) and conversely if concurrency was quite low then a smaller value may have some value in batching the SQL more efficiently.

Networked Filesystem (NFS)

http://www-01.ibm.com/support/docview.wss?uid=swg21456699

CPU Starvation Detected Warning (HMGR0152W)

[9/23/14 14:17:05:923 CDT] 0000008f CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 7 seconds.

The HMGR0152W starvation detection warning works by looping, noting time X, calling java/lang/Thread.sleep(Y=IBM_CS_THREAD_SCHED_DETECT_PERIOD, default 30 seconds), noting time Z upon return, and then reporting Z-Y-X as the scheduling delay if it is over the threshold IBM_CS_THREAD_SCHED_DETECT_ERROR (default 5 seconds).

For example, by default, a report of a 7 second scheduling delay means that a thread called Thread.sleep(30), but returned 37 seconds later, 2 seconds more than the threshold.

If this message appears frequently, or if this message occurs at about the same time as significant performance slowdowns or timeouts, you may want to investigate further. This message will disappear when the thread scheduling delay has been corrected. Perform resource analysis to determine the proper course of action. Common items to review:

  • The most common cause of this is a long, stop-the-world garbage collection cycle, because Java threads, including the timer that prints this warning, cannot be dispatched during this cycle. Review verbose:gc or a monitoring tool for garbage collections immediately preceding this warning that take longer than IBM_CS_THREAD_SCHED_DETECT_ERROR.
  • Review operating system statistics immediately preceding the warning such as high processor utilization, processor run queues greater than available processors, low memory and paging activity, virtualization steal times, etc. Operating system statistics are often gathered at intervals such as 60 or 300 seconds. If this interval is greater than IBM_CS_THREAD_SCHED_DETECT_ERROR, then the relevant symptoms may be averaged out of the operating system numbers. In this case, reduce the operating system statistics gathering interval to less than or equal to the IBM_CS_THREAD_SCHED_DETECT_ERROR.

http://www-01.ibm.com/support/docview.wss?uid=swg21236327