IBM WebSphere Application Server Performance Cookbook

IBM Java

IBM Java product documentation: http://www-01.ibm.com/support/knowledgecenter/SSYKE2/welcome_javasdk_family.html?lang=en

IBM Java Recipe

In addition to the overall recipe in the Java chapter,

In most cases, the gencon garbage collection policy works best, with the key tuning being the maximum heap size (-Xmx) and maximum nursery size (-Xmn).
If physical memory allows, increase the size of the shared class cache (-Xshareclasses).
Test with large pages (-Xlp). These are enabled by default in recent versions, but may require operating system configuration for full enablement.
Test with -Xaggressive.
Test with -XtlhPrefetch.
Consider enabling IBM Java Health Center (-Xhealthcenter) by default so that you can attach to particular processes if they start to have trouble.
If using IBM Java >= 7 SR3, the IBM JCE security provider and recent Intel, AMD, or POWER >= 8 CPUs, then AESNI hardware acceleration for AES encryption and decryption can be exploited with -Dcom.ibm.crypto.provider.doAESInHardware=true. This can reduce TLS overhead by up to 35%.
If the static IP address of the node on which Java is running is unlikely to change, use -Dcom.ibm.cacheLocalHost=true to reduce localhost lookup time.
Take a javacore thread dump and review the Java arguments (UserArgs) and Environment Variables sections for uncommon or debug options.

By default, Java will cache non-localhost lookups; however, localhost lookups are not cached in case localhost changes. In some operating systems or configurations, localhost lookups add significant overhead. If the static IP address of the node on which Java is running is unlikely to change, use -Dcom.ibm.cacheLocalHost=true to reduce localhost lookup time (http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_tunejvm_v61.html).

Garbage Collection

-Xgcpolicy:gencon is the default garbage collection policy starting in Java 6.26 (WAS 8) - it is a copy collector in the nursery area and a mark-sweep-compact collector in the tenured area. Previously, the default policy is -Xgcpolicy:optthruput.

In garbage collection, generally the term parallel means running on multiple threads, and concurrent means running at the same time as the application (i.e. not stop-the-world). Thread local heaps (TLH) are used by each thread for very small objects to reduce cross thread contention (global heap lock).

Comparing Policies

	IBM Java -Xgcpolicy: gencon	IBM Java -Xgcpolicy: optthruput	IBM Java -Xgcpolicy: optavgpause	IBM Java -Xgcpolicy: balanced	IBM Java -Xgcpolicy: metronome
Generational - most GC pauses are short (nursery/scavenge collections)	Yes	No	No	Yes	No
Compaction	Sometimes	Sometimes	Sometimes	Partial, full in overload conditions	Never
Large Heaps (>10GB)	Yes, depending on heap utilization	No	No	Yes	Yes
Soft Real Time - all GC pauses are very short (unless cpu/heap exhaustion occurs)	No	No	No	No	Yes
Hard Real Time - requires hard real time OS, all GC pauses are very short (unless CPU/heap exhaustion occurs)	No	No	No	No	Yes
Benefits	Tries to balance application throughput with low pause times	- Tries to optimize application throughput - CPU efficient	Tries to flatten out average pause times	Tries to deal with large heaps by breaking memory into many regions. May help with NUMA	Tries to have consistently low pause times
Potential Consequences	- Long global GC pauses with large heaps - Occasional long compactions - Benefits negated by frequent large object allocations if they are long-lived	- Longer average pause times	- Reduced throughput - Increased CPU - Poorly handles large heap usage variations	- Increased CPU - Reduced throughput	- Increased CPU - Increased Heap Usage
Recommended for	General Use (e.g. Web applications, messaging systems)	Batch applications	Consistent pause time requirement	Large heaps (>10GB)	Very low, consistent GC latency

-Xgcpolicy:gencon

The idea [of a generational collector] is to divide the heap up into different areas, and collect these areas at different rates. New objects are allocated out of one such area, called the nursery (or newspace). Since most objects in this area will become garbage quickly, collecting it offers the best chance to recover memory. Once an object has survived for a while, it is moved into a different area, called tenure (or oldspace). These objects are less likely to become garbage, so the collector examines them much less frequently...

IBM's gencon policy (-Xgcpolicy:gencon) offers a generational GC ("gen-") on top of [-Xgcpolicy:optavgpause]. The tenure space is collected as described above, while the nursery space uses a copying collector. This algorithm works by further subdividing the nursery area into allocate and survivor spaces... New objects are placed in allocate space until its free space has been exhausted. The application is then halted, and any live objects in allocate are copied into survivor. The two spaces then swap roles; that is, survivor becomes allocate, and the application is resumed. If an object has survived for a number of these copies, it is moved into the tenure area instead.

http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html

The default maximum nursery size (-Xmn) in Java 5 is 64MB. The default in Java 6 is 25% of -Xmx. The larger the nursery, the greater the time between collects, the less objects are likely to survive; however, the longer a copy can potentially take. In general the advice is to have as large a nursery as you can afford to avoid full collects - but the full collects shouldn't be any worse than the optavgpause case. The use of concurrent collection is still in place, and the presence of the nursery should be that there's less likelihood of compacting being required in the tenured space.

For -Xgcpolicy:gencon, consider tuning the nursery size (-Xmn) to a larger proportion of -Xmx (the default is 25%)... For applications with more short-lived objects, a performance improvement can be seen by increasing the nursery size (http://www.ibm.com/developerworks/websphere/techjournal/0909_blythe/0909_blythe.html#sec3a)

The scenarios where gencon generally falls down is where there are large object allocations that get tenured directly, or where the nursery is too small and objects are copied several times before they "die." In an ideal world, no object is copied more than once - after the first copy it either dies or is tenured because it is long lived.

Tenure age: "Tenure age is a measure of the object age at which it should be promoted to the tenure area. This age is dynamically adjusted by the JVM and reaches a maximum value of 14. An object’s age is incremented on each scavenge. A tenure age of x means that an object is promoted to the tenure area after it has survived x flips between survivor and allocate space. The threshold is adaptive and adjusts the tenure age based on the percentage of space used in the new area." (http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.70.doc/diag/understanding/mm_gc_generational_tenure.html)

A high tenure age means the JVM is aggressive about leaving objects in the nursery, trying to let them die there, which is generally healthy, since the JVM observes that it is able to collect most garbage in a scavenge.

As the nursery size increases, the maximum copy count and the adaptive tenure age will trend to 1. Once the application is a self optimizing tenure age of 1 at runtime, it may make sense to set tenureage=1 explicitly to make startup faster. That sets the tenure age where it will end up anyway, and ensures we don't do a lot of copying of "infrastructure" objects allocated at startup. Fix the tenure age, e.g.: -Xgc:scvNoAdaptiveTenure,scvTenureAge=1

A healthy used tenured heap (after collection) will show a sawtooth pattern where garbage collects in tenured continuously until a full collection. If the nursery size is too large (or the overall heap size is too small), then an unhealthy pattern in this plot will lack the sawtooth and you will see a low tenure age. This will caused the JVM to constantly run full collections and may increase the rate of compactions. A rough guide is that the size of the sawtooth drop should be about 25% of -Xmx. The tenured area may grow and shrink by specifying -Xmos and -Xmox.

You want the nursery to be large enough that data is at most copied once. Once that occurs the duration of a nursery collect is largely fixed at the copy time of the data, so after that increasing the nursery size increases the time between nursery collects - and therefore drops the GC overhead, and mostly likely the frequency of global collections as well.

If you've got large amounts of available RAM and process address space, the extreme tuning solution is a very large nursery with a tenure age of 1. This works on the theory that transactional data can only be copied once, and anything surviving two collects should be put into the old generation as its non-transactional (ie, at startup). You can fix the tenure age via a command line option.

There's no easy (low overhead) way of finding out what the average flip count is, but the following will give you a histogram on each scavenge collect: http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.70.doc/diag/tools/gcpd_tracing_scavenger.html

The maximum nursery size should be greater or equal to the maximum, concurrent transaction data for all threads. The average number of times that non-tenured objects are copied should be ~= 1

https://w3-03.ibm.com/tools/cm/iram/artifact/{D9D29FCF-84FA-B2C4-6922-1EEE32C76B3C}/HitchhikersGuide/WAS.hitchhikers.guide.html#2.17.3.3.gencon |outline

To force full GCs after each N scavenges, use -Xgc:maxScavengeBeforeGlobal=N

If you would like to tail the verbosegc log, it is generally recommended to look at free memory after global collections only because scavenges do not touch trash in the tenured region. On Linux, for example:

$ tail -f native_stderr.log | grep -A 1 "gc-end.*global" native_stderr.log
<gc-end id="1748" type="global" contextid="1741" durationms="670.959" timestamp="2014-07-02T16:28:22.476">
  <mem-info id="1749" free="156456360" total="311361536" percent="50">

Tilt Ratio

The tilt ratio is (size of new or allocate space)/(size of survivor space). The tilt ratio starts at 50% and is dynamically updated in an attempt to maximize the time between scavenges: http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.lnx.71.doc/diag/understanding/mm_gc_generational_tilt.html?lang=en

Concurrent Marking

If generational garbage collection is desired but the overhead of concurrent marking, with respect to both the overhead of the marking thread and the extra book-keeping required when allocating and manipulating objects, is not desired then concurrent marking may be disabled with the -Xconcurrentlevel0 option. This option is appropriate for workloads that benefit from gencon's optimizations for object allocation and lifetimes but also require maximum throughput and minimal GC overhead while application threads are running. In general for both the gencon and optavgpause GC policies, concurrent marking can be tuned with the -Xconcurrentlevel<number> option which specifies the ratio between the amounts of heap allocated and the amounts of heap marked. The default value is 8. The number of low priority mark threads can be set with the -Xconcurrentbackground<number> option. By default 1 thread is used for concurrent marking.

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Java+Performance+on+POWER7

-Xgcpolicy:optthruput

"The simplest possible garbage collection technique is to continue allocating until free memory has been exhausted, then stop the application and process the entire heap. While this results in a very efficient garbage collector, it means that the user program must be able to tolerate the pauses introduced by the collector. Workloads that are only concerned about overall throughput might benefit from this strategy." (http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html)

-Xgcpolicy:optavgpause

"For applications that are willing to trade some overall throughput for shorter pauses... -Xgcpolicy:optavgpause attempts to do as much GC work as possible before stopping the application, leading to shorter pauses... The same mark-sweep-compact collector is used, but much of the mark and sweep phases can be done as the application runs. Based on the program's allocation rate, the system attempts to predict when the next garbage collection will be required. When this threshold approaches, a concurrent GC begins. As application threads allocate objects, they will occasionally be asked to do a small amount of GC work before their allocation is fulfilled. The more allocations a thread does, the more it will be asked to help out. Meanwhile, one or more background GC threads will use idle cycles to get additional work done. Once all the concurrent work is done, or if free memory is exhausted ahead of schedule, the application is halted and the collection is completed. This pause is generally short, unless a compaction is required. Because compaction requires moving and updating live objects, it cannot be done concurrently." (http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html)

Optimized for applications with responsiveness criteria. It reduces and makes more consistent the time spent inside the stop-the-world operations by carying out some of the stop-the-world activity while the application is running. This has an additional overhead. Optavgpause is suited for consistent allocation patterns or when very large objects adversely affect gencon.

-Xgcpolicy:balanced

The balanced GC policy (available starting with Java 7) is suitable for arbitrarily large heaps, and includes various techniques to prevent worst-case pause time from growing linearly with total heap size. Balanced is a generational policy, so as with gencon most collections will be of the nursery space, and thus will be quite brief. An incremental compaction function performs a subset of compaction work during each GC pause, to avoid the very large pause time associated with compacting the entire heap in a single operation. Tenured space collections are performed on sub-areas of the tenured heap, and objects are grouped by lifespan within the heap, to make tenured collections more efficient and brief.

The primary goal of the balanced collector is to amortize the cost of global garbage collection across many GC pauses, reducing the effect of whole heap collection times. At the same time, each pause should attempt to perform a self contained collection, returning free memory back to the application for immediate reuse.

To achieve this, the balanced collector uses a dynamic approach to select heap areas to collect in order to maximize the return-on-investment of time and effort. This is similar to the gencon policy approach, but is more flexible as it considers all parts of the heap for collection during each pause, rather than a statically defined new space.

http://www.ibm.com/developerworks/websphere/techjournal/1108_sciampacone/1108_sciampacone.html

The balanced policy can better utilize NUMA node groupings.

Balanced GC overview: http://www.ibm.com/developerworks/websphere/techjournal/1108_sciampacone/1108_sciampacone.html

-Xgcpolicy:metronome

The metronome GC policy (available in the base VM starting with Java 7) invokes the WebSphere RealTime (WRT) collector. WRT performs GC in small increments using a time-bounded algorithm to ensure that any individual GC pause is very brief. This behavior is suitable for applications needing consistent low latency response times, e.g. financial transaction systems. The trade-off for getting very low GC latency is some increase in CPU and heap consumption. Unlike gencon, optthruput, and optavgpause collectors, GC pause time with WRT does not increase linearly with heap size, so WRT is suitable for use with very large heaps.

WRT Java info: http://www.ibm.com/developerworks/data/ue/kp-test/template-dw-knowledge-path-6.0.html

http://www.ibm.com/developerworks/library/j-rtj4/

Should you set minimum heap equal to the maximum heap?

For generational policies, the guidance is that you should fix the nursery size: -Xmns == -Xmnx, and allow the tenured heap to vary: -Xmos != -Xmox. For non generational you only have a tenured heap, so -Xms != -Xmx applies.

The reason being that the ability to expand the heap adds resilience into the system to avoid OutOfMemoryErrors. If you're then worried about the potential cost of expansion/shrinkage that this introduces by causing compactions, then that can be mitigated by adjusting -Xmaxf and -Xminf to make expand/shrink a rare event.

Long Mark Times

Long mark times can occur for the following reasons:

Increase in the number of Objects on the Java heap

Increase in the Java heap size

CPU contention

System paging

An increase in the number of objects on the Java heap or an increase in the Java heap size is typical. They are the two major factors contributing to GC duration; more Java objects take longer to mark, and more Java heap space means more time is required to traverse the larger memory space. CPU contention and system paging are caused by system resource contention, which you can determine if the paging and CPU information is available.

Old Java Diagnostic Guide

Long Sweep Times

Long sweep times can occur for the following reasons:

Increase in Java heap size

CPU contention

System paging

An increase in Java heap size is typical because the major factor contributing to the duration of the sweep phase is the size of the Java heap that must be traversed. If sweep times increase significantly, the most likely cause is system resource contention, which you can determine if the paging and CPU information is available.

Old Java Diagnostic Guide

Compaction

When compactions occur, they use most of the garbage collection cycle. The Garbage Collector avoids compaction where possible. However, when compactions must occur, the raw verbose:gc output contains a message explaining why the compaction occurred.

The most common cause of avoidable long GC cycles is Java heap expansion and shrinkage. When the Java heap shrinks in size, a compaction is probably required to allow the shrinkage to occur. When the Java heap expands, a compaction might occur before the expansion, particularly when the Java heap occupancy is growing rapidly.

A correctly sized Java heap aims to keep the Java heap occupancy between 40% and 70% of the maximum heap size, which are the trigger occupancy levels for heap expansion and shrinkage. If the range of occupancy is too great to stay within the recommended range, it is more important to keep the occupancy under 70% of the maximum than it is to stay over 40%.

You can remove or reduce the need for shrinkage by increasing the -Xmaxf option from its default value of 0.6, which controls the 40% threshold for shrinkage. By increasing the -Xmaxf value, the lower threshold is reduced below the normal range of occupancy, and shrinkage can be avoided during the normal operation of the application, while still leaving the possibility of shrinkage if the Java heap occupancy drops dramatically.

The -Xmaxf parameter specifies the amount of the Java heap that must be free before shrinkage occurs, so a setting of -Xmaxf0.7 will cause shrinkage when the occupancy is below 30% (70% is free), and -Xmaxf0.9 cause shrinkage when the occupancy is below 10% (90% is free).

Explicit requests for garbage collection to run using calls to the System.gc() or Runtime.gc() methods cause a compaction to occur during the garbage collection cycle if compaction did not occur in the previous garbage collection cycle. Explicit garbage collection calls cause garbage collection to run more frequently than necessary, and are likely to cause a compaction to occur. Remove explicit garbage collection calls where possible.

Old Java Diagnostic Guide

To disable heap shrinkage: -Xmaxf1.0

Verbose garbage collection (-verbose:gc)

See the verbosegc section in the general Java chapter for background.

One benchmark by the IBM Java team running on WAS 8.5.5, IBM Java 7, and Power hardware shows the overhead of verbosegc with local disks is 0.13% (http://www.slideshare.net/cnbailey/websphere-technical-university-introduction-to-java-diagnostic-tools).

By default, output will be sent to stderr (in WAS, native_stderr.log): http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.lnx.71.doc/diag/understanding/mm_heapsizing_verbosegc.html?lang=en

Specifying -Xverbosegclog implicitly enables -verbose:gc and allows you to write verbosegc to a particularly named file instead, along with the option of rotating said files after certain numbers of GC events (this works on all platforms including z/OS): http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.lnx.71.doc/diag/appendixes/cmdline/xverbosegclog.html?lang=en. If you are concerned about performance, you can use -Xverbosegclog to write the data to a RAMdisk (see your operating system's chapter). If the JVM is unable to create the file (e.g. permissions, disk space, etc.), verbosegc will fall back to stderr.

When using -Xverbosegclog, always specify unique dump tokens such as the time or PID so that previous files are not overwritten when the JVM restarts. For example:

-Xverbosegclog:verbosegc.%Y%m%d.%H%M%S.%pid.log

If you specify X,Y after the log name, output is redirected to ${X} number of files, each containing ${Y} GC cycles. You can only roll-over by the number of GC cycles and not by raw file size; however, garbage collection events are roughly equivalent in size, so you should be able to approximate. As a rough starting point, one GC cycle outputs about 2KB. Therefore, if let's say you wanted to rollover at 100MB, you would do:

A=Desired size in MB
B=Average GC cycle size output in bytes
Y=(A*1024*1024)/B

So, with A=100 and B=2048, Y would be 51,200, and then you would use:

-Xverbosegclog:verbosegc.%Y%m%d.%H%M%S.%pid.log,5,51200

That would create 5 historical files with roughly 100MB each. If you wanted to better approximate Y, then you need to better understand B. For that, you could do a historical analysis of verbosegc and calculate the median and mean sizes, in bytes, of each GC event, and fiddle around with B until you get close to A per file.

Time spent unloading classes

If you find long total GC pause times and the break down includes long times in "time spent unloading classes" in GCMV, then there are a few options:

Investigate which classes and classloaders are being unloaded and review if creating these can be reduced or avoided (for example, see the discussion on reflection inflation):

-verbose:class -Xgc:verboseExtensions
Consider using -Xgc:classUnloadingKickoffThreshold=N (http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html#migrating)
Consider using -Xgc:maxScavengeBeforeGlobal=N
Consider changing to a different -Xgcpolicy
Ensure IBM Java APAR IV49664 is applied: http://www-01.ibm.com/support/docview.wss?uid=swg1IV49664
If unloading times increase as the number of class(loader)es increases, test with -Xjit:disableCHOpts,disableCHTabl or more aggressively (if there are no Java agents), -Xjit:disableCHOpts,disableCHTable,noRecompile

Example verbosegc tag showing time spent unloading classes:

<classunloading classloaders="325178" classes="905" timevmquiescems="0.000" timetakenms="16990.786" />

Exclusive Access Time

Before a garbage collection, the GC requests "exclusive access" to the JVM. Normally, this should take almost no time. This time is not included in the "Total Pause Time" statistic in GCMV (instead there is an Exclusive Access Time statistic). If this is taking a long time, then most likely some other JVM thread is holding exclusive access for that time. You can determine how long these are by looking for:

<exclusive-start id="1628" timestamp="2014-03-31T16:13:51.448" intervalms="16331.866">
  <response-info timems="1499.726" idlems="999.647" threads="2" lastid="000000000FC2C600" lastname="Thread-123" />
</exclusive-start>

The only real way to investigate these is to take a core dump by using the -Xdump slow event and setting the threshold below the average timems value; for example: -Xdump:system:events=slow,filter=1000ms,range=1..2

Load the dump into IDDE, run "!info lock" and search for this section:

id: 0x2aaab4000ed0 name: VM exclusive access
    owner thread id: 27707 name: Thread-105
    waiting thread id: 26717 name: defaultJavaTimer-thread-1

The current thread should match the owner thread, so then just run "!info thread" and you'll see the stack (top frame should be in a native method).

Excessive Garbage Collection

By default, if the JVM detects "excessive time" spent in garbage collection (default 95%), an OutOfMemoryError is thrown: http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.71.doc/diag/appendixes/cmdline/xenableexcessivegc.html

The 95% threshold can be changed with -Xgc:excessiveGCratio=90 where 90 is an example different percentage.

Explicit Garbage Collection (System.gc, Runtime.gc)

In addition to the cases covered in the general Java chapter, the IBM JVM may explicitly call System.gc in certain situations. For example, if the JVM is unable to get native memory for class(loader) metadata, it will call System.gc in case this indirectly cleans up native resources. In fact, in this case, the JVM calls an internal method so a full GC will occur even if -Xdisableexplicitgc is set. If the JVM runs out of native memory but continues to run and continues to try to allocate native class(loader) metadata, this can cause a full GC storm.

Garbage Collection Threads

The maximum number of logical CPU cores is read and fixed at JVM startup by querying the operating system. If the number of logical CPUs decreases at runtime, and -Xgcthreads is not specified, then the JVM may decide to use less CPUs during a garbage collection based on how many are available. If the number of logical CPU cores increases more than the amount at JVM startup, the JVM will not use these additional cores for garbage collection.

Garbage Collection Notes

A scavenge which is converted into a global collection collection is called a percolate.

An "aggressive" GC is declared if a previous GC was unable to reclaim sufficient resources. It means that the GC will try as much as it can, including compaction, class unloading, softref clearing, etc. An aggressive collect may also be triggered if two explicit GCs happen back-to-back.

Just in Time (JIT) Compiler

The JIT compiler samples Java method execution at runtime and compiles the byte code of more frequently invoked methods into optimized native code.

The -XsamplingExpirationTime${SECONDS} option allows you to disable this background process a certain number of seconds after the JVM starts when you think that the most important JITting has been completed: http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.71.doc/diag/tools/jitpd_idle.html. A related option helps control sampling frequency when the JVM is idle: -Xjit:samplingFrequencyInIdleMode=${ms}

The JIT has two caches: code and data. The code cache holds the actual compiled native code for any methods that are JITted and the data cache is metadata for said code (which is relatively much smaller than the code). If the application uses a lot of classes or classloaders, the JIT code cache may fill up and subsequent JITting is reduced. Check the size of the JIT code cache by taking a javacore and search for the code cache segments. Each row is 8MB. You can increase the code cache segments and total code cache size with -Xcodecache and -Xjit:codetotal (http://www-01.ibm.com/support/docview.wss?uid=swg1IV26401). You can also use the -Xjit:verbose option and review if there are any failures due to lack of code cache. There is also -Xjit:dataTotal=XKB. You may also make more room in the JIT code and data caches by excluding some methods from being JITted with -Xjit:exclude.

Starting with IBM Java 626, by default, there may be up to 4 JIT Compilation threads. These can be quite intensive, and if there are many Java processes on a machine, if they happen to run at the same time, the processors may become saturated. In the same way that -Xgcthreads must be considered when running multiple JVMs on a machine, -XcompilationThreads can be reduced: http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.71.doc/diag/appendixes/cmdline/xcompilationthreads.html

There is an option to increase the size of the JIT profiling buffer (default 1024): -Xjit:iprofilerBufferSize=${bytes}

The option -Xjit:noServer may be used to reduce the level of inlining and therefore reduce JIT CPU utilization, although the program may run more slowly. The option -Xjit:virtualizationHighDensity may be used to be even more aggressive in reducing JIT CPU utilization (it is a superset of -Xjit:noServer), although the program may run even more slowly.

Another way to reduce the CPU usage of JIT compilation is to increase the size of the shared class cache (-Xscmx) and consequently the likelihood that Ahead-of-time (AOT) compiled methods can be reused. In general, AOT can be as big as disk space and physical memory support.

By default, the JIT will compile methods after a certain number of invocations. This can be changed with -Xjit:count (use 0 to compile immediately, although this is generally not recommended): http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.zos.70.doc/diag/appendixes/cmdline/xjit.html?lang=en

You can create a log of JIT events, which should be inherently bounded in size to just a few tens of MB, using -Xjit:verbose={compile*},vlog=jitvlog.txt. Example output:

+ (AOT load) sun/io/ByteToCharUTF8.reset()V @ 00002AAAB4D9B5A8-00002AAAB4D9B6C4 compThread 0
#CR  000000000050C100   Compile request rqk=8 j9method=000000000053BF38 java/util/Hashtable.rehash()V
#CR  000000000050C100   Compile request rqk=8 j9method=0000000000520268 java/lang/String.hashCode()I
 (warm) Compiling java/util/Hashtable.rehash()V  t=10 rqk=8

Shared Classes (-Xshareclasses)

Sharing classes in a cache can improve startup time and reduce memory footprint. Processes, such as application servers, node agents, and deployment managers, can use the share classes option... The default is enabled.(http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_tunejvm_v61.html)

In general, the shared class cache is a good thing to enable and primarily has a positive impact on JVM startup time and memory footprint.

The share classes option of the IBM [JVM] lets you share classes in a cache. Sharing classes in a cache can improve startup time and reduce memory footprint. Processes, such as application servers, node agents, and deployment managers, can use the share classes option.

This option is enabled by default in [WAS}. To clear the cache, either call the app_server_root/bin/clearClassCache utility [when all Java processes are stopped].

If you need to disable the share classes option for a process, specify the generic JVM argument -Xshareclasses:none for that process.

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_tunejvm_v61.html

List all shared class caches: java -Xshareclasses:listAllCaches

Delete a shared class cache: java -Xshareclasses:name=${NAME},destroy

The AOT compiler can be more aggressive with -Xaot:forceAOT

-Xquickstart

"The IBM® JIT compiler is tuned for long-running applications typically used on a server. You can use the -Xquickstart command-line option to improve the performance of short-running applications, especially for applications in which processing is not concentrated into a few methods.

-Xquickstart causes the JIT compiler to use a lower optimization level by default and to compile fewer methods. Performing fewer compilations more quickly can improve application startup time. When the AOT compiler is active (both shared classes and AOT compilation enabled), -Xquickstart causes all methods selected for compilation to be AOT compiled, which improves the startup time of subsequent runs. -Xquickstart might degrade performance if it is used with long-running applications that contain methods using a large amount of processing resource. The implementation of -Xquickstart is subject to change in future releases." (http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.71.doc/diag/tools/jitpd_short_run.html)

-Xaggressive

Consider testing with -Xaggressive: "Enables performance optimizations and new platform exploitation that are expected to be the default in future releases." (http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.zos.70.doc/diag/appendixes/cmdline/Xaggressive.html)

-XtlhPrefetch

Large Object Area

By default, LOA is only used as a last resort for large objects if there is no room in the rest of the heap. By default an object has to be larger than ~16k-64k to be allocated in LOA, and that only happens if there is not enough space in SOA.

-Xrs

The -Xrs flag is used to disable the default signal handler (for things such as javacores with kill -3, etc.); however, using this option may reduce performance by up to 5% due to the way the JIT works and the way Java uses signals when available for performance boosts.

External Delays

Performance problems can sometimes be caused by the poor responsiveness of external resources that your application is attempting to access. These external resources include database, File I/O, other applications, and legacy systems. To see if the problem is caused by external delays:

Identify that a number of threads are waiting on external resources and what those resources are, by examining the javacore.txt file that has been collected.

Profile the responsiveness of the resource to see if response times are longer than expected. You can use a method trace to profile when the call to the resource returns, or you can profile the resource being accessed.

Java thread information is displayed in the "THREADS subcomponent" section of the Javadump. The stack trace is provided for each thread, which can be used to determine whether there are any threads waiting on external resources. A thread may wait on an external resource either in a wait, read, or receive method. In this example, the threads are in the Object.wait() method because of a call to AS400ThreadedServer.receive(), which is an external resource:

3XMTHREADINFO "WebContainer : 0" (TID:0x0000000001191E00, sys_thread_t:0x00000000010955C0, state:CW, native ID:0x0000000000004454) prio=5

4XESTACKTRACE    at java/lang/Object.wait(Native Method)

4XESTACKTRACE    at java/lang/Object.wait(Object.java:199(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/AS400ThreadedServer.receive(AS400ThreadedServer.java:281(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/AS400ThreadedServer.sendAndReceive(AS400ThreadedServer.java:419(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/BaseDataQueueImplRemote.read(BaseDataQueueImplRemote.java:220(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/KeyedDataQueue.read(KeyedDataQueue.java:413(Compiled Code))

4XESTACKTRACE    at com/ibm/testapp/vjops/infra/cdapj/trans/CDAPDataQRouter.readByteBuffer(Bytecode PC:36(Compiled Code))

4XESTACKTRACE    at com/ibm/testapp/vjops/infra/cdapj/trans/CDAPDataQRouter.getMessage(Bytecode PC:28(Compiled Code))

4XESTACKTRACE    at com/ibm/testapp/vjops/infra/cdapj/trans/DataQueueMsgTransactor.doCDAPTransaction(Bytecode PC:175(Compiled Code))

...

3XMTHREADINFO "WebContainer : 2" (TID:0x0000000001495100, sys_thread_t:0x000000000135D6B0, state:CW, native ID:0x000000000000445C) prio=5

4XESTACKTRACE    at java/lang/Object.wait(Native Method)

4XESTACKTRACE    at java/lang/Object.wait(Object.java:199(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/AS400ThreadedServer.receive(AS400ThreadedServer.java:281(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/AS400ThreadedServer.sendAndReceive(AS400ThreadedServer.java:419(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/BaseDataQueueImplRemote.read(BaseDataQueueImplRemote.java:220(Compiled Code))

4XESTACKTRACE    at com/ibm/as400/access/KeyedDataQueue.read(KeyedDataQueue.java:413(Compiled Code))

4XESTACKTRACE    at com/ibm/testapp/vjops/infra/cdapj/trans/CDAPDataQRouter.readByteBuffer(Bytecode PC:36(Compiled Code))

4XESTACKTRACE    at com/ibm/testapp/vjops/infra/cdapj/trans/CDAPDataQRouter.getMessage(Bytecode PC:28(Compiled Code))

4XESTACKTRACE    at com/ibm/testapp/vjops/infra/cdapj/trans/DataQueueMsgTransactor.doCDAPTransaction(Bytecode PC:175(Compiled Code))

...

3XMTHREADINFO "WebContainer : 3" (TID:0x000000000167A800, sys_thread_t:0x0000000000E57AE0, state:B, native ID:0x0000000000005072) prio=5

4XESTACKTRACE    at java/lang/Object.wait(Native Method)

4XESTACKTRACE    at java/lang/Object.wait(Object.java:231(Compiled Code))

4XESTACKTRACE    at com/ibm/ws/util/BoundedBuffer.waitGet_(BoundedBuffer.java:188(Compiled Code))

4XESTACKTRACE    at com/ibm/ws/util/BoundedBuffer.take(BoundedBuffer.java:522(Compiled Code))

4XESTACKTRACE    at com/ibm/ws/util/ThreadPool.getTask(ThreadPool.java:816(Compiled Code))

4XESTACKTRACE    at com/ibm/ws/util/ThreadPool$Worker.run(ThreadPool.java:1476(Compiled Code))

One of the threads is in BoundedBuffer.waitGet_(), which is an internal resource [and thus not an external delay; in this case the thread is waiting for work]. If the Javadump shows threads that are suspected to be blocking on external resources, the next step is to profile the response time of those resources to see if they are taking a long time.

You can profile the amount of time taken by a method that accesses an external resource by using method trace. Method trace can capture trace data for the JVM, the Java Class Libraries (JCL), and Java application code. You do not need to modify your application to use method trace, which is useful if the source code for the methods of interest is not available. The following resources describe how to activate and control method trace:

... For example, you might profile the "AS400ThreadedServer.receive()" method, using the following command-line options:

-Xtrace:maximal=mt,output=mtrace#.out,10m,10,methods={com/ibm/as400/access/AS400ThreadedServer.receive*}

These options create up to ten files called mtrace#.out, where the # symbol is replaced with a sequence number. Each is up to 10 MB in size. When all ten possible files have been created, the trace engine begins to overwrite the first file in the sequence. You can then format the mtrace#.out files as described in the IBM Diagnostic Guide for Java. These files provide microsecond precision timing information for the entry and exit of each call to the AS400ThreadedServer.receive() method. You can use this information to calculate the average response time and determine if responsiveness is a problem.

Old Java Diagnostic Guide

Lock Contention

A monitor has a "thin" lock that can be tested efficiently, but which does not support blocking, and -- only when necessary -- an "inflated" lock. The inflated lock is typically implemented using OS resources that can support blocking, but also is less efficient because of the additional path length required when making the calls to the operating system. Because thin locks don't support blocking, spinning is often used such that threads will spin for a short period of time in case the lock becomes available soon after they first try to acquire it.

Analysis of typical locking patterns gives us the insight that spinning helps most cases, but for some specific cases it does not. Before running an application, it is impossible to know for which monitors spinning will not be useful. It is possible, however, to observe monitor usage and identify at run time those monitors for which you do not believe spinning will be helpful. You can then reduce or eliminate spinning for those specific monitors.

The JVM shipped with WebSphere Application Serer V8 includes spinning refinements that capture locking history and use this history to adaptively decide which monitors should use spin and which should not. This can free up additional cycles for other threads with work to do and, when CPU resources are fully utilized, improve overall application performance.

http://www.ibm.com/developerworks/websphere/techjournal/1111_dawson/1111_dawson.html

Starting in Java 6.0.1, various improvements were made that are expected to improve CPU effeciency. If CPU utilization decreases but application peformance decreases, test with -Xthr:secondarySpinForObjectMonitors. If application performance is affected after the application has run for some time or after a period of heavy load, test with -Xthr:noAdaptSpin. If heap usage is reduced but overall application performance decreases, test -Xlockword:mode=all (http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.70.doc/diag/problem_determination/optimizations_pd.html).

In a javacore, you may see most threads in Conditional Wait (CW) states which you would normally expect to show as Runnable instead. This is "by design" starting in IBM JVM 5. If the top of a thread stack is neither in Object.wait, nor Thread.sleep, nor Thread.join, nor a native method, then the JVM will put the thread into CW state in preparation for the javacore and will return it to Runnable after the javacore is finished. This is done by having all of the aforementioned threads wait for exclusive access to the JVM by waiting on the "Thread public flags mutex lock." This is done to get an internally consistent snapshot of Java stack and monitor states. (http://www-01.ibm.com/support/docview.wss?uid=swg21413580)

Consider upgrading to the latest version of Java because there are often performance improvements in lock contention in the JDK (for example, http://www-01.ibm.com/support/docview.wss?uid=swg1IV67003).

Lock Reservation

Synchronization and locking are an important part of any multi-threaded application. Shared resources must be adequately protected by monitors to insure correctness, even if some resources are only infrequently shared. If a resource is primarily accessed by a single thread at any given time that thread will frequently be the only thread to acquire the monitor guarding the resource. In such cases the cost of acquiring the monitor can be reduced with the -XlockReservation option. With this option it is assumed that the last thread to acquire the monitor will likely also be the next thread to acquire it. The lock is therefore said to be reserved for that thread, thereby minimizing its cost to acquire and release the monitor. This option is well-suited to workloads using many threads and many shared resources that are infrequently shared in practice.

Deadlocks

The Javadump file that should have been collected contains a 'LOCKS' subcomponent. During the generation of the javacore.txt file, a deadlock detector is run, and, if a deadlock is discovered, it is detailed in this section, showing the threads and locks involved in the deadlock:
=======================  
        Deadlock detected !!!  
        ---------------------  
       
          Thread "DeadLockThread 1" (0x41DAB100)  
            is waiting for:  
              sys_mon_t:0x00039B98 infl_mon_t:   0x00039BD8:  
              java/lang/Integer@004B2290/004B229C:  
            which is owned by:  
          Thread "DeadLockThread 0" (0x41DAAD00)  
            which is waiting for:  
              sys_mon_t:0x00039B40 infl_mon_t:   0x00039B80:  
              java/lang/Integer@004B22A0/004B22AC:  
            which is owned by:  
          Thread "DeadLockThread 1" (0x41DAB100)  
This example was taken from a deadlock test program where two threads DeadLockThread 0 and DeadLockThread 1 unsuccessfully attempted to synchronize (Java keyword) on two java/lang/Integers.

You can see in the example that DeadLockThread 1 has locked the object instance java/lang/Integer@004B2290. The monitor has been created as a result of a Java code fragment looking like synchronize(count0). This monitor has DeadLockThread 1 waiting to get a lock on the same object instance (count0 from the code fragment). Below the highlighted section is another monitor locked by DeadLockThread 0 that has DeadLockThread 1 waiting.

Old Java Diagnostic Guide

Reflection Inflation

For a discussion of reflection and inflation, see the general Java chapter. On the IBM JVM, the option -Dsun.reflect.inflationThreshold=0 disables inflation completely.

The sun.reflect.inflationThreshold property tells the JVM what number of times to use the JNI accessor. If it is set to 0, then the JNI accessors are always used. Since the bytecode accessors use more native memory than the JNI ones, if we are seeing a lot of Java reflection, we will want to use the JNI accessors. To do this, we just need to set the inflationThreshold property to zero. (http://www-01.ibm.com/support/docview.wss?uid=swg21566549)

On IBM Java, the default -Dsun.reflect.inflationThreshold=15 means that the JVM will use the JNI accessor for the first 15 accesses, then after that it will change to use the Java bytecode accessor. Using bytecode accessor currently costs 3-4x more than an invocation via JNI accessor for the first invocation, but subsequent invocations have been benchmarked to be over 20x faster than JNI accessor.

Advanced Encryption Standard New Instructions (AESNI)

AESNI is a set of CPU instructions to improve the speed of encryption and decryption using AES ciphers. It is available on recent Intel and AMD CPUs (http://en.wikipedia.org/wiki/AES_instruction_set#Supporting_CPUs) and POWER >= 8 CPUs (http://www.redbooks.ibm.com/redpieces/pdfs/sg248171.pdf). If using IBM Java >= 6 and the IBM JCE security provider, then AESNI, if available, can be exploited with -Dcom.ibm.crypto.provider.doAESInHardware=true (http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.security.component.70.doc/security-component/JceDocs/aesni.html?lang=en).

In some benchmarks, SSL/TLS overhead was reduced by up to 35%.

Use -Dcom.ibm.crypto.provider.AESNITrace=true to check if the processor supports the AES-IN instruction set:

Large Object Allocation Stack Traces

For a 5MB threshold:

-Xdump:stack:events=allocation,filter=#5m

https://www.ibm.com/developerworks/mydeveloperworks/blogs/troubleshootingjava/entry/profiling_large_objects

For a size range (5 to 6 MB): -Xdump:stack:events=allocation,filter=#5m..6m

Compressed References

64-bit processes primarily offer a much larger address space, thereby allowing for larger Java heaps, JIT code caches, and reducing the effects of memory fragmentation in the native heap. Certain platforms also offer additional benefits in 64-bit mode, such as more CPU registers. However, 64-bit processes also must deal with increased overhead. The overhead comes from the increased memory usage and decreased cache utilization. This overhead is present with every single object allocation, as each object must now be referred to with a 64-bit address rather than a 32-bit address. To alleviate this, the -Xcompressedrefs option may be used, and it is enabled by default in certain release on certain operating systems. When enabled, the JVM will use smaller references to objects instead of 64-bit references when possible. Object references are compressed and decompressed as necessary at minimal cost.

In order to determine the compression/decompression overhead for a given heap size on a particular platform, review verbosegc:

<attribute name="compressedRefsDisplacement" value="0x0" />
<attribute name="compressedRefsShift" value="0x0" />

Values of 0 essentially indicate that no work has to be done in order convert between references. Under these circumstances, 64-bit JVMs running with -Xcompressedrefs can reduce the overhead of 64-bit addressing even more and achieve better performance.

-Xcompressedrefs is enabled by default in Java 6.0.1 SR5 and Java 7 SR4 when the size of the heap allows it. -Xnocompressedrefs can be used to explicitly disable it. On z/OS, before Java 7.1, compressed references was disabled by default, but it could be enabled explicitly.

Some benchmarks show a 10-20% relative throughput decrease when disabling compressed references: "Analysis shows that a 64-bit application without CR yields only 80-85% of 32-bit throughput but with CR yields 90-95%. Depending on application requirements, CR can improve performance up to 20% over standard 64-bit." (ftp://public.dhe.ibm.com/software/webserver/appserv/was/WAS_V7_64-bit_performance.pdf). You may be able to recover some of this drop by increasing L2/L3 processor cache sizes. Disabling compressed references will also dramatically increase Java heap usage by up to 70%. Additional background: http://www-01.ibm.com/support/docview.wss?uid=swg21660890

-Xgc:preferredHeapBase

With compressed references enabled, due to the design of Java, native metadata must all be allocated in the virtual memory range 0-4GB. This includes all native objects backing classes, classloaders, threads, and monitors. If there is insufficient space for additional metadata to be allocated, then a native OutOfMemoryError (NOOM) will be thrown. In general, this can happen for two reasons: 1) there is a class, classloader, thread, or monitor leak, and 2) the Java heap is sharing the 0-4GB space. The first cause can be investigated with the javacore.txt file that's produced with the NOOM by searching for large numbers of these objects.

The second cause is due to the default performance optimizations that Java makes. The location of the Java heap will affect the type of compression operations that must be performed on each Java pointer reference (http://www-01.ibm.com/support/docview.wss?uid=swg21660890). If the Java heap can fit completely underneath 4GB, then no "compression" needs to occur - the top 32 bits are simply truncated. Otherwise, for different locations of the Java heap, different arithmetic operations need to be performed. Before APAR IV37797 on non-z/OS operating systems, part of the Java heap would sometimes be allocated below 4GB without the benefit that comes with putting all of it under 4GB. APAR IV37797 changes the behavior for larger heaps to allocate them higher in the virtual memory area by default. On z/OS operating systems, the default was not changed because of operating system considerations. On all operating systems, there are cases where the Java heap will be preferred underneath 4GB and squeeze the metadata space, thus causing NOOMs. One option is to reduce metadata demands, and the second option is to specify where the Java heap should start. Usually, it is sufficient to start the Java heap at the 4GB mark: -Xgc:preferredHeapBase=0x100000000

-Xgc:classUnloadingKickoffThreshold

Classloaders and classes tend to be long lived objects, so they will usually be tenured; however, they also retain native memory. If a full collection does not run for some time and if there are virtual or physical memory pressures, then you can induce full collections using -Xgc:classUnloadingKickoffThreshold:

The command line option -Xgc:classUnloadingKickoffThreshold=<number> tells the system to start a concurrent tenure collection be started every time <number> new class loaders have been created. So, for example, specifying -Xgc:classUnloadingKickoffThreshold=100 will start a concurrent tenure collect whenever a nursery collect notices that 100 new class loaders have been created since the last tenure collection. (http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html)

Method Tracing (-Xtrace methods)

Before IBM Java 7.1, using any method trace may have a significant performance overhead, in some cases up to 40%, and up to 70% during JVM startup. This only affects the -Xtrace "method" option (including simple triggers), not tpnid or other options. This overhead has been mostly removed in Java 7.1.

To use -Xtrace as a simple tracing profiler, see the Java Profilers chapter.

The IBM JVM includes a default -Xtrace which is written to the in-memory ring buffer and dumped out into a Snap.trc file in some situations. This can be disabled with -Xtrace:none.

Use -Xtrace triggers to gather diagnostics when specified Java methods are executed. For example, to take a javacore on the first 1500 executions:

-Xtrace:trigger=method{ilog/rules/factory/IlrReflect.*Listener,javadump,,,1500}

For example, here is a trace that tracks Java socket I/O activity:

-Xtrace:none -Xtrace:maximal=tpnid{IO.0-50},output=javatrace.log

Example output:

17:11:02.473807000          0x12b83f00         IO.18   Entry      >IO_Connect(descriptor=353, connect_to(AF_INET6: port=7272 flow=0 addr=...
17:11:02.473944000          0x12b83f00         IO.20   Exit       <IO_Connect - return =0
17:11:02.474078000          0x12b83f00         IO.32   Entry      >IO_Send(descriptor=353, msg=4197800128, len=20, flags=0)
17:11:02.474117000          0x12b83f00         IO.34   Exit       <IO_Send - bytes sent=20
17:11:02.474124000          0x12b83f00         IO.32   Entry      >IO_Send(descriptor=353, msg=4197800128, len=193, flags=0)
17:11:02.474145000          0x12b83f00         IO.34   Exit       <IO_Send - bytes sent=193
17:11:02.474149000          0x12b83f00         IO.32   Entry      >IO_Send(descriptor=353, msg=4197800128, len=1498, flags=0)
17:11:02.474171000          0x12b83f00         IO.34   Exit       <IO_Send - bytes sent=1498
17:12:20.422571000          0x13090c00         IO.21   Entry      >IO_Recv(descriptor=311, buffer=4195936448, len=88, flags=0)
17:12:20.422577000          0x13090c00         IO.23   Exit       <IO_Recv - bytes read=88
17:11:02.474183000          0x12b83f00         IO.43   Entry      >IO_Dup2(fd1=290, fd2=353)
17:11:02.474206000          0x12b83f00         IO.44   Exit       <IO_Dup2 - error=353
17:11:02.474209000          0x12b83f00         IO.47   Entry      >IO_Close(descriptor=353)
17:11:02.474210000          0x12b83f00         IO.49   Exit       <IO_Close - return code=0

To format an xtrace output file:

java com.ibm.jvm.format.TraceFormat xtrace.out

Trace history for a specific thread can be retrieved through jdmpview or IDDE: !snapformat -t <J9VMThread address>

Object Request Broker (ORB) and Remote Method Invocation (RMI)

Default ORB configuration is specified in ${java}/jre/lib/orb.properties. Key parameters include (http://pic.dhe.ibm.com/infocenter/java7sdk/v7r0/topic/com.ibm.java.lnx.71.doc/diag/understanding/orb_using.html):

com.ibm.CORBA.ConnectTimeout: Socket connect timeout
com.ibm.CORBA.MaxOpenConnections: maximum number of in-use connections that are to be kept in the connection cache table at any one time
com.ibm.CORBA.RequestTimeout: number of seconds to wait before timing out on a Request message
com.ibm.CORBA.ThreadPool.MaximumSize: Maximum size of the ORB thread pool

Note that in WAS, there is an ORB.thread.pool configuration which is normally used; however, if the ThreadPool properties are specified in orb.properties, then they override the WAS configuration.

You may see ORB reader threads (RT) and writer threads (WT). For example, here is a reader thread:

3XMTHREADINFO      "RT=265:P=941052:O=0:WSTCPTransportConnection[addr=...,port=2940,local=48884]" J9VMThread:0x000000000E255600, j9thread_t:0x00002AAAC15D5470, java/lang/Thread:0x000000004CF4B4F0, state:R, prio=5

3XMTHREADINFO1            (native thread ID:0x7EFD, native priority:0x5, native policy:UNKNOWN)

3XMTHREADINFO2            (native stack address range from:0x00002AAAD7D6A000, to:0x00002AAAD7DAB000, size:0x41000)

3XMTHREADINFO3           Java callstack:

4XESTACKTRACE                at java/net/SocketInputStream.socketRead0(Native Method)

4XESTACKTRACE                at java/net/SocketInputStream.read(SocketInputStream.java:140(Compiled Code))

4XESTACKTRACE                at com/ibm/rmi/iiop/Connection.readMoreData(Connection.java:1642(Compiled Code))

4XESTACKTRACE                at com/ibm/rmi/iiop/Connection.createInputStream(Connection.java:1455(Compiled Code))

4XESTACKTRACE                at com/ibm/rmi/iiop/Connection.doReaderWorkOnce(Connection.java:3250(Compiled Code))

4XESTACKTRACE                at com/ibm/rmi/transport/ReaderThread.run(ReaderPoolImpl.java:142(Compiled Code))

These will normally be in R (runnable) state, even if they are just waiting for the incoming message.

The number of Reader Threads (RT) are controlled by the number of active socket connections, not by the ORB thread pool size. For every socket.connect/accept call, an RT gets created and an RT gets removed when the socket closes. RT is not bounded by MaxConnectionCacheSize which is a soft limit - the cache can grow beyond the MaxConnectionCacheSize. Once the cache hits the MaxConnectionCacheSize, the ORB will try to remove stale i.e. unused connections.

The ORB thread pool size will be a cap on the maximum number of Writer Threads (WT), as only up to the number of ORB threads can be writing.

Connection Multiplicity

com.ibm.CORBA.ConnectionMultiplicity: The value of the ConnectionMultiplicity defines the number of concurrent TCP connections between the server and client ORBs. By default this value is set to 1, i.e. there will be only one connection between the server and client ORB and all the requests between the client and server ORB will be multiplexed onto the same connection. This could lead to a performance bottleneck in J2EE deployments where there are a large number of concurrent requests between client & server ORB. (http://www-01.ibm.com/support/docview.wss?uid=swg21669697)

For example, -Dcom.ibm.CORBA.ConnectionMultiplicity=N

Fragment Size

The ORB separates messages into fragments to send over the ORB connection. You can configure this fragment size through the com.ibm.CORBA.FragmentSize parameter.

To determine and change the size of the messages that transfer over the ORB and the number of required fragments, perform the following steps:

In the administrative console, enable ORB tracing in the ORB Properties page.

Enable ORBRas diagnostic trace ORBRas=all (http://www-01.ibm.com/support/docview.wss?uid=swg21254706).

Increase the trace file sizes because tracing can generate a lot of data.

Restart the server and run at least one iteration (preferably several) of the case that you are measuring.

Look at the traceable file and do a search for Fragment to follow: Yes.

This message indicates that the ORB transmitted a fragment, but it still has at least one remaining fragment to send prior to the entire message arriving. A Fragment to follow: No value indicates that the particular fragment is the last in the entire message. This fragment can also be the first, if the message fits entirely into one fragment.

If you go to the spot where Fragment to follow: Yes is located, you find a block that looks similar to the following example:

Fragment to follow: Yes

Message size: 4988 (0x137C)

--

Request ID: 1411

This example indicates that the amount of data in the fragment is 4988 bytes and the Request ID is 1411. If you search for all occurrences of Request ID: 1411, you can see the number of fragments that are used to send that particular message. If you add all the associated message sizes, you have the total size of the message that is being sent through the ORB.

You can configure the fragment size by setting the com.ibm.CORBA.FragmentSize ORB custom property.

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/rorb_tims.html

The default fragment size is 1-4KB, depending on release: http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.lnx.71.doc/user/corba_tuning.html?lang=en

In general, the ideal fragment size is 0, but that is assuming that there is always a single response per connection, which is often not the case with the default value of ConnectionMultiplicity. One major purpose of a non-zero fragment size is so that one large response does not hog a connection that other responses would like to use. However, if ConnectionMultiplicity is tuned to eliminate connection contention, then use -Dcom.ibm.CORBA.FragmentSize=0

Interceptors

Interceptors are ORB extensions that can set up the context prior to the ORB runs a request. For example, the context might include transactions or activity sessions to import. If the client creates a transaction, and then flows the transaction context to the server, then the server imports the transaction context onto the server request through the interceptors.

Most clients do not start transactions or activity sessions, so most systems can benefit from removing the interceptors that are not required.

To remove the interceptors, manually edit the server.xml file and remove the interceptor lines that are not needed from the ORB section.

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/rorb_tims.html

ORB IBM Data Representation (IDR)

ORB 7.1 introduced dramatic performance improvements.

java.nio.DirectByteBuffer

The option -Dcom.ibm.nio.DirectByteBuffer.AggressiveMemoryManagement=true may be used to enable a more aggressive DirectByteBuffer cleanup algorithm (which may increase the frequency of System.gc's).

XML and XSLT

Profile your application using tools such as the IBM Java Health Center or more simply by taking multiple thread dumps (e.g. the WAIT tool). If you observe significant lock contention on an instance of java/lang/Class and/or significant CPU time in com/ibm/xtq/xslt/* classes, then consider testing the older XSLT4J interpreter to see if you have better results:

From Version 6, the XL TXE-J compiler replaces the XSLT4J interpreter as the default XSLT processor.

The XL TXE-J compiler is faster than the XSLT4J interpreter when you are applying the same transformation more than once. If you perform each individual transformation only once, the XL TXE-J compiler is slower than the XSLT4J interpreter because compilation and optimization reduce performance.

For best performance, ensure that you are not recompiling XSLT transformations that can be reused. Use one of the following methods to reuse compiled transformations:

If your stylesheet does not change at run time, compile the stylesheet as part of your build process and put the compiled classes on your classpath. Use the org.apache.xalan.xsltc.cmdline.Compile command to compile the stylesheet and set the http://www.ibm.com/xmlns/prod/xltxe-j/use-classpath transformer factory attribute to true to load the classes from the classpath.

If your application will use the same stylesheet during multiple runs, set the http://www.ibm.com/xmlns/prod/xltxe-j/auto-translet transformer factory attribute to true to automatically save the compiled stylesheet to disk for reuse. The compiler will use a compiled stylesheet if it is available, and compile the stylesheet if it is not available or is out-of-date. Use the http://www.ibm.com/xmlns/prod/xltxe-j/destination-directory transformer factory attribute to set the directory used to store compiled stylesheets. By default, compiled stylesheets are stored in the same directory as the stylesheet.

If your application is a long-running application that reuses the same stylesheet, use the transformer factory to compile the stylesheet and create a Templates object. You can use the Templates object to create Transformer objects without recompiling the stylesheet. The Transformer objects can also be reused but are not thread-safe.

If your application uses each stylesheet just once or a very small number of times, or you are unable to make any of the other changes listed in this step, you might want to continue to use the XSLT4J interpreter by setting the javax.xml.transform.TransformerFactory service provider to org.apache.xalan.processor.TransformerFactoryImpl.

http://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.lnx.71.doc/user/xml/xslt_migrate.html

For additional information, see http://www-01.ibm.com/support/docview.wss?uid=swg21639667

Javacore Thread Dump

Review the native stack traces as well for hot stacks because that might point to some more fundamental issue in the operating system (e.g. malloc contention), etc.

Per-thread CPU usage in javacore (Java 7 SR6, Java 626 SR7, and Java 7.1): A new line has been added to the header section for each thread, giving CPU usage information for that thread (as available from the OS):

3XMTHREADINFO      "main" J9VMThread:0x0000000022C80100, j9thread_t:0x0000000000D4E5C0, java/lang/Thread:0x0000000022B96250, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1, isDaemon:false)
3XMTHREADINFO1            (native thread ID:0xE90, native priority:0x5, native policy:UNKNOWN)
3XMCPUTIME               CPU usage total: 0.249601600 secs, user: 0.218401400 secs, system: 0.031200200 secs
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=25368 (0x6318)

Starting with Java 8, CPU usage of JVM-attached threads is tracked by thread category (which can be disabled with -XX:-ReduceCPUMonitorOverhead): http://www-01.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/diag/preface/changes_80/whatsnew.html%23changes__mbean?lang=en. New lines at the end of the THREADS section in javacore provide the accumulated CPU totals in each category, for example:

1XMTHDSUMMARY  Threads CPU Usage Summary
NULL           =========================
1XMTHDCATEGORY All JVM attached threads: 134.253955000 secs
1XMTHDCATEGORY |
2XMTHDCATEGORY +--System-JVM: 8.642450000 secs
2XMTHDCATEGORY |  |
3XMTHDCATEGORY |  +--GC: 1.216805000 secs
2XMTHDCATEGORY |  |
3XMTHDCATEGORY |  +--JIT: 6.224438000 secs
1XMTHDCATEGORY |
2XMTHDCATEGORY +--Application: 125.611505000 secs

In the header lines for each thread, an additional field at the end of the 3XMCPUTIME line indicates the current CPU usage category of that thread, for example:

3XMTHREADINFO   "JIT Compilation Thread-0 Suspended" J9VMThread:0x000000000F01EB00, j9thread_t:0x000000000296A7F8
 java/lang/Thread:0x00000000E0029718, state:R, prio=10
3XMJAVALTHREAD         (java/lang/Thread getId:0x4, isDaemon:true)
3XMTHREADINFO1         (native thread ID:0xDFC, native priority:0xB, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x01000001)
3XMCPUTIME            CPU usage total: 5.912437900 secs, user: 5.865637600 secs, system: 0.046800300 secs, current category="JIT"

http://www-01.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/diag/preface/changes_80/whatsnew.html%23changes__mbean?lang=en

Stack Size (-Xss)

If using large stack sizes, consider setting -Xssi as well: http://www-01.ibm.com/support/docview.wss?uid=swg21659956

Large Pages (-Xlp)

Details of enabling large pages are on each operating system page. To see whether large pages are enabled on a running JVM, compare pageSize and requestedPageSize in verbosegc:

<attribute name="pageSize" value="0x1000" />
<attribute name="requestedPageSize" value="0x20000" />

Previous Section (Java) | Next Section (Oracle (HotSpot) Java) | Back to Table of Contents

Footer links