Intelligent Management

Intelligent Management Recipe

  1. If using Java On Demand Routers:
    1. Test the relative performance of an increased maximum size of the Default thread pool.
    2. If ODRs are on shared installations, consider using separate shared class caches.
    3. If using Windows:
      1. If using AIO (the default), test the relative performance of -DAIONewWindowsCancelPath=1
      2. If using AIO (the default), test the relative performance of disabling AIO and using NIO

Background

Intelligent Management (IM) was formerly a separate product called WebSphere Virtual Enterprise (WVE) and it became a part of WebSphere Network Deployment starting with version 8.5.

IM introduces the On Demand Router which supports application editioning, health policies, service policies, maintenance mode, automatic discovery, dynamic clusters, traffic shaping, and more. The ODR was first delivered as a Java process that was based on the Proxy Server and it was normally placed in between a web server and the application servers. Starting with WAS 8.5.5, there is an option called Intelligent Management for Web Servers (colloquially, ODRLib) which is a native C component that delivers some of the same functionality but is integrated directly into the IBM HTTP Server (IHS) web server.

Java On Demand Router (ODR)

The Java On Demand Router (ODR) is built on top of the WAS Java Proxy Server. Both of these write the following log files asynchronously in a background LoggerOffThread:

  • local.log: A log of the communications between the client (e.g. browser) and the ODR, i.e. the activities in the "local" ODR process.
  • proxy.log: A log of the communications between the ODR and the backend server (e.g. application server).

The weighted least outstanding request (WLOR) load balancing algorithm is generally superior to the available load balancing algorithms in the WebSphere plugin. WLOR takes into account both the weight of the server and the number of outstanding requests, so it is better at evening out load if one server slows down. WLOR is the default in both ODRLib and the Java ODR.

The "excessive request timeout condition" and "excessive response time condition" are useful health policies that the ODR can monitor to gather diagnostics on anomalous requests.

Conditional Request Trace enables traces only for requests that match a particular condition such as a URI.

The ODR measures "service time" as the time the request was sent to the application server until the time the first response chunk arrives.

Default Thread Pool

The Java ODR/Proxy primarily uses the Default thread pool for its HTTP proxying function; however, most of its activity is asynchronous, so a very large volume of traffic would be required to overwhelm this thread pool. In such case, it may help to increase its maximum size, although exhaustion of the Default thread pool may just be a symptom of downstream or upstream issues instead.

Maintenance Mode

Putting servers into maintenance mode is a great way to gather performance diagnostics while reducing the potential impact to customers. One maintenance mode option is to allow users with affinity to continue making requests while sending new requests to other servers.

Putting a server into maintenance mode is a persistent change. In other words, a server will remain in maintenance mode (even if the server is restarted) until the mode is explicitly changed. The maintenance mode of a server is stored persistently as a server custom property. The name of the custom property is "server.maintenancemode" under Application Servers } Administration } Custom Properties. Possible values for that property are:

  • false - maintenance mode is disabled
  • affinity - only route traffic with affinity to the server
  • break - don't route any traffic to the server

Custom Logging

The Java ODR supports custom logging which logs information about HTTP responses, allows for conditions on what is logged and has very flexible fields for logging.

The condition uses HTTP request and response operands. Response operands include response code, target server, response time, and service time.

There are various fields available to print.

Instructions to log all responses:

  1. Log into the machine that runs the WAS DMGR, open a command prompt, and change directory to the $WAS/bin/ directory.
  2. Run the following command for each ODR, replacing $ODRNODE with the ODR's node and $ODRSERVER with the name of the ODR:
    wsadmin -f manageODR.py insertCustomLogRule $ODRNODE:$ODRSERVER 1 "service.time }= 0" "http.log %h %t %r %s %b %Z %v %R %T"
  3. In the WAS DMGR administrative console, for each ODR, go to: Servers } Server Types } On Demand Routers } $ODR } On Demand Router Properties } On Demand Router settings } Custom Properties
    1. Click New and set Name=http.log.maxSize and Value=100 and click OK. This value is in MB.
    2. Click New and set Name=http.log.history and Value=10 and click OK
    3. Click Review, check the box to synchronize, and click Save
  4. Restart the ODRs
  5. Now observe that there should be an http.log file in $WAS\profiles\$PROFILE\logs\$ODR\

The default value for http.log.maxSize is 500 MB and the default value for http.log.history is 1.

Note that the number of historical files is in addition to the current file, meaning that the defaults will produce up to 1GB in two files. Also note that changing the values affects not only the ODR custom logs, but also the proxy.log, local.log, and cache.log.

Other notes:

Log rules may be listed with:

$ wsadmin -f manageODR.py listCustomLogRules $ODRNODE:$ODRSERVER
WASX7209I: Connected to process "dmgr" on node dmgr1 using SOAP connector;  The type of process is: DeploymentManager
WASX7303I: The following options are passed to the scripting environment and are available as arguments that are stored in the argv variable: "[listCustomLogRules, odr1:odrserver1]"
1: condition='service.time >= 0' value='http.log %h %t %r %s %b %Z %v %R %T'

Log rules may be removed by referencing the rule number (specified in insertCustomLogRule or listed on the left side of the output of listCustomLogRules):

$ wsadmin -f manageODR.py removeCustomLogRule ${ODRNODE}:%{ODRSERVER} 1
WASX7209I: Connected to process "dmgr" on node dmgr1 using SOAP connector;  The type of process is: DeploymentManager
WASX7303I: The following options are passed to the scripting environment and are available as arguments that are stored in the argv variable: "[removeCustomLogRule, odr1:odrserver1, 1]"
Removed log rule #1

If the overhead of the example log rule above is too high, then it may be reduced significantly by only logging requests that take a long time. Change the server.time threshold (in milliseconds) to some large value. For example (the name of the log is also changed to be more meaningful such as http_slow.log):

$ ./wsadmin.sh -f manageODR.py insertCustomLogRule ${ODRNODE}:%{ODRSERVER} 1 "service.time >= 5000" "http_slow.log %h %t %r %s %b %Z %v %R %T"
WASX7209I: Connected to process "dmgr" on node dmgr1 using SOAP connector;  The type of process is: DeploymentManager
WASX7303I: The following options are passed to the scripting environment and are available as arguments that are stored in the argv variable: "[insertCustomLogRule, odr1:odrserver1, 1, service.time >= 5000, http_slow.log %h %t %r %s %b %Z %v %R %T]"
Inserted 'log rule #1

Example output:

localhost6.localdomain6 09/Jan/2018:14:33:55 PST "GET /swat/Sleep HTTP/1.1" 200 326 cell1/node1/dc1_node1 oc3466700346 6006 6004

Note that %r will be double-quoted without you needing to specify the double quotes in insertCustomLogRule. In fact, insertCustomLogRule does not support double quotes around any field.

Binary Trace Facility (BTF)

The Java ODR supports a different type of tracing from the traditional diagnostic trace. Btrace enables trace on a per-request basis and infrequently-occurring conditions out-of-the-box (e.g. reason for 503). Btrace is hierarchical with respect to function rather than code and trace records are organized top-down and left-to-right (processing order). The trace specification can be set as a cell custom property starting with trace, e.g. name=trace.http, value=http.request.loadBalance=2

The trace command in the WAS installation directory can be used to format btrace data:

$WAS/bin/trace read $SERVER_LOGS_DIRECTORY $SPEC_TO_READ

Dynamic clusters

Application Placement Controller (APC)

The Application Placement Controller code runs in one JVM in the cell and coordinates stopping and starting JVMs when dynamic clusters are in automatic mode, or creating runtime tasks for doing so when dynamic clusters are in supervised mode. The frequency of changes is throttled by the minimum time between placements option. Some of the basic theory of the APC is described in Tang et al., 2007.

Investigate autonomic dynamic cluster size violations.

Investigate APC issues:

  1. Check all node agents are running and healthy and the core group is marked as stable.
  2. Check if any nodes or servers are in maintenance mode.
  3. Check the logs for servers to see if they were attempted to be started but failed for some reason (e.g. application initialization).
  4. Check each node's available physical memory if there is sufficient free space for additional servers.
  5. Find where the APC is running (DCPC0001I/HAMI0023I) and not stopped (DCPC0002I/HAMI0023I), and ensure that it is actually running at the interval of minimum time between placement options (otherwise, it may be hung).
  6. Check if APC detected a violation with the DCPC0309I message. If found, check for any subsequent errors or warnings.
  7. Check the apcReplayer.log, find the **BEGIN PLACEMENT INPUT DUMP** section, and review if all nodes are registered with lines starting with {CI.

If APC is constantly stopping and starting JVMs seemingly needlessly, test various options such as:

  • APC.BASE.PlaceConfig.DEMAND_DISTANCE_OVERALL=0.05
  • APC.BASE.PlaceConfig.UTILITY_DISTANCE_PER_APPL=0.05
  • APC.BASE.PlaceConfig.WANT_VIOLATION_SCORE=true
  • APC.BASE.PlaceConfig.PRUNE_NO_HELP=false

Service Policies

Service policies define application goals (e.g. average response time less than 1 second) and relative priorities (e.g. application A is High). The Java ODR uses these policies in its request prioritization and routing decisions.

CPU/Memory Overload Protection

These overload protection features cause the Java ODR to queue work to application servers that it sees are over the configured thresholds of CPU and/or memory usage.

Health Policies

When using the "excessive memory usage" health policy, set usexdHeapModule=true. Otherwise, the heap usage is sampled and this can create false positives with generational garbage collection policies such as gencon. The "memory leak" health policy uses the built-in traditional WAS performance advisor and this always samples, so it's not recommended with generational garbage collectors.

Visualization Data Service

This service logs key performance data into CSV log files. The logs are written to the deployment manager profile directory at $DMGR_PROFILE/logs/visualization/*.log

  1. System Administration } Visualization Data Service } Check "Enable Log"
    1. Timestamp format = MM/dd/yyyy HH:mm:ss
      1. If this is not specified, it defaults to the "number of milliseconds since the standard base time known as "the epoch", namely January 1, 1970, 00:00:00 GMT." - i.e. new Date(timestamp)
    2. Max file size = 20MB
    3. Max historical files = 5
      1. The max file size and historical files apply to each viz data log file, individually.

Example output of ServerStatsCache.log:

timeStamp,name,node,cellName,version,weight,cpu,usedMemory,uptime,totalRequests,liveSessions,updateTime,highMemMark,residentMemory,totalMemory,db_averageResponseTime,db_throughput,totalMethodCalls  
01/03/2019 09:45:53,server1,localhostNode01,localhostCell01,XD 9.0.0.9,1,0.26649348143619733,80953,846,1337,0,01/03/2019 09:45:44,,334792,5137836,,,

Bulletin Board over the Structured Overlay Network (BBSON)

BBSON is an alternative to the High Availability Manager (HAManager) and allows some of the WAS components that traditionally relied on the HAManager to use a different approach. BBSON is built on the P2P component which is peer-to-peer with small sized groups rather than a mesh network like HAManager. This can allow for greater scalability and no need for core group bridges. All IM components can use BBSON. WAS WLM can also use BBSON.

The SON thread pool sizes may be set with cell custom properties son.tcpInThreadPoolMin, son.tcpInThreadPoolMax, son.tcpOutThreadPoolMin, and son.tcpOutThreadPoolMax.

High Availability Deployment Manager (HADMGR)

The high availability deployment manager allows multiple instances of the deployment manager to share the same configuration (using a networked filesystem) to eliminate a single point of failure if one of them is not available. The HADMGR must be accessed through an On Demand Router (ODR) which routes to one of the active deployment managers. The deployment manager can be very chatty in making many small file I/O accesses, thus performance of the networked filesystem is critical.

PMI

In WAS ND 8.5 and above, to disable PMI completely, if you are not using any Intelligent Management capabilities, then set the cell custom property LargeTopologyOptimization=false, disable PMI, and restart:

Intelligent Management which is part of Websphere Application Server V8.5.0.0 and later, requires the default PMI counters to be enabled. It is not possible to disable PMI or the default PMI counters when using Intelligent Management capabilities. If no IntelligentManagement capabilities will ever be used then the property described in this fix can be used to disable Intelligent Management. In turn it will allow disabling the PMI Monitoring Infrastructure of default PMI counters.

  1. System Administration } Cell } Additional Properties } Custom Properties } New
    1. Name: LargeTopologyOptimization
    2. Value: false
    3. OK
  2. Server } Server Types } WebSphere application servers } $SERVER } Performance } Performance Monitoring Infrastructure (PMI)
    1. Uncheck "Enable Performance Monitoring Infrastructure"
    2. OK
  3. Review
    1. Check "Synchronize changes with Nodes"
    2. Save
  4. Restart $SERVER