Configuration Analysis

WAS traditional Extracting Properties

A subset of configuration properties may be extracted using an MBean: https://www.ibm.com/support/knowledgecenter/en/SSAW57_9.0.5/com.ibm.websphere.nd.multiplatform.doc/ae/txml_7propsfile.html

Additional background on properties-based configuration, including this note:

You cannot extract whole configuration properties from one cell, and apply to another empty cell to clone your environment.

WebSphere Application Server Configuration Visualizer

The WAS Configuration Visualizer visualizes a WAS traditional config directory in an HTML page: https://www.ibm.com/support/pages/websphere-application-server-configuration-visualizer

WebSphere Application Server Configuration Comparison Tool

The following tool performs configuration comparison for WAS traditional: https://www.ibm.com/support/pages/websphere-application-server-configuration-comparison-tool

General Health Check Points

  1. Ask for any previous health checks that have been done
  2. Involve the account team throughout the process
  3. Ask what are the current problems and pain points
  4. If there is time, perform a preliminary review of findings and recommendations (e.g. in the middle of the health check) to make sure you're on the right track and covering the key areas
  5. Mark findings and recommendations with a priority level, effort level, category (e.g. applications, performance, up-time, product level, architecture, security, configuration, past issues, problem determination, logging and monitoring, HA/DR, etc.), environment, status (needs attention, information only, not optimal, etc.), etc.
    1. Create a spreadsheet with titles of findings/recommendations with a column for each of the above. This allows various different people to quickly filter to what's important to them.
  6. Point out things that are going well
  7. For each recommendation, end with the reason; for example, "Change configuration X to improve resiliency"
  8. What are the response time targets, and what are the observations?
  9. What are the CPU and memory utilization targets, and what are the observations?
  10. How does testing work? How does performance testing compare to production?
  11. Which highly available services are used (e.g. transaction logs, session replication/persistence)?
  12. Create an architecture diagram
  13. Invite relevant management for the final presentation/review
  14. Review:
    1. Software versions
    2. Hardware configuration (e.g. CPU number/speed, RAM amount, etc.)
    3. Architecture of component interactions
    4. CPU utilization over time
    5. Memory utilization over time
    6. Network utilization over time
    7. Disk utilization over time
    8. Software logs (WAS, Java, OS, etc.) for warnings/errors
    9. Process arguments
    10. Proportion of time in garbage collection over time
    11. Longest garbage collection pause times
    12. Thread pool utilization over time (WAS, IHS, etc.)
    13. Connection pool (e.g. DB, JMS) utilization over time
    14. Timeouts
    15. Security configuration
    16. Operating system core hard ulimits and how cores are saved/truncated
    17. Review cache usage (e.g. HTTP sessions)
    18. Difference in behavior (response times, GC, etc.) between similar cluster members
    19. Memory leaks
    20. Review all components (WAS, IHS, etc.)
    21. Review the recipes from this cookbook

WAS traditional Health Check

Gather WAS traditional Health Check Data

  1. Run the collector on the deployment manager: https://www.ibm.com/support/knowledgecenter/en/SSAW57_9.0.5/com.ibm.websphere.nd.multiplatform.doc/ae/ttrb_runct.html
    1. Log in as the same user that's running WAS
    2. mkdir -p /tmp/was/
    3. cd /tmp/was
    4. export IBM_JAVA_OPTIONS="-Xmx2g"
    5. ${WAS}/profiles/${PROFILE}/bin/collector.sh
    6. Gather the file named *WASenv.jar
  2. Run the collector on at least one random application node (for log analysis); ideally, all.
  3. Gather any historical operating system statistics such as nmon, perfmon, etc.
  4. Upload all *WASenv.jar files (there should be at least 2) and any OS statistics if available.

Analyze WAS traditional Health Check Data

Analyze (after expanding the *WASenv.jar files):

  1. WAS versions:
    1. find . -type f -name "*SystemOut*log*" -exec grep -H "^WebSphere" {} \; | awk '{print $(NF-4),$3}' | sort | uniq
    2. find . -type f -name node-metadata.properties -exec grep -H ProductVersion {} \; | grep -v -e wxdop -e xdProduct | sed 's/.*nodes\///g' | sed 's/\/node-metadata.*:/: /g' | sort
  2. Operating system versions:
    1. find . -type f -name "*SystemOut*log*" -exec grep -H "Host Operating System is" {} \; | sed 's/.* is //g' | sort | uniq
  3. Java versions:
    1. find . -type f -name "*SystemOut*log*" -exec grep -H "Java version = " {} \; | sed 's/.* is //g' | sort | uniq
  4. Max file descriptors:
    1. find . -type f -name "*SystemOut*log*" -exec grep -H "Max file descriptor count = " {} \; | sed 's/.*://g' | sort | uniq
  5. If needed, review installed APARs: AppServer/properties/version/installed.xml
  6. Check for hung thread warnings:
    1. find . -type f -name "*SystemOut*log*" -exec grep -Hn WSVR0605W {} \;
  7. Find warnings and errors in SystemOut* logs:
    1. find . -type f -name "*SystemOut*log*" -exec grep -H " [W|E] " {} \; > sysout_warnings_errors.txt
      1. awk '{print $7}' sysout_warnings_errors.txt | grep "[WE]:$" | sort | uniq -c | sort -nr | head -10
  8. Find warnings and errors in SystemErr* logs:
    1. find . -name "*SystemErr*log" -exec grep -H "." {} \; | grep -v -e "SystemErr[[:blank:]]\+R[[:blank:]]\+at" -e "Display Current Environment"
  9. Find log message rate:
    1. find . -type f \( -name "*SystemOut*log*" -or -name "messages*log*" \) -exec grep "^\[.*" {} \; | awk '{print $1,$2}' | sed 's/:[0-9][0-9][0-9]$//g' | sed 's/\[//g' | sort | uniq -c
  10. Find startup trace specification:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H "startupTraceSpecification" {} \; | sed 's/:.* startupTraceSpecification="/ /g' | sed 's/".*//g'
  11. Find applications deployed by cluster:
    1. find . -type f -name deployment.xml -exec grep -H deploymentTargets {} \; | grep -v -e ibmasyncrsp -e isclite -e OTiS -e WebSphereWSDM | sed 's/:.*name="/ /g' | sed 's/".*//g' | sed 's/\// /g' | awk '{print $(NF),$(NF-4)}' | sort | uniq | sort -k 2
  12. Find generic JVM arguments:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H genericJvm {} \; | sed 's/:.*genericJvmArguments="\([^"]*\)".*/: \1/g' | grep -v ': $'
    2. find . -type f -name server.xml -not -path "*templates*" -exec grep -H systemProperties {} \; | grep -v -e 'value="off"' -e java.awt.headless
  13. Find where unexpected JVM debugging is enabled:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H genericJvm {} \; | grep -e 'verboseModeClass="true"' -e 'verboseModeJNI="true"' -e 'runHProf="true"' -e 'debugMode="true"'
  14. Find LTPA timeout (minutes);
    1. find . -type f -name security.xml -exec grep -H system.LTPA {} \; | sed 's/:.*timeout/: timeout/g' | sed 's/" .*/"/g'
  15. Check if "Enable failover of transaction log recovery" is enabled (disabled may not show in the grep):
    1. find . -type f -name cluster.xml -exec grep -H enableHA {} \; | sed 's/:.*enableHA/: enableHA/g' | sed 's/>//g'
  16. See if the transaction log directory has been modified (default of a local directory shows no results):
    1. find . -type f -name serverindex.xml -exec grep recoveryLog {} \;
  17. Find transaction timeout settings:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H "services.*TransactionService" {} \; | sed 's/:.*total/: total/g'
      1. totalTranLifetimeTimeout = Total transaction lifetime timeout
      2. propogatedOrBMTTranLifetimeTimeout = Maximum transaction timeout
      3. clientInactivityTimeout = Client inactivity timeout
      4. asyncResponseTimeout = Async response timeout
  18. Review defined resources and their scopes:
    1. find . -type f -name resources.xml -not -path "*template*" -exec grep -H "<resources" {} \; | sed 's/xmi.*name/name/g' | sed 's/ description.*//g' | sed 's/" .*/"/g' | sort -k 2 | grep -v -e URLProvider -e MailProvider -e JavaEEDefaultResources
  19. Check if IBM Service Log (activity.log) enabled:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H "serviceLog.*true" {} \;
  20. Check the type of log rollover (SIZE, TIME, or BOTH):
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H rolloverType {} \; | sed 's/:.*rolloverType="/: /g' | sed 's/".*//g'
  21. Check rollover sizes of logs:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H "rolloverType=\"[SB]" {} \; | sed 's/:.*fileName="/: /g' | sed 's/".*maxNumberOfBackupFiles="\([^"]\+\)"/ maxNumberOfBackupFiles \1/g' | sed 's/rolloverSize="\([^"]\+\)"/rolloverSize \1;/g' | sed 's/;.*//g'
  22. Find thread pool sizes:
    1. find . -type f -name server.xml -not -path "*template*" -exec grep -H "<threadPool.*name" {} \; | grep -v "<threadPool .*ORB" | grep -e WebContainer -e Default -e Message.Listener.Pool -e ORB.thread.pool -e SIBJMSRAThreadPool -e WMQJCAResourceAdapter | sed 's/^\(.*\): .*minimumSize="\([^"]\+\)".*maximumSize="\([^"]\+\)".*name="\([^"]\+\)".*/\1 \4 \2 \3/g'
  23. Find JMS activation specifications:
    1. find . -type f -name resources.xml -not -path "*templates*" -exec grep -H -A 20 j2cActivationSpec.*name= {} \; | grep -e j2cActivationSpec -e maxConcurrency | sed 's/\(.*\): .*jndiName="\([^"]\+\)".*/\1 \2/g' | sed 's/\(.*\)- .*maxConcurrency.*value="\([^"]\+\)".*/\1 maxConcurrency \2/g'
  24. Find listener ports:
    1. find . -type f -name server.xml -not -path "*templates*" -exec grep -H "<listenerPorts" {} \; | sed 's/^\(.*\): .*name="\([^"]\+\)".*maxSessions="\([^"]\+\)".*maxMessages="\([^"]\+\)".*/\1 \2 maxSessions \3 maxMessages \4/g'
  25. Find data source maximum connections:
    1. find . -type f -name resources.xml -not -path "*templates*" -exec grep -H -A 50 "<factories.*DataSource" {} \; | grep -e "<factories.*DataSource" -e "<connectionPool" | sed 's/\(.*\): .*jndiName="\([^"]\+\)".*/\1 \2/g' | sed 's/\(.*\)- .*maxConnections="\([^"]\+\)".*/\1 maxConnections \2/g' | grep -B 1 maxConnections | grep -v "\-\-"
  26. Find list of all custom properties:
    1. find . -type f -name "*xml" -not -path "*templates*" -exec grep "<properties " {} \; | grep com.ibm | sed 's/.*name="\([^"]\+\)".*/\1/g' | sort | uniq
  27. Review operating system statistics

WebSphere Liberty Health Check

Gather WebSphere Liberty Health Check Data

  1. If attaching to the running process to dump basic information and a thread dump is an acceptable risk:
    1. Liberty server dump: https://www.ibm.com/support/knowledgecenter/en/SSAW57_liberty/com.ibm.websphere.wlp.nd.multiplatform.doc/ae/twlp_setup_dump_server.html
      1. Log in as the same user that's running WAS
      2. ${WAS}/bin/server dump ${NAME} --include=thread
      3. Gather the file as noted in the message: Server ${NAME} dump complete in ${FILE}
  2. Otherwise:
    1. server.xml and any included xml files
    2. jvm.options (if any)
    3. bootstrap.properties (if any)
    4. Container logs and any messages.log and FFDC
  3. Gather any historical operating system statistics such as nmon, perfmon, etc.
  4. Upload all Liberty file collections and any OS statistics if available.

Analyze WebSphere Liberty Health Check Data

  1. Check for transaction manager configuration (e.g. transactionLogDirectory if storing trans logs on shared disk, nested dataSource if storing trans logs in DB, etc.):
    1. find . -type f -name "*xml" -exec grep -H "<transaction" {} \;
  2. Find Liberty versions:
    1. find . -type f -name "*messages*log*" -exec grep -H "product = " {} \; | sed 's/.*:product = //g' | sort | uniq
  3. Find Java versions:
    1. find . -type f -name "*messages*log*" -exec grep -H "java.runtime = " {} \; | sed 's/.*:java.runtime = //g' | sort | uniq
  4. Find operating system versions:
    1. find . -type f -name "*messages*log*" -exec grep -H "os = " {} \; | sed 's/.*:os = //g' | sort | uniq
  5. Review JVM parameters:
    1. find . -type f -name "*jvm.options*" -exec grep -Hn "." {} \;
  6. Review server.xml configuration for best practices
  7. Find warnings and errors:
    1. find . -type f -name "*messages*log*" -exec grep -H " [W|E] " {} \; > messages_warnings_errors.txt
      1. awk '{print $7}' sysout_warnings_errors.txt | grep "[WE]:$" | sort | uniq -c | sort -nr | head -10
  8. Find log message rate:
    1. find . -type f \( -name "*SystemOut*log*" -or -name "messages*log*" \) -exec grep "^\[.*" {} \; | awk '{print $1,$2}' | sed 's/:[0-9][0-9][0-9]$//g' | sed 's/\[//g' | sort | uniq -c
  9. Review operating system statistics

Java Health Check

Gather:

  1. Gather 10 thread dumps about 30 seconds apart on one JVM during normal load; for example create a script and pass the PID as an argument and then upload stdout:
    #!/bin/sh
    for i in $(seq 1 10); do
      kill -3 $1
      sleep 30
    done
  2. Gather and upload all JVM and application logs for one JVM
  3. If you would like to review Java heap utilization, gather a core dump (J9 JVM) or heapdump (HotSpot JVM) (note that this will pause the JVM for dozens of seconds so it should be done at off-peak times and it may have sensitive contents)
  4. If you would like to gather sampling profiler data data, capture and upload 5 minutes worth of data.

Analyze:

  1. Find JVM diagnostics:
    1. find . -type f \( -name "*javacore*txt" -or -name "*phd" -or -name "*dmp" -or -name "*trc" -or -name "*hcd" \\)
    2. find . -type f \( -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H JVM {} \;
  2. Find longest GC pauses:
    1. find . -type f \( -name "*verbosegc*log*" -or -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H exclusive-end {} \; | sed 's/:</ </g' | awk '{print $(NF-1),$(NF-2),$1}' | sed 's/"//g' | sed 's/timestamp=//g' | sed 's/durationms=//g' | sort -nr | head
  3. Find verbosegc warnings:
    1. find . -type f \( -name "*verbosegc*log*" -or -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H "<warning" {} \;
  4. Find if verbose classloading is enabled:
    1. find . -type f \( -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H "class load:" {} \;

IBM HTTP Server and WAS Plugin Health Check

Gather on at least one node (ideally, all):

  1. httpd.conf and any included *conf files
  2. plugin-cfg.xml
  3. access.log & error.log
  4. http_plugin.log

Analyze:

  1. Find logging configuration:
    1. find . -type f -name "*conf*" -exec grep -H -e LogFormat -e CustomLog {} \; | grep -v -e ":#" -e /templates/
  2. Find if IHS threads are saturated or nearly saturated:
    1. find . -type f -name "*error_log*" -exec grep -H mpmstats {} \; | grep "rdy . "
  3. Find HTTP 5XX errors:
    1. find . -type f -name "*access_log*" -exec grep -H "HTTP/1.1\" 5" {} \;
  4. Find any non-informational entries in WAS plugin log:
    1. find . -type f -name "*http_plugin*log*" -exec grep -H "." {} \;
  5. Find key WAS Plugin configuration:
    1. find . -type f -name plugin-cfg.xml -not -path "*templates*" -exec grep -Hn -e ServerIOTimeout -e ConnectTimeout {} \;

Linux Configuration Health Check

Gather the following as root and upload healthcheck_linux*.txt:

date &> healthcheck_linux_$(hostname).txt
echo "=== hostname ===" &>> healthcheck_linux_$(hostname).txt
hostname &>> healthcheck_linux_$(hostname).txt
echo "=== uname ===" &>> healthcheck_linux_$(hostname).txt
uname -a &>> healthcheck_linux_$(hostname).txt
echo "=== cmdline ===" &>> healthcheck_linux_$(hostname).txt
cat /proc/cmdline &>> healthcheck_linux_$(hostname).txt
echo "=== cpuinfo ===" &>> healthcheck_linux_$(hostname).txt
cat /proc/cpuinfo &>> healthcheck_linux_$(hostname).txt
echo "=== lscpu ===" &>> healthcheck_linux_$(hostname).txt
lscpu &>> healthcheck_linux_$(hostname).txt
echo "=== meminfo ===" &>> healthcheck_linux_$(hostname).txt
cat /proc/meminfo &>> healthcheck_linux_$(hostname).txt
echo "=== sysctl ===" &>> healthcheck_linux_$(hostname).txt
sysctl -a &>> healthcheck_linux_$(hostname).txt
echo "=== messages ===" &>> healthcheck_linux_$(hostname).txt
cat /var/log/messages &>> healthcheck_linux_$(hostname).txt
echo "=== syslog ===" &>> healthcheck_linux_$(hostname).txt
cat /var/log/syslog &>> healthcheck_linux_$(hostname).txt
echo "=== journal ===" &>> healthcheck_linux_$(hostname).txt
journalctl --since "7 days ago" &>> healthcheck_linux_$(hostname).txt
echo "=== netstat ===" &>> healthcheck_linux_$(hostname).txt
netstat -s &>> healthcheck_linux_$(hostname).txt
echo "=== nstat ===" &>> healthcheck_linux_$(hostname).txt
nstat -asz &>> healthcheck_linux_$(hostname).txt
echo "=== top ===" &>> healthcheck_linux_$(hostname).txt
top -b -d 1 -n 2 &>> healthcheck_linux_$(hostname).txt
echo "=== top -H ===" &>> healthcheck_linux_$(hostname).txt
top -H -b -d 1 -n 2 &>> healthcheck_linux_$(hostname).txt
echo "=== ps ===" &>> healthcheck_linux_$(hostname).txt
ps -elfyww &>> healthcheck_linux_$(hostname).txt
echo "=== iostat ===" &>> healthcheck_linux_$(hostname).txt
iostat -xm 1 2 &>> healthcheck_linux_$(hostname).txt
echo "=== ip addr ===" &>> healthcheck_linux_$(hostname).txt
ip addr &>> healthcheck_linux_$(hostname).txt
echo "=== ip -s ===" &>> healthcheck_linux_$(hostname).txt
ip -s link &>> healthcheck_linux_$(hostname).txt
echo "=== ss summary ===" &>> healthcheck_linux_$(hostname).txt
ss --summary &>> healthcheck_linux_$(hostname).txt
echo "=== ss ===" &>> healthcheck_linux_$(hostname).txt
ss -amponeti &>> healthcheck_linux_$(hostname).txt
echo "=== nstate ===" &>> healthcheck_linux_$(hostname).txt
nstat -saz &>> healthcheck_linux_$(hostname).txt
echo "=== netstat -i ===" &>> healthcheck_linux_$(hostname).txt
netstat -i &>> healthcheck_linux_$(hostname).txt
echo "=== netstat -s ===" &>> healthcheck_linux_$(hostname).txt
netstat -s &>> healthcheck_linux_$(hostname).txt
echo "=== netstat ===" &>> healthcheck_linux_$(hostname).txt
netstat -anop &>> healthcheck_linux_$(hostname).txt
echo "=== systemd-cgtop ===" &>> healthcheck_linux_$(hostname).txt
systemd-cgtop -b --depth=5 -d 1 -n 2 &>> healthcheck_linux_$(hostname).txt
echo "=== journalctl -b ===" &>> healthcheck_linux_$(hostname).txt
journalctl -b | head -2000 &>> healthcheck_linux_$(hostname).txt
echo "=== journalctl -b -n ===" &>> healthcheck_linux_$(hostname).txt
journalctl -b -n 2000 &>> healthcheck_linux_$(hostname).txt
echo "=== journalctl warning ===" &>> healthcheck_linux_$(hostname).txt
journalctl -p warning -n 500 &>> healthcheck_linux_$(hostname).txt
echo "=== ulimit ===" &>> healthcheck_linux_$(hostname).txt
ulimit -a &>> healthcheck_linux_$(hostname).txt
echo "=== df -h ===" &>> healthcheck_linux_$(hostname).txt
df -h &>> healthcheck_linux_$(hostname).txt
echo "=== systemctl list-units ===" &>> healthcheck_linux_$(hostname).txt
systemctl list-units &>> healthcheck_linux_$(hostname).txt
echo "=== systemd-cgls ===" &>> healthcheck_linux_$(hostname).txt
systemd-cgls &>> healthcheck_linux_$(hostname).txt
echo "=== pstree ===" &>> healthcheck_linux_$(hostname).txt
pstree -pT &>> healthcheck_linux_$(hostname).txt

IBM Visual Configuration Explorer (VCE)

The IBM Visual Configuration Explorer (VCE) tool is no longer publicly available.