Configuration Analysis
WAS traditional Extracting Properties
A subset of configuration properties may be extracted using an MBean: https://www.ibm.com/support/knowledgecenter/en/SSAW57_9.0.5/com.ibm.websphere.nd.multiplatform.doc/ae/txml_7propsfile.html
Additional background on properties-based configuration, including this note:
You cannot extract whole configuration properties from one cell, and apply to another empty cell to clone your environment.
WebSphere Application Server Configuration Visualizer
The WAS Configuration Visualizer visualizes a WAS traditional config directory in an HTML page: https://www.ibm.com/support/pages/websphere-application-server-configuration-visualizer
WebSphere Application Server Configuration Comparison Tool
The following tool performs configuration comparison for WAS traditional: https://www.ibm.com/support/pages/websphere-application-server-configuration-comparison-tool
General Health Check Points
- Ask for any previous health checks that have been done
- Involve the account team throughout the process
- Ask what are the current problems and pain points
- If there is time, perform a preliminary review of findings and recommendations (e.g. in the middle of the health check) to make sure you're on the right track and covering the key areas
- Mark findings and recommendations with a priority level, effort
level, category (e.g. applications, performance, up-time, product level,
architecture, security, configuration, past issues, problem
determination, logging and monitoring, HA/DR, etc.), environment, status
(needs attention, information only, not optimal, etc.), etc.
- Create a spreadsheet with titles of findings/recommendations with a column for each of the above. This allows various different people to quickly filter to what's important to them.
- Point out things that are going well
- For each recommendation, end with the reason; for example, "Change configuration X to improve resiliency"
- What are the response time targets, and what are the observations?
- What are the CPU and memory utilization targets, and what are the observations?
- How does testing work? How does performance testing compare to production?
- Which highly available services are used (e.g. transaction logs, session replication/persistence)?
- Create an architecture diagram
- Invite relevant management for the final presentation/review
- Review:
- Software versions
- Hardware configuration (e.g. CPU number/speed, RAM amount, etc.)
- Architecture of component interactions
- CPU utilization over time
- Memory utilization over time
- Network utilization over time
- Disk utilization over time
- Software logs (WAS, Java, OS, etc.) for warnings/errors
- Process arguments
- Proportion of time in garbage collection over time
- Longest garbage collection pause times
- Thread pool utilization over time (WAS, IHS, etc.)
- Connection pool (e.g. DB, JMS) utilization over time
- Timeouts
- Security configuration
- Operating system core hard ulimits and how cores are saved/truncated
- Review cache usage (e.g. HTTP sessions)
- Difference in behavior (response times, GC, etc.) between similar cluster members
- Memory leaks
- Review all components (WAS, IHS, etc.)
- Review the recipes from this cookbook
WAS traditional Health Check
Gather WAS traditional Health Check Data
- Run the collector on the deployment manager: https://www.ibm.com/support/knowledgecenter/en/SSAW57_9.0.5/com.ibm.websphere.nd.multiplatform.doc/ae/ttrb_runct.html
- Log in as the same user that's running WAS
mkdir -p /tmp/was/
cd /tmp/was
export IBM_JAVA_OPTIONS="-Xmx2g"
${WAS}/profiles/${PROFILE}/bin/collector.sh
- Gather the file named
*WASenv.jar
- Run the collector on at least one random application node (for log analysis); ideally, all.
- Gather any historical operating system statistics such as nmon, perfmon, etc.
- Upload all
*WASenv.jar
files (there should be at least 2) and any OS statistics if available.
Analyze WAS traditional Health Check Data
Analyze (after expanding the *WASenv.jar
files):
- WAS versions:
find . -type f -name "*SystemOut*log*" -exec grep -H "^WebSphere" {} \; | awk '{print $(NF-4),$3}' | sort | uniq
find . -type f -name node-metadata.properties -exec grep -H ProductVersion {} \; | grep -v -e wxdop -e xdProduct | sed 's/.*nodes\///g' | sed 's/\/node-metadata.*:/: /g' | sort
- Operating system versions:
find . -type f -name "*SystemOut*log*" -exec grep -H "Host Operating System is" {} \; | sed 's/.* is //g' | sort | uniq
- Java versions:
find . -type f -name "*SystemOut*log*" -exec grep -H "Java version = " {} \; | sed 's/.* is //g' | sort | uniq
- Max file descriptors:
find . -type f -name "*SystemOut*log*" -exec grep -H "Max file descriptor count = " {} \; | sed 's/.*://g' | sort | uniq
- If needed, review installed APARs: AppServer/properties/version/installed.xml
- Check for hung thread warnings:
find . -type f -name "*SystemOut*log*" -exec grep -Hn WSVR0605W {} \;
- Find warnings and errors in SystemOut* logs:
find . -type f -name "*SystemOut*log*" -exec grep -H " [W|E] " {} \; > sysout_warnings_errors.txt
awk '{print $7}' sysout_warnings_errors.txt | grep "[WE]:$" | sort | uniq -c | sort -nr | head -10
- Find warnings and errors in SystemErr* logs:
find . -name "*SystemErr*log" -exec grep -H "." {} \; | grep -v -e "SystemErr[[:blank:]]\+R[[:blank:]]\+at" -e "Display Current Environment"
- Find log message rate:
find . -type f \( -name "*SystemOut*log*" -or -name "messages*log*" \) -exec grep "^\[.*" {} \; | awk '{print $1,$2}' | sed 's/:[0-9][0-9][0-9]$//g' | sed 's/\[//g' | sort | uniq -c
- Find startup trace specification:
find . -type f -name server.xml -not -path "*template*" -exec grep -H "startupTraceSpecification" {} \; | sed 's/:.* startupTraceSpecification="/ /g' | sed 's/".*//g'
- Find applications deployed by cluster:
find . -type f -name deployment.xml -exec grep -H deploymentTargets {} \; | grep -v -e ibmasyncrsp -e isclite -e OTiS -e WebSphereWSDM | sed 's/:.*name="/ /g' | sed 's/".*//g' | sed 's/\// /g' | awk '{print $(NF),$(NF-4)}' | sort | uniq | sort -k 2
- Find generic JVM arguments:
find . -type f -name server.xml -not -path "*template*" -exec grep -H genericJvm {} \; | sed 's/:.*genericJvmArguments="\([^"]*\)".*/: \1/g' | grep -v ': $'
find . -type f -name server.xml -not -path "*templates*" -exec grep -H systemProperties {} \; | grep -v -e 'value="off"' -e java.awt.headless
- Find where unexpected JVM debugging is enabled:
find . -type f -name server.xml -not -path "*template*" -exec grep -H genericJvm {} \; | grep -e 'verboseModeClass="true"' -e 'verboseModeJNI="true"' -e 'runHProf="true"' -e 'debugMode="true"'
- Find LTPA timeout (minutes);
find . -type f -name security.xml -exec grep -H system.LTPA {} \; | sed 's/:.*timeout/: timeout/g' | sed 's/" .*/"/g'
- Check if "Enable failover of transaction log recovery" is enabled
(disabled may not show in the grep):
find . -type f -name cluster.xml -exec grep -H enableHA {} \; | sed 's/:.*enableHA/: enableHA/g' | sed 's/>//g'
- See if the transaction log directory has been modified (default of a
local directory shows no results):
find . -type f -name serverindex.xml -exec grep recoveryLog {} \;
- Find transaction timeout settings:
find . -type f -name server.xml -not -path "*template*" -exec grep -H "services.*TransactionService" {} \; | sed 's/:.*total/: total/g'
- totalTranLifetimeTimeout = Total transaction lifetime timeout
- propogatedOrBMTTranLifetimeTimeout = Maximum transaction timeout
- clientInactivityTimeout = Client inactivity timeout
- asyncResponseTimeout = Async response timeout
- Review defined resources and their scopes:
find . -type f -name resources.xml -not -path "*template*" -exec grep -H "<resources" {} \; | sed 's/xmi.*name/name/g' | sed 's/ description.*//g' | sed 's/" .*/"/g' | sort -k 2 | grep -v -e URLProvider -e MailProvider -e JavaEEDefaultResources
- Check if IBM Service Log (activity.log) enabled:
find . -type f -name server.xml -not -path "*template*" -exec grep -H "serviceLog.*true" {} \;
- Check the type of log rollover (SIZE, TIME, or BOTH):
find . -type f -name server.xml -not -path "*template*" -exec grep -H rolloverType {} \; | sed 's/:.*rolloverType="/: /g' | sed 's/".*//g'
- Check rollover sizes of logs:
find . -type f -name server.xml -not -path "*template*" -exec grep -H "rolloverType=\"[SB]" {} \; | sed 's/:.*fileName="/: /g' | sed 's/".*maxNumberOfBackupFiles="\([^"]\+\)"/ maxNumberOfBackupFiles \1/g' | sed 's/rolloverSize="\([^"]\+\)"/rolloverSize \1;/g' | sed 's/;.*//g'
- Find thread pool sizes:
find . -type f -name server.xml -not -path "*template*" -exec grep -H "<threadPool.*name" {} \; | grep -v "<threadPool .*ORB" | grep -e WebContainer -e Default -e Message.Listener.Pool -e ORB.thread.pool -e SIBJMSRAThreadPool -e WMQJCAResourceAdapter | sed 's/^\(.*\): .*minimumSize="\([^"]\+\)".*maximumSize="\([^"]\+\)".*name="\([^"]\+\)".*/\1 \4 \2 \3/g'
- Find JMS activation specifications:
find . -type f -name resources.xml -not -path "*templates*" -exec grep -H -A 20 j2cActivationSpec.*name= {} \; | grep -e j2cActivationSpec -e maxConcurrency | sed 's/\(.*\): .*jndiName="\([^"]\+\)".*/\1 \2/g' | sed 's/\(.*\)- .*maxConcurrency.*value="\([^"]\+\)".*/\1 maxConcurrency \2/g'
- Find listener ports:
find . -type f -name server.xml -not -path "*templates*" -exec grep -H "<listenerPorts" {} \; | sed 's/^\(.*\): .*name="\([^"]\+\)".*maxSessions="\([^"]\+\)".*maxMessages="\([^"]\+\)".*/\1 \2 maxSessions \3 maxMessages \4/g'
- Find data source maximum connections:
find . -type f -name resources.xml -not -path "*templates*" -exec grep -H -A 50 "<factories.*DataSource" {} \; | grep -e "<factories.*DataSource" -e "<connectionPool" | sed 's/\(.*\): .*jndiName="\([^"]\+\)".*/\1 \2/g' | sed 's/\(.*\)- .*maxConnections="\([^"]\+\)".*/\1 maxConnections \2/g' | grep -B 1 maxConnections | grep -v "\-\-"
- Find list of all custom properties:
find . -type f -name "*xml" -not -path "*templates*" -exec grep "<properties " {} \; | grep com.ibm | sed 's/.*name="\([^"]\+\)".*/\1/g' | sort | uniq
- Review operating system statistics
WebSphere Liberty Health Check
Gather WebSphere Liberty Health Check Data
- If attaching to the running process to dump basic information and a
thread dump is an acceptable risk:
- Liberty server dump: https://www.ibm.com/support/knowledgecenter/en/SSAW57_liberty/com.ibm.websphere.wlp.nd.multiplatform.doc/ae/twlp_setup_dump_server.html
- Log in as the same user that's running WAS
${WAS}/bin/server dump ${NAME} --include=thread
- Gather the file as noted in the message:
Server ${NAME} dump complete in ${FILE}
- Liberty server dump: https://www.ibm.com/support/knowledgecenter/en/SSAW57_liberty/com.ibm.websphere.wlp.nd.multiplatform.doc/ae/twlp_setup_dump_server.html
- Otherwise:
- server.xml and any included xml files
- jvm.options (if any)
- bootstrap.properties (if any)
- Container logs and any messages.log and FFDC
- Gather any historical operating system statistics such as nmon, perfmon, etc.
- Upload all Liberty file collections and any OS statistics if available.
Analyze WebSphere Liberty Health Check Data
- Check for transaction manager configuration (e.g.
transactionLogDirectory if storing trans logs on shared disk, nested
dataSource if storing trans logs in DB, etc.):
find . -type f -name "*xml" -exec grep -H "<transaction" {} \;
- Find Liberty versions:
find . -type f -name "*messages*log*" -exec grep -H "product = " {} \; | sed 's/.*:product = //g' | sort | uniq
- Find Java versions:
find . -type f -name "*messages*log*" -exec grep -H "java.runtime = " {} \; | sed 's/.*:java.runtime = //g' | sort | uniq
- Find operating system versions:
find . -type f -name "*messages*log*" -exec grep -H "os = " {} \; | sed 's/.*:os = //g' | sort | uniq
- Review JVM parameters:
find . -type f -name "*jvm.options*" -exec grep -Hn "." {} \;
- Review server.xml configuration for best practices
- Find warnings and errors:
find . -type f -name "*messages*log*" -exec grep -H " [W|E] " {} \; > messages_warnings_errors.txt
awk '{print $7}' sysout_warnings_errors.txt | grep "[WE]:$" | sort | uniq -c | sort -nr | head -10
- Find log message rate:
find . -type f \( -name "*SystemOut*log*" -or -name "messages*log*" \) -exec grep "^\[.*" {} \; | awk '{print $1,$2}' | sed 's/:[0-9][0-9][0-9]$//g' | sed 's/\[//g' | sort | uniq -c
- Review operating system statistics
Java Health Check
Gather:
- Gather 10 thread dumps about 30 seconds apart on one JVM during
normal load; for example create a script and pass the PID as an argument
and then upload stdout:
#!/bin/sh for i in $(seq 1 10); do kill -3 $1 sleep 30 done
- Gather and upload all JVM and application logs for one JVM
- If you would like to review Java heap utilization, gather a core dump (J9 JVM) or heapdump (HotSpot JVM) (note that this will pause the JVM for dozens of seconds so it should be done at off-peak times and it may have sensitive contents)
- If you would like to gather sampling profiler data data, capture and upload 5 minutes worth of data.
Analyze:
- Find JVM diagnostics:
find . -type f \( -name "*javacore*txt" -or -name "*phd" -or -name "*dmp" -or -name "*trc" -or -name "*hcd" \\)
find . -type f \( -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H JVM {} \;
- Find longest GC pauses:
find . -type f \( -name "*verbosegc*log*" -or -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H exclusive-end {} \; | sed 's/:</ </g' | awk '{print $(NF-1),$(NF-2),$1}' | sed 's/"//g' | sed 's/timestamp=//g' | sed 's/durationms=//g' | sort -nr | head
- Find verbosegc warnings:
find . -type f \( -name "*verbosegc*log*" -or -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H "<warning" {} \;
- Find if verbose classloading is enabled:
find . -type f \( -name "*stderr*log*" -or -name "*console*log*" \) -exec grep -H "class load:" {} \;
IBM HTTP Server and WAS Plugin Health Check
Gather on at least one node (ideally, all):
- httpd.conf and any included *conf files
- plugin-cfg.xml
- access.log & error.log
- http_plugin.log
Analyze:
- Find logging configuration:
find . -type f -name "*conf*" -exec grep -H -e LogFormat -e CustomLog {} \; | grep -v -e ":#" -e /templates/
- Find if IHS threads are saturated or nearly saturated:
find . -type f -name "*error_log*" -exec grep -H mpmstats {} \; | grep "rdy . "
- Find HTTP 5XX errors:
find . -type f -name "*access_log*" -exec grep -H "HTTP/1.1\" 5" {} \;
- Find any non-informational entries in WAS plugin log:
find . -type f -name "*http_plugin*log*" -exec grep -H "." {} \;
- Find key WAS Plugin configuration:
find . -type f -name plugin-cfg.xml -not -path "*templates*" -exec grep -Hn -e ServerIOTimeout -e ConnectTimeout {} \;
Linux Configuration Health Check
Gather the following as root and upload
healthcheck_linux*.txt
:
date &> healthcheck_linux_$(hostname).txt
echo "=== hostname ===" &>> healthcheck_linux_$(hostname).txt
hostname &>> healthcheck_linux_$(hostname).txt
echo "=== uname ===" &>> healthcheck_linux_$(hostname).txt
uname -a &>> healthcheck_linux_$(hostname).txt
echo "=== cmdline ===" &>> healthcheck_linux_$(hostname).txt
cat /proc/cmdline &>> healthcheck_linux_$(hostname).txt
echo "=== cpuinfo ===" &>> healthcheck_linux_$(hostname).txt
cat /proc/cpuinfo &>> healthcheck_linux_$(hostname).txt
echo "=== lscpu ===" &>> healthcheck_linux_$(hostname).txt
lscpu &>> healthcheck_linux_$(hostname).txt
echo "=== meminfo ===" &>> healthcheck_linux_$(hostname).txt
cat /proc/meminfo &>> healthcheck_linux_$(hostname).txt
echo "=== sysctl ===" &>> healthcheck_linux_$(hostname).txt
sysctl -a &>> healthcheck_linux_$(hostname).txt
echo "=== messages ===" &>> healthcheck_linux_$(hostname).txt
cat /var/log/messages &>> healthcheck_linux_$(hostname).txt
echo "=== syslog ===" &>> healthcheck_linux_$(hostname).txt
cat /var/log/syslog &>> healthcheck_linux_$(hostname).txt
echo "=== journal ===" &>> healthcheck_linux_$(hostname).txt
journalctl --since "7 days ago" &>> healthcheck_linux_$(hostname).txt
echo "=== netstat ===" &>> healthcheck_linux_$(hostname).txt
netstat -s &>> healthcheck_linux_$(hostname).txt
echo "=== nstat ===" &>> healthcheck_linux_$(hostname).txt
nstat -asz &>> healthcheck_linux_$(hostname).txt
echo "=== top ===" &>> healthcheck_linux_$(hostname).txt
top -b -d 1 -n 2 &>> healthcheck_linux_$(hostname).txt
echo "=== top -H ===" &>> healthcheck_linux_$(hostname).txt
top -H -b -d 1 -n 2 &>> healthcheck_linux_$(hostname).txt
echo "=== ps ===" &>> healthcheck_linux_$(hostname).txt
ps -elfyww &>> healthcheck_linux_$(hostname).txt
echo "=== iostat ===" &>> healthcheck_linux_$(hostname).txt
iostat -xm 1 2 &>> healthcheck_linux_$(hostname).txt
echo "=== ip addr ===" &>> healthcheck_linux_$(hostname).txt
ip addr &>> healthcheck_linux_$(hostname).txt
echo "=== ip -s ===" &>> healthcheck_linux_$(hostname).txt
ip -s link &>> healthcheck_linux_$(hostname).txt
echo "=== ss summary ===" &>> healthcheck_linux_$(hostname).txt
ss --summary &>> healthcheck_linux_$(hostname).txt
echo "=== ss ===" &>> healthcheck_linux_$(hostname).txt
ss -amponeti &>> healthcheck_linux_$(hostname).txt
echo "=== nstate ===" &>> healthcheck_linux_$(hostname).txt
nstat -saz &>> healthcheck_linux_$(hostname).txt
echo "=== netstat -i ===" &>> healthcheck_linux_$(hostname).txt
netstat -i &>> healthcheck_linux_$(hostname).txt
echo "=== netstat -s ===" &>> healthcheck_linux_$(hostname).txt
netstat -s &>> healthcheck_linux_$(hostname).txt
echo "=== netstat ===" &>> healthcheck_linux_$(hostname).txt
netstat -anop &>> healthcheck_linux_$(hostname).txt
echo "=== systemd-cgtop ===" &>> healthcheck_linux_$(hostname).txt
systemd-cgtop -b --depth=5 -d 1 -n 2 &>> healthcheck_linux_$(hostname).txt
echo "=== journalctl -b ===" &>> healthcheck_linux_$(hostname).txt
journalctl -b | head -2000 &>> healthcheck_linux_$(hostname).txt
echo "=== journalctl -b -n ===" &>> healthcheck_linux_$(hostname).txt
journalctl -b -n 2000 &>> healthcheck_linux_$(hostname).txt
echo "=== journalctl warning ===" &>> healthcheck_linux_$(hostname).txt
journalctl -p warning -n 500 &>> healthcheck_linux_$(hostname).txt
echo "=== ulimit ===" &>> healthcheck_linux_$(hostname).txt
ulimit -a &>> healthcheck_linux_$(hostname).txt
echo "=== df -h ===" &>> healthcheck_linux_$(hostname).txt
df -h &>> healthcheck_linux_$(hostname).txt
echo "=== systemctl list-units ===" &>> healthcheck_linux_$(hostname).txt
systemctl list-units &>> healthcheck_linux_$(hostname).txt
echo "=== systemd-cgls ===" &>> healthcheck_linux_$(hostname).txt
systemd-cgls &>> healthcheck_linux_$(hostname).txt
echo "=== pstree ===" &>> healthcheck_linux_$(hostname).txt
pstree -pT &>> healthcheck_linux_$(hostname).txt
IBM Visual Configuration Explorer (VCE)
The IBM Visual Configuration Explorer (VCE) tool is no longer publicly available.