z/OS

z/OS Recipe

  1. CPU core(s) should not be consistently saturated.
  2. Generally, physical memory should never be saturated and the operating system should not page memory out to disk.
  3. Input/Output interfaces such as network cards and disks should not be saturated, and should not have poor response times.
  4. TCP/IP and network tuning, whilst sometimes complicated to investigate, may have dramatic effects on performance.
  5. Consider tuning TCP/IP network buffer sizes.
  6. Collect and archive various RMF/SMF records on 10 or 15 minute intervals:
    1. SMF 30 records
    2. SMF 70-78 records
    3. SMF 113 subtype 1 (counters) records
    4. With recent versions of z/OS, Correlator SMF 98.1 records
    5. SMF 99 subtype 6 records
    6. If not active, activate HIS and collect hardware counters:
  7. Review ps -p $PID -m and D OMVS,PID=$PID output over time for processes of interest.
  8. Operating system level statistics and optionally process level statistics should be periodically monitored and saved for historical analysis.
  9. Review system logs for any errors, warnings, or high volumes of messages.
  10. Review snapshots of process activity, and for the largest users of resources, review per thread activity.
  11. If the operating system is running in a virtualized guest, review the configuration and whether or not resource allotments are changing dynamically.
  12. Use the Workload Activity Report to review performance.
  13. If there is sufficient network capacity for the additional packets, consider reducing the default TCP keepalive timer (TCPCONFIG INTERVAL) from 2 hours to a value less than intermediate device idle timeouts (e.g. firewalls).
  14. Review SYS1.PARMLIB (and SYS1.IPLPARM if used)
  15. Test disabling delayed ACKs

Also review the general topics in the Operating Systems chapter.

Documentation

General

z/OS is normally accessed through 3270 clients, telnet, SSH, or FTP.

z/OS uses the EBCDIC character set by default instead of ASCII/UTF; however, some files produced by Java are written in ASCII or UTF. These can be converted using the iconv USS command or downloaded through FTP in BINARY mode to an ASCII/UTF based computer.

Unix System Services (USS) and OMVS

ps

ps may be used to display address space information and CPU utilization; for example, to list all processes and accumulated CPU time:

By Process
ps -A -o xasid,jobname,pid,ppid,thdcnt,vsz,vsz64,vszlmt64,time,args

Example output:

ASID JOBNAME         PID       PPID THCNT     VSZ      VSZ64   VSZLMT64        TIME COMMAND
 160 SSHD7      16916941   67248492     1   16192   13631488     16383P    00:00:00 /usr/sbin/sshd -R
 175 KGRIGOR3   84025829   33693995    29  217448 1527775232     20480M    00:00:31 java -Xmx1g ...
  fb KGRIGOR8   50471447     139700     1     456   13631488     20480M    00:00:00 ps -A -o xasid,jobname,pid,ppid,thdcnt,vsz,vsz64,vszlmt64,time,args
By Thread

To display details about each thread of a process, use -p $PID and -m; for example:

ps -p $PID -m -o xasid,jobname,pid,ppid,xtid,xtcbaddr,vsz,vsz64,vszlmt64,time,semnum,lpid,lsyscall,syscall,state,args

Example output with TIME showing accumulated CPU by thread:

ASID JOBNAME         PID       PPID              TID  TCBADDR     VSZ      VSZ64   VSZLMT64        TIME  SNUM       LPID LASTSYSC             SYSC S      COMMAND
 175 KGRIGOR3   84025829   33693995                -        -  217448 1513095168     20480M    00:01:38     -          - -                    -    HR     java -Xmx1g ...
   - -                 -          - 1cc3300000000001   8ce048  217448          -          -    00:00:00     -          0 1NOP1NOP1NOP1NOP1NOP 1IPT KU     -
   - -                 -          - 1cc4b00000000002   8fb2f8  217448          -          -    00:01:16     -          0 1NOP1NOP1NOP1NOP1NOP -    RJ     -
   - -                 -          - 1cc4d80000000003   8bce78  217448          -          -    00:00:00     -          0 1NOP1NOP1NOP1NOP1NOP -    RNJV   -

USS Settings

Display global USS settings: /D OMVS,O

BPXO043I 10.14.11 DISPLAY OMVS 616
OMVS     000F ACTIVE             OMVS=(S4)
CURRENT UNIX CONFIGURATION SETTINGS:
MAXPROCSYS      =       1900    MAXPROCUSER     =        500
MAXFILEPROC     =      65535    MAXFILESIZE     = NOLIMIT
MAXCPUTIME      = 2147483647    MAXUIDS         =        500
MAXPTYS         =        750
MAXMMAPAREA     =       128K    MAXASSIZE       = 2147483647
MAXTHREADS      =      50000    MAXTHREADTASKS  =       5000
MAXCORESIZE     =      7921K    MAXSHAREPAGES   =         4M...
MAXQUEUEDSIGS   =      10000    SHRLIBRGNSIZE   =   67108864...

opercmd

opercmd may be an available command to execute operator commands normally run through a 3270 session, though it requires special permission. For example:

opercmd "D OMVS,O"

Tips

  1. A dataset may be copied to a file with: cp "//'cbc.sccnsam(ccnubrc)'" ccnubrc.C
  2. To get to OMVS: ISPF } 6 COMMAND } omvs
  3. OMVS disable autoscroll: NOAUTO } F2
  4. OMVS enable autoscroll: AUTO } F2
  5. Convert file from ASCII to EBCDIC: iconv -fiso8859-1 -tibm-1047 server.xml > server.ebcdic

Language Environment (LE)

z/OS provides a built-in mechanism to recommend fine tuned values for the LE heap. Run with LE RPTSTG(ON) and consult the resulting output: http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_tunezleheap.html

Ensure that you are NOT using the following options during production: RPTSTG(ON), RPTOPTS(ON), HEAPCHK(ON)

For best performance, use the LPALSTxx parmlib member to ensure that LE and C++ runtimes are loaded into LPA.

Ensure that the Language Environment data sets, SCEERUN and SCEERUN2, are authorized to enable xplink... For best performance, compile applications that use JNI services with xplink enabled.

http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rprf_tunezle.html

pax

The pax command may be used to create or unpack compressed or uncompressed archive files (similar to POSIX tar and gzip/gunzip).

  • Create with compression: pax -wzvf $FILE.pax.Z $FILES_OR_DIRS
  • Create without compression: pax -wvf $FILE.pax $FILES_OR_DIRS
  • Unpack (compression autodetected): pax -ppx -rf $FILE.pax.Z

uncompress

The uncompress command may be used to decompress .Z files:

uncommpress $FILE.Z

3270 Clients

The z/OS graphical user interface is normally accessed through a 3270 session. Commonly used client program are Personal Communications, Host On-Demand, x3270 (Linux), and c3270 (macOS). Some notes:

  1. On some 3270 client keyboard layouts, the right "Ctrl" key is used as the "Enter" key.
  2. Function keys are used heavily. In general, F3 goes back to the previous screen.
  3. Some clients such as Host On-Demand allow showing a virtual keyboard (View } Keypad) which may help when needing obscure keys.
  4. If you receive a red X in the bottom left, then you probably tried to press a key when your cursor was not in an input area. Press the SysReq key and press Enter to reset the screen.
  5. On some screens, you may move your cursor to any point (normally the top line) and press F2 to split the screen. Use F9 to switch between screens.
  6. Usually, page up with F7 and page down with F8. If you type m in the input field and press F7 or F8, then you will scroll to the top or bottom, respectively.
  7. Usually, page right with F11 and page left with F10.
  8. Usually, type f "SEARCH" to find something and then press F5 to find the next occurrence. Use prev to search backwards: f "SEARCH" prev

z/OS Version

Display z/OS version with /D IPLINFO

Search for the "RELEASE" line:

IEE254I  13.06.07 IPLINFO DISPLAY 033
   SYSTEM IPLED AT 09.38.57 ON 05/15/2018
   RELEASE z/OS 02.02.00    LICENSE = z/OS
   USED LOADRE IN SYS1.IPLPARM ON 00340
   ARCHLVL = 2   MTLSHARE = N
   IEASYM LIST = (05,RE,L)
   IEASYS LIST = (LF,KB) (OP)
   IODF DEVICE: ORIGINAL(00340) CURRENT(00340)
   IPL DEVICE: ORIGINAL(00980) CURRENT(00980) VOLUME(PDR22 )

Interactive System Productivity Facility (ISPF)

After logging in through a 3270 session, it is common to access most programs by typing ISPF.

Typically, if available, F7 is page up, F8 is page down, F10 is page left, and F11 is page right. Typing "m" followed by F7 or F8 pages down to the top or bottom, respectively.

If available, type "F STRING" to find the first occurrence of STRING, and F5 for the next occurrences.

Normally, F3 returns to the parent screen (i.e. exits the current screen).

Command (normally type 6 in ISPF) allows things such as TSO commands or going into Unix System Services (USS).

Utilities - Data Set List (normally type 3.4 in ISPF) allows browsing data sets.

Central Processing Unit (CPU)

z/OS offers multiple different types of processors that may have different costs. The most general processor type is called the general purpose processor or central processor (CP or GCP). There are also z Integrated Information Processor (zIIPs/IFAs) for Java, etc. workloads, z Application Assist Processors (zAAPs), Integrated Facility for Linux processors (IFLs) for Linux on z/VM, and others.

SMF 98.1

With z/OS 2.2 and above, always collect and archive SMF 98.1 records if IBM z/OS Workload Interaction Correlator is entitled. These records provide "valuable data for diagnosing transient performance problems; summary activities with a worst offender and its activity every 5 seconds" with minimal overhead: "IBM benchmarks were run with all available Correlator SMF records being captured and logged and could not detect any additional CPU cost from the increased data collection".

Post-process the output data set using SMF_CORE. To get statistics on the average CP, zIIP, and zAAP utilization and top consumers of each:

java -Xmx1g "-Dcom.ibm.ws390.smf.dateTimeFormat=yyyy-MM-dd'T'HH:mm:ss.SSSZ" -DPRINT_WRAPPER=false -jar smftools.jar "INFILE(DATASET)" "PLUGIN(com.ibm.smf.plugins.Type98CPU,STDOUT)"

Example output:

DateTime,LPAR,AvgCpuBusyCP,AvgCpuBusyzAAP,Avg_CpuBusy_zIIP,AddressSpaceMostCPU_CP,AddressSpaceMostCPU_zAAP,AddressSpaceMostCPU_zIIP
2023-12-11T12:30:00.000+0000,DBOC,43,0,41,D2PDDIST,,D2PDDIST
2023-12-11T12:30:05.000+0000,DBOC,50,0,44,D2PDDIST,,D2PDDIST
[...]

Display processors

Display processors with /D M=CPU; for example, this shows four general purpose processors and four zAAP processors:

D M=CPU
IEE174I 15.45.46 DISPLAY M 700
PROCESSOR STATUS
ID  CPU                  SERIAL
00  +                     0C7B352817
01  +                     0C7B352817
02  +                     0C7B352817
03  +                     0C7B352817
04  +A                    0C7B352817
05  +A                    0C7B352817
06  +A                    0C7B352817
07  +A                    0C7B352817
+ ONLINE    - OFFLINE    . DOES NOT EXIST    W WLM-MANAGED      N NOT AVAILABLE     A  APPLICATION ASSIST PROCESSOR (zAAP)

Display threads in an address space

Display threads in an address space and the accumulated CPU by thread: /D OMVS,PID=XXX (search for PID in the joblogs of the address space). This output includes a CT_SECS field which shows the total CPU seconds consumed by the address space. Note that the sum of all the ACC_TIME in the report will not equal CT_SECS or the address CPU as reported by RMF or SDSF because some threads may have terminated. The ACC_TIME and CT_SECS fields wrap after 11.5 days and will contain *****; therefore, the /D OMVS,PID= display is less useful when the address space has been running for longer than that.

-RO MVSA,D OMVS,PID=67502479
BPXO040I 11.09.56 DISPLAY OMVS 545
OMVS 000F ACTIVE OMVS=(00,FS,0A)
USER JOBNAME ASID PID PPID STATE START CT_SECS
WSASRU WSODRA S 0190 67502479 1 HR---- 23.01.38 13897.128 1
LATCHWAITPID= 0 CMD=BBOSR
THREAD_ID            TCB@    PRI_JOB USERNAME ACC_TIME SC STATE
2621F4D000000008 009C7938                       12.040 PTX JY V

All the threads/TCBs are listed and uniquely identified by their thread ID under the THREAD_ID column. The accumulated CPU time for each thread is under the ACC_TIME column. The thread ID is the first 8 hexadecimal characters in the THREAD_ID and can be found in a matching javacore.txt file. In the example above, the Java thread ID is 2621F4D0.

The threads with eye-catcher WLM are those from the ORB thread pool which are the threads that run the application enclave workload. Be careful when attempting to reconcile these CPU times with CPU accounting from RMF and SMF. This display shows all the threads in the address space, but remember that threads that are WLM managed (e.g. the Async Worker threads and the ORB threads) have their CPU time recorded in RMF/SMF under the enclave which is reported in the RMF report class that is associated with the related WLM classification rule for the CB workload type. The other threads will have their CPU time charged to the address space itself as it is classified in WLM under the STC workload type.

WebSphere trace entries also contain the TCB address of the thread generating those entries. For example:

THREAD_ID TCB§ PRI_JOB USERNAME ACC_TIME SC STATE
2707058000000078 009BDB58 178.389 STE JY V
Trace: 2009/03/19 08:28:35.069 01 t=9BDB58 c=UNK key=P8 (0000000A)

The SDSF.PS display provides an easy way to issue this command for one or more address spaces. Type d next to an address space to get this same output. Type ULOG to see the full output or view in SDSF.LOG.

Similar information can be found from USS:

$ ps -p $PID -m -o xtid,xtcbaddr,tagdata,state=STATE -o atime=CPUTIME -o syscall
            TID TCBADDR STATE CPUTIME SYSC
                    - -    HR 14:12 -
1e4e300000000000 8d0e00    YU  0:20
1e4e400000000001 8d07a8   YJV  0:00
1e4e500000000002 8d0588  YNJV  0:00
1e4e600000000003 8d0368   YJV  1:35
1e4e700000000004 8d0148   YJV  0:25

31-bit vs 64-bit

z/OS does not have a 32-bit architecture, but instead only has a 31-bit architecture:

zIIP/zAAP Processors

Review zIIP processors:

  1. Type /D IPLINFO and search for LOADY2.
  2. Go to the data set list and type the name from LOADY2 in Dsname level and press enter (e.g. SYS4.IPLPARM).
  3. Type b to browse the data set members and search for PARMLIB.
  4. Go to the data set list and type the name (e.g. USER.PARMLIB) and find the IEAOPT member.

Inside SYS1.PARMLIB(IEAOPTxx), the following options will affect how the zIIP engines process work.

  1. IFACrossOver = YES / NO
    • YES - work can run on both zIIP and general purpose CPs
    • NO - work will run only on zIIPs unless there are no zIIPs
  2. IFAHonorPriority = YES / NO
    • YES - WLM manages the priority of zIIP eligible work for CPs
    • NO - zIIP eligible work can run on CPs but at a priority lower than any non-zIIP work
Java zIIP/zAAP usage

Restart with -Xtrace:iprint=j9util.48 and review stderr for libraries using zIIP/zAAP with the following message:

validateLibrary shared library [...]/lib/default/zip flagged as zAAP enabled

System Display and Search Facility (SDSF)

SDSF (normally type S in ISPF) provides a system overview.

SDSF.LOG

LOG shows the system log and it is the most common place to execute system commands. Enter a system command by pressing /, press enter, and then type the command and press enter. Then use F8 or press enter to refresh the screen to see the command's output.

Display system activity summary: /D A

IEE114I 16.18.32 2011.250 ACTIVITY 733                                  
JOBS     M/S    TS USERS    SYSAS    INITS   ACTIVE/MAX VTAM     OAS   
00008    00034    00001      00035    00034    00001/00300       00019

Display users on the system: /D TS,L

IEE114I 16.51.50 2011.251 ACTIVITY 298
JOBS     M/S    TS USERS    SYSAS    INITS   ACTIVE/MAX VTAM     OAS
00008    00039    00002      00039    00034    00002/00300       00029
DOUGMAC  OWT      WITADM1 IN

Check global resource contention with /D GRS,C

SDSF.DA

SDSF.DA shows active address spaces. CPU/L/Z A/B/C shows current CPU use, where A=total, B=LPAR usage, and C=zAAP/zIIP usage.

Type PRE * to show all address spaces.

Type SORT X to sort, e.g. SORT CPU%.

Page right to see useful information such as MEMLIMIT, RPTCLASS, WORKLOAD, and SRVCLASS.

In the NP column, type S next to an address space to get all of its output, or type ? to get a member list and then type S for a particular member (e.g. SYSOUT, SYSPRINT).

When viewing joblog members of an address space (? in SDSF.DA), type XDC next to a member to transfer it to a data set.

SDSF.ST is similar to DA and includes completed jobs.

Physical Memory (RAM)

Use /D M=STOR to display available memory. The ONLINE sections show available memory. For example, this shows 64GB:

D M=STOR
IEE174I 16.00.48 DISPLAY M 238
REAL STORAGE STATUS
ONLINE-NOT RECONFIGURABLE
  0M-64000M
ONLINE-RECONFIGURABLE
  NONE
PENDING OFFLINE
  NONE
0M IN OFFLINE STORAGE ELEMENT(S)
0M UNASSIGNED STORAGE
STORAGE INCREMENT SIZE IS 256M

Use /D ASM to display paging spaces. The FULL columns for LOCAL entries should never be greater than 0%. For example:

D ASM
IEE200I 15.30.16 DISPLAY ASM 205
TYPE   FULL STAT  DEV DATASET NAME
PLPA    79%   OK 0414 SYS1.S12.PLPA
COMMON   0%   OK 0414 SYS1.S12.COMMON
LOCAL    0%   OK 0414 SYS1.S12.LOCAL1
LOCAL    0%   OK 0445 SYS1.S12.PAGE01

Display total virtual storage: /D VIRTSTOR,HVSHARE

IAR019I  17.08.47 DISPLAY VIRTSTOR 313
SOURCE = DEFAULT
TOTAL SHARED = 522240G
SHARED RANGE = 2048G-524288G
SHARED ALLOCATED = 262244M

Some systems display free memory with /F AXR,IAXDMEM:

IAR049I DISPLAY MEMORY V1.0 233           
PAGEABLE 1M STATISTICS                    
   66.7GB : TOTAL SIZE                    
   50.8GB : AVAILABLE FOR PAGEABLE 1M PAGE
 2404.0MB : IN-USE FOR PAGEABLE 1M PAGES  
 5238.0MB : MAX IN-USE FOR PAGEABLE 1M PAG
    0.0MB : FIXED PAGEABLE 1M FRAMES      
LFAREA 1M STATISTICS - SOURCE = DEFAULT   
    0.0MB : TOTAL SIZE                    
    0.0MB : AVAILABLE FOR FIXED 1M PAGES  
    0.0MB : IN-USE FOR FIXED 1M PAGES     
    0.0MB : MAX IN-USE FOR FIXED 1M PAGES
LFAREA 2G STATISTICS - SOURCE = DEFAULT   
    0.0MB : TOTAL SIZE = 0                
    0.0MB : AVAILABLE FOR 2G PAGES = 0    
    0.0MB : IN-USE FOR 2G PAGES = 0       
    0.0MB : MAX IN-USE FOR 2G PAGES = 0  

Job Entry Subsystem (JES)

Use /$DSPOOL to list spool utilization. For example:

$HASP646   41.0450 PERCENT SPOOL UTILIZATION

Workload Management (WLM)

WLM only makes noticeable decisions about resources when resources are low.

WLM performs better with less service classes.

  • Service Classes - goals for a particular type of work - you can have as many of these as you want but from a performance perspective the fewer service classes the better
  • Classification Rules - classification rules tie an address space or group of address spaces to a goal or service class
  • Report Classes - report classes have nothing to do with classification of work but they do allow you to show reports from a particular perspective for problem and performance diagnosis

Display WLM configuration: /D WLM

IWM025I  14.31.46  WLM DISPLAY 214
ACTIVE WORKLOAD MANAGEMENT SERVICE POLICY NAME: CBPTILE
ACTIVATED: 2011/06/13  AT: 16:15:27  BY: WITADM1   FROM: S12
DESCRIPTION: CB trans w/short percentile goal
RELATED SERVICE DEFINITION NAME: CBPTILE
INSTALLED: 2011/06/13  AT: 16:15:08  BY: WITADM1   FROM: S12

The related service definition name is the currently configured WLM definition.

Classify location service daemons and controllers as SYSSTC or high velocity.

Set achievable percentage response time goals: For example, a goal that 80% of the work will complete in .25 seconds is a typical goal. Velocity goals for application work are not meaningful and should be avoided.

Make your goals multi-period: This strategy might be useful if you have distinctly short and long running transactions in the same service class. On the other hand, it is usually better to filter this work into a different service class if you can. Being in a different service class will place the work in a different servant which allows WLM much more latitude in managing the goals.

Define unique WLM report classes for servant regions and for applications running in your application environment. Defining unique WLM report classes enables the resource measurement facility (RMF) to report performance information with more granularity.

Periodically review the results reported in the RMF Postprocessor workload activity report: Transactions per second (not always the same as client tran rate), Average response times (and distribution of response times), CPU time used, Percent response time associated with various delays

Watch out for work that defaults to SYSOTHER.

http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rprf_tunezwlm.html

Delay monitoring: http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rprf_wlmdm.html

Example: http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rprf_RMFsamples.html

You can print the entire WLM definition from the main screen:

Within the Subsystem types section you will find the classification rules that tie the address spaces to the service classes and report classes. You can also find this by paging right in SDSF.DA.

So what is the Response Time Ratio and what does it tell us? WLM calculates the Response Time Ratio by dividing the actual response time (enclave create to enclave delete) by the GOAL for this service class and multiplying by 100. It is, basically, a percentage of the goal. Note that WLM caps the value at 1000 so if the goal is badly missed you might see some big numbers but they will never exceed 1000. (http://www-03.ibm.com/support/techdocs/atsmastr.nsf/5cb5ed706d254a8186256c71006d2e0a/0c808594b1db5c6286257bb1006118ab/$FILE/ATTHSSAD.pdf/WP102311_SMF_Analysis.pdf)

A CPU goal uses CPU Service Units which normalize across different CPU models: https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.4.0/com.ibm.zos.v2r4.iear100/calc.htm

HTTP Request Distribution

When multiple servants are bound to the same service class, WLM attempts to dispatch the new requests to a hot servant. A hot servant has a recent request dispatched to it and has threads available. If the hot servant has a backlog of work, WLM dispatches the work to another servant.

Normally running this hot servant strategy is good because the hot servant likely has all its necessary pages in storage, has the just-in-time (JIT) compiled application methods saved close by, and has a cache full of data for fast data retrieval. However, this strategy presents a problem in the following situations: [...]

https://www.ibm.com/support/knowledgecenter/SS7K4U_9.0.5/com.ibm.websphere.zseries.doc/ae/crun_wlm_sessionplacement.html

The default workload distribution strategy uses a hot servant for running requests that create HTTP session objects. Consider configuring the product and the z/OS Workload Manager to distribute your HTTP session objects in a round-robin manner in the following conditions:

  • HTTP session objects in memory are used, causing dispatching affinities.
  • The HTTP sessions in memory last for many hours or days.
  • A large number of clients with HTTP session objects must be kept in memory.
  • The loss of a session object is disruptive to the client or server.
  • There is a large amount of time between requests that create HTTP sessions.

https://www.ibm.com/support/knowledgecenter/SS7K4U_9.0.5/com.ibm.websphere.zseries.doc/ae/trun_wlm_sessionplacement.html

Execution Velocity

"The execution velocity is a measure of how fast work is running compared to ideal conditions without delays." (https://www.ibm.com/support/knowledgecenter/SSLTBW_2.4.0/com.ibm.zos.v2r4.erbb500/exvel.htm)

System Management Facilities (SMF)

SMF captures operating system statistics to data sets.

Display what SMF is currently recording: /D SMF,O

IEE967I 08.21.08 SMF PARAMETERS 439
MEMBER = SMFPRMT8...
SYS(DETAIL) -- PARMLIB
SYS(TYPE(0,2:10,14,15,20,22:24,26,30,32:34,40,42,47:48,58,64,
70:83,84,88,89,90,100:103,110,120,127:134,148:151,161,199,225,
244,245,253)) -- PARMLIB...
INTVAL(30) -- PARMLIB...
DSNAME(SYS1.S34.MAN2) -- PARMLIB
DSNAME(SYS1.S34.MAN1) -- PARMLIB
ACTIVE -- PARMLIB

The MEMBER is the PARMLIB member holding the configuration. The SYS line shows which SMF types are being monitored. INTVAL is the recording interval (in minutes). The DSNAME members are the working data sets for the SMF data.

Modify the recording interval dynamically: /F RMF,MODIFY ZZ,SYNC(RMF,0),INTERVAL(15M)

Display SMF data set usage: /D SMF

RESPONSE=S12                                                          
NAME                  VOLSER SIZE(BLKS) %FULL  STATUS       
P-SYS1.S12.MANC       SMF001    180000    79  ACTIVE        
S-SYS1.S12.MAND       SMF001    180000     0  ALTERNATE

When the active volume fills up, SMF switches to the alternative. This can be done manually with /I SMF

Example JCL to Dump SMF

//SMFD3 JOB MSGCLASS=H,MSGLEVEL=(1,1),REGION=128M,TIME=5,
//  NOTIFY=&SYSUID  
// SET SMFIN=S25J.ZTESTB.S12.SMF.G0213V00
//* OUTPUT DATASET NAME
// SET DUMPOUT=ZPER.WM0.SMFS12.D213
//*
//S0      EXEC PGM=IFASMFDP,REGION=128M
//SYSPRINT DD  SYSOUT=*
//DUMPIN1      DD DISP=SHR,DSN=&SMFIN
//DUMPOUT      DD DISP=(,CATLG,DELETE),UNIT=SYSDA,
//             SPACE=(CYL,(400,100),RLSE),
//             DSN=&DUMPOUT,
//             LRECL=32760,BLKSIZE=23467,RECFM=VBS
//SYSIN        DD *
INDD(DUMPIN1,OPTIONS(DUMP))
OUTDD(DUMPOUT,TYPE(0:255))
/*

Example JCL to Dump Live SMF Data Sets into a Permanent One

//SMFD3 JOB MSGCLASS=H,MSGLEVEL=(1,1),REGION=128M,TIME=5,
//  NOTIFY=&SYSUID
// SET SMFIN=S25J.ZTESTG.S34.SMF.G1017V00
//* OUTPUT DATASET NAME
// SET DUMPOUT=ZPER.S34.MEVERET.D092211.A
//*
//S0      EXEC PGM=IFASMFDP,REGION=128M
//SYSPRINT DD  SYSOUT=*
//DUMPIN1      DD DISP=SHR,DSN=&SMFIN
//DUMPOUT      DD DISP=(,CATLG,DELETE),UNIT=SYSDA,
//             SPACE=(CYL,(400,100),RLSE),
//             DSN=&DUMPOUT,
//             LRECL=32760,BLKSIZE=23467,RECFM=VBS
//SYSIN        DD *
INDD(DUMPIN1,OPTIONS(DUMP))
OUTDD(DUMPOUT,TYPE(0:255))
/*

The output from the JCL contains the types of records and number of records in the raw data:

IFA020I DUMPOUT  -- ZPER.S34.MEVERET.D092211.A                                  
IFA020I DUMPIN1  -- S25J.ZTESTG.S34.SMF.G1017V00                                
                                           SUMMARY ACTIVITY REPORT              
      START DATE-TIME  09/22/2011-09:33:34                         END DATE-TIME
      RECORD RECORDS PERCENT      AVG. RECORD       MIN. RECORD   MAX.
        TYPE     READ        OF TOTAL           LENGTH            LENGTH                 
          2             1             .00 %                18.00                18       
          3             1             .00 %                18.00                18      
...
     TOTAL       42,572       100 %             1,233.27                18
     NUMBER OF RECORDS IN ERROR               0

Example JCL to Dump SMF

//SMFR1 JOB MSGLEVEL=(1,1),MSGCLASS=H
//WKLD@PGP EXEC PGM=ERBRMFPP,REGION=0K
//MFPINPUT DD   DSN=ZPER.WM0.SMFS12.D203,DISP=SHR
//PPXSRPTS DD   SYSOUT=*,DCB=(RECFM=FBA,LRECL=133)
//MFPMSGDS DD   SYSOUT=*
//SYSOUT   DD   SYSOUT=*
//SYSIN    DD   *
 ETOD(0000,2400)
 PTOD(0000,2400)
 RTOD(0000,2400)
 STOD(0000,2400)
 SYSRPTS(WLMGL(RCLASS(W*)))
    SYSOUT(H)
/*
 SYSRPTS(WLMGL(SCPER,RCLASS(WT7*)))
/*

See also http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_capwar.html

Example JCL to Clear SMF

//SMFCLEAR JOB MSGLEVEL=(1,1)
//STEP1  EXEC PGM=IFASMFDP
//DUMPIN   DD DSN=SYS1.S12.MANC,DISP=SHR
//*
//* SYS1.S34.MAN1
//* SYS1.S34.MAN2
//*
//*DUMPIN   DD DSN=SYS1.S12.MANC,DISP=SHR
//DUMPOUT DD DUMMY
//SYSPRINT DD SYSOUT=*
//SYSIN    DD *
  INDD(DUMPIN,OPTIONS(CLEAR))
  OUTDD(DUMPOUT,TYPE(000:255))

Resource Measurement Facility (RMF)

  • Display if RMF ZZ monitor is running: /F RMF,D ZZ
  • Start RMF ZZ monitor: /F RMF,S ZZ
  • Start RMFGAT: /F RMF,S III

Monitoring RMF in live mode can be very useful (navigate through ISPF). F10 and F11 page backwards and forwards through time.

Use RMF Monitor 3 } CPC for overall CPU. Execute procu for detailed CPU. Execute sysinfo for general information. Execute syssum for a sysplex summary.

Workload Activity Report

The JCL to produce this was covered above.

Example snippet output:

Important values:

  1. CPU - this is the total amount of processor time (excluding SRB time), used during this interval. It includes time spent on general purpose CPs, zAAPs and zIIPs.
  2. SRB - this is the amount of processor time consumed by SRBs during the interval. An SRB is a special unit of work used primarily by the operating system to schedule functions that need to run quickly and with high priority.
  3. AAP - this is the amount of time work was running on zAAP processors during the interval. The IIP field is exactly the same as AAP except it reports time spent on zIIP processors. On our system there were no zIIP processors defined so it will be ignored.
  4. Ended - this is the total number of WebSphere requests that completed during the interval.
  5. CP - this value represents the amount of time spent on general purpose processor. It includes the CP time and the zAAP time that is reported under the "SERVICE TIME" heading, fields CPU and SRB.
    The length of this interval is 5 minutes or 300 seconds so using the CP field value under the "APPL %" heading the amount of CP time is:
    (CP * interval length) / 100 or (0.20 * 300) / 100 = 0.600 (rounding error)
  6. AAPCP - this value is the amount of zAAP time that ran on a CP which could have run on a zAAP had a zAAP processor been available. It is a subset of the CP value. The system must be configured to capture this value. It is controlled by the parmlib option xxxxxxxxxxxx. Our system did not have this option set. To convert this percentage to time is simple:
    (AAPCP * interval length) / 100
  7. IIPCP - same as AAPCP except for zIIP processors
  8. AAP - this is the amount of zAAP time consumed during the interval. It reports the same value as the AAP field under the "SERVICE TIME" heading.
  9. IIP - same as AAP except for zIIP processors.

The APPL% values are processor times reported as a percentage. They are reported as the percentage of a single processor so it is common to see values greater than 100% on multi-processor systems.

Given this information, calculating the amount of processor time used during the interval is very straightforward. The amount of zAAP processor time is simply the value reported in the AAP field, 2.015 seconds. Remember the CPU field contains the time spent on zAAPs so if we want to calculate the total amount of general purpose CP time we must subtract the AAP value from the total of the CPU and SRB values.

In the example above, which is a report class that defines enclave work, the SRB field will always be zero so to calculate the CP time we simply need to subtract the AAP value from the CPU value or 2.171 - 2.015 = 0.156. So in this example, an enclave service class, the total amount of CP and zAAP processor time spent by work executing under this report class is simply the CPU value.

Since we are using a WebSphere example we should also include the amount of processor time consumed by the deployment manager address spaces (control and servant), the node agent address space, and the application server address spaces (control and servant) (the SRB field is non-zero so remember to add that value to the CPU value to get the total amount of CP and zAAP time consumed during the interval. Then just subtract the AAP value from this total to get the amount of CP processor time.)

Example Analysis of Multi-Period Discretionary Delays
        z/OS V2R3               SYSPLEX AAAAAA             DATE 04/16/2020           INTERVAL 15.00.032   MODE = GOAL
                                RPT VERSION V2R3 RMF       TIME 16.00.00

POLICY=BBBBBBBB   WORKLOAD=CCCCCC     SERVICE CLASS=DDDDDDDD     RESOURCE GROUP=*NONE      PERIOD=2 IMPORTANCE=DISC
                                      CRITICAL     =NONE

-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF  TRANS-APPL%-----CP-IIPCP/AAPCP-IIP/AAP  ---ENCLAVES---
AVG        1.48  ACTUAL 3.703426  TOTAL         0.11        0.05  225.28  AVG ENC   1.48
MPL        1.48  EXECUTION          3.702356  MOBILE        0.00        0.00    0.00  REM ENC   0.00
ENDED 494  QUEUED                 1069  CATEGORYA     0.00        0.00    0.00  MS ENC    0.00
END/S      0.55  R/S AFFIN                 0  CATEGORYB     0.00        0.00    0.00
#SWAPS        0  INELIGIBLE                0
EXCTD         0  CONVERSION                0
                 STD DEV 10.163165

----SERVICE----   SERVICE TIME  ---APPL %---  --PROMOTED--  --DASD I/O---  ----STORAGE----  -PAGE-IN RATES-
IOC           0   CPU 2028.572  CP      0.11  BLK    0.000  SSCHRT    4.1  AVG        0.00  SINGLE      0.0
CPU       67902K  SRB    0.000  IIPCP   0.05  ENQ    0.006  RESP      0.6  TOTAL      0.00  BLOCK       0.0
MSO           0   RCT    0.000  IIP   100.47  CRM    0.000  CONN      0.1  SHARED     0.00  SHARED      0.0
SRB           0   IIT    0.000  AAPCP   0.00  LCK    0.134  DISC      0.4                   HSP         0.0
TOT       67902K  HST    0.000  AAP      N/A  SUP    0.000  Q+PEND    0.1
/SEC      75446   IIP  904.286                              IOSQ      0.0
ABSRPTN      51K  AAP      N/A
TRX SERV     51K

GOAL: DISCRETIONARY

         RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  -------------- EXEC DELAYS % -----------  -USING%-  --- DELAY % ---    %
SYSTEM                    VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP CPU                                CRY CNT  UNK IDL CRY CNT  QUI

*ALL        --N/A--       91.5        1.5  0.1 N/A  70 0.1  6.5 6.4 0.1                                0.1 0.0   23 0.0 0.0 0.0  0.0
EE01                      92.0        0.6  0.0 N/A  72 0.0  6.3 6.3 0.0                                0.1 0.0   21 0.0 0.0 0.0  0.0
EE02                      91.1        0.8  0.1 N/A  69 0.1  6.7 6.6 0.2                                0.1 0.0   24 0.0 0.0 0.0  0.0

This is saying that 494 WAS transaction completed in this 15 minute interval ending at 16:00:00 under the discretionary period 2 goal, their average execution time was 3.7 seconds, the execution time standard deviation was 10.1 seconds, and the total sampled execution delays averaged 6.5% (mostly delayed on zIIPs). For detailed descriptions of the fields, see https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.erbb500/wfields.htm

And, "Set achievable percentage response time goals. Velocity goals for application work are not meaningful and should be avoided. [...] Watch out for work that [...] has a discretionary goal."

FTP

FTP can be used to download both USS files as well as data sets. To download a data set, surround the data set name with apostrophes:

ftp> ascii
200 Representation type is Ascii NonPrint
ftp> get 'WITADM1.SPF1.LIST'
...

To convert character sets from EBCDIC to ASCII, use FTP ASCII mode. If the file was written on the z/OS system with an ASCII character set, then download the file using FTP BINARY mode.

Input/Output (I/O)

Ensure that DASD are of the fastest speed, striping, etc.

Networking

To discover the host name, run the system command /D SYMBOLS and find the TCPIP address space name. In the TCPIP address space joblogs output, find the TCPIP profile configuration data set:

PROFILE DD DISP=SHR,DSN=TCPIP.PROFILE(&SYSN.)...

In 3.4, browse this dataset and this will show the host name and IP address mapping.

Increase MAXSOCKETS and MAXFILEPROC to 64000 (http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunetcpip.html, http://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ibm.zos.v2r2.bpxb200/mxflprc.htm)

Consider disabling delayed acknowledgments (NODELAYACKS). Warning: This option may or may not be better depending on the workload (see the discussion of delayed acknowledgments). (http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunetcpip.html)

Set SOMAXCONN=511 (http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunetcpip.html)

Monitoring dispatch requests: http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_monitor_dispatch_requests.html

Type HOMETEST in ISPF COMMAND to get the IP hostname and address.

DNS

By default, the DNS lookup timeout (which Java uses) defaults to 5 seconds.

TCP Congestion Control

Review the background on TCP congestion control.

Consider tuning TCP/IP buffer sizes using TCPCONFIG; for example:

  • TCPSENDBFRSIZE=131070
  • TCPRCVBUFRSIZE=131070

netstat

netstat may be used to list sockets from USS. The -A flag provides details such as send and receive queues. For example:

$ netstat -A | head -50
MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           17:14:53
Client Name: BBODMGR                  Client Id: 0000043D
  Local Socket: ::ffff:9.57.7.207..6001                             
  Foreign Socket: ::ffff:9.57.7.207..1550                             
    BytesIn:            00000000000037507439
    BytesOut:           00000000000000240750
    SegmentsIn:         00000000000000001232
    SegmentsOut:        00000000000000001230
    StartDate:          01/20/2021       StartTime:          20:30:14
    Last Touched:       21:10:34         State:              Establsh
    RcvNxt:             1627254963       SndNxt:             0500396862
    ClientRcvNxt:       1627254963       ClientSndNxt:       0500396862
    InitRcvSeqNum:      1589747523       InitSndSeqNum:      0500156111
    CongestionWindow:   0000130966       SlowStartThreshold: 0000065535
    IncomingWindowNum:  1627517107       OutgoingWindowNum:  0500659006
    SndWl1:             1627248788       SndWl2:             0500396862
    SndWnd:             0000262144       MaxSndWnd:          0000262144
    SndUna:             0500396862       rtt_seq:            0500156111
    MaximumSegmentSize: 0000065483       DSField:            00
    Round-trip information: 
      Smooth trip time: 0.000            SmoothTripVariance: 0.000     
    ReXmt:              0000000000       ReXmtCount:         0000000000
    DupACKs:            0000000000       RcvWnd:             0000262144
    SockOpt:            8C00             TcpTimer:           00
    TcpSig:             05               TcpSel:             40
    TcpDet:             F8               TcpPol:             00
    TcpPrf:             81               TcpPrf2:            22
    TcpPrf3:            00
    DelayAck:           Yes    
    QOSPolicy:          No 
    RoutingPolicy:      No 
    ReceiveBufferSize:  0000262144       SendBufferSize:     0000262144
    ReceiveDataQueued:  0000000000
    SendDataQueued:     0000000000
    SendStalled:        No 
    Ancillary Input Queue: N/A
----
Client Name: BBODMGR                  Client Id: 0000021D
  Local Socket: ::..9809                                            
  Foreign Socket: ::..0                                               
[...]

Resource Recovery Service (RRS)

RRS is used to guarantee transactional support.

For best throughput, use coupling facility (CF) logger for the RRS log.

Ensure that your CF logger configuration is optimal by using SMF 88 records.

Set adequate default values for the LOGR policy.

If you don't need the archive log, you should eliminate it since it can introduce extra DASD I/Os. The archive log contains the results of completed transactions. Normally, the archive log is not needed.

http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rprf_tunezrrs.html

SVCDUMPs, SYSTDUMPs

Issue the following command to start dump processing:

/DUMP COMM='Dump Description'
83 IEE094D SPECIFY OPERAND(S) FOR DUMP COMMAND

You will use the number 83 (WTOR) in this case to reply to the system with dump parameters.

In order to reply to the system with the appropriate dump parameters, you need to know the address space ID of the address space you want to dump. There are other options for dumping address spaces; however, we are going to stick to 1 address space at a time using the method in this section. To find the ASIDX go to SDSF.DA (page right with F11).

The template for replying to a dump for a WebSphere address space: [xx],ASID=([yy]),SDATA=(RGN,TRT,CSA,NUC,PSA,GRSQ,LPA,SQA,SUM)

The reply to dump the servant ASIDX 16D is as follows (in SDSF.LOG):

/R 83,ASID=([16D]),SDATA=(RGN,TRT,CSA,NUC,PSA,GRSQ,LPA,SQA,SUM)

After 2 minutes or so the following appears:

IEF196I IEF285I   SYS1.DUMP.D111011.T193242.S34.S00005         CATALOGED
IEF196I IEF285I   VOL SER NOS= XDUMP8.
IEA611I COMPLETE DUMP ON SYS1.DUMP.D111011.T193242.S34.S00005 646

The "complete dump on" dataset can be downloaded in binary.

svcdump.jar

svcdump.jar is an "AS IS" utility that can process SVCDUMPs and print various information: https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=diagjava

Examples:

  • Print threads: java -cp svcdump.jar com.ibm.zebedee.dtfj.PrintThreads <dumpname>
  • Extract PHD heapdump: java -cp svcdump.jar com.ibm.zebedee.commands.Convert -a <asid>

Security

When a SAF (RACF or equivalent) class is active, the number of profiles in a class will affect the overall performance of the check. Placing these profiles in a (RACLISTed) memory table will improve the performance of the access checks. Audit controls on access checks also affect performance. Usually, you audit failures and not successes.

Use a minimum number of EJBROLEs on methods.

If using Secure Sockets Layer (SSL), select the lowest level of encryption consistent with your security requirements. WebSphere Application Server enables you to select which cipher suites you use. The cipher suites dictate the encryption strength of the connection. The higher the encryption strength, the greater the impact on performance.

Use the RACLIST to place into memory those items that can improve performance. Specifically, ensure that you RACLIST (if used): CBIND, EJBROLE, SERVER, STARTED, FACILITY, SURROGAT

If you are a heavy SSL user, ensure that you have appropriate hardware, such as PCI crypto cards, to speed up the handshake process.

Here's how you define the BPX.SAFFASTPATH facility class profile. This profile allows you to bypass SAF calls which can be used to audit successful shared file system accesses.

Define the facility class profile to RACF.

RDEFINE FACILITY BPX.SAFFASTPATH UACC(NONE)

Activate this change by doing one of the following:
re-IPL
invoke the SETOMVS or SET OMVS operator commands.

Use VLF caching of the UIDs and GIDs

Do not enable global audit ALWAYS on the RACF (SAF) classes that control access to objects in the UNIX file system. If audit ALWAYS is specified in the SETR LOGOPTIONS for RACF classes DIRACC, DIRSRCH, FSOBJ or FSSEC, severe performance degradation occurs. If auditing is required, audit only failures using SETR LOGOPTIONS, and audit successes for only selected objects that require it. After changing the audit level on these classes, always verify that the change has not caused an unacceptable impact on response times and/or CPU usage.

http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rprf_tunezsec.html

Global Resource Serialization (GRS)

Check global resource contention: /D GRS,C

ISG343I 16.57.02 GRS STATUS 300
NO ENQ RESOURCE CONTENTION EXISTS
NO REQUESTS PENDING FOR ISGLOCK STRUCTURE
NO LATCH CONTENTION EXISTS

WebSphere Application Server for z/OS uses global resource serialization (GRS) to communicate information between servers in a sysplex... WebSphere Application Server for z/OS uses GRS to determine where the transaction is running.

WebSphere Application Server for z/OS uses GRS enqueues in the following situations: Two-phase commit transactions involving more than one server, HTTP sessions in memory, Stateful EJBs, "Sticky" transactions to keep track of pseudo-conversational states.

If you are not in a sysplex, you should configure GRS=NONE, or if you are in a sysplex, you should configure GRS=STAR. This requires configuring GRS to use the coupling facility.

http://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/tprf_tunezgrs.html

z/VM

Memory Overcommit

In this document, we will define [overcommit] as the total of the virtual memory of the started (logged on) virtual machines to the total real memory available to the z/VM system.

When planning whether memory can be overcommitted in a z/VM LPAR, the most important thing is to understand the usage pattern and characteristics of the applications, and to plan for the peak period of the day. This will allow you to plan the most effective strategy for utilizing your z/VM system's ability to overcommit memory while meeting application-based business requirements.

For z/VM LPARs where all started guests are heavily-used production WAS servers that are constantly active, no overcommitment of memory should be attempted.

In other cases where started guests experience some idle time, overcommitment of memory is possible.

http://www.vm.ibm.com/perf/tips/memory.html

zLinux

Although a bit old, review https://public.dhe.ibm.com/software/dw/linux390/perf/gm13-0635-00.pdf

Hardware Counters

It is generally recommended to activate HIS and collect hardware counters:

S HIS
MODIFY HIS,BEGIN,CTR=HDWR,CNTFILE=NO,CTRONLY
F HIS,B,TT=‘Text',CTRONLY,CTR=ALL,SI=SYNC,CNTFILE=NO
D HIS