Windows

Windows Recipe

  1. CPU core(s) should not be consistently saturated.
  2. Generally, physical memory should never be saturated and the operating system should not page memory out to disk.
  3. Input/Output interfaces such as network cards and disks should not be saturated, and should not have poor response times.
  4. TCP/IP and network tuning, whilst sometimes complicated to investigate, may have dramatic effects on performance.
  5. Consider changing Processor Performance Management (PPM) to the "High Performance" setting or disabling it.
  6. Operating system level statistics and optionally process level statistics should be periodically monitored and saved for historical analysis.
  7. Review operating system logs for any errors, warnings, or high volumes of messages.
  8. Review snapshots of process activity, and for the largest users of resources, review per thread activity.
  9. If the operating system is running in a virtualized guest, review the configuration and whether or not resource allotments are changing dynamically.
  10. Use Perfmon to review performance activity.
  11. Use the Windows Performance Toolkit to review sampled native processor usage.
  12. If there is sufficient network capacity for the additional packets, consider reducing the default TCP keepalive timer (HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime) from 2 hours to a value less than intermediate device idle timeouts (e.g. firewalls).
  13. Test disabling delayed ACKs

Also review the general topics in the Operating Systems chapter.

General

Check the Windows Event log (eventvwr.exe) for any warnings, error messages, or repeated informational messages.

Microsoft performance tuning guidelines by server version: https://msdn.microsoft.com/en-us/library/windows/hardware/dn529134

Command Prompt

Recursive search for a file pattern:

> @echo off
> for /F "usebackq" %i in (`dir /s /b *.pdb`) do echo %i
> @echo on

Windows Registry

Many operating system settings are changed in the Windows registry. To open the registry, execute regedit.exe.

We recommend periodically backing up the registry, particularly before any significant changes:

  • File > Export
  • Export Range=All
  • Save as some file.reg

Performance Monitor (Perfmon)

Perfmon is the generally recommended tool for Windows performance analysis.

"Windows Performance Monitor is a Microsoft Management Console (MMC) snap-in that provides tools for analyzing system performance. From a single console, you can monitor application and hardware performance in real time, customize what data you want to collect in logs, define thresholds for alerts and automatic actions, generate reports, and view past performance data in a variety of ways." (https://technet.microsoft.com/en-us/library/cc749154.aspx)

By default, counters do not show the process ID, so with multiple java processes, they are java_N, and if one process ends, all counters N+1 actually change. It is recommended to change to the PID format (https://support.microsoft.com/kb/281884):

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PerfProc\Performance
DWORD ProcessNameFormat=2

No restarts of the machine or Java are required - just restart Perfmon if it was open.

View Live Data

In the left pane, select Performance > Monitoring Tools > Performance Monitor.

This will show a live graph of % Processor Time:

Some useful tips:

  • To delete a counter, select the row in the bottom table and click Delete.
  • Click the pencil toggle button to highlight the currently selected counter.
  • By default, all counter values are scaled between 0 and 100. You can see if values are scaled by looking at the Scale column.
  • "Last" is the last sampled value ("Minimum" and "Maximum" are also useful).
  • "Average" is the average of all sampled values.
  • "Duration" is the amount of time (rolling) that Perfmon will capture and display data. To extend this, right click on the graph > Properties > General > Duration = X seconds
  • There are more options in the properties dialog that are worth exploring.

To add a counter, click the green plus icon:

Select a counter and the instances and click Add >>. In general, select <All instances> to ensure you get all the data. For example, if you select Process > % Processor time and you select <All instances>, if a process is spawned after data collection starts, it will be captured.

The instances are a way to look at counters in a more granular way. For example, the 0 and 1 instances above correspond to the two processors on this machine. If we select _Total, we will get the average of both processors. If we select <All instances>, this is a convenience and it is equivalent to multi-selecting _Total, 0, and 1.

Check "Show description" to better understand each counter.

Logging Perfmon Data to Files

For historical analysis of system metrics, configure Microsoft Perfmon to log statistics to files:

  1. Start perfmon.exe
  2. Performance } Monitoring Tools } Right Click on Performance Monitor } New } Data Collector Set or Performance } Data Collector Sets } Right Click on User Defined } New } Data Collector Set
  3. Specify any name, select "Create manually (Advanced)" and click Next
  4. Under the "Create data logs" section, select the Performance counter box, and click Next.
  5. In the "Performance counters:" section, click the "Add" button. Select each of the following counters:
    1. Expand Processor } % Interrupt Time, % Privileged Time, % Processor Time, % User Time } All instances } Add
    2. Expand Network Interface } Bytes Received/sec, Bytes Sent/sec, Output Queue Length, Packets Outbound Discarded, Packets Outbound Errors, Packets Received Discarded, Packets Received Errors } All instances
    3. Expand Process } % Privileged Time, % Processor Time, % User Time, IO Data Bytes/sec, IO Data Operations/sec, IO Other Bytes/sec, IO Other Operations/sec } All instances
    4. Expand Thread } % Processor Time, ID Process, ID Thread } All instances
    5. Expand LogicalDisk } % Disk Read Time, % Disk Write Time, % Free Space, % Idle Time, Avg. Disk Bytes/Read, Avg. Disk Bytes/Write, Avg. Disk sec/Read, Avg. Disk Read Queue Length, Avg. Disk sec/Write, Avg. Disk Write Queue Length, Disk Read Bytes/sec, Disk Reads/sec, Disk Write Bytes/sec, Disk Writes/sec } All instances
    6. Expand Memory } Available MBytes, Cache Bytes, Cache Faults/sec, Committed Bytes, Free System Page Table Entries, Page Faults/sec, Pages Input/sec, Pages Output/sec, Pool Nonpaged Bytes, Pool Pages Bytes, System Cache Resident Bytes
    7. Expand Paging File } % Usage
    8. Expand System } File Control Bytes/sec, File Control Operations/sec, File Data Operations/sec, File Read Bytes/sec, File Read Operations/sec, File Write Bytes/sec, File Write Operations/sec, Processor Queue Length, System Calls/sec
  6. Change the "Sample Interval" to 30 seconds and click Next.
  7. In the "Where would you like the data to be saved?" section, change the path for the Perfmon files if you would like, click Next.
  8. In the "Create the data collector set?" section, click on the Finish button.
  9. Ensure that the directory where the Perfmon files will be written has sufficient space.
  10. Start the collection by right clicking and clicking Start.
  11. After the test is complete, click Stop.
  12. Gather *.blg from the output directory

For similar instructions and screenshots, see https://www.ibm.com/support/pages/node/411769

Load Existing Logs into Perfmon

  1. In the left pane, select Performance } Monitoring Tools } Performance Monitor.
  2. Select the icon for View Log Data.
  3. Select Log files: and click Add... and browse to the location of the Perfmon blg log files.
  4. Click Add to select from the available counters in the data.

typeperf

The Windows typeperf command allows for simple access to performance counters from the command line: https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/typeperf

Central Processing Unit (CPU)

The key Perfmon counters are Process > % Interrupt Time, % Privileged Time, % Processor Time, % User Time > <All instances>. Note that the processor statistics for a particular process are in terms of a percentage of total CPU time, so if a process is using 2 CPUs at 100%, the sampled value will be 200.

"Where the "_Total" line reaches 100%, the Java process probably became constrained on CPU. If all the CPU is being used by the Java process, the performance is being limited by the machine. If another process is taking large amounts of CPU at those points in time, CPU contention is limiting the performance of the Java process." (Old Java Diagnostic Guide)

Per-Thread CPU Usage

With the Perfmon Thread counters, identify the threads that are using high CPU and convert the "ID Thread" value to hexadecimal. On IBM Java, if a thread dump was taken during these high CPU times, search the javacore file for the hexadecimal identifier to find the Java stack:

The reason for generating per-thread CPU usage information about the Java process is to understand what is happening to the process. The Java process might be deadlocked if all the threads are taking little or no CPU time. Points of contention or delay might be in the Java process if it does not take all the available CPU, even though the CPU usage is spread evenly over a number of threads in the process. This CPU usage pattern might also indicate a scalability issue. Finally, you might have a looping problem if the Java CPU usage is approaching 100% and a small number of the threads account for all of that CPU usage. The threads using the most process time might be looping. When you find some threads of interest, note the ID Thread values. Convert the values to hexadecimal, and look for the threads in the thread stack trace of the javacore.txt file. This trace helps you to determine if the thread is part of a thread pool and to understand what kind of work the thread performs. For example, an ID thread of 9244 becomes 241C in hexadecimal and is found in the "native ID" value in the javacore.txt file.

Perfmon counters: "% Processor Time", "ID Thread", and any other counters in which you are interested for all the Java thread instances

Old Java Diagnostic Guide

PsList

An alternative tool is pslist which is part of pstools: https://technet.microsoft.com/en-us/sysinternals/bb896682.aspx. See also http://www-01.ibm.com/support/docview.wss?uid=swg21304776

In most modes, you can filter the results by passing a process name prefix (such as java) or a PID at the end of the command.

No arguments prints the accumulated CPU time of each process and the elapsed time each process has been running:

> pslist.exe

Process information for ADMINIB-I6CU78U:
 
Name                Pid Pri Thd  Hnd   Priv        CPU Time    Elapsed Time
Idle                  0   0   2    0      0    11:08:07.609     0:00:00.000
System                4   8  82 4062    108     0:01:36.500     5:41:15.690
smss                240  11   2   30    440     0:00:01.484     5:41:13.940
csrss               316  13   9  871   2324     0:00:02.312     5:40:51.518...

The pslist argument -s shows an auto-updating view similar to task manager (similar to the top command on Unix platforms):

> pslist -s
2:24:04 PM 2/5/2014 Process information for ADMINIB-I6CU78U:
 
Name                Pid CPU Thd  Hnd   Priv        CPU Time    Elapsed Time
Idle                  0  97   2    0      0    11:15:27.906     5:45:06.985
pslist             4348   3   2  155   2840     0:00:02.015     0:00:30.546
smss                240   0   2   30    440     0:00:01.484     5:45:05.23537.32
csrss               316   0   9  847   2324     0:00:02.312     5:44:42.813
csrss               364   0   8  403   2504     0:00:01.234     5:44:41.250
wininit             372   0   3   77   1472     0:00:00.234     5:44:41.250
winlogon            404   0   3  113   2728     0:00:00.265     5:44:41.188...

The pslist argument -t shows a tree view of process ownership:

> pslist -t
 
Process information for ADMINIB-I6CU78U:
 
Name                             Pid Pri Thd  Hnd      VM      WS    Priv
Idle                               0   0   2    0       0      24       0
  System                           4   8  82 4030    3380     300     108
    smss                         240  11   2   30    4024    1100     440
java                            2684   8  87  989  816720  315196  294696
csrss                            316  13   9  853   50260    4780    2324
csrss                            364  13   8  406  210896   12332    2504
  conhost                       3484   8   2   79   77380    9916    4400
wininit                          372  13   3   77   49392    4452    1472
  services                       460   9   9  248   45168    9796    6204
    svchost                      580   8  10  362   46512    9492    3832
      WmiPrvSE                  2152   8   7  339   80312   16304    8368
      ProtectionUtilSurrogate   4036   8  10  289   98168   13184    4304...

The pslist argument -d prints the accumulated CPU times of each thread as well as the elapsed times the threads have existed:

> pslist -d java
 
Thread detail for ADMINIB-I6CU78U:
 
java 2684:
 Tid Pri    Cswtch            State     User Time   Kernel Time   Elapsed Time
2688   9         6     Wait:UserReq  0:00:00.000   0:00:00.000    5:47:24.155
2696   9      8465     Wait:UserReq  0:00:07.515   0:00:06.906    5:47:24.155
2700   8        22     Wait:UserReq  0:00:00.000   0:00:00.000    5:47:24.155
2704  15      8401     Wait:UserReq  0:00:08.921   0:00:02.203    5:47:24.092
2716  15   1146663     Wait:UserReq  0:00:00.000   0:00:00.000    5:47:23.733
2720   9     33519     Wait:UserReq  0:00:00.578   0:00:00.468    5:47:23.733...

The pslist argument -x is the same as -d but also prints memory information about the process (to get processes' memory without threads, use -m):

> pslist -x java
 
Process and thread information for ADMINIB-I6CU78U:
 
Name                Pid      VM      WS    Priv Priv Pk   Faults   NonP Page
java               2684  816720  315328  304244  313384   159552    633  257
 Tid Pri    Cswtch            State     User Time   Kernel Time   Elapsed Time
2688   9         6     Wait:UserReq  0:00:00.000   0:00:00.000    5:47:41.686
2696   9      8465     Wait:UserReq  0:00:07.515   0:00:06.906    5:47:41.686
2700   8        22     Wait:UserReq  0:00:00.000   0:00:00.000    5:47:41.686
2704  15      8402     Wait:UserReq  0:00:08.937   0:00:02.203    5:47:41.624
2716  15   1146681     Wait:UserReq  0:00:00.000   0:00:00.000    5:47:41.264...

Windows Performance Toolkit (WPT)

The Windows Performance Toolkit is a free tool from Microsoft that provides various dimensions of performance analysis: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/

Installation
  1. Download Windows Assessment and Deployment Kit (Windows ADK)
  2. On the "Select the features you want to install" screen, only "Windows Performance Toolkit" is required.
  3. On 64-bit Windows 7 and Windows Server 2008 (but not newer versions such as Windows 8 and Windows Server 2012), add the following registry entry and reboot:
    REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
Collect Data

There are two main ways to collect data (ETL file):

  1. GUI:
    1. Start C:\Program Files*\Windows Kits\*\Windows Performance Toolkit\WPRUI.exe as Administrator (replace with the correct path to WPRUI.exe)
    2. Leave the defaults of Performance Scenario=General, Detail level=Verbose, Logging mode=Memory. These buffer data to memory, so available RAM is needed. There are also options to flush to files.
    3. Check Resource Analysis } CPU usage
    4. Click Start
    5. Reproduce the problem for at least a few minutes
    6. Click Save to stop
  2. Command line:
    1. Start command prompt as Administrator
    2. Start collection (replace with the correct path):
      "C:\Program Files*\Windows Kits\*\Windows Performance Toolkit\xperf.exe" -on SysProf -stackwalk Profile -BufferSize 1024 -MaxBuffers 320
    3. These options buffer data to memory, so available RAM is needed. There are also options to flush to files.
    4. Reproduce the problem for at least a few minutes
    5. Stop collection (replace with the correct path to xperf.exe):
      "C:\Program Files*\Windows Kits\*\Windows Performance Toolkit\xperf.exe" -d callstacks.etl

By default, WPT data is written to %HOMEPATH%\Documents\WPR Files\*.etl. When clicking the "Start" button, the old collection files are not overwritten.

Also consider UIforETW.

Analyze Data

There are three main ways to view an ETL file:

  1. Windows Performance Analyzer (WPA.exe %ETL%)

  2. Trace > Configure Symbol Paths
    If .NET code was running at the time of the capture, an NGENPDB folder will be automatically created under %HOMEPATH%\Documents\WPR Files\ with the name of the .etl file. If it may be necessary to investigate .NET code, copy this path, which is automatically included in the default symbol path in WPA, and add to the end of the final symbol path.

    Example: C:\work\WAS8554_20140924\java\jre\bin\;C:\work\WAS8554_20140924\java\jre\bin\compressedrefs\;C:\work\WAS8554_20140924\lib\native\win\x86_64\;srv*C:\Symbols*http://msdl.microsoft.com/download/symbols;

  3. Trace > Load Symbols

Absolute Times

WPA shows all data in relative terms (seconds). Unfortunately, there doesn't appear to be an option to use absolute timestamps. To determine when the tracing started:

  1. Click Trace > System Configuration
  2. Click Traces
  3. Review Start Time (UTC)

The default ETL file name will include the date and time in local format, but this appears to be roughly the time the trace is requested to be stopped.

It is common for a ~200 second delay between the start of the capture and availability of some data (presumably while the kernel is initializing tracing).

CPU Analysis
  1. Expand Computation > CPU Usage (Sampled) > Utilization by Process, Stack

Flamegraphs can also be generated: https://randomascii.wordpress.com/2013/03/26/summarizing-xperf-cpu-usage-with-flame-graphs/

CPU Usage by Thread
  1. Expand Computation > CPU Usage (Attributed) > Utilization by Process, Thread, Activity *
Disk Analysis
  1. Expand Storage > Disk Usage > Service Time by Process, Path Name, Stack
  2. The disk times are in microseconds (https://blogs.technet.microsoft.com/b/robertsmith/archive/2012/02/07/analyzing-storage-performance-using-the-windows-performance-toolkit.aspx).
Analyzing on Another Machine
  1. Gather the etl file from %HOMEPATH%\Documents\WPR Files\
  2. Gather all *.pdb files from the WebSphere folder.
  3. If .NET code was running at the time of the capture, an NGENPDB folder will be automatically created under %HOMEPATH%\Documents\WPR Files\ with the name of the .etl file. If it may be necessary to investigate .NET code, also gather this folder.

TPROF

The open source performance inspector suite (originally created by IBM) includes a native Windows sampling profiler called TPROF: http://perfinsp.sourceforge.net/tprof.html

This is a great way to understand which native modules are using the CPU and it is requested as part of the IBM Performance MustGather on Windows: http://www-01.ibm.com/support/docview.wss?uid=swg21111364

The reason this tool is so useful is that it is a sampling profiler (see the Java Profilers chapter for background on this topic). It will sample the native stacks of the processes approximately every 7 milliseconds. This tends to be a very low overhead (less than a few percent) way to get insight into CPU usage without dramatically impacting the system. In general, TPROF can be used in production environments, although you should fully test this in a test environment first.

The instructions to install and use TPROF are quite straightforward: http://www-01.ibm.com/support/docview.wss?uid=swg21403450

Currently, TPROF does not work on Windows Server >= 2012.

Install with tinstall.cmd

Run with:

> setrunenv.cmd
> run.tprof.cmd
Press ENTER to start capturing data
Reproduce the problem
Press ENTER again to stop capturing data
Open tprof.out to analyze the results (see the TechNote above for a description of the various sections)

For example, in one case we were led to investigate some third party drivers by seeing a significant amount of CPU usage in the kernel (and other modules that are not shown here for confidentiality):

PID 695 51.00 java.exe_0c8c
  MOD 320 20.46 C:\Windows\system32\ntoskrnl.exe

Processor Performance Management (PPM)

Processor Performance Management (PPM) is a power saving feature. It may be changed to the "High Performance" setting: https://technet.microsoft.com/en-us/library/dd744398%28v=ws.10%29.aspx

A common symptom in profilers such as TPROF is a high CPU usage in, for example, the intelppm.sys driver:

   LAB    TKS   %%%     NAMES
    MOD 20448  7.13    C:\Windows\system32\DRIVERS\intelppm.sys

For example, the intelppm driver may be disabled with the following command followed by a restart:

> sc config intelppm start= disabled

Memory

Terms:

  • Memory may be reserved for future use although this puts no demands on RAM or paging spaces.
  • Reserved memory may be concurrently or subsequently committed which ensures there is virtual space in RAM or paging spaces although committed memory only becomes resident in RAM once it's touched (read/written). Programs such as Task Manager have an option to add a column called the "Commit Size" which is the total committed. Reserved and committed memory are roughly two different ways of looking at the "virtual size" of the process from the terms of other operating systems.
  • The working set of a process is the amount of memory resident in RAM. This is roughly the "resident set size" of the process from the terms of other operating systems.
  • The commit limit of a Windows node is the size of RAM plus all paging spaces. If the commit charge hits the commit limit, after the maximum number of paging space auto-increases (when "Automatically manage paging file size for all drives" is checked), a request to commit memory will fail even if there is available physical RAM. Unlike Linux, for example, Windows does not allow overcommit of virtual memory (other then reservations without commits):

    The system commit charge is the total committed or "promised" memory of all committed virtual memory in the system. If the system commit charge reaches the system commit limit, the system and processes might not get committed memory. This condition can cause freezing, crashing, and other malfunctions. Therefore, make sure that you set the system commit limit high enough to support the system commit charge during peak usage.

Physical Memory (RAM)

Perfmon counters (https://technet.microsoft.com/en-us/library/2008.08.pulse.aspx):

  • Memory\Available bytes = The amount of free physical memory available for running processes.
  • Memory\Cache bytes = The amount of physical memory used by the file system cache.
  • Memory\Free System Page Table Entries = The number of free PTEs. Should be non-zero.
  • Memory\Pool Non-Paged Bytes = Memory used by the kernel which cannot be paged out.
  • Memory\Pool Paged Bytes = Memory used by the kernel which can be paged out.

See also https://learn.microsoft.com/en-us/troubleshoot/windows-client/performance/how-to-determine-the-appropriate-page-file-size-for-64-bit-versions-of-windows#memorypagesec-and-other-hard-page-fault-counters

Process Memory Usage

To monitor process memory usage in Perfmon, check Process\Virtual Bytes and Process\Private Bytes.

VMMap is a useful tool to get a detailed breakdown of process memory usage: https://technet.microsoft.com/en-us/sysinternals/dd535533.aspx

Windows 32-bit uses a default virtual user address space of 2GB (http://download.boulder.ibm.com/ibmdl/pub/software/dw/jdk/diagnosis/dw3gbswitch3.pdf):

This can be changed to a 3GB virtual user address space:

The OS space (Windows kernel) is used for things such as the paged and non-paged pools (e.g. network buffers, see https://blogs.technet.microsoft.com/b/markrussinovich/archive/2009/03/26/3211216.aspx), page table entries (https://technet.microsoft.com/en-us/library/cc784475(v=WS.10).aspx and https://technet.microsoft.com/en-us/library/cc786709(WS.10).aspx), and drivers.

On older versions of Windows, you enable 3GB mode with a /3GB flag in boot.ini and reboot the box: https://technet.microsoft.com/en-us/library/bb124810.aspx and https://msdn.microsoft.com/en-us/library/bb613473(v=vs.85).aspx

On newer versions of Windows, use BCDEdit /set increaseuserva 3072 and reboot the box: https://msdn.microsoft.com/en-us/library/ff542202.aspx

In 3GB mode, some libraries are still based at the 2GB boundary, so -Xmx is practically limited to between -Xmx1408m and -Xmx1856m because it is a single, contiguous allocation. Library rebasing is possible but then shared libraries are loaded privately.

Starting in IBM Java 6, the split heap option may be used which forces gencon and allows you to straddle nursery and tenured regions around the 2GB area. For example: -Xgc:splitheap -Xmx2800m -Xmox1800m (http://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.win.80.doc/diag/appendixes/cmdline/xgcsplitheap.html)

A program must be linked with /LARGEADDRESSAWARE to utilize a system configured in a way other than the default 2GB mode. IBM Java is linked with this option.

"If an application was linked with /LARGEADDRESSAWARE, DUMPBIN /HEADERS will display information to that effect."
https://msdn.microsoft.com/en-us/library/wz223b1z.aspx

This option is not risk free: Third party JNI libraries with pointer arithmetic may have unexpected issues or crashes. The kernel itself may also run into issues, particularly with exhausted page translation table entries or an exhausted non-paged pool when there is a lot of network activity.

Input/Output (I/O)

Useful Perfmon counters for disks are (https://technet.microsoft.com/en-us/library/cc722466.aspx):

  • LogicalDisk\Avg. Disk sec/Read: Average time, in seconds, of a read of data from the disk
  • LogicalDisk\Avg. Disk sec/Write: Average time, in seconds, of a write of data to the disk
  • LogicalDisk\Current Disk Queue Length: Indicates the number of disk requests that are currently waiting as well as requests currently being serviced.
  • LogicalDisk\%Idle Time: Reports the percentage of time that the disk system was not processing requests and no work was queued.
  • LogicalDisk\Disk Reads/sec
  • LogicalDisk\Disk Writes/sec
  • LogicalDisk\Disk Read Bytes/sec
  • LogicalDisk\Disk Write Bytes/sec
  • Process\IO Read Bytes/sec
  • Process\IO Write Bytes/sec

Defragmentation

As you delete files, you create gaps in the arrangement of the contiguously stored files. As you save new files (and this is especially true for large files), the file system uses up all of these bits of free space - resulting in the new files being scattered all over the disk in noncontiguous pieces. And thus we end up with fragmented disks and system performance issues because the disk heads have to spend time moving from cluster to cluster before they can read or write the data.

[The Disk Defragmenter] utility physically rearranges the files so that they are stored (as much as possible) in physically contiguous clusters. In addition to the consolidation of files and folders, the Defragmenter utility also consolidates free space - meaning that it is less likely for new files to be fragmented when you save them. For operating systems prior to Windows Vista, you had to manually run the utility or schedule automatic defragmentation via a scheduled task. On Windows Vista, Disk Defragmenter runs as a low-priority background task that is automatically run on a weekly basis without requiring user intervention. On Windows Server 2008, which uses the same Disk Defragmenter, the automatic defragmentation is not enabled by default... The basic operation of the utility involves passing it a driver letter, for example: defrag.exe c: would perform a defragmentation of the C: drive.

> defrag c: -a

https://blogs.technet.microsoft.com/b/askperf/archive/2008/03/14/disk-fragmentation-and-system-performance.aspx

CIFS/SMB

The most common protocols for a networked file systems on Windows are Common Internet File System (CIFS) and Server Message Block (SMB). The SMB version 2 protocol is new and no longer synonymous with CIFS (https://msdn.microsoft.com/en-us/library/ee441790.aspx).

The versions of SMB2 are 2.002, 2.1, 3.0, and 3.02 (https://msdn.microsoft.com/en-us/library/cc246492.aspx).

If acceptable from a security point of view, consider disabling SMB packet signing: "By default, client-side SMB signing is enabled on workstations, servers, and domain controllers... Using SMB packet signing can degrade performance up to 15 percent on file service transactions" (https://technet.microsoft.com/en-us/library/cc731957.aspx) and "... the overhead could get extremely high-up to 40 percent in some situations" (https://technet.microsoft.com/en-us/library/cc512612.aspx). Disable 'Microsoft network client: Digitally sign communications (if server agrees)' and 'Microsoft network client: Digitally sign communications (always)'.

SMB2.1 introduces large Maximum Transmission Unit (MTU) support up to 1MB (https://technet.microsoft.com/en-us/library/ff625695(v=ws.10).aspx). It is enabled with HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanWorkstation\Parameters\DisableLargeMtu = 0 followed by a reboot (http://download.microsoft.com/download/9/B/2/9B205446-37EE-4BB1-9A50-E872565692F1/PerfTuningGuideServer2012R2.pdf).

The Perfmon counter Network Interface\Bytes Total/sec may be used to test the throughput behavior of SMB: https://blogs.technet.microsoft.com/b/josebda/archive/2008/11/11/file-server-performance-improvements-with-the-smb2-protocol-in-windows-server-2008.aspx

Test the response time of an SMB copy using a large file by creating a batch file such as largefilecopy.bat:

@echo off
echo %TIME%
xcopy /J /Y %PATHTOLARGELOCALFILE% \\%SMBPATH%
echo %TIME%

One technique of determining what proportion of time a process spends waiting for SMB responses is to gather network trace, filter to the times spanning a particular process request, add a Wireshark column for smb2.time, export to CSV, sum the service response times, and compare to the elapsed time of the process request.

Some people suggest disabling "Domain member: Digitally encrypt secure channel data;" however, this option does not appear to be related to SMB traffic (https://technet.microsoft.com/en-us/library/jj852270(v=ws.10).aspx).

Networking

Update TIME_WAIT timeout: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpTimedWaitDelay = REG_DWORD value 30 (https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunewindows.html)

Update maximum ephemeral local port: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\MaxUserPort = REG_DWORD value 65534 (https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunewindows.html)

Consider disabling delayed TCP acknowledgments: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\TcpAckFrequency = REG_DWORD value 1. Warning: This option may or may not be better depending on the workload (see the discussion of delayed acknowledgments). (https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunewindows.html, https://support.microsoft.com/kb/328890)

Consider increasing the TCP maximum window size. For example, to set the value to 65535, HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\GlobalMaxTcpWindowSize = REG_DWORD value 0xFFFF: https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunetcpip.html

Consider increasing the maximum number of TCP/IP control blocks (MaxFreeTcbs) when using large numbers of connections: https://technet.microsoft.com/en-us/library/cc938178.aspx. When modifying MaxFreeTcbs, MaxHashTableSize must also be modified proportionally: https://technet.microsoft.com/en-us/library/cc938176.aspx

Starting with Windows Server 2008, it is no longer applicable to modify EnableDynamicBacklog/MinimumDynamicBacklog/MaximumDynamicBacklog/DynamicBacklogGrowthDelta (https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tprf_tunewindows.html, https://support.microsoft.com/kb/142641, https://msdn.microsoft.com/en-us/library/ff648853.aspx, https://blogs.technet.microsoft.com/b/nettracer/archive/2010/08/11/where-have-those-afd-driver-related-registry-dynamicbackloggrowthdelta-enabledynamicbacklog-maximumdynamicbacklog-minimumdynamicbacklog-keys-gone.aspx)

Increase network adapter receive buffers: https://support.microsoft.com/kb/981482

It appears that TCP/IP in Windows 2012 is the same as 2008, so all of the same tuning applies: "In Windows Server 2012, TCP/IP - including both Internet Protocol version 4 (IPv4) and IPv6 - is unchanged from TCP/IP in Windows Server 2008 R2. For more information, see TCP/IP in the Windows Server 2008 and Windows Server 2008 R2 Technical Library." (https://technet.microsoft.com/en-us/library/jj573587.aspx).

Ping a remote host. In general, and particularly for LANs, ping times should be less than a few hundred milliseconds with little standard deviation.

> ping -t 10.20.30.1
 
Pinging 10.20.30.1 [10.20.30.1] with 32 bytes of data:
Reply from 10.20.30.1: bytes=32 time=92ms TTL=249
Reply from 10.20.30.1: bytes=32 time=89ms TTL=249
Reply from 10.20.30.1: bytes=32 time=155ms TTL=249
Reply from 10.20.30.1: bytes=32 time=89ms TTL=249
 
Ping statistics for 
10.20.30.1    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 89ms, Maximum = 155ms, Average = 106ms

TCP Congestion Control

Review the background on TCP congestion control.

Review https://docs.microsoft.com/en-us/windows-server/networking/technologies/network-subsystem/net-sub-performance-tuning-nics

Initial Congestion Window Size

The initial congestion window size may be changed with (https://support.microsoft.com/kb/2472264):

> netsh interface tcp set supplemental template=custom icw=10

netstat

Create a snapshot of socket information:

> netstat -a -b -n -o
 
Active Connections
 
  Proto  Local Address          Foreign Address        State           PID
  TCP    0.0.0.0:7278           0.0.0.0:0              LISTENING       2684
 [java.exe]
  TCP    0.0.0.0:8881           0.0.0.0:0              LISTENING       2684
 [java.exe]
  TCP    0.0.0.0:9045           0.0.0.0:0              LISTENING       2684
 [java.exe]...

Show adapter statistics:

C:\tprof\bin>netstat -s
 
IPv4 Statistics
 
  Received Header Errors             = 0
  Received Address Errors            = 0
  Unknown Protocols Received         = 0
  Received Packets Discarded         = 9
  Routing Discards                   = 0
  Discarded Output Packets           = 17
  Output Packet No Route             = 0
  Reassembly Required                = 0
  Reassembly Failures                = 0
  Datagrams Failing Fragmentation    = 0...
 
TCP Statistics for IPv4
 
  Failed Connection Attempts          = 445
  Reset Connections                   = 149
  Segments Retransmitted              = 921...

Show ethernet statistics:

> netstat -e
Interface Statistics
 
                           Received            Sent
Bytes                     275244337        12757159...
Discards                          0               0
Errors                            0               0
Unknown protocols                 0

Wireshark

Capture network packets using Wireshark (covered in the Major Tools chapter).

Start the capture:

  1. Install Wireshark: https://www.wireshark.org/#download
  2. Start Wireshark as Administrator
  3. Click "Capture" > "Options"
  4. Select the network interface in the "Input" box
  5. Click the "Output" tab and enter a "File" such as C:\wireshark.pcap
  6. Click the "Options" tab and uncheck "Update list of packets in realtime" and click "Start"

Stop the capture:

  1. Click "Capture" > "Stop"

netsh

netsh is a command line tool to help configure networking.

Disable IPv6 DHCP Auto-negotiation
netsh interface ipv6 set interface %INTERFACE% routerdiscovery=disabled

Message Analyzer

The official way to capture network packets on newer versions of Microsoft Windows is Microsoft Message Analyzer: http://www.microsoft.com/en-us/download/details.aspx?id=44226

Network Monitor

The official way to capture network packets on older versions of Microsoft Windows is Microsoft Network Monitor: https://support.microsoft.com/kb/148942

Process Monitor (ProcMon.exe)

Microsoft Process Monitor provides detailed information on file system activity, registry activity, network activity and process/thread activity: https://technet.microsoft.com/en-us/sysinternals/bb896645. ProcMon replaces previous tools such as FileMon.

  1. Delete any existing PML files from previous runs.
  2. Command Prompt> ProcMon.exe /NoConnect (the /NoConnect option avoids immediately starting collection so that you can configure whatever's needed)
  3. File > Backing Files > Select "Use file named" and enter a path such as C:\ProcMon.pml and click OK.
  4. Filter > Uncheck "Drop Filtered Events"
  5. Options > Configure Symbols... > Ensure DbgHelp.dll points to an existing path (install Debugging Tools if not), and set symbol paths to include a local symbol cache directory, such as srv*c:\symbols*http://msdl.microsoft.com/download/symbols
  6. Options > Profiling Events > Check "Generate thread profiling events" and select "Every 100 milliseconds"
  7. In the menu bar on the right, uncheck the 5 boxes named "Show Registry Activity, "Show File System Activity," etc. so that only the backing file is capturing the events and not the GUI as well.
  8. File > Click Capture Events.
  9. Reproduce problem
  10. File > Uncheck "Capture Events" (or run ProcMon.exe /terminate from another command prompt). This step is required; otherwise, you may receive the following error when trying to open the PML files: "The file %FILE% was not closed cleanly during capture and is corrupt."
  11. Load the PML File

Thread Profiling Analysis

Click Tools > Stack Summary..., sort by Time %, and expand the largest stack paths:

Large Pages

The -Xlp option requests the JVM to allocate the Java heap with large pages. This command is available only on Windows Server 2003, Windows Vista, Windows Server 2008, and above. To use large pages, the user that runs Java must have the authority to "lock pages in memory".

To enable this authority, as administrator go to Control Panel > Administrative Tools > Local Security Policy and then find Local Policies > User Rights Assignment > Lock pages in memory. Add the user who runs the Java process, and reboot your machine. For more information, see these websites:

Note: On Microsoft Windows Vista and Windows 2008, use of large pages is affected by the User Account Control (UAC) feature. When UAC is enabled, a regular user (a member of the Users group) can use the -Xlp option as normal. However, an administrative user (a member of the administrators group) must run the application as an administrator to gain the privileges required to lock pages in memory. To run as administrator, right-click the application and select Run as administrator. If the user does not have the necessary privileges, an error message is produced, advising that the System configuration does not support option '-Xlp'.