2007年7月29日 星期日

Mac OS X 為何能夠跑的更快?

‧Mac OS X 為何能夠跑的更快?
Making An Operating System Faster

http://www.kernelthread.com/mac/apme/optimizations/


Introduction

電腦硬體的效能通常只會隨著時間增加而而強大。即便這個說法也可以用來表示軟體,不過軟體效能增進的比率還是比硬體慢上許多。事實上,有許多人認為,有很多軟體,其效能反而會隨著時間增加而持續不斷的下降。此外,要建立一個客觀的效能評斷標準亦十分困難,更別提複雜程度如同作業系統一般的軟體了:一個 "更快的 OS(作業系統)" 是一個非常主觀,而且
依賴文章脈絡的用詞。

一個 OS 的結構要比普通硬體來的長壽許多。OS 的研究者通常不會提供更新、更快的演算法,如同硬體那樣固定、頻繁的發生更新。然而,那些牽連到 "製造(producing)" OS 的人 -- 包括研究者,設計者,實作者,甚至是行銷者 -- 都負有艱鉅的任務,要確保相關的效能曲線能夠跟上腳步。在 OS 市場的倖存者並不多(有些人可能會爭辯,就修辭學上而言,在這根本只有一個)。然而,它還是一個十分頑固(tough)的市場,而且 OS 供應商必須要不間斷的 "改進" 它們的系統。

現在,要詢問不喜歡在每次 OS 釋出循環當中,都遇到驚天動地的演算法突破的你,如何使你的系統更快?這個問題有多重的解決方案:

* Rather than looking for generalized optimizations pedantically, you could look into making the system faster for one (or more, but a few) "common" usage scenario. (你可以用常見的一種,或數種方式來使系統最佳化)

* You could consider numerous minor and mundane performance improvements, even if they are technically unimpressive/uninteresting, or ugly/kludgy to implement. Together, such improvements could lead to a perceptible performance gain from the users' point of view.
(你可以用許多之微末節的、通俗的效能增進方式,甚至它是醜陋的方法亦無妨,當然以使用者的觀點而言,效能的確是增進了。)

* You could vary the granularity at which you would usually make improvements. For example, in addition to improving typical OS algorithms, you could look into improving more meta operations, such as the entire boot process.
(你可以在你想要修改的地方動手,通常也會使效能增進,例如:整個開機程序)

* The most important kind of performance is the one perceived by the eventual users of a system. Thus, in any usage scenario, a "faster workflow" would be tantamount to "higher performance".

It might be possible to make the workflow faster without making fundamental changes in the design and implementation of the components involved. With apriori knowledge of how the system would be typically used, you could rearrange the order in which things happen (even if the resulting order is unnatural or unclean), if doing so makes the user believe that things are
happening faster.
(效能最重要的增進在於終端使用者對於系統的感覺。因此,在許多情況下, "更快的流程" 通常代表 "更高的效能"。你可以依照系統的先天條件來安排使用順序,可以在不更動原有條件下達到效能增進的目的。如果要這麼做,要讓使用者相信事情變得更快了。)


Example: Mac OS X

這份文件提到 Apple 在增進 Mac OS X 效能方面(在最初的/根本的 OS 設計與實作之外)所作的十件。有些顯然是相當棒的點子,亦是等著被履行的的候選者;有些是給研發者的指導方針或是工具,幫助他們建造更高效能的應用程式,此外,還有些是從戰略性選擇角度來延伸效能的積極性(proactive)嘗試。

以下所考慮的十項優勢並沒有特別的先後順序考慮:

* BootCache
* Kernel Extensions Cache
* Hot File Clustering
* Working Set Detection
* On-the-fly Defragmentation
* Prebinding
* Helping Developers Create Code Faster
* Helping Developers Create Faster Code
* Journaling in HFS Plus
* Instant-on


1. BootCache

Mac OS X uses a boot-time optimization (effectively a smart read-ahead) that monitors the pattern of incoming read requests to a block device (the boot disk), and sorts the pattern into a "playlist", which is used to cluster reads into a private cache. This "boot cache" is then used for satisfying incoming read requests, if possible. The scheme also measures the cache hit rate, and stores the request pattern into a "history list" for being adaptive in future. If the hit rate is too low, the caching is disabled.

The loadable (sorted) read pattern is stored in /var/db/BootCache.playlist. Once this pattern is loaded, the cache comes into effect. The entire process is invisible from users.

This feature is only supported on the root device. Further, it requires at least 128 MB of physical RAM before it is enabled (automatically).

/System/Library/Extensions/BootCache.kext is the location of the kernel extension implementing the cache while Contents/Resources/BootCacheControl within that directory is the user-level control utility (it lets you load the playlist, among other things).

The effectiveness of BootCache can be gauged from the following: in a particular update to "Panther", a reference to BootCacheControl was broken. BootCache is started (via BootCacheControl, the control utility) in /etc/rc, and a prefetch tag is inserted (unless the system is booting in safe mode). /etc/rc looks for BootCacheControl in the Resources directory of the BootCache.kext bundle, as well as in /usr/sbin, and finds it in the former (it doesn't exist in the latter). However, another program (loginwindow.app) accesses /usr/sbin/BootCacheControl directly, and does not find it. For what it's worth, making BootCacheControl available in /usr/sbin, say via a symbolic link, reduces the boot time (measured from clicking on the "Restart" confirmation button to the point where absolutely everything has shown up on the system menu) from 135 seconds to 60 seconds on one of my machines.

2. Kernel Extensions Cache

There may be close to a hundred kernel extensions that are loaded on a typical Mac OS X system, and perhaps twice as many residing in the system's "Extensions" folder(s). Kernel extensions may have dependencies on other extensions. Rather than scan all these every time the system boots (or worse, every time an extension is to be loaded), Mac OS X uses caching for kernel extensions, and the kernel itself.

There are three types of kernel/kext caches used in this context:

  • The kernel cache contains the kernel code, linked kernel extensions, and info dictionaries of any number of kernel extensions. The default cache directory for this type of cache is /System/Library/Caches/com.apple.kernelcaches. The cache files in this directory are named kernelcache.XXXXXXXX, where the suffix is a 32-bit adler checksum (the same algorithm as used by Gzip).
  • The multi-extension, or mkext cache, contains multiple kernel extensions and their info directories. Such caches are used during early system startup. BootX, the bootloader, tries to load a previously cached list of device drivers (created/updated by /usr/sbin/kextcache). If the mkext cache is corrupt or missing, BootX would look in /System/Library/Extensions for extensions that are needed in the current scenario (as determined by the value of the OSBundleRequired property in the Info.plist file of the extension's bundle. The mkext cache exists by default as /System/Library/Extensions.mkext. You can use /usr/sbin/mkextunpack to extract the contents of a mkext archive.
  • The kext repository cache contains the info dictionaries for all kernel extensions in a single repository directory, including their plugins. This cache exists by default as /System/Library/Extensions.kextcache. Note that this file is simply a large property list (XML) file that is Gzip compressed.
3. Hot File Clustering

Hot File Clustering (HFC) aims to improve the performance of small, frequently accessed files on HFS Plus volumes. This optimization is currently used only on boot volumes. HFC is a multi-staged clustering scheme that records "hot" files (except journal files, and ideally quota files) on a volume, and moves them to the "hot space" on the volume (0.5% of the total filesystem size located at the end of the default metadata zone, which itself is at the start of the volume). The files are also defragmented. The various stages in this scheme are DISABLED, IDLE, BUSY, RECORDING, EVALUATION, EVICTION, and ADOPTION. At most 5000 files, and only files less than 10 MB in size are "adopted" under this scheme.

The "metadata zone" referred to in the above description is an area on disk that may be used by HFS Plus for storing volume metadata: the Allocation Bitmap File, the Extents Overflow File, the Journal File, the Catalog File, Quota Files, and Hot Files. Mac OS X 10.3.x places the metadata zone near the beginning of the volume, immediately after the volume header.

HFC (and the metadata zone policy) are used only on journaled HFS Plus volumes that are at least 10 GB in size.

Note that what constitutes the set of hot files on your system will depend on your usage pattern over a few days. If you are doing extensive C programming for a few days, say, then it is likely that many of your hot files will be C headers. You can use hfsdebug to explore the working of Hot File Clustering.

% sudo hfsdebug -H -t 10 # Top 10 Hottest Files on the Volume rank temperature cnid path 1 537 7453 Macintosh HD:/usr/share/zoneinfo/US/Pacific 2 291 7485 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.128 3 264 7486 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.160 4 204 7495 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.96 5 204 2299247 Macintosh HD:/Library/Receipts/iTunes4.pkg/Contents\ /Resources/package_version 6 192 102106 Macintosh HD:/usr/include/mach/boolean.h 7 192 102156 Macintosh HD:/usr/include/mach/machine/boolean.h 8 192 102179 Macintosh HD:/usr/include/mach/ppc/boolean.h 9 188 98711 Macintosh HD:/usr/include/string.h 10 178 28725 Macintosh HD:/HFS+ Private Data/iNode1038632980 3365 active Hot Files.

4. Working Set Detection

The Mach kernel uses physical memory as a cache for virtual memory. When new pages are to be brought in as a result of page faults, the kernel would need to decide which pages to reclaim from amongst those that are currently in memory. For an application, the kernel should ideally keep those pages in memory that would be needed very soon.

In the Utopian OS, one would know ahead of time the pages an application references as it runs. There have been several algorithms that approximate such optimal page replacement. Another approach is to make use of the locality of reference of processes. According to the Principle of Locality, a process refers to a small, slowly changing subset of its set of pages. This subset is the Working Set. Studies have shown that the working set of a process needs to be resident (in-memory) in order for it to run with acceptable performance (that is, without causing an unacceptable number of page faults).

The Mac OS X kernel incorporates a subsystem (let us call it TWS, for Task Working Set) for detecting and maintaining the working sets of tasks. This subsystem is integrated with the kernel's page fault handling mechanism. TWS builds and maintains a profile of each task's fault behavior. The profiles are per-user, and are stored on-disk, under /var/vm/app_profile/. This information is then used during fault handling to determine which nearby pages should be brought in.

Several aspects of this scheme contribute to performance:

  • Bringing a number of pages in (that would hopefully be needed in the near future) results in a single large request to the pager.
  • TWS captures, and stores on disk, an application's (initial) working set the first time it is started (by a particular user). This file is used for seeding (sometimes called pre-heating) the application's working set, as its profile is built over time.
  • The locality of reference of memory is usually captured on disk (because files on disk usually have good locality on HFS Plus volumes). Thus, there should not be too much seeking involved in reading the working set from disk.

For a user with uid U, the application profiles are stored as two page cache files: #U_names and #U_data under /var/vm/app_profile/ (#U is the hexadecimal representation of U).

The "names" file, essentially a simple database, contains a header followed by profile elements:

typedef unsigned int natural_t; typedef natural_t vm_size_t; struct profile_names_header { unsigned int number_of_profiles; unsigned int user_id; unsigned int version; off_t element_array; unsigned int spare1; unsigned int spare2; unsigned int spare3; }; struct profile_element { off_t addr; vm_size_t size; unsigned int mod_date; unsigned int inode; char name[12]; };

The "data" file contains the actual working sets.

5. On-the-fly Defragmentation

When a file is opened on an HFS Plus volume, the following conditions are tested:

  • If the file is less than 20 MB in size
  • If the file is not already busy
  • If the file is not read-only
  • If the file has more than eight extents
  • If the system has been up for at least three minutes

If all of the above conditions are satisfied, the file is relocated -- it is defragmented on-the-fly.

File contiguity (regardless of file size) is promoted in general as a consequence of the extent-based allocation policy in HFS Plus, which also delays actual allocation. Refer to Fragmentation In HFS Plus Volumes for more details.

6. Prebinding

Mac OS X uses a concept called "prebinding" to optimize Mach-O (the default executable format) applications to launch faster (by reducing the work of the runtime linker).

The dynamic link editor resolves undefined symbols in an executable (and dynamic libraries) at run time. This activity involves mapping the dynamic code to free address ranges and computing the resultant symbol addresses. If a dynamic library is compiled with prebinding support, it can be predefined at a given (preferred) address range. This way, dyld can use predefined addresses to reference symbols in such a library. For this to work, libraries cannot have preferred addresses that overlap. Apple marks several address ranges as either "reserved" or "preferred" for its own software, and specifies allowable ranges for 3rd party (including the end users') libraries to use to support prebinding.

update_prebinding is run to (attempt to) synchronize prebinding information when new files are added to a system. This can be a time consuming process even if you add or change a single file, say, because all libraries and executables that might dynamically load the new file must be found (package information is used to help in this, and the process is further optimized by building a dependency graph), and eventually redo_prebinding is run to prebind files appropriately.

Prebinding is the reason you see the "Optimizing ..." message when you update the system, or install certain software.

/usr/bin/otool can be used to determine if a binary is prebound:


# otool -hv /usr/lib/libc.dylib
/usr/lib/libc.dylib:
Mach header
magic cputype cpusubtype filetype ncmds sizeofcmds flags
MH_MAGIC PPC ALL DYLIB 10 1940 \
NOUNDEFS DYLDLINK PREBOUND SPLIT_SEGS TWOLEVEL


7. Helping Developers Create Code Faster

Mac OS X includes a few optimizations that benefit developers by making development workflow -- the edit-compile-debug cycle -- faster. Some of these were introduced with Mac OS X Panther.

  • Precompiled Headers: Xcode (gcc, specifically) supports precompiled headers. Xcode uses this functionality to precompile prefix headers.



% cat foo.h
#define FOO 10
% cat foo.c
#include "foo.h"
#include <stdio.h>

int
main()
{
printf("%d\n", FOO);
}
% ls foo.*
foo.c foo.h
% gcc -x c-header -c foo.h
% ls foo.*
foo.c foo.h foo.gch
% gcc -o foo foo.c
% ./foo
10
% rm foo.h
% gcc -o foo foo.c
% ./foo
10


  • Distributed Builds: Xcode (through distcc) supports distributed builds, wherein it is possible to distribute builds across several machines on the network.
  • Predictive compilation runs the compiler in the background (as soon as it can, even as you edit the source). Once you are ready to build, the hope is that much of the building would have been done already.
  • Zero Link, a feature useful for development builds, links at runtime instead of compile time, whereby only code needed to run the application is linked in and loaded (that is, as an application runs within Xcode, each object file is linked as needed). A related feature is "Fix and Continue", courtesy which you can (with caveats) make a change to your code and have the code compiled and inserted into a running program.
8. Helping Developers Create Faster Code

Apple provides a variety of performance measurement/debugging tools for Mac OS X. Some of these are part of Mac OS X, while many others are available if you install the Apple Developer Tools. Quite expectedly, Apple encourages its own developers, as well as 3rd party developers, to create code in conformance with performance guidelines.

As mentioned earlier, perceived performance is quite important. For example, it is desirable for an application to display its menu bar and to start accepting user input as soon as possible. Reducing this initial response time might involve deferring certain initializations or reordering the "natural" sequence of events, etc.

Mac OS X Tools

Mac OS X includes several common GNU/Unix profiling/monitoring/dissecting tools, such as gprof, lsof, nm, top, vm_stat, and many more, such as:

Refer to Apple's documentation for these tools for more details.

  • fs_usage Report system calls and page faults related to filesystem activity.
  • heap List all malloc-allocated buffers in a process's heap.
  • ktrace/kdump Enable/view (from a trace) kernel process tracing.
  • leaks Search a process's for unreferenced malloc buffers.
  • malloc_history Show a process's malloc allocations.
  • otool Display various parts of an object file.
  • pagestuff Display information on specified pages of a Mach-O file.
  • sample Profile a process during a time interval.
  • sc_usage Show system call usage statistics.
  • vmmap Display virtual memory regions allocated in a process.

Performance Measurement Tools

  • MallocDebug Tracks and analyzes allocated memory.
  • ObjectAlloc Tracks Objective-C and Core Foundation object allocations and deallocations.
  • OpenGL Profiler Tool for profiling OpenGL applications.
  • PEFViewer Viewer for the contents of a PEF binary.
  • QuartzDebug Visualizer for an application's screen drawing behavior -- the areas being redrawn are flashed briefly.
  • Sampler Viewer for execution behavior of a program.
  • Spin Control Samples applications that cause the spinning cursor to appear.
  • Thread Viewer Viewer for threads and their activity.

CHUD Tools

The Computer Hardware Understanding Development (CHUD) Tools package, an optional installation, provides tools such as the following:

  • BigTop A graphical equivalent to top, vm_stat, etc. Displays system statistics.
  • CacheBasher Measures cache performance.
  • MONster Tool for collecting and visualizing hardware level performance data.
  • PMC Index Tool for searching Performance Monitoring Counter (PMC) events.
  • Reggie SE A viewer (and editor) for CPU and PCI configuration registers.
  • Saturn Tool for profiling applications at the function-call level, and visualizing the profile data.
  • Shark Performs system-wide sampling/profiling to create a profile of the execution behavior of a program, so as to help you understand where time is being spent as your code runs.
  • Skidmarks GT Processor performance benchmark (integer, floating-point, and vector benchmarks).
  • SpindownHD Utility for displaying the sleep/active status of attached drives.
  • acid Analyzes traces generated by amber (only the TT6E format).
  • amber Traces all threads of execution in a process, recording every instruction and data access to a trace file.
  • simg4 A cycle-accurate core simulator of the Motorola PowerPC G4 processor.
  • simg5 A cycle-accurate core simulator of the IBM PowerPC 970 (G5) processor.
9. Journaling in HFS Plus

While modern filesystems are often journaled by design, journaling came to HFS Plus rather late. Apple retrofitted journaling into HFS Plus as a supplementary mechanism to the erstwhile working of the filesystem, with Panther being the first version to have journaling turned on by default.

On a journaled HFS Plus volume, file object metadata and volume structures are journaled, but not file object data (fork contents, that is). The primary purpose of the journal is to make recovery faster and more reliable, in case a volume is unmounted uncleanly, but it may improve the performance of metadata operations.

10. Instant-on

Apple computers do not hibernate. Rather, when they "sleep", enough devices (in particular, the dynamic RAM) are kept alive (at the cost of some battery life, if the computer is running on battery power). Consequently, upon wakeup, the user perceives instant-on behavior: a very desirable effect.

Similarly, by default the system tries to keep network connections alive even if the machine sleeps. For example, if you login (via SSH, say) from one PowerBook to another, and both of them go to sleep, your login should stay alive within the constraints of the protocols.

Epilogue

Using Mac OS X as an example, we looked at a few kinds of optimizations that "OS people" (particularly those involved in creating an end-user system) adopt to improve performance. The integration of all such optimizations is perhaps even more important than the optimizations themselves. The end result should be a perceptible improvement in performance. A desirable manifestation of such improvement would be a faster workflow for the end-user.