You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1020 lines
53 KiB
1020 lines
53 KiB
/** \page TNMDC Metadata Caching in HDF5
|
|
|
|
\todo Revise this!
|
|
|
|
\section intro Introduction
|
|
|
|
In the 1.6.4 release, we introduced a re-implementation of the metadata
|
|
cache. That release contained an incomplete version of the cache which could not
|
|
be controlled via the API. The version in the 1.8 release is more mature and
|
|
includes new API calls that allow the user program to configure the metadata
|
|
cache both on file open and at run time.
|
|
|
|
From the user perspective, the most striking effect of the new cache should be a
|
|
large reduction in the cache memory requirements when working with complex HDF5
|
|
files.
|
|
|
|
Those working with such files may also notice a reduction in file close time.
|
|
|
|
Those working with HDF5 files with a simple structure shouldn't notice any
|
|
particular changes in most cases. In rare cases, there may be a significant
|
|
improvement in performance.
|
|
|
|
The remainder of this document contains an architectural overview of the old and
|
|
new metadata caches, a discussion of algorithms used to automatically adjust
|
|
cache size to circumstances, and a high-level discussion of the cache
|
|
configuration controls. It can be safely skipped by anyone who works only with
|
|
HDF5 files with relatively simple structure (i.e. no huge groups, no datasets
|
|
with large numbers of chunks, and no objects with large numbers of attributes.)
|
|
|
|
On the other hand, it is mandatory reading if you want to use something other
|
|
than the default metadata cache configuration. The documentation on the metadata
|
|
cache-related API calls will not make much sense without this background.
|
|
|
|
\section oldnew Old and New Metadata Cache
|
|
|
|
\subsection old The Old Metadata Cache
|
|
|
|
The old metadata cache indexed the cache with a hash table with no provision for
|
|
collisions. Instead, collisions were handled by evicting the existing entry to
|
|
make room for the new entry. Aside from flushes, there was no other mechanism
|
|
for evicting entries, so the replacement policy could best be described as
|
|
"Evict on Collision".
|
|
|
|
As a result, if two frequently used entries hashed to the same location, they
|
|
would evict each other regularly. To decrease the likelihood of this situation,
|
|
the default hash table size was set fairly large -- slightly more than
|
|
10,000. This worked well, but since the size of metadata entries is not bounded,
|
|
and since entries were only evicted on collision, the large hash table size
|
|
allowed the cache size to explode when working with HDF5 files with complex
|
|
structure.
|
|
|
|
The "Evict on Collision" replacement policy also caused problems with the
|
|
parallel version of the HDF5 library, as a collision with a dirty entry could
|
|
force a write in response to a metadata read. Since all metadata writes must be
|
|
collective in the parallel case while reads need not be, this could cause the
|
|
library to hang if only some of the processes participated in a metadata read
|
|
that forced a write. Prior to the implementation of the new metadata cache, we
|
|
dealt with this issue by maintaining a shadow cache for dirty entries evicted by
|
|
a read.
|
|
|
|
\subsection new The New Metadata Cache
|
|
|
|
The new metadata cache was designed to address the above issues. After
|
|
implementation, it became evident that the working set size for HDF5 files
|
|
varies widely depending on both structure and access patterns. Thus it was
|
|
necessary to add support for cache size adjustment under either automatic or
|
|
user program control (see section 2.3 for details).
|
|
|
|
When the cache is operating under direct user program control, it is also
|
|
possible to temporarily disable evictions from the metadata cache so as to
|
|
maximize raw data throughput at the expense of allowing the cache to grow
|
|
without bound until evictions are enabled again.
|
|
|
|
Structurally, the new metadata cache can be thought of as a heavily modified
|
|
version of the UNIX buffer cache as described in chapter three of M. J. Bach's
|
|
"The Design of the UNIX Operating System" In essence, the UNIX buffer cache uses
|
|
a hash table with chaining to index a pool of fixed-size buffers. It uses the
|
|
LRU replacement policy to select candidates for eviction.
|
|
|
|
Since HDF5 metadata entries are not of fixed size and may grow arbitrarily
|
|
large, the size of the new metadata cache cannot be controlled by setting a
|
|
maximum number of entries. Instead, the new cache keeps a running sum of the
|
|
sizes of all entries and attempts to evict entries as necessary to stay within a
|
|
user-specified maximum size. (Note the use of the word "attempts" here -- as
|
|
will be seen, it is possible for the cache to exceed its currently specified
|
|
maximum size.) At present, the LRU replacement policy is the only option for
|
|
selecting candidates for eviction.
|
|
|
|
Per the standard Unix buffer cache, dirty entries are given two passes through
|
|
the LRU list before being evicted. The first time they reach the end of the LRU
|
|
list, they are flushed, marked as clean, and moved to the head of the LRU
|
|
list. When a clean entry reaches the end of the LRU list, it is simply evicted
|
|
if space is needed.
|
|
|
|
The cache cannot evict entries that are locked, and thus it will temporarily
|
|
grow beyond its maximum size if there are insufficient unlocked entries
|
|
available for eviction.
|
|
|
|
In the parallel version of the library, only the cache running under process 0
|
|
of the file communicator is allowed to write metadata to file. All the other
|
|
caches must retain dirty metadata until the process 0 cache tells them that the
|
|
metadata is clean.
|
|
|
|
Since all operations modifying metadata must be collective, all caches see the
|
|
same stream of dirty metadata. This fact is used to allow them to synchronize
|
|
every n bytes of dirty metadata, where n is a user-configurable value that
|
|
defaults to 256 KB.
|
|
|
|
To avoid sending the other caches messages from the future, process 0 must not
|
|
write any dirty entries until it reaches a synchronization point. When it
|
|
reaches a synchronization point, it writes entries as needed, and then
|
|
broadcasts the list of flushed entries to the other caches. The caches on the
|
|
other processes use this list to mark entries clean before they leave the
|
|
synchronization point, allowing them to evict those entries as needed.
|
|
|
|
The caches will also synchronize on a user-initiated flush.
|
|
|
|
To minimize overhead when running in parallel, the cache maintains a "clean" LRU
|
|
list in addition to the regular LRU list. This list contains only clean entries
|
|
and is used as a source of candidates for eviction when flushing dirty entries
|
|
is not allowed.
|
|
|
|
Since flushing entries is forbidden most of the time when running in parallel,
|
|
the caches can be forced to exceed their maximum sizes if they run out of clean
|
|
entries to evict.
|
|
|
|
To decrease the likelihood of this event, the new cache allows the user to
|
|
specify a minimum clean size -- which is a minimum total size of all the entries
|
|
on the clean LRU plus all unused space in the cache.
|
|
|
|
While the clean LRU list is only maintained in the parallel version of the HDF5
|
|
library, the notion of a minimum clean size still applies in the serial
|
|
case. Here it is used to force a mix of clean and dirty entries in the cache
|
|
even in the write-only case.
|
|
|
|
This, in turn, reduces the number of redundant flushes by avoiding the case in
|
|
which the cache fills with dirty metadata and all entries must be flushed before
|
|
a clean entry can be evicted to make room for a new entry.
|
|
|
|
Observe that in both the serial and parallel cases, the maintenance of a minimum
|
|
clean size modifies the replacement policy, as dirty entries may be flushed
|
|
earlier than would otherwise be the case so as to maintain the desired amount of
|
|
clean and/or empty space in the cache.
|
|
|
|
While the new metadata cache only supports the LRU replacement policy at
|
|
present, that may change. Support for multiple replacement policies was very
|
|
much in mind when the cache was designed, as was the ability to switch
|
|
replacement policies at run time. The situation has been complicated by the
|
|
later addition of the adaptive cache resizing requirement, as two of the
|
|
resizing algorithms piggyback on the LRU list. However, if there is a need for
|
|
additional replacement policies, it shouldn't be too hard to implement them.
|
|
|
|
\section adapt Adaptive Cache Resizing in the New Metadata Cache
|
|
|
|
As mentioned earlier, the metadata working set size for an HDF5 file varies
|
|
wildly depending on the structure of the file and the access pattern. For
|
|
example, a 2MB limit on metadata cache size is excessive for an H5repack of
|
|
almost all HDF5 files we have tested. However, I have a file submitted by one of
|
|
our users that will run a 13% hit rate with this cache size and will lock up one
|
|
of our Linux boxes using the old metadata cache. Increase the new metadata cache
|
|
size to 4 MB, and the hit rate exceeds 99%.
|
|
|
|
In this case, the main culprit is a root group with more than 20,000 entries in
|
|
it. As a result, the root group heap exceeds 1 MB, which tends to crowd out the
|
|
rest of the metadata in a 2 MB cache
|
|
|
|
This case and a number of synthetic tests convinced us that we needed to modify
|
|
the new metadata cache to expand and contract according to need within
|
|
user-specified bounds.
|
|
|
|
I was unable to find any previous work on this problem, so I invented solutions
|
|
as I went along. If you are aware of prior work, please send me references. The
|
|
closest I was able to come was a group of embedded CPU designers who were
|
|
turning off sections of their cache to conserve power.
|
|
|
|
\subsection increasing Increasing the Cache Size
|
|
|
|
In the context of the HDF5 library, the problem of increasing the cache size as
|
|
necessary to contain the current working set turns out to involve two rather
|
|
different issues.
|
|
|
|
The first of these, which was recognized immediately, is the problem of
|
|
recognizing long term changes in working set size, and increasing the cache size
|
|
accordingly, while not reacting to transients.
|
|
|
|
The second, which I recognized the hard way, is to adjust the cache size for
|
|
sudden, dramatic increases in working set size caused by requests for large
|
|
pieces of metadata which may be larger than the current metadata cache size.
|
|
|
|
The algorithms for handling these situations are discussed below. These problems
|
|
are largely orthogonal to each other, so both algorithms may be used
|
|
simultaneously.
|
|
|
|
\subsubsection hrtcsi Hit Rate Threshold Cache Size Increment
|
|
|
|
Perhaps the most obvious heuristic for identifying cases in which the cache is
|
|
too small involves monitoring the hit rate. If the hit rate is low for a while,
|
|
and the cache is at its current maximum size, the current maximum cache size is
|
|
probably too small.
|
|
|
|
The hit rate threshold algorithm for increasing cache size applies this
|
|
intuition directly.
|
|
|
|
Hit rate statistics are collected over a user-specified number of cache
|
|
accesses. This period is known as an epoch.
|
|
|
|
At the end of each epoch, the hit rate is computed, and the counters are
|
|
reset. If the hit rate is below a user-specified threshold and the cache is at
|
|
its current maximum size, the maximum size of the cache is increased by a
|
|
user-specified multiple. If required, the new cache maximum size is clipped to
|
|
stay within the user-specified upper bound on the maximum cache size, and
|
|
optionally, within a user-specified maximum increment.
|
|
|
|
My tests indicate that this algorithm works well in most cases. However, in a
|
|
synthetic test in which hit rate increased slowly with cache size, and load
|
|
remained steady for many epochs, I observed a case in which cache size increased
|
|
until the hit rate just exceeded the specified minimum and then stalled. This is
|
|
a problem, as to avoid volatility, it is necessary to set the minimum hit rate
|
|
threshold well below the desired hit rate. Thus we may find ourselves with a
|
|
cache running with a 91% hit rate when we really want it to increase its size
|
|
until the hit rate is about 99%.
|
|
|
|
If this case occurs frequently in actual use, I will have to come up with an
|
|
improved algorithm. Please let me know if you see this behavior. However, I had
|
|
to work rather hard to create it in my synthetic tests, so I would expect it to
|
|
be uncommon.
|
|
|
|
\subsubsection fcsi Flash Cache Size Increment
|
|
|
|
A fundamental problem with the above algorithm is that contains the hidden
|
|
assumption that cache entries are relatively small in comparison to the cache
|
|
itself. While I knew this assumption was not generally true when I developed the
|
|
algorithm, I thought that cases, where it failed, would be so rare as to not be
|
|
worth considering, as even if they did occur, the above algorithm would rectify
|
|
the situation within an epoch or two.
|
|
|
|
While it is true that such occurrences are rare, and it is true that the hit
|
|
rate threshold cache size increment algorithm will rectify the situation
|
|
eventually, the performance degradation experienced by users while waiting for
|
|
the epoch to end was so extreme that some way of accelerating response to such
|
|
situations was essential.
|
|
|
|
To understand the problem, consider the following use case:
|
|
|
|
Suppose we create a group, and then repeatedly create a new data set in the
|
|
group, write some data to it and then close it.
|
|
|
|
In some versions of the HDF5 file format, the names of the datasets will be
|
|
stored in a local heap associated with the group, and the space for that heap
|
|
will be allocated in a single, contiguous chunk. When this local heap is full,
|
|
we allocate a new chunk twice the size of the old, copy the data from the old
|
|
local heap into the new, and discard the old local heap.
|
|
|
|
By default, the minimum metadata cache size is set to 2 MB. Thus in this use
|
|
case, our hit rate will be fine as long as the local heap is no larger than a
|
|
little less than 2 MB, as the group related metadata is accessed frequently and
|
|
never evicted, and the data set related metadata is never accessed once the data
|
|
set is closed, and thus is evicted smoothly to make room for new data sets.
|
|
|
|
All this changes abruptly when the local heap finally doubles in size to a value
|
|
above the slightly less than 2 MB limit. All of a sudden, the local heap is the
|
|
size of the metadata cache, and the cache must constantly swap it in to access
|
|
it, and then swap it out to make room for other metadata.
|
|
|
|
The hit rate threshold-based algorithm for increasing the cache size will fix
|
|
this problem eventually, but performance will be very bad until it does, as the
|
|
metadata cache will largely ineffective until its size is increased.
|
|
|
|
An obvious heuristic for addressing this "big rock in a small pond" issue is to
|
|
watch for large "incoming rocks", and increase the size of the "pond" if the
|
|
rock is so big that it will force most of the "water" out of the "pond".
|
|
|
|
The add space flash cache size increment algorithm applies this intuition
|
|
directly:
|
|
|
|
Let x be either the size of a newly inserted entry, a newly loaded entry, or the
|
|
number of bytes by which the size of an existing entry has been increased
|
|
(i.e. the size of the "rock").
|
|
|
|
If x is greater than some user-specified fraction of the current maximum cache
|
|
size, increase the current maximum cache size by x times some user-specified
|
|
multiple, less any free space that was in the cache, to begin with. Further, to
|
|
avoid confusing the other cache size increment/decrement code, start a new
|
|
epoch.
|
|
|
|
At present, this algorithm pays no attention to any user-specified limit on the
|
|
maximum size of any single cache size increase, but it DOES stay within the
|
|
user-specified upper bound on the maximum cache size.
|
|
|
|
While it should be easy to see how this algorithm could be fooled into
|
|
inactivity by a large number of entries that were not quite large enough to
|
|
cross the threshold, in practice it seems to work reasonably well.
|
|
|
|
Needless to say, I will revisit the issue should this cease to be the case.
|
|
|
|
\subsection decreasing Decreasing the Cache Size
|
|
|
|
Identifying cases in which the maximum cache size is larger than necessary
|
|
turned out to be more difficult.
|
|
|
|
\subsubsection hrtcsr Hit Rate Threshold Cache Size Reduction
|
|
|
|
One obvious heuristic is to monitor the hit rate and guess that we can safely
|
|
decrease cache size if the hit rate exceeds some user-supplied threshold (say
|
|
.99995). The hit rate threshold size decrement algorithm implemented in the new
|
|
metadata cache implements this intuition as follows:
|
|
|
|
At the end of each epoch (this is the same epoch that is used in the cache size
|
|
increment algorithm), the hit rate is compared with the user-specified
|
|
threshold. If the hit rate exceeds that threshold, the current maximum cache
|
|
size is decreased by a user-specified factor. If required, the size of the
|
|
reduction is clipped to stay within a user-specified lower bound on the maximum
|
|
cache size, and optionally, within a user-specified maximum decrement.
|
|
|
|
In my synthetic tests, this algorithm works poorly. Even with a very high
|
|
threshold and a small maximum reduction, it results in cache size
|
|
oscillations. The size increment code typically increments the maximum cache
|
|
size above the working set size. This results in a high hit rate, which causes
|
|
the threshold size decrement code to reduce the maximum cache size below the
|
|
working set size, which causes the hit rate to crash causing the cycle to
|
|
repeat. The resulting average hit rate is poor.
|
|
|
|
It remains to be seen if this behavior will be seen in the field. The algorithm
|
|
is available for use, but it wouldn't be my first choice. If you use it, please
|
|
report back.
|
|
|
|
\subsubsection acsr Ageout Cache Size Reduction
|
|
|
|
Another heuristic for dealing with oversized cache conditions is to look for
|
|
entries that haven't been accessed for a long time, evict them, and reduce the
|
|
cache size accordingly.
|
|
|
|
The age out cache size reduction applies this intuition as follows: At the end
|
|
of each epoch (again the same epoch as used in the cache size increment
|
|
algorithm), all entries that haven't been accessed for a user-configurable
|
|
number of epochs (1 - 10 at present) are evicted. The maximum cache size is then
|
|
reduced to equal the sum of the sizes of the remaining entries. The size of the
|
|
reduction is clipped to stay within a user-specified lower bound on maximum
|
|
cache size, and optionally, within a user-specified maximum decrement.
|
|
|
|
In addition, the user may specify a minimum fraction of the cache which must be
|
|
empty before the cache size is reduced. Thus if an empty reserve of 0.1 was
|
|
specified on a 10 MB cache, there would be no cache size reduction unless the
|
|
eviction of aged out entries resulted in more than 1 MB of empty space. Further,
|
|
even after the reduction, the cache would be one-tenth empty.
|
|
|
|
In my synthetic tests, the age out algorithm works rather well, although it is
|
|
somewhat sensitive to the epoch length and age out period selection.
|
|
|
|
\subsubsection awhrtcsr Ageout With Hit Rate Threshold Cache Size Reduction
|
|
|
|
To address these issues, I combined the hit rate threshold and age out
|
|
heuristics.
|
|
|
|
Age out with threshold works just like age out, except that the algorithm is not
|
|
run unless the hit rate exceeded a user-specified threshold in the previous
|
|
epoch.
|
|
|
|
In my synthetic tests, age out with threshold seems to work nicely, with no
|
|
observed oscillation. Thus I have selected it as the default cache size
|
|
reduction algorithm.
|
|
|
|
For those interested in such things, the age out algorithm is implemented by
|
|
inserting a marker entry at the head of the LRU list at the beginning of each
|
|
epoch. Entries that haven't been accessed for at least n epochs are simply
|
|
entries that appear in the LRU list after the n-th marker at the end of an
|
|
epoch.
|
|
|
|
\section configuring Configuring the New Metadata Cache
|
|
|
|
Due to a lack of resources, the design work on the automatic cache size
|
|
adjustment algorithms was done hastily, using primarily synthetic tests. I don't
|
|
think I spent more than a couple weeks writing and running performance tests --
|
|
most time went into coding and functional testing.
|
|
|
|
As a result, while I think the algorithms provided for adaptive cache resizing
|
|
will work well in actual use, I don't really know (although preliminary results
|
|
from the field are promising). Fortunately, the issue shouldn't arise for the
|
|
vast majority of HDF5 users, and those for whom it may arise should be savvy
|
|
enough to recognize problems and deal with them.
|
|
|
|
For this latter class of users, I have implemented a number of new API calls
|
|
allowing the user to select and configure the cache resize algorithms, or to
|
|
turn them off and control cache size directly from the user program. There are
|
|
also API calls that allow the user program to monitor hit rate and cache size.
|
|
|
|
From the user perspective, all the cache configuration data for a given file is
|
|
contained in an instance of the \ref H5AC_cache_config_t structure -- the definition
|
|
of which is given below:
|
|
|
|
\snippet H5ACpublic.h H5AC_cache_config_t_snip
|
|
|
|
This structure is defined in \c H5ACpublic.h. Each field is discussed below and in
|
|
the associated header comment.
|
|
|
|
The C API allows you to get and set this structure directly. Unfortunately, the
|
|
Fortran API has to do this with individual parameters for each of the fields
|
|
(with the exception of version).
|
|
|
|
While the API calls are discussed individually in the reference manual, the
|
|
following high-level discussion of what fields to change for different purposes
|
|
should be useful.
|
|
|
|
\subsection gconfig General Configuration
|
|
|
|
The \c version field is intended to allow \THG to change the \c
|
|
H5AC_cache_config_t structure without breaking old code. For now, this field
|
|
should always be set to \c H5AC__CURR_CACHE_CONFIG_VERSION, even when you are
|
|
getting the current configuration data from the cache. The library needs the
|
|
version number to know where fields are located with reference to the supplied
|
|
base address.
|
|
|
|
The \ref H5AC_cache_config_t.rpt_fcn_enabled "rpt_fcn_enabled" field is a
|
|
boolean flag that allows you to turn on and off the resize reporting function
|
|
that reports the activities of the adaptive cache resize code at the end of each
|
|
epoch -- assuming that it is enabled.
|
|
|
|
The report function is unsupported, so you are on your own if you use it. Since
|
|
it dumps status data to stdout, you should not attempt to use it with Windows
|
|
unless you modify the source. You may find it useful if you want to experiment
|
|
with different adaptive resize configurations. It is also a convenient way of
|
|
diagnosing poor cache configuration. Finally, if you do lots of runs with
|
|
identical behavior, you can use it to determine the metadata cache size needed
|
|
in each phase of your program so you can set the required cache sizes manually.
|
|
|
|
The trace file fields are also unsupported. They allow one to open and close a
|
|
trace file in which all calls to the metadata cache are logged in a
|
|
user-specified file for later analysis. The feature is intended primarily for
|
|
THG use in debugging or optimizing the metadata cache in cases where users in
|
|
the field observe obscure failures or poor performance that we cannot re-create
|
|
in the lab. The trace file will allow us to re-create the exact sequence of
|
|
cache operations that are triggering the problem.
|
|
|
|
At present we do not have a playback utility for trace files, although I imagine
|
|
that we will write one quickly when and if we need it.
|
|
|
|
To enable the trace file, you load the full path of the desired trace file into
|
|
\ref H5AC_cache_config_t.trace_file_name "trace_file_name", and set \ref
|
|
H5AC_cache_config_t.open_trace_file "open_trace_file" to \c TRUE. In the
|
|
parallel case, an ASCII representation of the rank of each process is appended
|
|
to the supplied trace file name to create a unique trace file name for that
|
|
process.
|
|
|
|
To close an open trace file, set \ref H5AC_cache_config_t.close_trace_file
|
|
"close_trace_file" to \c TRUE.
|
|
|
|
It must be emphasized that you are on your own if you play with the trace file
|
|
feature absent a request from \THG. Needless to say, the trace file feature is
|
|
disabled by default. If you enable it, you will take a large performance hit and
|
|
generate huge trace files.
|
|
|
|
The \ref H5AC_cache_config_t.evictions_enabled "evictions_enabled" field is a
|
|
boolean flag allowing the user to disable the eviction of entries from the
|
|
metadata cache. Under normal operation conditions, this field will always be set
|
|
to \c TRUE.
|
|
|
|
In rare circumstances, the raw data throughput requirements may be so high that
|
|
the user wishes to postpone metadata writes so as to reserve I/O throughput for
|
|
raw data. The \ref H5AC_cache_config_t.evictions_enabled "evictions_enabled"
|
|
field exists to allow this -- although the user is to be warned that the
|
|
metadata cache will grow without bound while evictions are disabled. Thus
|
|
evictions should be re-enabled as soon as possible, and it may be wise to
|
|
monitor cache size and statistics (to see how to enable statistics, see the
|
|
debugging facilities section below).
|
|
|
|
Evictions may only be disabled when the automatic cache resize code is disabled
|
|
as well. Thus to disable evictions, not only must the user set the \ref
|
|
H5AC_cache_config_t.evictions_enabled "evictions_enabled" field to \c FALSE, but
|
|
he must also set \ref H5AC_cache_config_t.incr_mode "incr_mode" to
|
|
#H5C_incr__off, set \ref H5AC_cache_config_t.flash_incr_mode "flash_incr_mode"
|
|
to #H5C_flash_incr__off, and set \ref H5AC_cache_config_t.decr_mode "decr_mode"
|
|
to #H5C_decr__off.
|
|
|
|
To re-enable evictions, just set \ref H5AC_cache_config_t.evictions_enabled
|
|
"evictions_enabled" back to \c TRUE.
|
|
|
|
Before passing on to other subjects, it is worth re-iterating that disabling
|
|
evictions is an extreme step. Before attempting it, you might consider setting a
|
|
large cache size manually, and flushing the cache just before high raw data
|
|
throughput is required. This may yield the desired results without the risks
|
|
inherent in disabling evictions.
|
|
|
|
The \ref H5AC_cache_config_t.set_initial_size "set_initial_size" and \ref
|
|
H5AC_cache_config_t.initial_size "initial_size" fields allow you to specify an
|
|
initial maximum cache size. If \ref H5AC_cache_config_t.set_initial_size
|
|
"set_initial_size" is \c TRUE, \ref H5AC_cache_config_t.initial_size
|
|
"initial_size" must lie in the interval [\ref H5AC_cache_config_t.min_size
|
|
"min_size", \ref H5AC_cache_config_t.max_size "max_size"] (see below for a
|
|
discussion of the \ref H5AC_cache_config_t.min_size "min_size" and \ref
|
|
H5AC_cache_config_t.max_size "max_size" fields).
|
|
|
|
If you disable the adaptive cache resizing code (done by setting \ref
|
|
H5AC_cache_config_t.incr_mode "incr_mode" to #H5C_incr__off, \ref
|
|
H5AC_cache_config_t.flash_incr_mode "flash_incr_mode" to #H5C_flash_incr__off,
|
|
and \ref H5AC_cache_config_t.decr_mode "decr_mode" to #H5C_decr__off), you can
|
|
use these fields to control maximum cache size manually, as the maximum cache
|
|
size will remain at the initial size.
|
|
|
|
Note, that the maximum cache size is only modified when \ref
|
|
H5AC_cache_config_t.set_initial_size "set_initial_size" is \c TRUE. This allows
|
|
the use of configurations specified at compile time to change resize
|
|
configuration without altering the current maximum size of the cache. Without
|
|
this feature, an additional call would be required to get the current maximum
|
|
cache size so as to set the \ref H5AC_cache_config_t.initial_size "initial_size"
|
|
to the current maximum cache size, and thereby avoid changing it.
|
|
|
|
The \ref H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" sets the
|
|
current minimum clean size as a fraction of the current max cache size. While
|
|
this field was originally used only in the parallel version of the library, it
|
|
now applies to the serial version as well. Its value must lie in the range
|
|
\Code{[0.0, 1.0]}. 0.01 is reasonable in the serial case, and 0.3 in the
|
|
parallel.
|
|
|
|
A potential interaction, discovered at release 1.8.3, between the enforcement of
|
|
the \ref H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" and the
|
|
adaptive cache resize code can severely degrade performance. While this
|
|
interaction is easily dealt with in the serial case by setting \ref
|
|
H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" to 0.01, the problem
|
|
is more difficult in the parallel case. Please see the Interactions section
|
|
below for further details.
|
|
|
|
The \ref H5AC_cache_config_t.max_size "max_size" and \ref
|
|
H5AC_cache_config_t.min_size "min_size" fields specify the range of maximum
|
|
sizes that may be set for the cache by the automatic resize code. \ref
|
|
H5AC_cache_config_t.min_size "min_size" must be less than or equal to
|
|
\ref H5AC_cache_config_t.max_size "max_size", and both must lie in the range
|
|
\Code{[H5C__MIN_MAX_CACHE_SIZE, H5C__MAX_MAX_CACHE_SIZE]} -- currently [1 KB,
|
|
128 MB]. If you routinely run a cache size in the top half of this range, you
|
|
should increase the hash table size. To do this, modify the \c
|
|
H5C__HASH_TABLE_LEN \Code{\#define} in \c H5Cpkg.h and re-compile. At present,
|
|
\c H5C__HASH_TABLE_LEN must be a power of two.
|
|
|
|
The \c epoch_length is the number of cache accesses between runs of the adaptive
|
|
cache size control algorithms. It is ignored if these algorithms are turned
|
|
off. It must lie in the range \Code{[H5C__MIN_AR_EPOCH_LENGTH,
|
|
H5C__MAX_AR_EPOCH_LENGTH]} -- currently [100, 1000000]. The above constants are
|
|
defined in \c H5Cprivate.h. 50000 is a reasonable value.
|
|
|
|
\subsection increment Increment Configuration
|
|
|
|
The \ref H5AC_cache_config_t.incr_mode "incr_mode" field specifies the cache
|
|
size increment algorithm used. Its value must be a member of the \ref
|
|
H5C_cache_incr_mode enum type -- currently either #H5C_incr__off or
|
|
#H5C_incr__threshold (note the double underscores after \c "incr"). This type is
|
|
defined in H5Cpublic.h.
|
|
|
|
If \ref H5AC_cache_config_t.incr_mode "incr_mode" is set to #H5C_incr__off,
|
|
regular automatic cache size increases are disabled, and the \ref
|
|
H5AC_cache_config_t.lower_hr_threshold "lower_hr_threshold", \ref
|
|
H5AC_cache_config_t.increment "increment", \ref
|
|
H5AC_cache_config_t.apply_max_increment "apply_max_increment", and \ref
|
|
H5AC_cache_config_t.max_increment "max_increment", fields are ignored.
|
|
|
|
The \ref H5AC_cache_config_t.flash_incr_mode "flash_incr_mode" field specifies
|
|
the flash cache size increment algorithm used. Its value must be a member of the
|
|
\ref H5C_cache_flash_incr_mode enum type -- currently either
|
|
#H5C_flash_incr__off or #H5C_flash_incr__add_space (note the double underscores
|
|
after \c "incr"). This type is defined in H5Cpublic.h.
|
|
|
|
If \ref H5AC_cache_config_t.flash_incr_mode "flash_incr_mode" is set to
|
|
#H5C_flash_incr__off, flash cache size increases are disabled, and the \ref
|
|
H5AC_cache_config_t.flash_multiple "flash_multiple", and \ref
|
|
H5AC_cache_config_t.flash_threshold "flash_threshold", fields are ignored.
|
|
|
|
\subsubsection hrtcsic Hit Rate Threshold Cache Size Increase Configuration
|
|
|
|
If \ref H5AC_cache_config_t.incr_mode "incr_mode" is #H5C_incr__threshold, the
|
|
cache size is increased via the hit rate threshold algorithm. The remaining
|
|
fields in the section are then used as follows:
|
|
|
|
\ref H5AC_cache_config_t.lower_hr_threshold "lower_hr_threshold" is the
|
|
threshold below which the hit rate must fall to trigger an increase. The value
|
|
must lie in the range \Code{[0.0 - 1.0]}. In my tests, a relatively high value
|
|
seems to work best -- 0.9 for example.
|
|
|
|
\ref H5AC_cache_config_t.increment "increment" is the factor by which the old
|
|
maximum cache size is multiplied to obtain an initial new maximum cache size
|
|
when an increment is needed. The actual change in size may be smaller as
|
|
required by \ref H5AC_cache_config_t.max_size "max_size" (above) and \c
|
|
max_increment (discussed below). increment must be greater than or equal to
|
|
1.0. If you set it to 1.0, you will effectively turn off the increment code. 2.0
|
|
is a reasonable value.
|
|
|
|
\ref H5AC_cache_config_t.apply_max_increment "apply_max_increment" and \ref
|
|
H5AC_cache_config_t.max_increment "max_increment" allow the user to specify a
|
|
maximum increment. If \ref H5AC_cache_config_t.apply_max_increment
|
|
"apply_max_increment" is \c TRUE, the cache size will never be increased by more
|
|
than the number of bytes specified in \ref H5AC_cache_config_t.max_increment
|
|
"max_increment" in any single increase.
|
|
|
|
\subsubsection fcsic Flash Cache Size Increase Configuration
|
|
|
|
If \ref H5AC_cache_config_t.flash_incr_mode "flash_incr_mode" is set to
|
|
#H5C_flash_incr__add_space, flash cache size increases are enabled. The size of
|
|
the cache will be increased under the following circumstances:
|
|
|
|
Let \c t be the current maximum cache size times the value of the \ref
|
|
H5AC_cache_config_t.flash_threshold "flash_threshold" field.
|
|
|
|
Let \c x be either the size of the newly inserted entry, the size of the newly
|
|
loaded entry, or the number of bytes added to the size of the entry under
|
|
consideration for triggering a flash cache size increase.
|
|
|
|
If \Code{t < x}, the basic condition for a flash cache size increase is met, and
|
|
we proceed as follows:
|
|
|
|
Let \c space_needed equal \c x less the amount of free space in the cache.
|
|
|
|
Further, let \ref H5AC_cache_config_t.increment "increment" equal \c
|
|
space_needed times the value of the \ref H5AC_cache_config_t.flash_multiple
|
|
"flash_multiple" field. If \ref H5AC_cache_config_t.increment "increment" plus
|
|
the current cache size is greater than \ref H5AC_cache_config_t.max_size
|
|
"max_size" (discussed above), reduce \ref H5AC_cache_config_t.increment
|
|
"increment" so that \ref H5AC_cache_config_t.increment "increment" plus the
|
|
current cache size is equal to \ref H5AC_cache_config_t.max_size "max_size".
|
|
|
|
If the increment is greater than zero, increase the current cache size by \ref
|
|
H5AC_cache_config_t.increment "increment". To avoid confusing the other cache
|
|
size increment or decrement algorithms, start a new epoch. Note, however, that
|
|
we do not cycle the epoch markers if some variant of the age out algorithm is in
|
|
use.
|
|
|
|
The use of the \ref H5AC_cache_config_t.flash_threshold "flash_threshold" field
|
|
is discussed above. It must be a floating-point value in the range of
|
|
\Code{[0.1, 1.0]}. 0.25 is a reasonable value.
|
|
|
|
The use of the \ref H5AC_cache_config_t.flash_multiple "flash_multiple" field is
|
|
also discussed above. It must be a floating-point value in the range of
|
|
\Code{[0.1, 10.0]}. 1.4 is a reasonable value.
|
|
|
|
\subsection decrement Decrement Configuration
|
|
|
|
The \ref H5AC_cache_config_t.decr_mode "decr_mode" field specifies the cache
|
|
size decrement algorithm used. Its value must be a member of the \ref
|
|
H5C_cache_decr_mode enum type -- currently either #H5C_decr__off,
|
|
#H5C_decr__threshold, #H5C_decr__age_out, or #H5C_decr__age_out_with_threshold
|
|
(note the double underscores after \c "decr"). This type is defined in
|
|
H5Cpublic.h.
|
|
|
|
If \ref H5AC_cache_config_t.decr_mode "decr_mode" is set to #H5C_decr__off,
|
|
automatic cache size decreases are disabled, and the remaining fields in the
|
|
cache size decrease control section are ignored.
|
|
|
|
\subsubsection hrtcsdc Hit Rate Threshold Cache Size Decrease Configuration
|
|
|
|
If \ref H5AC_cache_config_t.decr_mode "decr_mode" is #H5C_decr__threshold, the
|
|
cache size is decreased by the threshold algorithm, and the remaining fields of
|
|
the decrement section are used as follows:
|
|
|
|
\ref H5AC_cache_config_t.upper_hr_threshold "upper_hr_threshold" is the
|
|
threshold above which the hit rate must rise to trigger cache size reduction. It
|
|
must be in the range \Code{[0.0, 1.0]}. In my synthetic tests, very high values
|
|
like .9995 or .99995 seemed to work best.
|
|
|
|
\ref H5AC_cache_config_t.decrement "decrement" is the factor by which the
|
|
current maximum cache size is multiplied to obtain a tentative new maximum cache
|
|
size. It must lie in the range \Code{[0.0, 1.0]}. Relatively large values like
|
|
.9 seem to work best in my synthetic tests. Note that the actual size reduction
|
|
may be smaller as required by \ref H5AC_cache_config_t.min_size "min_size" and
|
|
\ref H5AC_cache_config_t.max_decrement "max_decrement" (discussed below). \ref
|
|
H5AC_cache_config_t.apply_max_decrement "apply_max_decrement" and \ref
|
|
H5AC_cache_config_t.max_decrement "max_decrement" allow the user to specify a
|
|
maximum decrement. If \ref H5AC_cache_config_t.apply_max_decrement
|
|
"apply_max_decrement" is \c TRUE, the cache size will never be reduced by more
|
|
than \ref H5AC_cache_config_t.max_decrement "max_decrement" bytes in any single
|
|
reduction.
|
|
|
|
With the hit rate threshold cache size decrement algorithm, the remaining fields
|
|
in the section are ignored.
|
|
|
|
\subsubsection dacsr Ageout Cache Size Reduction
|
|
|
|
If \ref H5AC_cache_config_t.decr_mode "decr_mode" is #H5C_decr__age_out the
|
|
cache size is decreased by the ageout algorithm, and the remaining fields of the
|
|
decrement section are used as follows:
|
|
|
|
\ref H5AC_cache_config_t.epochs_before_eviction "epochs_before_eviction" is the
|
|
number of epochs an entry must reside unaccessed in the cache before it is
|
|
evicted. This value must lie in the range \Code{[1, H5C__MAX_EPOCH_MARKERS]}. \c
|
|
H5C__MAX_EPOCH_MARKERS is defined in H5Cprivate.h, and is currently set to 10.
|
|
|
|
\ref H5AC_cache_config_t.apply_max_decrement "apply_max_decrement" and \ref
|
|
H5AC_cache_config_t.max_decrement "max_decrement" are used as in section
|
|
2.4.3.1.
|
|
|
|
\ref H5AC_cache_config_t.apply_empty_reserve "apply_emty_reserve" and \ref
|
|
H5AC_cache_config_t.empty_reserve "empty_reserve" allow the user to specify a
|
|
minimum empty reserve as discussed in section 2.3.2.2. An empty reserve of 0.05
|
|
or 0.1 seems to work well.
|
|
|
|
The \ref H5AC_cache_config_t.decrement "decrement" and \ref
|
|
H5AC_cache_config_t.upper_hr_threshold "upper_hr_threshold" fields are ignored
|
|
in this case.
|
|
|
|
\subsubsection dawhrtcsr Ageout With Hit Rate Threshold Cache Size Reduction
|
|
|
|
If \ref H5AC_cache_config_t.decr_mode "decr_mode" is
|
|
#H5C_decr__age_out_with_threshold, the cache size is decreased by the ageout
|
|
with hit rate threshold algorithm, and the fields of decrement section are used
|
|
as per the Ageout algorithm (see 5.3.2) with the exception of \ref
|
|
H5AC_cache_config_t.upper_hr_threshold "upper_hr_threshold".
|
|
|
|
Here, \ref H5AC_cache_config_t.upper_hr_threshold "upper_hr_threshold" is the
|
|
threshold above which the hit rate must rise to trigger cache size reduction. It
|
|
must be in the range \Code{[0.0, 1.0]}. In my synthetic tests, high values like
|
|
.999 seemed to work well.
|
|
|
|
\subsection parallel Parallel Configuration
|
|
|
|
This section is a catch-all for parallel specific configuration data. At
|
|
present, it has only one field --
|
|
\ref H5AC_cache_config_t.dirty_bytes_threshold "dirty_bytes_threshold".
|
|
|
|
In PHDF5, all operations that modify metadata must be executed collectively. We
|
|
used to think that this was enough to ensure consistency across the metadata
|
|
caches, but since we allow processes to read metadata individually, the order of
|
|
dirty entries in the LRU list can vary across processes. This, in turn, can
|
|
change the order in which dirty metadata cache entries reach the bottom of the
|
|
LRU and are flushed to disk -- opening the door to messages from the past and
|
|
messages from the future bugs.
|
|
|
|
To prevent this, only the metadata cache on process 0 of the file communicator
|
|
is allowed to write to file, and then only after entering a sync point with the
|
|
other caches. After it writes entries to file, it sends the base addresses of
|
|
the now clean entries to the other caches, so they can mark these entries clean
|
|
as well, and then leaves the sync point. The other caches mark the specified
|
|
entries as clean before they leave the sync point as well. (Observe, that since
|
|
all caches see the same stream of dirty metadata, they will all have the same
|
|
set of dirty entries upon sync point entry and exit.)
|
|
|
|
The different caches know when to synchronize by counting the number of bytes of
|
|
dirty metadata created by the collective operations modifying metadata. Whenever
|
|
this count exceeds the value specified in the \ref
|
|
H5AC_cache_config_t.dirty_bytes_threshold "dirty_bytes_threshold", they all
|
|
enter the sync point, and process 0 flushes down to its minimum clean size and
|
|
sends the list of newly cleaned entries to the other caches.
|
|
|
|
Needless to say, the value of the \ref H5AC_cache_config_t.dirty_bytes_threshold
|
|
"dirty_bytes_threshold" field must be consistent across all the caches operating
|
|
on a given file.
|
|
|
|
All dirty metadata can also by flushed under programmatic control via the
|
|
H5Fflush() call. This call must be collective and will reset the dirty data
|
|
counts on each metadata cache.
|
|
|
|
Absent calls to H5Fflush(), dirty metadata will only be flushed when the \ref
|
|
H5AC_cache_config_t.dirty_bytes_threshold "dirty_bytes_threshold" is exceeded,
|
|
and then only down to the H5AC_cache_config_t.min_clean_fraction
|
|
"min_clean_fraction". Thus, if a program does all its metadata modifications in
|
|
one phase, and then doesn't modify metadata thereafter, a residue of dirty
|
|
metadata will be frozen in the metadata caches for the remainder of the
|
|
computation -- effectively reducing the sizes of the caches.
|
|
|
|
In the default configuration, the caches will eventually resize themselves to
|
|
maintain an acceptable hit rate. However, this will take time, and it will
|
|
increase the application's footprint in memory.
|
|
|
|
If your application behaves in this manner, you can avoid this by a collective
|
|
call to H5Fflush() immediately after the metadata modification phase.
|
|
|
|
\subsection interactions Interactions
|
|
|
|
Evictions may not be disabled unless the automatic cache resize code is disabled
|
|
as well (by setting \ref H5AC_cache_config_t.decr_mode "decr_mode" to
|
|
#H5C_decr__off, \c flash_decr_mode to #H5C_flash_incr__add_space, and \ref
|
|
H5AC_cache_config_t.incr_mode "incr_mode" to #H5C_incr__off) -- thus placing the
|
|
cache size under the direct control of the user program.
|
|
|
|
There is no logical necessity for this restriction. It is imposed because it
|
|
simplifies testing greatly and because I can't see any reason why one would want
|
|
to disable evictions while the automatic cache size adjustment code was
|
|
enabled. This restriction can be relaxed if anyone can come up with a good
|
|
reason to do so.
|
|
|
|
At present, there are two interactions between the increment and decrement
|
|
sections of the configuration.
|
|
|
|
If \ref H5AC_cache_config_t.incr_mode "incr_mode" is #H5C_incr__threshold, and
|
|
\ref H5AC_cache_config_t.decr_mode "decr_mode" is either #H5C_decr__threshold or
|
|
#H5C_decr__age_out_with_threshold, then \ref
|
|
H5AC_cache_config_t.lower_hr_threshold "lower_hr_threshold" must be strictly
|
|
less than \ref H5AC_cache_config_t.upper_hr_threshold "upper_hr_threshold".
|
|
|
|
Also, if the flash cache size increment code is enabled and is triggered, it
|
|
will restart the current epoch without calling any other cache size increment or
|
|
decrement code.
|
|
|
|
In both the serial and parallel cases, there is the potential for an interaction
|
|
between the \ref H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" and
|
|
the cache size increment code that can severely degrade
|
|
performance. Specifically, if the \ref H5AC_cache_config_t.min_clean_fraction
|
|
"min_clean_fraction" is large enough, it is possible that keeping the specified
|
|
fraction of the cache clean may generate enough flushes to seriously degrade
|
|
performance even though the hit rate is excellent.
|
|
|
|
In the serial case, this is easily dealt with by selecting a very small \ref
|
|
H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" -- 0.01 for example
|
|
-- as this still avoids the "metadata blizzard" phenomenon that appears when the
|
|
cache fills with dirty metadata and must then flush all of it before evicting an
|
|
entry to make space for a new entry.
|
|
|
|
The problem is more difficult in the parallel case, as the \ref
|
|
H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" is used to ensure
|
|
that the cache contains clean entries that can be evicted to make space for new
|
|
entries when metadata writes are forbidden -- i.e. between sync points.
|
|
|
|
This issue was discovered shortly before release 1.8.3 and an automated solution
|
|
has not been implemented. Should it become an issue for an application, try
|
|
manually setting the cache size to ~1.5 times the maximum working set size for
|
|
the application, and leave \ref H5AC_cache_config_t.min_clean_fraction
|
|
"min_clean_fraction" set to 0.3.
|
|
|
|
You can approximate the working set size of your application via repeated calls
|
|
to H5Fget_mdc_size() and H5Fget_mdc_hit_rate() while running your program with
|
|
the cache resize code enabled. The maximum value returned by H5Fget_mdc_size()
|
|
should be a reasonable approximation -- particularly if the associated hit rate
|
|
is good. In the parallel case, there is also an interaction between \c
|
|
min_clean_fraction and \ref H5AC_cache_config_t.dirty_bytes_threshold
|
|
"dirty_bytes_threshold". Absent calls to H5Fflush() (discussed above), the upper
|
|
bound on the amount of dirty data in the metadata caches will oscillate between
|
|
(1 - \ref H5AC_cache_config_t.min_clean_fraction "min_clean_fraction") times
|
|
current maximum cache size, and that value plus the \ref
|
|
H5AC_cache_config_t.dirty_bytes_threshold "dirty_bytes_threshold". Needless to
|
|
say, it will be best if the \ref H5AC_cache_config_t.min_size "min_size", \ref
|
|
H5AC_cache_config_t.min_clean_fraction "min_clean_fraction", and the \ref
|
|
H5AC_cache_config_t.dirty_bytes_threshold "dirty_bytes_threshold"
|
|
are chosen so that the cache can't fill with dirty data.
|
|
|
|
\subsection defaults Default Metadata Cache Configuration
|
|
|
|
Starting with release 1.8.3, HDF5 provides different default metadata cache
|
|
configurations depending on whether the library is compiled for serial or
|
|
parallel.
|
|
|
|
The default configuration for the serial case is as follows:
|
|
|
|
\code{.c}
|
|
{
|
|
/* int version = */ H5C__CURR_AUTO_SIZE_CTL_VER,
|
|
/* hbool_t rpt_fcn_enabled = */ FALSE,
|
|
/* hbool_t open_trace_file = */ FALSE,
|
|
/* hbool_t close_trace_file = */ FALSE,
|
|
/* char trace_file_name[] = */ "",
|
|
/* hbool_t evictions_enabled = */ TRUE,
|
|
/* hbool_t set_initial_size = */ TRUE,
|
|
/* size_t initial_size = */ ( 2 * 1024 * 1024),
|
|
/* double min_clean_fraction = */ 0.01,
|
|
/* size_t max_size = */ (32 * 1024 * 1024),
|
|
/* size_t min_size = */ ( 1 * 1024 * 1024),
|
|
/* long int epoch_length = */ 50000,
|
|
/* enum H5C_cache_incr_mode incr_mode = */ H5C_incr__threshold,
|
|
/* double lower_hr_threshold = */ 0.9,
|
|
/* double increment = */ 2.0,
|
|
/* hbool_t apply_max_increment = */ TRUE,
|
|
/* size_t max_increment = */ (4 * 1024 * 1024),
|
|
/* enum H5C_cache_flash_incr_mode */
|
|
/* flash_incr_mode = */ H5C_flash_incr__add_space,
|
|
/* double flash_multiple = */ 1.4,
|
|
/* double flash_threshold = */ 0.25,
|
|
/* enum H5C_cache_decr_mode decr_mode = */ H5C_decr__age_out_with_threshold,
|
|
/* double upper_hr_threshold = */ 0.999,
|
|
/* double decrement = */ 0.9,
|
|
/* hbool_t apply_max_decrement = */ TRUE,
|
|
/* size_t max_decrement = */ (1 * 1024 * 1024),
|
|
/* int epochs_before_eviction = */ 3,
|
|
/* hbool_t apply_empty_reserve = */ TRUE,
|
|
/* double empty_reserve = */ 0.1,
|
|
/* int dirty_bytes_threshold = */ (256 * 1024)
|
|
}
|
|
\endcode
|
|
|
|
The default configuration for the parallel case is as follows:
|
|
|
|
\code{.c}
|
|
{
|
|
/* int version = */ H5C__CURR_AUTO_SIZE_CTL_VER,
|
|
/* hbool_t rpt_fcn_enabled = */ FALSE,
|
|
/* hbool_t open_trace_file = */ FALSE,
|
|
/* hbool_t close_trace_file = */ FALSE,
|
|
/* char trace_file_name[] = */ "",
|
|
/* hbool_t evictions_enabled = */ TRUE,
|
|
/* hbool_t set_initial_size = */ TRUE,
|
|
/* size_t initial_size = */ ( 2 * 1024 * 1024),
|
|
/* double min_clean_fraction = */ 0.3,
|
|
/* size_t max_size = */ (32 * 1024 * 1024),
|
|
/* size_t min_size = */ ( 1 * 1024 * 1024),
|
|
/* long int epoch_length = */ 50000,
|
|
/* enum H5C_cache_incr_mode incr_mode = */ H5C_incr__threshold,
|
|
/* double lower_hr_threshold = */ 0.9,
|
|
/* double increment = */ 2.0,
|
|
/* hbool_t apply_max_increment = */ TRUE,
|
|
/* size_t max_increment = */ (4 * 1024 * 1024),
|
|
/* enum H5C_cache_flash_incr_mode */
|
|
/* flash_incr_mode = */ H5C_flash_incr__add_space,
|
|
/* double flash_multiple = */ 1.0,
|
|
/* double flash_threshold = */ 0.25,
|
|
/* enum H5C_cache_decr_mode decr_mode = */ H5C_decr__age_out_with_threshold,
|
|
/* double upper_hr_threshold = */ 0.999,
|
|
/* double decrement = */ 0.9,
|
|
/* hbool_t apply_max_decrement = */ TRUE,
|
|
/* size_t max_decrement = */ (1 * 1024 * 1024),
|
|
/* int epochs_before_eviction = */ 3,
|
|
/* hbool_t apply_empty_reserve = */ TRUE,
|
|
/* double empty_reserve = */ 0.1,
|
|
/* int dirty_bytes_threshold = */ (256 * 1024)
|
|
}
|
|
\endcode
|
|
|
|
The default serial configuration should be adequate for most serial HDF5 users.
|
|
|
|
The same may not be true for the default parallel configuration due to the
|
|
interaction between the \ref H5AC_cache_config_t.min_clean_fraction "min_clean_fraction" and the cache size increase code. See
|
|
the Interactions section for further details.
|
|
|
|
Should you need to change the default configuration, it can be found in
|
|
H5ACprivate.h. Look for the definition of H5AC__DEFAULT_RESIZE_CONFIG.
|
|
|
|
\section controlling Controlling the New Metadata Cache Size From Your Program
|
|
|
|
You have already seen how \ref H5AC_cache_config_t has facilities that allow you
|
|
to control the metadata cache size directly. Use H5Fget_mdc_config() and
|
|
H5Fset_mdc_config() to get and set the metadata cache configuration on an open
|
|
file. Use H5Pget_mdc_config() and H5Pset_mdc_config() to get and set the initial
|
|
metadata cache configuration in a file access property list. Recall that this
|
|
list contains configuration data used when opening a file.
|
|
|
|
Use H5Fget_mdc_hit_rate() to get the average hit rate since the last time the
|
|
hit rate stats were reset. This happens automatically at the beginning of each
|
|
epoch if the adaptive cache resize code is enabled. You can also do it manually
|
|
with H5Freset_mdc_hit_rate_stats(). Be careful about doing this if the adaptive
|
|
cache resize code is enabled, as you may confuse it.
|
|
|
|
Use H5Fget_mdc_size() to get metadata cache size data on an open file.
|
|
|
|
Finally, note that cache size and cache footprint are two different things -- in
|
|
my tests, the cache footprint (as inferred from the UNIX \c top command) is
|
|
typically about three times the maximum cache size. I haven't tracked it down
|
|
yet, but I would guess that most of this is due to the very small typical cache
|
|
entry size combined with the rather large size of the cache entry header
|
|
structure. This should be investigated further, but there are other matters of
|
|
higher priority.
|
|
|
|
\section news New Metadata Cache Debugging Facilities
|
|
|
|
The new metadata cache has a variety of debugging facilities that may be of
|
|
use. I doubt that any other than the report function and the trace file will
|
|
ever be accessible via the API, but they are relatively easy to turn on in the
|
|
source code.
|
|
|
|
Note that none of this should be viewed as supported -- it is described here on
|
|
the off chance that you want to use it, but you are on your own if you do. Also,
|
|
there are no promises as to consistency between versions.
|
|
|
|
As mentioned above, you can use the \ref H5AC_cache_config_t.rpt_fcn_enabled "rpt_fcn_enabled" field of the
|
|
configuration structure to enable the default reporting function
|
|
(H5C_def_auto_resize_rpt_fcn() in H5C.c). If this function doesn't work for you,
|
|
you will have to write your own. In particular, remember that it uses \c stdout,
|
|
so it will probably be unhappy under Windows.
|
|
|
|
Again, remember that this facility is not supported. Further, it is likely to
|
|
change every time I do any serious work on the cache.
|
|
|
|
There is also an extensive statistics collection code. Use
|
|
H5C_COLLECT_CACHE_STATS and H5C_COLLECT_CACHE_ENTRY_STATS in H5Cprivate.h to
|
|
turn this on. If you also turn on H5AC_DUMP_STATS_ON_CLOSE in H5ACprivate.h,
|
|
stats will be dumped when you close a file. Alternatively you can call
|
|
H5C_stats() and H5C_stats__reset() within the library to dump and reset
|
|
stats. Both of these functions are defined in H5C.c.
|
|
|
|
Finally, the cache also contains an extensive sanity checking code. Much of this
|
|
is turned on when you compile in debug mode, but to enable the full suite, turn
|
|
on H5C_DO_SANITY_CHECKS in H5Cprivate.h.
|
|
|
|
\section trouble Trouble Shooting
|
|
|
|
Absent major bugs in the cache, the only troubleshooting you should have to do
|
|
is diagnosing and fixing problems with your cache configuration.
|
|
|
|
Assuming it runs on your platform (I've only used it under Linux), the reporting
|
|
function is probably the most convenient diagnosis tool. However, since it is
|
|
unsupported code, I will not discuss it further beyond directing you to the
|
|
source (H5C_def_auto_resize_rpt_fcn() in H5C.c).
|
|
|
|
Absent the reporting function, regular calls to H5Fget_mdc_hit_rate() should
|
|
give you a good idea of the hit rate over time. Remember that the hit rate stats
|
|
are reset at the end of each epoch (when adaptive cache resizing is enabled), so
|
|
you should expect some jitter.
|
|
|
|
Similar calls to H5Fget_mdc_size() should allow you to monitor cache size and
|
|
the fraction of the current maximum cache size that is actually in use.
|
|
|
|
If the hit rate is consistently low, and the cache it at its current maximum
|
|
size, increasing the maximum size is an obvious fix.
|
|
|
|
If you see hit rate and cache size oscillations, try disabling adaptive cache
|
|
resizing and setting a fixed cache size a bit greater than the high end of the
|
|
cache size oscillations you observed.
|
|
|
|
If the hit rate oscillations don't go away, you are probably looking at a
|
|
feature of your application that can't be helped without major changes to the
|
|
cache. Please send along a description of the situation.
|
|
|
|
If the oscillations do go away, you may be able to come up with a configuration
|
|
that deals with the situation. If that fails, control the cache size manually,
|
|
and write to me, so I can try to develop an adaptive resize algorithm that works
|
|
in your case.
|
|
|
|
Needless to say, you should give the cache a few epochs to adapt to
|
|
circumstances. If that is too slow for you, try manual cache size control.
|
|
|
|
If you find it necessary to disable evictions, you may find it useful to enable
|
|
the internal statistics collection code mentioned above in the section on
|
|
debugging facilities.
|
|
|
|
Amongst many other things, the stats code will report the maximum cache size,
|
|
and the average successful and unsuccessful search depths in the hash table. If
|
|
these latter figures are significantly above 1, you should increase the size of
|
|
the hash table.
|
|
|
|
*/ |