HDF5 allows chunked data to pass through user-defined filters
on the way to or from disk. The filters operate on chunks of an
H5D_CHUNKED
dataset can be arranged in a pipeline
so output of one filter becomes the input of the next filter.
Each filter has a two-byte identification number (type
H5Z_filter_t
) allocated by The HDF Group and can also be
passed application-defined integer resources to control its
behavior. Each filter also has an optional ASCII comment
string.
Values for H5Z_filter_t |
Description |
---|---|
0-255 |
These values are reserved for filters predefined and registered by the HDF5 library and of use to the general public. They are described in a separate section below. |
256-511 |
Filter numbers in this range are used for testing only and can be used temporarily by any organization. No attempt is made to resolve numbering conflicts since all definitions are by nature temporary. |
512-65535 |
Reserved for future assignment. Please contact the HDF5 development team to reserve a value or range of values for use by your filters. |
Two types of filters can be applied to raw data I/O: permanent
filters and transient filters. The permanent filter pipeline is
defined when the dataset is created while the transient pipeline
is defined for each I/O operation. During an
H5Dwrite()
the transient filters are applied first
in the order defined and then the permanent filters are applied
in the order defined. For an H5Dread()
the
opposite order is used: permanent filters in reverse order, then
transient filters in reverse order. An H5Dread()
must result in the same amount of data for a chunk as the
original H5Dwrite()
.
The permanent filter pipeline is defined by calling
H5Pset_filter()
for a dataset creation property
list while the transient filter pipeline is defined by calling
that function for a dataset transfer property list.
herr_t H5Pset_filter (hid_t plist,
H5Z_filter_t filter, unsigned int flags,
size_t cd_nelmts, const unsigned int
cd_values[])
int H5Pget_nfilters (hid_t plist)
H5Z_filter_t H5Pget_filter (hid_t plist,
int filter_number, unsigned int *flags,
size_t *cd_nelmts, unsigned int
*cd_values, size_t namelen, char name[])
H5Pset_filter()
and returns information about a
particular filter number in a permanent or transient pipeline
depending on whether plist is a dataset creation or
dataset transfer property list. On input, cd_nelmts
indicates the number of entries in the cd_values
array allocated by the caller while on exit it contains the
number of values defined by the filter. The
filter_number should be a value between zero and
N-1 as described for H5Pget_nfilters()
and the function will return failure (a negative value) if the
filter number is out of range. If name is a pointer
to an array of at least namelen bytes then the filter
name will be copied into that array. The name will be null
terminated if the namelen is large enough. The
filter name returned will be the name appearing in the file or
else the name registered for the filter or else an empty string.
The flags argument to the functions above is a bit vector of the following fields:
Values for flags | Description |
---|---|
H5Z_FLAG_OPTIONAL |
If this bit is set then the filter is optional. If
the filter fails (see below) during an
H5Dwrite() operation then the filter is
just excluded from the pipeline for the chunk for which
it failed; the filter will not participate in the
pipeline during an H5Dread() of the chunk.
This is commonly used for compression filters: if the
compression result would be larger than the input then
the compression filter returns failure and the
uncompressed data is stored in the file. If this bit is
clear and a filter fails then the
H5Dwrite() or H5Dread() also
fails. |
Each filter is bidirectional, handling both input and output to the file, and a flag is passed to the filter to indicate the direction. In either case the filter reads a chunk of data from a buffer, usually performs some sort of transformation on the data, places the result in the same or new buffer, and returns the buffer pointer and size to the caller. If something goes wrong the filter should return zero to indicate a failure.
During output, a filter that fails or isn't defined and is marked as optional is silently excluded from the pipeline and will not be used when reading that chunk of data. A required filter that fails or isn't defined causes the entire output operation to fail. During input, any filter that has not been excluded from the pipeline during output and fails or is not defined will cause the entire input operation to fail.
Filters are defined in two phases. The first phase is to
define a function to act as the filter and link the function
into the application. The second phase is to register the
function, associating the function with an
H5Z_filter_t
identification number and a comment.
typedef size_t (*H5Z_func_t)(unsigned int
flags, size_t cd_nelmts, const unsigned int
cd_values[], size_t nbytes, size_t
*buf_size, void **buf)
H5Pset_filter()
function with the additional flag
H5Z_FLAG_REVERSE
which is set when the filter is
called as part of the input pipeline. The input buffer is
pointed to by *buf and has a total size of
*buf_size bytes but only nbytes are valid
data. The filter should perform the transformation in place if
possible and return the number of valid bytes or zero for
failure. If the transformation cannot be done in place then
the filter should allocate a new buffer with
malloc()
and assign it to *buf,
assigning the allocated size of that buffer to
*buf_size. The old buffer should be freed
by calling free()
.
herr_t H5Zregister (H5Z_filter_t filter_id,
const char *comment, H5Z_func_t
filter)
If zlib
version 1.1.2 or later was found
during configuration then the library will define a filter whose
H5Z_filter_t
number is
H5Z_FILTER_DEFLATE
. Since this compression method
has the potential for generating compressed data which is larger
than the original, the H5Z_FLAG_OPTIONAL
flag
should be turned on so such cases can be handled gracefully by
storing the original data instead of the compressed data. The
cd_nvalues should be one with cd_value[0]
being a compression aggression level between zero and nine,
inclusive (zero is the fastest compression while nine results in
the best compression ratio).
A convenience function for adding the
H5Z_FILTER_DEFLATE
filter to a pipeline is:
herr_t H5Pset_deflate (hid_t plist, unsigned
aggression)
Even if the zlib
isn't detected during
configuration the application can define
H5Z_FILTER_DEFLATE
as a permanent filter. If the
filter is marked as optional (as with
H5Pset_deflate()
) then it will always fail and be
automatically removed from the pipeline. Applications that read
data will fail only if the data is actually compressed; they
won't fail if H5Z_FILTER_DEFLATE
was part of the
permanent output pipeline but was automatically excluded because
it didn't exist when the data was written.
zlib
can be acquired from
http://www.cdrom.com/pub/infozip/zlib/
.
This example shows how to define and register a simple filter that adds a checksum capability to the data stream.
The function that acts as the filter always returns zero
(failure) if the md5()
function was not detected at
configuration time (left as an exercise for the reader).
Otherwise the function is broken down to an input and output
half. The output half calculates a checksum, increases the size
of the output buffer if necessary, and appends the checksum to
the end of the buffer. The input half calculates the checksum
on the first part of the buffer and compares it to the checksum
already stored at the end of the buffer. If the two differ then
zero (failure) is returned, otherwise the buffer size is reduced
to exclude the checksum.
|
Once the filter function is defined it must be registered so
the HDF5 library knows about it. Since we're testing this
filter we choose one of the H5Z_filter_t
numbers
from the reserved range. We'll randomly choose 305.
|
Now we can use the filter in a pipeline. We could have added the filter to the pipeline before defining or registering the filter as long as the filter was defined and registered by time we tried to use it (if the filter is marked as optional then we could have used it without defining it and the library would have automatically removed it from the pipeline for each chunk written before the filter was defined and registered).
|
If the library is compiled with debugging turned on for the H5Z
layer (usually as a result of configure
--enable-debug=z
) then filter statistics are printed when
the application exits normally or the library is closed. The
statistics are written to the standard error stream and include
two lines for each filter that was used: one for input and one
for output. The following fields are displayed:
Field Name | Description |
---|---|
Method | This is the name of the method as defined with
H5Zregister() with the characters
"< or ">" prepended to indicate
input or output. |
Total | The total number of bytes processed by the filter including errors. This is the maximum of the nbytes argument or the return value. |
Errors | This field shows the number of bytes of the Total column which can be attributed to errors. |
User, System, Elapsed | These are the amount of user time, system time, and elapsed time in seconds spent in the filter function. Elapsed time is sensitive to system load. These times may be zero on operating systems that don't support the required operations. |
Bandwidth | This is the filter bandwidth which is the total number of bytes processed divided by elapsed time. Since elapsed time is subject to system load the bandwidth numbers cannot always be trusted. Furthermore, the bandwidth includes bytes attributed to errors which may significantly taint the value if the function is able to detect errors without much expense. |
|
Footnote 1: Dataset chunks can be compressed through the use of filters. Developers should be aware that reading and rewriting compressed chunked data can result in holes in an HDF5 file. In time, enough such holes can increase the file size enough to impair application or library performance when working with that file. See Freespace Management in the chapter Performance Analysis and Issues.