You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
450 lines
19 KiB
450 lines
19 KiB
<html>
|
|
<head>
|
|
<title>Filters</title>
|
|
<h1>Filters in HDF5</h1>
|
|
|
|
<b>Note: Transient pipelines described in this document have not
|
|
been implemented.</b>
|
|
|
|
<h2>Introduction</h2>
|
|
|
|
<p>HDF5 allows chunked data to pass through user-defined filters
|
|
on the way to or from disk. The filters operate on chunks of an
|
|
<code>H5D_CHUNKED</code> dataset can be arranged in a pipeline
|
|
so output of one filter becomes the input of the next filter.
|
|
|
|
</p><p>Each filter has a two-byte identification number (type
|
|
<code>H5Z_filter_t</code>) allocated by The HDF Group and can also be
|
|
passed application-defined integer resources to control its
|
|
behavior. Each filter also has an optional ASCII comment
|
|
string.
|
|
|
|
</p>
|
|
<table>
|
|
<tbody><tr>
|
|
<th>Values for <code>H5Z_filter_t</code></th>
|
|
<th>Description</th>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td><code>0-255</code></td>
|
|
<td>These values are reserved for filters predefined and
|
|
registered by the HDF5 library and of use to the general
|
|
public. They are described in a separate section
|
|
below.</td>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td><code>256-511</code></td>
|
|
<td>Filter numbers in this range are used for testing only
|
|
and can be used temporarily by any organization. No
|
|
attempt is made to resolve numbering conflicts since all
|
|
definitions are by nature temporary.</td>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td><code>512-65535</code></td>
|
|
<td>Reserved for future assignment. Please contact the
|
|
<a href="mailto:help@hdfgroup.org">HDF5 development team</a>
|
|
to reserve a value or range of values for
|
|
use by your filters.</td>
|
|
</tr></tbody></table>
|
|
|
|
<h2>Defining and Querying the Filter Pipeline</h2>
|
|
|
|
<p>Two types of filters can be applied to raw data I/O: permanent
|
|
filters and transient filters. The permanent filter pipeline is
|
|
defined when the dataset is created while the transient pipeline
|
|
is defined for each I/O operation. During an
|
|
<code>H5Dwrite()</code> the transient filters are applied first
|
|
in the order defined and then the permanent filters are applied
|
|
in the order defined. For an <code>H5Dread()</code> the
|
|
opposite order is used: permanent filters in reverse order, then
|
|
transient filters in reverse order. An <code>H5Dread()</code>
|
|
must result in the same amount of data for a chunk as the
|
|
original <code>H5Dwrite()</code>.
|
|
|
|
</p><p>The permanent filter pipeline is defined by calling
|
|
<code>H5Pset_filter()</code> for a dataset creation property
|
|
list while the transient filter pipeline is defined by calling
|
|
that function for a dataset transfer property list.
|
|
|
|
</p><dl>
|
|
<dt><code>herr_t H5Pset_filter (hid_t <em>plist</em>,
|
|
H5Z_filter_t <em>filter</em>, unsigned int <em>flags</em>,
|
|
size_t <em>cd_nelmts</em>, const unsigned int
|
|
<em>cd_values</em>[])</code>
|
|
</dt><dd>This function adds the specified <em>filter</em> and
|
|
corresponding properties to the end of the transient or
|
|
permanent output filter pipeline (depending on whether
|
|
<em>plist</em> is a dataset creation or dataset transfer
|
|
property list). The <em>flags</em> argument specifies certain
|
|
general properties of the filter and is documented below. The
|
|
<em>cd_values</em> is an array of <em>cd_nelmts</em> integers
|
|
which are auxiliary data for the filter. The integer values
|
|
will be stored in the dataset object header as part of the
|
|
filter information.
|
|
</dd><dt><code>int H5Pget_nfilters (hid_t <em>plist</em>)</code>
|
|
</dt><dd>This function returns the number of filters defined in the
|
|
permanent or transient filter pipeline depending on whether
|
|
<em>plist</em> is a dataset creation or dataset transfer
|
|
property list. In each pipeline the filters are numbered from
|
|
0 through <em>N</em>-1 where <em>N</em> is the value returned
|
|
by this function. During output to the file the filters of a
|
|
pipeline are applied in increasing order (the inverse is true
|
|
for input). Zero is returned if there are no filters in the
|
|
pipeline and a negative value is returned for errors.
|
|
</dd><dt><code>H5Z_filter_t H5Pget_filter (hid_t <em>plist</em>,
|
|
int <em>filter_number</em>, unsigned int *<em>flags</em>,
|
|
size_t *<em>cd_nelmts</em>, unsigned int
|
|
*<em>cd_values</em>, size_t namelen, char name[])</code>
|
|
</dt><dd>This is the query counterpart of
|
|
<code>H5Pset_filter()</code> and returns information about a
|
|
particular filter number in a permanent or transient pipeline
|
|
depending on whether <em>plist</em> is a dataset creation or
|
|
dataset transfer property list. On input, <em>cd_nelmts</em>
|
|
indicates the number of entries in the <em>cd_values</em>
|
|
array allocated by the caller while on exit it contains the
|
|
number of values defined by the filter. The
|
|
<em>filter_number</em> should be a value between zero and
|
|
<em>N</em>-1 as described for <code>H5Pget_nfilters()</code>
|
|
and the function will return failure (a negative value) if the
|
|
filter number is out of range. If <em>name</em> is a pointer
|
|
to an array of at least <em>namelen</em> bytes then the filter
|
|
name will be copied into that array. The name will be null
|
|
terminated if the <em>namelen</em> is large enough. The
|
|
filter name returned will be the name appearing in the file or
|
|
else the name registered for the filter or else an empty string.
|
|
</dd></dl>
|
|
|
|
<p>The flags argument to the functions above is a bit vector of
|
|
the following fields:
|
|
|
|
</p>
|
|
<table>
|
|
<tbody><tr>
|
|
<th>Values for <em>flags</em></th>
|
|
<th>Description</th>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td><code>H5Z_FLAG_OPTIONAL</code></td>
|
|
<td>If this bit is set then the filter is optional. If
|
|
the filter fails (see below) during an
|
|
<code>H5Dwrite()</code> operation then the filter is
|
|
just excluded from the pipeline for the chunk for which
|
|
it failed; the filter will not participate in the
|
|
pipeline during an <code>H5Dread()</code> of the chunk.
|
|
This is commonly used for compression filters: if the
|
|
compression result would be larger than the input then
|
|
the compression filter returns failure and the
|
|
uncompressed data is stored in the file. If this bit is
|
|
clear and a filter fails then the
|
|
<code>H5Dwrite()</code> or <code>H5Dread()</code> also
|
|
fails.</td>
|
|
</tr>
|
|
</tbody></table>
|
|
|
|
<h2>Defining Filters</h2>
|
|
|
|
<p>Each filter is bidirectional, handling both input and output to
|
|
the file, and a flag is passed to the filter to indicate the
|
|
direction. In either case the filter reads a chunk of data from
|
|
a buffer, usually performs some sort of transformation on the
|
|
data, places the result in the same or new buffer, and returns
|
|
the buffer pointer and size to the caller. If something goes
|
|
wrong the filter should return zero to indicate a failure.
|
|
|
|
</p><p>During output, a filter that fails or isn't defined and is
|
|
marked as optional is silently excluded from the pipeline and
|
|
will not be used when reading that chunk of data. A required
|
|
filter that fails or isn't defined causes the entire output
|
|
operation to fail. During input, any filter that has not been
|
|
excluded from the pipeline during output and fails or is not
|
|
defined will cause the entire input operation to fail.
|
|
|
|
</p><p>Filters are defined in two phases. The first phase is to
|
|
define a function to act as the filter and link the function
|
|
into the application. The second phase is to register the
|
|
function, associating the function with an
|
|
<code>H5Z_filter_t</code> identification number and a comment.
|
|
|
|
</p><dl>
|
|
<dt><code>typedef size_t (*H5Z_func_t)(unsigned int
|
|
<em>flags</em>, size_t <em>cd_nelmts</em>, const unsigned int
|
|
<em>cd_values</em>[], size_t <em>nbytes</em>, size_t
|
|
*<em>buf_size</em>, void **<em>buf</em>)</code>
|
|
</dt><dd>The <em>flags</em>, <em>cd_nelmts</em>, and
|
|
<em>cd_values</em> are the same as for the
|
|
<code>H5Pset_filter()</code> function with the additional flag
|
|
<code>H5Z_FLAG_REVERSE</code> which is set when the filter is
|
|
called as part of the input pipeline. The input buffer is
|
|
pointed to by <em>*buf</em> and has a total size of
|
|
<em>*buf_size</em> bytes but only <em>nbytes</em> are valid
|
|
data. The filter should perform the transformation in place if
|
|
possible and return the number of valid bytes or zero for
|
|
failure. If the transformation cannot be done in place then
|
|
the filter should allocate a new buffer with
|
|
<code>malloc()</code> and assign it to <em>*buf</em>,
|
|
assigning the allocated size of that buffer to
|
|
<em>*buf_size</em>. The old buffer should be freed
|
|
by calling <code>free()</code>.
|
|
|
|
<br><br>
|
|
</dd><dt><code>herr_t H5Zregister (H5Z_filter_t <em>filter_id</em>,
|
|
const char *<em>comment</em>, H5Z_func_t
|
|
<em>filter</em>)</code>
|
|
</dt><dd>The <em>filter</em> function is associated with a filter
|
|
number and a short ASCII comment which will be stored in the
|
|
hdf5 file if the filter is used as part of a permanent
|
|
pipeline during dataset creation.
|
|
</dd></dl>
|
|
|
|
<h2>Predefined Filters</h2>
|
|
|
|
<p>If <code>zlib</code> version 1.1.2 or later was found
|
|
during configuration then the library will define a filter whose
|
|
<code>H5Z_filter_t</code> number is
|
|
<code>H5Z_FILTER_DEFLATE</code>. Since this compression method
|
|
has the potential for generating compressed data which is larger
|
|
than the original, the <code>H5Z_FLAG_OPTIONAL</code> flag
|
|
should be turned on so such cases can be handled gracefully by
|
|
storing the original data instead of the compressed data. The
|
|
<em>cd_nvalues</em> should be one with <em>cd_value[0]</em>
|
|
being a compression aggression level between zero and nine,
|
|
inclusive (zero is the fastest compression while nine results in
|
|
the best compression ratio).
|
|
|
|
</p><p>A convenience function for adding the
|
|
<code>H5Z_FILTER_DEFLATE</code> filter to a pipeline is:
|
|
|
|
</p><dl>
|
|
<dt><code>herr_t H5Pset_deflate (hid_t <em>plist</em>, unsigned
|
|
<em>aggression</em>)</code>
|
|
</dt><dd>The deflate compression method is added to the end of the
|
|
permanent or transient filter pipeline depending on whether
|
|
<em>plist</em> is a dataset creation or dataset transfer
|
|
property list. The <em>aggression</em> is a number between
|
|
zero and nine (inclusive) to indicate the tradeoff between
|
|
speed and compression ratio (zero is fastest, nine is best
|
|
ratio).
|
|
</dd></dl>
|
|
|
|
<p>Even if the <code>zlib</code> isn't detected during
|
|
configuration the application can define
|
|
<code>H5Z_FILTER_DEFLATE</code> as a permanent filter. If the
|
|
filter is marked as optional (as with
|
|
<code>H5Pset_deflate()</code>) then it will always fail and be
|
|
automatically removed from the pipeline. Applications that read
|
|
data will fail only if the data is actually compressed; they
|
|
won't fail if <code>H5Z_FILTER_DEFLATE</code> was part of the
|
|
permanent output pipeline but was automatically excluded because
|
|
it didn't exist when the data was written.
|
|
|
|
</p><p><code>zlib</code> can be acquired from
|
|
<code><a href="http://www.cdrom.com/pub/infozip/zlib/">
|
|
http://www.cdrom.com/pub/infozip/zlib/</a></code>.
|
|
|
|
</p><h2>Example</h2>
|
|
|
|
<p>This example shows how to define and register a simple filter
|
|
that adds a checksum capability to the data stream.
|
|
|
|
</p><p>The function that acts as the filter always returns zero
|
|
(failure) if the <code>md5()</code> function was not detected at
|
|
configuration time (left as an exercise for the reader).
|
|
Otherwise the function is broken down to an input and output
|
|
half. The output half calculates a checksum, increases the size
|
|
of the output buffer if necessary, and appends the checksum to
|
|
the end of the buffer. The input half calculates the checksum
|
|
on the first part of the buffer and compares it to the checksum
|
|
already stored at the end of the buffer. If the two differ then
|
|
zero (failure) is returned, otherwise the buffer size is reduced
|
|
to exclude the checksum.
|
|
|
|
</p>
|
|
<table>
|
|
<tbody><tr>
|
|
<td>
|
|
<p><code></code></p><pre><code>
|
|
size_t
|
|
md5_filter(unsigned int flags, size_t cd_nelmts,
|
|
const unsigned int cd_values[], size_t nbytes,
|
|
size_t *buf_size, void **buf)
|
|
{
|
|
#ifdef HAVE_MD5
|
|
unsigned char cksum[16];
|
|
|
|
if (flags & H5Z_REVERSE) {
|
|
/* Input */
|
|
assert(nbytes>=16);
|
|
md5(nbytes-16, *buf, cksum);
|
|
|
|
/* Compare */
|
|
if (memcmp(cksum, (char*)(*buf)+nbytes-16, 16)) {
|
|
return 0; /*fail*/
|
|
}
|
|
|
|
/* Strip off checksum */
|
|
return nbytes-16;
|
|
|
|
} else {
|
|
/* Output */
|
|
md5(nbytes, *buf, cksum);
|
|
|
|
/* Increase buffer size if necessary */
|
|
if (nbytes+16>*buf_size) {
|
|
*buf_size = nbytes + 16;
|
|
*buf = realloc(*buf, *buf_size);
|
|
}
|
|
|
|
/* Append checksum */
|
|
memcpy((char*)(*buf)+nbytes, cksum, 16);
|
|
return nbytes+16;
|
|
}
|
|
#else
|
|
return 0; /*fail*/
|
|
#endif
|
|
}
|
|
</code></pre>
|
|
</td>
|
|
</tr>
|
|
</tbody></table>
|
|
|
|
<p>Once the filter function is defined it must be registered so
|
|
the HDF5 library knows about it. Since we're testing this
|
|
filter we choose one of the <code>H5Z_filter_t</code> numbers
|
|
from the reserved range. We'll randomly choose 305.
|
|
|
|
</p><p>
|
|
</p>
|
|
<table>
|
|
<tbody><tr>
|
|
<td>
|
|
<p><code></code></p><pre><code>
|
|
#define FILTER_MD5 305
|
|
herr_t status = H5Zregister(FILTER_MD5, "md5 checksum", md5_filter);
|
|
</code></pre>
|
|
</td>
|
|
</tr>
|
|
</tbody></table>
|
|
|
|
<p>Now we can use the filter in a pipeline. We could have added
|
|
the filter to the pipeline before defining or registering the
|
|
filter as long as the filter was defined and registered by time
|
|
we tried to use it (if the filter is marked as optional then we
|
|
could have used it without defining it and the library would
|
|
have automatically removed it from the pipeline for each chunk
|
|
written before the filter was defined and registered).
|
|
|
|
</p><p>
|
|
</p>
|
|
<table>
|
|
<tbody><tr>
|
|
<td>
|
|
<p><code></code></p><pre><code>
|
|
hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
|
|
hsize_t chunk_size[3] = {10,10,10};
|
|
H5Pset_chunk(dcpl, 3, chunk_size);
|
|
H5Pset_filter(dcpl, FILTER_MD5, 0, 0, NULL);
|
|
hid_t dset = H5Dcreate(file, "dset", H5T_NATIVE_DOUBLE, space, dcpl);
|
|
</code></pre>
|
|
</td>
|
|
</tr>
|
|
</tbody></table>
|
|
|
|
<h2>6. Filter Diagnostics</h2>
|
|
|
|
<p>If the library is compiled with debugging turned on for the H5Z
|
|
layer (usually as a result of <code>configure
|
|
--enable-debug=z</code>) then filter statistics are printed when
|
|
the application exits normally or the library is closed. The
|
|
statistics are written to the standard error stream and include
|
|
two lines for each filter that was used: one for input and one
|
|
for output. The following fields are displayed:
|
|
|
|
</p><p>
|
|
</p>
|
|
<table>
|
|
<tbody><tr>
|
|
<th>Field Name</th>
|
|
<th>Description</th>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td>Method</td>
|
|
<td>This is the name of the method as defined with
|
|
<code>H5Zregister()</code> with the characters
|
|
"< or ">" prepended to indicate
|
|
input or output.</td>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td>Total</td>
|
|
<td>The total number of bytes processed by the filter
|
|
including errors. This is the maximum of the
|
|
<em>nbytes</em> argument or the return value.
|
|
</td></tr>
|
|
|
|
<tr valign="top">
|
|
<td>Errors</td>
|
|
<td>This field shows the number of bytes of the Total
|
|
column which can be attributed to errors.</td>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td>User, System, Elapsed</td>
|
|
<td>These are the amount of user time, system time, and
|
|
elapsed time in seconds spent in the filter function.
|
|
Elapsed time is sensitive to system load. These times
|
|
may be zero on operating systems that don't support the
|
|
required operations.</td>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td>Bandwidth</td>
|
|
<td>This is the filter bandwidth which is the total
|
|
number of bytes processed divided by elapsed time.
|
|
Since elapsed time is subject to system load the
|
|
bandwidth numbers cannot always be trusted.
|
|
Furthermore, the bandwidth includes bytes attributed to
|
|
errors which may significantly taint the value if the
|
|
function is able to detect errors without much
|
|
expense.</td>
|
|
</tr>
|
|
</tbody></table>
|
|
|
|
<p>
|
|
</p>
|
|
<table>
|
|
<caption align="bottom">
|
|
<b>Example: Filter Statistics</b>
|
|
</caption>
|
|
<tbody><tr>
|
|
<td>
|
|
<p><code></code></p><pre><code>H5Z: filter statistics accumulated ov=
|
|
er life of library:
|
|
Method Total Errors User System Elapsed Bandwidth
|
|
------ ----- ------ ---- ------ ------- ---------
|
|
>deflate 160000 40000 0.62 0.74 1.33 117.5 kBs
|
|
<deflate 120000 0 0.11 0.00 0.12 1.000 MBs
|
|
</code></pre>
|
|
</td>
|
|
</tr>
|
|
</tbody></table>
|
|
|
|
<hr>
|
|
|
|
|
|
<p><a name="fn1">Footnote 1:</a> Dataset chunks can be compressed
|
|
through the use of filters. Developers should be aware that
|
|
reading and rewriting compressed chunked data can result in holes
|
|
in an HDF5 file. In time, enough such holes can increase the
|
|
file size enough to impair application or library performance
|
|
when working with that file. See
|
|
<a href="https://support.hdfgroup.org/HDF5/doc1.6/Performance.html#Freespace">
|
|
Freespace Management</a>
|
|
in the chapter
|
|
<a href="https://support.hdfgroup.org/HDF5/doc1.6/Performance.html">
|
|
Performance Analysis and Issues</a>.</p>
|
|
</html>
|
|
|