Cloned library HDF5-1.14.1 with extra build files for internal package management.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

451 lines
19 KiB

2 years ago
<html>
<head>
<title>Filters</title>
<h1>Filters in HDF5</h1>
<b>Note: Transient pipelines described in this document have not
been implemented.</b>
<h2>Introduction</h2>
<p>HDF5 allows chunked data to pass through user-defined filters
on the way to or from disk. The filters operate on chunks of an
<code>H5D_CHUNKED</code> dataset can be arranged in a pipeline
so output of one filter becomes the input of the next filter.
</p><p>Each filter has a two-byte identification number (type
<code>H5Z_filter_t</code>) allocated by The HDF Group and can also be
passed application-defined integer resources to control its
behavior. Each filter also has an optional ASCII comment
string.
</p>
<table>
<tbody><tr>
<th>Values for <code>H5Z_filter_t</code></th>
<th>Description</th>
</tr>
<tr valign="top">
<td><code>0-255</code></td>
<td>These values are reserved for filters predefined and
registered by the HDF5 library and of use to the general
public. They are described in a separate section
below.</td>
</tr>
<tr valign="top">
<td><code>256-511</code></td>
<td>Filter numbers in this range are used for testing only
and can be used temporarily by any organization. No
attempt is made to resolve numbering conflicts since all
definitions are by nature temporary.</td>
</tr>
<tr valign="top">
<td><code>512-65535</code></td>
<td>Reserved for future assignment. Please contact the
<a href="mailto:help@hdfgroup.org">HDF5 development team</a>
to reserve a value or range of values for
use by your filters.</td>
</tr></tbody></table>
<h2>Defining and Querying the Filter Pipeline</h2>
<p>Two types of filters can be applied to raw data I/O: permanent
filters and transient filters. The permanent filter pipeline is
defined when the dataset is created while the transient pipeline
is defined for each I/O operation. During an
<code>H5Dwrite()</code> the transient filters are applied first
in the order defined and then the permanent filters are applied
in the order defined. For an <code>H5Dread()</code> the
opposite order is used: permanent filters in reverse order, then
transient filters in reverse order. An <code>H5Dread()</code>
must result in the same amount of data for a chunk as the
original <code>H5Dwrite()</code>.
</p><p>The permanent filter pipeline is defined by calling
<code>H5Pset_filter()</code> for a dataset creation property
list while the transient filter pipeline is defined by calling
that function for a dataset transfer property list.
</p><dl>
<dt><code>herr_t H5Pset_filter (hid_t <em>plist</em>,
H5Z_filter_t <em>filter</em>, unsigned int <em>flags</em>,
size_t <em>cd_nelmts</em>, const unsigned int
<em>cd_values</em>[])</code>
</dt><dd>This function adds the specified <em>filter</em> and
corresponding properties to the end of the transient or
permanent output filter pipeline (depending on whether
<em>plist</em> is a dataset creation or dataset transfer
property list). The <em>flags</em> argument specifies certain
general properties of the filter and is documented below. The
<em>cd_values</em> is an array of <em>cd_nelmts</em> integers
which are auxiliary data for the filter. The integer values
will be stored in the dataset object header as part of the
filter information.
</dd><dt><code>int H5Pget_nfilters (hid_t <em>plist</em>)</code>
</dt><dd>This function returns the number of filters defined in the
permanent or transient filter pipeline depending on whether
<em>plist</em> is a dataset creation or dataset transfer
property list. In each pipeline the filters are numbered from
0 through <em>N</em>-1 where <em>N</em> is the value returned
by this function. During output to the file the filters of a
pipeline are applied in increasing order (the inverse is true
for input). Zero is returned if there are no filters in the
pipeline and a negative value is returned for errors.
</dd><dt><code>H5Z_filter_t H5Pget_filter (hid_t <em>plist</em>,
int <em>filter_number</em>, unsigned int *<em>flags</em>,
size_t *<em>cd_nelmts</em>, unsigned int
*<em>cd_values</em>, size_t namelen, char name[])</code>
</dt><dd>This is the query counterpart of
<code>H5Pset_filter()</code> and returns information about a
particular filter number in a permanent or transient pipeline
depending on whether <em>plist</em> is a dataset creation or
dataset transfer property list. On input, <em>cd_nelmts</em>
indicates the number of entries in the <em>cd_values</em>
array allocated by the caller while on exit it contains the
number of values defined by the filter. The
<em>filter_number</em> should be a value between zero and
<em>N</em>-1 as described for <code>H5Pget_nfilters()</code>
and the function will return failure (a negative value) if the
filter number is out of range. If <em>name</em> is a pointer
to an array of at least <em>namelen</em> bytes then the filter
name will be copied into that array. The name will be null
terminated if the <em>namelen</em> is large enough. The
filter name returned will be the name appearing in the file or
else the name registered for the filter or else an empty string.
</dd></dl>
<p>The flags argument to the functions above is a bit vector of
the following fields:
</p>
<table>
<tbody><tr>
<th>Values for <em>flags</em></th>
<th>Description</th>
</tr>
<tr valign="top">
<td><code>H5Z_FLAG_OPTIONAL</code></td>
<td>If this bit is set then the filter is optional. If
the filter fails (see below) during an
<code>H5Dwrite()</code> operation then the filter is
just excluded from the pipeline for the chunk for which
it failed; the filter will not participate in the
pipeline during an <code>H5Dread()</code> of the chunk.
This is commonly used for compression filters: if the
compression result would be larger than the input then
the compression filter returns failure and the
uncompressed data is stored in the file. If this bit is
clear and a filter fails then the
<code>H5Dwrite()</code> or <code>H5Dread()</code> also
fails.</td>
</tr>
</tbody></table>
<h2>Defining Filters</h2>
<p>Each filter is bidirectional, handling both input and output to
the file, and a flag is passed to the filter to indicate the
direction. In either case the filter reads a chunk of data from
a buffer, usually performs some sort of transformation on the
data, places the result in the same or new buffer, and returns
the buffer pointer and size to the caller. If something goes
wrong the filter should return zero to indicate a failure.
</p><p>During output, a filter that fails or isn't defined and is
marked as optional is silently excluded from the pipeline and
will not be used when reading that chunk of data. A required
filter that fails or isn't defined causes the entire output
operation to fail. During input, any filter that has not been
excluded from the pipeline during output and fails or is not
defined will cause the entire input operation to fail.
</p><p>Filters are defined in two phases. The first phase is to
define a function to act as the filter and link the function
into the application. The second phase is to register the
function, associating the function with an
<code>H5Z_filter_t</code> identification number and a comment.
</p><dl>
<dt><code>typedef size_t (*H5Z_func_t)(unsigned int
<em>flags</em>, size_t <em>cd_nelmts</em>, const unsigned int
<em>cd_values</em>[], size_t <em>nbytes</em>, size_t
*<em>buf_size</em>, void **<em>buf</em>)</code>
</dt><dd>The <em>flags</em>, <em>cd_nelmts</em>, and
<em>cd_values</em> are the same as for the
<code>H5Pset_filter()</code> function with the additional flag
<code>H5Z_FLAG_REVERSE</code> which is set when the filter is
called as part of the input pipeline. The input buffer is
pointed to by <em>*buf</em> and has a total size of
<em>*buf_size</em> bytes but only <em>nbytes</em> are valid
data. The filter should perform the transformation in place if
possible and return the number of valid bytes or zero for
failure. If the transformation cannot be done in place then
the filter should allocate a new buffer with
<code>malloc()</code> and assign it to <em>*buf</em>,
assigning the allocated size of that buffer to
<em>*buf_size</em>. The old buffer should be freed
by calling <code>free()</code>.
<br><br>
</dd><dt><code>herr_t H5Zregister (H5Z_filter_t <em>filter_id</em>,
const char *<em>comment</em>, H5Z_func_t
<em>filter</em>)</code>
</dt><dd>The <em>filter</em> function is associated with a filter
number and a short ASCII comment which will be stored in the
hdf5 file if the filter is used as part of a permanent
pipeline during dataset creation.
</dd></dl>
<h2>Predefined Filters</h2>
<p>If <code>zlib</code> version 1.1.2 or later was found
during configuration then the library will define a filter whose
<code>H5Z_filter_t</code> number is
<code>H5Z_FILTER_DEFLATE</code>. Since this compression method
has the potential for generating compressed data which is larger
than the original, the <code>H5Z_FLAG_OPTIONAL</code> flag
should be turned on so such cases can be handled gracefully by
storing the original data instead of the compressed data. The
<em>cd_nvalues</em> should be one with <em>cd_value[0]</em>
being a compression aggression level between zero and nine,
inclusive (zero is the fastest compression while nine results in
the best compression ratio).
</p><p>A convenience function for adding the
<code>H5Z_FILTER_DEFLATE</code> filter to a pipeline is:
</p><dl>
<dt><code>herr_t H5Pset_deflate (hid_t <em>plist</em>, unsigned
<em>aggression</em>)</code>
</dt><dd>The deflate compression method is added to the end of the
permanent or transient filter pipeline depending on whether
<em>plist</em> is a dataset creation or dataset transfer
property list. The <em>aggression</em> is a number between
zero and nine (inclusive) to indicate the tradeoff between
speed and compression ratio (zero is fastest, nine is best
ratio).
</dd></dl>
<p>Even if the <code>zlib</code> isn't detected during
configuration the application can define
<code>H5Z_FILTER_DEFLATE</code> as a permanent filter. If the
filter is marked as optional (as with
<code>H5Pset_deflate()</code>) then it will always fail and be
automatically removed from the pipeline. Applications that read
data will fail only if the data is actually compressed; they
won't fail if <code>H5Z_FILTER_DEFLATE</code> was part of the
permanent output pipeline but was automatically excluded because
it didn't exist when the data was written.
</p><p><code>zlib</code> can be acquired from
<code><a href="http://www.cdrom.com/pub/infozip/zlib/">
http://www.cdrom.com/pub/infozip/zlib/</a></code>.
</p><h2>Example</h2>
<p>This example shows how to define and register a simple filter
that adds a checksum capability to the data stream.
</p><p>The function that acts as the filter always returns zero
(failure) if the <code>md5()</code> function was not detected at
configuration time (left as an exercise for the reader).
Otherwise the function is broken down to an input and output
half. The output half calculates a checksum, increases the size
of the output buffer if necessary, and appends the checksum to
the end of the buffer. The input half calculates the checksum
on the first part of the buffer and compares it to the checksum
already stored at the end of the buffer. If the two differ then
zero (failure) is returned, otherwise the buffer size is reduced
to exclude the checksum.
</p>
<table>
<tbody><tr>
<td>
<p><code></code></p><pre><code>
size_t
md5_filter(unsigned int flags, size_t cd_nelmts,
const unsigned int cd_values[], size_t nbytes,
size_t *buf_size, void **buf)
{
#ifdef HAVE_MD5
unsigned char cksum[16];
if (flags &amp; H5Z_REVERSE) {
/* Input */
assert(nbytes&gt;=16);
md5(nbytes-16, *buf, cksum);
/* Compare */
if (memcmp(cksum, (char*)(*buf)+nbytes-16, 16)) {
return 0; /*fail*/
}
/* Strip off checksum */
return nbytes-16;
} else {
/* Output */
md5(nbytes, *buf, cksum);
/* Increase buffer size if necessary */
if (nbytes+16&gt;*buf_size) {
*buf_size = nbytes + 16;
*buf = realloc(*buf, *buf_size);
}
/* Append checksum */
memcpy((char*)(*buf)+nbytes, cksum, 16);
return nbytes+16;
}
#else
return 0; /*fail*/
#endif
}
</code></pre>
</td>
</tr>
</tbody></table>
<p>Once the filter function is defined it must be registered so
the HDF5 library knows about it. Since we're testing this
filter we choose one of the <code>H5Z_filter_t</code> numbers
from the reserved range. We'll randomly choose 305.
</p><p>
</p>
<table>
<tbody><tr>
<td>
<p><code></code></p><pre><code>
#define FILTER_MD5 305
herr_t status = H5Zregister(FILTER_MD5, "md5 checksum", md5_filter);
</code></pre>
</td>
</tr>
</tbody></table>
<p>Now we can use the filter in a pipeline. We could have added
the filter to the pipeline before defining or registering the
filter as long as the filter was defined and registered by time
we tried to use it (if the filter is marked as optional then we
could have used it without defining it and the library would
have automatically removed it from the pipeline for each chunk
written before the filter was defined and registered).
</p><p>
</p>
<table>
<tbody><tr>
<td>
<p><code></code></p><pre><code>
hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
hsize_t chunk_size[3] = {10,10,10};
H5Pset_chunk(dcpl, 3, chunk_size);
H5Pset_filter(dcpl, FILTER_MD5, 0, 0, NULL);
hid_t dset = H5Dcreate(file, "dset", H5T_NATIVE_DOUBLE, space, dcpl);
</code></pre>
</td>
</tr>
</tbody></table>
<h2>6. Filter Diagnostics</h2>
<p>If the library is compiled with debugging turned on for the H5Z
layer (usually as a result of <code>configure
--enable-debug=z</code>) then filter statistics are printed when
the application exits normally or the library is closed. The
statistics are written to the standard error stream and include
two lines for each filter that was used: one for input and one
for output. The following fields are displayed:
</p><p>
</p>
<table>
<tbody><tr>
<th>Field Name</th>
<th>Description</th>
</tr>
<tr valign="top">
<td>Method</td>
<td>This is the name of the method as defined with
<code>H5Zregister()</code> with the characters
"&lt; or "&gt;" prepended to indicate
input or output.</td>
</tr>
<tr valign="top">
<td>Total</td>
<td>The total number of bytes processed by the filter
including errors. This is the maximum of the
<em>nbytes</em> argument or the return value.
</td></tr>
<tr valign="top">
<td>Errors</td>
<td>This field shows the number of bytes of the Total
column which can be attributed to errors.</td>
</tr>
<tr valign="top">
<td>User, System, Elapsed</td>
<td>These are the amount of user time, system time, and
elapsed time in seconds spent in the filter function.
Elapsed time is sensitive to system load. These times
may be zero on operating systems that don't support the
required operations.</td>
</tr>
<tr valign="top">
<td>Bandwidth</td>
<td>This is the filter bandwidth which is the total
number of bytes processed divided by elapsed time.
Since elapsed time is subject to system load the
bandwidth numbers cannot always be trusted.
Furthermore, the bandwidth includes bytes attributed to
errors which may significantly taint the value if the
function is able to detect errors without much
expense.</td>
</tr>
</tbody></table>
<p>
</p>
<table>
<caption align="bottom">
<b>Example: Filter Statistics</b>
</caption>
<tbody><tr>
<td>
<p><code></code></p><pre><code>H5Z: filter statistics accumulated ov=
er life of library:
Method Total Errors User System Elapsed Bandwidth
------ ----- ------ ---- ------ ------- ---------
&gt;deflate 160000 40000 0.62 0.74 1.33 117.5 kBs
&lt;deflate 120000 0 0.11 0.00 0.12 1.000 MBs
</code></pre>
</td>
</tr>
</tbody></table>
<hr>
<p><a name="fn1">Footnote 1:</a> Dataset chunks can be compressed
through the use of filters. Developers should be aware that
reading and rewriting compressed chunked data can result in holes
in an HDF5 file. In time, enough such holes can increase the
file size enough to impair application or library performance
when working with that file. See
<a href="https://support.hdfgroup.org/HDF5/doc1.6/Performance.html#Freespace">
Freespace Management</a>
in the chapter
<a href="https://support.hdfgroup.org/HDF5/doc1.6/Performance.html">
Performance Analysis and Issues</a>.</p>
</html>