HDF5-1.14.1/doxygen/examples/Filters.html

<html>
  <head>
    <title>Filters</title>
    <h1>Filters in HDF5</h1>

    <b>Note: Transient pipelines described in this document have not
      been implemented.</b>

    <h2>Introduction</h2>

    <p>HDF5 allows chunked data to pass through user-defined filters
      on the way to or from disk.  The filters operate on chunks of an
      <code>H5D_CHUNKED</code> dataset can be arranged in a pipeline
      so output of one filter becomes the input of the next filter.

    </p><p>Each filter has a two-byte identification number (type
      <code>H5Z_filter_t</code>) allocated by The HDF Group and can also be
      passed application-defined integer resources to control its
      behavior.  Each filter also has an optional ASCII comment
      string.

    </p>
    <table>
	  <tbody><tr>
	      <th>Values for <code>H5Z_filter_t</code></th>
	      <th>Description</th>
	    </tr>

	    <tr valign="top">
	      <td><code>0-255</code></td>
	      <td>These values are reserved for filters predefined and
	        registered by the HDF5 library and of use to the general
	        public.  They are described in a separate section
	        below.</td>
	    </tr>

	    <tr valign="top">
	      <td><code>256-511</code></td>
	      <td>Filter numbers in this range are used for testing only
	        and can be used temporarily by any organization.  No
	        attempt is made to resolve numbering conflicts since all
	        definitions are by nature temporary.</td>
	    </tr>

	    <tr valign="top">
	      <td><code>512-65535</code></td>
	      <td>Reserved for future assignment. Please contact the
	        <a href="mailto:help@hdfgroup.org">HDF5 development team</a>
            to reserve a value or range of values for
	        use by your filters.</td>
	</tr></tbody></table>

    <h2>Defining and Querying the Filter Pipeline</h2>

    <p>Two types of filters can be applied to raw data I/O: permanent
      filters and transient filters.  The permanent filter pipeline is
      defined when the dataset is created while the transient pipeline
      is defined for each I/O operation.  During an
      <code>H5Dwrite()</code> the transient filters are applied first
      in the order defined and then the permanent filters are applied
      in the order defined.  For an <code>H5Dread()</code> the
      opposite order is used: permanent filters in reverse order, then
      transient filters in reverse order.  An <code>H5Dread()</code>
      must result in the same amount of data for a chunk as the
      original <code>H5Dwrite()</code>.

    </p><p>The permanent filter pipeline is defined by calling
      <code>H5Pset_filter()</code> for a dataset creation property
      list while the transient filter pipeline is defined by calling
      that function for a dataset transfer property list.

    </p><dl>
      <dt><code>herr_t H5Pset_filter (hid_t <em>plist</em>,
	      H5Z_filter_t <em>filter</em>, unsigned int <em>flags</em>,
	      size_t <em>cd_nelmts</em>, const unsigned int
	      <em>cd_values</em>[])</code>
      </dt><dd>This function adds the specified <em>filter</em> and
	    corresponding properties to the end of the transient or
	    permanent output filter pipeline (depending on whether
	    <em>plist</em> is a dataset creation or dataset transfer
	    property list).  The <em>flags</em> argument specifies certain
	    general properties of the filter and is documented below. The
	    <em>cd_values</em> is an array of <em>cd_nelmts</em> integers
	    which are auxiliary data for the filter.  The integer values
	    will be stored in the dataset object header as part of the
	    filter information.
      </dd><dt><code>int H5Pget_nfilters (hid_t <em>plist</em>)</code>
      </dt><dd>This function returns the number of filters defined in the
	    permanent or transient filter pipeline depending on whether
	    <em>plist</em> is a dataset creation or dataset transfer
	    property list.  In each pipeline the filters are numbered from
	    0 through <em>N</em>-1 where <em>N</em> is the value returned
	    by this function. During output to the file the filters of a
	    pipeline are applied in increasing order (the inverse is true
	    for input).  Zero is returned if there are no filters in the
	    pipeline and a negative value is returned for errors.
      </dd><dt><code>H5Z_filter_t H5Pget_filter (hid_t <em>plist</em>,
	      int <em>filter_number</em>, unsigned int *<em>flags</em>,
	      size_t *<em>cd_nelmts</em>, unsigned int
	      *<em>cd_values</em>, size_t namelen, char name[])</code>
      </dt><dd>This is the query counterpart of
	    <code>H5Pset_filter()</code> and returns information about a
	    particular filter number in a permanent or transient pipeline
	    depending on whether <em>plist</em> is a dataset creation or
	    dataset transfer property list.  On input, <em>cd_nelmts</em>
	    indicates the number of entries in the <em>cd_values</em>
	    array allocated by the caller while on exit it contains the
	    number of values defined by the filter.  The
	    <em>filter_number</em> should be a value between zero and
	    <em>N</em>-1 as described for <code>H5Pget_nfilters()</code>
	    and the function will return failure (a negative value) if the
	    filter number is out of range.  If <em>name</em> is a pointer
	    to an array of at least <em>namelen</em> bytes then the filter
	    name will be copied into that array.  The name will be null
	    terminated if the <em>namelen</em> is large enough.  The
	    filter name returned will be the name appearing in the file or
	    else the name registered for the filter or else an empty string.
    </dd></dl>

    <p>The flags argument to the functions above is a bit vector of
      the following fields:

    </p>
	  <table>
	    <tbody><tr>
	        <th>Values for <em>flags</em></th>
	        <th>Description</th>
	      </tr>

	      <tr valign="top">
	        <td><code>H5Z_FLAG_OPTIONAL</code></td>
	        <td>If this bit is set then the filter is optional.  If
	          the filter fails (see below) during an
	          <code>H5Dwrite()</code> operation then the filter is
	          just excluded from the pipeline for the chunk for which
	          it failed; the filter will not participate in the
	          pipeline during an <code>H5Dread()</code> of the chunk.
	          This is commonly used for compression filters: if the
	          compression result would be larger than the input then
	          the compression filter returns failure and the
	          uncompressed data is stored in the file.  If this bit is
	          clear and a filter fails then the
	          <code>H5Dwrite()</code> or <code>H5Dread()</code> also
	          fails.</td>
	      </tr>
	  </tbody></table>

    <h2>Defining Filters</h2>

    <p>Each filter is bidirectional, handling both input and output to
      the file, and a flag is passed to the filter to indicate the
      direction.  In either case the filter reads a chunk of data from
      a buffer, usually performs some sort of transformation on the
      data, places the result in the same or new buffer, and returns
      the buffer pointer and size to the caller. If something goes
      wrong the filter should return zero to indicate a failure.

    </p><p>During output, a filter that fails or isn't defined and is
      marked as optional is silently excluded from the pipeline and
      will not be used when reading that chunk of data.  A required
      filter that fails or isn't defined causes the entire output
      operation to fail. During input, any filter that has not been
      excluded from the pipeline during output and fails or is not
      defined will cause the entire input operation to fail.

    </p><p>Filters are defined in two phases.  The first phase is to
      define a function to act as the filter and link the function
      into the application.  The second phase is to register the
      function, associating the function with an
      <code>H5Z_filter_t</code> identification number and a comment.

    </p><dl>
      <dt><code>typedef size_t (*H5Z_func_t)(unsigned int
	      <em>flags</em>, size_t <em>cd_nelmts</em>, const unsigned int
	      <em>cd_values</em>[], size_t <em>nbytes</em>, size_t
	      *<em>buf_size</em>, void **<em>buf</em>)</code>
      </dt><dd>The <em>flags</em>, <em>cd_nelmts</em>, and
	    <em>cd_values</em> are the same as for the
	    <code>H5Pset_filter()</code> function with the additional flag
	    <code>H5Z_FLAG_REVERSE</code> which is set when the filter is
	    called as part of the input pipeline. The input buffer is
	    pointed to by <em>*buf</em> and has a total size of
	    <em>*buf_size</em> bytes but only <em>nbytes</em> are valid
	    data. The filter should perform the transformation in place if
	    possible and return the number of valid bytes or zero for
	    failure.  If the transformation cannot be done in place then
	    the filter should allocate a new buffer with
	    <code>malloc()</code> and assign it to <em>*buf</em>,
	    assigning the allocated size of that buffer to
	    <em>*buf_size</em>. The old buffer should be freed
	    by calling <code>free()</code>.

	    <br><br>
      </dd><dt><code>herr_t H5Zregister (H5Z_filter_t <em>filter_id</em>,
	      const char *<em>comment</em>, H5Z_func_t
	      <em>filter</em>)</code>
      </dt><dd>The <em>filter</em> function is associated with a filter
	    number and a short ASCII comment which will be stored in the
	    hdf5 file if the filter is used as part of a permanent
	    pipeline during dataset creation.
    </dd></dl>

    <h2>Predefined Filters</h2>

    <p>If <code>zlib</code> version 1.1.2 or later was found
      during configuration then the library will define a filter whose
      <code>H5Z_filter_t</code> number is
      <code>H5Z_FILTER_DEFLATE</code>. Since this compression method
      has the potential for generating compressed data which is larger
      than the original, the <code>H5Z_FLAG_OPTIONAL</code> flag
      should be turned on so such cases can be handled gracefully by
      storing the original data instead of the compressed data.  The
      <em>cd_nvalues</em> should be one with <em>cd_value[0]</em>
      being a compression aggression level between zero and nine,
      inclusive (zero is the fastest compression while nine results in
      the best compression ratio).

    </p><p>A convenience function for adding the
      <code>H5Z_FILTER_DEFLATE</code> filter to a pipeline is:

    </p><dl>
      <dt><code>herr_t H5Pset_deflate (hid_t <em>plist</em>, unsigned
	      <em>aggression</em>)</code>
      </dt><dd>The deflate compression method is added to the end of the
	    permanent or transient filter pipeline depending on whether
	    <em>plist</em> is a dataset creation or dataset transfer
	    property list. The <em>aggression</em> is a number between
	    zero and nine (inclusive) to indicate the tradeoff between
	    speed and compression ratio (zero is fastest, nine is best
	    ratio).
    </dd></dl>

    <p>Even if the <code>zlib</code> isn't detected during
      configuration the application can define
      <code>H5Z_FILTER_DEFLATE</code> as a permanent filter.  If the
      filter is marked as optional (as with
      <code>H5Pset_deflate()</code>) then it will always fail and be
      automatically removed from the pipeline.  Applications that read
      data will fail only if the data is actually compressed; they
      won't fail if <code>H5Z_FILTER_DEFLATE</code> was part of the
      permanent output pipeline but was automatically excluded because
      it didn't exist when the data was written.

    </p><p><code>zlib</code> can be acquired from
      <code><a href="http://www.cdrom.com/pub/infozip/zlib/">
          http://www.cdrom.com/pub/infozip/zlib/</a></code>.

    </p><h2>Example</h2>

    <p>This example shows how to define and register a simple filter
      that adds a checksum capability to the data stream.

    </p><p>The function that acts as the filter always returns zero
      (failure) if the <code>md5()</code> function was not detected at
      configuration time (left as an exercise for the reader).
      Otherwise the function is broken down to an input and output
      half.  The output half calculates a checksum, increases the size
      of the output buffer if necessary, and appends the checksum to
      the end of the buffer.  The input half calculates the checksum
      on the first part of the buffer and compares it to the checksum
      already stored at the end of the buffer.  If the two differ then
      zero (failure) is returned, otherwise the buffer size is reduced
      to exclude the checksum.

    </p>
	  <table>
	    <tbody><tr>
	        <td>
	          <p><code></code></p><pre><code>
                  size_t
                  md5_filter(unsigned int flags, size_t cd_nelmts,
                  const unsigned int cd_values[], size_t nbytes,
                  size_t *buf_size, void **buf)
                  {
                  #ifdef HAVE_MD5
                  unsigned char       cksum[16];

                  if (flags &amp; H5Z_REVERSE) {
                  /* Input */
                  assert(nbytes&gt;=16);
                  md5(nbytes-16, *buf, cksum);

                  /* Compare */
                  if (memcmp(cksum, (char*)(*buf)+nbytes-16, 16)) {
                  return 0; /*fail*/
                  }

                  /* Strip off checksum */
                  return nbytes-16;

                  } else {
                  /* Output */
                  md5(nbytes, *buf, cksum);

                  /* Increase buffer size if necessary */
                  if (nbytes+16&gt;*buf_size) {
                  *buf_size = nbytes + 16;
                  *buf = realloc(*buf, *buf_size);
                  }

                  /* Append checksum */
                  memcpy((char*)(*buf)+nbytes, cksum, 16);
                  return nbytes+16;
                  }
                  #else
                  return 0; /*fail*/
                  #endif
                  }
	          </code></pre>
	        </td>
	      </tr>
	  </tbody></table>

    <p>Once the filter function is defined it must be registered so
      the HDF5 library knows about it.  Since we're testing this
      filter we choose one of the <code>H5Z_filter_t</code> numbers
      from the reserved range.  We'll randomly choose 305.

    </p><p>
    </p>
	  <table>
	    <tbody><tr>
	        <td>
	          <p><code></code></p><pre><code>
                  #define FILTER_MD5 305
                  herr_t status = H5Zregister(FILTER_MD5, "md5 checksum", md5_filter);
	          </code></pre>
	        </td>
	      </tr>
	  </tbody></table>

    <p>Now we can use the filter in a pipeline.  We could have added
      the filter to the pipeline before defining or registering the
      filter as long as the filter was defined and registered by time
      we tried to use it (if the filter is marked as optional then we
      could have used it without defining it and the library would
      have automatically removed it from the pipeline for each chunk
      written before the filter was defined and registered).

    </p><p>
    </p>
	  <table>
	    <tbody><tr>
	        <td>
	          <p><code></code></p><pre><code>
                  hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
                  hsize_t chunk_size[3] = {10,10,10};
                  H5Pset_chunk(dcpl, 3, chunk_size);
                  H5Pset_filter(dcpl, FILTER_MD5, 0, 0, NULL);
                  hid_t dset = H5Dcreate(file, "dset", H5T_NATIVE_DOUBLE, space, dcpl);
	          </code></pre>
	        </td>
	      </tr>
	  </tbody></table>

    <h2>6. Filter Diagnostics</h2>

    <p>If the library is compiled with debugging turned on for the H5Z
      layer (usually as a result of <code>configure
        --enable-debug=z</code>) then filter statistics are printed when
      the application exits normally or the library is closed.  The
      statistics are written to the standard error stream and include
      two lines for each filter that was used: one for input and one
      for output.  The following fields are displayed:

    </p><p>
    </p>
	  <table>
	    <tbody><tr>
	        <th>Field Name</th>
	        <th>Description</th>
	      </tr>

	      <tr valign="top">
	        <td>Method</td>
	        <td>This is the name of the method as defined with
	          <code>H5Zregister()</code> with the characters
	          "&lt; or "&gt;" prepended to indicate
	          input or output.</td>
	      </tr>

	      <tr valign="top">
	        <td>Total</td>
	        <td>The total number of bytes processed by the filter
	          including errors.  This is the maximum of the
	          <em>nbytes</em> argument or the return value.
	      </td></tr>

	      <tr valign="top">
	        <td>Errors</td>
	        <td>This field shows the number of bytes of the Total
	          column which can be attributed to errors.</td>
	      </tr>

	      <tr valign="top">
	        <td>User, System, Elapsed</td>
	        <td>These are the amount of user time, system time, and
	          elapsed time in seconds spent in the filter function.
	          Elapsed time is sensitive to system load. These times
	          may be zero on operating systems that don't support the
	          required operations.</td>
	      </tr>

	      <tr valign="top">
	        <td>Bandwidth</td>
	        <td>This is the filter bandwidth which is the total
	          number of bytes processed divided by elapsed time.
	          Since elapsed time is subject to system load the
	          bandwidth numbers cannot always be trusted.
	          Furthermore, the bandwidth includes bytes attributed to
	          errors which may significantly taint the value if the
	          function is able to detect errors without much
	          expense.</td>
	      </tr>
	  </tbody></table>

    <p>
    </p>
	  <table>
	    <caption align="bottom">
	      <b>Example: Filter Statistics</b>
	    </caption>
	    <tbody><tr>
	        <td>
	          <p><code></code></p><pre><code>H5Z: filter statistics accumulated ov=
                  er life of library:
                  Method     Total  Errors  User  System  Elapsed Bandwidth
                  ------     -----  ------  ----  ------  ------- ---------
                  &gt;deflate  160000   40000  0.62    0.74     1.33 117.5 kBs
                  &lt;deflate  120000       0  0.11    0.00     0.12 1.000 MBs
	          </code></pre>
	        </td>
	      </tr>
	  </tbody></table>

    <hr>


    <p><a name="fn1">Footnote 1:</a> Dataset chunks can be compressed
      through the use of filters.  Developers should be aware that
      reading and rewriting compressed chunked data can result in holes
      in an HDF5 file.  In time, enough such holes can increase the
      file size enough to impair application or library performance
      when working with that file.  See
      <a href="https://support.hdfgroup.org/HDF5/doc1.6/Performance.html#Freespace">
        Freespace Management</a>
      in the chapter
      <a href="https://support.hdfgroup.org/HDF5/doc1.6/Performance.html">
        Performance Analysis and Issues</a>.</p>
</html>