Filters in HDF5

Note: Transient pipelines described in this document have not been implemented.

Introduction

HDF5 allows chunked data to pass through user-defined filters on the way to or from disk. The filters operate on chunks of an H5D_CHUNKED dataset can be arranged in a pipeline so output of one filter becomes the input of the next filter.

Each filter has a two-byte identification number (type H5Z_filter_t) allocated by The HDF Group and can also be passed application-defined integer resources to control its behavior. Each filter also has an optional ASCII comment string.

Values for H5Z_filter_t Description
0-255 These values are reserved for filters predefined and registered by the HDF5 library and of use to the general public. They are described in a separate section below.
256-511 Filter numbers in this range are used for testing only and can be used temporarily by any organization. No attempt is made to resolve numbering conflicts since all definitions are by nature temporary.
512-65535 Reserved for future assignment. Please contact the HDF5 development team to reserve a value or range of values for use by your filters.

Defining and Querying the Filter Pipeline

Two types of filters can be applied to raw data I/O: permanent filters and transient filters. The permanent filter pipeline is defined when the dataset is created while the transient pipeline is defined for each I/O operation. During an H5Dwrite() the transient filters are applied first in the order defined and then the permanent filters are applied in the order defined. For an H5Dread() the opposite order is used: permanent filters in reverse order, then transient filters in reverse order. An H5Dread() must result in the same amount of data for a chunk as the original H5Dwrite().

The permanent filter pipeline is defined by calling H5Pset_filter() for a dataset creation property list while the transient filter pipeline is defined by calling that function for a dataset transfer property list.

herr_t H5Pset_filter (hid_t plist, H5Z_filter_t filter, unsigned int flags, size_t cd_nelmts, const unsigned int cd_values[])
This function adds the specified filter and corresponding properties to the end of the transient or permanent output filter pipeline (depending on whether plist is a dataset creation or dataset transfer property list). The flags argument specifies certain general properties of the filter and is documented below. The cd_values is an array of cd_nelmts integers which are auxiliary data for the filter. The integer values will be stored in the dataset object header as part of the filter information.
int H5Pget_nfilters (hid_t plist)
This function returns the number of filters defined in the permanent or transient filter pipeline depending on whether plist is a dataset creation or dataset transfer property list. In each pipeline the filters are numbered from 0 through N-1 where N is the value returned by this function. During output to the file the filters of a pipeline are applied in increasing order (the inverse is true for input). Zero is returned if there are no filters in the pipeline and a negative value is returned for errors.
H5Z_filter_t H5Pget_filter (hid_t plist, int filter_number, unsigned int *flags, size_t *cd_nelmts, unsigned int *cd_values, size_t namelen, char name[])
This is the query counterpart of H5Pset_filter() and returns information about a particular filter number in a permanent or transient pipeline depending on whether plist is a dataset creation or dataset transfer property list. On input, cd_nelmts indicates the number of entries in the cd_values array allocated by the caller while on exit it contains the number of values defined by the filter. The filter_number should be a value between zero and N-1 as described for H5Pget_nfilters() and the function will return failure (a negative value) if the filter number is out of range. If name is a pointer to an array of at least namelen bytes then the filter name will be copied into that array. The name will be null terminated if the namelen is large enough. The filter name returned will be the name appearing in the file or else the name registered for the filter or else an empty string.

The flags argument to the functions above is a bit vector of the following fields:

Values for flags Description
H5Z_FLAG_OPTIONAL If this bit is set then the filter is optional. If the filter fails (see below) during an H5Dwrite() operation then the filter is just excluded from the pipeline for the chunk for which it failed; the filter will not participate in the pipeline during an H5Dread() of the chunk. This is commonly used for compression filters: if the compression result would be larger than the input then the compression filter returns failure and the uncompressed data is stored in the file. If this bit is clear and a filter fails then the H5Dwrite() or H5Dread() also fails.

Defining Filters

Each filter is bidirectional, handling both input and output to the file, and a flag is passed to the filter to indicate the direction. In either case the filter reads a chunk of data from a buffer, usually performs some sort of transformation on the data, places the result in the same or new buffer, and returns the buffer pointer and size to the caller. If something goes wrong the filter should return zero to indicate a failure.

During output, a filter that fails or isn't defined and is marked as optional is silently excluded from the pipeline and will not be used when reading that chunk of data. A required filter that fails or isn't defined causes the entire output operation to fail. During input, any filter that has not been excluded from the pipeline during output and fails or is not defined will cause the entire input operation to fail.

Filters are defined in two phases. The first phase is to define a function to act as the filter and link the function into the application. The second phase is to register the function, associating the function with an H5Z_filter_t identification number and a comment.

typedef size_t (*H5Z_func_t)(unsigned int flags, size_t cd_nelmts, const unsigned int cd_values[], size_t nbytes, size_t *buf_size, void **buf)
The flags, cd_nelmts, and cd_values are the same as for the H5Pset_filter() function with the additional flag H5Z_FLAG_REVERSE which is set when the filter is called as part of the input pipeline. The input buffer is pointed to by *buf and has a total size of *buf_size bytes but only nbytes are valid data. The filter should perform the transformation in place if possible and return the number of valid bytes or zero for failure. If the transformation cannot be done in place then the filter should allocate a new buffer with malloc() and assign it to *buf, assigning the allocated size of that buffer to *buf_size. The old buffer should be freed by calling free().

herr_t H5Zregister (H5Z_filter_t filter_id, const char *comment, H5Z_func_t filter)
The filter function is associated with a filter number and a short ASCII comment which will be stored in the hdf5 file if the filter is used as part of a permanent pipeline during dataset creation.

Predefined Filters

If zlib version 1.1.2 or later was found during configuration then the library will define a filter whose H5Z_filter_t number is H5Z_FILTER_DEFLATE. Since this compression method has the potential for generating compressed data which is larger than the original, the H5Z_FLAG_OPTIONAL flag should be turned on so such cases can be handled gracefully by storing the original data instead of the compressed data. The cd_nvalues should be one with cd_value[0] being a compression aggression level between zero and nine, inclusive (zero is the fastest compression while nine results in the best compression ratio).

A convenience function for adding the H5Z_FILTER_DEFLATE filter to a pipeline is:

herr_t H5Pset_deflate (hid_t plist, unsigned aggression)
The deflate compression method is added to the end of the permanent or transient filter pipeline depending on whether plist is a dataset creation or dataset transfer property list. The aggression is a number between zero and nine (inclusive) to indicate the tradeoff between speed and compression ratio (zero is fastest, nine is best ratio).

Even if the zlib isn't detected during configuration the application can define H5Z_FILTER_DEFLATE as a permanent filter. If the filter is marked as optional (as with H5Pset_deflate()) then it will always fail and be automatically removed from the pipeline. Applications that read data will fail only if the data is actually compressed; they won't fail if H5Z_FILTER_DEFLATE was part of the permanent output pipeline but was automatically excluded because it didn't exist when the data was written.

zlib can be acquired from http://www.cdrom.com/pub/infozip/zlib/.

Example

This example shows how to define and register a simple filter that adds a checksum capability to the data stream.

The function that acts as the filter always returns zero (failure) if the md5() function was not detected at configuration time (left as an exercise for the reader). Otherwise the function is broken down to an input and output half. The output half calculates a checksum, increases the size of the output buffer if necessary, and appends the checksum to the end of the buffer. The input half calculates the checksum on the first part of the buffer and compares it to the checksum already stored at the end of the buffer. If the two differ then zero (failure) is returned, otherwise the buffer size is reduced to exclude the checksum.


                  size_t
                  md5_filter(unsigned int flags, size_t cd_nelmts,
                  const unsigned int cd_values[], size_t nbytes,
                  size_t *buf_size, void **buf)
                  {
                  #ifdef HAVE_MD5
                  unsigned char       cksum[16];

                  if (flags & H5Z_REVERSE) {
                  /* Input */
                  assert(nbytes>=16);
                  md5(nbytes-16, *buf, cksum);

                  /* Compare */
                  if (memcmp(cksum, (char*)(*buf)+nbytes-16, 16)) {
                  return 0; /*fail*/
                  }

                  /* Strip off checksum */
                  return nbytes-16;

                  } else {
                  /* Output */
                  md5(nbytes, *buf, cksum);

                  /* Increase buffer size if necessary */
                  if (nbytes+16>*buf_size) {
                  *buf_size = nbytes + 16;
                  *buf = realloc(*buf, *buf_size);
                  }

                  /* Append checksum */
                  memcpy((char*)(*buf)+nbytes, cksum, 16);
                  return nbytes+16;
                  }
                  #else
                  return 0; /*fail*/
                  #endif
                  }
	          

Once the filter function is defined it must be registered so the HDF5 library knows about it. Since we're testing this filter we choose one of the H5Z_filter_t numbers from the reserved range. We'll randomly choose 305.


                  #define FILTER_MD5 305
                  herr_t status = H5Zregister(FILTER_MD5, "md5 checksum", md5_filter);
	          

Now we can use the filter in a pipeline. We could have added the filter to the pipeline before defining or registering the filter as long as the filter was defined and registered by time we tried to use it (if the filter is marked as optional then we could have used it without defining it and the library would have automatically removed it from the pipeline for each chunk written before the filter was defined and registered).


                  hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
                  hsize_t chunk_size[3] = {10,10,10};
                  H5Pset_chunk(dcpl, 3, chunk_size);
                  H5Pset_filter(dcpl, FILTER_MD5, 0, 0, NULL);
                  hid_t dset = H5Dcreate(file, "dset", H5T_NATIVE_DOUBLE, space, dcpl);
	          

6. Filter Diagnostics

If the library is compiled with debugging turned on for the H5Z layer (usually as a result of configure --enable-debug=z) then filter statistics are printed when the application exits normally or the library is closed. The statistics are written to the standard error stream and include two lines for each filter that was used: one for input and one for output. The following fields are displayed:

Field Name Description
Method This is the name of the method as defined with H5Zregister() with the characters "< or ">" prepended to indicate input or output.
Total The total number of bytes processed by the filter including errors. This is the maximum of the nbytes argument or the return value.
Errors This field shows the number of bytes of the Total column which can be attributed to errors.
User, System, Elapsed These are the amount of user time, system time, and elapsed time in seconds spent in the filter function. Elapsed time is sensitive to system load. These times may be zero on operating systems that don't support the required operations.
Bandwidth This is the filter bandwidth which is the total number of bytes processed divided by elapsed time. Since elapsed time is subject to system load the bandwidth numbers cannot always be trusted. Furthermore, the bandwidth includes bytes attributed to errors which may significantly taint the value if the function is able to detect errors without much expense.

Example: Filter Statistics

H5Z: filter statistics accumulated ov=
                  er life of library:
                  Method     Total  Errors  User  System  Elapsed Bandwidth
                  ------     -----  ------  ----  ------  ------- ---------
                  >deflate  160000   40000  0.62    0.74     1.33 117.5 kBs
                  <deflate  120000       0  0.11    0.00     0.12 1.000 MBs
	          

Footnote 1: Dataset chunks can be compressed through the use of filters. Developers should be aware that reading and rewriting compressed chunked data can result in holes in an HDF5 file. In time, enough such holes can increase the file size enough to impair application or library performance when working with that file. See Freespace Management in the chapter Performance Analysis and Issues.