You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
313 lines
15 KiB
313 lines
15 KiB
# HDF5 Parallel Compression
|
|
|
|
## Introduction
|
|
|
|
When an HDF5 dataset is created, the application can specify
|
|
optional data filters to be applied to the dataset (as long as
|
|
the dataset uses a chunked data layout). These filters may
|
|
perform compression, shuffling, checksumming/error detection
|
|
and more on the dataset data. The filters are added to a filter
|
|
pipeline for the dataset and are automatically applied to the
|
|
data during dataset writes and reads.
|
|
|
|
Prior to the HDF5 1.10.2 release, a parallel HDF5 application
|
|
could read datasets with filters applied to them, but could
|
|
not write to those datasets in parallel. The datasets would
|
|
have to first be written in a serial HDF5 application or from
|
|
a single MPI rank in a parallel HDF5 application. This
|
|
restriction was in place because:
|
|
|
|
- Updating the data in filtered datasets requires management
|
|
of file metadata, such as the dataset's chunk index and file
|
|
space for data chunks, which must be done collectively in
|
|
order for MPI ranks to have a consistent view of the file.
|
|
At the time, HDF5 lacked the collective coordination of
|
|
this metadata management.
|
|
|
|
- When multiple MPI ranks are writing independently to the
|
|
same chunk in a dataset (even if their selected portions of
|
|
the chunk don't overlap), the whole chunk has to be read,
|
|
unfiltered, modified, re-filtered and then written back to
|
|
disk. This read-modify-write style of operation would cause
|
|
conflicts among the MPI ranks and lead to an inconsistent
|
|
view of the file.
|
|
|
|
Introduced in the HDF5 1.10.2 release, the parallel compression
|
|
feature allows an HDF5 application to write in parallel to
|
|
datasets with filters applied to them, as long as collective
|
|
I/O is used. The feature introduces new internal infrastructure
|
|
that coordinates the collective management of the file metadata
|
|
between MPI ranks during dataset writes. It also accounts for
|
|
multiple MPI ranks writing to a chunk by assigning ownership to
|
|
one of the MPI ranks, at which point the other MPI ranks send
|
|
their modifications to the owning MPI rank.
|
|
|
|
The parallel compression feature is always enabled when HDF5
|
|
is built with parallel enabled, but the feature may be disabled
|
|
if the necessary MPI-3 routines are not available. Therefore,
|
|
HDF5 conditionally defines the macro `H5_HAVE_PARALLEL_FILTERED_WRITES`
|
|
which an application can check for to see if the feature is
|
|
available.
|
|
|
|
## Examples
|
|
|
|
Using the parallel compression feature is very similar to using
|
|
compression in serial HDF5, except that dataset writes **must**
|
|
be collective:
|
|
|
|
```
|
|
hid_t dxpl_id = H5Pcreate(H5P_DATASET_XFER);
|
|
H5Pset_dxpl_mpio(dxpl_id, H5FD_MPIO_COLLECTIVE);
|
|
H5Dwrite(..., dxpl_id, ...);
|
|
```
|
|
|
|
The following are two simple examples of using the parallel compression
|
|
feature:
|
|
|
|
[ph5_filtered_writes.c](https://github.com/HDFGroup/hdf5/blob/develop/examples/ph5_filtered_writes.c)
|
|
|
|
[ph5_filtered_writes_no_sel.c](https://github.com/HDFGroup/hdf5/blob/develop/examples/ph5_filtered_writes_no_sel.c)
|
|
|
|
The former contains simple examples of using the parallel
|
|
compression feature to write to compressed datasets, while the
|
|
latter contains an example of how to write to compressed datasets
|
|
when one or MPI ranks don't have any data to write to a dataset.
|
|
Remember that the feature requires these writes to use collective
|
|
I/O, so the MPI ranks which have nothing to contribute must still
|
|
participate in the collective write call.
|
|
|
|
## Incremental file space allocation support
|
|
|
|
HDF5's [file space allocation time](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALLOC_TIME)
|
|
is a dataset creation property that can have significant effects
|
|
on application performance, especially if the application uses
|
|
parallel HDF5. In a serial HDF5 application, the default file space
|
|
allocation time for chunked datasets is "incremental". This means
|
|
that allocation of space in the HDF5 file for data chunks is
|
|
deferred until data is first written to those chunks. In parallel
|
|
HDF5, the file space allocation time was previously always forced
|
|
to "early", which allocates space in the file for all of a dataset's
|
|
data chunks at creation time (or during the first open of a dataset
|
|
if it was created serially). This would ensure that all the necessary
|
|
file space was allocated so MPI ranks could perform independent I/O
|
|
operations on a dataset without needing further coordination of file
|
|
metadata as described previously.
|
|
|
|
While this strategy has worked in the past, it has some noticeable
|
|
drawbacks. For one, the larger the chunked dataset being created,
|
|
the more noticeable overhead there will be during dataset creation
|
|
as all of the data chunks are being allocated in the HDF5 file.
|
|
Further, these data chunks will, by default, be [filled](https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_VALUE)
|
|
with HDF5's default fill data value, leading to extraordinary
|
|
dataset creation overhead and resulting in pre-filling large
|
|
portions of a dataset that the application might have been planning
|
|
to overwrite anyway. Even worse, there will be more initial overhead
|
|
from compressing that fill data before writing it out, only to have
|
|
it read back in, unfiltered and modified the first time a chunk is
|
|
written to. In the past, it was typically suggested that parallel
|
|
HDF5 applications should use [H5Pset_fill_time](https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_TIME)
|
|
with a value of `H5D_FILL_TIME_NEVER` in order to disable writing of
|
|
the fill value to dataset chunks, but this isn't ideal if the
|
|
application actually wishes to make use of fill values.
|
|
|
|
With [improvements made](https://www.hdfgroup.org/2022/03/parallel-compression-improvements-in-hdf5-1-13-1/)
|
|
to the parallel compression feature for the HDF5 1.13.1 release,
|
|
"incremental" file space allocation is now the default for datasets
|
|
created in parallel *only if they have filters applied to them*.
|
|
"Early" file space allocation is still supported for these datasets
|
|
if desired and is still forced for datasets created in parallel that
|
|
do *not* have filters applied to them. This change should significantly
|
|
reduce the overhead of creating filtered datasets in parallel HDF5
|
|
applications and should be helpful to applications that wish to
|
|
use a fill value for these datasets. It should also help significantly
|
|
reduce the size of the HDF5 file, as file space for the data chunks
|
|
is allocated as needed rather than all at once.
|
|
|
|
## Performance Considerations
|
|
|
|
Since getting good performance out of HDF5's parallel compression
|
|
feature involves several factors, the following is a list of
|
|
performance considerations (generally from most to least important)
|
|
and best practices to take into account when trying to get the
|
|
optimal performance out of the parallel compression feature.
|
|
|
|
### Begin with a good chunking strategy
|
|
|
|
[Starting with a good chunking strategy](https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5)
|
|
will generally have the largest impact on overall application
|
|
performance. The different chunking parameters can be difficult
|
|
to fine-tune, but it is essential to start with a well-performing
|
|
chunking layout before adding compression and parallel I/O into
|
|
the mix. Compression itself adds overhead and may have side
|
|
effects that necessitate further adjustment of the chunking
|
|
parameters and HDF5 application settings. Consider that the
|
|
chosen chunk size becomes a very important factor when compression
|
|
is involved, as data chunks have to be completely read and
|
|
re-written to perform partial writes to the chunk.
|
|
|
|
[Improving I/O performance with HDF5 compressed datasets](http://portal.hdfgroup.org/display/HDF5/Improving+IO+Performance+When+Working+with+HDF5+Compressed+Datasets)
|
|
is a useful reference for more information on getting good
|
|
performance when using a chunked dataset layout.
|
|
|
|
### Avoid chunk sharing
|
|
|
|
Since the parallel compression feature has to assign ownership
|
|
of data chunks to a single MPI rank in order to avoid the
|
|
previously described read-modify-write issue, an HDF5 application
|
|
may need to take care when determining how a dataset will be
|
|
divided up among the MPI ranks writing to it. Each dataset data
|
|
chunk that is written to by more than 1 MPI rank will incur extra
|
|
MPI overhead as one of the ranks takes ownership and the other
|
|
ranks send it their data and information about where in the chunk
|
|
that data belongs. While not always possible to do, an HDF5
|
|
application will get the best performance out of parallel compression
|
|
if it can avoid writing in a way that causes more than 1 MPI rank
|
|
to write to any given data chunk in a dataset.
|
|
|
|
### Collective metadata operations
|
|
|
|
The parallel compression feature typically works with a significant
|
|
amount of metadata related to the management of the data chunks
|
|
in datasets. In initial performance results gathered from various
|
|
HPC machines, it was found that the parallel compression feature
|
|
did not scale well at around 8k MPI ranks and beyond. On further
|
|
investigation, it became obvious that the bottleneck was due to
|
|
heavy filesystem pressure from the metadata management for dataset
|
|
data chunks as they changed size (as a result of data compression)
|
|
and moved around in the HDF5 file.
|
|
|
|
Enabling collective metadata operations in the HDF5 application
|
|
(as in the below snippet) showed significant improvement in
|
|
performance and scalability and is generally always recommended
|
|
unless application performance shows negative benefits by doing
|
|
so.
|
|
|
|
```
|
|
...
|
|
hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
|
|
H5Pset_fapl_mpio(fapl_id, MPI_COMM_WORLD, MPI_INFO_NULL);
|
|
H5Pset_all_coll_metadata_ops(fapl_id, 1);
|
|
H5Pset_coll_metadata_write(fapl_id, 1);
|
|
hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);
|
|
...
|
|
```
|
|
|
|
### Align chunks in the file
|
|
|
|
The natural layout of an HDF5 file may cause dataset data
|
|
chunks to end up at addresses in the file that do not align
|
|
well with the underlying file system, possibly leading to
|
|
poor performance. As an example, Lustre performance is generally
|
|
good when writes are aligned with the chosen stripe size.
|
|
The HDF5 application can use [H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT)
|
|
to have a bit more control over where objects in the HDF5
|
|
file end up. However, do note that setting the alignment
|
|
of objects generally wastes space in the file and has the
|
|
potential to dramatically increase its resulting size, so
|
|
caution should be used when choosing the alignment parameters.
|
|
|
|
[H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT)
|
|
has two parameters that control the alignment of objects in
|
|
the HDF5 file, the "threshold" value and the alignment
|
|
value. The threshold value specifies that any object greater
|
|
than or equal in size to that value will be aligned in the
|
|
file at addresses which are multiples of the chosen alignment
|
|
value. While the value 0 can be specified for the threshold
|
|
to make every object in the file be aligned according to
|
|
the alignment value, this isn't generally recommended, as it
|
|
will likely waste an excessive amount of space in the file.
|
|
|
|
In the example below, the chosen dataset chunk size is
|
|
provided for the threshold value and 1MiB is specified for
|
|
the alignment value. Assuming that 1MiB is an optimal
|
|
alignment value (e.g., assuming that it matches well with
|
|
the Lustre stripe size), this should cause dataset data
|
|
chunks to be well-aligned and generally give good write
|
|
performance.
|
|
|
|
```
|
|
hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
|
|
H5Pset_fapl_mpio(fapl_id, MPI_COMM_WORLD, MPI_INFO_NULL);
|
|
/* Assuming Lustre stripe size is 1MiB, align data chunks
|
|
in the file to address multiples of 1MiB. */
|
|
H5Pset_alignment(fapl_id, dataset_chunk_size, 1048576);
|
|
hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);
|
|
```
|
|
|
|
### File free space managers
|
|
|
|
As data chunks in a dataset get written to and compressed,
|
|
they can change in size and be relocated in the HDF5 file.
|
|
Since parallel compression usually involves many data chunks
|
|
in a file, this can create significant amounts of free space
|
|
in the file over its lifetime and eventually cause performance
|
|
issues.
|
|
|
|
An HDF5 application can use [H5Pset_file_space_strategy](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_STRATEGY)
|
|
with a value of `H5F_FSPACE_STRATEGY_PAGE` to enable the paged
|
|
aggregation feature, which can accumulate metadata and raw
|
|
data for dataset data chunks into well-aligned, configurably
|
|
sized "pages" for better performance. However, note that using
|
|
the paged aggregation feature will cause any setting from
|
|
[H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT)
|
|
to be ignored. While an application should be able to get
|
|
comparable performance effects by [setting the size of these pages](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_PAGE_SIZE) to be equal to the value that
|
|
would have been set for [H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT),
|
|
this may not necessarily be the case and should be studied.
|
|
|
|
Note that [H5Pset_file_space_strategy](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_STRATEGY)
|
|
has a `persist` parameter. This determines whether or not the
|
|
file free space manager should include extra metadata in the
|
|
HDF5 file about free space sections in the file. If this
|
|
parameter is `false`, any free space in the HDF5 file will
|
|
become unusable once the HDF5 file is closed. For parallel
|
|
compression, it's generally recommended that `persist` be set
|
|
to `true`, as this will keep better track of file free space
|
|
for data chunks between accesses to the HDF5 file.
|
|
|
|
```
|
|
hid_t fcpl_id = H5Pcreate(H5P_FILE_CREATE);
|
|
/* Use persistent free space manager with paged aggregation */
|
|
H5Pset_file_space_strategy(fcpl_id, H5F_FSPACE_STRATEGY_PAGE, 1, 1);
|
|
/* Assuming Lustre stripe size is 1MiB, set page size to that */
|
|
H5Pset_file_space_page_size(fcpl_id, 1048576);
|
|
...
|
|
hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, fcpl_id, fapl_id);
|
|
```
|
|
|
|
### Low-level collective vs. independent I/O
|
|
|
|
While the parallel compression feature requires that the HDF5
|
|
application set and maintain collective I/O at the application
|
|
interface level (via [H5Pset_dxpl_mpio](https://portal.hdfgroup.org/display/HDF5/H5P_SET_DXPL_MPIO)),
|
|
it does not require that the actual MPI I/O that occurs at
|
|
the lowest layers of HDF5 be collective; independent I/O may
|
|
perform better depending on the application I/O patterns and
|
|
parallel file system performance, among other factors. The
|
|
application may use [H5Pset_dxpl_mpio_collective_opt](https://portal.hdfgroup.org/display/HDF5/H5P_SET_DXPL_MPIO_COLLECTIVE_OPT)
|
|
to control this setting and see which I/O method provides the
|
|
best performance.
|
|
|
|
```
|
|
hid_t dxpl_id = H5Pcreate(H5P_DATASET_XFER);
|
|
H5Pset_dxpl_mpio(dxpl_id, H5FD_MPIO_COLLECTIVE);
|
|
H5Pset_dxpl_mpio_collective_opt(dxpl_id, H5FD_MPIO_INDIVIDUAL_IO); /* Try independent I/O */
|
|
H5Dwrite(..., dxpl_id, ...);
|
|
```
|
|
|
|
### Runtime HDF5 Library version
|
|
|
|
An HDF5 application can use the [H5Pset_libver_bounds](http://portal.hdfgroup.org/display/HDF5/H5P_SET_LIBVER_BOUNDS)
|
|
routine to set the upper and lower bounds on library versions
|
|
to use when creating HDF5 objects. For parallel compression
|
|
specifically, setting the library version to the latest available
|
|
version can allow access to better/more efficient chunk indexing
|
|
types and data encoding methods. For example:
|
|
|
|
```
|
|
...
|
|
hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
|
|
H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
|
|
hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);
|
|
...
|
|
```
|
|
|