You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
332 lines
14 KiB
332 lines
14 KiB
# Lossy Compression with Quantize
|
|
|
|
## Introduction {#quantize}
|
|
|
|
The quantize feature was initially developed as part of the Community
|
|
Codec Repository (CCR) [2]. The CCR project allows netCDF users to
|
|
make use of HDF5 plugins (a.k.a. “filters”) which can add new
|
|
compression and other algorithms to the HDF5 library. As part of CCR,
|
|
the quantization algorithms were implemented as HDF5 filters.
|
|
|
|
However, one aspect of implementing the quantization as a filter is
|
|
that the filter is also required when reading the data [1]. Although
|
|
this makes sense for compression/decompression algorithms, the
|
|
quantize algorithms are only needed when data are written. Requiring
|
|
that the readers of the data also install the filters places an
|
|
unnecessary burden on data readers. Furthermore, using the quantize
|
|
filter results in data that cannot be read by netCDF-Java or versions
|
|
of netcdf-c before 4.8.0, when support for multiple HDF5 filters was
|
|
added. For these reasons, it was decided to merge the quantize
|
|
algorithms into the netcdf-c library [5].
|
|
|
|
As part of the netcdf-c library, the quantize algorithms are available
|
|
for netCDF/HDF5 files, and the new ncZarr format, and produce data
|
|
files that are fully backward compatible for all versions of netcdf-c
|
|
since 4.0, and also fully compatible with netcdf-Java.
|
|
|
|
## The Quantize Feature
|
|
|
|
The quantize algorithms assist with lossy compression by setting
|
|
excess bits to all zeros or all ones (in alternate array values). This
|
|
allows a subsequent compression algorithm, like the zlib-based
|
|
deflation built into netCDF-4, to better compress the data.
|
|
|
|
The quantize feature is applied to a variable in a netCDF file, and
|
|
may only be used with single or double precision floating point
|
|
(netCDF types NC_FLOAT and NC_DOUBLE). Attempting to turn on quantize
|
|
for any other type of netCDF variable will result in an error.
|
|
|
|
It should be noted that turning on quantize does not, by itself,
|
|
reduce the size of the data. Only if subsequent compression is used
|
|
will setting the quantize feature result in additional compression.
|
|
|
|
![Quantization of Pi](quantize_pi.png)
|
|
Figure 1: The value of Pi expressed as a 32-bit floating point number,
|
|
with different levels of quantization applied, from Number of
|
|
Significant Digits equal to 8 (no quantization), to 1 (maximum
|
|
quantization). The least significant bits of the significand are
|
|
replaced with zeros, to the extent possible, while preserving the
|
|
desired number of significant digits. In this example the Bit Grooming
|
|
quantization algorithm is used.
|
|
|
|
## Quantization Algorithms
|
|
|
|
Three different quantization algorithms are provided in the netcdf-c
|
|
quantize feature. Each does a somewhat different calculation to
|
|
determine the number of bits that can be set to zeros (or ones), while
|
|
preserving the number of significant digits specified by the user.
|
|
|
|
Two of the algorithms, Bit-Groom and Granular Bit-Round, accept the
|
|
number of decimal digits to be preserved in the data. One algorithm,
|
|
Bit-Round, accepts the number of binary bits to preserve.
|
|
|
|
### Bit Grooming
|
|
|
|
The Bit Grooming algorithm sets determines the number of bits which
|
|
are necessary for the required number of significant decimal
|
|
digits. This determination is made at the beginning of processing and
|
|
is applied to all values.
|
|
|
|
Bit Grooming then sets excess bits of the first array value to zero,
|
|
then excess bits of the next array value to one, and continues
|
|
alternating between zero and one for the excess bits of every other
|
|
array value. In this way, the average value of the array is preserved.
|
|
|
|
For the Bit Grooming algorithm, the NSD parameter refers to the number
|
|
of significant decimal digits that will be preserved. The number of
|
|
significant digits may be 1-7 for single precision floating point, or
|
|
1-15 for double precision floating point.
|
|
|
|
### Granular Bit Round
|
|
|
|
Granular Bit Round determines the number of required bits for each
|
|
value in the array, and uses IEEE rounding to change the data
|
|
value. It achieves a better overall compression ratio by more
|
|
aggressively determining the minimum number of bits required to
|
|
preserve the specified number of decimal digit precision.
|
|
|
|
For the Granular Bit Round algorithm, the NSD parameter refers to the
|
|
number of significant decimal digits that will be preserved (as with
|
|
the Bit Grooming algorithm). The number of significant digits may be
|
|
1-7 for single precision floating point, or 1-15 for double precision
|
|
floating point.
|
|
|
|
### Bit Round
|
|
|
|
The Bit Round algorithm allows the user to directly specify the number
|
|
of bits of the significant which will be preserved, and then sets
|
|
excess bits to zero or one for alternate array values.
|
|
|
|
For the Bit Round algorithm, the NSD parameter refers to the number of
|
|
significant binary digits that will be preserved. The number of
|
|
significant digits may be 1-23 for single precision floating point, or
|
|
1-52 for double precision floating point.
|
|
|
|
## Quantize Attribute
|
|
|
|
When the quantize feature is used, an integer attribute is added to
|
|
the variable which contains the NSD setting. Without this attribute it
|
|
would be impossible for readers to know that quantize had been applied
|
|
to the data. The name of the attribute reflects the quantize algorithm
|
|
used. In accordance with the conventions established by the NetCDF
|
|
Users Guide, these attribute names begin with an underscore,
|
|
indicating that they are added by the library and should not be
|
|
modified or deleted by users [6].
|
|
|
|
Algorithm | Attribute Name
|
|
----------|---------------
|
|
Bit Groom | _QuantizeBitGroomNumberOfSignificantDigits
|
|
Granular Bit Round | _QuantizeGranularBitRoundNumberOfSignificantDigits
|
|
Bit Round | _QuantizeBitRoundNumberOfSignificantBits
|
|
|
|
Figure 2: Table showing the names of the attribute added to a variable
|
|
after the quantize feature has been applied. The name of the attribute
|
|
indicates the algorithm used, the integer values represents the number
|
|
of significant decimal digits (for Bit Groom and Granular Bit Round),
|
|
or the number of significand bits retained (for Bit Round).
|
|
|
|
## Handling of Fill Values
|
|
|
|
In a netCDF file, fill values refer to the value used for elements of
|
|
the data not written by the user. For example, if a variable contains
|
|
an array of 10 values, and the user only writes 8 of them, the other
|
|
two values will be set to the fill value for that variable.
|
|
|
|
The fill value of a variable may be set by the user by adding an
|
|
attribute of the same type as the variable with the name
|
|
“_FillValue”. If present, the value of this attribute will be used as
|
|
the fill value for that variable. If not specified, a default value
|
|
for each type is used as the fill value. The default fill values may
|
|
be found in the netcdf.h file.
|
|
|
|
When using the quantize feature, any fill values will remain
|
|
unquantized. That is, the excess bits of any array element will not be
|
|
changed, if that element is the fill value. This is necessary if the
|
|
fill value is to retain its purpose as an indicator of values that
|
|
have not been written.
|
|
|
|
## Distortions Introduced by Lossy Compression
|
|
|
|
Any lossy compression introduces distortions to data.
|
|
|
|
The Bitgroom algorithms implemented in netcdf-c introduce a distortoin
|
|
that can be quantified in terms of a _relative_ error. The magnitude
|
|
of distortion introduced to every single value V is guaranteed to be
|
|
within a certain fraction of V, expressed as 0.5 * V * 2**{-NSB}:
|
|
i.e. it is 0.5V for NSB=0, 0.25V for NSB=1, 0.125V for NSB=2 etc.
|
|
|
|
Two quantize algorithms use different definitions of _decimal
|
|
precision_, though both are guaranteed to reproduce NSD decimals when
|
|
printed.
|
|
|
|
The margin for a relative error introduced by the methods are
|
|
summarised in the table:
|
|
|
|
```
|
|
NSD 1 2 3 4 5 6 7
|
|
|
|
BitGroom
|
|
Error Margin 3.1e-2 3.9e-3 4.9e-4 3.1e-5 3.8e-6 4.7e-7 -
|
|
|
|
GranularBitRound
|
|
Error Margin 1.4e-1 1.9e-2 2.2e-3 1.4e-4 1.8e-5 2.2e-6 -
|
|
|
|
```
|
|
|
|
If one defines decimal precision as in BitGroom, i.e. the introduced
|
|
relative error must not exceed half of the unit at the decimal place
|
|
NSD in the worst-case scenario, the following values of NSB should be
|
|
used for BitRound:
|
|
|
|
```
|
|
NSD 1 2 3 4 5 6 7
|
|
NSB 3 6 9 13 16 19 23
|
|
```
|
|
|
|
The resulting application of BitRound is as fast as BitGroom, and is
|
|
free from artifacts in multipoint statistics introduced by BitGroom
|
|
(see https://doi.org/10.5194/gmd-14-377-2021).
|
|
|
|
## Using the Quantize Feature
|
|
|
|
Turning on the quantize feature must be done on a per-variable basis,
|
|
after the variable has been defined, and before nc_enddef() (or its
|
|
Fortran equivalents) have been called. (Recall that for netCDF/HDF5
|
|
files, nc_enddef() is automatically called when data are written or
|
|
read from a variable.)
|
|
|
|
In accordance with the usual NetCDF API practice, an inquiry function
|
|
is also provided which may be called to check if quantize has been
|
|
turned on for a variable. Calling the inquiry function is not required
|
|
when reading the data - it is provided for user convenience.
|
|
|
|
### Using Quantize with the NetCDF C API
|
|
|
|
Quantize is available in the main branch of the netcdf-c library, and
|
|
will be part of the next release (netcdf-c-4.9.0).
|
|
|
|
To turn on the quantize feature, call the nc_def_var_quantize()
|
|
function. To inquire about whether quantize been turned on for a
|
|
variable, use the nc_inq_var_quantize() function.
|
|
|
|
@code
|
|
/* Create two variables, one float, one double. Quantization
|
|
* may only be applied to floating point data. */
|
|
if (nc_def_var(ncid, "var1", NC_FLOAT, NDIM1, &dimid, &varid1)) ERR;
|
|
if (nc_def_var(ncid, "var2", NC_DOUBLE, NDIM1, &dimid, &varid2)) ERR;
|
|
|
|
/* Set up quantization. This will not make the data any
|
|
* smaller, unless compression is also turned on. In this
|
|
* case, we will set 3 significant digits. */
|
|
if (nc_def_var_quantize(ncid, varid1, NC_QUANTIZE_BITGROOM, NSD_3)) ERR;
|
|
if (nc_def_var_quantize(ncid, varid2, NC_QUANTIZE_BITGROOM, NSD_3)) ERR;
|
|
|
|
/* Set up zlib compression. This will work better because the
|
|
* data are quantized, yielding a smaller output file. We will
|
|
* set compression level to 1, which is usually the best
|
|
* choice. */
|
|
if (nc_def_var_deflate(ncid, varid1, 0, 1, 1)) ERR;
|
|
if (nc_def_var_deflate(ncid, varid2, 0, 1, 1)) ERR;
|
|
@endcode
|
|
|
|
Figure 3: Example of using the quantize feature in C. Note that the
|
|
example also demonstrates adding zlib (a.k.a. deflate) compression to
|
|
the variables. Without turning on the compression, use of quantize
|
|
alone will not result in smaller data output.
|
|
|
|
### Using Quantize with the NetCDF Fortran 90 API
|
|
|
|
Quantize is available on a branch of the netcdf-fortran libraries, and
|
|
will be merged to main after the next netcdf-c release (4.9.0) and
|
|
will be released as part of the netCDF Fortran 90 API in the
|
|
subsequent release of netcdf-fortran.
|
|
|
|
In the Fortran 90 API, quantization is turned on by using two new
|
|
optional arguments to nf90_def_var(), the quantize_mode and the nsd
|
|
arguments.
|
|
|
|
@code
|
|
! Define some variables.
|
|
call check(nf90_def_var(ncid, VAR1_NAME, NF90_FLOAT, dimids, varid1&
|
|
&, deflate_level = DEFLATE_LEVEL, quantize_mode =&
|
|
& nf90_quantize_bitgroom, nsd = 3))
|
|
call check(nf90_def_var(ncid, VAR2_NAME, NF90_DOUBLE, dimids,&
|
|
& varid2, contiguous = .TRUE., quantize_mode =&
|
|
& nf90_quantize_bitgroom, nsd = 3))
|
|
@endcode
|
|
|
|
Figure 4: In the Fortran 90 netCDF API, two additional optional
|
|
parameters are available for the quantize feature, the quantize_mode
|
|
and nsd parameters.
|
|
|
|
### Using Quantize with the NetCDF Fortran 77 API
|
|
|
|
Quantize is available on a branch of the netcdf-fortran libraries, and
|
|
will be merged to main after the next netcdf-c release (4.9.0) and
|
|
will be released as part of the netCDF Fortran 77 API in the
|
|
subsequent release of netcdf-fortran.
|
|
|
|
@code
|
|
C Create some variables.
|
|
do x = 1, NVARS
|
|
retval = nf_def_var(ncid, var_name(x), var_type(x), NDIM1,
|
|
$ dimids, varid(x))
|
|
if (retval .ne. nf_noerr) stop 3
|
|
|
|
C Turn on quantize.
|
|
retval = nf_def_var_quantize(ncid, varid(x),
|
|
$ NF_QUANTIZE_BITGROOM, NSD_3)
|
|
if (retval .ne. nf_noerr) stop 3
|
|
|
|
C Turn on zlib compression.
|
|
retval = nf_def_var_deflate(ncid, varid(x), 0, 1, 1)
|
|
if (retval .ne. nf_noerr) stop 3
|
|
end do
|
|
@endcode
|
|
|
|
Figure 4: In the Fortran 77 netCDF API, nf_def_var_quantize() and
|
|
nf_inq_var_quantize() are provided, which wrap the quantize functions
|
|
from the C API.
|
|
|
|
## Performance
|
|
|
|
![Quantization Performance](images/quantize_performance.png)
|
|
|
|
Figure 5: Compression ratio of E3SM Atmosphere Model (EAM) v2 default
|
|
monthly dataset of raw size 445 MB compressed with default netCDF
|
|
lossless compression algorithm (DEFLATE, compression level=1) alone
|
|
(leftmost), or after pre-filtering with one of three lossy codecs
|
|
(BitGroom, Granular BitGroom, or BitRound) with quantization increasing
|
|
(and precision decreasing) to the right.
|
|
|
|
## References
|
|
|
|
1. HDF5 Dynamically Loaded Filters, The HDF Group, retrieved on
|
|
December 2, 2021 from
|
|
https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf.
|
|
|
|
2. Hartnett, Zender, C. S., (2020), ADDITIONAL NETCDF COMPRESSION
|
|
OPTIONS WITH THE COMMUNITY CODEC REPOSITORY (CCR), American
|
|
Meteorological Society (AMS) Annual Meeting, retrieved on July 3, 2021
|
|
from
|
|
https://www.researchgate.net/publication/347726695_ADDITIONAL_NETCDF_COMPRESSION_OPTIONS_WITH_THE_COMMUNITY_CODEC_REPOSITORY_CCR.
|
|
|
|
3. Zender, C. S. (2016), Bit Grooming: Statistically accurate
|
|
precision-preserving quantization with compression, evaluated in the
|
|
netCDF Operators (NCO, v4.4.8+), Geosci. Model Dev., 9, 3199-3211,
|
|
doi:10.5194/gmd-9-3199-2016 Retrieved on Sep 21, 2020 from
|
|
https://www.researchgate.net/publication/301575383_Bit_Grooming_Statistically_accurate_precision-preserving_quantization_with_compression_evaluated_in_the_netCDF_Operators_NCO_v448.
|
|
|
|
4. Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of
|
|
lossless and lossy algorithms for the compression of scientific
|
|
datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9),
|
|
4099-4113, doi:10.5194/gmd-2018-250, retrieved on Sep 21, 2020 from
|
|
https://www.researchgate.net/publication/335987647_Evaluation_of_lossless_and_lossy_algorithms_for_the_compression_of_scientific_datasets_in_netCDF-4_or_HDF5_files.
|
|
|
|
5. Hartnett, E., et. al., “Provide a way to do bit grooming before
|
|
compression”, netcdf-c GitHub Issue #1548,
|
|
https://github.com/Unidata/netcdf-c/issues/1548.
|
|
|
|
6. Rew, R., et. al., NetCDF Users Guide, Appendix A: Attribute
|
|
Conventions, Unidata,
|
|
https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html.
|
|
|