PyTables
has a powerful
capability to deal with native HDF5 files created with another tools.
However, there are situations were you may want to create truly native
PyTables
files with those
tools while retaining fully compatibility with
PyTables
format. That is
perfectly possible, and in this appendix is presented the format that you
should endow to your own-generated files in order to get a fully
PyTables
compatible file.
We are going to describe the 1.6 version of
PyTables
file
format (introduced in
PyTables
version 1.3). At
this stage, this file format is considered stable enough to do not
introduce significant changes during a reasonable amount of time. As time
goes by, some changes will be introduced (and documented here) in order to
cope with new necessities. However, the changes will be carefully pondered
so as to ensure backward compatibility whenever is possible.
A PyTables
file is
composed with arbitrarily large amounts of HDF5 groups
(Groups
in
PyTables
naming scheme) and
datasets (Leaves
in
PyTables
naming scheme). For
groups, the only requirements are that they must have some
system attributes available. By convention, system
attributes in PyTables
are
written in upper case, and user attributes in lower case but this is not
enforced by the software. In the case of datasets, besides the mandatory
system attributes, some conditions are further needed in their storage
layout, as well as in the datatypes used in there, as we will see shortly.
As a final remark, you can use any filter as you want to create a
PyTables
file, provided that
the filter is a standard one in HDF5, like zlib,
shuffle or szip (although the
last one can not be used from within
PyTables
to create a new
file, datasets compressed with szip can be read, because it is the HDF5
library which do the decompression transparently).
The File
object is,
in fact, an special HDF5 group structure that is
root for the rest of the objects on the object
tree. The next attributes are mandatory for the HDF5 root
group structure in
PyTables
files:
This attribute should always be set to
'GROUP'
for group
structures.
It represents the internal format version, and
currently should be set to the
'1.6'
string.
A string where the user can put some description on what is this group used for.
Should contains the string
'1.0'
.
The next attributes are mandatory for group structures:
This attribute should always be set to
'GROUP'
for group
structures.
A string where the user can put some description on what is this group used for.
Should contains the string
'1.0'
.
This depends on the kind of
Leaf
. The format for each
type follows.
The next attributes are mandatory for table structures:
Must be set to
'TABLE'
.
A string where the user can put some description on what is this dataset used for.
Should contain the string
'2.6'
.
This is meant to provide the information about the
kind of object kept in the
Table
, i.e. when the
dataset is read, it will be converted to the indicated flavor. It
can take one the next string values:
The read operations will return a
numarray
object.
The read operations will be return as a
NumPy
object.
It contains the names of the different fields. The
X
means the number of
the field, zero-based (beware, order do matter). You should add as
many attributes of this kind as fields you have in your records.
It
contains the default values of the different fields. All the
datatypes are suported natively, except for complex types that are
currently serialized using Pickle. The
X
means the number of
the field, zero-based (beware, order do matter). You should add as
many attributes of this kind as fields you have in your records.
These fields are meant for saving the default values persistently
and their existence is optional.
This should contain the number of compound data type entries in the dataset. It must be an int data type.
The datatype of the elements (rows) of
Table
must be the
H5T_COMPOUND compound data type, and each of
these compound components must be built with only the next HDF5 data
types classes:
This class is used to represent the
Bool
type. Such a
type must be build using a H5T_NATIVE_B8 datatype, followed by a
HDF5 H5Tset_precision
call to set its precision to be just 1 bit.
This includes the next data types:
This represents a signed
char C type, but it is effectively used to represent
an Int8
type.
This represents an unsigned
char C type, but it is effectively used to represent
an UInt8
type.
This
represents a short C type, and it is
effectively used to represent an
Int16
type.
This
represents an unsigned short C type, and it
is effectively used to represent an
UInt16
type.
This
represents an int C type, and it is
effectively used to represent an
Int32
type.
This
represents an unsigned int C type, and it
is effectively used to represent an
UInt32
type.
This
represents a long C type, and it is
effectively used to represent an
Int32
or an
Int64
, depending on
whether you are running a 32-bit or 64-bit architecture.
This
represents an unsigned long C type, and it
is effectively used to represent an
UInt32
or an
UInt64
, depending
on whether you are running a 32-bit or 64-bit architecture.
This
represents a long long C type
(__int64
, if you
are using a Windows system) and it is effectively used to
represent an Int64
type.
This
represents an unsigned long long C type
(beware: this type does not have a correspondence on Windows
systems) and it is effectively used to represent an
UInt64
type.
This includes the next datatypes:
This represents a float C
type and it is effectively used to represent an
Float32
type.
This represents a double C
type and it is effectively used to represent an
Float64
type.
This includes the next datatypes:
This represents a POSIX
time_t C type and it is effectively used to
represent a
'Time32'
aliasing
type, which corresponds to an
Int32
type.
This represents a POSIX struct
timeval C type and it is effectively used to
represent a
'Time64'
aliasing
type, which corresponds to a
Float64
type.
The datatype used to describe strings in PyTables
is H5T_C_S1 (i.e. a string C type) followed
with a call to the HDF5
H5Tset_size()
function to set their length.
This allows the construction of homogeneous,
multidimensional arrays, so that you can include such objects in
compound records. The types supported as elements of H5T_ARRAY
data types are the ones described above. Currently,
PyTables
does not
support nested H5T_ARRAY types.
This allows the support of complex numbers. Its format is described below:
The H5T_COMPOUND type class contains two members. Both members must have the H5T_FLOAT atomic datatype class. The name of the first member should be "r" and represents the real part. The name of the second member should be "i" and represents the imaginary part. The precision property of both of the H5T_FLOAT members must be either 32 significant bits (e.g. H5T_NATIVE_FLOAT) or 64 significant bits (e.g. H5T_NATIVE_DOUBLE). They represent Complex32 and Complex64 types respectively.
Currently,
PyTables
does not
support nested H5T_COMPOUND types, the only exception being
supporting complex numbers in
Table
objects as
described above.
The next attributes are mandatory for array structures:
Must be set to
'ARRAY'
.
This is meant to provide the information about the
kind of object kept in the
Array
, i.e. when the
dataset is read, it will be converted to the indicated flavor. It
can take one the next string values:
The read operations will return a
numarray
object.
The read operations will return a
NumPy
object.
The read operations will return a
Numeric
object.
The read operations will return a Python
list
object in case
the dataset has dimensionality. If the dataset is an scalar,
then an appropriate Python
scalar
will be
returned instead.
A string where the user can put some description on what is this dataset used for.
Should contain the string
'2.3'
.
An Array
has a
dataspace with a N-dimensional
contiguous layout (if you prefer a
chunked layout see
EArray
below).
The elements of
Array
must have either
HDF5 atomic data types or a
compound data type representing a complex
number. The atomic data types can currently be one of the next HDF5
data type classes: H5T_BITFIELD, H5T_INTEGER,
H5T_FLOAT and H5T_STRING. The H5T_TIME class is also supported for
reading existing Array
objects, but not for creating them. See the
Table
format
description in Section D.3.1
for more info about these types.
In addition to the HDF5 atomic data types, the Array format
supports complex numbers with the H5T_COMPOUND data type class. See
the Table
format
description in Section D.3.1
for more info about this special type.
You should note that H5T_ARRAY class datatypes are not allowed
in Array
objects.
The next attributes are mandatory for carray structures:
Must be set to
'CARRAY'
.
This is meant to provide the information about the
kind of objects kept in the
CArray
, i.e. when the
dataset is read, it will be converted to the indicated flavor. It
can take the same values as the
Array
object.
A string where the user can put some description on what is this dataset used for.
Should contain the string
'1.0'
.
The elements of
CArray
must have either
HDF5 atomic data types or a
compound data type representing a complex
number. The atomic data types can currently be one of the next HDF5
data type classes: H5T_BITFIELD, H5T_INTEGER,
H5T_FLOAT and H5T_STRING. The H5T_TIME class is also supported for
reading existing CArray
objects, but not for creating them. See the
Table
format
description in Section D.3.1
for more info about these types.
In addition to the HDF5 atomic data types, the CArray format
supports complex numbers with the H5T_COMPOUND data type class. See
the Table
format
description in Section D.3.1
for more info about this special type.
You should note that H5T_ARRAY class datatypes are not allowed
in Array
objects.
The next attributes are mandatory for earray structures:
Must be set to
'EARRAY'
.
(Integer) Must be set to the extensible dimension. Only one extensible dimension is supported right now.
This is meant to provide the information about the
kind of objects kept in the
EArray
, i.e. when the
dataset is read, it will be converted to the indicated flavor. It
can take the same values as the
Array
object (see D.3.2), except
"Int"
and
"Float"
.
A string where the user can put some description on what is this dataset used for.
Should contain the string
'1.3'
.
The elements of
EArray
are allowed to
have the same data types as for the elements in the Array format. They
can be one of the HDF5 atomic data type
classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT,
H5T_TIME or H5T_STRING, see the
Table
format description
in Section D.3.1 for more info
about these types. They can also be a H5T_COMPOUND datatype
representing a complex number, see the
Table
format description
in Section D.3.1.
You should note that H5T_ARRAY class data types are not allowed
in EArray
objects.
The next attributes are mandatory for vlarray structures:
Must be set to
'VLARRAY'
.
This is meant to provide the information about the
kind of objects kept in the
VLArray
, i.e. when the
dataset is read, it will be converted to the indicated flavor. It
can take one of the next values:
The dataset will be returned as a
numarray
object.
The dataset will be returned as a
NumPy
object.
The dataset will be returned as an
Numeric
object.
The dataset will be returned as a Python
List
object in case
the dataset has dimensionality. If the dataset is an scalar, then
an appropriate Python scalar will be returned instead.
The elements in the dataset will be interpreted as
pickled (i.e. serialized objects through the use of the
Pickle
Python module)
objects and returned as Python generic
objects. Only one of such objects will be deserialized per entry.
As the Pickle
module
is not normally available in other languages, this flavor won't be
useful in general.
The elements in the dataset will be returned as
Python String
objects
of any length, with the twist that
Unicode strings are supported as well
(provided you use the UTF-8 codification, see
below). However, only one of such objects will be deserialized per
entry.
A string where the user can put some description on what is this dataset used for.
Should
contain the string
'1.2'
.
The data type of the elements (rows) of
VLArray
objects must be
the H5T_VLEN variable-length (or VL for short)
datatype, and the base datatype specified for the VL datatype can be
of any atomic HDF5 datatype that is listed in
the Table
format description D.3.1. That
includes the classes:
H5T_BITFIELD
H5T_INTEGER
H5T_FLOAT
H5T_TIME
H5T_STRING
H5T_ARRAY
They can also be a H5T_COMPOUND data type representing a
complex number, see the
Table
format
description in Section D.3.1
for a detailed description.
You should note that this does not include another VL
datatype, or a compound datatype that does not fit the description
of a complex number. Note as well that, for
Object
and
VLString
special
flavors, the base for the VL datatype is always a H5T_NATIVE_UCHAR.
That means that the complete row entry in the dataset has to be used
in order to fully serialize the object or the variable length
string.
In addition, if you plan to use a
VLString
flavor for
your text data and you are using ascii-7 (7 bits ASCII) codification
for your strings, but you don't know (or just don't want) to convert
it to the required UTF-8 codification, you should not worry too much
about that because the ASCII characters with values in the range
[0x00, 0x7f] are directly mapped to Unicode characters in the range
[U+0000, U+007F] and the UTF-8 encoding has the useful property that
an UTF-8 encoded ascii-7 string is indistinguishable from a
traditional ascii-7 string. So, you will not need any further
conversion in order to save your ascii-7 strings and have an
VLString
flavor.