PyTables
comes with a
couple of utilities that make the life easier to the user. One is called
ptdump
and lets you see the
contents of a PyTables
file
(or generic HDF5
file, if
supported). The other one is named
ptrepack
that allows to
(recursively) copy sub-hierarchies of objects present in a file into
another one, changing, if desired, some of the filters applied to the
leaves during the copy process.
Normally, these utilities will be installed somewhere in your PATH
during the process of installation of the
PyTables
package, so that you
can invoke them from any place in your file system after the installation
has successfully finished.
As has been said before,
ptdump
utility allows you
look into the contents of your
PyTables
files. It lets you
see not only the data but also the metadata (that is, the
structure and additional information in the form of
attributes).
For instructions on how to use it, just pass the
-h
flag to the command:
$ ptdump -h
to see the message usage:
usage: ptdump [-R start,stop,step] [-a] [-h] [-d] [-v] file[:nodepath] -R RANGE -- Select a RANGE of rows in the form "start,stop,step" -a -- Show attributes in nodes (only useful when -v or -d are active) -c -- Show info of columns in tables (only useful when -v or -d are active) -i -- Show info of indexed columns (only useful when -v or -d are active) -d -- Dump data information on leaves -h -- Print help on usage -v -- Dump more meta-information on nodes
Let's suppose that we want to know only the structure of a file. In order to do that, just don't pass any flag, just the file as parameter:
$ ptdump vlarray1.h5 Filename: 'vlarray1.h5' Title: '' , Last modif.: 'Fri Feb 6 19:33:28 2004' , rootUEP='/', filters=Filters(), Format version: 1.2 / (Group) '' /vlarray1 (VLArray(4,), shuffle, zlib(1)) 'ragged array of ints'
we can see that the file contains just a leaf object called
vlarray1
, that is an
instance of VLArray
, has
4 rows, and two filters has been used in order to create it:
shuffle
and
zlib
(with a compression
level of 1).
Let's say we want more meta-information. Just add the
-v
(verbose) flag:
$ ptdump -v vlarray1.h5 / (Group) '' children := ['vlarray1' (VLArray)] /vlarray1 (VLArray(4,), shuffle, zlib(1)) 'ragged array of ints' atom = Atom(type=Int32, shape=1, flavor='Numeric') nrows = 4 flavor = 'Numeric' byteorder = 'little'
so we can see more info about the atoms that are the components of the
vlarray1
dataset, i.e.
they are scalars of type
Int32
and with
Numeric
flavor.
If we want information about the attributes on the nodes, we
must add the -a
flag:
$ ptdump -va vlarray1.h5 / (Group) '' children := ['vlarray1' (VLArray)] /._v_attrs (AttributeSet), 5 attributes: [CLASS := 'GROUP', FILTERS := None, PYTABLES_FORMAT_VERSION := '1.2', TITLE := '', VERSION := '1.0'] /vlarray1 (VLArray(4,), shuffle, zlib(1)) 'ragged array of ints' atom = Atom(type=Int32, shape=1, flavor='Numeric') nrows = 4 flavor = 'Numeric' byteorder = 'little' /vlarray1.attrs (AttributeSet), 4 attributes: [CLASS := 'VLARRAY', FLAVOR := 'Numeric', TITLE := 'ragged array of ints', VERSION := '1.0']
Let's have a look at the real data:
$ ptdump -d vlarray1.h5 / (Group) '' /vlarray1 (VLArray(4,), shuffle, zlib(1)) 'ragged array of ints' Data dump: [array([5, 6]), array([5, 6, 7]), array([5, 6, 9, 8]), array([ 5, 6, 9, 10, 12])]
we see here a data dump of the 4 rows in
vlarray1
object, in the
form of a list. Because the object is a VLA, we see a different number
of integers on each row.
Say that we are interested only on a specific row
range of the
/vlarray1
object:
ptdump -R2,4 -d vlarray1.h5:/vlarray1 /vlarray1 (VLArray(4,), shuffle, zlib(1)) 'ragged array of ints' Data dump: [array([5, 6, 9, 8]), array([ 5, 6, 9, 10, 12])]
Here, we have specified the range of rows between 2 and 4 (the upper
limit excluded, as usual in Python). See how we have selected only the
/vlarray1
object for
doing the dump
(vlarray1.h5:/vlarray1
).
Finally, you can mix several information at once:
$ ptdump -R2,4 -vad vlarray1.h5:/vlarray1 /vlarray1 (VLArray(4,), shuffle, zlib(1)) 'ragged array of ints' atom = Atom(type=Int32, shape=1, flavor='Numeric') nrows = 4 flavor = 'Numeric' byteorder = 'little' /vlarray1.attrs (AttributeSet), 4 attributes: [CLASS := 'VLARRAY', FLAVOR := 'Numeric', TITLE := 'ragged array of ints', VERSION := '1.0'] Data dump: [array([5, 6, 9, 8]), array([ 5, 6, 9, 10, 12])]
This utility is a very powerful one and lets you copy any leaf,
group or complete subtree into another file. During the copy process you
are allowed to change the filter properties if you want so. Also, in the
case of duplicated pathnames, you can decide if you want to overwrite
already existing nodes on the destination file. Generally speaking,
ptrepack
can be useful in
may situations, like replicating a subtree in another file, change the
filters in objects and see how affect this to the compression degree or
I/O performance, consolidating specific data in repositories or even
importing generic
HDF5
files and create true
PyTables
counterparts.
For instructions on how to use it, just pass the
-h
flag to the command:
$ ptrepack -h
to see the message usage:
usage: ptrepack [-h] [-v] [-o] [-R start,stop,step] [--non-recursive] [--dest-title=title] [--dont-copyuser-attrs] [--overwrite-nodes] [--complevel=(0-9)] [--complib=lib] [--shuffle=(0|1)] [--fletcher32=(0|1)] [--keep-source-filters] sourcefile:sourcegroup destfile:destgroup -h -- Print usage message. -v -- Show more information. -o -- Overwite destination file. -R RANGE -- Select a RANGE of rows (in the form "start,stop,step") during the copy of *all* the leaves. --non-recursive -- Do not do a recursive copy. Default is to do it. --dest-title=title -- Title for the new file (if not specified, the source is copied). --dont-copy-userattrs -- Do not copy the user attrs (default is to do it) --overwrite-nodes -- Overwrite destination nodes if they exist. Default is to not overwrite them. --complevel=(0-9) -- Set a compression level (0 for no compression, which is the default). --complib=lib -- Set the compression library to be used during the copy. lib can be set to "zlib", "lzo", "ucl" or "bzip2". Defaults to "zlib". --shuffle=(0|1) -- Activate or not the shuffling filter (default is active if complevel>0). --fletcher32=(0|1) -- Whether to activate or not the fletcher32 filter (not active by default). --keep-source-filters -- Use the original filters in source files. The default is not doing that if any of --complevel, --complib, --shuffle or --fletcher32 option is specified.
Imagine that we have ended the tutorial 1 (see the output of
examples/tutorial1-1.py
),
and we want to copy our reduced data (i.e. those datasets that hangs
from the /column
group)
to another file. First, let's remember the content of the
examples/tutorial1.h5
:
$ ptdump tutorial1.h5 Filename: 'tutorial1.h5' Title: 'Test file' , Last modif.: 'Fri Feb 6 19:33:28 2004' , rootUEP='/', filters=Filters(), Format version: 1.2 / (Group) 'Test file' /columns (Group) 'Pressure and Name' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection' /detector (Group) 'Detector information' /detector/readout (Table(10L,)) 'Readout example'
Now, copy the /columns
to
other non-existing file. That's easy:
$ ptrepack tutorial1.h5:/columns reduced.h5
That's all. Let's see the contents of the newly created
reduced.h5
file:
$ ptdump reduced.h5 Filename: 'reduced.h5' Title: '' , Last modif.: 'Fri Feb 20 15:26:47 2004' , rootUEP='/', filters=Filters(), Format version: 1.2 / (Group) '' /name (Array(3,)) 'Name column selection' /pressure (Array(3,)) 'Pressure column selection'
so, you have copied the children of
/columns
group into the
root of the
reduced.h5
file.
Now, you suddenly realized that what you intended to do was to
copy all the hierarchy, the group
/columns
itself included.
You can do that by just specifying the destination group:
$ ptrepack tutorial1.h5:/columns reduced.h5:/columns ptdump reduced.h5 Filename: 'reduced.h5' Title: '' , Last modif.: 'Fri Feb 20 15:39:15 2004' , rootUEP='/', filters=Filters(), Format version: 1.2 / (Group) '' /name (Array(3,)) 'Name column selection' /pressure (Array(3,)) 'Pressure column selection' /columns (Group) '' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection'
OK. Much better. But you want to get rid of the existing nodes on the new file. You can achieve this by adding the -o flag:
$ ptrepack -o tutorial1.h5:/columns reduced.h5:/columns $ ptdump reduced.h5 Filename: 'reduced.h5' Title: '' , Last modif.: 'Fri Feb 20 15:41:57 2004' , rootUEP='/', filters=Filters(), Format version: 1.2 / (Group) '' /columns (Group) '' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection'
where you can see how the old contents of the
reduced.h5
file has been
overwritten.
You can copy just one single node in the repacking operation and change its name in destination:
$ ptrepack tutorial1.h5:/detector/readout reduced.h5:/rawdata $ ptdump reduced.h5 Filename: 'reduced.h5' Title: '' , Last modif.: 'Fri Feb 20 15:52:22 2004', rootUEP='/', filters=Filters(), Format version: 1.2 / (Group) '' /rawdata (Table(10L,)) 'Readout example' /columns (Group) '' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection'
where the
/detector/readout
has
been copied to /rawdata
in destination.
We can change the filter properties as well:
$ ptrepack --complevel=1 tutorial1.h5:/detector/readout reduced.h5:/rawdata Problems doing the copy from 'tutorial1.h5:/detector/readout' to 'reduced.h5:/rawdata' The error was --> exceptions.ValueError: The destination (/rawdata (Table(10L,)) 'Readout example') already exists. Assert the overwrite parameter if you really want to overwrite it. The destination file looks like: Filename: 'reduced.h5' Title: ''; Last modif.: 'Fri Feb 20 15:52:22 2004'; rootUEP='/'; filters=Filters(), Format version: 1.2 / (Group) '' /rawdata (Table(10L,)) 'Readout example' /columns (Group) '' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection' Traceback (most recent call last): File "../utils/ptrepack", line 358, in ? start=start, stop=stop, step=step) File "../utils/ptrepack", line 111, in copyLeaf raise RuntimeError, "Please, check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired." RuntimeError: Please, check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired.
ooops!. We ran into problems: we forgot that
/rawdata
pathname already
existed in destination file. Let's add the
--overwrite-nodes
, as the
verbose error suggested:
$ ptrepack --overwrite-nodes --complevel=1 tutorial1.h5:/detector/readout reduced.h5:/rawdata $ ptdump reduced.h5 Filename: 'reduced.h5' Title: ''; Last modif.: 'Fri Feb 20 16:02:20 2004'; rootUEP='/'; filters=Filters(), Format version: 1.2 / (Group) '' /rawdata (Table(10L,), shuffle, zlib(1)) 'Readout example' /columns (Group) '' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection'
you can check how the filter properties has been changed for the
/rawdata
table. Check as
the other nodes still exists.
Finally, let's copy a slice of the
readout
table in origin
to destination, under a new group called
/slices
and with the
name, for example,
aslice
:
$ ptrepack -R1,8,3 tutorial1.h5:/detector/readout reduced.h5:/slices/aslice $ ptdump reduced.h5 Filename: 'reduced.h5' Title: ''; Last modif.: 'Fri Feb 20 16:17:13 2004'; rootUEP='/'; filters=Filters(); Format version: 1.2 / (Group) '' /rawdata (Table(10L,), shuffle, zlib(1)) 'Readout example' /columns (Group) '' /columns/name (Array(3,)) 'Name column selection' /columns/pressure (Array(3,)) 'Pressure column selection' /slices (Group) '' /slices/aslice (Table(3L,)) 'Readout example'
note how
only 3 rows of the original
readout
table has been
copied to the new aslice
destination. Note as well how the previously inexistent
slices
group has been
created in the same operation.
This tool is able to convert a file in NetCDF
format to a PyTables
file
(and hence, to a HDF5 file). However, for this to work, you will need
the NetCDF interface for Python that comes with the excellent
Scientific Python
(see
[16]) package. This script was initially contributed by Jeff
Whitaker. It has been updated to support selectable filters from the
command line and some other small improvements.
If you want other file formats to be converted to
PyTables
, have a look at
the SciPy
(see [17])
project (subpackage io
),
and look for different methods to import them into
NumPy/Numeric/numarray
objects. Following the
SciPy
documentation, you
can read, among other formats, ASCII files
(read_array
), binary files
in C or Fortran (fopen
) and
MATLAB
(version 4, 5 or 6)
files (loadmat
). Once you
have the content of your files as
NumPy/Numeric/numarray
objects, you can save them as regular
(E)Arrays
in
PyTables
files. Remember,
if you end with a nice conversor, do not forget to contribute it back to
the community. Thanks!
For instructions on how to use it, just pass the
-h
flag to the command:
$ nctoh5 -h
to see the message usage:
usage: nctoh5 [-h] [-v] [-o] [--complevel=(0-9)] [--complib=lib] [--shuffle=(0|1)] [--fletcher32=(0|1)] [--unpackshort=(0|1)] [--quantize=(0|1)] netcdffilename hdf5filename -h -- Print usage message. -v -- Show more information. -o -- Overwite destination file. --complevel=(0-9) -- Set a compression level (0 for no compression, which is the default). --complib=lib -- Set the compression library to be used during the copy. lib can be set to "zlib", "lzo", "ucl" or "bzip2". Defaults to "zlib". --shuffle=(0|1) -- Activate or not the shuffling filter (default is active if complevel>0). --fletcher32=(0|1) -- Whether to activate or not the fletcher32 filter (not active by default). --unpackshort=(0|1) -- unpack short integer variables to float variables using scale_factor and add_offset netCDF variable attributes (not active by default). --quantize=(0|1) -- quantize data to improve compression using least_significant_digit netCDF variable attribute (not active by default). See http://www.cdc.noaa.gov/cdc/conventions/cdc_netcdf_standard.shtml for further explanation of what this attribute means.
If you have followed the small tutorial on the
ptrepack
utility (see
C.2), you should easily realize
what most of the different flags would mean.