... durch planmässiges Tattonieren. [... through systematic, palpable experimentation.]
On this chapter, you will get deeper knowledge of
PyTables
internals.
PyTables
has several places
where the user can improve the performance of his application. If you are
planning to deal with really large data, you should read carefully this
section in order to learn how to get an important efficiency boost for
your code. But if your dataset is small or medium size (say, up to 10 MB),
you should not worry about that as the default parameters in
PyTables
are already tuned to
handle that perfectly.
The underlying HDF5 library that is used by
PyTables
allows for certain
datasets (chunked datasets) to take the data in
bunches of a certain length, so-called chunks, to
write them on disk as a whole, i.e. the HDF5 library treats chunks as
atomic objects and disk I/O is always made in terms of complete chunks.
This allows data filters to be defined by the application to perform
tasks such as compression, encryption, checksumming, etc. on entire
chunks.
An in-memory B-tree is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and cause file storage overhead as well as more disk I/O and higher contention for the metadata cache. Consequently, it's important to balance between memory and I/O overhead (small B-trees) and time to access data (big B-trees).
PyTables
can
determine an optimum chunk size to make B-trees adequate to your dataset
size if you help it by providing an estimation of the number of rows for
a table. This must be made at table creation time by passing this value
to the expectedrows
keyword
of the createTable
method
(see description).
When your table size is bigger than 10 MB (take this figure only as a reference, not strictly), by providing this guess of the number of rows you will be optimizing the access to your data. When the table size is larger than, say 100MB, you are strongly suggested to provide such a guess; failing to do that may cause your application to do very slow I/O operations and to demand huge amounts of memory. You have been warned!
If you are going to use a lot of searches like the next one:
row = table.row result = [ row['var2'] for row in table if row['var1'] <= 20 ]
(for future reference, we will call this the standard selection mode) and you want to improve the time taken to run it, keep reading.
PyTables
provides a
way to accelerate data selections when they are simple, i.e. when only
a column is implied in the selection process, through the use of the
where
iterator
(see description). We will call this mode of
selecting data in-kernel. Let's see an example of
in-kernel selection based on the
standard selection mentioned above:
row = table result = [ row['var2'] for row in table.where(table.cols.var1 <= 20)]
This simple change of mode selection can account for an improvement in search times up to a factor of 10 (see the Figure 5.1).
Figure 5.1. Times for different selection modes over Int32 values. Benchmark made on a machine with Itanium (IA64) @ 900 MHz processors with SCSI disk @ 10K RPM.
Figure 5.2. Times for different selection modes over Float64 values. Benchmark made on a machine with Itanium (IA64) @ 900 MHz processors with SCSI disk @ 10K RPM.
So, where is the trick? It's easy. In the
standard selection mode the data for column
var1
has to be carried up
to Python space so as to evaluate the condition and decide if the
var2
value should be
added to the result
list.
On the contrary, in the in-kernel mode, the
condition is passed to the
PyTables
kernel (hence
the name), written in C, and evaluated there at C speed (with some
help of the numarray
package), so that the only values that are brought to the Python space
are the references for
rows
that fulfilled the
condition.
You should note, however, that currently the
where
method only accepts
conditions along a single column[14]. Fortunately, you
can mix the in-kernel and
standard selection modes for evaluating
arbitrarily complex conditions along several columns at once. Look at
this example:
row = table result = [ row['var2'] for row in table.where(table.cols.var3 == "foo") if row['var1'] <= 20 ]
here, we have used a in-kernel selection to
filter the rows whose
var3
field is equal to
string "foo"
. Then, we
apply a standard selection to complete the
query.
Of course, when you mix the in-kernel and
standard selection modes you should pass the most
restrictive condition to the in-kernel part, i.e.
to the where
iterator. In
situations where it is not clear which is the most restrictive
condition, you might want to experiment a bit in order to find the
best combination.
When you need more speed than in-kernel
selections can offer you,
PyTables
offers a third
selection method, the so-called indexed mode. In
this mode, you have to decide which column(s) you are going to do your
selections on, and index them. Indexing is just a kind of sort
operation, so that next searches along a column will look at the
sorted information using a binary search which is
much faster than a sequential search.
You can index your selected columns in several ways:
In this mode, you can declare a column as being indexed by passing the indexed parameter to the column descriptor. That is:
class Example(IsDescription): var1 = StringCol(length=4, dflt="", pos=1, indexed=1) var2 = BoolCol(0, indexed=1, pos = 2) var3 = IntCol(0, indexed=1, pos = 3) var4 = FloatCol(0, indexed=0, pos = 4)
In this case, we are telling that
var1
,
var2
and
var3
columns will be
indexed automatically when you add rows to the table with this
description.
In this mode, you can create an index even on an already created table. For example:
indexrows = table.cols.var1.createIndex() indexrows = table.cols.var2.createIndex() indexrows = table.cols.var3.createIndex()
will create indexes for all
var1
,
var2
and
var3
columns, and after
doing that, they will behave as regular indexes.
After you have indexed a column, you can proceed to use it
through the use of
Table.where
method:
row = table result = [ row['var2'] for row in table.where(table.cols.var1 == "foo") ]
or, if you want to add more conditions, you can mix the indexed selection with a standard one:
row = table result = [ row['var2'] for row in table.where(table.cols.var3 <= 20) if row['var1'] == "foo" ]
remember to pass the most restrictive condition to the
where
iterator.
You can see in figures5.1 and 5.2 that indexing can accelerate quite a lot your data selections in tables. For moderately large tables (> one million rows), you can get speedups in the order of 100x with regard to in-kernel selections, and in the order of 1000x with regard to standard selections.
One important aspect of indexation in
PyTables
is that it has
been implemented with the goal of being capable to manage effectively
very large tables. In Figure 5.3,
you can see that the times to index columns in tables always grow
linearly. In particular, the time to index a
couple of columns with 1 billion of rows each is 40 min. (roughly 20
min. each), which is a quite reasonable figure. This is because
PyTables
has chosen an
algorithm that does a partial sort of the columns
in order to ensure that the indexing time grows
linearly. On the contrary, most of relational
databases try to do a complete sort of columns,
and this makes the time to index grow much faster with the number of
rows.
The fact that relational databases use a complete sorting
algorithm for indexes means that their index would be more effective
(but not by a large extent) for searching purposes than the
PyTables
approach.
However, for relatively large tables (> 10 millions of rows) the
time required for completing such a sort can be so large, that
indexing is not normally worth the effort. In other words,
PyTables
indexing scales
much better than relational databases. So don't worry if you have
extremely large columns to index:
PyTables
is designed to
cope with that perfectly.
Figure 5.3. Times for indexing a couple of columns of data type Int32 and Float64. Benchmark made on a machine with Itanium (IA64) @ 900 MHz processors with SCSI disk @ 10K RPM.
One of the beauties of
PyTables
is that it
supports compression on tables and arrays[15], although it is not used by default.
Compression of big amounts of data might be a bit controversial feature,
because compression has a legend of being a very big consumer of CPU
time resources. However, if you are willing to check if compression can
help not only by reducing your dataset file size but
also by improving I/O efficiency, specially when
dealing with very large datasets, keep reading.
There is a common scenario where users need to save duplicated data in some record fields, while the others have varying values. In a relational database approach such redundant data can normally be moved to other tables and a relationship between the rows on the separate tables can be created. But that takes analysis and implementation time, and makes the underlying libraries more complex and slower.
PyTables
transparent
compression allows the users to not worry about finding which is their
optimum strategy for data tables, but rather use less, not directly
related, tables with a larger number of columns while still not
cluttering the database too much with duplicated data (compression is
responsible to avoid that). As a side effect, data selections can be
made more easily because you have more fields available in a single
table, and they can be referred in the same loop. This process may
normally end in a simpler, yet powerful manner to process your data
(although you should still be careful about in which kind of scenarios
the use of compression is convenient or not).
The compression library used by default is the Zlib (see [5]). Since HDF5 requires it, you can safely use it and expect that your HDF5 files will be readable on any other platform that has HDF5 libraries installed. Zlib provides good compression ratio, although somewhat slow, and reasonably fast decompression. Because of that, it is a good candidate to be used for compressing you data.
However, in some situations it is critical to have very good decompression speed (at the expense of lower compression ratios or more CPU wasted on compression, as we will see soon). In others, the emphasis is put in achieving the maximum compression ratios, no matter which reading speed will result. This is why support for two additional compressors has been added to PyTables: LZO (see [13]) and bzip2 (see [14]). Following the author of LZO (and checked by the author of this section, as you will see soon), LZO offers pretty fast compression (though a small compression ratio) and extremely fast decompression. In fact, LZO is so fast when compressing/decompressing that it may well happen (that depends on your data, of course) that writing or reading a compressed dataset is sometimes faster than if it is not compressed at all (specially when dealing with extremely large datasets). This fact is very important, specially if you have to deal with very large amounts of data. Regarding bzip2, it has a reputation of achieving excellent compression ratios, but at the price of spending much more CPU time, which results in very low compression/decompression speeds.
Be aware that the LZO and bzip2 support in PyTables is not
standard on HDF5, so if you are going to use your PyTables files in
other contexts different from PyTables you will not be able to read
them. Still, see the Section C.2
(where the ptrepack
utility
is described) to find a way to free your files from LZO or bzip2
dependencies, so that you can use these compressors locally with the
warranty that you can replace them with Zlib (or even remove compression
completely) if you want to use these files with other HDF5 tools or
platforms afterwards.
In order to allow you to grasp what amount of compression can be achieved, and how this affects performance, a series of experiments has been carried out. All the results presented in this section (and in the next one) have been obtained with synthetic data and using PyTables 1.3. Also, the tests have been conducted on a IBM OpenPower 720 (e-series) with a PowerPC G5 at 1.65 GHz and a hard disk spinning at 15K RPM. As your data and platform may be totally different for your case, take this just as a guide because your mileage will probably vary. Finally, and to be able to play with tables with a number of rows as large as possible, the record size has been chosen to be small (16 bytes). Here is its definition:
class Bench(IsDescription): var1 = StringCol(length=4) var2 = IntCol() var3 = FloatCol()
With this setup, you can look at the compression ratios that can be achieved in Figure 5.4. As you can see, LZO is the compressor that performs worse in this sense, but, curiosly enough, there is not much difference between Zlib and bzip2.
Also, PyTables lets you select different compression levels for Zlib and bzip2, although you may get a bit disappointed by the small improvement that show these compressors when dealing with a combination of numbers and strings as in our example. As a reference, see plot 5.5 for a comparison of the compression achieved by selecting different levels of Zlib. Very oddly, the best compression ratio corresponds to level 1 (!). It's difficult to explain that, but this lesson will serve to reaffirm that there is no replacement for experiments with your own data. In general, it is recommended to select the lowest level of compression in order to achieve best performance and decent (if not the best!) compression ratio. See later for more figures on this regard.
Have also a look at Figure 5.6. It shows how the speed of writing rows evolves as the size (the row number) of the table grows. Even though in these graphs the size of one single row is 16 bytes, you can most probably extrapolate these figures to other row sizes.
In Figure 5.7 you can see how compression affects the reading performance. In fact, what you see in the plot is an in-kernel selection speed, but provided that this operation is very fast (see Section 5.2.1), we can accept it as an actual read test. Compared with the reference line without compression, the general trend here is that LZO does not affect too much the reading performance (and in some points it is actually better), Zlib makes speed to drop to a half, while bzip2 is performing very slow (up to 8x slower).
Also, in the same Figure 5.7 you can notice some strange peaks in the speed that we might be tempted to attribute to libraries on which PyTables relies (HDF5, compressors...), or to PyTables itself. However, Figure 5.8 reveals that, if we put the file in the filesystem cache (by reading it several times before, for example), the evolution of the performance is much smoother. So, the most probable explanation would be that such a peaks are a consequence of the underlying OS filesystem, rather than a flaw in PyTables (or any other library behind it). Another consequence that can be derived from the above plot is that LZO decompression performance is much better than Zlib, allowing an improvement in overal speed of more than 2x, and perhaps more important, the read performance for really large datasets (i.e. when they do not fit in the OS filesystem cache) can be actually better than not using compression at all. Finally, one can see that reading performance is very badly affected when bzip2 is used (it is 10x slower than LZO and 4x than Zlib), but this is not too strange anyway.
So, generally speaking and looking at the experiments above, you can expect that LZO will be the fastest in both compressing and decompressing, but the one that achieves the worse compression ratio (although that may be just OK for many situations, specially when used with the Section 5.4). bzip2 is the slowest, by large, in both compressing and decompressing, and besides, it does not achieve any better compression ratio than Zlib. Zlib represents a balance between them: it's somewhat slow compressing (2x) and decompressing (3x) than LZO, but it normally achieves fairly good compression ratios.
Finally, by looking at the plots 5.9, 5.10, and the aforementioned 5.5 you can see why the recommended compression level to use for all compression libraries is 1. This is the lowest level of compression, but if you take the approach suggested above, the redundant data is to be found normally in the same row, making redundancy locality very high so that a small level of compression should be enough to achieve a good compression ratio on your data tables, saving CPU cycles for doing other things. Nonetheless, in some situations you may want to check for your own how the different compression levels affect your application.
You can select the compression library and level by setting the
complib
and
complevel
keywords in the
Filters
class (see 4.17.1). A compression level of 0
will completely disable compression (the default), 1 is the less CPU
time demanding level, while 9 is the maximum level and most CPU
intensive. Finally, have in mind that LZO is not accepting a compression
level right now, so, when using LZO, 0 means that compression is not
active, and any other value means that LZO is active.
So, in conclusion, if your ultimate goal is writing and reading as fast as possible, choose LZO. If you want to reduce as much as possible your data, while retaining acceptable read speed, choose Zlib. Finally, if portability is important for you, Zlib is your best bet. So, when you want to use bzip2? Well, looking at the results, it is difficult to recommend its use in general, but you may want to experiment with it in those cases where you know that it is well suited for your data pattern (for example, for dealing with repetitive string datasets).
Figure 5.10. Selecting values in tables with different levels of compression. The file is in the OS cache.
The HDF5
library
provides an interesting filter that can leverage the results of your
favorite compressor. Its name is shuffle, and
because it can greatly benefit compression and it does not take many CPU
resources (see below for a justification), it is active by
default in
PyTables
whenever
compression is activated (independently of the chosen compressor). It is
of course deactivated when compression is off (which is the default, as
you already should know). Of course, you can deactivate it if you want,
but this is not recommended.
So, how exactly works this mysterious filter? From the HDF5 reference manual: “The shuffle filter de-interlaces a block of data by reordering the bytes. All the bytes from one consistent byte position of each data element are placed together in one block; all bytes from a second consistent byte position of each data element are placed together a second block; etc. For example, given three data elements of a 4-byte datatype stored as 012301230123, shuffling will re-order data as 000111222333. This can be a valuable step in an effective compression algorithm because the bytes in each byte position are often closely related to each other and putting them together can increase the compression ratio.”
In Figure 5.11 you can see a benchmark that shows how the shuffle filter can help the different libraries in compressing data. In this experiment, shuffle has made LZO to compress almost 3x more (!), while Zlib and bzip2 are seeing improvements of 2x. Once again, the data for this experiment is synthetic, and shuffle seems to do a great work with it, but in general, the results will vary in each case[16].
Figure 5.11. Comparison between different compression libraries with and without the shuffle filter.
At any rate, the most remarkable fact about the shuffle filter is the relatively high level of compression that compressor filters can achieve when used in combination with it. A curious thing to note is that the Bzip2 compression rate does not seem very much improved (less than a 40%), and what is more striking, Bzip2+shuffle does compress quite less than Zlib+shuffle or LZO+shuffle combinations, which is kind of unexpected. The thing that seems clear is that Bzip2 is not very good at compressing patterns that result of shuffle application. As always, you may want to experiment with your own data before widely applying the Bzip2+shuffle combination in order to avoid surprises.
Now, how does shuffling affect performance? Well, if you look at plots 5.12, 5.13 and 5.14, you will get a somewhat unexpected (but pleasant) surprise. Roughly, shuffle makes the writing process (shuffling+compressing) faster (aproximately a 15% for LZO, 30% for Bzip2 and a 80% for Zlib), which is an interesting result by itself. But perhaps more exciting is the fact that the reading process (unshuffling+decompressing) is also accelerated by a similar extent (a 20% for LZO, 60% for Zlib and a 75% for Bzip2, roughly).
Figure 5.13. Reading with different compression libraries with the shuffle filter. The file is not in OS cache.
Figure 5.14. Reading with different compression libraries with and without the shuffle filter. The file is in OS cache.
You may wonder why introducing another filter in the write/read pipelines does effectively accelerate the throughput. Well, maybe data elements are more similar or related column-wise than row-wise, i.e. contiguous elements in the same column are more alike, so shuffling makes the job of the compressor easier (faster) and more effective (greater ratios). As a side effect, compressed chunks do fit better in the CPU cache (at least, the chunks are smaller!) so that the process of unshuffle/decompress can make a better use of the cache (i.e. reducing the number of CPU cache faults).
So, given the potential gains (faster writing and reading, but specially much improved compression level), it is a good thing to have such a filter enabled by default in the battle for discovering redundancy when you want to compress your data, just as PyTables does.
Psyco (see [15]) is a kind of specialized compiler for Python that typically accelerates Python applications with no change in source code. You can think of Psyco as a kind of just-in-time (JIT) compiler, a little bit like Java's, that emits machine code on the fly instead of interpreting your Python program step by step. The result is that your unmodified Python programs run faster.
Psyco is very easy to install and use, so in most scenarios it is worth to give it a try. However, it only runs on Intel 386 architectures, so if you are using other architectures, you are out of luck (at least until Psyco will support yours).
As an example, imagine that you have a small script that reads and selects data over a series of datasets, like this:
def readFile(filename): "Select data from all the tables in filename" fileh = openFile(filename, mode = "r") result = [] for table in fileh("/", 'Table'): result = [ p['var3'] for p in table if p['var2'] <= 20 ] fileh.close() return result if __name__=="__main__": print readFile("myfile.h5")
In order to accelerate this piece of code, you can rewrite your main program to look like:
if __name__=="__main__": import psyco psyco.bind(readFile) print readFile("myfile.h5")
That's all!. From now on, each time that you execute your Python script, Psyco will deploy its sophisticated algorithms so as to accelerate your calculations.
You can see in the graphs 5.15 and 5.16 how much I/O speed improvement you can get by using Psyco. By looking at this figures you can get an idea if these improvements are of your interest or not. In general, if you are not going to use compression you will take advantage of Psyco if your tables are medium sized (from a thousand to a million rows), and this advantage will disappear progressively when the number of rows grows well over one million. However if you use compression, you will probably see improvements even beyond this limit (see Section 5.3). As always, there is no substitute for experimentation with your own dataset.
Starting from PyTables 1.2 on, it has been introduced a new LRU cache that prevents from loading all the nodes of the object tree in memory. This cache is responsible of loading just up to a certain amount of nodes and discard the least recent used ones when there is a need to load new ones. This represents a big advantage over the old schema, specially in terms of memory usage (as there is no need to load every node in memory), but it also adds very convenient optimizations for working interactively like, for example, speeding-up the opening times of files with lots of nodes, allowing to open almost any kind of file in typically less than one tenth of second (compare this with the more than 10 seconds for files with more than 10000 nodes in PyTables pre-1.2 era). See [18] for more info on the advantages (and also drawbacks) of this approach.
One thing that deserves some discussion is the election of the
parameter that sets the maximum amount of nodes to be held in memory at
any time. As PyTables is meant to be deployed in machines that have
potentially low memory, the default for it is quite conservative (you
can look at its actual value in the
NODE_CACHE_SIZE
parameter
in module
tables/constants.py
).
However, if you usually have to deal with files that have much more
nodes than the maximum default, and you have a lot of free memory in
your system, then you may want to experiment which is the appropriate
value of NODE_CACHE_SIZE
that fits better your needs.
As an example, look at the next code:
def browse_tables(filename): fileh = openFile(filename,'a') group = fileh.root.newgroup for j in range(10): for tt in fileh.walkNodes(group, "Table"): title = tt.attrs.TITLE for row in tt: pass fileh.close()
We will be running the code above against a couple of files having
a /newgroup
containing 100
tables and 1000 tables respectively. We will run this small benchmark
for different values of the LRU cache size, namely 256 and 1024. You can
see the results in Table 5.1.
100 nodes | 1000 nodes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Memory (MB) | Time (ms) | Memory (MB) | Time (ms) | ||||||
Node is coming from... | Cache size | 256 | 1024 | 256 | 1024 | 256 | 1024 | 256 | 1024 |
From disk | 14 | 14 | 1.24 | 1.24 | 51 | 66 | 1.33 | 1.31 | |
From cache | 14 | 14 | 0.53 | 0.52 | 65 | 73 | 1.35 | 0.68 |
Table 5.1. Retrieving speed and memory consumption dependency of the number of nodes in LRU cache.
From the data in Table 5.1, one can see that, when the number of objects that you are dealing with does fit in cache, you will get better access times to them. Also, incrementing the node cache size does effectively consumes more memory only if the total nodes exceeds the slots in cache; otherwise the memory consumption remains the same. It is also worth noting that incrementing the node cache size in the case you want to fit all your nodes in cache, it does not take much more memory than keeping too conservative. On another hand, it might happen that the speed-up that you can achieve by allocating more slots in your cache maybe is not worth the amount of memory used.
Anyway, if you feel that this issue is
important for you, setup your own experiments and proceed fine-tuning
the NODE_CACHE_SIZE
parameter.
Note: After the introduction of the new object tree cache in PyTables 1.2, this feature is not very useful anymore and might become deprecated in future versions.
If you have a huge tree in your data file with many nodes on it, creating the object tree would take long time. Many times, however, you are interested only in access to a part of the complete tree, so you won't strictly need PyTables to build the entire object tree in-memory, but only the interesting part.
This is where the
rootUEP
parameter of
openFile
function (see
description) can be helpful. Imagine that
you have a file called
"test.h5"
with the
associated tree that you can see in Figure 5.17, and you are interested only in the
section marked in red. You can avoid the build of all the object tree by
saying to openFile
that
your root will be the
/Group2/Group3
group. That
is:
fileh = openFile("test.h5", rootUEP="/Group2/Group3")
As a result, the actual object tree built will be like the one that can be seen in Figure 5.18.
Of course this has been a simple example and the use of the
rootUEP
parameter was not
very necessary. But when you have thousands of
nodes on a tree, you will certainly appreciate the
rootUEP
parameter.
Let's suppose that you have a file on which you have made a lot of
row deletions on one or more tables, or deleted many leaves or even
entire subtrees. These operations might leave holes
(i.e. space that is not used anymore) in your files, that may
potentially affect not only the size of the files but, more importantly,
the performance of I/O. This is because when you delete a lot of rows on
a table, the space is not automatically recovered on-the-flight. In
addition, if you add many more rows to a table than specified in the
expectedrows
keyword in
creation time this may affect performance as well, as explained in Section 5.1.
In order to cope with these issues, you should be aware that a
handy PyTables
utility
called ptrepack
can be very
useful, not only to compact your already existing
leaky files, but also to adjust some internal
parameters (both in memory and in file) in order to create adequate
buffer sizes and chunk sizes for optimum I/O speed. Please, check the
Section C.2 for a brief tutorial
on its use.
Another thing that you might want to use
ptrepack
for is changing
the compression filters or compression levels on your existing data for
different goals, like checking how this can affect both final size and
I/O performance, or getting rid of the optional compressors like
LZO
,
UCL
or
bzip2
in your existing
files in case you want to use them with generic HDF5 tools that do not
have support for these filters.