Software Quarterly

Database Parallelism:
Unparalleled Performance

By Janet Hedges

Janet Hedges is a product planner for DB2 client/server products at the IBM Software Solutions Laboratory in Toronto, Canada. Before joining IBM five years ago, she was a DB2 system programmer for the Royal Bank of Canada. Janet holds a bachelor of mathematics in computer science from the University of Waterloo, Waterloo, Ontario.


Any business runs on data -- barrels of data on payroll costs, benefits, taxes, sales, expenses, customer orders, invoices, and a whole lot more. But data alone can't support a company's efforts to increase revenue, reduce costs, satisfy customers, and gain a competitive edge.

A business also needs a "killer" application, one that can help make it a contender in its industry -- or even a champion. That application should be able to race through data and extract vital information instantly. More than that, it should be able to quickly spot business trends and alert management to any that are beginning to get out of line. Are material costs exceeding estimates? Are back orders piling up? Is the inventory getting low? Are profit margins narrowing?

The trends are all there, hidden in the data, waiting to be plucked out. Armed with such information, managers can take corrective actions immediately. And the winning application that can do all that is a relational database management system that uses the latest software innovation -- parallelism -- to get the toughest job done in a wink.

Such a relational database management system, or RDBMS, provides four key benefits. It can increase throughput with better price performance. It can reduce the time it takes to access data and answer queries. It can provide flexible scalability. And it can help ensure high system availability.

Leading-edge RDBMS software is available now. In fact, this year IBM brought parallel processing into the mainstream by introducing three new fully parallel database products (see related story). This article relates the key benefits provided by these new products.


Parallel technology can help

your company keep on top of

its business and on top of its

bottom line.


Four Reasons Why

First, depending on the application, parallel databases can process more transactions in a given time, leading to increased revenue -- in stock trading, for example.

Moreover, with their enhanced price-vs.-performance throughput ratio, a company can achieve increased revenue potential at a lower cost. Lower cost is achieved in today's commercial parallel systems by using low-cost processor components, such as CMOS- and RISC-based processors. All of these cost savings can be passed on to customers.

Second, reduced query time will allow users to build a new class of applications and queries to gain more in-depth knowledge of their businesses. In fact, the queries could retrieve information so important that a company would gain a competitive advantage -- allowing it to substantially increase revenues or reduce costs.

A simple example occurs in target marketing. A company trying to sell a new product will have a better chance of success if it can identify only those customers most likely to buy it. The company could query a database of purchasing information for the last five years and retrieve a list of customers who have purchased similar or related products. As a result, it could avoid unnecessary expenses -- such as the cost of mailing brochures to customers or prospects who are unlikely to buy the new product.

In short, businesses can now access and profitably use vast amounts of historical data -- something they couldn't do before because the needed hardware cost too much or the response time was prohibitively slow. And with the newly available data, businesses can get a competitive edge and potentially improve their bottom line.

The third good reason: Parallelism allows a company to expand a system or application base incrementally. That is, it can start with low-cost entry configurations that match its requirements. Then, as the company needs to increase processing power or gain access to more data, it can easily add new units to the system. With parallelism, incremental growth is non-disruptive; it results in a smaller incremental cost and allows a company to buy just the processing power it needs.

Last: With many processors in a parallel environment, a system can be designed so that if a single processor becomes unavailable, it won't severely impact the system workload. Other processors will execute the workload that the failed processor would have run. That's especially important for systems that run mission-critical applications -- because the loss of continuous availability can result in lost revenue.

Using the Parallel Advantage

Large-scale parallel processing has traditionally been used in compute-intensive applications -- in science and engineering, for example. Now, with the availability of multiple processors and commercial parallel solutions, more individual tasks can be run simultaneously, and big tasks can be divided into independent, concurrent subtasks.

Dividing large tasks, in some cases, is the only way to complete them fast enough to meet the user's needs. This high degree of parallelism -- enabled by dividing input/ out- (I/O) and compute-intensive tasks into subtasks -- is being integrated into more and more commercial software solutions.


A business needs a "killer"

application, one that can help

make it a contender in its

industry -- or even a champion.


When an RDBMS can split a database task -- and distribute both processing and accessing data -- across multiple processors, it is using parallelism to its fullest advantage. In addition, an RDBMS that uses parallel techniques increases a system's performance in many ways and boosts its capacity.

Relational database management systems are inherently well suited to "parallelization." Built on set theory, an RDBMS operates on sets of data while "hiding" the physical implementation from the user. An RDBMS implements sophisticated optimizers to analyze data requests from applications and turn them into physical access paths to the data.

Optimizers identify subtasks, such as sorting and scanning, that can be performed in parallel. Since operating on a relation (or table) results in another relation, a particular operation can be performed on subsets of data in parallel -- thereby yielding a merged answer that is a relation. Optimizer technology is an important part of implementing an efficient, parallel RDBMS.

Exploiting Multiple Processors

How does parallel RDBMS technology take advantage of multiple processors? It does so in several key ways (see Figure 1).

Transaction parallelism. Many different transactions can be serviced simultaneously on a single database image by processing a transaction at each available processor. As a result, throughput increases, thereby enabling the system to support more users without disruption or increased application costs.

Query parallelism. Spreading the workload of a single query or statement across many processors dramatically shortens the time needed to obtain an answer and allows the data to be analyzed in greater depth. This, in turn, can increase revenues, reduce costs, improve customer satisfaction, and provide a competitive edge.

In query parallelism, a single query can be subdivided into two forms. First, it can be subdivided into many subqueries and processed against different sets of data. This splitting, with respect to data, is generally called query decomposition -- or, more formally, partition parallelism.* It results in reduced query times, since smaller portions of data are processed in parallel.

Second, a query can be divided into a series of operators, such as scanning followed by sorting. The output from one operator can be passed on as input to another operator while executing both in parallel.

Pipelined parallelism. This approach allows more complex processing to proceed faster. In a query with a GROUP BY or ORDER BY clause, for example, the output from the scan operation -- which would result in rows that matched some criteria specified in the query -- can immediately be passed on to a sort operation without the need to produce a complete interim result.

I/O parallelism. Spreading data across many disk devices allows it to be accessed simultaneously. Even if a query is not "parallelized," I/O parallelism brings the data to the processor before it's needed. Effectively organizing the subdivision of data across disks can enhance query parallelism capabilities.

For example, response times of complex customer inquiries are reduced, letting a business respond more quickly, improving customer satisfaction, and potentially increasing revenues. With data partitioned across disk devices, it's also possible to run various utilities against that data in parallel, such as loading, reorganizing, and performing backups. When data maintenance is performed in parallel, there are shorter outages and an increased availability of data.

Hardware exploitation. To achieve a high degree of parallelism, the parallel database architecture must complement or match the underlying hardware architecture in which I/O, transaction, and query parallelism techniques are applied.

The benefits of query

parallelism are available

with the newest DB2 family

members.

Optimizer technology. Parallel technology is being applied to very large databases with very complex query requirements. In this technology, the strength of the cost-based optimizer is crucial. Indeed, as the size of the database grows, so does the importance of the optimizer's ability to choose the best data access path. If it chooses the wrong path, it slows down response time because more data must be searched.

Application transparency. A key advantage to database parallelism is that managing and exploiting the parallel environment is restricted to the database manager alone. Applications don't have to be altered to gain the benefits offered by parallelism. That increases application portability and minimizes application development costs.

Parallel Processing: All in the Family

Exploiting multiple processors is nothing new to IBM's DB2 family of products. DB2 in the MVS, VM, VSE, and OS/400 operating environments has supported tightly coupled multi- processor hardware for years. These DB2 products are designed to exploit the strengths and capabilities of their operating environments, thereby providing excellent transaction throughput rates in a large system environment and illustrating efficient transaction parallelism.

Now, DB2/2 and DB2/6000 join the family in this ability to take advantage of tightly coupled multiprocessor systems. (Systems that share memory are classified as tightly coupled, while those that link together several processors with a distributed memory are classified as loosely coupled.)

For query performance needs, the entire DB2 family incorporates the ability to process I/O in parallel, so that the data can be accessed many times faster than the speed of one disk.

The full benefits of query parallelism are available with the newest DB2 family members: the System/390 Parallel Query Server, DB2 Version 4 for the System/390 Parallel Transaction Server and the Parallel Sysplex, and DB2 Parallel Edition for the POWERparallel systems.

Relational databases, parallelism, subtasks, multiprocessors -- they're all just tools and techniques for processing and screening data. But what they provide is important: the big picture that keeps a company on top of its business and on top of its bottom line.

Reference:
*David DeWitt and Jim Gray, "Parallel Database Systems: The Future of High Performance Database Systems," Communications of the ACM, June 1992.

See also:

DB2: What Happens Next?

Database Parallelism: The Best and the Latest

[SQ][tell SQ] [get SQ] ["software"]

[ IBM home page | Order | Search | Contact IBM | Help | (C) | (TM) ]