TOC PREV NEXT INDEX DOC LIST MASTER INDEX



Optimizing Code

This chapter discusses the impact that various switches and pragmas have on the optimization of code generated by the Rational Ada compiler. Specifically, this chapter discusses:


Coding Strategies in Rational Ada

The Rational Ada compiler can generate code with various goals:

Typically, code that is easy to debug has the fastest compilation time. Compilation time usually is inversely proportional to the size or execution time of generated code.

Typically, optimizing for execution time requires larger code size (smaller code size results in somewhat slower code execution), and optimizing for execution time or code space reduces debugging ability.

Any given optimization can be an exception to these guidelines.

Optimization Factors in Rational Ada

Apex allows you to specify —— using pragmas or switches —— several factors that affect the optimization of code generated by the Rational Ada compiler:

These factors affect each other, as described in Setting Optimization Factors.

Optimizing for Specific Effects

In many cases, you should generate code to meet more than one goal or to move gradually in the direction of a single goal.

For example, optimization level 0 provides the greatest debugging ability and the slowest execution speed, whereas level 2 provides the opposite.

Table 1 shows the Apex settings to use for various goals; you can mix settings to achieve the best balance for your project.

Table 1 Settings for Specific Code-Generation Goals1
For the best:
Opt. level
Opt. objective
Code size
2
Space
Execution speed
2
Time
Compilation speed
0
Space
Debugging ability
0
Space
1Some settings for some factors make the settings of other factors irrelevant. Refer to detailed descriptions.


Setting Optimization Factors

This section describes:

Setting the Optimization Level

The optimization level affects the amount of execution optimization performed during compilation. To specify this value, set the OPTIMIZATION_LEVEL context switch.

Optimization levels 0 through 2 are valid:

You can use this factor in the following way:

Table 2 describes the optimizations performed at levels 1 and 2. At level 1, the optimizations are performed only so far as to have minimum impact on debugging; level 2 removes this restriction.

Table 2 Optimizations Applied at Levels 1 and 2
Optimization
Description
Constant folding
Performs computations involving constant arguments at compile time.
Algebraic transformations
Uses algebraic laws to simplify arithmetic expressions.
Common subexpression elimination
Eliminates the reevaluation of expressions whose values have already been computed.
Constraint-check elimination
Removes constraint checks that can be shown to be unnecessary.
Evaluation-order determination
Selects the evaluation order of subexpressions to minimize register usage.
Equivalence propagation
Substitutes a simpler expression for a more complex one known to have the same value.
Copy propagation

Overflow-check elimination
Removes overflow checks from expressions that cannot cause overflows.
Short-circuit evaluation
Eliminates the evaluation of irrelevant Boolean expressions.
Peephole optimization
Provides some platform-specific code simplification.
Instruction scheduling
Reorders instructions to avoid pipeline and memory conflicts that cause delays.
Lifetime determination
Allocates to the same locations objects that have non-overlapping lifetimes.
Dead-assignment elimination
Eliminates assignments to objects whose values will not be used.
Dead-code elimination
Removes code that will never be executed.
Static-link elimination
Identifies subprograms that do not require static links.
Loop-invariant code motion
Moves out of loops the computation of expressions whose values do not change.
Loop-strength reduction
When optimization objective is Time, transforms multiplications into less-costly additions within loops; particularly useful for array-indexing expressions.
Tail-recursion elimination
Turns tail-recursive calls into branches (level 2 only).
Check hoisting
Moves constraint checks of loop-invariant expressions out of loops.

All optimizations are performed subject to the constraints of the Ada LRM, Section 11.6.

Setting the Optimization Objective

The optimization objective specifies whether execution speed (Time) or code size (Space) is the primary goal of the compiler. It is set by:

Inlining Pragmas

Apex Ada supports the inlining behavior specified in the ARM.

The standard pragma Inline instructs the compiler to honor requests for inlining at optimization levels 1 and 2 when possible (it is not possible to inline recursive code, for example) and beneficial.

Even in the absence of an inlining pragma, at optimization level 2 the compiler may perform inlining of code local to the compilation unit if it is beneficial to do so.

Apex Ada supports two additional inlining pragmas:


Debugging Optimized Code

You can do source level debugging on code optimized at any level. And, of course, you can always do machine level debugging regardless of how the code was compiled.

Generally, however, the higher the optimization level, the less information is retained for debugging. More specifically:

At optimization level 0:

No optimization is performed. You should have full debugging capabilities.

At optimization level 1:

This is the default. At this level, the compiler performs optimizations which maximize performance without sacrificing debuggability. At this level, no optimizations are performed which move code from it "natural" order. This includes code hoisting and pipelining. In addition, no code or objects are eliminated.

At optimization level 2:

At this level, debuggability is sacrificed for performance. Code is eliminated. If you try to set a breakpoint at a source line which has been optimized away, the debugger will give you a message saying that there is no code associated with that source line.

Objects can be optimized away. If you try to display such an object, the debugger will tell you that it has been optimized away.

Code can be re-ordered. There are several optimizations that can cause this: code hoisting, loop unrolling, and pipelining are some examples. In this case, the debugger still generally knows which source line is associated with every instruction. However, when you're stepping through re-ordered or pipelined code, you may execute statements out of order or execute the same statement more than once!

Inlining is performed. If you step into inlined code, the debugger will still be able to place breakpoints and do other debugging operations associated with code locations, but will not be able to display objects declared in inlined routines. The only exception to this is that the debugger is able to display the parameters of C++ inlined member functions. For Ada, the debugger cannot display any objects declared in an inlined subprogram.


The Optimization Process in Detail

This section provides an overview of Apex Ada optimizations and describes methods by which optimizations may be controlled at a very fine level of granularity.

Optimization Process Overview

The Apex compiler executes a front-end process, then an optimizer, and finally a code generator. Optimizations take place in each of these steps.

The front-end process parses and analyzes Ada code to produce a extended Diana tree representation, called the persistent intermediate representation (PIR). The front-end optimizes the PIR, with a focus on Ada-specific optimizations and optimizations that can best be done with tree representations of the program.

The front-end then "walks" the PIR tree to generate a linear representation of the program using the so-called intermediate language (IL). The IL is essentially an abstract assembly code.

The IL is transformed by the optimizer to a more efficient IL. The input and output use the same IL instructions, but the output is much more efficient.

Finally the code generator phase processes the IL to produce actual target machine code. During this process, a number of target-specific optimizations are applied.

The following sections describe the optimizations that take place during each processing step.

Front-End Optimizations

The Apex front-end performs optimizations on a tree-based representation of the program called MILO.

ILX (inline expansion)

At optimization level 1, only those routines that have a pragma Inline applied are candidates for inlining. The optimizer does not inline routines if it feels that there would be no significant benefit. For example, if a routine is quite large, then the benefits of inline reduction of call overhead would likely be eliminated by code expansion negative effects on computer caches.

At optimization level 2, the optimizer treats all routines for which bodies are available at compilation time as candidates for inlining even if there is no pragma Inline applied to the routine spec. The optimization objective switch controls how aggressively we inline:

The Apex inlining approach means users can design their Ada code with proper abstract layers without fear of huge calling penalties. A coder need only worry about algorithm efficiencies and good, maintainable design, not about how to structure routines to eliminate calls.

Apex provides two additional inlining pragmas that apply to optimization levels 0, 1, and 2: pragma Inline_Never and pragma Inline_Always.

GTE (goto elimination)

The GTE optimization eliminates jumps to the next location. It chains jumps to other jumps, returns, and raises. While the middle-pass optimizer also does this optimization on the linear IL, it is essential to this at the Milo level to clean up the Milo trees so that other Milo optimizations can be triggered.

is transformed into

If there are no other references to L, then the statement with label <<L>> will be eliminated by DCE.

DCE (dead code elimination)

DCE eliminates code that is unreachable, i.e., basic blocks for which there is no control path to them.

UCE (unreferenced code elimination)

UCE eliminates routines, objects, constants, symbols, etc., that are never referenced. The major impact is elimination of routines from generic instantiations that are never called. This optimization works only at the compilation unit level so routines which are externally visible cannot be eliminated (this cannot be done until link time).

UCR (useless code removal)

UCR eliminates code that has no side-effects. The middle pass very frequently generates code for side-effects only and when it turns out that there are no side-effects (very frequently the case), this useless code can be eliminated. For example,

The evaluation of X+1 can be eliminated if it can be shown not to overflow since this value is never used.

TRE (tail recursion elimination)

Tail recursion is transformed into iteration. This is done as part of inline-expansion. For example,

is transformed into

EVO (Evaluation reordering)

EVO reorders the evaluation of expressions to reduce the number of temporaries required. It also reorders the evaluation of the actual parameters of a call to reduce the number of temporaries and to increase the likelihood that an actual parameter can be evaluated into the register used for parameter passing.

BND (Object binder)

BND attempts to bind objects that do not have overlapping lifetimes to the same location in the stack frame to reduce the size of the stack frame. This phase does not attempt to optimize the binding of (symbolic) register objects; that is handled by the middle-pass and code generator optimizations.

Middle-Pass Optimizations

The middle-pass optimizer transforms IL code into optimized IL code using a number of well-known optimization algorithms. The code is optimized by applying the algorithms in some order, often reapplying one or more of the algorithms after some other algorithm is run.

Optimization Algorithms

There are nine "canned" algorithm sequences, known simply as optimization levels 0 to 2 (level 0 is no optimization).

The optimization algorithms are referenced by single letters, with no particular meaning for the choice of letter. The algorithm choices are presented below in groups, with some explanation of the transformations those algorithms provide. The following sections provide a description of the middle-pass optimizations which are available

Propagation Optimization Algorithms

A:
Full address propagation
E:
Allowed address propagation
a:
Free address propagation
I:
Propagate integer adds/subs
M:
Propagate all constants and memory
P:
Propagate registers and all constants
p:
Propagate registers
t:
Propagate registers and inexpensive constants

Propagation moves an item forward from computation to use, often allowing the computation to be folded or coalesced into the using instruction and eliminating the original computation. Address propagation often moves the computation of an address offset into the operand of a subsequent instruction in the form "op offset(reg)".

Address Conversion and Coalescing Optimization Algorithms

e:
Convert add's of addresses into addr's
i:
Coalesce address increments
z:
Convert simple addr's into add's of addresses

Address conversion and coalescing often allows address propagation to proceed in situations that might not otherwise permit propagation. In particular, note that `e' and `z' are inverses; `e' is used to collapse simple address computations into complex address operands preparatory to an address propagation, then `z' is used to convert the results back to simple address computations that are better suited to efficient code generation and code scheduling.

Address Simplification Optimization Algorithms

X
Simplify addresses completely
x:
Simplify addresses

Address simplification is used to break complex addresses into simple expressions that permit better propagation. Often propagation will reverse the result of these transformations, after propagation of other computations have taken place. For example,

will be converted by transformation "x" into

which can then be propagated by transformations "I" and "e" into

where offset_const2 is the sum of const and offset_const.

Common Sub-Expression Elimination Optimization Algorithms

C
Global CSE
V:
Local CSE
c:
Global CSE within loops, no increments
v:
Local CSE, but not constants

Common sub-expression elimination (CSE) finds expressions that are repeated and moves the computation ahead of all uses so that it is only computed once. Often repeated expressions occur implicitly within Ada code, such as index computations for different arrays within a loop.

Local CSE restricts the search and moves to basic blocks, i.e. code sequences with no branches and no entry labels. Global CSE searches across basic blocks and thus involves complex flow analysis.

Code Hoisting Optimization Algorithms

H
Code hoisting
h:
Hoisting CSE from if-statements

Code hoisting involves moving computations from a loop (where it is executed with each iteration) into the prefix code for the loop. It is also possible to hoist code that occurs commonly within both branches of an if/then statement into the prefix code before the if/then. While that transformation alone does not speed up the code, it does reduce code size and also permits other transformations, such as code coalescing, to work on the hoisted code.

Reduction Optimization Algorithms

S
Reducing induction expressions
d:
Reducing moves
m:
Reducing moves and renumbering regs

A transformation associated with code hoisting, strength reduction, involves transforming an expensive operation within a loop into a less expensive operation. For example,

can be strength reduced (to change a multiply to an add within the loop) to

Merge Bit Optimization Algorithms

B
Merge bit operations
b:
Limited merge bit operations

Bit operation merging involves merging operations on successive bit fields into a single operation on a larger field. This permits boolean bit arrays that are, for example, being ANDed bit by bit, to be converted to 8-bit or possibly 64-bit integer AND instructions. It allows byte by byte instructions to be merged into a 32-bit or 64-bit integer instruction.

Loop Unrolling Optimization Algorithm

L:
Unroll loops

Loop unrolling involves replicating the code within a loop several times (four to eight seems to be the best trade-off) so that more than one computational iteration is achieved with each iterative computation. While unrolling does reduce the number of iterator computations and end tests, the more important benefit is to provide more instructions for the code generator scheduling optimizer to schedule. Often having instructions from multiple iterations allows the scheduler to delay use of a computation sufficiently such that high-performance pipelines are not stalled waiting for results. Loop unrolling is limited to relatively small loops, so the resulting code expansion is not a great impact on instruction caches.

Range Check Elimination Optimization Algorithm

R:
Eliminate range checks

Range check elimination computes maximum limits for computation results and uses those limits to eliminate limit checks whenever the result cannot possibly exceed a (sub)type limit. The computation result limits are computed using (sub)type limits for operands, propagated constants, and instruction properties.

Useless Computations and Variables Optimization Algorithms

D
Delete dead variables
U:
Remove useless computations including range instructions
u:
Remove useless computations

Often, other transformations result in pieces of code that cannot possibly have any external effect on the program. These transformations eliminate those useless computations.

Control Flow Optimization Algorithm

f:
Simplify control flow

This operation manipulates the control flow to reduce them number of branches and tests.

Memory Optimization Algorithms

T
Extracting expensive constants fron instructions
Y:
Extracting memory operations from instructions
y:
Extracting memory loads from instructions

These operations eliminate redundant loads and stores from memory. Whenever an instruction operand refers to memory, it is transformed into a load or store register instruction followed by the original instruction using the register for its operand. Multiple register loads and stores are then collapsed into one load and one store where possible. Depending on the results from other transformations, it is possible that the loads and stores remaining after this transformation may then be reintegrated into the using instructions as operands referencing memory.

Normalization Optimization Algorithms

N
Normalization
n:
Normalization

Normalization reorders the IL code within basic blocks into sequences that can be more easily processed. Normalization is important for many propagation, merging, and coalescing transformations.

Register Alignment Optimization Algorithm

G:
Finding register alignment

Register alignment controls code generation with respect to register sizes, including sizes used when spilling registers.

Miscellaneous Optimization Algorithms

F:
Counting registers
g:
Generating debug aliases
r:
Mark debug registers

These are miscellaneous transformations used to set up for subsequent optimizations, or used to set up extra data information used by the debugger for properly associating registers with symbols in the original Ada code.

Delineate Sections of Transformations

{:
Optional loop transformation start
}:
Optional loop transformation end

Braces delineate section(s) of transformations that are taken only if there are loops in the section of code being optimized.

Optimizations Performed at Each Level

For each optimization level, a predetermined sequence of transformations is executed. These sequences are currently stable, but future experience could prompt a change in these, possibly for a particular platform. The current sequences for optimization levels 0-2 are shown in the following table. A quick reference key to the algorithm letters is provided immediately below the table.

0
(no optimization)
1
NmPDFVPCPHAbyXMDmG
2
ndPDmFncPADIRyxcPXpdzcPV{HPSrPDuHPSrPHPS}rMDLCHPAUDIiBYxcPXzCPVmnHCPEUDMDeEmVUDTGDgz
{
Optional loop transformation start
i
Coalesce address increments
}
Optional loop transformation end
M
Propagate all consts and memory
A
Full address propagation
m
Reduce moves and renumbering regs
a
Free address propagation
N
Normalization
B
Merge bit operations
n
Normalization
b
Limited merge bit operations
P
Propagate registers and all consts
C
Global CSE
p
Propagate registers
c
Global CSE w/in loops, no increments
r
Mark debug registers
d
Reduce moves
S
Reduce induction expressions
E
Allowed address propagation
T
Extract expensive consts from instrns
e
Convert add's of addresses into addr's
t
Propagate regs and inexpensive consts
F
Counting registers
V
Local CSE
f:
Simplify control flow
v
Local CSE, but not consts
G
Find register alignment
X
Simplify addresses completely
g
Generate debug aliases
x
Simplify addresses
H
Code hoisting
Y
Extract memory operations from instrns
h
Hoist CSE from if-statements
y
Extract memory loads from instrns
I
Propagate integer adds/subs
z
Convert simple addr's into add's of addresses

Code Generator Optimizations

The Apex code generators also optimize code. These optimizations are typically target-specific. In this section, general optimization areas performed by code generators will be discussed.

Register Usage Optimization

Code generators make the assignment of physical registers from the pseudo-registers specified within the IL. Complex graph-coloring algorithms and target-specific heuristics are used to try to minimize the number of register spills, and register moves.

A register spill occurs when a variable allocated to a register has to be temporarily saved to the stack so the register may be used for some other variable. Register spills can significantly slow execution speed, so the code generator code makes significant efforts to minimize these occurrences.

Register moves occur when a variable allocated to one register has to be moved to another register for a procedure call. Register moves do not have as negative an impact as a register spill, but there are wasted instructions if the values can be computed directly into the appropriate result register. Consequently, code generators also attempt to minimize register moves.

Instruction Scheduling

All present day computer architectures (whether RISC or not) have heavily pipelined computational units in order to achieve their high speeds. Depending on the architecture, there may any or all of the following scheduling issues.

Delay slots
These are instructions executed after a branch is executed, but before the branch is taken, which can contain a useful instruction if, and only if, the instruction result if valid regardless of the branch path.
Result delays
The result from an instruction is not available for a number of cycles after the instruction is executed. Optimal scheduling does not make use of a result register until after the appropriate number of instructions are executed. The amount of the delay depends on the instruction type.
Parallel initiation
The processor fetches multiple instructions in one cycle. If those instructions are a correct mix, e.g. one floating point instruction and one integer instruction, then the instructions can be initiated in parallel, effectively executing more than one instruction per cycle.
Branch target alignment
The target of branch must be fetched by the processor before execution. On most modern architectures, instruction fetching is done in aligned multiples of instructions, typically two or four instructions at a time. If branch targets are not aligned, then the first instruction fetch will pick up fewer instructions.
Branch prediction
Architectures contain implicit and explicit hints about branch prediction. By properly ordering branch blocks, and by generating the correct tests, the effects of branches on code can be virtually eliminated with most modern architectures.
Trap delays
Pipelined architectures often will not trap a failed instruction, such as a floating point overflow, until many instructions past the failing instruction have been executed. If traps are essential to exception generation, then the scheduler must ensure that all possible instruction traps have occurred within the code range of any exception handler. That is, the scheduler cannot move instructions outside of blocks with associated exception handlers.

Existing Apex code generators use heuristics that make use of all the above scheduling issues, if applicable to the particular target architecture.

Advanced Instruction Sets

Most architectures are constantly evolving. Apex code generators frequently have architecture version switches to enable use of more advanced instructions. When those switches are enabled, the code generator takes a different path for some Ada code and generates machine code that uses the advanced instructions.

However, these alternative architecture switches do not come without a price. Users often want one executable that will run on different architecture versions. Most modern operating systems provide emulations for missing hardware instructions, but those emulations often are orders of magnitude slower than generating code that does not use the advanced instructions. The result is that if the advanced instructions are enabled and the resulting executable is run on an older architecture version, applications can suffer significant degradation in speed of execution.

Advanced instruction sets also in combination increase the number of code execution scenarios and elevate the importance of the operating system version (version 10.2 may properly emulate an instruction, but 10.1 may have bugs). For this reason, architecture variants are very carefully reviewed for potential benefit before supporting architecture variant switches. Often several variants will be grouped into one switch version to minimize the testing cases.

Peephole Optimization

During actual target code generation, it is often possible to collapse some adjacent IL instructions into a single target instruction. This is called peephole optimization because the transformation only operates on a small "peephole" view of instructions for determining these transformations. This optimization may also be run on the target code instructions after generation, rather than the IL during code generation. The principles are the same.

An example is generation of the S8ADDL instruction for Dec Alpha targets, which. shifts one register by 3 (multiply by 8) and then adds it to another register. This operation is often used for computing array offsets. If the peephole optimizer sees an IL multiply by 8 followed by an IL add of the result to another register, then the code generator collapses those two IL instructions into the single Alpha S8ADDL instruction.


Rational Software Corporation 
http://www.rational.com
support@rational.com
techpubs@rational.com
Copyright © 1993-2004, Rational Software Corporation. All rights reserved.
TOC PREV NEXT INDEX DOC LIST MASTER INDEX TECHNOTES APEX TIPS