Chapter 7. Writing Assembly Language Code

This chapter gives rules and examples to follow when designing an assembly language program. The chapter includes a tutorial section that contains information about how calling sequences work. This involves writing a skeleton version of your prospective assembly routine using a high-level language, and then compiling it with the -S option to generate a human-readable assembly language file. The assembly language file can then be used as the starting point for coding your routine. See “Using the .s Assembly Language File” for details about the assembly language file produced with this option.

This assembler works in either 32-bit, high performance 32-bit (N32) or 64-bit compilation modes. While these modes are very similar, due to the difference in data, register and address sizes, the N32 and 64-bit assembler linkage conventions are not always the same as those for 32-bit mode. For details on some of these differences, see the MIPSpro 64-Bit Porting and Transition Guide and the MIPSpro N32 ABI Handbook.

The procedures and examples in this chapter, for the most part, describe 32-bit compilation mode. In some cases, specific differences necessitated by 64-bit mode are highlighted.

Introduction

When you write assembly language routines, you should follow the same calling conventions that the compilers observe, for two reasons:

  • Often your code must interact with compiler-generated code, accepting and returning arguments or accessing shared global data.

  • The symbolic debugger gives better assistance in debugging programs using standard calling conventions.

The conventions for the compiler system are a bit more complicated than some, mostly to enhance the speed of each procedure call. Specifically:

  • The compilers use the full, general calling sequence only when necessary; where possible, they omit unneeded portions of it. For example, the compilers don't use a register as a frame pointer whenever possible.

  • The compilers and debugger observe certain implicit rules rather than communicating via instructions or data at execution time. For example, the debugger looks at information placed in the symbol table by a “.frame” directive at compilation time, so that it can tolerate the lack of a register containing a frame pointer at execution time.

Program Design

This section describes some general areas of concern to the assembly language programmer:

  • Stack frame requirements on entering and exiting a routine.

  • The “shape” of data (scalars, arrays, records, sets) laid out by the various high-level languages.

For information about register format, and general, special, and floating-point registeres, see Chapter 1.

The Stack Frame

This discussion of the stack frame, particularly regarding the graphics, describes 32-bit operations. In 32-bit mode, restrictions such as stack addressing are enforced strictly. While these restrictions are not enforced rigidly for 64-bit stack frame usage, their observance is probably still a good coding practice, especially if you count on reliable debugging information.

The compilers classify each routine into one of the following categories:

  • Non-leaf routines, that is, routines that call other procedures.

  • Leaf routines, that is, routines that do not themselves execute any procedure calls. Leaf routines are of two types:

    • Leaf routines that require stack storage for local variables

    • Leaf routines that do not require stack storage for local variables.

You must decide the routine category before determining the calling sequence.

To write a program with proper stack frame usage and debugging capabilities, use the following procedure:

  1. Regardless of the type of routine, you should include a .ent pseudo-op and an entry label for the procedure. The .ent pseudo-op is for use by the debugger, and the entry label is the procedure name. The syntax is:

         .ent    procedure_name
    procedure_name:

  2. If you are writing a leaf procedure that does not use the stack, skip to step 3. For leaf procedure that uses the stack or non-leaf procedures, you must allocate all the stack space that the routine requires. The syntax to adjust the stack size is:

         subu    $sp,framesize

    where framesize is the size of frame required; framesize must be a multiple of 16. Space must be allocated for:

    • Local variables.

    • Saved general registers. Space should be allocated only for those registers saved. For non-leaf procedures, you must save $31 , which is used in the calls to other procedures from this routine. If you use registers $16-$23, you must also save them.

    • Saved floating-point registers. Space should be allocated only for those registers saved. If you use registers $f20-$f30 (for 32-bit) or $f24-$f31 (for 64-bit), you must also save them.

    • Procedure call argument area. You must allocate the maximum number of bytes for arguments of any procedure that you call from this routine.


      Note: Once you have modified $sp, you should not modify it again for the rest of the routine.


  3. Now include a .frame pseudo-op:

         .frame framereg,framesize,returnreg

    The virtual frame pointer is a frame pointer as used in other compiler systems but has no register allocated for it. It consists of the framereg ($sp, in most cases) added to the framesize (see step 2 above). The following figures show the stack components for -32 and -n32 and -64.

    The returnreg specifies the register containing the return address (usually $31). These usual values may change if you use a varying stack pointer or are specifying a kernel trap routine.

    Figure 7-1. Stack Organization for -32

    Stack Organization for -32

    Figure 7-2. Stack Organization for -n32 and -64

    Stack Organization for -n32 and -64

  4. If the procedure is a leaf procedure that does not use the stack, skip to step 7. Otherwise you must save the registers you allocated space for in step 2.

    To save the general registers, use the following operations:

         .mask    bitmask,frameoffset
         sw reg,framesize+frameoffset-N($sp)

    The .mask directive specifies the registers to be stored and where they are stored. A bit should be on in bitmask for each register saved (for example, if register $31 is saved, bit 31 should be `1' in bitmask. Bits are set in bitmask in little-endian order, even if the machine configuration is big-endian).The frameoffset is the offset from the virtual frame pointer (this number is usually negative). N should be 0 for the highest numbered register saved and then incremented by four for each subsequently lower numbered register saved. For example:

         sw    $31,framesize+frameoffset($sp)
         sw    $17,framesize+frameoffset-4($sp)
         sw    $16,framesize+frameoffset-16($sp)

    Figure 7-3, illustrates this example.

    Now save any floating-point registers that you allocated space for in step 2 as follows:

         .fmask    bitmask,frameoffset
         s.[sd]    reg,framesize+frameoffset-N($sp)

    Notice that saving floating-point registers is identical to saving general registers except we use the .fmask pseudo-op instead of .mask, and the stores are of floating-point singles or doubles.The discussion regarding saving general registers applies here as well, but remember that N should be incremented by 16 for doubles.The stack framesize must be a multiple of 16.

    Figure 7-3. Stack Example

    Stack Example

  5. This step describes parameter passing: how to access arguments passed into your routine and passing arguments correctly to other procedures. For information on high-level language-specific constructs (call-by-name, call-by-value, string or structure passing), refer to the MIPSpro N32/64 Compiling and Performance Tuning Guide.

    As specified in step 2, space must be allocated on the stack for all arguments even though they may be passed in registers. This provides a saving area if their registers are needed for other variables.

    General registers must be used for passing arguments. For 32-bit compilations, general registers $4-$7 and float registers $f12, $f14 are used for passing the first four arguments (if possible). You must allocate a pair of registers (even if it's a single precision argument) that start with an even register for floating-point arguments appearing in registers.

    For 64-bit compilations, general registers $4- $11 and float registers $f12, through $f19 are used for passing the first eight arguments (if possible).

    In Table 7-1 and Table 7-2, the “fN” arguments are considered single- and double-precision floating-point arguments, and “nN ” arguments are everything else. The ellipses (...) mean that the rest of the arguments do not go in registers regardless of their type. The “stack” assignment means that you do not put this argument in a register. The register assignments occur in the order shown in order to satisfy optimizing compiler protocols:

    Table 7-1. Parameter Passing (-32)

    Argument List

    Register and Stack Assignments

    f1, f2

    $f12, $f14

    f1, n1, f2

    $f12, $6, stack

    f1, n1, n2

    $f12, $6 $7

    n1, n2, n3, n4

    $4, $5, $6, $7

    n1, n2, n3, f1

    $4, $5, $6, stack

    n1, n2, f1

    $4, $5, ($6, $6)

    n1, f1

    $4, ($6, $7)


    Table 7-2. Parameter Passing (-n32 and -64)

    Argument List

    Register and Stack Assignments

    d1,d2

    $f12, $f13

    s1,s2

    $f12, $f13

    s1,d1

    $f12, $f13

    d1,s1

    $f12, $f13

    n1,d1

    $4,$f13

    d1,n1,d1

    $f12, $5,$f14

    n1,n2,d1

    $4, $5,$f14

    d1,n1,n2

    $f12, $5,$6

    s1,n1,n2

    $f12, $5,$6

    d1,s1,s2

    $f12, $f13, $f14

    s1,s2,d1

    $f12, $f13, $f14

    n1,n2,n3,n4

    $4,$5,$6,$7

    n1,n2,n3,d1

    $4,$5,$6,$f15

    n1,n2,n3,s1

    $4,$5,$6, $f15

    s1,s2,s3,s4

    $f12, $f13,$f14,$f15

    s1,n1,s2,n2

    $f12, $5,$f14,$7

    n1,s1,n2,s2

    $4,$f13,$6,$f15

    n1,s1,n2,n3

    $4,$f13,$6,$7

    d1,d2,d3,d4,d5

    $f12, $f13, $f14, $f15, $f16

    d1,d2,d3,d4,d5,s1,s2,s3,s4

    $f12, $f13, $f14, $f15, $f16, $f17, $f18,$f19,stack

    d1,d2,d3,s1,s2,s3,n1,n2,n3

    $f12, $f13, $f14, $f15, $f16, $f17, $10,$11, stack


  6. Next, you must restore registers that were saved in step 4. To restore general purpose registers:

         lw reg,framesize+frameoffset-N($sp)

    To restore the floating-point registers:

         l.[sd] reg,framesize+frameoffset-N($sp)

    Refer to step 4 for a discussion of the value of N.)

  7. Get the return address:

         lw $31,framesize+frameoffset($sp)

  8. Clean up the stack:

         addu framesize

  9. Return:

         j $31

  10. To end the procedure:

         .end procedurename

The difference in stack frame usage for -n32 and -64 operations can be summarized as follows:

The portion of the argument structure beyond the initial eight doublewords is passed in memory on the stack, pointed to by the stack pointer at the time of call. The caller does not reserve space for the register arguments; the callee is responsible for reserving it if required (either adjacent to any caller-saved stack arguments if required, or elsewhere as appropriate). No requirement is placed on the callee either to allocate space and save the register parameters, or to save them in any particular place.

The Shape of Data

In most cases, high-level language routine and assembly routines communicate via simple variables: pointers, integers, booleans, and single- and double-precision real numbers. Describing the details of the various high-level data structures (arrays, records, sets, and so on) is beyond the scope of this book. If you need to access such a structure as an argument or as a shared global variable, refer to the MIPSpro N32/64 Compiling and Performance Tuning Guide.

Examples

This section contains the examples that illustrate program design rules. Each example shows a procedure written and C and its equivalent written in assembly language.

Example 7-1. Non-leaf procedure

The following example shows a non-leaf procedure. Notice that it creates a stackframe, and also saves its return address since it must put a new return address into register $31 when it invokes its callee:

float
nonleaf(int i, int *j)
     {
     double atof();
     int temp;

     temp = i - *j;
     if (i < *j) temp = -temp;
     return atof(temp);
     }
           .globl       nonleaf
   #   1   float
   #   2   nonleaf(i, j)
   #   3   int i, *j;
   #   4   {
           .ent        nonleaf 2
   nonleaf;
           .cpload $25              ## Load $gp
           subu        $sp, 32      ## Create stackframe
           sw          $31, 20($sp) ## Save the return
                                    ## address
           sw          $sp, 24($sp) ## Save gp
           .mask       0x80000000, -4
           .frame    $sp, 32, $31
   #  5    double atof();
   #  6    int temp;
   #  7
   #  8    temp = i - *j;
           lw      $2, 0($5)        ## Arguments are in
                                    ## $4 and $5
           subu     $3, $4, $2
   #  9    if (i < *j) temp = -temp;
           bge      $4, $2, $32     ## Note: $32 is a label,
                                    ##  not a reg
           negu     $3, $3
$32:
   #  10   return atof(temp);
           move     $4, $3
           jal      atof
           cvt.s.   $f0, $f0       ## Return value goes in $f0
           lw       $gp, 24($sp)   ## Restore gp
           lw       $31, 20($sp)   ## Restore return address
           addu     $sp, 32        ## Delete stackframe
           j        $31            ## Return to caller
           .end     nonleaf   

The -n32 code for the previous example is shown below. Note that this code is under .set noreorder, so be aware of delay slots.

  .set          noreorder
         # Program Unit: nonleaf
  .ent          nonleaf
  .globl        nonleaf
nonleaf:        # 0x0
  .frame        $sp, 32, $31
  .mask         0x80000000, -32
  lw $7,0($5)              # load *j
  addiu $sp,$sp,-32        #.frame.len.nonleaf
  sd $gp,8($sp)            # save $gp
  sd $31,0($sp)            # save $ra
  lui $31,%hi(%neg(%gp_rel(nonleaf+0))) #load new $gp
  addiu $31,$31,%lo(%neg(%gp_rel(nonleaf +0))) #
  addu $gp,$25,$31         #
  slt $1,$4,$7             # compare i to *j
  beq $1,$0,.L.1.1.temp    #
  subu $7,$4,$7            # i-*j, in delay slot of branch
  subu $7,$0,$7            # temp = -temp
.L.1.1.temp:     # 0x2c
  lw $25,%call16(atof)($gp)#
  jalr $25                 #atof
  or $4,$7,$0              # delay slot of jalr loads arg
  ld $31,0($sp)            # restore $ra
  cvt.s.d $f0,$f0          #
  ld $tp,8($sp)            # restore $gp
  jr $31                   #
  addiu $sp,$sp,32         # .frame.len.nonleaf
  .end   nonleaf

           


Example 7-2. Leaf Procedure

This example shows a leaf procedure that does not require stack space for local variables. Notice that it creates no stackframe, and saves no return address.

int
leaf(p1, p2)
    int p1, p2;
    {
    return (p1 > p2) ? p1 : p2;
    }
                .globl        leaf
   #    1       int
   #    2       leaf(p1, p2)
   #    3         int p1, p2;
   #    4         {
                .ent          leaf2
leaf:
                .frame        $sp, 0, $31
   #    5         return (p1 > p2) ? p1 : p2;
                 ble          $4, $5, $32    ## Arguments in
                                             ##  $4 and $5
                 move         $3, $4
                 b            $33
$32:
                 move         $3, $5
$33:
                 move         $2, $3         ## Return value
                                             ##  goes in $2
                 j            $31            ## Return to
                                             ##  caller
   #    6          }
                 .end    leaf

The -n32 code for the previous example looks like this:

   .set    noreorder
   .ent    leaf
   .globl  leaf
leaf:   #0x0
   .fram$sp, 0, $31
   slt $2,$5,$4           # compare p1 and p2
   beq $2, $0,.L.1.2.temp #
   or $9,$4,$0            # delay slot
   b .L.1.1.temp          #
   or $2,$9,$0            # delay slot, return pl
.L.1.2.temp:  # 0x14
   or $2,$5,$0            # return p2
.L.1.1.temp:  # 0x18
   jr $31                 #
   nop                    # delay slot
   .end     leaf

Interfaces Between Assembly Routines and Other Languages

The rules and parameter requirements that exist between assembly language and other languages are varied and complex. The simplest approach to coding an interface between an assembly routine and a routine written in a high-level language is to do the following:

  • Use the high-level language to write a skeletal version of the routine that you plan to code in assembly language.

  • Compile the program using the -S option, which creates an assembly language (.s) version of the compiled source file (the -O option, though not required, reduces the amount of code generated, making the listing easier to read).

  • Study the assembly-language listing and then, imitating the rules and conventions used by the compiler, write your assembly language code.

Using the .s Assembly Language File

The MIPSpro compilers can produce a .s file rather than a .o file. The file is produced by specifying the -S option on the command line instead of the -c option.

The assembly language file that is produced contains exactly the same set of instructions that would have been produced in the .o object file, and inputting the .s file to the assembler produces an object file with the same instructions that the compiler would have produced. The .s file is a listing of the instructions, but does not contain all the object information that a .o file contains. Therefore, a .o file generated by a .s file will not be exactly the same as one generated directly by the compiler and they are not guaranteed to work identically (for example, reorg_common information is lost).

In addition to the program's instructions, the .s file contains comments indicating the effects of various optimization transformations that were made by the compiler.

Most of these comments are self-explanatory or contain easily understood information, while other comments require a detailed knowledge of the compiler's internal workings. The following information is intended to describe the more useful, non-obvious, features of the file without getting into the details of optimization theory. For more detailed information about optimization see the MIPSpro N32/64 Compiling and Performance Tuning Guide.

The following subsections describe the different elements of the .s file.

Program Header

The file begins with comments that indicate the name of the source file and the compiler that was used to produce the .s file. The options that were used by the compiler are also listed. It is often important to know the target machine that the instructions were intended for; this is discussed in the following subsections. By default, only a select set of options are included in the file. More detail can be obtained by including the -LIST:options flag on the compiler's command line.

Instruction Alignment

One of the first pseudo-instructions in the file is similar to the following example:

     .section .text, 1, 0x00000006, 4, 64

or

     .section .text, 1, 0x00000006, 4, 16

This directive is used by the loader to align the start of the program's instructions at particular byte-address boundaries. The rightmost field is 16 if quad word alignment is required, or is 64 if cache line alignment is needed. The proper number is determined by the target processor type and the optimization level that was used because some optimizations require an exact knowledge of the I-Cache placement of each instruction while others do not benefit from this level of control.

Label Offset Comments

A comment is attached to each label definition (recognized by the colon (:) following the name). This comment provides the byte offset of the label's location relative to the start of the .section .text directive. The first label, which usually corresponds to the first entry point of the first function, is 0x0.

The remaining labels have addresses that are increased by 4 bytes for each instruction that is placed between successive labels. These offsets are the same for both the .s file and the related .o files, although the loader can choose to place the start of the program (the 0x0 location) anywhere in the machine's address space. The start is subject only to the alignment restriction placed on the .section directive (see “Instruction Alignment”).

This is useful to note when you are using a debugger and trying to correlate the assembly file to the executed instructions. The machine addresses are sometimes difficult to translate to the file's relative offsets when only quad word alignment was requested.

The following is an example of this comment:

.BB1.kernel_:  # 0x0  

Source Code Comments

To help associate the compiler-generated code to the original source code, the line number and source code are inserted into comment lines that are interspersed with the assembly instructions. The comments usually appear ahead of the machine instructions that are generated for it. However, various optimizations may cause instructions to be moved or reordered and it is sometimes difficult to understand where they appear.

A further difficulty can arise if inline code expansion occurs. In these cases, the line number (503 in the following example) may refer to the line of the module that contained the inlined routine, and not to the original source code module of the compiled program. This can be especially confusing if the -ipa option was requested, and if several source code files were intermixed.

To determine the original file that contains a particular source code line, search for the immediately preceding .loc directive. This directive contains the line number and an index to a previous .file directive that identifies the file that the source code was read from. See Chapter 8, “Pseudo Op-Codes (Directives)” for information about the .loc directive.

The following is an example of this comment:

# 503              x[k] = q + y[k]*( r*z[k+10] + t*z[k+11] )

Relative Instruction Issue Times

When any level of optimization greater than -O0 is requested on the command line, comments are added to the right of machine instructions that indicate the compiler's knowledge of the relative issue time for the particular instruction. These comments consist of an integer between square brackets, as shown in the following example:

     mul.d $f2,$f2,$f10              # [11]

In this example, the [11] indicates the clock cycle (relative to the start of the block) in which this instruction will be issued by the processor.

The assembly files targeted for processors that can only issue a single instruction in a clock period have unique times for each instruction in the block, while target processors that can issue multiple instructions may show that several instructions have the same integer in the issue time comment.

The times for processors that support Out-Of-Order issue of instructions may sometimes appear unusual because an instruction may be issued before other instructions that precede it in the block. This is common processor behavior. The compiler attempts to model the queuing mechanisms contained by the hardware and it uses knowledge of the details to arrive at meaningful times to place in these comment fields. The times are accurate to the limit that the machine is modeled.

Several simplifying assumptions are made to calculate these times, which make it difficult to estimate the performance of the code by using these comments alone. The most important point to make is that program flow is not taken into account. The actual performance of a program is influenced by the path taken into a particular block of code, which often determines when the inputs that an instruction needs will be ready. It would be difficult to model an entire program and take into account all possible paths into a block, so it is assumed that all inputs computed outside a block are available at the start of the block, and that all functional units are initially free to accept new operations.

Even with these restrictions, it is difficult to accurately model the behavior of load and store instructions. The compiler attempts to recognize accesses that will be satisfied from a data cache and use an appropriate latency. Although performance data suggests that most data references are to a cache, this can be very program-dependent. With the additional complexity introduced when multiple levels of cache are available, the compiler can never be certain that it is using the correct memory latency to produce the issue time comments. Because of these uncertainties, the compiler uses times that match what happens in the average program.

A limitation on the use of these times is illustrated with the following program example run on a machine with an R10000 processor:

.Lt.0.224:      # 0x508
        .loc    1 589 17  
# 589                  temp -= x[j]*y[j];
         ldc1 $f9,24024($3)              # [0]
         ldc1 $f10,32064($5)             # [1]
         addiu $2,$2,-1                  # [0]
         addiu $3,$3,8                   # [0]
         addiu $5,$5,8                   # [1]
         bne $2,$0,.Lt.0.224             # [1,1]
         nmsub.d $f4,$f4,$f9,$f10        # [4]

With just a glance at the times, you might conclude that a nmsub instruction will only be issued every 5 clock periods. However, as long as execution stays within the loop, the processor will prefetch instructions faster than it can execute them, resulting in an average issue of 1 nmsub instruction every 2 clock periods, limited by the 2 memory accesses that take 2 clock periods to issue.

There are occasions when the first instruction is not considered to be part of the block and no instruction issue time is computed for it. This happens when the block is frequently branched to using a label+4 address specification. The following example code illustrates this:

.Lt.0.274:      # 0x9ac
         or $3,$9,$0                     #
         or $6,$10,$0                    # [0]

The most frequent transfer to the label is the following instruction:

     bne $8,$30,.Lt.0.274+4                   # [0,1]  

Relative Branch Prediction Times

If the target processor is an Out-Of-Order processor, the cycle when the hardware will predict the direction of a conditional branch is estimated. This happens at the time the instruction is first read into the instruction decode buffer and is independent of the time that the instruction actually issues.

This time is reported as the first of a pair of integers, in square brackets, in the comment field of the instruction. The second field is the issue time. In the preceding example, branch prediction happens in cycle 0, but the instruction will not issue until cycle 1 (because it has to wait for an input).

The compiler attempts to move inputs to conditional branches as far previously as possible so that both the branch prediction and the issue times are identical. However, there are conditions that prevent the compiler from doing so; this is done to minimize the number of instructions that are speculatively executed after the prediction and before the direction of the branch can be determined with certainty. It is only when the instruction completes execution that the hardware is certain which branch direction is correct.

If the wrong direction was predicted, all speculatively executed instructions will need to be aborted, wasting time that could have been devoted to completing the program.

nop Instructions

A nop instruction is a real operation that does not change the contents of any registers. There are several that could be used, but the preferred one is sll $0,$0,0, which means “shift left by 0 bits the contents of register $0 and store the result into register $0”.

nop instructions usually waste space and should be deleted by the compiler, but there are situations where they are necessary for the correctness of the executed code and cases where they can improve the performance of the executed code. They are most often encountered as a placeholder for the delay slot of a branch instruction, when no other instruction can be found. The following code sequence illustrates this:

         addiu $5,$5,1                   # [0]
         bne $5,$30,.Lt.0.460            # [0,1]
         nop                             # [0]  

Other than their use in the delay slot of conditional branches, nop instructions are used to optimize the fetch and decode performance of processor types that can read, decode and execute multiple instructions in each clock period. These processors cannot group together instructions when a cache line boundary occurs between them, resulting in a delay that can be avoided by inserting one or more nop instructions ahead of a label.

The optimization that attempts this alignment depends on the processor type and the optimization levels selected. In the common case, the first block of each loop is forced to start on a quad word boundary. This is simple and fast although it sometimes causes nops to be added in the middle of a cache line, where they are not useful.

For the highest level of optimization, and only for Out-Of-Order issue processors, closer track is kept of cache line boundaries. This requires that the start of the module (that is, the address of the first text label) be aligned on a cache line boundary, increasing the size of the generated executable but allowing the compiler to avoid unnecessary instructions.

Along with optimally aligning instructions on Out-Of-Order processors, attention is paid to a timing "hiccup" that can occur if a branch instruction is separated from its delay slot instruction by a cache line break. The insertion of a nop before the branch can improve performance slightly. The following is an example of this. The nop forces the bne instruction to start in the next cache line, as can be determined by the address comment in the label field of the next block.

         nop                             # [1]
         bne $0,$1,.Lt.0.550             # [3,5]
         xori $1,$1,1                    # [5]
.BB307.kernel_: # 0x2408

Loop Information Comments

Comments are added at the start of loops to indicate the transformations that were applied to the loop. The meanings of most of these are obvious, but some need some explanation:

  • The occurrence of comments that start with <swpf> or <loop>Not unrolled: indicate that software pipelining failed to optimize the loop. There is usually a reason given, although the meaning can be obscure and refer to details of software theory.

  • Comments that look like <swps> xx cycles per iteration may not contain an accurate count of the number of cycles for Out-Of-Order processors. This is because the exact cycle times are determined much later in the compiler process than when this cycle count is estimated and the comment is constructed. These inaccuracies also affect the numbers that precede % of peak comments.

  • Similarly, for Out-Of-Order processors, the cycle count in comments similar to <sched> Loop schedule length: xxx cycles (ignoring nested loops) is sometimes wrong.

Block Information

A block is a sequence of instructions between 2 labels. Blocks are usually identified in the assembly file by a comment between the starting label and the first instruction with a comment that contains BB:xxx. The block number that follows the BB: is used to identify each unique block of the program. Comments that start with <freq> BB:xxx frequency = yyy.yy indicate how often the compiler believes the block is executed for each invocation of the function where the block is located.

The comment is followed by (heuristic) or (feedback) to indicate how that average was arrived at. Because many optimizations utilize this information, incorrect information can result in sub-optimal compiler output. It is important that the feedback data be generated by tests that truly represent the expected behavior of the final program so that accurate decisions can be made by the compiler.

Blocks that end with conditional branches also contain comments similar to <freq> BB:xxx => BB:yyy probability = 0.zzzzz. These indicate the compiler's estimation for the direction of each possible branch. Again, it is important for optimal performance that feedback be generated by test cases that are representative of the actual workload.