Try   HackMD

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Introduction

Design and implementation of assemblers

  • Some common fundamental functions such as:
    • translating mnemonic operation codes to their machine language equivalents
    • assigning machine addresses to symbolic labels used by the programmer.
  • The features and design of an assembler depend heavily upon the source language it translates and the machine language it produces.
  • Machine dependent/independent features

Basic Assembler Functions

  • Example of SIC assembler language program

  • Assembler Directive: 在組語程式中,用來告訴 Assembler 某些資訊或是該做哪些動作,不屬於 CPU 的指令

    • START: specify name and starting address for the program.
    • END: indicate the end of the source program and (optionally) specify the first executable instruction in the program.
    • Variable/Constant Declaration
      • BYTE
      • WORD
      • RESB
      • RESW
  • The line numbers are for reference only and are not part of the program

A Simple SIC Assembler

  • The translation of source program to object code requires us to accomplish the following functions
    (not necessarily in the order given)
    1. Convert mnemonic operation codes to their machine language equivalents
      • translate STL to 14 (line 10)
    2. Convert symbolic operands to their equivalent machine addresses
      • translate RETADR to 1033 (line 10)
    3. Build the machine instruction in the proper format.
    4. Convert the data constants specified in the source program into their internal machine representations
      • translate EOF to 454F46 (line 80)
    5. Write the object program and the assembly listing.

Forward Reference & Two Passes

  • The instruction at line 10 contains a forward reference.
    • Line 10 存取 RETADR,但 RETADR 在 line 95 才定義
  • Because of forward reference, most assemblers make two passes over the source program.
    • The first pass:
      • does little more than scan the source program for label definitions and assign addresses.
    • The second pass:
      • performs most of the actual translation previously described.
  • In addition, the assembler must process statements called assembler directives (or pseudo-instructions).

Simple Object Program

  • The simple object program format we use contains three types of records.
    • Header
      • program name
      • starting address
      • length
    • Text
      • translated instructions
      • data
      • indication of the addresses where these are to be loaded
    • End
      • marks the end of the object program
      • specifies the address in the program where execution is to begin

Header record:

  • Col. 1: H
  • Col. 2-7: Program name
  • Col. 8-13: starting address of object program
  • Col. 14-19: length of object program in bytes

Text record:

  • Col. 1: T
  • Col. 2-7: starting address for object code in this record
  • Col. 8-9: length of object code in this record in bytes
  • Col. 10-69: object code, represented in hexadecimal
  • Object code 的部分可以容納 60 個 symbol(每個 4 bit),若以 byte 計算就是 30 bytes。
  • 一個 instruction 長 3 bytes,所以可以放 10 個 instructions
  • 在指令不連續時(通常是中間有宣告變數),會在沒填滿的情況下使用新的 text record。

End record:

  • Col.1: E
  • Col. 2-7: address of first executable instruction in object program

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • There is no object code corresponding to addresses 1033-2038.
  • This storage is simply reserved by the loader for use by the program during execution.

Two Passes

Pass 1 (define symbols):

  • Assign addresses to all statements in the program.
  • Save the values (addresses) assigned to all labels for use in Pass 2.
  • Perform some processing of assembler directives.
    • e.g., determining the length of data areas defined by BYTE, RESW, etc.)

Pass 2 (assemble instructions and generate object program):

  • Assemble instructions (translating operation codes and looking up addresses)
  • Generate data values defined by BYTE, WORD, etc.
  • Perform processing of assembler directives not done during Pass 1.
  • Write the object program and the assembly listing.

Assembler Algorithm and Data Structures

Major internal data structures of simple assembler:

  • Operation code table (OPTAB)
  • Symbol table (SYMTAB)
  • Location Counter (LOCCTR)
    • Initialized to the beginning address specified in the START statement.

Operation Code Table (OPTAB)

  • Contains (at least) the mnemonic operation code and its machine language equivalents
    • In more complex assemblers, this table also contains information about instruction format and length.
  • OPTAB is usually organized as a hash table, with mnemonic operation code as the key.
    • The information in OPTAB is predefined when the assembler itself is written.

Example of OPTAB

Mnemonic Machine Language
ADD 18
ADDF 58
ADDR 90
  • Operation code table 的內容不太會變動
  • 使用目的是用來加快 mnemonic
    machine language 的速度
    • 需要使用查詢速度很快的資料結構
  • 所以會使用 Hash Table 來實作 OPTAB
    • 查詢的平均複雜度是
      O(1)

Symbol Table (SYMTAB)

  • Includes the name and value (address) for each label in the source program
    • together with flags to indicate error conditions
      • e.g., a symbol defined in two different places
  • It may also contain other information about the data area or instruction labeled
    • for example, its type or length.
  • During Pass 1
    • labels are entered into SYMTAB as they are encountered in the source program
    • along with their assigned addresses.
  • During Pass 2
    • symbols used as operands are looked up in SYMTAB
      the addresses to be inserted in the assembled instructions.
  • SYMTAB is usually organized as a hash table
    • for efficiency of insertion and retrieval.
  • Since entries are rarely deleted from this table, efficiency of deletion is not an important consideration.

Example: SYMTAB of SIC Program

Label Location
COPY 1000
FIRST 1000
CLOOP 1003
ENDFIL 1015
EOF 102A
THREE 102D

對 hash table 來說,insertion(插入) 和 retrieval(查詢),平均複雜度都是

O(1)

Hash Table

  • Care should be taken in the selection of a hashing function.
    • Programmers often select many labels that have similar characteristics
      • for example, labels that start or end with the same characters (like LOOP1, LOOP2, LOOPA).
    • It is important that the hashing function used perform well with such non-random keys.
      • 對於任意的 key,hash 的結果越不容易重複越好
    • Division of the entire key by a prime table length often gives good results.

Algorithms of Passes

Pass 1 usually writes an intermediate file

  • contains each source statement together with its assigned addresses, error indicators, etc.
  • This file is used as the input to Pass 2.

Pseudo Code of Pass1

begin read first input line if OPCODE = 'START' then begin save #[OPERAND] as starting address initialize LOCCTR to starting address write line to intermediate file read next input line end {if START} else initialize LOCCTR to 0 while OPCODE != 'END' do begin if this is not a comment line then begin if there is a symbol in the LABEL field then begin search SYMTAB for LABEL if found then set error flag (duplicate symbol) else insert (LABEL, LOCCTR) into SYMTAB end {if symbol} search OPTAB for OPCODE if found then add 3 {instruction length} to LOCCTR else if OPCODE = 'WORD ' then add 3 to LOCCTR else if OPCODE = 'RESW' then add 3 * #[OPERAND] to LOCCTR else if OPCODE = 'RESB' then add #[OPERAND] to LOCCTR else if OPCODE = 'BYTE' then begin find length of constant in bytes add length to LOCCTR end {if BYTE} else set error flag (invalid operation code) end {if not a comment} write line to intermediate file read next input line end {while not END} write last line to intermediate file save (LOCCTR - starting address) as program length end {Pass 1}

Presudo Code of Pass2

begin read first input line {from intermediate file} if OPCODE = 'START' then begin Write listing line read next input line end {if START} write Header record to object program initialize first Text record while OPCODE != 'END' do begin if this is not a comment line then begin search OPTAB for OPCODE if found then begin if there is a symbol in OPERAND field then begin search SYMTAB for OPERAND if found then store symbol value as operand address else begin store 0 as operand address set error flag (undefined symbol) end end {if symbol} else store 0 as operand address assemble the object code instruction end {if opcode found} else if OPCODE = 'BYTE' or 'WORD' then convert constant to object code if object code will not fit into the current Text record then begin write Text record to object program initialize new Text record end add object code to Text record end {if not comment } write listing line read next input line end {while not END} write last Text record to object program write End record to object program write last listing line end {Pass 2}

Machine-Dependent Assembler Features

  • Example of SIC/XE Program

  • The assembler directive BASE is used in conjunction with base relative addressing.

    • BASE 讓 assembler 知道 base register 的值
    • Assembler 必須知道 base 的值,才能計算指令中的 displacement
  • If the displacements required for relative addressing are too large to fit into a 3-byte instruction

    the 4-byte extended format (Format 4) must be used.

    • The bit e is set to 1 to indicate extended instruction format.
    • The extended instruction format is specified with the prefix + added to the operation code in the source statement.
    • It is the programmer’s responsibility to specify this form of addressing when it is required.
  • Advantages of the more advanced SIC/XE architecture to improve the execution speed of the program

    • Register-to-register instructions are faster than register-to-memory operations.
    • When using immediate addressing, the operand is already present as part of the instruction and need not be fetched from anywhere.
    • The use of indirect addressing often avoids the need for another instruction.

Instruction Formats and Addressing Modes

  • The conversion of register mnemonics to numbers can be done with a separate table

    • however, it is often convenient to use the symbol table for this purpose.
    • To do this, SYMTAB would be preloaded with the register names and their values.
  • If extended format is not specified, the assembler first attempts to translate the instruction using program-counter relative addressing.

    • If this is not possible (because the required displacement is out of range), the assembler then attempts to use base relative addressing.
    • If neither form of relative addressing is applicable and extended format is not specified, then the instruction cannot be properly assembled.
      • In this case, the assembler must generate an error message.

Displacement Calculation

  • The computation that the assembler needs to perform is essentially the target address calculation in reverse.
    • disp=TA(B)
    • disp=TA(PC)

PC Relative

  • The program counter is advanced after each instruction is fetched and before it is executed.
    • Thus during the execution of an instruction, program counter will contain the address of the next instruction.
  • The assembler knows what the contents of the program counter will be at execution time.

Base Relative

  • The base register is under control of the programmer.
  • The programmer must tell the assembler what the base register will contain during execution
    so that the assembler can compute displacements.
    • This is done with the assembler directive BASE.
    • NOBASE
    • These assembler directives produce no executable code.
    • The programmer must provide instructions that load the proper value into the base register during execution.

Program Relocation

當一個設備同時執行多個程式時,我們就無法知道該東西當時的記憶體位置,所以要藉由告知 loader 一些資訊,讓其可以在程式執行時做一些修正。

  • It is often desirable to have more than one program at a time sharing the memory and other resources of the machine.

    • In such a situation the actual starting address of the program is not known until load time.
    • Some address portions of the program need to be changed, while some parts (such as constants) should remain the same regardless of where the program is loaded.
    • Looking at the object code alone, it is in general not possible to tell which values represent addresses and which represent constant data items.
  • Since the assembler does not know the actual location where the program will be loaded, it cannot make the necessary changes in the addresses. However, the assembler can identify for the loader those parts of the object program that need modification.

  • An object program that contains the information necessary to perform this kind of modification is called a relocatable program(可重定位程式).

  • We can solve the relocation problem in the following way:

    • When the assembler generates the object code, it will produce the addresses relative to the start of the program.
    • The assembler will also produce a command for the loader, instructing it to add the beginning address of the program to the address fields at load time.
  • Modification record (part of the object program)

    • Col. 1: M
    • Col. 2-7: starting location of the address field to be modified, relative to the beginning of the program (hexadecimal)
    • Col. 8-9: length of the address field to be modified, in half bytes (hexadecimal)
    • If the field occupies an odd number of half-bytes, it is assumed to begin in the middle of the first byte at the starting location.
    • M00000705 for Line 15

  • In some cases no modification is needed because the instruction operand is not a memory address at all (e.g., CLEAR S or LDA #3).

  • In other cases no modification is needed because the operand is specified using program-counter relative or base relative addressing.

  • It should be clear that the only parts of the program that require modification at load time are those that specify direct addresses.

    • The program in Fig. 2.1 requires relocation for almost every instruction.
  • Figure 2.8 shows the complete object program.

    • There is one Modification record for each address field that needs to be changed when the program is relocated.

基本上,只有 format 4 會需要 modification。

Machine-Independent Assembler Features

Literals

  • It is often convenient for the programmer to be able to write the value of a constant operand as a part of the instruction that uses it. Such an operand is called a literal because the value is stated literally in the instruction.

  • A literal is identified with the prefix =, which is followed by a specification of the literal value, using the same notation as in the BYTE statement.

  • The effect of using a literal is exactly the same as if the programmer had defined the constant explicitly and used the label assigned to the constant as the instruction operand.

  • It is important to understand the difference between a literal and an immediate operand.

    • With immediate addressing, the operand value is assembled as part of the machine instruction. (line 55 in Figure 2.10)
    • With a literal, the assembler generates the specified value as a constant at some other memory location. (line 45 in Figure 2.10)
  • All of the literal operands used in a program are gathered together into one or more literal pools. (Figure 2.10)

  • In some cases, however, it is desirable to place literals into a pool at some other location in the object program. (e.g., avoiding having to use extended format instructions when referring to the literals)

    • LTORG (line 93 in Figure 2.10)
  • Most assemblers recognize duplicate literals and store only one copy of the specified data value. (the literal =X’05’ on lines 215 and 230)

  • The easiest way to recognize duplicate literals is by comparison of the character strings defining them. Sometimes a slight additional saving is possible if we look at the generated data value instead of the defining expression. (e.g., the literals =C’EOF’ and =X’454F46’ would specify identical operand values.)

    • However, the benefits realized in this way are usually not great enough to justify the additional complexity in the assembler.
  • If we use the character string defining a literal to recognize duplicates, we must be careful of literals whose value depends upon their location in the program.

    • BASE *
    • LDB =*
  • A literal table LITTAB is often organized as a hash table, using the literal name or value as the key.

    • Literal name
    • The operand value and length
    • The address assigned in a literal pool
  • Each literal operand is recognized during Pass 1.

  • During Pass 2, the operand address for use in generating object code is obtained by searching LITTAB for each literal operand encountered.

  • If a literal value represents an address in the program (e.g., a location counter value), the assembler must also generate the appropriate Modification record.

Symbol-Defining Statements

Symbol EQU value

The value may be given as a constant or as any expression involving constants and previously defined symbols.

  • One common use of EQU is to establish symbolic names that can be used for improved readability in place of numeric values.

    1. ​​​​​​​+LDT  #4096
      
    2. ​​​​​​​MAXLEN  EQU  4096
      ​​​​+LDT   #MAXLEN
      
  • The resulting object code is exactly the same as in the original version of the instruction; however, the source statement is easier to understand. It is also much easier to find and change the value of MAXLEN.

  • Another common use of EQU is in defining mnemonic names for registers.

    ​​​​A  EQU 0
    ​​​​X  EQU 1
    ​​​​L  EQU 2
    

ORG value

Where value is a constant or an expression involving constants and previously defined symbols.

When this statement is encountered during assembly of a program, the assembler resets its location counter to the specified value.

  • Consider a symbol table STAB with 100 entries, the SYMBOL field contains a 6-byte user-defined symbol; VALUE is a one-word representation of the value assigned to the symbol; FLAGS is a 2-byte field that specifies symbol type and other information.

    • ​​​​​​STAB  RESB  1100
      ​​​​​​SYMBOL   EQU  STAB
      ​​​​​​VALUE    EQU   STAB+6
      ​​​​​​FLAGS   EQU   STAB+9
      ​​​​​​LDA   VALUE, X
      
    • However, this method of definition simply defines the labels; it does not make the structure of the table as clear as it might be.
    • ​​​​​​STAB   RESB   1100
      ​​​​​​       ORG    STAB
      ​​​​​​SYMBOL   RESB  6
      ​​​​​​VALUE    RESW  1
      ​​​​​​FLAGS    RESB  2
      ​​​​​​       ORG    STAB+1100
      
    • The last ORG statement is very important. It sets LOCCTR back to its previous value.
    • In some assemblers the previous value of LOCCTR is automatically remembered, so we can simply write
      ​​​​​​​​ORG
      
  • All symbols used on the right-hand side of the statement—that is, all terms used to specify the value of the new symbol—must have been defined previously in the program. (e.g., the following example would not be allowed)

    • ​​​​​​BETA   EQU   ALPHA
      ​​​​​​ALPHA  RESW  1 
      

Expressions

  • Individual terms in the expression may be constants, user-defined symbols, or special terms (e.g., the value of the location counter)
  • The values of terms and expressions are either relative or absolute.
    • A constant is, of course, an absolute term.
    • Labels on instructions and data areas, and references to the location counter value, are relative terms.
  • An expression that contains only absolute terms is, of course, an absolute expression.
  • Absolute expressions may also contain relative terms provided the relative terms occur in pairs and the terms in each such pair have opposite signs.
    • None of the relative terms may enter into a multiplication or division operation.
  • A relative expression is one in which all of the relative terms except one can be paired as described above; the remaining unpaired relative term must have a positive sign.
  • Absolute, relative, or error? (in Figure 2.10)
    ​​​​RETADR            // Absolute
    ​​​​BUFFER            // Absolute
    ​​​​BUFFEND           // Relative
    ​​​​MAXLEN            // Absolute
    ​​​​BUFEND+BUFFER     // Error
    ​​​​100-BUFFER        // Error
    ​​​​3*BUFFER          // Error
    
  • We need a flag in the symbol table to indicate type of value (absolute or relative) in addition to the value itself.

Program Blocks

  • We use the term program blocks to refer to segments of code that are rearranged within a single object program unit.
  • The assembler directive USE indicates which portions of the source program belong to the various blocks.
  • Figure 2.11
    • The first (unnamed) program block contains the executable instructions of the program.
    • The second (named CDATA) contains all data areas that are a few words or less in length.
    • The third (named CBLKS) contains all data areas that consist of larger blocks of memory.
  • The assembler will (logically) rearrange these segments to gather together the pieces of each block. These blocks will then be assigned addresses in the object program, with the blocks appearing in the same order in which they were first begun in the source program.
  • The assembler accomplishes this logical rearrangement of code by maintaining, during Pass 1, a separate location counter for each program block.
  • When labels are entered into the symbol table, the block name or number is stored along with the assigned relative address.
  • At the end of Pass 1 the latest value of the location counter for each block indicates the length of that block. The assembler can then assign to each block a starting address in the object program.
  • For code generation during Pass 2, the assembler simply adds the location of the symbol, relative to the start of its block, to the assigned block starting address.
  • Figure 2.12
    • Notice that the value of the symbol MAXLEN (line 107) is shown without a block number. This indicates that MAXLEN is an absolute symbol, whose value is not relative to the start of any program block.
    • Line 20
  • We can immediately see that the separation of the program into blocks has considerably reduced the addressing problems. Because the large buffer area is moved to the end of the object program, we no longer need to use extended format instructions on lines 15, 35, and 65.
  • Program readability is often improved if the definitions of data areas are placed in the source program close to the statements that reference them.
  • The use of program blocks is one way of satisfying both of these requirements, with the assembler providing the required reorganization.
  • It is not necessary to physically rearrange the generated code in the object program to place the pieces of each program block together. The assembler can simply write the object code as it is generated during Pass 2 and insert the proper load address in each Text record. The loader will simply load the object code from each record at the indicated address.
    • Figure 2.13, Figure 2.14

Control Sections and Program Linking

  • A control section is a part of the program that maintains its identity after assembly; each such control section can be loaded and relocated independently of the others.
  • When control sections form logically related parts of a program, it is necessary to provide some means for linking them together.
  • Instructions in one control section might need to refer to instructions or data located in another section. The assembler is unable to process these references in the usual way. The assembler has no idea where any other control section will be located at execution. Such references between control sections are called external references.

Figure 2.15

  • CSECT
  • Three control sections: COPY, RDREC, WRREC
  • The assembler establishes a separate location counter for each control section, just as it does for program blocks.
  • EXTDEF (external definition for external symbol)
  • EXTREF (external reference)

Figure 2.16

  • Line 15:
    ​​​​15    0003    CLOOP     +JSUB	RDREC		4B100000
    
  • Line 160
    ​​​​160   0017              +STCH	BUFFER,X	57900000
    
  • Line 190
    ​​​​190   0028    MAXLEN    WORD	BUFEND-BUFFER	000000
    
    • The assembler has no idea where the control section containing RDREC will be loaded, so it cannot assemble the address for this instruction. Instead the assembler inserts an address of zero and passes information to the loader, which will cause the proper address to be inserted at load time.
    • The address of RDREC will have no predictable relationship to anything in this control section therefore relative addressing is not possible. Thus an extended format instruction must be used to provide room for the actual address to be inserted.
  • The assembler must also allow the same symbol to be used in different control sections.

Define record

  • Col. 1: D
  • Col. 2-7: name of external symbol defined in this control section
  • Col. 8-13: relative address of symbol within this control section
  • Col. 14-73: Repeat information in Col. 2-13 for other external symbols.

Refer record

  • Col. 1: R
  • Col. 2-7: name of external symbol referred to in this control section
  • Col. 8-73: names of other external reference symbols

New format of the Modification record type

  • Col. 1: M
  • Col. 2-7: starting address of the field to be modified
  • Col. 8-9: length of the field to be modified, in half-bytes
  • Col. 10: Modification flag (+ or -)
  • Col. 11-16: External symbol whose value is to be added to or subtracted from the indicated field.

Figure 2.17

  • Our earlier definitions required that all of the relative terms in an expression be paired (for an absolute expression), or that all except one be paired (for a relative expression). We must now extend this restriction to specify that both terms in each pair must be relative within the same control section.
  • When an expression involves external references, the assembler cannot in general determine whether or not the expression is legal. Whether the terms occur in the same control section is unknown at assembly time. In such a case, the assembler generates Modification records so the loader can check the expression for errors.