Chapter02 - Assemblers - HackMD

<style> font { color: red; font-weight: bold; } </style> ![](https://i.imgur.com/hSiRp3g.png) ## Introduction Design and implementation of assemblers - Some **common fundamental functions** such as: - translating **mnemonic operation codes** to their **machine language** equivalents - assigning machine **addresses** to **symbolic labels** used by the programmer. - The features and design of an assembler depend heavily upon the source language it translates and the machine language it produces. - **Machine dependent/independent features** ## Basic Assembler Functions - [**Example of SIC assembler language program**](/bi01pjLMRFKh0euMhhuPGQ?view#Example-of-SIC-Assembly-Program) - **Assembler Directive**: 在組語程式中，用來告訴 Assembler 某些資訊或是該做哪些動作，不屬於 CPU 的指令 - `START`: specify **name** and **starting address** for the program. - `END`: indicate the **end of the source program** and (optionally) specify the first executable instruction in the program. - **Variable/Constant Declaration** - `BYTE` - `WORD` - `RESB` - `RESW` - The line numbers are for reference only and are not part of the program ### A Simple SIC Assembler - The translation of source program to object code requires us to accomplish the following functions (not necessarily in the order given) 1. Convert **mnemonic operation codes** to their machine **language equivalents** - translate STL to 14 (line 10) 2. Convert **symbolic operands** to their equivalent **machine addresses** - translate RETADR to 1033 (line 10) 3. Build the **machine instruction** in the **proper format**. 4. Convert the **data constants** specified in the source program into their **internal machine representations** - translate EOF to 454F46 (line 80) 5. **Write the object program** and the **assembly listing**. ### Forward Reference & Two Passes - The instruction at line 10 contains a <font>forward reference</font>. - Line 10 存取 `RETADR`，但 `RETADR` 在 line 95 才定義 - Because of forward reference, most assemblers make **two passes** over the source program. - **The first pass**: - does little more than scan the source program for **label definitions** and **assign addresses**. - **The second pass**: - performs most of the **actual translation** previously described. - In addition, the assembler must **process statements** called <font>assembler directives</font> (or <font>pseudo-instructions</font>). ### Simple Object Program - The simple object program format we use contains **three types of records**. - **Header** - program name - starting address - length - **Text** - translated instructions - data - indication of the addresses where these are to be loaded - **End** - marks the end of the object program - specifies the address in the program where execution is to begin **Header record**: - Col. 1: H - Col. 2-7: Program name - Col. 8-13: starting address of object program - Col. 14-19: length of object program in bytes **Text record**: - Col. 1: T - Col. 2-7: starting address for object code in this record - Col. 8-9: length of object code in this record in bytes - Col. 10-69: object code, represented in **hexadecimal** :::info - Object code 的部分可以容納 60 個 **symbol**（每個 4 bit），若以 byte 計算就是 30 bytes。 - 一個 instruction 長 3 bytes，所以可以放 10 個 instructions - 在指令不連續時（通常是中間有宣告變數），會在沒填滿的情況下使用新的 text record。 ::: **End record**: - Col.1: E - Col. 2-7: address of first executable instruction in object program ![](https://i.imgur.com/K3uyzDu.jpg) - There is no object code corresponding to addresses 1033-2038. - This storage is simply reserved by the loader for use by the program during execution. ### Two Passes **Pass 1** (define symbols): - **Assign addresses** to all **statements** in the program. - Save the values (addresses) assigned to all **labels** for use in Pass 2. - Perform some **processing of assembler directives**. - e.g., **determining the length of data areas** defined by BYTE, RESW, etc.) **Pass 2** (assemble instructions and generate object program): - Assemble **instructions** (translating operation codes and looking up addresses) - Generate **data values** defined by BYTE, WORD, etc. - Perform processing of assembler directives not done during Pass 1. - Write the object program and the assembly listing. ## Assembler Algorithm and Data Structures Major internal data structures of simple assembler: - **Operation code table** (OPTAB) - Symbol table (SYMTAB) - Location Counter (LOCCTR) - **Initialized to the beginning address** specified in the START statement. ### Operation Code Table (OPTAB) - Contains (at least) the **mnemonic operation code** and its **machine language equivalents** - In more complex assemblers, this table also contains information about instruction format and length. - OPTAB is usually organized as a **hash table**, with **mnemonic operation code as the <font>key</font>**. - The information in OPTAB is **predefined** when the assembler itself is written. **Example of OPTAB** | Mnemonic | Machine Language | |:-------- |:---------------- | | `ADD` | 18 | | `ADDF` | 58 | | `ADDR` | 90 | | ... | ... | :::info - Operation code table 的內容不太會變動 - 使用目的是用來加快 mnemonic $\Rightarrow$ machine language 的速度 - 需要使用**查詢**速度很快的資料結構 - 所以會使用 Hash Table 來實作 OPTAB - 查詢的平均複雜度是 $O(1)$ ::: ### Symbol Table (SYMTAB) - Includes the **name** and **value (address)** for each **label** in the source program - together with **flags** to indicate <font>error conditions</font> - e.g., a symbol defined in two different places - It may also contain other information about the data area or instruction labeled - for example, its **type** or **length**. - During <font>Pass 1</font> - **labels** are entered into SYMTAB as they are encountered in the source program - along with their **assigned addresses**. - During <font>Pass 2</font> - symbols used as **operands** are **looked up in SYMTAB** $\Rightarrow$ **the addresses** to be inserted in the assembled instructions. - SYMTAB is usually organized as a **hash table** - for **efficiency** of <font>insertion</font> and <font>retrieval</font>. - Since entries are rarely deleted from this table, efficiency of deletion is not an important consideration. **Example: SYMTAB of [SIC Program](/bi01pjLMRFKh0euMhhuPGQ#Example-of-SIC-Program)** | Label | Location | |:-------- |:-------- | | `COPY` | 1000 | | `FIRST` | 1000 | | `CLOOP` | 1003 | | `ENDFIL` | 1015 | | `EOF` | 102A | | `THREE` | 102D | | ... | ... | :::info 對 hash table 來說，**insertion**(插入) 和 **retrieval**(查詢)，平均複雜度都是 $O(1)$ ::: ### Hash Table - Care should be taken in the selection of a hashing function. - Programmers often select many **labels** that have **similar characteristics** - for example, labels that start or end with the same characters (like LOOP1, LOOP2, LOOPA). - It is important that the hashing function used perform well with such non-random keys. - 對於任意的 key，hash 的結果越不容易重複越好 - Division of the entire key by a **prime table length** often gives good results. ### Algorithms of Passes Pass 1 usually writes an <font>intermediate file</font> - contains each **source statement** together with its **assigned addresses**, **error indicators**, etc. - This file is used **as the input to Pass 2**. **Pseudo Code of Pass1** ```vhdl= begin read first input line if OPCODE = 'START' then begin save #[OPERAND] as starting address initialize LOCCTR to starting address write line to intermediate file read next input line end {if START} else initialize LOCCTR to 0 while OPCODE != 'END' do begin if this is not a comment line then begin if there is a symbol in the LABEL field then begin search SYMTAB for LABEL if found then set error flag (duplicate symbol) else insert (LABEL, LOCCTR) into SYMTAB end {if symbol} search OPTAB for OPCODE if found then add 3 {instruction length} to LOCCTR else if OPCODE = 'WORD ' then add 3 to LOCCTR else if OPCODE = 'RESW' then add 3 * #[OPERAND] to LOCCTR else if OPCODE = 'RESB' then add #[OPERAND] to LOCCTR else if OPCODE = 'BYTE' then begin find length of constant in bytes add length to LOCCTR end {if BYTE} else set error flag (invalid operation code) end {if not a comment} write line to intermediate file read next input line end {while not END} write last line to intermediate file save (LOCCTR - starting address) as program length end {Pass 1} ``` **Presudo Code of Pass2** ```vhdl= begin read first input line {from intermediate file} if OPCODE = 'START' then begin Write listing line read next input line end {if START} write Header record to object program initialize first Text record while OPCODE != 'END' do begin if this is not a comment line then begin search OPTAB for OPCODE if found then begin if there is a symbol in OPERAND field then begin search SYMTAB for OPERAND if found then store symbol value as operand address else begin store 0 as operand address set error flag (undefined symbol) end end {if symbol} else store 0 as operand address assemble the object code instruction end {if opcode found} else if OPCODE = 'BYTE' or 'WORD' then convert constant to object code if object code will not fit into the current Text record then begin write Text record to object program initialize new Text record end add object code to Text record end {if not comment } write listing line read next input line end {while not END} write last Text record to object program write End record to object program write last listing line end {Pass 2} ``` ## Machine-Dependent Assembler Features - [**Example of SIC/XE Program**](/bi01pjLMRFKh0euMhhuPGQ?view#Example-of-SICXE-Program) - The **assembler directive** `BASE` is used in conjunction with **base relative addressing**. - 用 `BASE` 讓 assembler 知道 base register 的值 - Assembler 必須知道 base 的值，才能計算指令中的 displacement - If the **displacements** required for relative addressing are **too large** to fit into a **3-byte instruction** $\Rightarrow$ the **4-byte extended format** (Format 4) must be used. - The **bit e is set to 1** to indicate extended instruction format. - The extended instruction format is **specified with the prefix +** added to the **operation code** in the <font>source statement</font>. - It is the **programmer’s responsibility to specify** this form of addressing when it is required. - Advantages of the more advanced SIC/XE architecture to improve the execution speed of the program - **Register-to-register instructions** are faster than register-to-memory operations. - When using **immediate addressing**, the operand is already present as part of the instruction and **need not be fetched from anywhere**. - The use of **indirect addressing** often **avoids the need for another instruction**. ### Instruction Formats and Addressing Modes - The conversion of **register mnemonics** to **numbers** can be done with a separate table - however, it is often convenient to use the **symbol table** for this purpose. - To do this, SYMTAB would **be preloaded with the register names** and their values. - If extended format is not specified, the assembler **first** attempts to translate the instruction using **program-counter relative addressing**. - If this is not possible (because the required displacement is out of range), the assembler then attempts to use **base relative addressing**. - If neither form of relative addressing is applicable and extended format is not specified, then the instruction cannot be properly assembled. - In this case, the assembler must generate an <font>error message</font>. ### Displacement Calculation - The computation that the assembler needs to perform is essentially the target address calculation in reverse. - $disp = TA - (B)$ - $disp = TA - (PC)$ **PC Relative** - The program counter is advanced after each instruction is fetched and before it is executed. - Thus during the execution of an instruction, program counter will contain **the address of the next instruction**. - The assembler knows what the contents of the program counter will be at execution time. **Base Relative** - The base register is under control of the programmer. - The programmer must tell the assembler what the base register will contain during execution $\Rightarrow$ so that the assembler can compute displacements. - This is done with the **assembler directive `BASE`**. - `NOBASE` - These assembler directives produce no executable code. - The programmer must provide instructions that load the proper value into the base register during execution. ### Program Relocation :::warning 當一個設備同時執行多個程式時，我們就無法知道該東西當時的記憶體位置，所以要藉由告知 loader 一些資訊，讓其可以在程式執行時做一些修正。 ::: - It is often desirable to have more than one program at a time sharing the memory and other resources of the machine. - In such a situation the actual starting address of the program is not known until load time. - Some address portions of the program need to be changed, while some parts (such as constants) should remain the same regardless of where the program is loaded. - Looking at the object code alone, it is in general **not possible to tell which values represent addresses and which represent constant data items**. - Since the assembler does not know the actual location where the program will be loaded, it cannot make the necessary changes in the addresses. However, the **assembler** can identify for the **loader** those parts of the object program that need modification. - An object program that contains the information necessary to perform this kind of modification is called a <font>relocatable</font> program(可重定位程式). ![](https://i.imgur.com/w3yW6U3.png) - We can solve the relocation problem in the following way: - When the assembler generates the object code, it will produce the addresses relative to the start of the program. - The assembler will also produce a command for the loader, instructing it to add the beginning address of the program to the address fields at load time. - **Modification record** (part of the object program) - **Col. 1:** M - **Col. 2-7:** starting location of the address field to be modified, relative to the beginning of the program (hexadecimal) - **Col. 8-9:** length of the address field to be modified, in <font>half bytes</font> (hexadecimal) - If the field occupies an odd number of half-bytes, it is assumed to begin in the middle of the first byte at the starting location. - `M00000705` for Line 15 ![](https://i.imgur.com/iSkmAdF.jpg) - In some cases no modification is needed because the instruction operand is not a memory address at all (e.g., CLEAR S or LDA #3). - In other cases no modification is needed because the operand is specified using program-counter relative or base relative addressing. - It should be clear that the only parts of the program that require modification at load time are those that specify direct addresses. - The program in Fig. 2.1 requires relocation for almost every instruction. - Figure 2.8 shows the complete object program. - There is one Modification record for each address field that needs to be changed when the program is relocated. ==基本上，只有 format 4 會需要 modification。== ## Machine-Independent Assembler Features ### Literals - It is often convenient for the programmer to be able to write the value of a constant operand as a part of the instruction that uses it. Such an operand is called a <font>literal</font> because the value is stated literally in the instruction. - A literal is identified with the prefix =, which is followed by a specification of the literal value, using the same notation as in the BYTE statement. - The effect of using a literal is exactly the same as if the programmer had defined the constant explicitly and used the label assigned to the constant as the instruction operand. - It is important to understand the difference between a literal and an immediate operand. - With immediate addressing, the operand value is assembled as part of the machine instruction. (line 55 in Figure 2.10) - With a literal, the assembler generates the specified value as a constant at some other memory location. (line 45 in Figure 2.10) - All of the literal operands used in a program are gathered together into one or more <font>literal pools</font>. (Figure 2.10) - In some cases, however, it is desirable to place literals into a pool at some other location in the object program. (e.g., avoiding having to use extended format instructions when referring to the literals) - <font>LTORG</font> (line 93 in Figure 2.10) - Most assemblers recognize duplicate literals and store only one copy of the specified data value. (the literal `=X’05’` on lines 215 and 230) - The easiest way to recognize duplicate literals is by comparison of the character strings defining them. Sometimes a slight additional saving is possible if we look at the generated data value instead of the defining expression. (e.g., the literals `=C’EOF’` and `=X’454F46’` would specify identical operand values.) - However, the benefits realized in this way are usually not great enough to justify the additional complexity in the assembler. - If we use the character string defining a literal to recognize duplicates, we must be careful of literals whose value depends upon their location in the program. - `BASE *` - `LDB =*` - A literal table LITTAB is often organized as a hash table, using the literal name or value as the key. - Literal name - The operand value and length - The address assigned in a literal pool - Each literal operand is recognized during Pass 1. - During Pass 2, the operand address for use in generating object code is obtained by searching LITTAB for each literal operand encountered. - If a literal value represents an address in the program (e.g., a location counter value), the assembler must also generate the appropriate Modification record. ### Symbol-Defining Statements #### **Symbol <font>`EQU`</font> value** The value may be given as a constant or as any expression involving constants and previously defined symbols. - One common use of EQU is to establish symbolic names that can be used for improved readability in place of numeric values. 1. ``` +LDT #4096 ``` 2. ``` MAXLEN EQU 4096 +LDT #MAXLEN ``` - The resulting object code is exactly the same as in the original version of the instruction; however, the source statement is easier to understand. It is also much easier to find and change the value of MAXLEN. - Another common use of EQU is in defining mnemonic names for registers. ``` A EQU 0 X EQU 1 L EQU 2 ``` #### **<font>`ORG`</font> value** Where value is a constant or an expression involving constants and previously defined symbols. When this statement is encountered during assembly of a program, the assembler resets its location counter to the specified value. - Consider a symbol table STAB with 100 entries, the SYMBOL field contains a 6-byte user-defined symbol; VALUE is a one-word representation of the value assigned to the symbol; FLAGS is a 2-byte field that specifies symbol type and other information. - ``` STAB RESB 1100 SYMBOL EQU STAB VALUE EQU STAB+6 FLAGS EQU STAB+9 LDA VALUE, X ``` - However, this method of definition simply defines the labels; it does not make the structure of the table as clear as it might be. - ``` STAB RESB 1100 ORG STAB SYMBOL RESB 6 VALUE RESW 1 FLAGS RESB 2 ORG STAB+1100 ``` - The last `ORG` statement is very important. It sets LOCCTR back to its previous value. - In some assemblers the previous value of LOCCTR is automatically remembered, so we can simply write ``` ORG ``` - All symbols used on the right-hand side of the statement—that is, all terms used to specify the value of the new symbol—must have been defined previously in the program. (e.g., the following example would not be allowed) - ``` BETA EQU ALPHA ALPHA RESW 1 ``` ### Expressions - Individual terms in the expression may be constants, user-defined symbols, or special terms (e.g., the value of the location counter) - The values of terms and expressions are either <font>relative</font> or <font>absolute</font>. - A constant is, of course, an absolute term. - Labels on instructions and data areas, and references to the location counter value, are relative terms. - An expression that contains only absolute terms is, of course, an absolute expression. - Absolute expressions may also contain relative terms provided the relative terms occur in pairs and the terms in each such pair have opposite signs. - None of the relative terms may enter into a multiplication or division operation. - A relative expression is one in which all of the relative terms except one can be paired as described above; the remaining unpaired relative term must have a positive sign. - Absolute, relative, or error? (in Figure 2.10) ``` RETADR // Absolute BUFFER // Absolute BUFFEND // Relative MAXLEN // Absolute BUFEND+BUFFER // Error 100-BUFFER // Error 3*BUFFER // Error ``` - We need a flag in the symbol table to indicate type of value (absolute or relative) in addition to the value itself. ### Program Blocks - We use the term <font>program blocks</font> to refer to segments of code that are rearranged within a single object program unit. - The assembler directive USE indicates which portions of the source program belong to the various blocks. - **Figure 2.11** - The first (unnamed) program block contains the executable instructions of the program. - The second (named CDATA) contains all data areas that are a few words or less in length. - The third (named CBLKS) contains all data areas that consist of larger blocks of memory. - The assembler will (logically) rearrange these segments to gather together the pieces of each block. These blocks will then be assigned addresses in the object program, with the blocks appearing in the same order in which they were first begun in the source program. - The assembler accomplishes this logical rearrangement of code by maintaining, during Pass 1, a separate location counter for each program block. - When labels are entered into the symbol table, the block name or number is stored along with the assigned relative address. - At the end of Pass 1 the latest value of the location counter for each block indicates the length of that block. The assembler can then assign to each block a starting address in the object program. - For code generation during Pass 2, the assembler simply adds the location of the symbol, relative to the start of its block, to the assigned block starting address. - **Figure 2.12** - Notice that the value of the symbol MAXLEN (line 107) is shown without a block number. This indicates that MAXLEN is an absolute symbol, whose value is not relative to the start of any program block. - Line 20 - We can immediately see that the separation of the program into blocks has considerably reduced the addressing problems. Because the large buffer area is moved to the end of the object program, we no longer need to use extended format instructions on lines 15, 35, and 65. - Program <font>readability</font> is often improved if the definitions of data areas are placed in the source program close to the statements that reference them. - The use of program blocks is one way of satisfying both of these requirements, with the assembler providing the required reorganization. - It is not necessary to physically rearrange the generated code in the object program to place the pieces of each program block together. The assembler can simply write the object code as it is generated during Pass 2 and insert the proper load address in each Text record. The loader will simply load the object code from each record at the indicated address. - Figure 2.13, Figure 2.14 ![](https://i.imgur.com/hmGOUb6.jpg) ![](https://i.imgur.com/Sp87SNz.jpg) ## Control Sections and Program Linking - A <font>control section</font> is a part of the program that maintains its identity after assembly; each such control section can be loaded and relocated independently of the others. - When control sections form logically related parts of a program, it is necessary to provide some means for <font>linking</font> them together. - Instructions in one control section might need to refer to instructions or data located in another section. The assembler is unable to process these references in the usual way. The assembler has no idea where any other control section will be located at execution. Such references between control sections are called <font>external references</font>. ### Figure 2.15 - <font>CSECT</font> - Three control sections: `COPY`, `RDREC`, `WRREC` - The assembler establishes a separate location counter for each control section, just as it does for program blocks. - <font>EXTDEF</font> (external definition for external symbol) - <font>EXTREF</font> (external reference) ### Figure 2.16 - **Line 15:** ``` 15 0003 CLOOP +JSUB RDREC 4B100000 ``` - **Line 160** ``` 160 0017 +STCH BUFFER,X 57900000 ``` - **Line 190** ``` 190 0028 MAXLEN WORD BUFEND-BUFFER 000000 ``` - The assembler has no idea where the control section containing `RDREC` will be loaded, so it cannot assemble the address for this instruction. Instead the assembler inserts an address of zero and passes information to the loader, which will cause the proper address to be inserted at load time. - The address of `RDREC` will have no predictable relationship to anything in this control section therefore relative addressing is not possible. Thus an extended format instruction must be used to provide room for the actual address to be inserted. - The assembler must also allow the same symbol to be used in different control sections. **Define record** - **Col. 1:** D - **Col. 2-7:** name of external symbol defined in this control section - **Col. 8-13:** relative address of symbol within this control section - **Col. 14-73:** Repeat information in Col. 2-13 for other external symbols. **Refer record** - **Col. 1:** R - **Col. 2-7:** name of external symbol referred to in this control section - **Col. 8-73:** names of other external reference symbols **New format of the Modification record type** - **Col. 1:** M - **Col. 2-7:** starting address of the field to be modified - **Col. 8-9:** length of the field to be modified, in half-bytes - **Col. 10:** Modification flag (+ or -) - **Col. 11-16:** External symbol whose value is to be added to or subtracted from the indicated field. ### Figure 2.17 - Our earlier definitions required that all of the relative terms in an expression be paired (for an absolute expression), or that all except one be paired (for a relative expression). We must now extend this restriction to specify that both terms in each pair must be relative within the same control section. - When an expression involves external references, the assembler cannot in general determine whether or not the expression is legal. Whether the terms occur in the same control section is unknown at assembly time. In such a case, the assembler generates Modification records so the loader can check the expression for errors.