300 views
# Internals of CPython * current version: Python 3.7 alpha * git commit: 03f68b60e17b57f * contributer: Louie Lu <<me at louie.lu>> This document is base on the work from **Prashanth Raghu**, who wrote [Internals of CPython 3.6](https://mail.python.org/mailman/private/core-mentorship/2017-January/003778.html). You can found it at this [google documents](https://docs.google.com/document/d/1nzNN1jeNCC_bg1LADCvtTuGKvcyMskV1w8Ad2iLlwoI/) and Python [core-mentorship 2017-Jan](https://mail.python.org/mailman/private/core-mentorship). And, also reference from **Guido** [Yet another guided tour of CPython](https://paper.dropbox.com/doc/Yet-another-guided-tour-of-CPython-XY7KgFGn88zMNivGJ4Jzv), and cpython-devguide [chapter compiler](http://cpython-devguide.readthedocs.io/compiler.html). Most of the action were do at this environment: ```shell $ uname -a 4.8.3-1-ARCH $ gcc -v gcc version 6.3.1 20170109 (GCC) ``` ## Table of Contents * 00. Pre-requirements * 01. Set up the environment * 02. Debugging the Parser * 03. Debugging the AST(Abstract Syntax Tree) generator * 04. Debugging the symbol table generator * 05. Debugging the compiler and code generator * 06. Debugging the interpreter loop * 07. Debugging Python Objects * 08. Debugging the memory allocator * 09. Debugging Python threads and sockets * 10. Python Optimizations * Appendix A - Python Sandard Library * Appendix B - Introduction to PyPy * Appendix C - Some useful terms ## Why re-write this documents? The great work from Prashanth Raghu were using Eclipse IDE as its debugging environment, which I want to change is to remove the requirement of IDE. It should be work with every where, without any specific IDE for this tour. ## 0. Pre-requirements 1. Linux based operating system 2. Python source code, now the code is migrate to [GitHub](https://github.com/python/cpython), so you can easily get the latest source code by entering `git clone https://github.com/python/cpython` 3. gdb 4. make ## 01. Set up the environment 1. Clone the latest cpython source code from GitHub: ```shell $ git clone https://github.com/python/cpython $ cd cpython ``` 2. Configure with --with-pydebug then build python from source code: ```shell $ ./configure --with-pydebug $ make -j8 -s ``` :::info You can change the number of `-j` parameter, for preformance issue, this number is recommand to set to `Physics CPU core * 2`. ::: After this, you will see a binary file `python` in your directory, if you are using macOS, this will probably be `python.exe` ```shell $ ls aclocal.m4 config.log configure Doc install-sh LICENSE Makefile.pre Modules PC pyconfig.h Python python-gdb.py test.py build config.status configure.ac Grammar Lib Mac Makefile.pre.in Objects PCbuild pyconfig.h.in python-config README.rst Tools config.guess config.sub coverage.info Include libpython3.7dm.a Makefile Misc Parser Programs python python-config.py setup.py ``` If you saw it, then you are ready to go! ## 02. Debugging with GDB We will using `gdb` to trace the behaivor of `python` that we built when running it. An therefore, this chapter will introduce the easy `gdb` tutorial, to let you be confidence when using `gdb`. ### GDB shortcut * r (run): run the program * b (break): to set the break point. * s (step): step-to-step * c (continue): continue the program, this will stop at breakpoint or some trap. * l (list): list the code where the program now running. * ctrl+x a: open tui mode. * ctrl+p: up * ctrl+n: down ### GDB intro Now, type the command `gdb python` under the directory where you build cpython: ```shell $ gdb python GNU gdb (GDB) 7.12 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python...done. (gdb) ``` You will see the message in the bottom line says: Reading symbols from python...done. Great, that means we can now debug python via `gdb`. Now, the program isn't run yet. It stop at the front of the program. So we can plan what we want to do. First, let us familiar with `gdb`, every C program starts its execution in `main()`, we can set a breakpoint for `main()`: ``` (gdb) b main Breakpoint 1 at 0x41d1d6: file ./Programs/python.c, line 20. (gdb) ``` gdb then set a breakpoint at `Programs/python.c, line 20`, this message show that the entry point of `Python` is at `Programs/python.c:20`. alternative, if you know where you want to set the breakpoint, in this example, at Programs/python.c, line 20. we can set the breakpoint like this: ``` (gdb) b Programs/python.c:20` Breakpoint 3 at 0x41d1d6: file ./Programs/python.c, line 20. (gdb) ``` We can run `python` now: ```shell (gdb) r Starting program: /home/grd/Python/cpython/python [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20 20 { (gdb) ``` We are now stop at the breakpoint we set before, list the source code with `list`: ``` (gdb) l 15 } 16 #else 17 18 int 19 main(int argc, char **argv) 20 { 21 wchar_t **argv_copy; 22 /* We need a second copy, as Python might modify the first one. */ 23 wchar_t **argv_copy2; 24 int i, res; (gdb) ``` Or just call out `tui` by ctrl+x a: ``` ┌──./Programs/python.c─────────────────────────────────────────────────────────┐ │15 } │ │16 #else │ │17 │ │18 int │ │19 main(int argc, char **argv) │ B+>│20 { │ │21 wchar_t **argv_copy; │ │22 /* We need a second copy, as Python might modify the first one. */│ │23 wchar_t **argv_copy2; │ │24 int i, res; │ │25 char *oldloc; │ │26 │ └──────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff7f9de In: main L20 PC: 0x41d1d6 (gdb) ``` In tui mode, you can see now we are stop at line 20 in the source code. If we didn't set any arguments for `python`, running `python` will get a REPL in linux or macOS. You can try with `continue` in gdb: ``` (gdb) c │29 │ │30 argv_copy = (wchar_t **)PyMem_RawMalloc(sizeof(wchar_t*) * (argc+1)); │ │31 argv_copy2 = (wchar_t **)PyMem_RawMalloc(sizeof(wchar_t*) * (argc+1)); │ │32 if (!argv_copy || !argv_copy2) { │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff7f9de In: main L20 PC: 0x41d1d6 [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20 (gdb) c Continuing. Python 3.7.0a0 (default, Feb 22 2017, 22:10:22) [GCC 6.3.1 20170109] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` Yep, you will get a REPL as we think. We are now get some basic concept from gdb, then move on to debug Python! ## 03. Debugging the parser ### 3bb.1 Debugging the parser Before starting this chapter, lets create a simple python script, name it `test.py`, with following contents: ```python= a = 100 ``` Now open gdb, then set the arguments, run python with the script we create: ```shell $ gdb python (gdb) set args test.py // Or doing this $ gdb --args python test.py ``` :::info This is equivalent with what we do in linux command line: ``` $ python test.py ``` ::: Then setting the breakpoint at `main` as we do it before, then run it: ``` (gdb) b main Breakpoint 1 at 0x41d1d6: file ./Programs/python.c, line 20. (gdb) r Starting program: /home/grd/Python/cpython/python [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20 20 {Starting program: /home/grd/Python/cpython/python [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20 20 { (gdb) ``` #### Programs/python.c If you take a look to the source code `Programs/python.c`, you can found that the `main()` do a lot of setting with it, the core is to call the function `Py_Main(argc, argv_copy)` to start Python, which defined in the file `Modules/main.c`: ```c=64 argv_copy2[argc] = argv_copy[argc] = NULL; setlocale(LC_ALL, oldloc); PyMem_RawFree(oldloc); res = Py_Main(argc, argv_copy); /* Force again malloc() allocator to release memory blocks allocated before Py_Main() */ (void)_PyMem_SetupAllocators("malloc"); ``` #### Modules/main.c If you are interesting in what `modules/main.c` doing, I'll recommend you read Guido's work - [Yet another guided tour of CPython - Chapter 1](https://paper.dropbox.com/doc/Yet-another-guided-tour-of-CPython-XY7KgFGn88zMNivGJ4Jzv), to simplify, in `Py_Main` it will do: 1. [Initialization for hash randomiztion](https://mail.python.org/pipermail/python-announce-list/2012-March/009394.html):391. 2. Reset Warning options and GetOpt: 394,395 3. Parse command line argument via `_PyOS_GetOpt()`: 397 4. Initialize python via `Py_Initialize()`: 693 5. Import `readline`: 723 6. Decide how to run: 1. run_command() 2. RunModule() 3. RunInteractiveHook() 4. RunMainFromImporter() 5. run_file() 6. PyRun_AnyFileFlags() Now, we are focus on road number 5, `run_file` in line 793: ```c=792 if (sts == -1) sts = run_file(fp, filename, &cf); ``` Stop here by typing `b 793` or `b run_file` to stop inside the function, we can check filename by `p filename`, go into function by `step`: ``` (gdb) b Modules/main.c:793 (gdb) c (gdb) p filename $1 = 0x931050 L"test.py (gdb) s ``` `run_file` is just a wrapper of `PyRun_AnyFileExFlags` which lives in `Python/pythonrun.c` and is a wrapper that calls either `PyRun_InteractiveLoopFlags()` or `PyRun_SimpleFileExFlags()`, cause we set the arguments `test.py`, then it will run `PyRun_SimpleFileExFlags()`. #### Python/pythonrun.c `PyRun_SimpleFileExFlags()` will then check if there is a .pyc file to run with `maybe_pyc_file()` at line 366 if-else statement. In our case, is should not run with .pyc, but from .py, it will then call `PyRun_FileExFlags()` in same file, it will then try to construct a abstract syntax tree from file by `PyParser_ASTFromFileObject()`. This function (`PyParser_ASTFromFileObject()`) invokes the parser in the function `PyParser_ParseFileObject()`, which lives in `Parser/parsetok.c` to construct a node object, and invokes the AST tree contructor function `PyAST_FromNodeObject()` to create AST tree by node object. #### Parser/parsetok.c For now, We take a closer look inside `PyParser_ParseFileObjet()`. It first initialize the syntax token from `PyTokenizer_FromFile()`, then pass to `parsetok()` to create node object. `parsetok()` will parse the input from the given tokenizer structure, and return error code. The interesting part is inside the infinity for loop, take a tour here. ```c= type = PyTokenizer_Get(tok, &a, &b); ``` `PyTokenizer_Get()`, is a wrapper of `tok_get()`, it will return token type defined in `token.h`: :::info include/token.h ```cpp= #define ENDMARKER 0 #define NAME 1 #define NUMBER 2 #define STRING 3 #define NEWLINE 4 #define INDENT 5 #define DEDENT 6 #define LPAR 7 #define RPAR 8 #define LSQB 9 ... #define RARROW 51 #define ELLIPSIS 52 /* Don't forget to update the table _PyParser_TokenNames in tokenizer.c! */ #define OP 53 #define AWAIT 54 #define ASYNC 55 #define ERRORTOKEN 56 #define N_TOKENS 57 ``` ::: in the first round of the for-loop, we can print out the type, cause it is `a`, it is a NAME, the value of type should be 1: ``` (gdb) p type $1 = 1 ``` :::info include/token.h ```c=13 #define NAME 1 ``` ::: at line 236, `str[lbbben] = '\0'` will store the token string value, if we print out, it will be `a`. ``` (gdb) p str $2 = 0x7ffff6eb5a18 "a" ``` This make sense, since our source code is `a = 100`, the first token string should be `a`, and its type is `NAME`. Then calling `PyParser_AddToken()`, this will add the token into parser tree at this moment. You can now try to follow the other token `=` and `100`, trace how they are added into the parse tree. ### 3.2 Generation of the grammar The textual representationi of the grammer is present in the file `Grammar/Grammar`. It is written in [yacc](https://en.wikipedia.org/wiki/Yacc) and I would suggest you to go through it. The numerical representation of the grammar is present in the file `Python/graminit.c` which contains the array of [DFA's](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) representing the source program. We will now try to understand how this work in Python. First, open `test.py` and modify the content to: ```python class foo: pass ``` Then, start gdb and set the breakpoint at `PyParser_AddToken()`: ```bash $ gdb python (gdb) b PyParser_AddToken (gdb) r test.py [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Breakpoint 1, PyParser_AddToken (ps=ps@entry=0x9cf4f0, type=type@entry=1, str=str@entry=0x7ffff6eb5a18 "class", lineno=1, col_offset=0, expected_ret=expected_ret@entry=0x7fffffffdc34) at Parser/parser.c:229 229 { ``` At this moment, look inside the for-loop, we want to know the DFA state now, so set a breakpoint at `dfa *d = ps->p_stack.s_top->s_dfa` next line, and run the program by `continue`: ```bash (gdb) b 244 Breakpoint 2 at 0x5b5933: file Parser/parser.c, line 244. (gdb) c Breakpoint 2, PyParser_AddToken (ps=ps@entry=0x9cf4f0, type=type@entry=1, str=str@entry=0x7ffff6eb5a18 "class", lineno=1, col_offset=0, expected_ret=expected_ret@entry=0x7fffffffdc34) at Parser/parser.c:244 244 state *s = &d->d_state[ps->p_stack.s_top->s_state]; ``` We can then print out the value in `d`, for example, print out its name: ```bash (gdb) p d->d_name 'file_input' (gdb) ``` Or, print out as Python `dir()`: ```bash (gdb) p *d $6 = {d_type = 269, d_name = 0x606340 "stmt", d_initial = 0, d_nstates = 2, d_state = 0x8a6a40 <states_13>, d_first = 0x60658c ""} (gdb) set print pretty (gdb) p *d $8 = { d_type = 269, d_name = 0x606340 "stmt", d_initial = 0, d_nstates = 2, d_state = 0x8a6a40 <states_13>, d_first = 0x60658c "" } ``` Compare the value of `d_name`, you can found it in `Grammar/Grammar`: ```yacc # Grammar for Python single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE file_input: (NEWLINE | stmt)* ENDMARKER eval_input: testlist NEWLINE* ENDMARKER ``` Run it more time, you will see that `d_name` will change as this sequence: `file_input`, `stmt`, `compound_stmt`, `classdef`, `classdef`, `classdef`, `classdef`, `suite`, `suite`, `suite`, `stmt`, `simple_stmt`, `small_stmt`, `pass_stmt`, `simple_stmt`, `suite`, `file_input`, `file_input`. ### 3.3 Review #### Backtrace * main() * Py_Main() * run_file() * PyRun_AnyFileExFlags() * PyRun_SimpleFileExFlags() * PyRun_FileExFlags() * PyParser_ASTFromFileObject() * PyParser_ParseFileObject() * parsetok() * PyParser_AddToken() * PyAST_FromNodeObject() ## 04. Debugging the AST (Abstract Syntax Tree) generator At this moment, we build our parser tree via `PyParser_ParseFileObject()`, next step is to generate AST from the parser node we built. Before getting down, there are several macros needs to introduce, these macros used to querying data from node structs, which defined in `Include/node.h`: * CHILD(node *, int) Returns the nth child of the node using zero-offset indexing * RCHILD(node *, int) Returns the nth child of the node from the right side; use negative numbers! * NCH(node *) Number of children the node has * STR(node *) String representation of the node; e.g., will return `:` for a * COLON token TYPE(node *) The type of node as specified in `Include/graminit.h` * REQ(node *, TYPE) Assert that the node is the type that is expected * LINENO(node *) retrieve the line number of the source code that led to the creation of the parse rule; defined in `Python/ast.c`. We can then try to convert Parse Tree to AST, using the function `PyAST_FromNodeObject()` in `Python/ast.c`. ```c=786 for (i = 0; i < NCH(n) - 1; i++) { ch = CHILD(n, i); if (TYPE(ch) == NEWLINE) continue; REQ(ch, stmt); num = num_stmts(ch); if (num == 1) { s = ast_for_stmt(&c, ch); if (!s) goto out; asdl_seq_SET(stmts, k++, s); } else { ch = CHILD(ch, 0); REQ(ch, simple_stmt); for (j = 0; j < num; j++) { s = ast_for_stmt(&c, CHILD(ch, j * 2)); if (!s) goto out; asdl_seq_SET(stmts, k++, s); } } } res = Module(stmts, arena); ``` `ast_for_stmt()` is a wrapper for `ast_for_xx` where xx is the grammer rule that the function handles. Setting breakpoint at `ast_for_stmt()` to observe how ASDL statement is generated. ## 05. Debugging the symbol table generator Back into the function `PyRun_FileExFlags()` in `Python/pythonrun.c:906`, we finish our tour for `PyParser_ASTFromFileObject()`, it will then pass the result `mod` into `run_mod()` function above. It has two principal functions to generate the code object (`PyAST_CompileObject()`), and second to call the interpreter loop (`PyEval_EvalCode()`). Let us step into `PyAST_CompileObject()` first, this function is in `Python/compile.c`, It has two important function: 1. PySumtable_BuildObject() 2. compiler_mod `PySumtable_BuildObject()` used to generate the symbol table, which is defined in `Python/symtable.c` at line 240. We shall step into this function and insert a breakpoint at line 281, where is a for-loop to iterates through asdl sequence (`for (i = 0; i < asdl_seq_LEN(seq); i++)`) for the module and generates the symbol table entry in the function `symtable_visit_stmt()`. Before we step into this function we need to understand the structures that support the symbol table. They are defined in the file `Include/symtble.h`. ```c= struct _symtable_entry; struct symtable { PyObject *st_filename; /* name of file being compiled, decoded from the filesystem encoding */ struct _symtable_entry *st_cur; /* current symbol table entry */ struct _symtable_entry *st_top; /* symbol table entry for module */ PyObject *st_blocks; /* dict: map AST node addresses * to symbol table entries */ PyObject *st_stack; /* list: stack of namespace info */ PyObject *st_global; /* borrowed ref to st_top->ste_symbols */ int st_nblocks; /* number of blocks used. kept for consistency with the corresponding compiler structure */ PyObject *st_private; /* name of current class or NULL */ PyFutureFeatures *st_future; /* module's future features that affect the symbol table */ int recursion_depth; /* current recursion depth */ int recursion_limit; /* recursion limit */ }; typedef struct _symtable_entry { PyObject_HEAD PyObject *ste_id; /* int: key in ste_table->st_blocks */ PyObject *ste_symbols; /* dict: variable names to flags */ PyObject *ste_name; /* string: name of current block */ PyObject *ste_varnames; /* list of function parameters */ PyObject *ste_children; /* list of child blocks */ PyObject *ste_directives;/* locations of global and nonlocal statements */ _Py_block_ty ste_type; /* module, class, or function */ int ste_nested; /* true if block is nested */ unsigned ste_free : 1; /* true if block has free variables */ unsigned ste_child_free : 1; /* true if a child block has free vars, including free refs to globals */ unsigned ste_generator : 1; /* true if namespace is a generator */ unsigned ste_coroutine : 1; /* true if namespace is a coroutine */ unsigned ste_varargs : 1; /* true if block has varargs */ unsigned ste_varkeywords : 1; /* true if block has varkeywords */ unsigned ste_returns_value : 1; /* true if namespace uses return with an argument */ unsigned ste_needs_class_closure : 1; /* for class scopes, true if a closure over __class__ should be created */ int ste_lineno; /* first line of block */ int ste_col_offset; /* offset of first line of block */ int ste_opt_lineno; /* lineno of last exec or import * */ int ste_opt_col_offset; /* offset of last exec or import * */ int ste_tmpname; /* counter for listcomp temp vars */ struct symtable *ste_table; } PySTEntryObject; ``` We can observe that the symbol table is a dictionary containing the symbol table entries. For more specific information, please reference from comment below the source code. Back to the function `symtable_visit_stmt`, we can set the breakpoint on it: ``` (gdb) b symtable_visit_stmt ``` Here we can observe that the kind of expression is `xx_kind`, for `Name_kind`, it will simply call the function `symtable_add_def()` to add a definition into the symbol table. I would suggest you to debug this fucntion on your own as it is quite straight forward. :::info Exercise: Trace the symbol table entries for the following python files: ```python class test: pass ``` ```python i = range(0, 10) m = map(lambda x: x * 2, i) ``` In this exercise we have understood how symbol table entries are created. ::: ## 06. Debugging the compiler and code generator Back to the function `PyAST_CompileObject()`, the next step is `compiler_mod()`, which then convert AST into CFG (content-free graph). Set a breakpoint to it `b compiler_mod`, and we can step into it. The switch will then bring us into `Module_kind`,into this there is a call to the function `compiler_body()`, step into and there is a for-loop ```c= for (; i < asdl_seq_LEN(stmts); i++) VISIT(c, stmt, (stmt)ty)asdl_seq_GET(stmts, i)); ``` Here we observe that we iterate through the asdl statements and call the macro `VISIT`, then calls `compiler_visit_expr(c, node)`. Emission of bytecode is handled by the following macros: * `ADDOP()` add a specified opcode * `ADDOP_I()` add an opcode that takes an argument * `ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)` add an opcode with the proper argument based on the position of the specified PyObject in * PyObject sequence object, but with no handling of mangled names; used for when you need to do named lookups of objects such as globals, consts, or parameters where name mangling is not possible and the scope of the name is known * `ADDOP_NAME()` just like ADDOP_O, but name mangling is also handled; used for attribute loading or i*mporting based on name * `ADDOP_JABS()` create an absolute jump to a basic block * `ADDOP_JREL()` create a relative jump to a basic block Several helper functions that will emit bytecode and are named `compiler_xx()`, where `xx` is what the function helps with (list, boolop, etc.) :::danger not yet done. miss some part. ::: To verify if we have generated the right opcode, on the file `test.py` execute the following command: ```bash $ python -m dis test.py 1 0 LOAD_CONST 0 (100) 2 STORE_NAME 0 (a) 4 LOAD_CONST 1 (None) 6 RETURN_VALUE ``` From this we can verify that our generated bytecode is correct. I would suggest you to debug larger programs and see how the opcode is generated for the same. ## 07. Debugging the interpreter loop Once the bytecode is generated the next step is to execute the program by the interpreter. Back to the file `Python/pythonrun.c`, we then call the function `PyEval_EvalCode()`, it is a wrapper function to `PyEval_EvalCodeEx()`, and is another wrapper function for `_PyEval_EvalCodeWithName()` :::warning Deprecate to 2.7, PyEval_EvalCodeEx won't setup frame before entering `_PyEval_EvalCodeWithName()`, frame create is move into _PyEval_EvalCodeWithName ::: Let us examine the structure of the frame object, which is defined in the file `Include/frameobject.h`: ```cpp typedef struct _frame { PyObject_VAR_HEAD struct _frame *f_back; /* previous frame, or NULL */ PyCodeObject *f_code; /* code segment */ PyObject *f_builtins; /* builtin symbol table (PyDictObject) */ PyObject *f_globals; /* global symbol table (PyDictObject) */ PyObject *f_locals; /* local symbol table (any mapping) */ PyObject **f_valuestack; /* points after the last local */ /* Next free slot in f_valuestack. Frame creation sets to f_valuestack. Frame evaluation usually NULLs it, but a frame that yields sets it to the current stack top. */ PyObject **f_stacktop; PyObject *f_trace; /* Trace function */ /* In a generator, we need to be able to swap between the exception state inside the generator and the exception state of the calling frame (which shouldn't be impacted when the generator "yields" from an except handler). These three fields exist exactly for that, and are unused for non-generator frames. See the save_exc_state and swap_exc_state functions in ceval.c for details of their use. */ PyObject *f_exc_type, *f_exc_value, *f_exc_traceback; /* Borrowed reference to a generator, or NULL */ PyObject *f_gen; int f_lasti; /* Last instruction if called */ /* Call PyFrame_GetLineNumber() instead of reading this field directly. As of 2.3 f_lineno is only valid when tracing is active (i.e. when f_trace is set). At other times we use PyCode_Addr2Line to calculate the line from the current bytecode index. */ int f_lineno; /* Current line number */ int f_iblock; /* index in f_blockstack */ char f_executing; /* whether the frame is still executing */ PyTryBlock f_blockstack[CO_MAXBLOCKS]; /* for try and loop blocks */ PyObject *f_localsplus[1]; /* locals+stack, dynamically sized */ } PyFrameObject; ``` What is important is that this frame contains an instruction pointer which is very useful when we understand how generators work in the coming chapters. In `_PyEval_EvalCodeWithName()`, it will create a frame by `_PyFrame_New_NoTrack()`, and at the bottom call the function `PyEval_EvalFrameEx()`. `PyEval_EvalFrameEx()` then will calls the `eval_frame()` function on the PyThreadState which is nothin gbut the function `_PyEval_EvalFrameDefault()`. This function can alsa be said the virtual machine of Python. Trace into the fucntion `_PyEval_EvalFrameDefault()`, we can then observe a infinity for-loop at line 1054, where will then generate opcode, we can trace it, and you will see it switch to the corresponding opcode block. For example, running code with `a = 100`, will first using `LOAD_CONST`, then `LOAD_NAME`, then so on by observe with `python -m dis test.py`. :::info Please, run this to make sure you familiar with this process. ```python i = range(0, 10) m = map(lambda x: x * 2, i) ``` ```py def logger(func): def logged(name) print('logging starts') func(name) print('logging ends') return logged @logger def print_name(name): print(name) print_name('Resolution Debugging') ``` ::: ### Review about the process This three chapter is the process of converting AST to CFG to Bytecode, you can get more information and refernece at devguide [25.7 AST to CFG to Bytecode](http://cpython-devguide.readthedocs.io/compiler.html#ast-to-cfg-to-bytecode) ## 08. Debugging Python Objects In this chapter we shall understand the implementation of the Python Objects. Let us begin with the PyObject which is the generic Python Object. It can be also called as the Object class of Python. It is defined in the file `Include/object.h` ### 8.0 Before starting PyObject tour In this chapter, you needs to use gdb to trace the code, each PyObject can be trace as these step: * Open gdb with python * Set the break point at Object create function * Take out REPL, then type the object create you want * Trap, start to trace it For example, if you want to trace about PyBoolObject, you can do following: ```bash $ gdb python (gdb) b bool_newbb Breakpoint 1 at 0x44812f: file Objects/boolobject.c, line 44. (gdb) r [GCC 6.3.1 20170109] on linux Type "help", "copyright", "credits" or "license" for more information. >>> a = bool(1) Breakpoint 1, bool_new (type=0x87a700 <PyBool_Type>, args=(1,), kwds=0x0) at Objects/boolobject.c:44 44 { ``` Have fun! ### 8.1 The PyObject Python generic object is defined as follow: ```cpp typedef struct _object { _PyObject_HEAD_EXTRA Py_ssize_t ob_refcnt; struct _typeobject *ob_type; } PyObjecdt; ``` After preprocessor expand _PyObject_HEAD_EXTRA, it will look like, it define two pointers to support a doubly-linked list of all live heap objects: ```cpp typedef struct _object { struct _object *_ob_next; struct _object *_ob_prev; Py_ssize_t ob_refcnt; struct _typeobject *ob_type; } PyObjecdt; ``` Inside PyObject, it contain two important elements, the reference count and type object. We will look into these object after. ### 8.2 The PyVarObject The definition for the Python Objects of variable length which is defined as: ```cpp typedef struct { PyObject ob_base; Py_ssize_t ob_size; /* Number of items in variable part */ } PyVarObject; ``` It is basically the same as PyObject, but with an additional field to denote the size of the variable length of the object. :::info Every pointer to a Python object can be cast to a PyObject*. Similarly every pointer to a variable-size Python object can, in addition, be cast to PyVarObject* ::: ### 8.3 The PyTypeObject The PyTypeObject is the representation of the type of the Python object. to get the type of any object in Python, you can type these following statements: ```py >>> t = type(1) >>> dir(t) ['__abs__', '__add__', '__and__', '__bool__', '__ceil__', '__class__', '__delattr__', '__dir__', '__divmod__', '__doc__', '__eq__', '__float__', '__floor__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getnewargs__', '__gt__', '__hash__', '__index__', '__init__', '__init_subclass__', '__int__', '__invert__', '__le__', '__lshift__', '__lt__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__trunc__', '__xor__', 'bit_length', 'conjugate', 'denominator', 'from_bytes', 'imag', 'numerator', 'real', 'to_bytes'] ``` We can see the definition of these methods in the PyTypeObject. It is as well as defined in the same file, `Include/object.t`. ```cpp #ifdef Py_LIMITED_API typedef struct _typeobject PyTypeObject; /* opaque */ #else typedef struct _typeobject { PyObject_VAR_HEAD const char *tp_name; /* For printing, in format "<module>.<name>" */ Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */ /* Methods to implement standard operations */ destructor tp_dealloc; printfunc tp_print; getattrfunc tp_getattr; setattrfunc tp_setattr; PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2) or tp_reserved (Python 3) */ reprfunc tp_repr; /* Method suites for standard classes */ PyNumberMethods *tp_as_number; PySequenceMethods *tp_as_sequence; PyMappingMethods *tp_as_mapping; /* More standard operations (here for binary compatibility) */ hashfunc tp_hash; ternaryfunc tp_call; reprfunc tp_str; getattrofunc tp_getattro; setattrofunc tp_setattro; /* Functions to access object as input/output buffer */ PyBufferProcs *tp_as_buffer; /* Flags to define presence of optional/expanded features */ unsigned long tp_flags; const char *tp_doc; /* Documentation string */ /* Assigned meaning in release 2.0 */ /* call function for all accessible objects */ traverseproc tp_traverse; /* delete references to contained objects */ inquiry tp_clear; /* Assigned meaning in release 2.1 */ /* rich comparisons */ richcmpfunc tp_richcompare; /* weak reference enabler */ Py_ssize_t tp_weaklistoffset; /* Iterators */ getiterfunc tp_iter; iternextfunc tp_iternext; /* Attribute descriptor and subclassing stuff */ struct PyMethodDef *tp_methods; struct PyMemberDef *tp_members; struct PyGetSetDef *tp_getset; struct _typeobject *tp_base; PyObject *tp_dict; descrgetfunc tp_descr_get; descrsetfunc tp_descr_set; Py_ssize_t tp_dictoffset; initproc tp_init; allocfunc tp_alloc; newfunc tp_new; freefunc tp_free; /* Low-level free-memory routine */ inquiry tp_is_gc; /* For PyObject_IS_GC */ PyObject *tp_bases; PyObject *tp_mro; /* method resolution order */ PyObject *tp_cache; PyObject *tp_subclasses; PyObject *tp_weaklist; destructor tp_del; /* Type attribute cache version tag. Added in version 2.6 */ unsigned int tp_version_tag; destructor tp_finalize; #ifdef COUNT_ALLOCS /* these must be last and never explicitly initialized */ Py_ssize_t tp_allocs; Py_ssize_t tp_frees; Py_ssize_t tp_maxalloc; struct _typeobject *tp_prev; struct _typeobject *tp_next; #endif } PyTypeObject; #endif ``` All the integer object are now implement in `Objects/longobject.c`, it defined `PyLong_Type` at line 5389, with PyTypeObject: ```cpp PyTypeObject PyLong_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "int", /* tp_name */ offsetof(PyLongObject, ob_digit), /* tp_basicsize */ sizeof(digit), /* tp_itemsize */ long_dealloc, /* tp_dealloc */ 0, /* tp_print */ 0, /* tp_getattr */ 0, /* tp_setattr */ 0, /* tp_reserved */ long_to_decimal_string, /* tp_repr */ &long_as_number, /* tp_as_number */ 0, /* tp_as_sequence */ 0, /* tp_as_mapping */ (hashfunc)long_hash, /* tp_hash */ 0, /* tp_call */ long_to_decimal_string, /* tp_str */ PyObject_GenericGetAttr, /* tp_getattro */ 0, /* tp_setattro */ 0, /* tp_as_buffer */ Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_LONG_SUBCLASS, /* tp_flags */ long_doc, /* tp_doc */ 0, /* tp_traverse */ 0, /* tp_clear */ long_richcompare, /* tp_richcompare */ 0, /* tp_weaklistoffset */ 0, /* tp_iter */ 0, /* tp_iternext */ long_methods, /* tp_methods */ 0, /* tp_members */ long_getset, /* tp_getset */ 0, /* tp_base */ 0, /* tp_dict */ 0, /* tp_descr_get */ 0, /* tp_descr_set */ 0, /* tp_dictoffset */ 0, /* tp_init */ 0, /* tp_alloc */ long_new, /* tp_new */ PyObject_Del, /* tp_free */ }; ``` ### 8.4 PyLongObject Defined at `Include/longobject.h`, as this: ```cpp typedef struct _longobject PyLongObject; /* Revealed in longintrepr.h */ ``` ### 8.5 PyBoolObject The PyBoolObject is used to store boolean types in Python. The definition is in the file `include/boolobject.h`. ### 8.6 PyFloatObject In file `Include/floatobject.h` ```cpp typedef struct { PyObject_HEAD double ob_fval; } PyFloatObject; ``` ### 8.7 PyListObject In file `Include/listobject.h`. ```cpp typedef struct { PyObject_VAR_HEAD /* Vector of pointers to list elements. list[0] is ob_item[0], etc. */ PyObject **ob_item; /* ob_item contains space for 'allocated' elements. The number * currently in use is ob_size. * Invariants: * 0 <= ob_size <= allocated * len(list) == ob_size * ob_item == NULL implies ob_size == allocated == 0 * list.sort() temporarily sets allocated to -1 to detect mutations. * * Items must normally not be NULL, except during construction when * the list is not yet visible outside the function that builds it. */ Py_ssize_t allocated; } PyListObject; ``` ## Appendix C - Some useful terms * ASDL (Abstract Syntax Definition Language) * yacc and lex * Parser * AST (Abstract Syntax Tree) * CFG (Context-free graph) * LL (LL(k) parser)