Internals of CPython

current version: Python 3.7 alpha
git commit: 03f68b60e17b57f
contributer: Louie Lu <<me at louie.lu>>

This document is based on the work of Prashanth Raghu*, who wrote Internals of CPython 3.6. You can find it at this Google document and Python Core-Mentorship 2017-Jan.

And, also reference from Guido Yet another guided tour of CPython, and cpython-devguide chapter compiler.

Most of the actions were performed in this environment:

$ uname -a
4.8.3-1-ARCH
$ gcc -v
gcc version 6.3.1 20170109 (GCC)

1. Pre-requirements
1. Set up the environment
1. Debugging the Parser
1. Debugging the AST(Abstract Syntax Tree) generator
1. Debugging the symbol table generator
1. Debugging the compiler and code generator
1. Debugging the interpreter loop
1. Debugging Python Objects
1. Debugging the memory allocator
1. Debugging Python threads and sockets
1. Python Optimizations
Appendix A - Python Sandard Library
Appendix B - Introduction to PyPy
Appendix C - Some useful terms

Why re-write these documents?

The great work from Prashanth Raghu was using Eclipse IDE as its debugging environment. What I want to change is to remove the requirement of an IDE. It should be able to work everywhere, without any specific IDE for this tour.

0. Prerequisites

Linux based operating system
Python source code. The code has been migrated to GitHub, so you can easily get the latest source code by running git clone https://github.com/python/cpython
gdb
make

01. Set up the environment

Clone the latest cpython source code from GitHub:

$ git clone https://github.com/python/cpython
$ cd cpython

Configure with --with-pydebug then build python from source code:

$ ./configure --with-pydebug
$ make -j8 -s

You can change the number for the -j parameter. For preformance reasons, this number is recommended to be set to # of physical CPU cores * 2.

After this, you will see a binary file called python in your directory. If you are using macOS this will probably be python.exe. If this is the case, replace python with python.exe in all the commands that follow later in the tutorial.

$ ls
aclocal.m4    config.log     configure      Doc      install-sh        LICENSE   Makefile.pre     Modules  PC        pyconfig.h     Python            python-gdb.py  test.py
build         config.status  configure.ac   Grammar  Lib               Mac       Makefile.pre.in  Objects  PCbuild   pyconfig.h.in  python-config     README.rst     Tools
config.guess  config.sub     coverage.info  Include  libpython3.7dm.a  Makefile  Misc             Parser   Programs  python         python-config.py  setup.py

If you see it, then you are ready to go!

02. Debugging with GDB

We will using gdb to trace the behaivor of python that we built when running it.

As such, this chapter will also introduce an easy gdb tutorial to give you confidence when using gdb.

GDB shortcut

r (run): run the program
b (break): to set the break point.
s (step): step-to-step
c (continue): continue the program, this will stop at breakpoint or some trap.
l (list): list the code where the program now running.
ctrl+x a: open tui mode.
ctrl+p: up
ctrl+n: down

GDB intro

Now, type the command gdb python under the directory where you built cpython:

$ gdb python
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb)

You will see the message in the bottom line that says: Reading symbols from python...done.. Great, that means we can now debug python via gdb.

Now, the program hasn't run yet, it stops at the front of the program. Now we can plan what we want to do.

First, let us familiarise ourselves with gdb. Every C program starts its execution in main(), so we will set a breakpoint for main():

(gdb) b main
Breakpoint 1 at 0x41d1d6: file ./Programs/python.c, line 20.
(gdb)

gdb then sets a breakpoint at Programs/python.c, line 20. This message shows that the entry point of Python is at Programs/python.c:20. Alternativey, if you know where you want to set the breakpoint (in this example, at Programs/python.c, line 20) we can set the breakpoint like this:

(gdb) b Programs/python.c:20
Breakpoint 3 at 0x41d1d6: file ./Programs/python.c, line 20.
(gdb)

We can run python now:

(gdb) r
Starting program: /home/grd/Python/cpython/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20
20	{
(gdb)

We are now stop at the breakpoint we set before, list the source code with list:

(gdb) l
15	}
16	#else
17	
18	int
19	main(int argc, char **argv)
20	{
21	    wchar_t **argv_copy;
22	    /* We need a second copy, as Python might modify the first one. */
23	    wchar_t **argv_copy2;
24	    int i, res;
(gdb)

Or just call out tui by ctrl+x a:

   ┌──./Programs/python.c─────────────────────────────────────────────────────────┐
   │15      }                                                                     │
   │16      #else                                                                 │
   │17                                                                            │
   │18      int                                                                   │
   │19      main(int argc, char **argv)                                           │
B+>│20      {                                                                     │
   │21          wchar_t **argv_copy;                                              │
   │22          /* We need a second copy, as Python might modify the first one. */│
   │23          wchar_t **argv_copy2;                                             │
   │24          int i, res;                                                       │
   │25          char *oldloc;                                                     │
   │26                                                                            │
   └──────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff7f9de In: main                         L20   PC: 0x41d1d6 
(gdb)

In tui mode, you can see that we are now stopped at line 20 in the source code.

If we didn't set any arguments for python, running python will get a REPL in linux or macOS. You can try with continue in gdb:

(gdb) c
   │29                                                                                                                                                                                                                                      │
   │30          argv_copy = (wchar_t **)PyMem_RawMalloc(sizeof(wchar_t*) * (argc+1));                                                                                                                                                       │
   │31          argv_copy2 = (wchar_t **)PyMem_RawMalloc(sizeof(wchar_t*) * (argc+1));                                                                                                                                                      │
   │32          if (!argv_copy || !argv_copy2) {                                                                                                                                                                                            │
   └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff7f9de In: main                                                                                                                                                                                   L20   PC: 0x41d1d6 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20
(gdb) c
Continuing.
Python 3.7.0a0 (default, Feb 22 2017, 22:10:22) 
[GCC 6.3.1 20170109] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Yep, you will get a REPL as we expected.

Now that we understand some basic gdb concepts, we can move on to debugging Python!

03. Debugging the parser

3bb.1 Debugging the parser

Before starting this chapter, let's create a simple python script, name it test.py, with following the contents:


a = 100

Now open gdb, set the arguments, and run python with the script we created:

$ gdb python
(gdb) set args test.py
// Or doing this
$ gdb --args python test.py

This is equivalent to what we do in the linux command line:

$ python test.py

Now set the breakpoint at main as we did before, then run it:

(gdb) b main
Breakpoint 1 at 0x41d1d6: file ./Programs/python.c, line 20.
(gdb) r
Starting program: /home/grd/Python/cpython/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20
20	{Starting program: /home/grd/Python/cpython/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffdff8) at ./Programs/python.c:20
20	{
(gdb)

Programs/python.c

If you take a look to the source code Programs/python.c, you can find that the main() does a lot of setting with it, the core is to call the function Py_Main(argc, argv_copy) to start Python, which is defined in the file Modules/main.c:










    argv_copy2[argc] = argv_copy[argc] = NULL;

    setlocale(LC_ALL, oldloc);
    PyMem_RawFree(oldloc);

    res = Py_Main(argc, argv_copy);

    /* Force again malloc() allocator to release memory blocks allocated
       before Py_Main() */
    (void)_PyMem_SetupAllocators("malloc");

Modules/main.c

If you are interesting in what modules/main.c is doing, I recommend you read Guido's work - Yet another guided tour of CPython - Chapter 1. To simplify, in Py_Main it will do:

Initialization for hash randomiztion:391
Reset Warning options and GetOpt: 394,395
Parse command line argument via _PyOS_GetOpt(): 397
Initialize python via Py_Initialize(): 693
Import readline: 723
Decide how to run:
1. run_command()
2. RunModule()
3. RunInteractiveHook()
4. RunMainFromImporter()
5. run_file()
6. PyRun_AnyFileFlags()

Now, we are focused on road number 5, run_file in line 793:


if (sts == -1)
    sts = run_file(fp, filename, &cf);

Stop here by typing b 793 or b run_file to stop inside the function, we can check filename by p filename, go into function by step:

(gdb) b Modules/main.c:793
(gdb) c
(gdb) p filename
$1 = 0x931050 L"test.py
(gdb) s

run_file is just a wrapper of PyRun_AnyFileExFlags which lives in Python/pythonrun.c and is a wrapper that calls either PyRun_InteractiveLoopFlags() or PyRun_SimpleFileExFlags(). Because we set the arguments test.py, it will run PyRun_SimpleFileExFlags().

Python/pythonrun.c

PyRun_SimpleFileExFlags() will then check if there is a .pyc file to run with maybe_pyc_file() at the line 366 if-else statement.

In our case, is should not run with .pyc, but from .py. It will then call PyRun_FileExFlags() in same file, and then try to construct an abstract syntax tree (AST) from file by PyParser_ASTFromFileObject().

This function (PyParser_ASTFromFileObject()) invokes the parser in the function PyParser_ParseFileObject(), which lives in Parser/parsetok.c to construct a node object, and invokes the AST tree contructor function PyAST_FromNodeObject() to create AST tree by node object.

Parser/parsetok.c

For now, let's take a closer look inside PyParser_ParseFileObject().

It first initializes the syntax token from PyTokenizer_FromFile(), then passes to parsetok() to create the node object.

parsetok() will parse the input from the given tokenizer structure and return an error code.

The interesting part is inside the infinite for loop; take a look here:


type = PyTokenizer_Get(tok, &a, &b);

PyTokenizer_Get(), is a wrapper of tok_get(). It will return token type defined in token.h:

include/token.h



















#define ENDMARKER	0
#define NAME		1
#define NUMBER		2
#define STRING		3
#define NEWLINE		4
#define INDENT		5
#define DEDENT		6
#define LPAR		7
#define RPAR		8
#define LSQB		9
...
#define RARROW          51
#define ELLIPSIS        52
/* Don't forget to update the table _PyParser_TokenNames in tokenizer.c! */
#define OP		53
#define AWAIT		54
#define ASYNC		55
#define ERRORTOKEN	56
#define N_TOKENS	57

In the first round of the for-loop, we can print out the type, because it is a, it is a NAME, the value of type should be 1:

(gdb) p type
$1 = 1

include/token.h


#define NAME		1

At line 236, str[lbbben] = '\0' will store the token string value. If we print it out, it will be a.

(gdb) p str
$2 = 0x7ffff6eb5a18 "a"

This make sense, since our source code is a = 100, the first token string should be a, and its type is NAME.

Then calling PyParser_AddToken(), this will add the token into the parser tree at this moment. You can now try to follow the other token = and 100 and trace how they are added into the parser tree.

3.2 Generation of the grammar

The textual representation of the grammar is present in the file Grammar/Grammar. It is written in yacc and I would suggest you to go through it. The numerical representation of the grammar is present in the file Python/graminit.c which contains the array of DFA's representing the source program.

We will now try to understand how this works in Python. First, open test.py and modify the contents to:

class foo:
    pass

Then, start gdb and set the breakpoint at PyParser_AddToken():

$ gdb python
(gdb) b PyParser_AddToken
(gdb) r test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 1, PyParser_AddToken (ps=ps@entry=0x9cf4f0, type=type@entry=1, str=str@entry=0x7ffff6eb5a18 "class", lineno=1, col_offset=0, expected_ret=expected_ret@entry=0x7fffffffdc34) at Parser/parser.c:229
229	{

At this moment, look inside the for-loop. We want to know the DFA state now, so set a breakpoint at dfa *d = ps->p_stack.s_top->s_dfa next line, and run the program by continue:

(gdb) b 244
Breakpoint 2 at 0x5b5933: file Parser/parser.c, line 244.
(gdb) c
Breakpoint 2, PyParser_AddToken (ps=ps@entry=0x9cf4f0, type=type@entry=1, str=str@entry=0x7ffff6eb5a18 "class", lineno=1, col_offset=0, expected_ret=expected_ret@entry=0x7fffffffdc34) at Parser/parser.c:244
244	        state *s = &d->d_state[ps->p_stack.s_top->s_state];

We can then print out the value in d, for example, print out its name:

(gdb) p d->d_name
'file_input'
(gdb)

Or, print out as Python dir():

(gdb) p *d
$6 = {d_type = 269, d_name = 0x606340 "stmt", d_initial = 0, d_nstates = 2, d_state = 0x8a6a40 <states_13>, d_first = 0x60658c ""}
(gdb) set print pretty
(gdb) p *d
$8 = {
  d_type = 269,
  d_name = 0x606340 "stmt",
  d_initial = 0,
  d_nstates = 2,
  d_state = 0x8a6a40 <states_13>,
  d_first = 0x60658c ""
}

Compare the value of d_name, you can find it in Grammar/Grammar:

# Grammar for Python

single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
file_input: (NEWLINE | stmt)* ENDMARKER
eval_input: testlist NEWLINE* ENDMARKER

Run it more time, you will see that d_name will change as this sequence: file_input, stmt, compound_stmt, classdef, classdef, classdef, classdef, suite, suite, suite, stmt, simple_stmt, small_stmt, pass_stmt, simple_stmt, suite, file_input, file_input.

3.3 Review

Backtrace

main()
Py_Main()
run_file()
PyRun_AnyFileExFlags()
PyRun_SimpleFileExFlags()
PyRun_FileExFlags()
PyParser_ASTFromFileObject()
- PyParser_ParseFileObject()
  - parsetok()
  - PyParser_AddToken()
- PyAST_FromNodeObject()

04. Debugging the AST (Abstract Syntax Tree) generator

At this moment, we build our parser tree via PyParser_ParseFileObject(). The next step is to generate the AST from the parser node we built.

Before getting into it, there are several macros we need to introduce. These macros are used to query data from node structs, which are defined in Include/node.h:

CHILD(node *, int)
Returns the nth child of the node using zero-offset indexing
RCHILD(node *, int)
Returns the nth child of the node from the right side; use negative numbers!
NCH(node *)
The number of children the node has
STR(node *)
A string representation of the node; e.g. will return : for a * COLON token
TYPE(node *)
The type of node as specified in Include/graminit.h
REQ(node *, TYPE)
Assert that the node is the type that is expected
LINENO(node *)
Retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c.

We can now try to convert the Parse Tree to an AST using the function PyAST_FromNodeObject() in Python/ast.c.
























for (i = 0; i < NCH(n) - 1; i++) {
    ch = CHILD(n, i);
    if (TYPE(ch) == NEWLINE)
        continue;
    REQ(ch, stmt);
    num = num_stmts(ch);
    if (num == 1) {
        s = ast_for_stmt(&c, ch);
        if (!s)
            goto out;
        asdl_seq_SET(stmts, k++, s);
    }
    else {
        ch = CHILD(ch, 0);
        REQ(ch, simple_stmt);
        for (j = 0; j < num; j++) {
            s = ast_for_stmt(&c, CHILD(ch, j * 2));
            if (!s)
                goto out;
            asdl_seq_SET(stmts, k++, s);
        }
    }
}
res = Module(stmts, arena);

ast_for_stmt() is a wrapper for ast_for_xx where xx is the grammer rule that the function handles.

Set a breakpoint at ast_for_stmt() to observe how the ASDL statement is generated.

05. Debugging the symbol table generator

Back into the function PyRun_FileExFlags() in Python/pythonrun.c:906, we finish our tour for PyParser_ASTFromFileObject(). It will then pass the result mod into the run_mod() function above. It has two principal functions: The first to generate the code object (PyAST_CompileObject()) and the second to call the interpreter loop (PyEval_EvalCode()).

Let us step into PyAST_CompileObject() first, this function is in Python/compile.c. It has two important functions:

PySumtable_BuildObject()
compiler_mod()

PySumtable_BuildObject() is used to generate the symbol table, which is defined in Python/symtable.c at line 240.

We shall step into this function and insert a breakpoint at line 281, where there is a for-loop that iterates through the ASDL sequence (for (i = 0; i < asdl_seq_LEN(seq); i++)) for the module and generates the symbol table entry in the function symtable_visit_stmt().

Before we step into this function we need to understand the structures that support the symbol table. They are defined in the file Include/symtble.h.


















































struct _symtable_entry;

struct symtable {
    PyObject *st_filename;          /* name of file being compiled,
                                       decoded from the filesystem encoding */
    struct _symtable_entry *st_cur; /* current symbol table entry */
    struct _symtable_entry *st_top; /* symbol table entry for module */
    PyObject *st_blocks;            /* dict: map AST node addresses
                                     *       to symbol table entries */
    PyObject *st_stack;             /* list: stack of namespace info */
    PyObject *st_global;            /* borrowed ref to st_top->ste_symbols */
    int st_nblocks;                 /* number of blocks used. kept for
                                       consistency with the corresponding
                                       compiler structure */
    PyObject *st_private;           /* name of current class or NULL */
    PyFutureFeatures *st_future;    /* module's future features that affect
                                       the symbol table */
    int recursion_depth;            /* current recursion depth */
    int recursion_limit;            /* recursion limit */
};

typedef struct _symtable_entry {
    PyObject_HEAD
    PyObject *ste_id;        /* int: key in ste_table->st_blocks */
    PyObject *ste_symbols;   /* dict: variable names to flags */
    PyObject *ste_name;      /* string: name of current block */
    PyObject *ste_varnames;  /* list of function parameters */
    PyObject *ste_children;  /* list of child blocks */
    PyObject *ste_directives;/* locations of global and nonlocal statements */
    _Py_block_ty ste_type;   /* module, class, or function */
    int ste_nested;      /* true if block is nested */
    unsigned ste_free : 1;        /* true if block has free variables */
    unsigned ste_child_free : 1;  /* true if a child block has free vars,
                                     including free refs to globals */
    unsigned ste_generator : 1;   /* true if namespace is a generator */
    unsigned ste_coroutine : 1;   /* true if namespace is a coroutine */
    unsigned ste_varargs : 1;     /* true if block has varargs */
    unsigned ste_varkeywords : 1; /* true if block has varkeywords */
    unsigned ste_returns_value : 1;  /* true if namespace uses return with
                                        an argument */
    unsigned ste_needs_class_closure : 1; /* for class scopes, true if a
                                             closure over __class__
                                             should be created */
    int ste_lineno;          /* first line of block */
    int ste_col_offset;      /* offset of first line of block */
    int ste_opt_lineno;      /* lineno of last exec or import * */
    int ste_opt_col_offset;  /* offset of last exec or import * */
    int ste_tmpname;         /* counter for listcomp temp vars */
    struct symtable *ste_table;
} PySTEntryObject;

We can observe that the symbol table is a dictionary containing the symbol table entries. For more specific information, please refer to the comment below the source code.

Back to the function symtable_visit_stmt, we can set the breakpoint on it:

(gdb) b symtable_visit_stmt

Here we can observe that the kind of expression is xx_kind, for Name_kind, it will simply call the function symtable_add_def() to add a definition into the symbol table.

I would suggest you to debug this function on your own as it is quite straight forward.

Exercise:

Trace the symbol table entries for the following Python files:

class test:
    pass

i = range(0, 10)
m = map(lambda x: x * 2, i)

In this exercise we have understood how symbol table entries are created.

06. Debugging the compiler and code generator

Back to the function PyAST_CompileObject(), the next step is compiler_mod(), which then converts the AST into a CFG (control flow graph).

Set a breakpoint to it (b compiler_mod), and we can step into it. The switch will then bring us into Module_kind, into this there is a call to the function compiler_body(). Step into it and you will see that there is a for-loop:


for (; i < asdl_seq_LEN(stmts); i++)
    VISIT(c, stmt, (stmt)ty)asdl_seq_GET(stmts, i));

Here we observe that we iterate through the ASDL statements and call the macro VISIT, then calls compiler_visit_expr(c, node).

The emission of bytecode is handled by the following macros:

ADDOP()
Add a specified opcode
ADDOP_I()
Add an opcode that takes an argument
ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)
Add an opcode with the proper argument based on the position of the specified PyObject in * PyObject sequence object, but with no handling of mangled names; this is used for when you need to do named lookups of objects such as globals, consts, or parameters where name mangling is not possible and the scope of the name is known
ADDOP_NAME()
Just like ADDOP_O, but name mangling is also handled; used for attribute loading or importing based on name
ADDOP_JABS()
Create an absolute jump to a basic block
ADDOP_JREL()
Create a relative jump to a basic block

Several helper functions that will emit bytecode and are named compiler_xx(), where xx is what the function helps with (list, boolop, etc.)

Not yet done. Missing some parts.

To verify if we have generated the right opcode, on the file test.py execute the following command:

$ python -m dis test.py
  1           0 LOAD_CONST               0 (100)
              2 STORE_NAME               0 (a)
              4 LOAD_CONST               1 (None)
              6 RETURN_VALUE

From this we can verify that our generated bytecode is correct.

I would suggest you to debug larger programs and see how the opcode is generated for the same.

07. Debugging the interpreter loop

Once the bytecode is generated the next step is to execute the program by the interpreter. Back to the file Python/pythonrun.c, we then call the function PyEval_EvalCode(), it is a wrapper function to PyEval_EvalCodeEx(), and is another wrapper function for _PyEval_EvalCodeWithName()

Deprecate to 2.7, PyEval_EvalCodeEx won't setup frame before entering _PyEval_EvalCodeWithName(), frame create is moved into _PyEval_EvalCodeWithName.

Let us examine the structure of the frame object, which is defined in the file Include/frameobject.h:

typedef struct _frame {
    PyObject_VAR_HEAD
    struct _frame *f_back;      /* previous frame, or NULL */
    PyCodeObject *f_code;       /* code segment */
    PyObject *f_builtins;       /* builtin symbol table (PyDictObject) */
    PyObject *f_globals;        /* global symbol table (PyDictObject) */
    PyObject *f_locals;         /* local symbol table (any mapping) */
    PyObject **f_valuestack;    /* points after the last local */
    /* Next free slot in f_valuestack.  Frame creation sets to f_valuestack.
       Frame evaluation usually NULLs it, but a frame that yields sets it
       to the current stack top. */
    PyObject **f_stacktop;
    PyObject *f_trace;          /* Trace function */

    /* In a generator, we need to be able to swap between the exception
       state inside the generator and the exception state of the calling
       frame (which shouldn't be impacted when the generator "yields"
       from an except handler).
       These three fields exist exactly for that, and are unused for
       non-generator frames. See the save_exc_state and swap_exc_state
       functions in ceval.c for details of their use. */
    PyObject *f_exc_type, *f_exc_value, *f_exc_traceback;
    /* Borrowed reference to a generator, or NULL */
    PyObject *f_gen;

    int f_lasti;                /* Last instruction if called */
    /* Call PyFrame_GetLineNumber() instead of reading this field
       directly.  As of 2.3 f_lineno is only valid when tracing is
       active (i.e. when f_trace is set).  At other times we use
       PyCode_Addr2Line to calculate the line from the current
       bytecode index. */
    int f_lineno;               /* Current line number */
    int f_iblock;               /* index in f_blockstack */
    char f_executing;           /* whether the frame is still executing */
    PyTryBlock f_blockstack[CO_MAXBLOCKS]; /* for try and loop blocks */
    PyObject *f_localsplus[1];  /* locals+stack, dynamically sized */
} PyFrameObject;

What is important is that this frame contains an instruction pointer which is very useful when we try to understand how generators work in the coming chapters.

In _PyEval_EvalCodeWithName(), it will create a frame by _PyFrame_New_NoTrack(), and at the bottom call the function PyEval_EvalFrameEx().

PyEval_EvalFrameEx() then will call the eval_frame() function on the PyThreadState which is nothing but the function _PyEval_EvalFrameDefault(). This function can also be called the virtual machine of Python.

Trace into the function _PyEval_EvalFrameDefault(), we can then observe an infinite for-loop at line 1054, which will then generate opcode. We can trace it, and you will see it switch to the corresponding opcode block.

For example, running code with a = 100 will first use LOAD_CONST, then LOAD_NAME, then so on, which we can observe with python -m dis test.py.

Please, run this to make sure you are familiar with this process.

i = range(0, 10)
m = map(lambda x: x * 2, i)

def logger(func):
    def logged(name)
        print('logging starts')
        func(name)
        print('logging ends')
    return logged
    
@logger
def print_name(name):
    print(name)
    
print_name('Resolution Debugging')

Review of the process

This three chapter is the process of converting AST to CFG to Bytecode, you can get more information by referencing the devguide 25.7 AST to CFG to Bytecode

08. Debugging Python Objects

In this chapter we shall learn about the implementation of Python Objects. Let us begin with the PyObject which is the generic Python Object. It can be also called the Object class of Python. It is defined in the file Include/object.h.

8.0 Before starting PyObject tour

In this chapter, you need to use gdb to trace the code. Each PyObject can be traced with these steps:

Open gdb with python
Set the break point at Object create function
Take out REPL, then type the object create you want
Trap, start to trace it

For example, if you want to trace about PyBoolObject, you can do the following:

$ gdb python
(gdb) b bool_newbb
Breakpoint 1 at 0x44812f: file Objects/boolobject.c, line 44.
(gdb) r
[GCC 6.3.1 20170109] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = bool(1)

Breakpoint 1, bool_new (type=0x87a700 <PyBool_Type>, args=(1,), kwds=0x0) at Objects/boolobject.c:44
44	{

Have fun!

8.1 The PyObject

The generic Python object is defined as follows:

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObjecdt;

After the preprocessor expands _PyObject_HEAD_EXTRA, it will look like, it defines two pointers to support a doubly-linked list of all live heap objects:

typedef struct _object {
    struct _object *_ob_next;
    struct _object *_ob_prev;
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObjecdt;

Inside PyObject, it contains two important elements, the reference count and type object. We will look into these objects after.

8.2 The PyVarObject

The definition for the Python Objects of variable length is defined as:

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

It is basically the same as PyObject, but with an additional field to denote the size of the variable length of the object.

Every pointer to a Python object can be cast to a PyObject*. Similarly every pointer to a variable-size Python object can, in addition, be cast to PyVarObject*.

8.3 The PyTypeObject

The PyTypeObject is the representation of the type of the Python object. to get the type of any object in Python, you can type the following statements:

>>> t = type(1)
>>> dir(t)
['__abs__', '__add__', '__and__', '__bool__', '__ceil__', '__class__', '__delattr__', '__dir__', '__divmod__', '__doc__', '__eq__', '__float__', '__floor__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getnewargs__', '__gt__', '__hash__', '__index__', '__init__', '__init_subclass__', '__int__', '__invert__', '__le__', '__lshift__', '__lt__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__trunc__', '__xor__', 'bit_length', 'conjugate', 'denominator', 'from_bytes', 'imag', 'numerator', 'real', 'to_bytes']

We can see the definition of these methods in the PyTypeObject. It is also defined in the same file, Include/object.t.

#ifdef Py_LIMITED_API
typedef struct _typeobject PyTypeObject; /* opaque */
#else
typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "<module>.<name>" */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

    /* Methods to implement standard operations */

    destructor tp_dealloc;
    printfunc tp_print;
    getattrfunc tp_getattr;
    setattrfunc tp_setattr;
    PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2)
                                    or tp_reserved (Python 3) */
    reprfunc tp_repr;

    /* Method suites for standard classes */

    PyNumberMethods *tp_as_number;
    PySequenceMethods *tp_as_sequence;
    PyMappingMethods *tp_as_mapping;

    /* More standard operations (here for binary compatibility) */

    hashfunc tp_hash;
    ternaryfunc tp_call;
    reprfunc tp_str;
    getattrofunc tp_getattro;
    setattrofunc tp_setattro;

    /* Functions to access object as input/output buffer */
    PyBufferProcs *tp_as_buffer;

    /* Flags to define presence of optional/expanded features */
    unsigned long tp_flags;

    const char *tp_doc; /* Documentation string */

    /* Assigned meaning in release 2.0 */
    /* call function for all accessible objects */
    traverseproc tp_traverse;

    /* delete references to contained objects */
    inquiry tp_clear;

    /* Assigned meaning in release 2.1 */
    /* rich comparisons */
    richcmpfunc tp_richcompare;

    /* weak reference enabler */
    Py_ssize_t tp_weaklistoffset;

    /* Iterators */
    getiterfunc tp_iter;
    iternextfunc tp_iternext;

    /* Attribute descriptor and subclassing stuff */
    struct PyMethodDef *tp_methods;
    struct PyMemberDef *tp_members;
    struct PyGetSetDef *tp_getset;
    struct _typeobject *tp_base;
    PyObject *tp_dict;
    descrgetfunc tp_descr_get;
    descrsetfunc tp_descr_set;
    Py_ssize_t tp_dictoffset;
    initproc tp_init;
    allocfunc tp_alloc;
    newfunc tp_new;
    freefunc tp_free; /* Low-level free-memory routine */
    inquiry tp_is_gc; /* For PyObject_IS_GC */
    PyObject *tp_bases;
    PyObject *tp_mro; /* method resolution order */
    PyObject *tp_cache;
    PyObject *tp_subclasses;
    PyObject *tp_weaklist;
    destructor tp_del;

    /* Type attribute cache version tag. Added in version 2.6 */
    unsigned int tp_version_tag;

    destructor tp_finalize;

#ifdef COUNT_ALLOCS
    /* these must be last and never explicitly initialized */
    Py_ssize_t tp_allocs;
    Py_ssize_t tp_frees;
    Py_ssize_t tp_maxalloc;
    struct _typeobject *tp_prev;
    struct _typeobject *tp_next;
#endif
} PyTypeObject;
#endif

All the integer objects are now implemented in Objects/longobject.c; it is defined as PyLong_Type at line 5389, with PyTypeObject:

PyTypeObject PyLong_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "int",                                      /* tp_name */
    offsetof(PyLongObject, ob_digit),           /* tp_basicsize */
    sizeof(digit),                              /* tp_itemsize */
    long_dealloc,                               /* tp_dealloc */
    0,                                          /* tp_print */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    0,                                          /* tp_reserved */
    long_to_decimal_string,                     /* tp_repr */
    &long_as_number,                            /* tp_as_number */
    0,                                          /* tp_as_sequence */
    0,                                          /* tp_as_mapping */
    (hashfunc)long_hash,                        /* tp_hash */
    0,                                          /* tp_call */
    long_to_decimal_string,                     /* tp_str */
    PyObject_GenericGetAttr,                    /* tp_getattro */
    0,                                          /* tp_setattro */
    0,                                          /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE |
        Py_TPFLAGS_LONG_SUBCLASS,               /* tp_flags */
    long_doc,                                   /* tp_doc */
    0,                                          /* tp_traverse */
    0,                                          /* tp_clear */
    long_richcompare,                           /* tp_richcompare */
    0,                                          /* tp_weaklistoffset */
    0,                                          /* tp_iter */
    0,                                          /* tp_iternext */
    long_methods,                               /* tp_methods */
    0,                                          /* tp_members */
    long_getset,                                /* tp_getset */
    0,                                          /* tp_base */
    0,                                          /* tp_dict */
    0,                                          /* tp_descr_get */
    0,                                          /* tp_descr_set */
    0,                                          /* tp_dictoffset */
    0,                                          /* tp_init */
    0,                                          /* tp_alloc */
    long_new,                                   /* tp_new */
    PyObject_Del,                               /* tp_free */
};

8.4 PyLongObject

Defined at Include/longobject.h, as this:

typedef struct _longobject PyLongObject; /* Revealed in longintrepr.h */

8.5 PyBoolObject

The PyBoolObject is used to store boolean types in Python. The definition is in the file include/boolobject.h.

8.6 PyFloatObject

In file Include/floatobject.h

typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

8.7 PyListObject

In file Include/listobject.h.

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

Appendix C - Some useful terms

ASDL (Abstract Syntax Definition Language)
yacc and lex
Parser
AST (Abstract Syntax Tree)
CFG (Context-free graph)
LL (LL(k) parser)

Guest McGee2021/06/28 15:12:42

(Context-free

wasan't it control flow graph? (Edited)

刘备

2023/03/11 10:51:03

PySumtable_BuildObject

PySymtable_BuildObject , not PySumtable_BuildObject (Edited)

2023/03/16 13:17:22

```bash $ gdb python (gdb) b bool_newbb Breakpoint 1 at 0x44812f: file Objects/boolobject.c, line 44. (gdb) r [GCC 6.3.1 20170109] on linux Type "help", "copyright", "credits" or "license" for more information. >>> a = bool(1) Breakpoint 1, bool_new (type=0x87a700 <PyBool_Type>, args=(1,), kwds=0x0) at Objects/boolobject.c:44 44 { ```

bool_new not bool_newbb (Edited)