Book & Implementation : ゼロから作るDeep Learning ―Pythonで學ぶディープラーニングの理論と実裝 & Implement with Eigen C++

# Book & Implementation : ゼロから作るDeep Learning ―Pythonで學ぶディープラーニングの理論と実裝 & Implement with Eigen C++ > Reference \: > * Deep Learning｜用 Python 進行深度學習的基礎理論實作 Deep Learning ―Pythonで學ぶディープラーニングの理論と実裝 > https://www.tenlong.com.tw/products/9789864764846 > * DeZeroCpp Github Repo >https://github.com/bugph0bia/DeZeroCpp/tree/master *This note is based on the above book. In this note, I am only focusing on C++ \& eigen implementation detail. Please refer to the book for detail implementation.* :::info :information_source: **Previous Attempt** https://hackmd.io/@Erebustsai/HkQGGTe7a ::: :::success :bulb: **Similar Work** https://github.com/bugph0bia/DeZeroCpp/tree/master When I was working on the code, I found this great work by bugph0bia which use `NumCPP` for implementation. However, the goal of this work will finally extend to **CUDA C** or **OpenCL** since this framework serve as a reference or backbone for further extension. Notice that I will work on my version of the framework while reading it's notes. There are many implementation consideration is definitely taking this repo as reference. ::: :::info :information_source: **DeZero with CPP Road Map** * **First Version** \: Using `Eigen` and purely use CPU for all the computation. * **Second Version** \: Support `ViennaCL` and will be able to use GPU \(Support OpenCL 1.2\) for computation. * **Third Version** \: Take out `Eigen` & `ViennaCL` and build all the computation and data structures in row C++ and OpenCL. Note that using `cublas` or any other BLAS library is not considered since **this framework is purely for research and will NOT go into production**. ::: # Stage \#1 \: Caclulate Differential Automatically In the early days of making AI framework, the framework can be treated as a tool for caclulate differentiation. Therefore, the first stage here in the book is making a framework that can caclulate differentiation automatically. ## Step 1 :::info :information_source: **Scalar, Vector and Matrix** In the **step 1** of the book page 6, the author noted that DeZero framework will use up to Matrix data for now. This makes the best choice of the type of the data is `eigen::MatrixXf`. Note that Eigen can use more than two dimension with `eigen::tensor` but its support by the community. If DeZero is going to use more than two dimension, change the backbone from `eigen::MatrixXf` to `eigen::tensor` might be needed. ::: ### Implementation Considerations 1. For the first iteration, I use `Eigen::MatrixXf` in the `Variable` class because I want to leave the memory management thingy to `Eigen` itself. The similar work refered above use an additional layer of pointer for the data instance. 2. I want to keep it as `c++17` as possible but also I will still retain some of the habit that I collected when I was working on other projects. Therefore, **the final program will definitely require a peer review**. 3. I notice that the orientation of the data in a matrix is opposite in `numpy` and `Eigen`. See my other [post](https://hackmd.io/HMzKBN64Q4yRbInS2JZxHA). ## Step 2 :::info :information_source: **Diagram of Computation Graph** Computation Graph is used to represent data and function relations. With the following example provided by the book, we will use circle for variables and square for functions. ![image](https://hackmd.io/_uploads/S1Ji7WiGel.png) ::: ### Implementation of `Function` Class 1. `Function` class should use `Variable` instance as input and output. 2. data is in the `Variable` instance ### Implementation Considerations 1. By overloading `operator()` in C++, which also known as functor, we can make a C++ class work similarly as python class using `__call__`. 2. Use function inheritation to have a base class `Function` and implement different computation operation on it. 3. The base `Function` diagram will be like \: ```clike operator() { x = input.data; y = this.forward(x); output = Variable(y); return output; } forward() { // real computation } ``` The function `forward` is used to involke the computation on data while the functor body prepare for the data. ## Step 3 The output of a `Funtion operator()` is a `Variable`. Therefore, can be used to pass to another functor. In this case, I rely on `RVO` in C++ to help eliminate the temperary variable. :::info :information_source: **Note for RVO \(Return Value Optimization\)** https://wdv4758h.github.io/notes/cpp/rvo.html According to the above article, RVO is faster than `std::move`. ::: ## Step 4 \& 5 These two steps are introducing how we do derivative in when we can train our model. Please refer to the book for details. :::info :information_source: **Chain Rule Review** https://hackmd.io/ZhWoUuriQemVnf3Oj78c6Q#Derivatives-Gradients-Partial-Derivatives-and-Chain-Rule ::: ## Step 6 :::success :bulb: **Check If `Eigen::MatrixXf` is Assigned with Any Value** > Reference \: > https://stackoverflow.com/a/59839698 Basically, check the size of the `MatrixXf` against `0`. Note that there is no `None` or `uninitialized` state for a `MatrixXf`. Basically, a `MatrixXf` is **assigned** with values. ::: ## Step 7 Build a connection between `Function` and `Variable` to keep track of the creator of a `Variable` :::info :bulb: **forward declaration** In the above code, a forward declaration is used since we are using class `Function` in the class `Variable` and vice versa. We need to tell the compiler that class `Function` exist and will be defined later. **Additionally, since we don't define the class `Function`, we cannot declare it and can only use pointer.** ::: ### An Abstract of `backward()` in this step. ```cpp #include "framework.h" void Variable::backward() { Function* f = this->creator; if (f != nullptr) { std::cout << "backward\n\n"; f->input->grad = f->backward(this->grad); std::cout << "Data : " << f->input->data << " Grad : " << f->input->grad << std::endl; f->input->backward(); // recursively } } ``` Notice that this is being done recursively. Since we don't like recursion generally, we will see the next step to chage it to looping. ## Step 8 Avoiding using recursion is commonly accepted. Please refer to the following article if you see it otherwise. :::info :bulb: **Recursion vs loops** https://stackoverflow.com/questions/660337/recursion-vs-loops Although this is an old post, these answers still help us to understnad the pros and cons of using either of them. ::: In the following code, a `std::vector` is used to store all the functions. Using `std::queue` is another option but queue is implemented with `std::deque` which has bigger overhead than `std::vector`. :::info :bulb: **C++ STL queue memory usage compared to vector?** https://stackoverflow.com/questions/14784551/c-stl-queue-memory-usage-compared-to-vector ::: ## Step 9 ```cpp shared_ptr<Variable> square(shared_ptr<Variable> x) { shared_ptr<Function> f = make_shared<Square>(); return (*f)(x); } shared_ptr<Variable> exp(shared_ptr<Variable> x) { shared_ptr<Function> f = make_shared<Exp>(); return (*f)(x); } ``` ## Extra Step I found that the framework has a lots of issue using raw pointer. Therefore, **we replace all the raw pointer with smart pointers**. *This part of implementation is heavily inspired by bugph0bia's work since I am still learning.* :::info :bulb: **Using std::enable_shared_this\<T\>** https://blog.csdn.net/caoshangpa/article/details/79392878 After using that a ``` terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr Aborted (core dumped) ``` Accoured Please make sure that in the main function, you're using `shared_ptr<Function> A = make_shared<Square>();` instead of `Function* A = new Square()` Reason as described in the following stackoverflow. https://stackoverflow.com/questions/27697973/shared-from-this-causing-bad-weak-ptr ::: Please refer to GitHub repository for better understanding. The following steps will use the modified version. ## Step 10 Skip # Stage \#2 \: Make Framework Easier to Use ## Step 11 \& 12 \& 13 Before this step, all the functions implemented are with single input and output. In this step, the book extend this into multiple inputs. :::warning :warning: **Prevent Unecessary Copys** This implementation make use of `std::vector` which **by its default will copy all the elements when pass in or out from a function**. For this implementation, I will try to keep everything as **pass by reference**. ::: ```cpp // Overloading : one input with multiple outputs vector<shared_ptr<Variable>> operator()(shared_ptr<Variable> input); // Overloading : multiple inputs with multiple outputs vector<shared_ptr<Variable>> operator()(const vector<shared_ptr<Variable>>& inputs); ``` ### Implementation Considerations * In the book, the output of the `Function` is always a list and handling how to extract the value outside of the class. * The `Function operator()(inputs)` is basically make the input into `inputs` which in this case, is a `vector`. * A pointer is 8 bytes and a pass by reference is still passing a pointer. Therefore, `Function operator()(input)` will be kept as pass by value. * `std::shared_ptr<Base> Ptr = std::make_shared<Derived> ()` Reference \: https://stackoverflow.com/a/24968387 :::info :information_source: **Python `*` Explode** C++ has the similar thing call `std::initializer_list`; however, in this case, we can just overload the function and make everything work as the above section described. *Some improvement is made to make the code easier to use. However, some improvements are reasonable in python but not in C++.* ::: ### Backpropogation In the backpropogation of addition, there are one input and two outputs, which is the opposite of the forward pass. To support multiple output, we need to modify the `backward` function in `Variable` class. Notice that in the python implementation, `zip(f.inputs, gxs)` is used to iterate these two vector together. There is no such function in C++, therefore using `gxs.size()` as the loop size. To safely use it like this, an additional checking is add to it which can be taken out when release the framework. ```cpp #ifdef DEBUG if (f->inputs.size() != gxs.size()) { perror("Backward Erorr : backward size not match with forward size\n"); exit(EXIT_FAILURE); } #endif ``` ## Step 14 ### Problem In the `step13.py` line 29, `x.grad` is storing `gx`. ```python for x, gx in zip(f.inputs, gxs): x.grad = gx ``` However, if we reuse the same variable to do computation, this behavior will ***replace*** the previous `x.grad` that is calculated. ### Solution ```python if x.grad is None: x.grad = gx else: x.grad = x.grad + gx ``` In the above code snippet, we can see that when the variable is back propergated in the first time. we store the `gx`; however, if it is not the first time, we will add onto the previous stored value. ### Reset Grad If we want to save on memory and do two back propergation \(call `y.backward()`\) twice, we will need a way to reset the grad from the previous call. In python, we can assign the `self.grad` to `None`. In Eigen, we will have to resize it to `0` size and trigger Eigen to deallocate the memory block used. ```cpp this->grad.resize(0, 0); ``` ## Step 15 *For detail please refer to the book* Until now, we only work on compute graph forming a single line. However, when working on complex compute graph, our scheme will not work. ![image](https://hackmd.io/_uploads/r16iDaqHgx.png) In the above two compute graph, with our current model, the \#1 will generate correct backpropergation result; however, not the \#2. Notice that we should finish all the previous branchs before working on the next variables and functions. Therefore, we need to adapt generation of variables and functions. All the child gereration will have to be finished before working on parents. ## Step 16 Add a variable represent generation in both `Variable` and `Function` class. The `Variable`\'s class will be set to the function that generate it `+ 1`. On the other hand, the `Function`\'s generation will be set to the input `Variable`\'s generation. ### Generation in `Variable` ```cpp void Variable::set_creator(shared_ptr<Function> func) { this->creator = func; this->generation = func->generation + 1; } ``` ### Generation in `Function` ```cpp shared_ptr<Variable> max = *(std::max_element(inputs.begin(), inputs.end(), [](const shared_ptr<Variable> lhs, const shared_ptr<Variable> rhs) { return lhs->generation < rhs->generation; })); this->generation = max->generation; ``` Notice that in the above code, we use `std::max_element` to retrive the biggest generation from all the input `Variables`. ### Backward in `Variable` This step is going to do backpropergation with generation in mind. ```cpp /** * @brief Using lambda function catching to prevent passing `funcs` and `seen_set` as variables * @todo Migrate `funcs` to Priority Queue. * @todo Migrate to private member function */ auto add_func = [&funcs, &seen_set] (const shared_ptr<Function> &f) { if (seen_set.find(f) == seen_set.end()) { funcs.push_back(f); seen_set.insert(f); std::sort(funcs.begin(), funcs.end(), [] (const shared_ptr<Function> &lhs, const shared_ptr<Function> &rhs) { return lhs->generation < rhs->generation; }); } }; ``` We use lambda function inside a member function of `Variable` class. This allow us to catch `funcs` and `seen_set` instead of pass by reference. This might not be better on performance side. Migrate to private member function can be applied. \(TODO\) ## Step 17 In Python, the memory management mechanism is using reference count, which is also the one that we used when we use `std::shared_ptr`. However, there is one pattern that cannot work correctly which is **circular reference**. In Python, Garbage Collector will take over and handle these danggled objects. GC is able to deal with circular reference; however, there will be delay of memory being freed since GC will work when memory is insufficient. ### Circular Reference in DeZero till Now ![image](https://hackmd.io/_uploads/SyWENy2rxl.png) The `Function.outputs` is a reference to `Variable` and the `Variable.creator` is a reference to `Function`. Thus, a circular reference. ### Using `std::weak_ptr` **My Other Posts** * [Effective Modern C++ Item 20](https://hackmd.io/7n-VHaZtQBmLUgHEmc2uIA?view#Item-20--Use-stdweak_ptr-to-replace-stdshared_ptr-that-can-be-danggled) ### Solution * Change `Function`\'s `outputs` to weak reference `std::weak_ptr` :::info :information_source: **Weak Pointers** https://shengyu7697.github.io/std-weak_ptr/ In the framework code, I use explicit type casting; however, implicit type casting from `shared_ptr` to `weak_ptr` is allowed. ```cpp for (const auto &output : outputs) { this->outputs.push_back(static_cast<weak_ptr<Variable>>(output)); } ``` ::: ## Step 18 \!TODO \: Might Be a Better Way ### Retain Gradient or Not Usually, we don't need intermedian result of a model\'s gradient. Therefore, a way to release and reduce memory usage is applied. :::success Unlike the *book* and *DeZeroCpp*, the basic data structure within the `Variable` is not a pointer but a `MatrixXf`. Therefore, I cannot set `MatrixXf` to `nullptr` or `None`. Similar to `cleargrad()`, I need to call `MatrixXf.resize(0, 0)` to make a `MatrixXf` having nothing and release the memory. As mentioned above, I rely on Eigen\'s memory system to maintain the basic data structure. ```cpp if (!retain_grad) { for (const auto &y : f->outputs) { y.lock()->grad.resize(0, 0); // Set Matrixf size to 0 invoke Eigen release memory } } ``` ::: ### Switch Between Infrerence Mode or Training Mode In the book, the `with` clause used in Python is basically RAII in C++. See [reference](https://stackoverflow.com/a/11493847). Therefore, create a class and do things in its constructer and destructer is the choosen implementation of *DeZeroCpp*. I am not going to follow the code structure of DeZero completely since it's not practical and will make the code much more complex. **Most importantly, I am lazy**. ```cpp struct Config { bool enable_backprop = true; }; // default value is set Config config; // global variable for global configuration class using_config { public: using_config(Config new_values) { #ifdef DEBUG std::cout << "Apply new configuration\n"; #endif old_values = config; config = new_values; } ~using_config() { #ifdef DEBUG std::cout << "Restore to old configuration\n"; #endif config = old_values; } using_config() = delete; using_config(const using_config&) = delete; using_config(using_config&&) = delete; using_config& operator=(const using_config&) = delete; using_config& operator=(using_config&&) = delete; private: Config old_values; }; ``` Basically, I **use a global variable `config` to maintain the configuration**. This will definitely cause the code structure different from the original Python DeZero. However, this is the most straight forward way to do it. The `using_config` class is simple as well. * Only the constructor with a `Config` parameter is allowed * Passing around the `Config` structure with **deep copy** **Example** ```cpp { Config no_grad; no_grad.enable_backprop = false; using_config temp_config(no_grad); /* ... */ } ``` ## Step 19 This step working on making `Variable` class more informative. In other words, we need to provide more information about the underlying `MatrixXf` through `Variable` class. * `name` \: provide a `name` member * `shape` \: provide a member function `shape()` return a `std::pair` since `MatrixXf` is always 2D. * `size` \: provide a member function `size()` * `ndim` \: No implementation since it will always be 2 :::success :bulb: **Underlaying Data Structure** We are using `MatrixXf` to store the data. However, if more than 2D is needed, the entire model will have to be modified to use `Eigen::tensor` ::: ## Step 20 :::success :bulb: **Debug Diary** The correct can be the following two. ***Not combined!!*** **\#1 \: Calls Lambda Function for the First Element in `Func`** ```cpp vector<shared_ptr<Function>> funcs {}; set<shared_ptr<Function>> seen_set {}; /** * @brief Using lambda function catching to prevent passing `funcs` and `seen_set` as variables * @todo Migrate `funcs` to Priority Queue. * @todo Migrate to private member function */ auto add_func = [&funcs, &seen_set] (const shared_ptr<Function> &f) { if (seen_set.find(f) == seen_set.end()) { funcs.emplace_back(f); seen_set.insert(f); std::sort(funcs.begin(), funcs.end(), [] (const shared_ptr<Function> &lhs, const shared_ptr<Function> &rhs) { return lhs->generation < rhs->generation; }); } }; add_func(this->creator); ``` **\#2 \: Add in Initializer for First Element in `Func`** ```cpp vector<shared_ptr<Function>> funcs { this->creator }; set<shared_ptr<Function>> seen_set {}; /** * @brief Using lambda function catching to prevent passing `funcs` and `seen_set` as variables * @todo Migrate `funcs` to Priority Queue. * @todo Migrate to private member function */ auto add_func = [&funcs, &seen_set] (const shared_ptr<Function> &f) { if (seen_set.find(f) == seen_set.end()) { funcs.emplace_back(f); seen_set.insert(f); std::sort(funcs.begin(), funcs.end(), [] (const shared_ptr<Function> &lhs, const shared_ptr<Function> &rhs) { return lhs->generation < rhs->generation; }); } }; // add_func(this->creator); ``` Notice that \#2 might will have better performance since the first element never need to check for duplication. However, keeping the implementation as the book provided might be a better idea since this behavior might have its meaning in the future. ***However, this can be optimized if proven not useful.*** ::: ### Backward for Multiplication ```cpp vector<MatrixXf> Mul::backward(vector<MatrixXf> &gys) { return { gys[0].array() * this->inputs[1]->data.array(), gys[0].array() * this->inputs[0]->data.array() }; } ``` ### Operator Overloading ```cpp shared_ptr<Variable> operator+(const shared_ptr<Variable> &lhs, const shared_ptr<Variable> &rhs) { vector<shared_ptr<Variable>> xs {lhs, rhs}; shared_ptr<Function> f = make_shared<Add>(); return (*f)(xs)[0]; } shared_ptr<Variable> operator*(const shared_ptr<Variable> &lhs, const shared_ptr<Variable> &rhs) { vector<shared_ptr<Variable>> xs {lhs, rhs}; shared_ptr<Function> f = make_shared<Mul>(); return (*f)(xs)[0]; } ``` ## Step 21 Provide the following functions for `float` type as `lhs` or `rhs`. ```cpp shared_ptr<Variable> operator+(const float lhs, const shared_ptr<Variable> &rhs) { vector<shared_ptr<Variable>> xs { to_variable(MatrixXf::Constant(1, 1, lhs)), rhs }; shared_ptr<Function> f = make_shared<Add>(); return (*f)(xs)[0]; } shared_ptr<Variable> operator+(const shared_ptr<Variable> &lhs, const float rhs) { vector<shared_ptr<Variable>> xs { lhs, to_variable(MatrixXf::Constant(1, 1, rhs)) }; shared_ptr<Function> f = make_shared<Add>(); return (*f)(xs)[0]; } ``` ## Step 22 Create `neg`, `sub`, `div`, `pow` classes and functions. ## Step 23 In this step, I am going to just study the book. ***No Implementation*** ## Step 24 **In Step 21**, I provide operator overloading for both `shared_ptr<Variable>` and `float`. ### Example ```cpp int main(int argc, char** argv) { auto x = to_variable(MatrixXf::Constant(1, 1, 1.0f)); auto y = to_variable(MatrixXf::Constant(1, 1, 1.0f)); auto z = to_variable(MatrixXf::Constant(1, 1, 0.26f)) * (pow(x, 2) + pow(y, 2)) - to_variable(MatrixXf::Constant(1, 1, 0.48f)) * x * y; z->backward(); std::cout << x->grad << "\n"; std::cout << y->grad << "\n"; return 0; } ``` ## Define by Run \& Define and Run > Reference \: > * Define-and-Run vs. Define-by-Run > https://medium.com/@zzemb6/define-and-run-vs-define-by-run-b527d127e13a In this step, I am going to use Chinese to take note since this will reduce ambiguity. ### Define by Run \(靜態計算圖\) ![image](https://hackmd.io/_uploads/S1VbpqGIll.png) It will be like my other projects * [CUDA_AI_Framework](https://github.com/Chen-KaiTsai/CUDA_AI_Framework) * [Eigen_AI_Framework](https://github.com/Chen-KaiTsai/Eigen_AI_Framework) However, the above projects are pretty naive; there is no compile stage and performance is not good. ```cpp int main() { framework::getDeviceName(); framework::getDeviceInfo(0); std::unique_ptr<float[]> predict = std::make_unique<float[]>(1 * 2); std::unique_ptr<float[]> data = std::make_unique<float[]>(1 * 3 * 224 * 224); std::future<void> backgroundThread = std::async(std::launch::async, bufferLoader, data.get()); framework::sequential model; model.add(new framework::input({1, 3, 224, 224})); model.add(new framework::conv2d(model.back(), 32, 3, 2, 1, false, 1, {16, 16, 4})); model.add(new framework::batchnorm(model.back(), 1e-5, true, {16, 16, 4})); model.add(new framework::relu6(model.back(), true, {1024})); // 1 model.add(new framework::dwconv2d(model.back(), 3, 1, 1, false, 1, {16, 16, 4})); //4 model.add(new framework::batchnorm(model.back(), 1e-5, true, {16, 16, 4})); model.add(new framework::relu6(model.back(), true, {1024})); model.add(new framework::pwconv2d(model.back(), 16, false, 1, {16, 16, 4})); model.add(new framework::batchnorm(model.back(), 1e-5, true, {16, 16, 4})); ... model.inference(data.get(), predict.get()); ``` ### Define and Run \(動態計算圖\) 同時【傳遞資料】、【建立計算圖】 ![image](https://hackmd.io/_uploads/Bk4IboMUxg.png) # Stage \#3 \: Calculate Higher-order Derivatives ## Step 25 There are two way I can provide this function, one is embedded the entire draw graph function into the framework; however, this will cause the framework to have one more library dependancy and complex `Makefile`. The other way to do it is metaprogramming which can be achieve by generating the source code for `graphviz` and provide a script to generate a `png` for the compute graph, thus reduce the complexity of the framework. ### Study `Graphviz` **My Other Post \:** * [Graphviz \& DOT Language](https://hackmd.io/@Erebustsai/HJCyK2ELxx) ## Step 26 Apparently, metaprogramming is also the way `dezero` choose. The function `get_dot_graph` with input of `y` which is the final `Variable` of a compute graph. ### Implemenation Consideration :::info :information_source: `typeid()` function > Reference \: > * 在 C++ 顯示完整的 `typeid` 型別名稱 > https://viml.nchc.org.tw/show-complete-typeid-name/ > * C++查看完整變數類型 > https://blog.csdn.net/m0_48692571/article/details/106783440 ```cpp abi::__cxa_demangle(typeid(*f).name(), 0, 0, 0); ``` ::: :::info :information_source: **Store Address of Variable** * Which cast to use when casting between pointers and `intptr_t`\/`uintptr_t`\? https://stackoverflow.com/a/77208571 ::: Notice that I am not going to use C++ to run any command since it is best to keep it as simple as possible. \(I am lazy and don't want to deal with cross-platform issues\) Instead, a message is printed on the terminal to show an example of DOT compile command. ### Sample Output Image ![output](https://hackmd.io/_uploads/Bk3BkjP8gl.png) ## Step 27 ### Implement Consideration Simply call member function `Eigne::MatrixXf.array().sin()` and `Eigne::MatrixXf.array().cos()` to do the computation. Notice that printing `Eigne::MatrixXf` will only print to 1e-6 precision and output will be rounded. :::info :information_source: **C++ Use Constant from `<cmath>`** ```cpp #include <cmath> #define _USE_MATH_DEFINES ``` https://learn.microsoft.com/zh-tw/cpp/c-runtime-library/math-constants?view=msvc-170 ::: ### Taylor Series Taylor series can be used to approximate a function. ![image](https://hackmd.io/_uploads/rk-N770Lel.png) In the above formula, `a` can be any number; however, using `a = 0` can make things easier to do. This special case of Taylor series is call Maclaurin series. ![image](https://hackmd.io/_uploads/BkEINmAUxl.png) When we apply `f(x)=sin(x)`, the above pattern shows. Since `sin(0)=0` and `cos(0)=1` we can easily caclulate the above formula as. ![image](https://hackmd.io/_uploads/B1IUBQCLel.png) ## Step 28 ### Function Optimization Find a set of parameters get the minimum or maximum value of a function. ### Rosenbrock Function ![image](https://hackmd.io/_uploads/BJSBZRC8xx.png) ```cpp int main(int argc, char** argv) { auto x0 = to_variable(MatrixXf::Constant(1, 1, 0.0f)); auto x1 = to_variable(MatrixXf::Constant(1, 1, 2.0f)); auto y = 100.0f * pow((x1 - pow(x0, 2)), 2) + pow((x0 - 1.0f), 2); y->backward(); std::cout << x0->grad << " " << x1->grad << "\n"; return 0; } ``` ```bash -2 400 ``` The `x1` and `x2` have `-2.0` and `400.0` as differential respectively. If we see these values as a vector `(-2.0, 400.0)`, it will be a gradient vector. The gradient represent the most increment direction of a function output at a given point. Therefore, in `(x0, x1)` the direction for `y` to have most increment will be `(x0.grad, x1.grad)` and this means that the negative of `(x0.grad, x1.grad)` will reduce `y`\'s value the most. ### Gradient Decent ```cpp for (unsigned int i = 0; i < itr; ++i) { std::cout << "( " << x0 << ", " << x1 << " )\n"; auto y = 100.0f * pow((x1 - pow(x0, 2)), 2) + pow((x0 - 1.0f), 2); x0->cleargrad(); x1->cleargrad(); y->backward(); x0->data -= lr * x0->grad; x1->data -= lr * x1->grad; } ``` ```bash ( variable(0), variable(2) ) ( variable(0.002), variable(1.6) ) ( variable(0.005276), variable(1.28) ) ( variable(0.0099667), variable(1.02401) ) ... ( variable(0.683051), variable(0.465046) ) ( variable(0.683272), variable(0.465348) ) ( variable(0.683492), variable(0.465651) ) ``` ## Step 29 In this step, we can manually provide Newton's Method Optimization. ### Newton's Method Using this algorithm can have faster convergent. > Reference \: > * Calculus for Machine Learning and Data Science\(Week3: Lesson 2 \- Newton\'s Method\) > https://hackmd.io/@shaoeChen/ryk5yli33 Refer to the above article and the book to get the full detail. ![image](https://hackmd.io/_uploads/S1uv7AePeg.png) We can see Newton\'s method as a special case of gradient decent. ## Step30 & 31 These steps are going to enable DeZero to calculate higher derivatives. ### Theory The framework currently only make connection in forward stage. To enable the framework to caculate higher derivatives, we need to make connection in backward stage as well. ### Code Review ```cpp vector<shared_ptr<Variable>> Function::operator()(const vector<shared_ptr<Variable>> &inputs) { vector<MatrixXf> xs {}; // deep copy one by one into the xs for (const auto &x : inputs) { xs.emplace_back(x->data); } vector<MatrixXf> ys = this->forward(xs); vector<shared_ptr<Variable>> outputs {}; // create output one by one and copy into the ys for (auto &y : ys) { shared_ptr<Variable> temp = make_shared<Variable>(y); temp->set_creator(shared_from_this()); outputs.emplace_back(temp); } if (config.enable_backprop == true) { shared_ptr<Variable> max = *(std::max_element(inputs.begin(), inputs.end(), [](const shared_ptr<Variable> lhs, const shared_ptr<Variable> rhs) { return lhs->generation < rhs->generation; })); this->generation = max->generation; this->inputs = inputs; for (const auto &output : outputs) { this->outputs.emplace_back(static_cast<weak_ptr<Variable>>(output)); } } return outputs; } ``` The above code snippet is the `Function` class we have till now. Notice that the program make connection when `set_creator()` is called, thus the program will only create ***connection*** in forward stage. ### How to do it\? We need to store gradient `gx` as `Variable` not just `MatrixXf` so that it can have its own `grad` and its own `grad` will be `Variable`. Therefore, we can do higher derivative with no limitations. *However, it might not be that easy to do in C++ version since the data type need to be explicit.* ## Step 32 *This step is a huge change to the entire framework, thus I am going to include what I do and how I approach it.* ### Approach 1. Modify the header fully `framework.h` 2. Similar to TDD, fulfill the implementation needed in `cpp` file. ### Implementation Considerations In the `Variable` class, I define things as following... ```cpp class Variable { public: MatrixXf data; shared_ptr<Variable> grad; std::string name; shared_ptr<Function> creator = nullptr; unsigned int generation = 0; ... ``` Notice that as [mentioned before](https://hackmd.io/z3VqORjgS9ONy1VCXtRVvA?view#Implementation-Considerations), I want to keep `data` as a instance of `MatrixXf` since `Eigen::MatrixXf` is already using a pointer for its real data underneath. Using an additional pointer for `MatrixXf` don't make sense for me **for now**. On the other hand, I need to use `shared_ptr<Variable>` for the `grad` \(explained in next section\), thus I will need to use `vector<shared_ptr<Variable>>` to accommodate this. Notice that when clean grad, we now can just assign `nullptr` to the `shared_ptr`. :::info :information_source: **Class in Class** A class cannot contain a instance of itself since it will introduce a infinite loop when creating just one instance. Therefore, we can only use a pointer for `grad` and all the `backward` function inherite from `Function` will need to be using `vector<shared_ptr<Variable>>` instead of `vector<Variable>`. ::: ### `Config` Control ```cpp void Variable::backward(bool retain_grad, bool create_graph) { ... while(!funcs.empty()) { shared_ptr<Function> f = funcs.back(); funcs.pop_back(); vector<shared_ptr<Variable>> gys {}; for (const auto &output : f->outputs) { #ifdef DEBUG if (output.expired()) { perror("Memory Management : Weak ptr expired"); exit(EXIT_FAILURE); } #endif gys.emplace_back(output.lock()->grad); } { // backprop lives within it using_config backprop_control({create_graph}); vector<shared_ptr<Variable>> gxs = f->backward(gys); ... ``` ## Step 33 When calculating second level of derivative, we need to call `cleargrad` on input `x`. Otherwise, it will add up first level and second level of derivative. ```cpp auto x = to_variable(MatrixXf::Constant(1, 1, 2.0f)); int it = 10; for (int i = 0; i < it; ++i) { std::cout << i << " iteration: " << x << std::endl; auto y = f(x); x->cleargrad(); y->backward(false, true); auto gx = x->grad; x->cleargrad(); gx->backward(); auto gx2 = x->grad; x->data.array() -= gx->data.array() / gx2->data.array(); } ``` ``` 0 iteration: variable(2) 1 iteration: variable(1.45455) 2 iteration: variable(1.15105) 3 iteration: variable(1.02533) 4 iteration: variable(1.00091) 5 iteration: variable(1) 6 iteration: variable(1) 7 iteration: variable(1) 8 iteration: variable(1) 9 iteration: variable(1) ``` To calculate the update part, since I am using `MatrixXf` directly, this will not work as `Variable` and create compute graph. ## Step 34 Add two complex functions `sin()` and `cos()` For the complex functions, the framework seperate these functions to a different file and I do the same thing. The implementation of these functions will be in `functions.cpp` and the header will remain the same. I didn\'t use `namespace` to seperate complex functions from core functions. These might be changed. ```cpp int main(int argc, char** argv) { auto x = to_variable(MatrixXf::Constant(1, 1, 1.0f)); auto y = sin(x); y->backward(false, true); for (int i = 0; i < 3; ++i) { auto gx = x->grad; x->cleargrad(); gx->backward(false, true); std::cout << x->grad << std::endl; } return 0; } ``` ``` ./main variable(-0.841471) variable(-0.540302) variable(0.841471) ``` ## Step 35 ![image](https://hackmd.io/_uploads/Bycn8xDjgl.png) ```cpp // Tanh class vector<MatrixXf> Tanh::forward(vector<MatrixXf> &xs) { return { xs[0].array().tanh() }; } vector<shared_ptr<Variable>> Tanh::backward(vector<shared_ptr<Variable>> &gys) { return { gys[0] * (1.0f - (this->outputs[0].lock() * this->outputs[0].lock())) }; } ``` ```cpp int main(int argc, char** argv) { auto x = to_variable(MatrixXf::Constant(1, 1, 1.0f)); auto y = tanh(x); x->name = "x"; y->name = "y"; y->backward(false, true); for (int i = 0; i < 6; ++i) { auto gx = x->grad; x->cleargrad(); gx->backward(false, true); } auto gx = x->grad; gx->name = "gx"; plot_dot_graph(gx); return 0; } ``` ![image](https://hackmd.io/_uploads/SJA0Q-Pill.png) The above diagram is the `tanh` in first, second, and sixth level of derivative. The sixth level is huge and take some time to generate, on a EPYC 7302P system using one core, so I only show part of it. ## Step 36 *No implementation only include extra materials* :::info :information_source: **Double Backprop** https://ai.stackexchange.com/a/25115 ::: # Stage \#4 \: Construct Neural Networks ## Step 37 To support matrix based computation instead of scalar, a checking and reviewing process is required. ```cpp int main(int argc, char** argv) { MatrixXf IN(2, 3); IN << 1, 2, 3, 4, 5, 6; std::cout << IN << std::endl; auto x = to_variable(std::move(IN)); auto y = sin(x); y->backward(false, true); std::cout << x->grad << std::endl; auto w = to_variable(MatrixXf::Constant(1, 1, 1.0f)); auto z = sin(w); z->backward(false, true); std::cout << w->grad << std::endl; return 0; } ``` The above code check that the matrix have the same output as the scalar. Note that `Variable.data` and `Variable.grad` should have the same shape and the following code snippet make sure of it. ```cpp void Variable::backward(bool retain_grad, bool create_graph) { if (this->grad == nullptr) { this->grad = to_variable(MatrixXf::Constant(this->data.rows(), this->data.cols(), 1.0f)); } ... ``` ## Extra Step `Eigen` is not `numpy` in many ways. This step is neede and mostly for `Eigen` to function as expected. ```cpp int main(int argc, char** argv) { Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> IN(2, 3); IN << 1, 2, 3, 4, 5, 6; IN.resize(3, 2); std::cout << IN << std::endl; return 0; } ``` ``` ./main 1 2 3 4 5 6 ``` This step is to show case that `Eigen` matrix is not in row-major as most of us use. This will cause the data in column major way when resize it and make things really hard to read. However, we can define the major when creating the Matrix instance. *The possible performance issue is not clear, but we might look into it later.* ```cpp int main(int argc, char** argv) { Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> IN(2, 3); IN << 1, 2, 3, 4, 5, 6; auto x = to_variable(IN); std::cout << x << std::endl; auto y = sin(x); y->backward(false, true); std::cout << x->grad << std::endl; return 0; } ``` The backpropergation work correctly when using a `Eigen::RowMajor` `Matrix`. Therefore, we need to replace all the `MatrixXf` to the above setting so that when pass through functions, `Eigen::RowMajor` setting will be kept. ```cpp using MatrixR = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>; ``` In order to make things easier to read, I deleted `using Eigen::MatrixXf` and added the above `using`. This will make sure that only the configured `Matrix` can be used. ```cpp shared_ptr<Variable> operator+(const float lhs, const shared_ptr<Variable> &rhs) { vector<shared_ptr<Variable>> xs { to_variable(MatrixR::Constant(rhs->shape().first, rhs->shape().second, lhs)), rhs }; shared_ptr<Function> f = make_shared<Add>(); return (*f)(xs)[0]; } ... ``` In the above code snippet, I only list the `+` operator with `float` operand on the left hand side, but this change will need to be applied to all the overloaded operators. This will make sure that when using a constant scalar, a proper boardcast will be conducted. ## Step 38 This step works on functions that will only change the tensors shape. ### `reshape` Function `reshape` only change the shape of the input, thus when in the backward stage, we need to make the gradient's shape same as the input in the forward stage. Additionally, the data order should work as `RowMajor` since the book use `numpy`, which is a `RowMajor` based compute backend. From the above extra step, we can force `RowMajor` data order on the `Eigen::Marix`. ![image](https://hackmd.io/_uploads/SJRuq3cigx.png) :::warning :warning: **Not Implemented** In 38.2 section of the book, in order to make the `Variable` look more like a numpy array, a member function of `reshape` is added to `Variable`. However, currently, I do not want to include this function since it might cause issues and this function is just for convenience. ::: ### `transpose` Function ```cpp vector<MatrixR> Transpose::forward(vector<MatrixR> &xs) { return { xs[0].transpose() }; } vector<shared_ptr<Variable>> Transpose::backward(vector<shared_ptr<Variable>> &gys) { return { transpose(gys[0]) }; } ``` Note that the `transpose` function in the `backward()` is the wrapper function of the `Transpose.forward()`. Remember that to support higher level of derivative, all the computation in `backward()` are required to be in combination of `forward()` functions. For the same reason, member function of transpose for `Variable` will not be implemented for now. ## Extra Step\: Optimize `Variable.backward()` with `priority_queue` As book suggested, the `Variable.backward()` function can be made to a version that no need to sort for every `Function` added. ## Step 39 \& 40 At this stage of the framework, most of the implementation require solving the differences between `numpy` and `Eigen`. > Reference\: > * Numpy和Eigen—Python與C++中的矩陣運算庫的聯絡 > https://www.hbblog.cn/python%26C%2B%2B/matrix_python_cpp/ > * eigen庫相關函數 `.replicate()` > https://blog.csdn.net/weixin_42208705/article/details/122757675 ### `sum_to` ### `broadcast_to`