Pascal.Jr.Tikeng.Notsawo

Films & Series (2021-Now)
* Since 2021, when I watch a movie I write it down here (with a rating out of 10 next to it, and sometime a comment). * I also program the future movies I'm gonna watch. * I don't rate certain movies that are too old. They're not movies of my time, so it's hard for me to rate them. * I've also gained 40 years in the last 4 years, so don't believe too much in what I say. * >= 8/10 : 'absolute cinema' * 8/10 > * >= 7/10 : you can watch * 7/10 > * > 5/10 : I don't know * <= 5/10 : don't waste your time (go do a round of your favorite sport instead) * <= 2.5/10 : Whoever did that should go to jail Update : I'm on letterboxd (https://letterboxd.com/tq9_matusalem/)
Pascal.Jr.Tikeng.Notsawo changed 12 days agoView mode Like 2 Bookmark
In-context learning dataset
Stuffs below are implemented in this notebook with more details : https://colab.research.google.com/drive/19IZiHeZTWK6pjGsrq0iJRJhXuaWNfzpM?usp=sharing For $p, q, m \in \mathbb{N}^*$, let : $\mathcal{S} = [p]^q$ : sequence of length $q$ on the alphabet $[p] = {0, 1, \cdots, p-1}$. We have $|\mathcal{S}| = p^q$Another way of looking at it is to just consider $\mathcal{S}={0, 1, \cdots, p^q-1 }$, the set of all integers in base $p$ (for example $p=10$ for decimals) of lengths less than or equal to $q$ (hence $p^q-1$ being the bigger one), but with the sequences completed in front with $0$'s to get the sequences of lengths $q$ (if $q=3$ and $p=2$ for example, $1$ becomes $001$, $101$ remains $101$, etc). This way of looking at things is more flexible for generating sequences, since all we have to do is generate integers between $0$ and $p^q-1$ (which is quite easy to do), then complete them. $\mathcal{F} = S_q$ : symmetric group $S_q$, i.e. set of permutation of $q$ elements. We have $|\mathcal{F}| = q!$ We randomly partition $\mathcal{F}$ into two disjoint and non-empty sets $\mathcal{F}{\text{train}}$ and $\mathcal{F}{\text{val}}$, for the training and the validation dataset respectively. The training data fraction $r = |\mathcal{F}_{\text{train}}| / |\mathcal{F}|$ is a hyperparameter.
Pascal.Jr.Tikeng.Notsawo changed 4 months agoView mode Like Bookmark
McDiarmid's Inequality
https://web.eecs.umich.edu/~cscott/past_courses/eecs598w14/notes/09_bounded_difference.pdf https://stats.stackexchange.com/questions/21362/understanding-proof-of-mcdiarmids-inequality https://www.stat.cmu.edu/~larry/=stat705/Lecture2.pdf https://ocw.mit.edu/courses/18-s997-high-dimensional-statistics-spring-2015/a69e2f53bb2eeb9464520f3027fc61e6_MIT18_S997S15_Chapter1.pdf https://math.stackexchange.com/questions/41536/intuitive-explanation-of-the-tower-property-of-conditional-expectation
Pascal.Jr.Tikeng.Notsawo changed 2 years agoView mode Like Bookmark
Epoch-wise bias-variance decomposition
Let's suppose we're training a model parameterized by $\theta$, and let's denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that's what we'll be doing in this blog post. $\mathcal{X}$ : domain set (input space) $\mathcal{Y}$ : label set (output space) $\mathcal{H}$ : hypothesis class (class of possible models we can learn) Definititions and Preliminaries Definitition 1 (Loss Function) The loss function $\ell(t, y)$ is defined as a function that takes two labels and produces a value between $0$ and some constant $M \in [0, \infty]$, and measures the cost of predicting $y$ when the true value is $t$. \begin{align*}
Pascal.Jr.Tikeng.Notsawo changed 2 years agoView mode Like Bookmark
Visualization of the loss landscape and optimization path of a neural network
Author : Pascal Tikeng Notations $\Theta \subseteq \mathbb{R}^d$ : parameters (weights) space, $|\Theta| = d$ We will consider $\Theta$ as an affine space to which we will associate a vector space $\vec{\Theta}$ : $\forall (\theta, \theta') \in \Theta^2, \ \vec{\theta\theta'} = \theta' - \theta \in \vec{\Theta}$. The elements of $\vec{\Theta}$ will be noted with an arrow on top of them. $I_n = {1, \dots, n} \ \forall n \in \mathbb{N}$ $| \cdot |$ : Frobenius norm $\langle \cdot, \cdot, \cdot \rangle$ : bilinear product, $\langle u, A, v \rangle = u^TAv = \sum_{i}\sum_{j} u_ia_{ij}v_j$ for a matrix $A \in \mathbb{R}^{n \times m}$, and two vectors $u \in \mathbb{R}^n$ and $v \in \mathbb{R}^m$ $\langle \cdot, \cdot \rangle$ :scalar product : $\langle u, v \rangle = u^Tv = \sum_{i} u_iv_i$ for two vectors $u \in \mathbb{R}^n$ and $v \in \mathbb{R}^n$ quadratic product : $\langle A, u \rangle = \langle u, A, u \rangle = u^TAu = \sum_{i}\sum_{j} u_ia_{ij}u_j$ for a matrix $A \in \mathbb{R}^{n \times n}$ and a vector $u \in \mathbb{R}^n$
Pascal.Jr.Tikeng.Notsawo changed 2 years agoView mode Like Bookmark