Coarse HP Models

--- title: Coarse HP Models --- [TOC] ## Influence transmission and reception model ### 0. Setup and notation {%hackmd b-i7Qp9rTDaurB_IhqJ3Cw %} **Note**: Everything in this note is for a single cascade. For multiple cascades, the overall log-likelihood is the average of the log-likelihood of individual cascades. ### 1. Vanilla HP model We start with a vanilla multivariate HP model, in which the intensity for a single source $m$ is defined as: $\lambda_{m} (t)=\mu_{m} + \sum\limits_{\{i:t_i < t\}}^{N}\alpha_{m_i \rightarrow m}~\kappa(t - t_i)$, where $\mathbf{\mu} \in \mathbb{R}^{M}$ and $\mathbf{\alpha} \in \mathbb{R}^{M \times M}$ are the parameters of the model. The parameters are learned by maximizing the log-likelihood for the observed cascade. #### 1.1 Log likelihood The log likelihood is given as: $\mathcal{L} = \sum\limits_{n=1}^{N}\log\left(\lambda_{m_n}(t_n)\right) - \sum\limits_{m=1}^{M}\int\limits_{0}^{T}\lambda_m(t)dt$ With some reorganization and simplification that is shown [here](/tBpeVSYxQoiWUQZB36Jp8w) and can partially be seen [here](https://arxiv.org/pdf/1609.02075.pdf) and [here](http://www.smallake.kr/wp-content/uploads/2015/01/HawkesCourseSlides.pdf), the log likelihood can be written in its expanded form as: $\mathcal{L(\mathbf{\mu},\mathbf{\alpha})}=\sum\limits_{n=1}^{N}\log(\mu_{m_n} + \sum\limits_{m^{'}=1}^{M}\alpha_{m^{'}\rightarrow m_n}~R_{m^{'}\rightarrow m_n }(t_n)) - (\sum\limits_{m=1}^{M}(\mu_{m}T ~+ \sum\limits_{n=1}^{N}\alpha_{m_n \rightarrow m}(1-\kappa(T-t_n))))$, where: $$ R_{u \rightarrow v}(i)= \begin{cases} \kappa(t_{i}^{v} - t_{i-1}^{v})R_{u \rightarrow v}(i-1) + \sum\limits_{\{t_{i-1}^{v} \leq t_j^{u} \leq t_{i}^{v}\}}\kappa(t_i^{v} - t_j^{u}),& \text{if } u\neq v\\ \kappa(t_{i}^{v} - t_{i-1}^{v})(1+R_{u \rightarrow v}(i-1)), & \text{if } u=v \end{cases} $$ **Note**: Some overloaded notation here but just assume for the sake of completion that this calculation of $R$ is available. Notice that it's independent of the parameters. #### 1.2 Gradients $\frac{\partial \mathcal{L}}{\partial \mu_{m}} = \sum\limits_{n=1}^{N}\frac{\delta(m_n=m)}{\lambda_{m_n}(t_n)} - T$ $\frac{\partial \mathcal{L}}{\partial \alpha_{u \rightarrow v}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}R_{m^{'} \rightarrow m_n}(t_n)(\delta(m^{'} = u)\times \delta(m_{n} = v))}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}(1-\kappa(T-t_{n}))(\delta(m_n=u) \times \delta (m=v))$ ### 2. Coarse HP model The vanilla HP model gives us dyadic influence parameters. Consider a case where we're more interested in knowing the aggregate influence/leadership of a source directly; conversely, one might be interested in knowing the extent to which a source tends to be a follower. For example, if sources are publication venues in NLP, we could be interested in knowing the overall influence of ACL and not so much about the influence of ACL on EMNLP. One way to answer this would be to sum the dyadic influence parameters from u together as follows: $\text{Influence (u)} = \sum\limits_{v^{'}=1}^{M}\alpha_{u \rightarrow v^{'}}$ However, this estimate of influence is likely to be of high variance: the dyadic parameters when estimated on small amounts of data are high variance which is further amplified when they're pooled. A more direct approach, in contrast, is by making the following modification. $\alpha_{u \rightarrow v} = \mathbf{b}_u \cdot \mathbf{c}_v + s_{u,v}$ where: * $\mathbf{s} \in \mathbb{R}^{M \times M}$ is a diagonal matrix which captures self-excitation; $s_{u,v} \in \mathbb{R}$. * $\mathbf{b} \in \mathbb{R}^{M \times k}$ capture the transmission/influence/leadership parameters for each source; $\mathbf{b}_u \in \mathbb{R}^{k}$ * $\mathbf{c} \in \mathbb{R}^{M \times k}$ capture the reception parameters for each source; $\mathbf{c}_v \in \mathbb{R}^{k}$ . Note that instead of learning $M(M+1)$ parameters for the vanilla HP model, we are now learning $2M (1+k)$ parameters. For our purpose, $k=1$ which turns the dot product to scalar multiplication. #### 2.1 Log likelihood Let's rewrite the log likelihood compactly. $$\mathcal{L(\mathbf{\mu},\mathbf{b}, \mathbf{c}, \mathbf{s})}=\sum\limits_{n=1}^{N}\log(\mu_{m_n} + \sum\limits_{m^{'}=1}^{M}(b_{m^{'}}\cdot c_{m_n} + s_{m^{'}}\delta(m^{'}=m_n))~R_{m^{'}\rightarrow m_n }(t_n)) \\ - (\sum\limits_{m=1}^{M}(\mu_{m}T ~+ \sum\limits_{n=1}^{N}(b_{m_n}\cdot c_m + s_{m_{n}}\delta(m_n=m))(1-\kappa(T-t_n))))$$, #### 2.2 Gradients The gradient for $\mu$ hasn't changed from [1.2](#12-Gradients) $\frac{\partial \mathcal{L}}{\partial {b_u}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}c_{m_n}R_{m^{'} \rightarrow m_n}(t_n)\delta(m^{'} = u)}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}c_m(1-\kappa(T-t_{n}))\delta(m_n=u)$ $\frac{\partial \mathcal{L}}{\partial {c_u}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}b_{m^{'}}R_{m^{'} \rightarrow m_n}(t_n)\delta(m_n = u)}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}b_{m_n}(1-\kappa(T-t_{n}))\delta(m=u)$ $\frac{\partial \mathcal{L}}{\partial {s_u}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}R_{m^{'} \rightarrow m_n}(t_n)\delta(m_n = u) \times \delta(m^{'} = u)}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}(1-\kappa(T-t_{n}))\delta(m=u) \times \delta(m_n=u)$ ### 3. Controlling for activation and source size It's possible that some sources are not active at all times in $[0,T)$. In our case, for modeling influence between NLP venues, which are not active in every year (e.g., NAACL is not held every third year), it makes sense to learn excitation parameters that are dependent on time. So a further modification is as follows. $\alpha_{u \rightarrow v}(t) = (\mathbf{b}_u \cdot \mathbf{c}_v) ~f(v,t) + s_{u,v}$ where: * $f(\cdot, \cdot) \in \{0,1\}$ is a gating function; $f(v,t)=1$ if source $v$ is active at time $t$, $0$ otherwise. Also, let $g(m,t)$ be the size of the source $s$ at time $t$ such that we can decompose $\mu_m$ as follows. * $\mu_m(t) = \gamma_m ~g(m,t)$ **Note**: We only want to check if the recipient is active because the sender is implicitly active when calculating the log-likelihood. #### 3.1 Log likelihood Again, rewriting the log likelihood from [2.1](#21-Log-likelihood), $$\mathcal{L(\mathbf{\gamma},\mathbf{b}, \mathbf{c}, \mathbf{s})}=\sum\limits_{n=1}^{N}\log\left(\gamma_{m_n}~g(m_n, t_n) + \sum\limits_{m^{'}=1}^{M}\left(b_{m^{'}}~ c_{m_n} ~ f(m_n, t_n)+ s_{m^{'}}\delta(m^{'}=m_n)\right)~R_{m^{'}\rightarrow m_n }(t_n)\right) \\ - \sum\limits_{m=1}^{M}\left(\gamma_{m}\sum\limits_{n=1}^{N}g(m,t_n) ~+ \sum\limits_{n=1}^{N}\left(b_{m_n} ~ c_m ~ f(m,t_n)+ s_{m_{n}}\delta(m_n=m)\right)\left(1-\kappa(T-t_n)\right)\right)$$, <span style="color:red"> Still to be resolved: * Is the part for time dependent $\mu_m$ right? * In this expression, the $f(m_n, t_n)$ --- in the first part of the expression under log --- will always evaluate to $1$ because we only evaluate the log-likelihood at $t_n$ and $m_n$ is acting as the recipient at $t_n$; arguably, the gating need not be applied here because $R$ acts as a soft gating. Is this right? </span> #### 3.2 Gradients $\frac{\partial \mathcal{L}}{\partial \mu_{m}} = \sum\limits_{n=1}^{N}\frac{\delta(m_n=m)~g(m_n,t_n)}{\lambda_{m_n}(t_n)} - \delta(m_n=m)~g(m,t_n)$ $\frac{\partial \mathcal{L}}{\partial {b_u}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}c_{m_n}~f(m_n,t_n)~R_{m^{'} \rightarrow m_n}(t_n)\delta(m^{'} = u)}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}c_m~f(m,t_n)(1-\kappa(T-t_{n}))\delta(m_n=u)$ $\frac{\partial \mathcal{L}}{\partial {c_u}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}b_{m^{'}}~f(m_n,t_n)~R_{m^{'} \rightarrow m_n}(t_n)\delta(m_n = u)}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}b_{m_n}~f(m, t_n)~(1-\kappa(T-t_{n}))\delta(m=u)$ $\frac{\partial \mathcal{L}}{\partial {s_u}} = \sum\limits_{n=1}^{N}\frac{\sum\limits_{m^{'} }^{M}R_{m^{'} \rightarrow m_n}(t_n)\delta(m_n = u) \times \delta(m^{'} = u)}{\lambda_{m_n}(t_n)} - \sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}(1-\kappa(T-t_{n}))\delta(m=u) \times \delta(m_n=u)$ ###### tags: `HP models` `PhD` `css` `science of science` `semantic leadership`