HackMD - Collaborative Markdown Knowledge Base

\title{ Matrix Differentiation } ## Introduction Throughout this presentation I have chosen to use a symbolic matrix notation. This choice was not made lightly. I am a strong advocate of index notation, when appropriate. For example, index notation greatly simplifies the presentation and manipulation of differential geometry. As a rule-of-thumb, if your work is going to primarily involve differentiation with respect to the spatial coordinates, then index notation is almost surely the appropriate choice. In the present case, however, I will be manipulating large systems of equations in which the matrix calculus is relatively simply while the matrix algebra and matrix arithmetic is messy and more involved. Thus, I have chosen to use symbolic notation. ## Notation and Nomenclature Definition 1 Let $a_{i j} \in \Re, i=1,2, \ldots, m, j=1,2, \ldots, n$. Then the ordered rectangular array $$ \mathbf{A}=\left[\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1 n} \\ a_{21} & a_{22} & \cdots & a_{2 n} \\ \vdots & \vdots & & \vdots \\ a_{m 1} & a_{m 2} & \cdots & a_{m n} \end{array}\right] $$ is said to be a real matrix of dimension $m \times n$. When writing a matrix I will occasionally write down its typical element as well as its dimension. Thus, $$ \mathbf{A}=\left[\mathrm{a}_{\mathrm{i} \mathrm{j}}\right], \quad \boldsymbol{i}=1,2, \ldots, \mathrm{m} ; \boldsymbol{j}=1,2, \ldots, \mathrm{n} $$ denotes a matrix with $m$ rows and $n$ columns, whose typical element is $a_{i j}$. Note, the first subscript locates the row in which the typical element lies while the second subscript locates the column. For example, $a_{j k}$ denotes the element lying in the $j$ th row and kth column of the matrix $\mathbf{A}$. Definition 2 A vector is a matrix with only one column. Thus, all vectors are inherently column vectors. \section{Convention 1} Multi-column matrices are denoted by boldface uppercase letters: for example, $\mathbf{A}, \mathbf{B}, \mathbf{X}$. Vectors (single-column matrices) are denoted by boldfaced lowercase letters: for example, $\mathbf{a}, \mathbf{b}, \mathbf{x}$. I will attempt to use letters from the beginning of the alphabet to designate known matrices, and letters from the end of the alphabet for unknown or variable matrices. \section{Convention 2} When it is useful to explicitly attach the matrix dimensions to the symbolic notation, I will use an underscript. For example, $\underset{m \times n}{\mathbf{A}}$, indicates a known, multi-column matrix with $m$ rows and $n$ columns. A superscript ${ }^{\top}$ denotes the matrix transpose operation; for example, $\mathbf{A}^{\top}$ denotes the transpose of $\mathbf{A}$. Similarly, if $\mathbf{A}$ has an inverse it will be denoted by $\mathbf{A}^{-1}$. The determinant of $\mathbf{A}$ will be denoted by either $|\mathbf{A}|$ or $\operatorname{det}(\mathbf{A})$. Similarly, the rank of a matrix $\mathbf{A}$ is denoted by $\operatorname{rank}(\mathbf{A})$. An identity matrix will be denoted by $\mathbf{I}$, and $\mathbf{0}$ will denote a null matrix. \section{Matrix Multiplication} Definition 3 Let $\mathbf{A}$ be $m \times n$, and $\mathbf{B}$ be $n \times p$, and let the product $\mathbf{A B}$ be $$ \mathbf{C}=\mathbf{A B} $$ then $\mathbf{C}$ is a $m \times p$ matrix, with element $(i, j)$ given by $$ c_{i j}=\sum_{k=1}^{n} a_{i k} b_{k j} $$ for all $i=1,2, \ldots, m, \quad j=1,2, \ldots, p$ Proposition 1 Let $\mathbf{A}$ be $m \times n$, and $\mathbf{x}$ be $n \times 1$, then the typical element of the product $$ \mathbf{z}=\mathbf{A x} $$ is given by $$ z_{i}=\sum_{k=1}^{n} a_{i k} x_{k} $$ for all $i=1,2, \ldots, m$. Similarly, let $\mathbf{y}$ be $m \times 1$, then the typical element of the product $$ \mathbf{z}^{\top}=\mathbf{y}^{\top} \mathbf{A} $$ is given by $$ z_{i}=\sum_{k=1}^{n} a_{k i} y_{k} $$ for all $i=1,2, \ldots, n$. Finally, the scalar resulting from the product $$ \alpha=\mathbf{y}^{\top} \mathbf{A} \mathbf{x} $$ is given by $$ \alpha=\sum_{j=1}^{m} \sum_{k=1}^{n} a_{j k} y_{j} x_{k} $$ Proof: These are merely direct applications of Definition 3. q.e.d. Proposition 2 Let $\mathbf{A}$ be $m \times n$, and $\mathbf{B}$ be $\mathbf{n} \times \mathbf{p}$, and let the product $\mathbf{A B}$ be $$ \mathbf{C}=\mathbf{A B} $$ then $$ \mathbf{C}^{\top}=\mathbf{B}^{\top} \mathbf{A}^{\top} $$ Proof: The typical element of $\mathbf{C}$ is given by $$ c_{i j}=\sum_{k=1}^{n} a_{i k} b_{k j} $$ By definition, the typical element of $\mathbf{C}^{\top}$, say $\mathrm{d}_{\mathfrak{i j}}$, is given by $$ d_{i j}=c_{j i}=\sum_{k=1}^{n} a_{j k} b_{k i} $$ Hence, $$ \mathbf{C}^{\top}=\mathbf{B}^{\top} \mathbf{A}^{\top} $$ q.e.d. Proposition 3 Let $\mathbf{A}$ and $\mathbf{B}$ be $\mathfrak{\times} n$ and invertible matrices. Let the product $\mathbf{A B}$ be given by $$ \mathbf{C}=\mathbf{A B} $$ then $$ \mathbf{C}^{-1}=\mathbf{B}^{-1} \mathbf{A}^{-1} $$ Proof: $$ \mathrm{CB}^{-1} \mathbf{A}^{-1}=\mathbf{A B B}^{-1} \mathbf{A}^{-1}=\mathbf{I} $$ q.e.d. \section{Partioned Matrices} Frequently, I will find it convenient to deal with partitioned matrices ${ }^{1}$. Such a representation, and the manipulation of this representation, are two of the relative advantages of the symbolic matrix notation. Definition 4 Let $\boldsymbol{A}$ be $m \times n$ and write $$ A=\left[\begin{array}{ll} B & C \\ D & E \end{array}\right] $$ where $\mathbf{B}$ is $m_{1} \times n_{1}, \mathbf{E}$ is $m_{2} \times n_{2}, \mathbf{C}$ is $m_{1} \times n_{2}, \mathbf{D}$ is $m_{2} \times n_{1}, m_{1}+m_{2}=m$, and $n_{1}+n_{2}=n$. The above is said to be a partition of the matrix $\mathbf{A}$. ${ }^{1}$ Much of the material in this section is extracted directly from Dhrymes (1978, Section 2.7). Proposition 4 Let $\mathbf{A}$ be a square, nonsingular matrix of order $\mathrm{m}$. Partition $\mathbf{A}$ as $$ \mathbf{A}=\left[\begin{array}{ll} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \mathbf{A}_{21} & \mathbf{A}_{22} \end{array}\right] $$ so that $\mathbf{A}_{11}$ is a nonsingular matrix of order $\mathrm{m}_{1}, \mathbf{A}_{22}$ is a nonsingular matrix of order $\mathrm{m}_{2}$, and $m_{1}+m_{2}=m$. Then $$ \mathbf{A}^{-1}=\left[\begin{array}{cc} \left(\mathbf{A}_{11}-\mathbf{A}_{12} \mathbf{A}_{22}^{-1} \mathbf{A}_{21}\right)^{-1} & -\mathbf{A}_{11}^{-1} \mathbf{A}_{12}\left(\mathbf{A}_{22}-\mathbf{A}_{21} \mathbf{A}_{11}^{-1} \mathbf{A}_{12}\right)^{-1} \\ -\mathbf{A}_{22}^{-1} \mathbf{A}_{21}\left(\mathbf{A}_{11}-\mathbf{A}_{12} \mathbf{A}_{22}^{-1} \mathbf{A}_{21}\right)^{-1} & \left(\mathbf{A}_{22}-\mathbf{A}_{21} \mathbf{A}_{11}^{-1} \mathbf{A}_{12}\right)^{-1} \end{array}\right] $$ Proof: Direct multiplication of the proposed $\mathbf{A}^{-1}$ and $\mathbf{A}$ yields $$ \mathbf{A}^{-1} \mathbf{A}=\mathbf{I} $$ q.e.d. \section{$5 \quad$ Matrix Differentiation} In the following discussion I will differentiate matrix quantities with respect to the elements of the referenced matrices. Although no new concept is required to carry out such operations, the element-by-element calculations involve cumbersome manipulations and, thus, it is useful to derive the necessary results and have them readily available ${ }^{2}$. \section{Convention 3} Let $$ \mathbf{y}=\psi(\mathbf{x}) $$ where $\mathbf{y}$ is an $m$-element vector, and $\mathbf{x}$ is an $n$-element vector. The symbol $$ \frac{\partial y}{\partial x}=\left[\begin{array}{cccc} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{2}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}} \\ \frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{2}} & \cdots & \frac{\partial y_{2}}{\partial x_{n}} \\ \vdots & \vdots & & \vdots \\ \frac{\partial y_{m}}{\partial x_{1}} & \frac{\partial y_{m}}{\partial x_{2}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] $$ will denote the $m \times n$ matrix of first-order partial derivatives of the transformation from $x$ to $\mathbf{y}$. Such a matrix is called the Jacobian matrix of the transformation $\psi()$. Notice that if $\mathbf{x}$ is actually a scalar in Convention 3 then the resulting Jacobian matrix is a $m \times 1$ matrix; that is, a single column (a vector). On the other hand, if $\mathbf{y}$ is actually a scalar in Convention 3 then the resulting Jacobian matrix is a $1 \times n$ matrix; that is, a single row (the transpose of a vector). \section{Proposition 5 Let} $$ \mathbf{y}=\mathbf{A x} $$ ${ }^{2}$ Much of the material in this section is extracted directly from Dhrymes (1978, Section 4.3). The interested reader is directed to this worthy reference to find additional results. where $\mathbf{y}$ is $\mathbf{m} \times 1, \mathbf{x}$ is $\mathbf{n} \times 1, \mathbf{A}$ is $\mathbf{m} \times \mathbf{n}$, and $\mathbf{A}$ does not depend on $\mathbf{x}$, then $$ \frac{\partial y}{\partial x}=\mathbf{A} $$ Proof: Since the $i$ th element of $\mathbf{y}$ is given by $$ y_{i}=\sum_{k=1}^{n} a_{i k} x_{k} $$ it follows that $$ \frac{\partial y_{i}}{\partial x_{j}}=a_{i j} $$ for all $i=1,2, \ldots, m, \quad j=1,2, \ldots, n$. Hence $$ \frac{\partial y}{\partial x}=\mathbf{A} $$ q.e.d. Proposition 6 Let $$ \mathbf{y}=\mathbf{A x} $$ where $\mathbf{y}$ is $\mathbf{m} \times 1, \mathbf{x}$ is $\mathbf{n} \times 1, \mathbf{A}$ is $\mathbf{m} \times \mathbf{n}$, and $\mathbf{A}$ does not depend on $\mathbf{x}$, as in Proposition 5 . Suppose that $\mathbf{x}$ is a function of the vector $\mathbf{z}$, while $\mathbf{A}$ is independent of $\mathbf{z}$. Then $$ \frac{\partial \mathbf{y}}{\partial \mathbf{z}}=\mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Proof: Since the $i$ th element of $\mathbf{y}$ is given by $$ y_{i}=\sum_{k=1}^{n} a_{i k} x_{k} $$ for all $i=1,2, \ldots, m$, it follows that $$ \frac{\partial y_{i}}{\partial z_{j}}=\sum_{k=1}^{n} a_{i k} \frac{\partial x_{k}}{\partial z_{j}} $$ but the right hand side of the above is simply element $(i, j)$ of $\mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}}$. Hence $$ \frac{\partial \mathbf{y}}{\partial \mathbf{z}}=\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \frac{\partial \mathbf{x}}{\partial \mathbf{z}}=\mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ q.e.d. Proposition 7 Let the scalar $\alpha$ be defined by $$ \alpha=\mathbf{y}^{\top} \mathbf{A x} $$ where $\mathbf{y}$ is $\mathbf{m} \times 1, \mathbf{x}$ is $\mathbf{n} \times 1, \mathbf{A}$ is $\mathbf{m} \times \mathbf{n}$, and $\mathbf{A}$ is independent of $\mathbf{x}$ and $\mathbf{y}$, then $$ \frac{\partial \alpha}{\partial \mathbf{x}}=\mathbf{y}^{\top} \mathbf{A} $$ and $$ \frac{\partial \alpha}{\partial \mathbf{y}}=\mathbf{x}^{\top} \mathbf{A}^{\top} $$ Proof: Define $$ \mathbf{w}^{\top}=\mathbf{y}^{\top} \mathbf{A} $$ and note that $$ \alpha=\mathbf{w}^{\top} \mathbf{x} $$ Hence, by Proposition 5 we have that $$ \frac{\partial \alpha}{\partial \mathbf{x}}=\mathbf{w}^{\top}=\mathbf{y}^{\top} \mathbf{A} $$ which is the first result. Since $\alpha$ is a scalar, we can write $$ \alpha=\alpha^{\top}=\mathbf{x}^{\top} \mathbf{A}^{\top} \mathbf{y} $$ and applying Proposition 5 as before we obtain $$ \frac{\partial \alpha}{\partial \mathbf{y}}=\mathbf{x}^{\top} \mathbf{A}^{\top} $$ q.e.d. Proposition 8 For the special case in which the scalar $\alpha$ is given by the quadratic form $$ \alpha=\mathbf{x}^{\top} \mathbf{A x} $$ where $\mathbf{x}$ is $\mathfrak{n} \times 1, \mathbf{A}$ is $\mathfrak{n} \times \mathbf{n}$, and $\mathbf{A}$ does not depend on $\mathbf{x}$, then $$ \frac{\partial \alpha}{\partial \mathbf{x}}=\mathbf{x}^{\top}\left(\mathbf{A}+\mathbf{A}^{\top}\right) $$ Proof: By definition $$ \alpha=\sum_{j=1}^{n} \sum_{i=1}^{n} a_{i j} x_{i} x_{j} $$ Differentiating with respect to the $k$ th element of $\mathbf{x}$ we have $$ \frac{\partial \alpha}{\partial x_{k}}=\sum_{j=1}^{n} a_{k j} x_{j}+\sum_{i=1}^{n} a_{i k} x_{i} $$ for all $\mathrm{k}=1,2, \ldots, \mathrm{n}$, and consequently, $$ \frac{\partial \alpha}{\partial \mathbf{x}}=\mathbf{x}^{\top} \mathbf{A}^{\top}+\mathbf{x}^{\top} \mathbf{A}=\mathbf{x}^{\top}\left(\mathbf{A}^{\top}+\mathbf{A}\right) $$ q.e.d. Proposition 9 For the special case where $\mathbf{A}$ is a symmetric matrix and $$ \alpha=\mathbf{x}^{\top} \mathbf{A x} $$ where $\mathbf{x}$ is $\mathbf{n} \times 1, \mathbf{A}$ is $\mathbf{n} \times \mathbf{n}$, and $\mathbf{A}$ does not depend on $\mathbf{x}$, then $$ \frac{\partial \alpha}{\partial x}=2 x^{\top} \mathbf{A} $$ Proof: This is an obvious application of Proposition 8. q.e.d. Proposition 10 Let the scalar $\alpha$ be defined by $$ \alpha=\mathbf{y}^{\top} \mathbf{x} $$ where $\mathbf{y}$ is $\mathbf{n} \times 1, \mathbf{x}$ is $\mathbf{n} \times 1$, and both $\mathbf{y}$ and $\mathbf{x}$ are functions of the vector $\mathbf{z}$. Then $$ \frac{\partial \alpha}{\partial z}=x^{\top} \frac{\partial \mathbf{y}}{\partial z}+y^{\top} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Proof: We have $$ \alpha=\sum_{j=1}^{n} x_{j} y_{j} $$ Differentiating with respect to the $k$ th element of $\mathbf{z}$ we have $$ \frac{\partial \alpha}{\partial z_{k}}=\sum_{j=1}^{n}\left(x_{j} \frac{\partial y_{j}}{\partial z_{k}}+y_{j} \frac{\partial x_{j}}{\partial z_{k}}\right) $$ for all $k=1,2, \ldots, n$, and consequently, $$ \frac{\partial \alpha}{\partial z}=\frac{\partial \alpha}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{z}}+\frac{\partial \alpha}{\partial \mathbf{x}} \frac{\partial \mathbf{x}}{\partial \mathbf{z}}=\mathbf{x}^{\top} \frac{\partial \mathbf{y}}{\partial \mathbf{z}}+\mathbf{y}^{\top} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ q.e.d. Proposition 11 Let the scalar $\alpha$ be defined by $$ \alpha=\mathbf{x}^{\top} \mathbf{x} $$ where $\mathbf{x}$ is $\mathbf{n} \times 1$, and $\mathbf{x}$ is a function of the vector $\mathbf{z}$. Then $$ \frac{\partial \alpha}{\partial \mathbf{z}}=2 \mathbf{x}^{\top} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Proof: This is an obvious application of Proposition 10. q.e.d. Proposition 12 Let the scalar $\alpha$ be defined by $$ \alpha=\mathbf{y}^{\top} \mathbf{A x} $$ where $\mathbf{y}$ is $\mathbf{m} \times 1, \mathbf{x}$ is $\mathbf{n} \times 1, \mathbf{A}$ is $\mathbf{m} \times \mathbf{n}$, and both $\mathbf{y}$ and $\mathbf{x}$ are functions of the vector $\mathbf{z}$, while $\mathbf{A}$ does not depend on $\mathbf{z}$. Then $$ \frac{\partial \alpha}{\partial \mathbf{z}}=\mathbf{x}^{\top} \mathbf{A}^{\top} \frac{\partial \mathbf{y}}{\partial \mathbf{z}}+\mathbf{y}^{\top} \mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Proof: Define $$ \mathbf{w}^{\top}=\mathbf{y}^{\top} \mathbf{A} $$ and note that $$ \alpha=\mathbf{w}^{\top} \mathbf{x} $$ Applying Propositon 10 we have $$ \frac{\partial \alpha}{\partial \mathbf{z}}=\mathbf{x}^{\top} \frac{\partial \mathbf{w}}{\partial \mathbf{z}}+\mathbf{w}^{\top} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Substituting back in for $\mathbf{w}$ we arrive at $$ \frac{\partial \alpha}{\partial z}=\frac{\partial \alpha}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{z}}+\frac{\partial \alpha}{\partial \mathbf{x}} \frac{\partial \mathbf{x}}{\partial \mathbf{z}}=\mathbf{x}^{\top} \mathbf{A}^{\top} \frac{\partial \mathbf{y}}{\partial \mathbf{z}}+\mathbf{y}^{\top} \mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ q.e.d. Proposition 13 Let the scalar $\alpha$ be defined by the quadratic form $$ \alpha=\mathbf{x}^{\top} \mathbf{A x} $$ where $\mathbf{x}$ is $\mathbf{n} \times 1, \mathbf{A}$ is $\mathbf{n} \times \mathbf{n}$, and $\mathbf{x}$ is a function of the vector $\mathbf{z}$, while $\mathbf{A}$ does not depend on $\mathbf{z}$. Then $$ \frac{\partial \alpha}{\partial \mathbf{z}}=\mathbf{x}^{\top}\left(\mathbf{A}+\mathbf{A}^{\top}\right) \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Proof: This is an obvious application of Proposition 12. q.e.d. Proposition 14 For the special case where $\mathbf{A}$ is a symmetric matrix and $$ \alpha=\mathbf{x}^{\top} \mathbf{A} \mathbf{x} $$ where $\mathbf{x}$ is $\mathfrak{n} \times 1, \mathbf{A}$ is $\mathfrak{n} \times \mathfrak{n}$, and $\mathbf{x}$ is a function of the vector $\mathbf{z}$, while $\mathbf{A}$ does not depend on $\mathbf{z}$. Then $$ \frac{\partial \alpha}{\partial \mathbf{z}}=2 \mathbf{x}^{\top} \mathbf{A} \frac{\partial \mathbf{x}}{\partial \mathbf{z}} $$ Proof: This is an obvious application of Proposition 13. q.e.d. Definition 5 Let $\mathbf{A}$ be a $m \times n$ matrix whose elements are functions of the scalar parameter $\alpha$. Then the derivative of the matrix $\boldsymbol{A}$ with respect to the scalar parameter $\alpha$ is the $m \times n$ matrix of element-by-element derivatives: $$ \frac{\partial \mathbf{A}}{\partial \alpha}=\left[\begin{array}{cccc} \frac{\partial a_{11}}{\partial \alpha} & \frac{\partial a_{12}}{\partial \alpha} & \ldots & \frac{\partial a_{1 n}}{\partial \alpha} \\ \frac{\partial a_{21}}{\partial \alpha} & \frac{\partial a_{22}}{\partial \alpha} & \ldots & \frac{\partial a_{2 n}}{\partial \alpha} \\ \vdots & \vdots & & \vdots \\ \frac{\partial a_{m 1}}{\partial \alpha} & \frac{\partial a_{m 2}}{\partial \alpha} & \ldots & \frac{\partial a_{m n}}{\partial \alpha} \end{array}\right] $$ Proposition 15 Let $\mathbf{A}$ be a nonsingular, $m \times m$ matrix whose elements are functions of the scalar parameter $\alpha$. Then $$ \frac{\partial \mathbf{A}^{-1}}{\partial \alpha}=-\mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \alpha} \mathbf{A}^{-1} $$ Proof: Start with the definition of the inverse $$ \mathbf{A}^{-1} \mathbf{A}=\mathbf{I} $$ and differentiate, yielding $$ \mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \alpha}+\frac{\partial \mathbf{A}^{-1}}{\partial \alpha} \mathbf{A}=\mathbf{0} $$ rearranging the terms yields $$ \frac{\partial \mathbf{A}^{-1}}{\partial \alpha}=-\mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \alpha} \mathbf{A}^{-1} $$ q.e.d. \section{References} - Dhrymes, Phoebus J., 1978, Mathematics for Econometrics, Springer-Verlag, New York, 136 pp. - Golub, Gene H., and Charles F. Van Loan, 1983, Matrix Computations, Johns Hopkins University Press, Baltimore, Maryland, $476 \mathrm{pp}$. - Graybill, Franklin A., 1983, Matrices with Applications in Statistics, 2nd Edition, Wadsworth International Group, Belmont, California, 461 pp.