ML Fundamentals Journal Club
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- title: Independent Componet Analysis tags: ICA --- # Independent Component Analysis We are familiar with the problem that our high-dimensional data sources (images, video, fMRI, etc.) are redundant. And actually, we don't care about the value of pixels/voxels, we need some compact representation. Do you know state-space models? Well, something like that would be great. Of course, we have dimensionality-reduction methods, like Principal Component Analysis (PCA) or Factor Analysis (FA). Why do we need another one then? Someone just tried to get a PhD by tweaking around a bit to get something published? Fortunately, Independent Component Analysis is much more than that. ## Notation $s$ - signal sources/latents $x$ - signal mixtures/observations $y$ - reconstructed signal sources $g$ - prescribed CDF function $Y$ - $g(y)$ $A$ - mixing matrix $W$ - unmixing matrix $p_s$ - signal PDF model/ prior pdf ## Blind source separation Regarding ICA, the main motivation is used to be formulated as the problem of blind source separation - it is also known as the cocktail party problem. I didn't mention that we will have party tiiiiiiiiiime! (In a coronaconform way, of course). Let's imagine that you are attending a cocktail party (in your imagination/in reality if you happen to be vaccinated), and you want to get the most recent gossip there, but your old friend also decides to unload his recent experience with correlation analysis. You don't care about that, but you don't want to hurt your friend's feelings. Still, the gossip needs to be heard. You never know... This problem can be described mathematically as separating the mixture signal $x$ into its components gossip $g$ and math stuff $m$, i.e.: $$ x = A s, $$ where $A$ is called the mixing matrix - I will denote its inverse with $W := A^{-1}$ - and $s=(g,m)=(s_1,s_2)$. $x$ can contain multiple mixtures of the signals. If your signals are EEG measurements, then you can have multiple sensors for example. But we don't have access to $s$, only to $x$! Otherwise, the problem would be solved. So the problem we need to solve is: $$ s = Wx $$ The goal in this setting is to somehow recover $s$ with a close enough estimate. You can formulate this problem as dimensionality reduction, namely, you want to extract the not redundant components. Thus, PCA or FA could be used. We'll soon see why ICA is better. ## A comparison with PCA and FA I only give a oneliner as an introduction to PCA and FA, as their assumptions are the most important for us to differentiate ICA from those. ### PCA PCA basically solves the $s = Wx$ equation while maximizing the variance in the _(centered)_ data $x-\mathbb{E}(x)$ by selecting the $m$ eigenvectors with the biggest eigenvalues out of the $n$-dimensional orthonormal basis of the covariance matrix. If your principal axes are orthogonal to each other, then decomposing any signal in that basis will provide the lack of _(linear)_ correlation - the basis is linearly independent, you see. Morover, PCA assumes Gaussian latents, i.e. $p_s = \mathcal{N}(0, I)$. In this case, the ML estimate (cf. it [below](/tFr-eBO5R7WLtEc1PHRDKw#Maximum-Likelihood)) is invariant to any rotation $R$. Namely, if the reconstructed signal $Wx$ is Gaussian, then $RWx$ will be too. Moreover, as $\det R=1$, the change of variables formula is also invariant to rotations. ### FA Factor Analysis goes a bit further by also incorporating the noise characteristics of the measurements. Thus, the problem here is $s = Wx +\epsilon$. ### Different assumptions The most important thing is to list the main assumptions of these methods, as this will determine their applicability in a given case. ICA strives to extract **statistically independent** factors, whereas both PCA and FA gets us **uncorrelated factors**. We will reason about the distributions of these factors, so you can think of them as random variables. The definition on independence is: $$ p\left(x,y\right) = p\left(x\right)p\left(y\right). $$ Moreover, in the ICA case, the following will also apply (as a consequence of independence): $$ p\left(x^k,y^p\right) = p\left(x^k\right)p\left(y^p\right) \qquad \forall k,p \in \mathbb{Z}^+, $$ which means much more constraints than in the uncorrelated case, where the correlation coefficient $\rho$ is zero (and that’s it): $$ \rho = \dfrac{\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y)}{\sigma_X\sigma_Y}, $$ which is the special case of the above with $(k=p=1)$: That is, uncorrelatedness implies $p\left(x,y\right) = p\left(x\right)p\left(y\right)$, but this does not mean independence. And anyway, zero correlation means zero **linear** correlation - for higher values of $k$ and $p$ it does not hold. In the literature, it is also common to state the factorization for independent distributions for functions of them - i.e., for $p( f(x), g(y))$. But as you can have a Taylor series approximation, the powers $x^k$ and $y^p$ are fine too. This means that ICA restricts the "search space" of the factors more signifcantly, as they need to satisfy more constraints. Although this is not the only difference, let's focus now on the following: is there any case when the assumptions of PCA/FA are sufficient? It turns out that there is such a case, namely, if you have Gaussian random variables. As specifying up to (including) second-order moments describe a Gaussian, you are all set with assuming uncorrelatedness - as in this setting, this implies independece as well. ## Linear ICA Okay, we need to ensure that we have _independent_ components, but how do we do that? The idea is simple, but turning that into a practical algorithm is a bit burdensome. **Note:** I will provide two derivations. ML is easier, but InfoMax is more eye-widening (at least for me), so I will begin with InfoMax. ### ICA assumptions But first, I should say something... Hmmm, ICA is _different_, you know? You are probably adapted to assuming that everything is Gaussian. Well, ICA assumes that the components are **independent** aaand **non-Gaussian**. Moreover, ICA assumes that mixing the independent source signals has the following effects: 1. The mixtures (observations) will not be independent, as they contain a combination of the independent sources. 2. As of the Central Limit Theorem, the mixture of the non-Gaussian sources will be closer to a Gaussian distribution than the latents themselves. 3. The mixture of latents/source signals has a more complicated structure than any of its component. Regarding point 3, you can think about Fourier series, where you have simple periodic components, but as a result, you can have very complicated signals. One more remark: ICA extracts all components in a parallel manner. ### Shortcomings You might be wondering what's the problem with Gaussians. The answer is that they are the only distribution that is rotationally symetric. This introduces another degree of freedom for the process of determining the source axis (i.e., when you calculate $W$) - this is an ambiguity we cannot handle. You can relax this condition a bit though: if **one and only one** source is Gaussian, then ICA still works. Nonetheless, there are two shortcomings: 1. The results will be up to a permutation $\pi,$ so what you get is not $(s_1,s_2)$ but $(s_{\pi1},s_{\pi2})$. In this aspect, PCA is better as it has a clear ordering between the components. 2. The variances of the signals cannot be determined in an absolute manner. Namely $s=Wx=(\lambda W)(x/\lambda) : \forall \lambda \in \mathbb{R}$ ### Infomax In the Infomax approach, the goal is to maximize mutual information between $x$ and a signal $Y=g(y),$ where $y$ denotes the recovered signals - i.e., $y=Wx$. Here $g$ is a CDF (cumulative distribution function), which is invertible - this acts as a model, as you need to specify this in advance based on prior knowledge. This means that you want to get a mapping of the extracted signals $y$ that is not able to tell you anything about the signal mixures $x$. This is what we want as the source signals $s$ (for which $y$ is hopefully a good esimate) have no information about the mixtures (the matrix $A$ holds all of that information). But why do we use $g$ at all? To get independent signals, we want to have maximal entropy i.e., a multivariate uniform distribution as the joint PDF (probability density function). The following turns out to be true for maximum entropy settings: 1. The unifrom distribution has maximal entropy. 2. If the joint is uniform, then the components (the random variables) are independent. 3. If you transform a uniform distribution with any _invertible_ mapping (e.g. a CDF), then the result is also mutually independent. So now you see where we can use $g$, aren't you? When we extract $y$ from the data $x$, then it will have a PDF - probably not a uniform one. To get things into a uniform one, then we need to transform $y$. Its CDF comes then handy, beacuse if you transform a random variable with its CDF, you get a uniform distribution. First, this didn't seem trivial to me, so I guess you might be happy with an example. Imagine you have $z \sim \mathcal{N}(0,1)$. The most probable is that you get a value near 0 if you sample from this distribution. But if you transform these samples with the CDF, then it will get you a uniform distribution. This is because the CDF is the steepest where the PDF has its maximum (which is trivial, as $\mathrm{CDF} = \int \mathrm{PDF}$). Thus, the more samples are nearer to the maximum, the further away they are spread. If the PDF has a low value, then the CDF will "collect" the points in the transformed domain. In the end, this equalizes the distribution into a uniform one. There is one more thing we need for the Infomax ICA, the change of variables rule for probability distributions. The intuition here is that the _probability contained in the differential area_ must be invariant to the transformation, i.e. $$ \begin{align} p_Y(Y) \left|\dfrac{\partial Y}{\partial y}\right| &= p_y(y) \end{align} $$ Here $\left|\dfrac{\partial Y}{\partial y}\right|=g'(y)=p_s(y),$ where $p_s(y)$ is the underlying true distribution for the signals $s$. Substituting this value into the above expression gets: $$ p_Y(Y) = \dfrac{p_y(y)}{p_s(y)}, $$ which means that the only way to get a uniform distribution for $p_Y(Y)$ is to have $p_y(y)=p_s(y)$. If this equality holds, then the signal model holds as well. Moreover, $p_Y(Y)$ has maximal entropy, and by point 3 above this means that $p_y$ has too. As we wan't to maximize the entropy of $Y$, we can express $H(Y)$ with this fractional relationship: $$ H(Y) = - \dfrac{1}{N}\sum_{i=1}^{N}\log \dfrac{p_y(y_i)}{p_s(y_i)} $$ Now we apply the change of variables rule to $p_y(y)$ to get $W$ into the expression: $$ p_y(y) \left|\dfrac{\partial y}{\partial x}\right| = p_x(x) $$ With $y=Wx,$ the Jacobian equals to $|W|,$ thus $p_y(y) |W|= p_x(x),$ which is w.r.t the entropy: $$ \begin{align} H(Y) &= - \dfrac{1}{N}\sum_{i=1}^{N}\log \dfrac{p_x(x_i)}{p_s(y_i)|W|} \\ &= -\dfrac{1}{N}\sum_{i=1}^{N}\log p_x(x_i) + \dfrac{1}{N}\sum_{i=1}^{N}{\ln p_s(y_i)} + \log|W| \\ &= H(X) + \dfrac{1}{N}\sum_{i=1}^{N}{\ln p_s(y_i)} + \log|W| \\ \end{align} $$ As $H(X)$ doesn't depend on $W,$ it can be neglected during optimization, so the objective $H_{\mathrm{infomax}}(Y)$ is: $$ H_{\mathrm{infomax}}(Y) = \dfrac{1}{N}\sum_{i=1}^{N}{\ln p_s(y_i)} + \log|W|. $$ To optimize this expression, we need a signal model $p_s$ in advance. Okay, we got the objective to optimize, but what relationship do we have to mutual information? The statement is that by getting $p_y(y)=p_s(y)$, the mutual information $I(X;Y)$ will be maximizied. Let's include the definition of that: $$ \begin{align} I(X;Y) = H(X) - H(X|Y) \geq 0. \end{align} $$ As the signal mixtures $x$ are given, $H(X)$ is fixed; thus, maximizing the mutual information is the same as _minimizing_ the conditional entropy $H(Y|X)$. As $Y \sim g(Wx) | x$; so the conditional is $p(Y|x) = g(Wx)$. If $H(X|Y)$ is minimized (i.e, it equals zero), this means that getting to know $x$ does not provide additional information if $Y$ is known. Namely, $x$ has no new information regarding neither the latents, nor the mixing process, everything can be described with $Y$ and $W$. This is what we want! As if $Y$ is known, then $y=g^{-1}(Y)$ is known too. Thus, we have the signal components - knowing the mixtures does not contain additional information. Imagine that $H(X|Y)\neq 0$ - this would mean that $x$ **contains** information about the mixing process and/or the latents. Thus, our model (with $Y$ and $W$) is not sufficient. ### Maximum Likelihood If you fancy, you can do this the ML way as well. In this case, we will end up at the same objective (surpriiiiise), but the way is quite different. Our a priori information is distilled into the signal model $p_s,$ and the goal is determine the unmixing matrix $W$ under which the data has the highest likelihood. For writing down the likelihood of the data dependent on $W$, the change of variables rule comes handy again: $$ \begin{align} L(W) &= p_x(x) = p_s(y) \left|\dfrac{\partial y}{\partial x}\right| \\ &= p_s(y)|W|, \end{align} $$ where $L(W)$ stands for the likelihood of $W$ given the data $x$. Taking the $\log$ and using the independence of the source signals $p_s(y)=\prod_{i=1}^{N} p_s(y_i)$ gives: $$ l(W) = \sum_{i=1}^{N}\log p_s(y_i) +N\log|W|, $$ from which it can be seen that $l(W)/N = H_{\mathrm{infomax}}(Y)$.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully