# Ethnicity Adjustment
## Adding missing ethnicity:
For example aian:
$CV_{aian} = CV_{Unknown} * \frac{census_{aian}}{census_{Unknown} + census_{aian}}$
$CV_{Unknown} = (1 - CV_{aian}) * CV_{Unknown}$
In case we had multiple missing ethnicities we can apply the logic sequentially one by one.
## Optimization
Given $p^1$ corresponding to cv distribution (we want to adjust), $p^2$ corresponding to census distribution (reference distribution).
Both $p^1$ and $p^2$ are asserted to have a sum equal to $1$.
Let's assume we have a vector $x$ which denotes adjustment factors.
We want to minimize:
$$\sum_{eth}^{} \left[\frac{(p^1_{eth} + x_{eth} - p^2_{eth})}{min(p^2_{eth},1-p^2_{eth})}\right]^2$$
Subject To:
$$\sum_{dx \in x}^{} dx = 0$$
$$\sum_{dx \in x}^{} |dx| \leq 2\lambda$$
## Stopping Criteria:
Average loss per ethnicity <= $\mu$
I found $\mu = 0.1$ as a good threshold.
## Insights
The issues in the cv collection is that:
* Number of CVs per title is low
* CVs tend to come from limited amount of locations across U.S
* This implies a sampling bias that we need to remove while keeping the signals from CV data
* Using norm coefficients to be $min(x,1-x)$ is to make sure that dominant ethnicities aren't suppressed.
## Some results:

