# Ethnicity Adjustment ## Adding missing ethnicity: For example aian: $CV_{aian} = CV_{Unknown} * \frac{census_{aian}}{census_{Unknown} + census_{aian}}$ $CV_{Unknown} = (1 - CV_{aian}) * CV_{Unknown}$ In case we had multiple missing ethnicities we can apply the logic sequentially one by one. ## Optimization Given $p^1$ corresponding to cv distribution (we want to adjust), $p^2$ corresponding to census distribution (reference distribution). Both $p^1$ and $p^2$ are asserted to have a sum equal to $1$. Let's assume we have a vector $x$ which denotes adjustment factors. We want to minimize: $$\sum_{eth}^{} \left[\frac{(p^1_{eth} + x_{eth} - p^2_{eth})}{min(p^2_{eth},1-p^2_{eth})}\right]^2$$ Subject To: $$\sum_{dx \in x}^{} dx = 0$$ $$\sum_{dx \in x}^{} |dx| \leq 2\lambda$$ ## Stopping Criteria: Average loss per ethnicity <= $\mu$ I found $\mu = 0.1$ as a good threshold. ## Insights The issues in the cv collection is that: * Number of CVs per title is low * CVs tend to come from limited amount of locations across U.S * This implies a sampling bias that we need to remove while keeping the signals from CV data * Using norm coefficients to be $min(x,1-x)$ is to make sure that dominant ethnicities aren't suppressed. ## Some results: ![](https://i.imgur.com/xie0pU6.png) ![](https://i.imgur.com/2C9P3aL.png)