---
tags: little_pink
---
Other Classification Options for Comparison
===
Current classification pipeline for a user, $u$:
$$
\begin{align}
X_u^t &= \{x_1^t, \dots, x_n^t\} \\
\hat{Y}_u^t &= \{\hat{\mathbf{y}}_1^t, \dots, \hat{\mathbf{y}}_n^t\} \\
X_u^a &= \phi_{\text{stat_feats}}(\hat{Y}_u^t) \\
\hat{\mathbf{y}}_u^a &= \phi_{\text{log_regs}}(X_u^a)
\end{align}
$$
Where:
- $x_i^t$ is the text for tweet $i$;
- $\hat{\mathbf{y}}_i^t$ is the output label and probability from our roBERTa-based text classifier;
- $X_u^a$ is the set of statistical features extracted from the predictions that are fed into our logistic regression classifiers;
- $\hat{\mathbf{y}}_u^a$ is a vector of two predictions for little pink and anti-ccp, currently coming from two separate logistic regression classifiers.
Can we think of a deep learning way as a comparison?
$$
\begin{align}
X_u^t &= \{x_1^t, \dots, x_n^t\} \\
X_u^{t,r} &= \{ \phi_{L-1}(x_1^t), \dots, \phi_{L-1}(x_n^t)\} \\
\mathbf{x}_u^t &= [\text{mean}(X_u^{t,r});\max(X_u^{t,r});\min(X_u^{t,r});\text{std}(X_u^{t,r})] \\
\hat{\mathbf{y}}_u^a &= \phi_{\text{cls_layers}}(\mathbf{x}_u^t)
\end{align}
$$
Instead, represent an account as a document of joined tweets (top tweets). Top tweets: e.g., top-3 with highest certainty (per label) of label. Absolute value of these probs (expressing confidence one way or the other).
3-5 or else padding. Like treating tweets as sentences. Less padding the better. Need to find a good length that balances these concerns.
Doc repr.
Account level features:
- Number of tweets
- Retweet ratio
- Tweeting time profile