Other Classification Options for Comparison

--- tags: little_pink --- Other Classification Options for Comparison === Current classification pipeline for a user, $u$: $$ \begin{align} X_u^t &= \{x_1^t, \dots, x_n^t\} \\ \hat{Y}_u^t &= \{\hat{\mathbf{y}}_1^t, \dots, \hat{\mathbf{y}}_n^t\} \\ X_u^a &= \phi_{\text{stat_feats}}(\hat{Y}_u^t) \\ \hat{\mathbf{y}}_u^a &= \phi_{\text{log_regs}}(X_u^a) \end{align} $$ Where: - $x_i^t$ is the text for tweet $i$; - $\hat{\mathbf{y}}_i^t$ is the output label and probability from our roBERTa-based text classifier; - $X_u^a$ is the set of statistical features extracted from the predictions that are fed into our logistic regression classifiers; - $\hat{\mathbf{y}}_u^a$ is a vector of two predictions for little pink and anti-ccp, currently coming from two separate logistic regression classifiers. Can we think of a deep learning way as a comparison? $$ \begin{align} X_u^t &= \{x_1^t, \dots, x_n^t\} \\ X_u^{t,r} &= \{ \phi_{L-1}(x_1^t), \dots, \phi_{L-1}(x_n^t)\} \\ \mathbf{x}_u^t &= [\text{mean}(X_u^{t,r});\max(X_u^{t,r});\min(X_u^{t,r});\text{std}(X_u^{t,r})] \\ \hat{\mathbf{y}}_u^a &= \phi_{\text{cls_layers}}(\mathbf{x}_u^t) \end{align} $$ Instead, represent an account as a document of joined tweets (top tweets). Top tweets: e.g., top-3 with highest certainty (per label) of label. Absolute value of these probs (expressing confidence one way or the other). 3-5 or else padding. Like treating tweets as sentences. Less padding the better. Need to find a good length that balances these concerns. Doc repr. Account level features: - Number of tweets - Retweet ratio - Tweeting time profile