Operators and Pipelines

# Operators and Pipelines Our goal here is to list all relevant types of operators and pipelines in order to be able to make decisions on the *scope* of our work. ## Operators ### Unary Map * **Format:** `op(a: Any) -> Any` * **Examples:** * `convert` - performs various forms of type casting (e.g. string to integer) * `hash` - computes a hash value from the input * `na_indicator` - returns a `Bool` that indicates if the input is a missing value or not * `log` - computes a logarithm of the input * `lambda` - computes an arbitrary lambda expression in a given input value * `stopword_remover` - takes an input list of string tokens and removes the ones that correspond to stop-words (e.g. "the", "and", etc); the list of stop words is specified as an additional argument * `token_counter` - takes an input list of string tokens and returns a dictionary where each unique token is paired with the number of its occurrences in the list * It is possible to pass a token dictionary (which can be obtained either independently or with the `token_dictionary` operator), in which case the tokens that appear in the output dictionary will be exactly the tokens from the provided token dictionary (tokens missing in the input will have count zero) ### Binary Numerical Map * **Format:** `op(a: Numeric, b: Numeric) -> Numeric` * Type `Numeric` is either integer or real * **Examples:** `add`, `sub`, `mul`, `div` ### Binary and Unary Logical Map * **Format:** `op(a: Any, b: Any) -> Bool` or `op(a: Any) -> Bool` * Type `Any` has a zero value, so it can be safely converted to `True`/`False` * **Examples:** `equal`, `and`, `or`, `not` ### Multi-Value Numerical Map * **Format:** `op(a: Tuple[Numeric]) -> Tuple[Numeric]` * **Examples:** * `normalize` - Scale the input vector to have unit norm. ### Numerical Aggregate Reduce * **Format:** `op(a: Column[Numeric]) -> Numeric` * Type `Column[Type]` is a column where each cell is of type `Type` * **Examples:** `sum`, `count`, `mean`, `std` ### Numerical Element Selector Reduce * **Format:** `op(a: Column[Numeric]) -> Numeric` * **Examples:** `min`, `max` * **Note:** Although the format of this operator is the same as set aggregate, these operators differ because their result is less sensitive to element removal or addition. ### Numerical Scaler Map* * **Format:** `op(a: Numeric, stats: Dict[str, Numeric]) -> Numeric` * The parameter `stats` is a dictionary of parameters that represent statistics ***computed over the whole column***. * **Examples:** * `min_max_scaler` - requires `min` and `max` as stats (**requires reduce**) * `mean_std_scaler` - requires `mean` and `std` as stats (**requires reduce**) ### Column Summarizer Reduce * **Format:** `op(a: Column[Any]) -> Summary` * Type `Summary` is an object that encodes some key information obtained by summarizing the input column * **Examples:** * `unique_summarizer` - produces a list of all unique elements and assigns them an ordinal number * `counter` - produces a list of all unique elements coupled with their number of occurrences * `token_dictionary` - takes a column containing lists of word tokens and summarizes it into a dictionary of all unique tokens found in the entire column * `token_document_frequency` - takes a column containing lists of word tokens and summarizes it into a dictionary where for unique token is paired with the number of rows in the column where that word appears (this summary is used to compute the TF-IDF map) ### Encoder Map* * **Format:** `op(a: Any, summary: Summary) -> Any` * **Examples:** * `ordinal_encoder` - replaces each distinct element a unique ordinal integer obtained from `unique_summarizer` (**requires reduce**) * `one_hot_encoder` - same as `ordinal_encoder` but instead of an integer, it returns a one-hot encoded binary vector (**requires reduce**) * `tf_idf_encoder` - takes an input dictionary of token counts along with a summary obtained with `token_document_frequency` and outputs a vector of TF-IDF features (count of each token in an input divided by the number of rows that have that token) ### Row Filter * **Format:** `op(a: Column[Any]) -> Column[Any, superset=a]` * Here we parametrize the type `Column` by specifying that it is a ***subset*** of the input column * **Examples:** * `na_filter` - removes all rows that have a missing value (**~map**) * `range_filter` - removes all rows whose value doesn't fall in the specified range (**~map**) * `take_filter` - removes the first `N` rows from a column (**not a map**) ### Model Train Reduce * **Format:** `op(features: Column[Feature], labels: Column[Label], hyperparameters: Dict[str, Any]) -> Model` * Type `Feature` corresponds to `Tuple[Numeric]` * Type `Label` corresponds to one of the following: `Categorical`, `Numeric`, `Tuple[Numeric]` (and maybe `Tuple[Categorical]`) * Type `Model` is any model that can be invoked by passing some features and it will return predicted labels * **Example models:** linear models, tree-based models, embeddings, text sentiment models, LDA model, PCA * Embeddings, LDA and PCA return a `Tuple[Numeric]` result upon prediction ### Model Predict Map * **Format:** `op(features: Feature, model: Model) -> Label` * **Note:** This includes any model-based map such as embeddings of various types, tree featurizers, sentiment predictor etc. ### Column Selector Reduce * **Format:** `op(Column[Tuple], num_features: int, parameters: Dict[str, Any]) -> Dict[str]` * **Examples:** * `model_based_selector` - selects the features based on feature importance of a specified model that was trained on the columns * `nondefault_count_selector` - selects the features based on the number of non-default values in each column * `mutual_information_selector` - selects the features that have the highest amount of mutual information shared between them and a specified label column ### Schema Transformer Map * **Format:** `op(a: Tuple, columns: List[str]) -> Tuple` * **Examples:** * `project` - selects a specific subset of columns from the input tuple (deleting specific columns is a version of this operator as well) * `duplicate` - produces multiple copies of a specified column * `concat` - takes the specified set of columns and replaces them with a single column of type vector ### Data Augmentation Fork * **Format:** `op(a: Tuple) -> List[Tuple]` * **Examples:** can be any data augmentation operation, image based, text based

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.