# Operators and Pipelines
Our goal here is to list all relevant types of operators and pipelines in order to be able to make decisions on the *scope* of our work.
## Operators
### Unary Map
* **Format:** `op(a: Any) -> Any`
* **Examples:**
* `convert` - performs various forms of type casting (e.g. string to integer)
* `hash` - computes a hash value from the input
* `na_indicator` - returns a `Bool` that indicates if the input is a missing value or not
* `log` - computes a logarithm of the input
* `lambda` - computes an arbitrary lambda expression in a given input value
* `stopword_remover` - takes an input list of string tokens and removes the ones that correspond to stop-words (e.g. "the", "and", etc); the list of stop words is specified as an additional argument
* `token_counter` - takes an input list of string tokens and returns a dictionary where each unique token is paired with the number of its occurrences in the list
* It is possible to pass a token dictionary (which can be obtained either independently or with the `token_dictionary` operator), in which case the tokens that appear in the output dictionary will be exactly the tokens from the provided token dictionary (tokens missing in the input will have count zero)
### Binary Numerical Map
* **Format:** `op(a: Numeric, b: Numeric) -> Numeric`
* Type `Numeric` is either integer or real
* **Examples:** `add`, `sub`, `mul`, `div`
### Binary and Unary Logical Map
* **Format:** `op(a: Any, b: Any) -> Bool` or `op(a: Any) -> Bool`
* Type `Any` has a zero value, so it can be safely converted to `True`/`False`
* **Examples:** `equal`, `and`, `or`, `not`
### Multi-Value Numerical Map
* **Format:** `op(a: Tuple[Numeric]) -> Tuple[Numeric]`
* **Examples:**
* `normalize` - Scale the input vector to have unit norm.
### Numerical Aggregate Reduce
* **Format:** `op(a: Column[Numeric]) -> Numeric`
* Type `Column[Type]` is a column where each cell is of type `Type`
* **Examples:** `sum`, `count`, `mean`, `std`
### Numerical Element Selector Reduce
* **Format:** `op(a: Column[Numeric]) -> Numeric`
* **Examples:** `min`, `max`
* **Note:** Although the format of this operator is the same as set aggregate, these operators differ because their result is less sensitive to element removal or addition.
### Numerical Scaler Map*
* **Format:** `op(a: Numeric, stats: Dict[str, Numeric]) -> Numeric`
* The parameter `stats` is a dictionary of parameters that represent statistics ***computed over the whole column***.
* **Examples:**
* `min_max_scaler` - requires `min` and `max` as stats (**requires reduce**)
* `mean_std_scaler` - requires `mean` and `std` as stats (**requires reduce**)
### Column Summarizer Reduce
* **Format:** `op(a: Column[Any]) -> Summary`
* Type `Summary` is an object that encodes some key information obtained by summarizing the input column
* **Examples:**
* `unique_summarizer` - produces a list of all unique elements and assigns them an ordinal number
* `counter` - produces a list of all unique elements coupled with their number of occurrences
* `token_dictionary` - takes a column containing lists of word tokens and summarizes it into a dictionary of all unique tokens found in the entire column
* `token_document_frequency` - takes a column containing lists of word tokens and summarizes it into a dictionary where for unique token is paired with the number of rows in the column where that word appears (this summary is used to compute the TF-IDF map)
### Encoder Map*
* **Format:** `op(a: Any, summary: Summary) -> Any`
* **Examples:**
* `ordinal_encoder` - replaces each distinct element a unique ordinal integer obtained from `unique_summarizer` (**requires reduce**)
* `one_hot_encoder` - same as `ordinal_encoder` but instead of an integer, it returns a one-hot encoded binary vector (**requires reduce**)
* `tf_idf_encoder` - takes an input dictionary of token counts along with a summary obtained with `token_document_frequency` and outputs a vector of TF-IDF features (count of each token in an input divided by the number of rows that have that token)
### Row Filter
* **Format:** `op(a: Column[Any]) -> Column[Any, superset=a]`
* Here we parametrize the type `Column` by specifying that it is a ***subset*** of the input column
* **Examples:**
* `na_filter` - removes all rows that have a missing value (**~map**)
* `range_filter` - removes all rows whose value doesn't fall in the specified range (**~map**)
* `take_filter` - removes the first `N` rows from a column (**not a map**)
### Model Train Reduce
* **Format:** `op(features: Column[Feature], labels: Column[Label], hyperparameters: Dict[str, Any]) -> Model`
* Type `Feature` corresponds to `Tuple[Numeric]`
* Type `Label` corresponds to one of the following: `Categorical`, `Numeric`, `Tuple[Numeric]` (and maybe `Tuple[Categorical]`)
* Type `Model` is any model that can be invoked by passing some features and it will return predicted labels
* **Example models:** linear models, tree-based models, embeddings, text sentiment models, LDA model, PCA
* Embeddings, LDA and PCA return a `Tuple[Numeric]` result upon prediction
### Model Predict Map
* **Format:** `op(features: Feature, model: Model) -> Label`
* **Note:** This includes any model-based map such as embeddings of various types, tree featurizers, sentiment predictor etc.
### Column Selector Reduce
* **Format:** `op(Column[Tuple], num_features: int, parameters: Dict[str, Any]) -> Dict[str]`
* **Examples:**
* `model_based_selector` - selects the features based on feature importance of a specified model that was trained on the columns
* `nondefault_count_selector` - selects the features based on the number of non-default values in each column
* `mutual_information_selector` - selects the features that have the highest amount of mutual information shared between them and a specified label column
### Schema Transformer Map
* **Format:** `op(a: Tuple, columns: List[str]) -> Tuple`
* **Examples:**
* `project` - selects a specific subset of columns from the input tuple (deleting specific columns is a version of this operator as well)
* `duplicate` - produces multiple copies of a specified column
* `concat` - takes the specified set of columns and replaces them with a single column of type vector
### Data Augmentation Fork
* **Format:** `op(a: Tuple) -> List[Tuple]`
* **Examples:** can be any data augmentation operation, image based, text based