TF2.0 APIs - HackMD

# TF2.0 APIs ## Global (這部分 API 應使用 global 方式應用) * tf.keras.utils.get_file(fname, origin, extract=False) * [return] Path to the downloaded file. * fname: Name of the file. If an absolute path /path/to/file.txt is specified the file will be saved at that location. * origin: Original URL of the file. * extract: True tries extracting the file as an Archive, like tar or zip. ``` train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL) ``` * tf.data.Dataset.from_tensor_slices((2d numpy array as X, 1d numpy array as y)) ``` train_dataset=tf.data.Dataset.from_tensor_slices((dict(train_df), train_df.pop('survived'))) ``` --- ## Dataset (取得 Dataset 物件後使用) * shuffle(length) * [return] shuffled dataset * length: the length of the buffer * batch(batchSize) * [return] batched dataset * batchSize: the batch size `train_dataset=train_dataset.shuffle(len(train_df)).batch(64)` --- ## tensorflow (通常會 import tensorflow as tf) * **feature_column** * numeric_column(columnName) * [return] A NumericColumn. * columnName: A unique string identifying the input feature. It is used as the column name and the dictionary key for feature parsing configs, feature Tensor objects, and feature columns. * categorical_column_with_vocabulary_list(columnName, vocabularyList) * [return] A CategoricalColumn with in-memory vocabulary. * columnName: A unique string identifying the input feature. It is used as the column name and the dictionary key for feature parsing configs, feature Tensor objects, and feature columns. * vocabularyList: An ordered iterable defining the vocabulary. Each feature is mapped to the index of its value (if present) in vocabulary_list. Must be castable to dtype. * indicator_column(categorical_column): Represents multi-hot representation of given categorical column. * [return] An IndicatorColumn. * categorical_column: A CategoricalColumn which is created by categorical_column_with_* or crossed_column functions. --- # tensorflow.keras.models * **[Common]** * compile(loss, optimizer, metrics) * loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses. * optimizer: String (name of optimizer) or optimizer instance. See tf.keras.optimizers. * metrics: List of metrics to be evaluated by the model during training and testing. Typically you will use metrics=['accuracy']. To specify different metrics for different outputs of a multi-output model, you could also pass a dictionary, such as metrics={'output_a': 'accuracy', 'output_b': ['accuracy', 'mse']}. You can also pass a list (len = len(outputs)) of lists of metrics such as metrics=[['accuracy'], ['accuracy', 'mse']] or metrics=['accuracy', ['accuracy', 'mse']]. * **Sequential()**: Linear stack of layers. * add(layer): add a new layer to the model --- # tensorflow.keras.layers * **Dense(units, activation, input_shape)**: A regular densely-connected NN layer. * units: Positive integer, dimensionality of the output space. * activation: Activation function to use. * linear (default) * relu * elu * tanh * sigmoid * softmatrix * input_shape: the shape of input ``` model = tf.keras.Sequential() model.add(base_model) model.add(GlobalAveragePooling2D()) model.add(Dense(1024, activation='relu')) model.add(Dense(5, activation='softmax')) ``` * **DenseFeatures(feature_columns)**:A layer that produces a dense Tensor based on given feature_columns. * feature_columns: An iterable containing the FeatureColumns to use as inputs to your model. * **Embedding(input_dim, output_dim)**: Turns positive integers (indexes) into dense vectors of fixed size. * input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1. * output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem. * **GRU(units, activation='tanh', dropout)**: Gated Recurrent Unit, an enhanced version of LSTM. * units: Positive integer, dimensionality of the output space. * activation: Activation function to use. Default: hyperbolic tangent (tanh). If you pass None, no activation is applied (ie. "linear" activation: a(x) = x). * dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. * **Bidirectional(layer)**: Bidirectional wrapper for RNNs. * layer: the RNN layer to be wrapped. --- # tensorflow.keras.preprocessing.sequence * pad_sequences(sequences, maxlen): Pads sequences to the same length. * [return] Numpy array with shape `(len(sequences), maxlen)` * sequences: List of lists, where each element is a sequence. * maxlen: Int, maximum length of all sequences. --- # tensorflow_datasets.core.features.text.text_encoder * **Tokenizer**: Splits a string into an array of tokens. * __init__(alphanum_only=True, reserved_tokens=None) * alphanum_only:if True, only parse out alphanumeric tokens (non-alphanumeric characters are dropped); otherwise, keep all characters (individual tokens will still be either all alphanumeric or all non-alphanumeric). * reserved_tokens: list of string, a list of strings that, if any are in s, will be preserved as whole tokens, even if they contain mixed alphanumeric/non-alphanumeric characters. * tokenize(str): Splits a string into (an array of) tokens. ``` tokenizer = Tokenizer() tokenizer.tokenize("this is a test") ``` * **TokenTextEncoder**: TextEncoder for converting between text and integers backed by a list of tokens. * __init__(vocab_list, lowercase=False, tokenizer=None) * vocab_list: set of tokens. * lowercase: whether to make all text and tokens lowercase. * tokenizer: responsible for converting incoming text into a list of tokens (usually you should used the same Tokenizer in an Encoder) * encode(str): Encodes text into a list of integers. * decode(str): Decodes a list of integers into text. ``` encoder = TokenTextEncoder(vocabulary_set,tokenizer=tokenizer) df['Encode']=df['Text'].map(lambda t: encoder.encode(t)) ``` --- # sklearn.model_selection * train_test_split(array, test_size): split array into training and testing data * [return] List containing train-test split of inputs. * array: the original dataset or a dataframe * test_size: 0-1, the amount of data to be used as testing data ``` train, test = train_test_split(df, test_size=0.3) ``` # numpy # pandas * read_csv(path): Read a comma-separated values (csv) file into DataFrame. * [return] a DataFrame. * path: the path to the csv file. * **DataFrame**: Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). * __init__(data, columns): * data: 2-d arry of data * columns: list of column names * describe() * [return] Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. * count(axis=0) * [return] Count non-NA cells for each column or row. * axis: {0 or ‘index’, 1 or ‘columns’}, default 0. If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.