---
tags: COTAI LHP
---
# Lecture Notes ML4AI 2021
## SESSION 3: NONLINEAR PREDICTORS
###### **_NOTE_** OnlineLaTex (luyện input equations,etc.)
### LESSIONS REVIEW
### Linear predictors:
+ $X\rightarrow Y$ (regression or classification)
+ **Equation đặc trưng**:
${\hat y} = s(W*z+b)$
* bài toán regression thì **s** là **Linear**:
* $s(x) = x \Leftrightarrow s(Wz+b) = Wz+b$
* bài toán classification thì **s** là **Softmax, sigmoid,etc.** với output là niềm tin.
### Regression curve (non-linear)
* Linear predictor may not be accurate (low accuracy)
* Regression surfaces are curves
#### Decision Boundary:
Là vùng mà ở đó không xác định được nó thuộc classification nào
#### Non-linearly separable
use curves to separate classes.
* when input is "noisy" (sparse), linear separating is highly inaccurate, or, non-separable.
#### Non-linear Transformation
transforms the data to a Z space that can be sliced using linear separate
### Manifold Hypothesis (vùng mỏng, lớp, đa tạp)
"high dimensional data are points in a meaningful low dimensional manifold with high dimensional noise added"
#### Multimodal embeddings (đa phương thức)
maximally compatible w.r.t. multimodal training data.
#### Entangled embedding manifolds (xoắn, rối)
* use transformations to "stretch", thus, disentangle the data
### Transformations: stretch out Z space
- We want to transform objects from ambient space into coordinate space.
- We want to use simple “cuts” to separate concepts in embedding space $Z$
- Rotate, stretch, bend, curl to disentangle the embedding manifolds.
- Input: $Z$ ; Output: $Z'$
- transform $Z$ using matrix $A$ to get $Z'$ space with $Z' = AZ$
sau đó bóp méo với matrix $gamma$
### Matrix multiplication
$C_{m\times p} = A_{m\times n}B_{n\times p}$
### Transformations
#### Linear transformations = rotate/flip + scale + rotate/flip
#### Nonlinear Transformations
Activation functions:
### Nonlinear predictors
- Feature engineering (hand-crafted):
- MLP ("multi-layer") or neuro-network:
$Z \rightarrow Z' \rightarrow Z" \rightarrow Z"'$ (deep-learning)
#### Interactive demo: TensoFlow Playground
[Find out here](https://news.ycombinator.com/item?id=11486058)
#### Visualize in 2D: nonlinear decision boundary
http://cs231n.github.io/neural-networks-case-study/
https://machinelearningcoban.com/2017/02/24/mlp/
#### Kernel trick in kernel machines
### Decision Tree & Locally linear models
Chia nhỏ không gian input thành nhiều không gian nhỏ với mỗi không gian là linear
## SESSION 4: RecSys: Recommender Systems
- widely use as core engine for user platforms (amazon,spotify,etc)
((recsys success story insert))
- can learn to serve humans: Virtual assistant, cobot
### Generalization counts
- Observe items or users and recommend other items to the users
- **vital for a good RecSys**
### Personalization counts
- going beyond common statistics of the mass
- have a unique differentiated stats between users
- **vital for a good RecSys**
### Long-tail issues
(insert image here)
### Problem formulation
- $Input: users' rating \rightarrow Output: prediction(rating, stars,etc.)$
- Two common methods: **User-based filtering** and **Item-based filtering**
#### Input Data

#### Taxonomy

### Collaborative filtering

- Require:
- $\overrightarrow{Z}$ : embeddings
- similarity
##### Item-based CF

- Key idea: users as features → compute item-item similarity matrix.
- Advantage: items don’t change much as users.
##### User-based CF

- Key idea: items as features → compute user-user similarity matrix.
###### Note: Transposing the utility matrix in user-based CF before computing basically gives item-based CF result (more efficient since $n^o$ of users > $n^o$ of items).
##### Matrix factorization (model-based) CF

- Key idea: user-item weighted-sum/similarity score ≈ rating score
- SVD: singular vector decomposition.
- Matrix factorization solved by Regularized Alternating Least Squares
- Solved by truncated SVD ***after*** CF
### Personalized RecSys

#### Content-based filtering

- Key idea: $\theta_k⋅z_i+b_k= user_k–item_i$, weighted sum score ≈ rating (could be formulated as a regression or a classification problem)
#### Context-aware & KB personalized RecSys!
[](https://i.imgur.com/7tlsUUH.png)
### Hybrid RecSys

## SESSION 6 MIDTERM EXAM
1. [**5** Point] Given a set of inputs $(X^1,\dots,X^N)$, we use PCA to represent them as $X^t = X_0 + z_1U_1+\dots+z_nU_n$ with $X_0$ the mean input and $U_i$ the orthonormal principal components.
- [**2** Points] Write down equation for $X_0$, and equations for properties of $U_i,U_j$: unit length & mutual orthogonal. **Solution**:
- mean: $\frac{1}{n}\sum_{i=1}^{n}X_i$
- unit length: $d(A,B)=\sqrt{(x_i-x_j)^2 +(y_i-y_j)^2 +...+ (z_i-z_j)^2}$
- mutual orthogonal: $x_i x_j + y_i y_j + ... + z_i z_j = 0$
- [**1** Point] We need to reduce the dimensions of $X^t$ to visualize them on 2D. What is the embedding vector ${\bf z}^t$ of $X^t$ if we use only 2 first principal components to represent it? What is the last feature of ${\bf z}^t$ in this case? **Solution**:
- [**1** Point] What are the main differences between representations by PCA and by sparse coding? **Solution**: PCA reduce the number of significant signal component while sparse coding does not.
- [**1** Point] If we cluster the dataset into 3 groups with centroids $({\bf m}_1, {\bf m}_2, {\bf m}_3),$ what is the label embedding coordinates of $X^t$ if it belongs to cluster 2? **Solution**:
2. [**1** Point] If we use each song as a feature to represent the users, what is the embedding coordinates ${\bf z}_A$ of user A in the dataset below? **Solution**:$z_A = [5,5,0,1,1]$

3. [**3** Point] From the general form of linear predictors: $\hat{y}=\mathsf{s}(Wz+b)$ with $\mathsf{s}(\cdot)$ a transfer function for desired output interpretation.
- [**1** Point] What is $W$ for
- 1 dimentional linear regression? **Solution**:
- sofmax regression with 3 classes? **Solution**:
- [**1** Point] What is function $\mathsf{s}(\cdot)$ for
- 1 dimentional linear regression? **Solution**: nothing
- SVM binary classification? **Solution**: sigmoid
- [**1** Point] Why logistic regression (for binary classification) has only 1 probability output while there are 2 classes? **Solution**:
4. [**2** Points] Evaluation procedure
- [**1** Point] Explain the main use of the train--dev (validation)--test sets. **Solution**:
- [**1** Point] What are the main similarity and differences between linear SVM and logistic regression? **Solution**:
+ differences: SVM are deterministic and Logistics regression are probabilistic.
+ similarity: can be used for regressions.
5. [**2** Points] There are **1100 items** and **one million users**. We need to build a content-based RecSys by extracting **120 features** ${\bf z}_i$ describing each item $i$ then learn a classifier ${\bf \theta}_j$ for each user $j$ to predict **ratings from 1 to 5 stars** of each user for each item.

- [**1** Point] How many classes do we need? **Solution**: *120 classes*
- [**1** Point] What is the size of $W$ if we use softmax regression $\hat{y}=s(Wz+b)$ for to classify ratings? **Solution**: **
6. [**2** Points] Nonlinear predictors have general form $\hat{y}=s(W'\phi(z)+b')$. For Multilayer Perceptrons (MLP) in particular: $\phi(z) = \gamma(Wz+b)$ recursively, each called a "hidden layer".
- [**1** Point] Give explicit equation of an MLP with 2 hidden layers. **Solution**:
+ $\hat y=s(\gamma(W'_1(W'_{2}z+b_{2})+b_1))$
- [**1** Point] What are the parameters of the fully-connected layer in your equation? **Solution**:
- s: activation functions
- W': weight
- b: bias
- $\gamma$
-
7. [**2** Points] Kernel machines use "kernel trick" $\phi(z_i)\cdot\phi(z_j) = \kappa(z_i,z_j)$.
- [**1** Point] Explain why kernel trick is useful. **Solution**:
- [**1** Point] Explain how we can use kernel trick in feature-space prediction $\hat{y}=s(W^\phi\phi(z)+b)$ to turn a linear predictor into a nonlinear one. **Solution**:
-------------------------
## SESSION 7: Search: Optimization & training
work horse of machine learning based on derivative.
##### Notes: features $\neq$ embedding vector Z
### Main goal of today's session:
- to find the best $F$ algorithms
- further understand TEFPA model:
- T: Task
- E: Experience
- F: Function space
- P: Performance
- A: Algorithm to search
- focus on **searching Algorithm A**
#### Maximizing performance measure = minimizing loss
- Search/optimization: moving in parameter $\theta = (\theta_1,\theta_2)$ to find a "good" function $f_{\theta}$
##### Parameter space to lost surface

- we need to find optimized $\theta^{*} = argmin J(\theta;X;y)$
- following the steepest direction (like water) $\rightarrow$ gradient.

###### with every set of $Weight$ and $bias$
##### Velocity and gradient
Velocity $v_i = \lim_{\Delta t\to0}\frac{\Delta p_i}{\Delta t}$
- direction of velocity is the direction where the rate of change reaches apex.
- we need to find velocity vector's direction to find where $\theta$ steepest direction.
- velocity vector have 2 component of $(\theta_1,\theta_2)$
Gradient vector: $\nabla_\theta J(\theta;X,{\bf y}) = \Big(\frac{\partial}{\partial\theta_1}J_\theta, \dots, \frac{\partial}{\partial\theta_k}J_\theta, \dots, \frac{\partial}{\partial\theta_n}J_\theta\Big)$
with $\frac{\partial}{\partial\theta_k}J_\theta = \lim_{\Delta \theta_k\to 0}\frac{J(\theta_1,\dots,\theta_k,\dots,\theta_n) - J(\theta_1,\dots,\theta_k + \Delta \theta_k,\dots,\theta_n)}{\Delta \theta_k}$
- for minimizing loss, phai di nguoc chieu vector gradient tang nhanh nhat
- formula of minimized loss: $\theta \leftarrow \theta - \eta \nabla_\theta J$ with step-size 0 < \eta <= 1 called "learning rate"
- Ex: $\nabla_{f_\theta} = (0.5;-0.2)$
##### back-propagation(MLP)
- forward to find loss then backward to update better number.
##### Realistic loss surface


- local optimum
- saddle point
- flat area
- narrow valleys
Other issues:

- big learning rate: loss might increase and decrease at the same time
- small learing rate: slow run time
- $Batch \rightarrow mini-batch \rightarrow SGD$
- momentum:

- some optimizers: SDG, SGD with momentum, AdaGrad, RMSprop, Adam(most common)
##### Heuristics
- loss in validation set often increases when overfit → **early stopping**.
- **Dropout** forces a neural network to learn more robust features.
----------------------
## SESSION 8: Convolutional neural networks
uses for processing images, videos or audio
### Revision:
- Dot product for matching & similarity
### Spatial matching: sliding
- Use dot product to recognize pattern by template matching.
- Features maps
### Matching + sliding:
##### 1D convolution: horizontal **OR** vertical slide only:

##### 2D convolution: horizontal **AND** vertical slide only:

### CNN operations
#### Padding
- keep or increase shape size
- reduce egde information loss
#### Stride
- How filter move
- data shrinking speed
#### Pooling
- reduce shape of input data by half. Ex $4 \times 4 \rightarrow 2\times 2$
- Max pooling and average pooling

- implement as kernel with stride = 2
- filter's variables always remains the same.
- Max pooling: sharp features
- Avg pooling: smooth
###### Convolution Bias:
- convolution bias = output bias + convolution kernel (filter) size
### CNN output shape
- **Formula**: $n_{out} = \Big[ \frac{n_{in} +2p - k}{s} \Big]+ 1$
- **Parameter**:
- $n_{in}$: number of input features
- $n_{out}$:number of output features
- k: convolution kernel size
- p: convolution padding size
- s: convolution stride size
### DCNN: Deep CNN
- DCNN for MNIST:

- DCNN for face:

- Some popular DCNNs:

### CNN effective receptive field

- A single variable in layer $n^o$ 3 represents layer 1
-------------------
## SESSION 9: Recurrent neural networks (RNNs)
- Sequential Data
- Vanilla RNN
- Vanishing Gradient Problem
- Introduction to LSTM & GRU
###### RNNs predict a series of predictions as output
- Output of a image classification does not affect the output of another image.
- RNNs' output series of outputs (one output affects others):
- Time series.
- Voice/speech recognition.
#### Moving in concept embedding spaces


[More examples](http://visual.cs.ucl.ac.uk/pubs/interpTransform/)
- When moving in concept embedding scenes, values of $Z$ change.
### 'Moving' mathematical meaning
- $z \rightarrow z' = z + \Delta z$
- $z \rightarrow z' = \sigma(Uz + W\Delta{z} +b)$
- $h \rightarrow h' = \sigma(Uh + Wx +b)$
- $h_{t-1} \rightarrow h_t = \sigma(Uh_{t-1}+Wx_t + b)$
- $h_{t-1}$: khai niem cu
- $h_t$: khai niem moi
### Vanilla RNNs
- Feed-forward networks

- To transform feed-forward networks to RNNs: Leveraging any information we’ve already extracted from the sequence and maintaining them by the “hidden states”.
- Vanilla RNNs:

- output becomes new input.
- Math:

- RNNs: weight sharing in time. CNNs: weight sharing in space.
- Backpropagation through time: backpropagate errors to earlier time-steps during training.

#### Training Vanilla RNNs
Gradient-based method: backpropagation through time (BPTT)
- Treat the unfolded network as a big feed-forward network.
- The whole input sequence is given to the FFNN.
- The weight updates are computed for each copy in the unfolded network using the usual back-propagation method.
- All the updates are then summed (or averaged) and then applied to the RNN (shared) weights.
#### Common structures

#### Vanishing of Exploding Gradient
- Recurrent layer’s weights are shared across time-steps.
- Issue:
- When multiplying, gradient values can be reduced(vanishing) or increase(exploding).
- 'short-term' memory:
#### Other RNNs:
- Vanilla RNNs are simple but do not work very well.
- Some variants of vanilla RNNs can improve their drawbacks: LSTMs & GRUs (more on later sessions).

## SESSION 10: K-means Clustering: Unsupervised Learning
### In Today Lesson:
- Unsupervised Learning & Clustering
- K-means algorithm & objective function
- Elbow method
- Silhouette method
### Unsupervised Learning: Clustering
#### Unsupervised Learning:
dua vao input X de ra Y^ nhung khong can output label vi du Y
#### Clustering & Classification:

- classification classifies 1 input X to 1 output Y (labels or class)
- Clustering classifies inputs X to 1 output Y (segment of Xs to a Y class)
#### Applications:
- Customer segmentation:
- Example: using customers' shopping frequency or spending to classify customers to different classes to provide each customer with better services and goods
### K-means Clustering
- centroid = mean of values in a cluster
- inputs are classified to the nearest Cluster using Euclidean distance.
- implementing method:

- example:
#### Objective of K-means Clustering
- In K-means Clustering, performance depends on how 'stick together' of each point in a cluster (intra-cluster) and how far each cluster are (inter-cluster).
- To optimize K-means means minimizing mean sum of intra-cluster distances:
- Formula: $J =
#### How to choose good
##### Elbow method:
##### Silhouette method:
- Formula: $S A_i = \frac{b_i - a_i}{\max(a_i,b_i)}$ with $a_i$
## SESSION 11: MDP Planning
### WHAT IS PLANNING ?
##### Planning is at the core of intelligence
- 5P: proper planning prevents poor performance. Fail to plan = plan to fail.
- From input X extract Z, predict $hat Y$, from $hat Y$ bring out action A.
- For example:
- playing chess: input is current chess board, extract Z(strategies), predict next move with highest win chances.
- Tower of Hanoi: find the best algorithm for solving problem, achieve goal. 
- From observation O, through extracting features creates Z. Sometimes observation O is the Z.
- **Good planning**:
- **Bad planning**: 
### Planning
- Planning is the highest level of processing data.
- A plan is a series of actions to as efficiently as possible to achieve goal(s)
- A plan = sequence of actions to efficiently achieve a goal → sequential decision making.
- Examples: Chess playing, Auto Driving, Assistant Robots,etc.
#### Elements to formulate a planning problem
- Actions: depends on environment
- Performance evaluations: the client satisfaction,etc.
- State of agent(environment): Observation O and Z.
- Ex:
- 
- The planning model is called transition model: at a state $O_1$, use action $A_1$, and switch to $A_n$ when state changes to $O_n$
#### Simplest formulation:
- MDP(Markov decision progress) = (S,A,T,R,γ) with:
- S: state
- A: Action
- T: Transition
- R: Reward
- $\gamma$: Discount Factor(reward nhan dc bay gio se lon hon reward neu wait lau hon), lay duoc reward som
- 
- At state $S_t$, the agent decide which Action to take in order to maximize Reward R.
#### Optimal plan
- Policy $\pi(s)$: what $A$ to do at each & every situation $S$
- Action value $q^{\pi}(s,a)$: calculate sum of rewards with S and A. Average sum of possibility
- *Find q to find $\pi$*
- We can find $\pi$ directly using policy search(will be taught later)
#### Trajectory, return, value function, and optimal policy
the highest reward makes the best plan
- (slide 16 session 11. remember to write down!!!!!)
#### Bellman equations
- Formula: $q^∗ (s,a)=R(s,a)+γ\sum_{s′} P(s′|s,a) \max_{a′}q^∗ (s′,a′)$
#### Reinforcement learning
will be taught in the next class
---------------------
# ML4AI Session 12 Final Exam (for Students)
<br>

<br>
CoTAI AI Training Program for Highschool Students
© [CoTAI](https://gem.cot.ai) 2021
---
Open book. Choose to answer as many questions as possible, as concisely as possible. Ask instructors if you have difficulty in understanding the questions.
**Questions**
1. [**8** Points] The unified TEFPA framework to build a ML model: task $\mathcal{T}$, experience $\mathcal{E}$, function space $\mathcal{F}$, performance $\mathcal{P}$, and algorithm $\mathcal{A}$ to search. What are the elements in TEFPA framework...
- 1.1 [**2** Point, 3.1] to build a face recognition model using a DCNN?:
- Answer:
- Task: Input: Images or camera footages - Output: Face recognitions (names,...)
- Experience: Face images
- Function Space:
- Performance:
- Algorithm:
- 1.2 [**2** Point, 3.1] to build a RecSys? (using one of the models you have learned: item-based, user-based, content-based, MF)
- Answer:
- Task: Input: user's id. Output: Recommendation
- Experience: users' interactions,items' Features or user features or both.
- Function Space: users' similarity and items similarity
- Performance: user's interactions
- Algorithm:
- 1.3 [**2** Point, 3.1] to build a customer segmentation model using k-means?:
- Answer:
- Task: Input: customers. Output: customer segmentation
- Experience: customers' traits (hobby,afforded goods,etc)
- Function Space: classify customers to different segments.
- Performance: customers' actions
- Algorithm: distances between intracluster and intercluster.
- 1.4 [**2** Point, 3.1] to build a sentiment analysis model (good, bad, neutral comments) using RNN+Softmax Classifier?:
- Answer:
- Task: Input: Comment - Output: analysis (good, bad, neutral)
- Experience: Previous comments
- Function Space: MSE
- Performance: the comment's label
- Algorithm:
2. [**6** Points] Convolutional filters (or kernels)
- 2.1 [**1** Point, 1.1, 3.2] How do we extend them from dot product?:
- Answer: Backpropagation.
- 2.2 [**1** Point, 1.1, 3.2, 3.4] Why do we call their outputs "feature maps"?:
- Answer: Because each number in output represents features of input.
- 2.3 [**1** Point, 3.2] Explain padding: how to do & main purpose:
- Answer:
- Main purpose: keep or increase shape size and reduce egde information loss
- How to do: add pads around data
- 2.4 [**1** Point, 3.2] Explain pooling: how to do & main purpose:
- Answer:
- main purpose: reduces input shape size
- Avg Pool (mean of values inside kernel) or Max Pool (max of values inside kernel)
- 2.5 [**1** Point, 3.2] Explain stride: how to do & main purpose:
- Answer:
- Main purpose: decide how the kernel will move accross the data.
- 2.6 [**1** Point, 3.2, 3.4] Explain their **effective** receptive field: why do they produce highly absstract features?:
- Answer: A number in output layer(s) represents a feature which is represented by a lot of numbers in input data.
3. [**6** Points] Recurrent neural networks (RNNs) can be used for sequential modeling.
- 3.1 [**1** Point, 3.2] What does sequential data mean?:
- Answer: a series of data in time.
- 3.2 [**1** Point, 1.1, 3.2, 3.4] Explain each element in this basic equation of RNNs $h_t = \mathsf{\gamma}(Ah_{t-1}+Wz_t)$:
- Answer:
- $h_t$: new concept
- $\gamma$: activate functions
- A:
- $h_{t-1}$:old concept
- W: weight
- 3.3 [**2** Point, 1.3, 2.1, 3.2] WWhat does back-propagation-through-time mean, why do we need it instead of using plain back-prop, and how does it work for training RNNs?
- 3.4 [**1** Point, 1.3, 3.2] Explain vanishing gradient problem for simple RNNs.
- 3.5 [**1** Point, 3.1, 3.3] If we want to classify the sentiment of each user comment (good, bad, neutral) at the end of each sequence using RNN+Softmax classifier: explain briefly the model architecture.
4. [**6** Points] Planning in Markov Decision Process (MDP) $(S,A,T,R,\gamma)$.
- 4.1 [**1** Point, 3.1, 3.2] Explain 5 elements in MDP model (equation of each element if available).
- 4.2 [**1** Point, 3.2] Following a policy $\pi(s)$ to generate a trajectory of 10 time steps $(s_t,a_t,s_{t+1},r_{t+1})$. Compute the return. Equation of $a_t$?
- 4.3 [**1** Point, 1.2, 3.2] Repeat for 10 days: from $s_0 = \text{HOME}$ take action $a_0 = \text{GET_BUS}$ with cost $r_1 = 6000 \text{VNĐ}$ then following policy $\pi(s)$ to generate $K=10$ trajectories, each with total cost $G_k$. Compute the average cost of taking bus then following $\pi$: $Q^\pi(\text{HOME, GET_BUS})$.
- 4.4 [**1** Point, 1.1, 1.3, 2.1, 3.2] How do we compute an optimal plan (i.e., optimal policy $\pi^*$) of a known MDP $(S,A,T,R,\gamma)$?
- 4.5 [**1** Point, 3.2] Why do we say that the action value function $Q^\pi(s,a)$ gives predictions into very far future?
- 4.6 [**1** Point, 1.2, 3.2] What is the meaning of action value function when we set $\gamma = 1$? $\gamma = 0$?
5. [**7** Points] Unified ML models
$\text{Input } X \xrightarrow[B_{\beta}]{\text{Features}}\text{ Embedding Coordinates }Z \xrightarrow[P_{\theta}]{\text{Predictor}}\text{ Predictions }\hat{Y} \xrightarrow[{\pi^*}]{\text{Policy}}\text{ Action }A$
- 5.1 [**2** Points] List all *taught* algorithms for feature extraction and their main ideas.
- Answer: PCA, CNN.
- 5.2 [**2** Points] List all *taught* algorithms for making predictions and their main ideas.
- Answer: Linear Regression, Logistic Regression, SVM, KMeans, kNN.
- 5.3 [**2** Points] What are the main *general* differences between linear predictors? And in your opinion why do we need different algorithms?:
- Each linear predictor use different algorithm
- 5.4 [**1** Points] For MDPs, what are the predictions $\hat{Y}$ used to make decisions $A$?
- Answer: Action value $Q$
6. [**2** Points] RecSys

We build item embeddings ${\bf z}_i \in \mathbb{R}^2$ as in table, and use **softmax regression** to predict ratings 1 to 5. Choose a specific user $X\in \{A,\dots,F\}$, what is the training set for learning $\theta_X$? What are the parameters $\theta_X$ to be learned (with their shapes)?
- Answer:
- For user $\theta_A$: training
7. [**6** Points] MDP Planning for playing Chess. Let rewards = 1 for winning, -1 for losing, and 0 for a draw or unfinished game, and no discount.
- 7.1 [**2** Points] What is the range of value of the optimal action-value function $Q^*(s,a)$, and how to derive probability of win/loss from it?
- 7.2 [**2** Points] If we use all the games already played in history to compute $Q^*(s,a)$, explain the method?
- 7.3 [**2** Points] Because there are so many state and action pairs $(s,a)$, we need to use *learning* to approximate and generalize for all $(s,a)$ pairs. If we use MLP to learn $Q^*(s,a)$, what is the dataset and possible network structure?