# Projects
## AACAS
### Intro
This is my capstone project at the MRSD program, which I collaborated with 3 teammates to develop a drone perception, planning and guidance system. The goal is to automate a drone so that it could navigate along waypoints, and avoid collision with obstacles along its path.
(In manual mode, our system displays the planned path on a user interface instead, so that the human operator can fly the drone along the suggested path.)
### Perception
My part in the project is to develop the perception functionalities of the system. The pipeline input of my part is mainly camera images, along with some point cloud and drone pose information, from the lidar and DJI onboard sensors. The output is 2D or 3D trajectories of obstacles in the drone's field of view.
My implementations are based on C++, Python and PyTorch, and everything is integrated on ROS.
* Object detection & classification: We use YOLOv3, which is a neural network based approach that simultaneously does object detection and classification.
* For tracking, we use a Kalman filter-based algorithm, SORT. The idea is to match new detections with current tracklets of objects.
### Questions: General
* Overall pipeline?
* Experiments?
* Biggest challenge?
* Sensor platform?
### Questions: YOLOv3
* Why YOLOv3?
* YOLOv3 is able to perform real-time object detection and classification simultaneously.
* It can be trained end-to-end.
* Its training process can also be broken down to: pretraining image features (finetuning) and training the classifier. The latter does not take up too much time and can easily be adapted to different datasets.
* It does not require additional sampling methods (such as region proposals) which greatly lags both training and inference speed.
* Training pipeline?
* The model is trained on an AWS EC2 GPU instance, since it is not very efficient or necessary to do the training on the Xavier which has limited specifications.
* We collect and manually label the dataset.
* We collect data at similar locations where we test fly the drone and deploy the model, and with similar object settings, so that training and inference data are similar (image background, obstacles).
* Inference pipeline?
* The trained model weights is then moved to and deployed on the Xavier. First, we have to ensure that the PyTorch model on both AWS and Xavier have identical architecture (including input resolution) so that the weights could be correctly loaded.
* Pass each frame of the video stream (batch size=1, reshaped, normalized) through the trained model and return the bounding boxes & classes of detected objects.
* Integration with ROS: Since the downstream (trajectory prediction) only needs trajectories of objects instead of bounding boxes, we first do NMS, and then pass the boxes through a matching and tracking algorithm and assign them with object ids before passing them onto the downstream. Increases efficiency by only providing necessary information.
* Did you make any modifications to the YOLOv3 architecture?
* In order to adapt to our dataset with 3 classes, we trained the classifier part of the YOLOv3 model, replacing that of the original implementation which has 20 classes.
* Tweaked some NMS settings -- increased the bounding box confidence threshold so that we won't spend too much time on various bounding boxes.
* How did you monitor the inference performance?
* Test fly & display recordings with visualization functions, and see if the model fails on specific cases (e.g. super big/close objects which are less seen in the dataset; the white drone easily blends into the background and becomes hard to detect).
* How did you enhance the inference performance?
* Collect more data (~10k images)
* Add augmentation: color space augmentation (random color jittering / grayscale), and those which requires modifying the labels -- horizontal flipping, random zoom-in.
* Adjust some parameters to fit the current case specifically. We don't have cluttered background or objects, so in our case it is fine to increase the bounding box confidence and IoU thresholds when doing non-maximum suppression.
* What are the targets and how did you obtain data?
* Since we have limited time and project scope, the target classes are restricted to {ball, person, drone}.
* Recorded ROS bags (collection of all topics in the ROS system, including images, point clouds, drone pose, and predicted bounding boxes) with either the drone or only the sensor pod.
### Questions: Tracking
* What tracking algorithm did you use?
* SORT, which is a Kalman filter-based algorithm.
* 2D linear model of tracking bounding boxes.
* YOLOv3 is for 2D images. How did you achieve detection and tracking in 3D?
* When we receive 2D bounding boxes, we project the 3D lidar point clouds to the 2D camera frame, and some points will fall into the bounding boxes. We group these boxes and apply RANSAC filtering, which gives us a "cluster" of points that belong to the objects. The center of the objects will be the estimation of their 3D positions.
* Every timestep we update and keep track of the 2D bounding box of each object, and repeat the 3D rendering process.
* How do you model object state?
* The state of each object include x, y positions, velocities and accelerations, and also height, width and their first derivatives. The state vector is of size 10.
* State vector $X_t=[x, y, w, h, \dot{x}, \dot{y}, \dot{w}, \dot{h}, \ddot{x}, \ddot{y}]$
* The class prediction from YOLO is preserved here, so that the matching process between tracklets and new detections is easier.
* What are your prediction and update processes?
* Prediction: At each timestep we make an estimation of the object's state each timestep based on the previous prediction and covariances.
State prediction $\hat{X}_t:$
$\hat{f}_t=f_{t-1}+\dot{f}_{t-1}\Delta t+\ddot{f}_{t-1}\frac{\Delta t^2}{2}\; , \hat{\dot{f}}_t=\dot{f}_{t-1}+\ddot{f}_{t-1}\Delta t\;, f \in \{x, y\}$
$\hat{g}_t=g_{t-1}+\dot{g}_{t-1}\Delta t\; ,g \in \{w, h\}$
* Matching: A new detection is assigned to a tracklet if their class labels match, and the bounding boxes of the new detection & prediction in the last step have an IoU score above certain threshold.
* Measurement: The newly assigned bounding box serves as the measurement for the update process. The state (10 elements) is mapped to measurement space (x, y, w, h) with a 4 * 10 matrix.
Measurement $\tilde{X}_t=[x, y, w, h]$
* Update: We use the "Measurement" which is the newly assigned bounding box to compute the Kalman Gain and update our object state.
* Task-specific implementations: If a detection does not have a matching tracklet, it is viewed as a new object and a new tracklet will be started.
We keep track of objects through time, even if they are "absent" (no matching detection) for a few timesteps, in order to deal with short-term occlusions. If a tracklet has been inactive for a certain number of timesteps, it will be removed from the current collection (i.e. object has already left the scene).
* How do you deal with the drone's motion?
* The drone motion part is handled and modeled with the control input term. Once we have the relative 3d position between the object and the camera in the previous frame, and the drone pose difference between the previous and current frames (which is obtained by onboard inertial and gps sensors), we can use homography to estimate the "amount of change" of the bounding box x, y positions. We add this to our state estimation as the form of control input.

### Questions: ROS and Software Implementation
* On what platform did you deploy your model?
NVIDIA Jetson Xavier, which is an onboard computer with Linux operation system with a GPU.
* How was the inference performance?
We are able to perform inference (detection, classification and tracking) in real-time, which is around 60 fps. We reduced the size of images to ensure its inference speed.
* ROS integration problems
---
## Graph Neural Network for 3d Object Detection
### Intro
This is a research project during my internship last summer.
To start off this summer research project, I chose one of the methods from the Kitti benchmark, which is a graph neural network-based method, then tried to modify it and improve the performance.
The graph neural network takes in a graph data structure, consisting of nodes and edges. Therefore the graph encodes lots of information of the architecture and connectivity.
During both training and inference, the features of each node are updated by aggregating features of its neighboring nodes and connecting edges, and we try to predict the labels of each node with the updated features.
After we
### Preprocessing
* Projection
* First we create graph nodes by projecting the point cloud onto the camera image. The graph nodes are the points projected to 2D, initialized with a 4 dimension feature vector each -- which are the lidar reflectivity, and the RGB values of the corresponding pixel they are projected to in the camera image.
* Node Sampling
* Voxelized point cloud is too dense (10~20k points in each frame) so we have to downsample it again, leaving only around 1000-2000 points, or nodes.
* Graph construction
* Nodes: We use the downsampled point cloud as the graph nodes. Each are initialized with the 4-dimension voxel features (irgb).
* Edges: We connect two nodes with an edge if they are within a certain radius. The edge features are initialized with a 3d vector, which are the relative distance in 3d between both nodes.
### Architecture and ablation
* PyTorch-Geometric implementation
* Transformed the original network implementation and loss function on Tensorflow to PyTorch-Geometric.
* Easier in implementation and trying different architectures.
* Since the research goal is on different model architectures and features, the implementation of everything else (preprocessing, box grouping, NMS) remained the same.
* Initialization features
* The original approach is to only consider the lidar reflectivity of each point, instead of the 4d voxels with the RGB values. Adding RGB values did improve the performance a bit.
* We also tried replacing the RGB image with an upsampled feature map from a pretrained network, which also improved the performance. (cached feature maps to reduce training time).
* Architectures
* Original implementation: Linear aggregation, MLP for updating node features and edge features. Problem is it's very heavy in weight because we have individual weights for all ~100k edges.
* Attention graph neural network: The first attempt here is to add attention weights to all the nodes and edges. Though this did improve the performance, but the training is also slow.
* GraphSage: This is another type of graph neural network operator and the idea here is to have all edges share the same weight. This would ideally tackle the edge weight problem, but however in this case the edges carry lots of information (which is the 3d offset between its connecting points), this did not improve the performance. Even if we add an attention mechanism to GraphSage.
* A possible solution is try to incorporate the (normalized) 3D coordinates of the points to the node features, instead of encoding that in the edges.
### Loss
* Classification loss: Just like the simplest case for graph neural networks, we use cross-entropy loss for the node label classification problem.
* This has very limited effect because the nodewise label predictions carry no information at all of the surrounding architecture, namely the "clusters" which are actually objects.
* Another reason is that the node labels are heavily imbalanced. Over 80% of the points are label 0, which is the background. The model will just have to predict all labels to be 0 and could significantly "reduce the loss and reach a high accuracy".
* Localization loss: offset between predicted and ground truth boxes. Position, size and orientation.
* Since nodewise label classification loss is apparently insufficient, the second loss function is introduced.
* The predicted boxes are formed by grouping nodes with the same prediction labels (other than 0), and a non-maximum suppression is applied to remove duplicated boxes.
* The loss between predicted and ground truth boxes is a summed square loss of all terms, including 3D position, sizes and orientation.
* This loss function is critical therefore it is weighted much more heavier than the classification loss.
* Regularization
### Questions: General
* K-means clustering: To evaluate if the feature maps method makes sense.
* Visualization: Point cloud projections (depth and reflectivity), bounding boxes.
* What will you do to deploy this model for inference?
* Try to encode the 3D information in the node features so that we may try the graphsage architecture, which has a general set of edge weights, which greatly reduces size and inference time.
### Questions: PyTorch-Geometric
* Dataloader architecture?
* The graph representation of each sample consists of two components: Node features (of size N by node_feature_size), and Edge indices (of size M by 2, each instance is an integer between 0 and N-1, representing the node indices). Optional components include edge weights (of size M by edge_feature_size), and labels (of size N by num_classes).
* The graph dataloader uses a special collate function, which is to concatenate all of the components for each sample in the batch, and provide a batch index that indicates which sample the data components belong to (in order to parse the network outputs).
* How does GNN layers work?
* GNN layers generally have an input size and output size. This is similar to linear layers -- which converts the feature vectors the target node from the input size to the specified output size.
* Feature vectors of neighboring nodes and edges will be passed through separate sets of linear weights of the same shape, and added to the target node's feature vector.
* The attention weights will be added in the process of "adding those features together".
* Reduction: We usually don't add features of all neighboring nodes and edges to the target node, but just take the max or mean instead. Max reduction is a similar concept to maxpooling -- taking the most significant feature.
* How does nodewise label prediction work?
* Very similar to image classification, we take the final updated feature vector of each node and pass it through a linear classifier, with output nodes of the same size as classes.
---
## Pedestrian Re-identification and Tracking
### Intro
This is a research project during my previous job as a research assistant, at Academia Sinica, Taiwan.
Pedestrian re-identification is to identify the same individuals across a sequence of camera frames, so that we can keep track of them.
### Pipeline
* The input is a sequence of videos, probably from a surveillance CCTV.
* First, we use YOLOv3 to detect the pedestrians.
* Then we pass the "proposals", which are cropped regions of pedestrians in the input image, to a feature extraction network which produces a single feature vector for each pedestrian detected.
* Finally we compare the distance between feature vectors of two detections, to determine if they are the same person.
### Training
We use the yolo detector off-the-shelf and focus on training the feature network. We train directly on human detections (cropped images).
This is a scenario in which we will very likely encounter "new identities" all the time, which is, classification is not suitable here. Therefore we use metric learning instead of classification.
The network will produce a feature vector for a human detection image. We compare the distance between two vectors and use a threshold to determine if they are the same person (positive pair) or otherwise (negative pair).
* Siamese loss: Aims for minimizing the distance between feature vectors of instances from the same class. The weakness is that it does nothing about the distance from different classes.
* Triplet loss: Designed to tackle the problem of Siamese loss. During training it tries to minimize the distance between intraclass feature vectors, while maximizing the distance between interclass feature vectors.
* We also did some sampling for triplet loss training. If we take all possible combinations of all training instances, there will be a large imbalance between the number of positive and negative pairs. Therefore we have to sample and only keep some negative pairs so that the overall dataset has a neg/pos ratio of around 1.
### Inference and tracking
Detections are directly passed into the feature network to produce the feature vectors.
Tracking only happens in inference phase. We use a Kalman filter-based method called DeepSort, which is SORT + deep features.
A detection is considered to be matching with a track if:
* They have an IoU overlap above certain threshold.
* Their feature vectors have a distance under certain threshold.
If there is a match, the new detection will be assigned to the existing tracklet, and the tracklet state (including the position and "average features") will be refined with the new detection.
### Challenges and remedy
* Occlusion: partially occluded instances are probably the hardest to deal with
* Determine which body regions are visible: head, upper torso, lower torso, upper legs, lower legs. Off-the-shelf pose estimation pipelines.
* Feature vector representation: we only had 1 global feature vector which is extracted from the entire detection. Now we have 5 extra feature vectors of each body region, all from individual feature embedding networks, which we call local feature vectors.
* We only compare the local feature vectors of that of the tracklets if they are present and not occluded.
* For the global detection, we pad the occluded regions with zeros before reshaping them and feeding them into the feature embedding network.
### Metrics
Swiching, broken tracklets, ...
---
## Robotic High Jump Coach
### Intro
This is a course project at CMU during which I collaborated with 2 teammates. The goal is to develop a system to aid high jumpers in their training.
The input is a 2d video of a high jump process, and the output include a prediction of the bar outcome (either successfully jumping through the bar or not) and a set of coaching suggestion labels (verbal).
The inplementation involves pose estimation, and a sequence analyzing network combining convolutional and recurrent architectures, where lots of the ideas were inspired by NLP tasks.
### Pipeline
* Pose estimation
* The input video is first passed through an off-the-shelf 3D pose estimator in order to obtain joint predictions for each frame. The 3D coordinates are then normalized to be between 0 and 1.
* For each frame, there are 25 keypoints, each in 3 coordinates, which yields a flattened feature vector of size 75 in total.
* Sequence segmentation
* Since there are different stages in a full high jump process, which are very different from each other, we decided to handle different "segments" individually, including training separate neural network blocks and producing different sets of verbal labels. (In fact, coaches often judge and provide suggestions for athletes for each high jump stage respectively.)
* We now do this manually.
* Neural network
* For each video / training instance, we pass the stacked pose sequences of each segment into a neural network, which predicts the verbal suggestions for each segment, and also predict a binary bar outcome label for each training instance.
### Network architecture
* Overall architecture: The full model consists of 3 feature embedding blocks, 3 verbal label classifiers, and 1 bar outcome classifier.
* Many concepts in the network design are inspired by NLP practices and models.

* Feature embedding blocks
Here we use a hierarchical embedding strategy. This is common in NLP models, when we want embeddings of different hierarchies, for example, words, phrases and sentences. Likewisely, we are extracting embedded features for single frames, a couple consecutive frames, and the entire sequence segment.

* Single frame: We use an unbiased linear layer to extract the feature vector instead, which is a weighted linear combination of the input joint vector. This is the idea of "which keypoints are important while which are not".
* Speaking of biases of the linear layer. There is also an important factor in the videos: direction, which is an indicator of whether the jumper is approaching the bar from right or left.
In NLP we usually use an embedding layer in order to get a feature vector for each tokenized word (which is a row in the weight matrix). 'right' and 'left' here are apparently discrete tokens, from which we can directly pull the embedding vectors. The direction vector is added to the single frame features (which is how bias is applied to linear layer outputs -- we use this as the bias).
* Several frames: We also want to consider an embedding spanning across several consecutive frames, so we run a 1D convolution along the time dimension, with a certain kernel size, or sliding window size, on the single frame embeddings.
* Sequence: Finally we apply a one-directional RNN or LSTM on the previous hierarchical embeddings, to embed the entire sequence, and then take the hidden vector of each timestep as our sequence level embedded features.
* Finally we apply a maxpooling to the three levels of embedding features, and output the element-wise maximum among the three.
* Verbal suggestions classifier
* This is also inspired by NLP verbal answer prediction, mostly used in VQA (Visual Question Answering) tasks. The output space here is a "vocabulary", where the labels are multi-hot vectors (i.e. there can be multiple positive labels in one training instance). Likewisely, we create a phrase-level vocabulary of verbal suggestions (such as "run faster") to be the output space.
* For the ground truth vector, the binary label 1 for a specific verbal suggestion indicates that in this run, this suggestion is what the jumper has to improve.
* There are different sets of vocabulary and thus classifiers for each of the 3 high jump stages.
* We sum the embedding vectors obtained in the previous module along the time dimension, and pass it through a classifier module -- which is a combination of linear layers, with batch normalization and ReLU activations between them.
* The output here will be a vector which length is the size of our vocabulary. After a sigmoid activation which scales everything to between 0 and 1, each element here represents the probability of each term being positive.
* Bar outcome prediction classifier
* Besides producing verbal labels for coaching, we also want to predict the outcome of the jump, which is whether the athlete will successfully jump through the bar.
* Inspired by NLP again, which is probably the simplest task in NLP -- binary sentiment analysis. Binary sentiment analysis has 2 binary labels, positive and negative. Embeddings are passed into a recurrent architecture, then the hidden vector of the last word/timestep will be passed into the classifier and determine whether the sentiment is positive or negative.
* Likewisely, we take the embedded vector of our last timestep in all 3 high jump stages, concatenate them, and then pass it into a big classifier with 2 output neurons, and of course followed by a softmax.
* The softmax probability of the label 1 is the predicted probability of the athlete successfully jumping through the bar.
### Loss
There are 2 outputs, namely verbal suggestions and bar outcome prediction, therefore we have 2 different loss functions (which will be summed up).
* Bar outcome prediction loss: This is a softmax probability of 2 classes, therefore we use a Cross Entropy loss (which combines the softmax function and the negative log likelihood loss) to compute the loss.
* Verbal suggestion loss: For multihot labels that each term has its own probability between 0 and 1, we use the Binary Cross Entropy loss to handle this. The loss function is actually weighted because the original dataset is biased -- with a neg/pos ratio around 5.
### Questions
---
## Listen, Attend & Spell
This is part of my coursework at CMU, which is to implement the neural network for speech-to-text generation based on the paper "Listen, Attend and Spell".
The network accepts preprocessed phonemes as the input, and tries to generate the matching text as the output.
### Architecture
The model architecture consists of 3 parts.
* Encoder / Listener
The purpose of the encoder is to analyze the audio input with a biLSTM structure. But however some audio sequences are long and difficult to handle with the classic LSTM.
Therefore, the workaround is to use pyramidal biLSTM structures to reduce time resolution, so that long sequences could also be handled well.
For every layer in the pyramidal architecture, the sequence lengths are halved and the feature dimensions are doubled.
* Attention
The self-attention weights are computed with
* A key/value pair, which are from separate linear layers following the encoder
* And also a query which is from the decoder.
To produce the attended audio context:
* First we mask the input to generate the attention mask of appropriate length / get rid of padding.
* Then we apply the attention mask, which encodes the attention scores of each element, to the sequence.
* Decoder / Speller
This is the trickiest part, where the goal is to predict the next character in the script, given the audio input and previous character. Therefore, the LSTM here is single directional.
For sequence generation we pass the label of the start-of-sequence token at timestep 0, and gradually produce the rest.
For each timestep:
* The previous token prediction is passed into the embedding layer which yields the corresponding feature vector.
* The feature vector will be concatenated with the attended context of the previous LSTM output
* Then pass the concatenated vector to the LSTM cell again.
* The current token / character prediction is generated by a softmax layer.
---
## Hierarchical Co-Attention VQA
The goal for this project is to process and comprehend both of the inputs, the image and the question, in order to predict the most suitable answer to the question.
The implementation is based on the paper "", which introduces a co-attention based approach. The pipeline looks like the following:
### Image
We pass the image through a pretrained deep backbone network and extract the image texture. We have a feature matrix of shape (spatial resolution * embed_size / feature map count).
### Question
To process the question, we first create a tokenized set of question word vocabulary, then extract the vector for each tokenized word in the question, then embed the question in 3 different hierarchies:
* Words: The embedded feature vector of each token.
* Phrases: We consider 3 lengths of phrases, which are n = 1, 2, and 3 words in length, respectively. For each phrase size n, we apply a 1D convolution with window size n along the time dimension, respectively, and finally apply a maxpool to get the final phrase embedding vector for each t.
* Sentence: We use a single directional LSTM to encode the entire sequence, and take each hidden vector at timestep t.
### Answer
We create another tokenized vocabulary set for each answer in the dataset, which is sentence-level embedded, i.e. a single answer for a question.
Now we have 3 different hierarchical feature matrices of size (sequence length * embed size).
### Co-attention
The attended context is computed for each of the hierarchies, respectively.
The idea of co-attention is using one feature to guide another alternatively.
* For each level of co-attention, the operation is applied for 3 rounds each. The first round is to use the question context to guide the image context, and alternatively swap the roles for two more rounds.
* For each co-attending round, both contexts are passed through their own linear layer, and then a softmax (across the "time" or "spatial resolution" dimensions) will be used to compute the attention mask. Finally the attended context is generated by applying the mask to the guided features.
* The final co-attended context is obtained by adding both visual and verbal co-attended vectors.
### Prediction
We now have 3 sets of co-attended context, for each word/phrase/sentence hierarchy.
For predicting the answers, we pass the co-attended context through a linear layer followed by an activation, then we concatenate the output with the co-attended context of the next hierarchical level, and repeat the process.
Finally a softmax is applied to predict the final answer.