# Post-Delay Analysis
:::info
- **Date:** 22.06.2019
- **Authors:** Huy-Bao Le, Robin Petersen, Sebastian Berghoff, Paul Duy An Nguyen
- **Description:** Analysis of our week of bug-hunting and other adventures.
:::
This document tries to explain and rectify the issues / solutions with our software-design and implementation, which prevented us from completing the project at given project deadline.
Fortunately, we were given another week to complete our bug-fixing and testing, so this document tries to cover the analysis of our bug-fixes / problem resolution which occurred over the week.
#### Dates:
- Deadline: 17.06.2019
- Duration of bug-fixing phase: 1x week
- Grace deadline: 24.06.2019
## Problem Analysis
The overall issues can be grouped into the following categories:
- Organizational
- Design
- Implementation
### Coordination of tasks (Organizational)
#### Observation & Issues
The allocation of human resources were greatly underestimated as the code-base grew in size and has been mostly maintained by one or two maintainers. This resulted in the stagnation of the project, especially at near deadline times, it was difficult to get participation because getting into the code-base requires a somewhat steep learning-curve as the code-base uses advance coding techniques (eg. C++ TML, ...).
#### Solution & Future outlook
The planning of human resources should definitely not be underestimated. Furthermore, the participation of every project member in the code-base should be required. To reduce the steep learning-curve, one should not rely on complicated language features like C++ TML.
#### References
- [Gitlab Repo Graphs](https://gitlab.informatik.haw-hamburg.de/ack231/esep_fhc/-/graphs/master)
### Unsound architecture (Design)
#### Observation & Issues
After the initial design phase and our initial review, the design seemed well-though and sound, however most of the issues with the design came up when we started our implementation. For example, the Finite-State-Machines which were modeling our main logic were underspecified and unsound. In terms of underspecification, many core details / assumptions were missing or just not documented. This significantly slowed down our progress at finishing up our initial implementation as the discussion came up whether what solution was the correct one and how they interact with other components.
#### Solution & Future outlook
Probably the most important thing which could've improved our situation would be to better document design decisions and especially even if it's something really small. Another improvement should be a even more through code/design-review. Although dealing with all possible scenarios is not possible and has more or less to do with experience in given subject area.
### Monolithic FSMs (Design)
#### Observation & Issues
The main logic components in our application are modeled as Finite-State-Machines. The decision was made to design the main FSM around the life of a puck. Hence the reason to stick with the Master-Slave architecture, the Puck-FSM became such a monolith FSM, including all kinds of error handling. Trying to break down the the architechture, other FSMs were designed trying to decapsulatethe every component. This led to several FSMs surrounding the Puck-FSMs communicating with each other by sending events.
#### Solution & Future outlook
#### References
- [Gitlab FSMs](https://gitlab.informatik.haw-hamburg.de/ack231/esep_fhc/wikis/04-Design/FSMs)
### Segmentation-Faults (Implementation)
#### Observation & Issues
Our initial testing has triggered a at first unknown fatal error which occurred in a segmentation fault. We then tried to debug this issue in a debugger (gdb) but this was not helpful at all because of the corrupted stack. This pointed at a stack-corruption issue at first but until hours of further inspection we traced the problem back to the Broker system.
The possibility was high that we were indeed invoking undefined-behavior.
Turns out the following lines were the culprit:
```cpp
// Server
responses::Response response{responses::MAGIC, event};
MsgReply(rcvId, IPC_OK, &response, sizeof(response));
// Client
responses::Response res;
MsgSend_r(..., reinterpret_cast<void*>(&res), sizeof(res));
```
How can these two lines invoke undefined-behavior? The problem is more-or-less hidden in the data-type `responses::Response`. `responses::Response` is non-POD (Plain-Old-Data) which means that it contains data-structures which cannot be moved/copied simply by copying bytes instead we have to use C++'s copy/move-constructor to move objects. Imagine `response` contains a a`shared_ptr` with the reference count of one, we want to move ownership of our `shared_ptr` object to the client with `MsgReply`, but a `shared_ptr` is not POD, so we are invoking UB here, because the following happens:
```cpp
// event = shared_ptr { ref = 1 }
// constructing response:
// shared_ptr { ref = 2 }
responses::Response response{responses::MAGIC, event};
MsgReply(rcvId, IPC_OK, &response, sizeof(response));
...
// response goes out of scope => shared_ptr { ref = 1 }
responses::Response res;
MsgSend_r(..., reinterpret_cast<void*>(&res), sizeof(res));
// res goes out of scope => shared_ptr { ref = 0 }
// => resources are freed!
```
#### Solution & Future outlook
The fix actually requires the data to be POD but we actually just want to transfer ownership of a shared_ptr, so we can get away with transferring the raw-data and not running the destructor.
#### References
- [Broker fix UB](https://gitlab.informatik.haw-hamburg.de/ack231/esep_fhc/commit/5f02966e5c2116e263e54033334c10c9468922f2)
### Timing-Issues (Design/Implementation)
#### Observation & Issues
In our initial design phase, we decided to go with a simple Master-Slave architecture, as in Controller (Master) and Proxy (Slave), whereas the Proxy will redirect events and commands back/from the Controller. This has the advantage that the issue with data consistency and sharing between both physical entities (transport belts) would go away and the proxy would completely abstract the second physical entity away.
Initial implementation / testing has shown that this design decision was a good decision without drawbacks. However, before even actually implementing the whole FSM designs we started small with our MVP “Little elephant”, which successfully realized the sorting and others things, but only on one transport belt. Soon we implemented the rest of our FSM designs which now used the abstracted second transport-belt proxy layer.
This led to problems, that the serial communication has too much overhead which makes it to slow for our event driven design and how the puck detection system works in our MVP.
Another timing-related issue was the fact that we decided to update the position of the pucks every 30ms in mm. The first problem was the timer used for this task was overloading the Broker-component which is the core-multiplexer for events.
Not to mention that tracking positions by distance is extremely imprecise, as the calculation from time to distance is not precise and the error will multiply by the time how often we trigger a position update (in this case 30ms).
#### Solution & Future outlook
We had two choices: a complete redesign to a more Master-master architecture or offload the critical time-sensitive parts (eg. puck-detection) to the proxy-component. It's obvious that the first decision was not viable because we were time-constrained, so we took the second approach, which makes our architecture a more hybrid Master-slave system.
We also eliminated the conversion of distances by fully determine the distance through time-stamps and -durations. The different velocities of the belts still needs to be converted, but this will be done with one multiplication by a factor which will be determined by the calibration when running the belt in slow- and fast mode.
#### References
- [Modification to proxy](https://gitlab.informatik.haw-hamburg.de/ack231/esep_fhc/commit/c99b2bd8feb51e4d95fc24360ed0b3d3a3f00c23)
- [Rewrite calibration](https://gitlab.informatik.haw-hamburg.de/ack231/esep_fhc/commit/6456fd5d18913ace0ee5930483f494d596d40f1f)
## Retrospective
There was a lot to learn from a project of this scale, such as realising when one is too ambitious and getting a better sense of what is possible in a limited amount of time.
Other lessons learned include: making sure that software that is crucial for the process is fit for use and won’t stop working/show limits that were not planned for.
When we started, we aimed for the most comprehensive interpretation of the given task, taking care to design a system that can deal with fringe cases while covering all functionalities mentioned in the "Aufgabenstellung" document. We planned our project under the assumption that all functionalities must be present, and even when that proved to be a very complex, maybe too complex task, the decision was made to invest more time rather than lowering our expectations.
We aimed for the "best result" rather than the "best possible result with minimal expenditure", designing a system that could be expanded easily, with modularity etc, even though that wasn’t necessary for the project requirements. In cases like these, "doing it right" trumped “doing it in the easiest/quickest way".
The designed system, while intuitive in theory and easy to design and implement, proved to be much more intense and difficult to work with in practise, as a project this complex gives way for many opportunities where minuscule mistakes, misunderstandings or typos can cost a lot of time in development.