# Project Description
Euclid is a space survey mission by the ESA and will launch a satellite in 2022.
The satellite will measure the redshift of distant galaxies to help understand
the role of dark matter and dark energy in our universe. The data collected by
the satellite will be sent back to earth and analyzed by various processing
jobs.
The software infrastructure, developed by the Institute for Data Science at the
FHNW, plays a key role in the Euclid project. Under the lead of Prof. Dr. Martin
Melchior, they developed multiple software components relevant for our project:
The first component called the Metascheduler, orchestrates the processing of
pipelines and dispatches jobs to the different science data centers. There, the
second component called Pipelinerunner, will pick up the task and queue the work
on the compute-nodes in the data center. The so-called Pilot Agent will then
process the queued jobs by fetching the necessary data from the Distributed
Storage System and executes them.
All components generate logs, which contain valuable information about the state
of the processing pipeline. Some jobs might take a long time to complete, and
the log data could give insight into what is going on and how far along a job
is. Occasionally, something might go wrong and the log data could help pinpoint
the problem and assist in debugging.
Logs are currently fed into an Elasticsearch instance with a Kibana frontend
which is not under the control of the institute for data science. More time and
effort is required to make the current infrastructure fit their needs
perfectly. Logs messages are not properly parsed into a standardized format
which makes searching difficult. Also, the amount of logs overwhelms human
operators when it comes to debugging this highly complex distributed system.
# Project Goal
Our job is to build a central logging infrastructure to collect, store and
analyze the generated logs. The collected logs should be parsed into a
structured format to enable performant searches and to ease debugging when
problems occur in the processing pipeline. An interactive dashboard may provide an
interface to search and filter log data and include metrics to visualize running
jobs of the overall system status.
Due to the number of logs being generated, a manual analysis might not always be
possible. For this, one or more machine-learning algorithms should be evaluated
and used to gain additional insight into the vast amount of log data. It will be
part of the project to evaluate which tools out of the supervised, unsupervised
and reinforcement-learning toolbox will help us with this endeavor. To give some
examples, we might be able to send alerts when we detect anomalies in the log
data or condense many hundred log entries into a single meaningful data point
for better analysis.
# Project Requirements
The key words “MUST”, “SHOULD” and “MAY” in this document are to be interpreted
as described in [RFC 2119](http://www.ietf.org/rfc/rfc2119.txt).
- Design architecture for central logging infrastructure.
- The architecture MUST be horizontally scalable to match current estimations
of future workloads.
- The architecture MUST take into account that the Euclid project spans
multiple data centers.
- The architecture MUST allow flexible integrations with machine learning
algorithms (e.g. batch or stream processing workloads).
- The architecture SHOULD be maintainable by the I4DS and the Euclid project
team.
- MUST handle input from existing filebeat infrastructure (IAL logs).
- Log data from science jobs MAY be handled by the logging infrastructure.
- Build proof of concept infrastructure.
- The PoC MUST implement a subset of the defined architecture.
- Parse log data
- A minimal standardized log format MUST be defined.
- Log data MUST be parsed into the defined format (e.g. JSON).
- Multi line log messages SHOULD be parsable (e.g. Tracebacks).
- Semi structured information MUST be extracted from log data (e.g. timestamp, severity).
- Domain-specific knowledge SHOULD be extracted from log data (e.g. job-id).
- There MUST be a way to identify wrongly parsed log messages.
- Unimportant log messages SHOULD be filterable.
- Data retention
- Log data MUST be retainable through a time-based policy
- Log data SHOULD be retainable through a space-based policy
- Log data MAY be retainable through a job-based policy
- We SHOULD find out which raw log datasets need to be preserved.
- Alerting based on log data
- The PoC MUST implement concrete alerts for already known problems (e.g. send alerts on messages with severity "ERROR").
- The documentation SHOULD describe best practices for alerting based on log
data.
- The documentation SHOULD describe best practices for where in the stack the alerting should happen (e.g. collection, ingestion, machine learning, ...)
- Machine learning with log data
- We MUST find suitable machine learning algorithms that work with log data
- The PoC SHOULD integrate machine learning algorithms into central logging
infrastructure.
- The PoC MAY use machine learning based alerting.
- Debugging
- The PoC MUST allow interactive exploration of log data (e.g. queries).
- The PoC MUST provide a way to visualize the log data (e.g. graphs and
dashboards)
# Project Boundaries
The following points are explicitly not part of the project.
- **Log Collection**: We do not have to collect the logfiles. Meaning it is not
part of the project to provide a solution or product to collect log entries
and send them to our infrastructure. We have to provide a documented endpoint
at which we will collect log entries.
- **Hardware**: The required storage and compute resources are provided by the
customer.
- **Capacity estimates**: Capacity estimates for the future production
infrastructure are provided by the customer. This includes the average log
volume per day.
# Our Approach
In the first phase, we study the state of the art logging best practices done in
the industry today. We will study publications, blog posts, and conference talks
from large technology companies on how they handle and incorporate logging in
their daily business.
We then evaluate products and technologies for building a central logging
infrastructure from first principles. To do so, we will define technology
requirements together with the customer and reduce a list of possible stacks
down to a few suitable options by ranking them on each requirement. Finally, we
will then make a recommendation and together with the customer agree on the
final solution.
Based on the chosen technologies we will go ahead and design an architecture
that satisfies the project requirements. We will build a proof concept
infrastructure based on this architecture.
Once we have a working proof of concept and can process log data, we can start
researching different machine learning techniques that we can apply to log
analysis. The goal is to reduce the set of suitable machine learning techniques
to the most promising.
We will prototype different approaches and periodically present our results to
the customer. Similar to the technology evaluation we will issue a
recommendation for the most promising machine learning techniques. Together with
the customer, we will then decide which techniques we should be investigated
further.
In the last phase of the project, we will then try to integrate those machine
learning techniques into our proof of concept. Integration could mean
visualizing predictions or defining alerts based on classification results.
# Risks
- Capacity planning
- We have no way to guarantee that the infrastructure will scale to the
production workload. This is because the launch of the mission is fare in
the future.
- Machine learning
- The logs may not contain sufficient information for indepth analysis.
- We have to train our models on non-production data. This may result in
overfitting the model to the test-data and result in pool performing or
conpletely unusable models/algorithms when later run on the production
dataset.
- We might not have success implementing machine learning algorithms and spend
to much time in the process.
# Milestone Planning
The following milestone plan is used to track to progress of the project.
-------------------------------------------------------------------------------
Milestone Due Date Party Acceptance criteria
----------- ------------ ------------ -----------------------------------------
MS1 15.10.2019 All Project agreement and requirements are
accepted by all parties.
MS2 22.10.2019 All Technology evaluation spreadsheet is
accepted by all parties.
MS3 29.10.2019 MM/SM Hardware resources are provided
MS4 05.11.2019 RH/PW PoC infrastructure is running.
MS5 17.12.2019 All As selection of machine learning
algorithms have been agreed upon for
further investigation
MS4 20.03.2020 All Final product is accepted by all parties.
(Project requirements defined in agreement
are fulfilled)
-------------------------------------------------------------------------------
# Signatures
\vspace*{\fill}
\noindent\begin{tabular}{ll}
\makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\
Date & Martin Melchior\\[12ex]
\makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\
Date & Simon Marcin\\[12ex]
\makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\
Date & Ralph Huwiler\\[12ex]
\makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\
Date & Patrick Winter\\
\end{tabular}