Project Description

# Project Description Euclid is a space survey mission by the ESA and will launch a satellite in 2022. The satellite will measure the redshift of distant galaxies to help understand the role of dark matter and dark energy in our universe. The data collected by the satellite will be sent back to earth and analyzed by various processing jobs. The software infrastructure, developed by the Institute for Data Science at the FHNW, plays a key role in the Euclid project. Under the lead of Prof. Dr. Martin Melchior, they developed multiple software components relevant for our project: The first component called the Metascheduler, orchestrates the processing of pipelines and dispatches jobs to the different science data centers. There, the second component called Pipelinerunner, will pick up the task and queue the work on the compute-nodes in the data center. The so-called Pilot Agent will then process the queued jobs by fetching the necessary data from the Distributed Storage System and executes them. All components generate logs, which contain valuable information about the state of the processing pipeline. Some jobs might take a long time to complete, and the log data could give insight into what is going on and how far along a job is. Occasionally, something might go wrong and the log data could help pinpoint the problem and assist in debugging. Logs are currently fed into an Elasticsearch instance with a Kibana frontend which is not under the control of the institute for data science. More time and effort is required to make the current infrastructure fit their needs perfectly. Logs messages are not properly parsed into a standardized format which makes searching difficult. Also, the amount of logs overwhelms human operators when it comes to debugging this highly complex distributed system. # Project Goal Our job is to build a central logging infrastructure to collect, store and analyze the generated logs. The collected logs should be parsed into a structured format to enable performant searches and to ease debugging when problems occur in the processing pipeline. An interactive dashboard may provide an interface to search and filter log data and include metrics to visualize running jobs of the overall system status. Due to the number of logs being generated, a manual analysis might not always be possible. For this, one or more machine-learning algorithms should be evaluated and used to gain additional insight into the vast amount of log data. It will be part of the project to evaluate which tools out of the supervised, unsupervised and reinforcement-learning toolbox will help us with this endeavor. To give some examples, we might be able to send alerts when we detect anomalies in the log data or condense many hundred log entries into a single meaningful data point for better analysis. # Project Requirements The key words “MUST”, “SHOULD” and “MAY” in this document are to be interpreted as described in [RFC 2119](http://www.ietf.org/rfc/rfc2119.txt). - Design architecture for central logging infrastructure. - The architecture MUST be horizontally scalable to match current estimations of future workloads. - The architecture MUST take into account that the Euclid project spans multiple data centers. - The architecture MUST allow flexible integrations with machine learning algorithms (e.g. batch or stream processing workloads). - The architecture SHOULD be maintainable by the I4DS and the Euclid project team. - MUST handle input from existing filebeat infrastructure (IAL logs). - Log data from science jobs MAY be handled by the logging infrastructure. - Build proof of concept infrastructure. - The PoC MUST implement a subset of the defined architecture. - Parse log data - A minimal standardized log format MUST be defined. - Log data MUST be parsed into the defined format (e.g. JSON). - Multi line log messages SHOULD be parsable (e.g. Tracebacks). - Semi structured information MUST be extracted from log data (e.g. timestamp, severity). - Domain-specific knowledge SHOULD be extracted from log data (e.g. job-id). - There MUST be a way to identify wrongly parsed log messages. - Unimportant log messages SHOULD be filterable. - Data retention - Log data MUST be retainable through a time-based policy - Log data SHOULD be retainable through a space-based policy - Log data MAY be retainable through a job-based policy - We SHOULD find out which raw log datasets need to be preserved. - Alerting based on log data - The PoC MUST implement concrete alerts for already known problems (e.g. send alerts on messages with severity "ERROR"). - The documentation SHOULD describe best practices for alerting based on log data. - The documentation SHOULD describe best practices for where in the stack the alerting should happen (e.g. collection, ingestion, machine learning, ...) - Machine learning with log data - We MUST find suitable machine learning algorithms that work with log data - The PoC SHOULD integrate machine learning algorithms into central logging infrastructure. - The PoC MAY use machine learning based alerting. - Debugging - The PoC MUST allow interactive exploration of log data (e.g. queries). - The PoC MUST provide a way to visualize the log data (e.g. graphs and dashboards) # Project Boundaries The following points are explicitly not part of the project. - **Log Collection**: We do not have to collect the logfiles. Meaning it is not part of the project to provide a solution or product to collect log entries and send them to our infrastructure. We have to provide a documented endpoint at which we will collect log entries. - **Hardware**: The required storage and compute resources are provided by the customer. - **Capacity estimates**: Capacity estimates for the future production infrastructure are provided by the customer. This includes the average log volume per day. # Our Approach In the first phase, we study the state of the art logging best practices done in the industry today. We will study publications, blog posts, and conference talks from large technology companies on how they handle and incorporate logging in their daily business. We then evaluate products and technologies for building a central logging infrastructure from first principles. To do so, we will define technology requirements together with the customer and reduce a list of possible stacks down to a few suitable options by ranking them on each requirement. Finally, we will then make a recommendation and together with the customer agree on the final solution. Based on the chosen technologies we will go ahead and design an architecture that satisfies the project requirements. We will build a proof concept infrastructure based on this architecture. Once we have a working proof of concept and can process log data, we can start researching different machine learning techniques that we can apply to log analysis. The goal is to reduce the set of suitable machine learning techniques to the most promising. We will prototype different approaches and periodically present our results to the customer. Similar to the technology evaluation we will issue a recommendation for the most promising machine learning techniques. Together with the customer, we will then decide which techniques we should be investigated further. In the last phase of the project, we will then try to integrate those machine learning techniques into our proof of concept. Integration could mean visualizing predictions or defining alerts based on classification results. # Risks - Capacity planning - We have no way to guarantee that the infrastructure will scale to the production workload. This is because the launch of the mission is fare in the future. - Machine learning - The logs may not contain sufficient information for indepth analysis. - We have to train our models on non-production data. This may result in overfitting the model to the test-data and result in pool performing or conpletely unusable models/algorithms when later run on the production dataset. - We might not have success implementing machine learning algorithms and spend to much time in the process. # Milestone Planning The following milestone plan is used to track to progress of the project. ------------------------------------------------------------------------------- Milestone Due Date Party Acceptance criteria ----------- ------------ ------------ ----------------------------------------- MS1 15.10.2019 All Project agreement and requirements are accepted by all parties. MS2 22.10.2019 All Technology evaluation spreadsheet is accepted by all parties. MS3 29.10.2019 MM/SM Hardware resources are provided MS4 05.11.2019 RH/PW PoC infrastructure is running. MS5 17.12.2019 All As selection of machine learning algorithms have been agreed upon for further investigation MS4 20.03.2020 All Final product is accepted by all parties. (Project requirements defined in agreement are fulfilled) ------------------------------------------------------------------------------- # Signatures \vspace*{\fill} \noindent\begin{tabular}{ll} \makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\ Date & Martin Melchior\\[12ex] \makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\ Date & Simon Marcin\\[12ex] \makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\ Date & Ralph Huwiler\\[12ex] \makebox[0.4\textwidth]{\hrulefill} & \makebox[0.55\textwidth]{\hrulefill}\\ Date & Patrick Winter\\ \end{tabular}