Please bear in mind that the assignment you choose to work on should help you to - relate the knowledge gained from the lectures to a specific field of application or use case; - demonstrate your understanding of the topic; - discern the available options and alternatives (as far as they have importance to the topic); - apply available tools & techniques in practice; - show the importance of the topic in your own company (without disclosing private information); - apply lateral thinking to this assignment - i.e., considering different perspectives (e.g. commercial, engineering, cultural etc.). Why might the perspectives differ? Choose from the suggested open topics below: # Open topics for selection ## 1. Data-driven Workflow Automation (Dataiku/PowerAutomate) * **Statement:** Modern companies have many "boring" tasks (like moving Excel data to a database). Automation tools help do this without writing thousands of lines of code. * **Argumentation:** Using a tool like PowerAutomate is faster and less prone to "typo" errors than doing it manually every Monday morning. * **Evidence:** Compare how long it takes to process 100 rows manually vs. using the automated flow. * **Methodology:** 1. Pick a tool (e.g., PowerAutomate). 2. Create a flow: "When an email arrives with a CSV, save it to a MySQL table." 3. Test it with 5 different files. * **Results:** A working automation that updates a live dashboard whenever a new file is uploaded. * **Research Question:** How much time is saved over a month by automating a daily data-entry task? ## 2. Using AI Assistants with MySQL * **Statement:** Tools like ChatGPT or GitHub Copilot can help students write SQL queries faster. * **Argumentation:** AI is a "tutor" that helps find syntax errors, but you still need to know the basics to check if the AI is hallucinating. * **Evidence:** A log of 10 complex query requests and how many times the AI got the syntax right on the first try. * **Methodology:** 1. Take a standard database (like 'Sakila'). 2. Try to generate 10 specific reports using AI. 3. Fix the mistakes the AI makes and document them. * **Results:** A "Best Practices" guide for students on how to prompt an AI for SQL help. * **Research Question:** Can a student with basic SQL knowledge complete an advanced reporting task 50% faster using AI? ## 3. Comparison of Low-Code Data Science Tools (Orange, KNIME) * **Statement:** You don’t always need to be a Python expert to do Data Science. Visual tools let you drag-and-drop your analysis. * **Argumentation:** Visual tools help you see the "logic flow" of data better than a wall of code. * **Evidence:** A table comparing how many clicks it takes to run a "Linear Regression" workflow in Orange vs. KNIME. * **Methodology:** 1. Load the same dataset into Orange and KNIME. 2. Build a simple prediction model in both. 3. Compare which tool was easier to navigate. * **Results:** A step-by-step tutorial on building a "House Price Predictor" in both tools. * **Research Question:** Which tool is more user-friendly, productive and versatile for non-programmers who want to implement real Data Science use cases? ## 4. How to Effectively Use the MySQL Performance Schema * **Statement:** Sometimes a database is slow. The Performance Schema is like a "Check Engine" light for MySQL. * **Argumentation:** Instead of guessing why a query is slow, we can use built-in tables to see exactly which part of the system is lagging. * **Evidence:** "Before and After" execution times of a slow query after finding the bottleneck. * **Methodology:** 1. Enable Performance Schema. 2. Run a "heavy" query. 3. Query the `sys` or `performance_schema` tables to see what happened. * **Results:** A "Top 5 List" of the most common causes of slow queries found during testing. * **Research Question:** Can the Performance Schema help a student identify a missing index in under 5 minutes? ## 5. SQLMesh * **Statement:** SQLMesh is a newer tool that helps teams manage their SQL code like real software (with versions and testing). * **Argumentation:** Traditional SQL scripts are messy; SQLMesh keeps them organized and ensures that if you change one table, you don't break the whole database. * **Evidence:** A log of "errors caught" by SQLMesh before they were deployed to the main database. * **Methodology:** 1. Install SQLMesh. 2. Create three connected tables. 3. Change a column name and see how SQLMesh handles the update automatically. * **Results:** A small "Data Pipeline" that updates itself safely. * **Research Question:** How does SQLMesh simplify the process of making changes to an existing database schema? ## 6. Comparing Data Warehouse Design (Kimball vs. Inmon vs. Data Vault) * **Statement:** There are different "blueprints" for building a big data storage system (Warehouse). * **Argumentation:** Choosing the right blueprint (like Kimball’s "Star Schema") makes it much easier for business users to get their reports. * **Evidence:** A comparison of how many "JOINs" are needed for a simple report in each design. * **Methodology:** 1. Take a small sales dataset. 2. Design it once as a Star Schema (Kimball) and once as a Normalized system (Inmon). 3. Write the same query for both. * **Results:** A visual diagram showing which design is simpler for a basic "Total Sales by Month" report. * **Research Question:** Which design is easier to build and query? ## 7. PostgreSQL with OpenTofu (IaC) * **Statement:** Instead of clicking buttons to install a database, you can write a "recipe" (code) that does it for you. * **Argumentation:** Using code (OpenTofu/Terraform) means you can delete your database and rebuild it exactly the same way in 30 seconds. * **Evidence:** A screenshot of a single command spinning up a full PostgreSQL environment. * **Methodology:** 1. Install OpenTofu. 2. Write a `.tf` script to create a Docker container with PostgreSQL. 3. Run it, delete it, and run it again. * **Results:** A "One-Click" setup script for a local development database. * **Research Question:** Does using Infrastructure-as-Code reduce the setup time for a new project compared to manual installation? ## 8. Database High Availability with MariaDB MaxScale * **Statement:** If one database server crashes, the website shouldn't go down. MaxScale acts like a traffic cop to send users to a working server. * **Argumentation:** High Availability isn't just for huge companies; students should learn how to make systems that don't break. * **Evidence:** A video or log showing the system staying online while you manually "kill" one database node. * **Methodology:** 1. Set up two MariaDB servers. 2. Put MaxScale in front of them. 3. Turn off Server A and see if Server B takes over automatically. * **Results:** A "Fault-Tolerant" database setup. * **Research Question:** How many seconds of downtime occur during an automatic failover using MaxScale? ## 9. Driving Analytics with Columnar Storage * **Statement:** Regular databases store data in rows. Columnar databases store it in columns, which is much faster for math (like averages). * **Argumentation:** For big data analysis, switching to columnar storage is like switching from a bicycle to a car. * **Evidence:** A bar chart showing the time difference for a `SUM()` query on 1 million rows in Row-storage vs. Column-storage. * **Methodology:** 1. Load a large CSV into a normal table and a Columnar table (like MariaDB ColumnStore or Parquet). 2. Run 5 analytical queries. 3. Record the speed. * **Results:** A report showing when it's worth the effort to switch to columnar storage. * **Research Question:** Is columnar storage at least 5x faster than row storage for simple aggregation queries? ## 10. Configuring SSL On Your MySQL Server * **Statement:** Hackers can "listen" to your database traffic. SSL (Encryption) makes that data unreadable to them. * **Argumentation:** Encryption in transit should be "on by default" to prevent data theft. * **Evidence:** A Wireshark screenshot showing "Plaintext" password before encrypting with SSL, and "Garbage/Encrypted" text after encrypting with SSL. * **Methodology:** 1. Set up a standard MySQL server. 2. Create SSL certificates (using OpenSSL). 3. Force the connection to use them and verify with a packet sniffer. * **Results:** A guide on how to secure a database connection for any project. * **Research Question:** Does adding SSL encryption significantly slow down the database response time? --- ## 11. Integration of MFA Services in PostgreSQL * **Statement:** Passwords are often stolen. Multi-Factor Authentication (MFA) adds a "second lock" to the database door. * **Argumentation:** It is better to have a slightly slower login process than to have your entire customer database leaked because of a weak password. * **Evidence:** A walkthrough of a "blocked" login attempt where the password was correct but the MFA code was missing. * **Methodology:** 1. Set up a PostgreSQL server. 2. Configure it to use a PAM module (like Google Authenticator). 3. Test logging in via a terminal and a database GUI (like pgAdmin). * **Results:** A working login system that requires a code from a smartphone app. * **Research Question:** How difficult is it to integrate MFA into existing database workflows for a small team? ## 12. The Data Build Tool (dbt) * **Statement:** dbt is like a "compiler" for SQL. It lets you write clean code and automatically turns it into complex tables and views. * **Argumentation:** Writing SQL in dbt is better than writing long scripts because dbt automatically tracks which tables depend on each other. * **Evidence:** A "Lineage Graph" generated by dbt showing how data moves from raw tables to final reports. * **Methodology:** 1. Load some raw data (e.g., CSVs) into a database. 2. Write dbt "models" (SQL SELECT statements) to clean the data. 3. Run `dbt test` to check for errors. * **Results:** A clean, documented set of tables ready for a BI tool (like PowerBI). * **Research Question:** How does dbt help prevent "broken" reports when the underlying data format changes? ## 13. Migrating a Database using Structured Tools * **Statement:** Moving data from one type of database (like MySQL) to another (like PostgreSQL) is hard because they "speak" different dialects. * **Argumentation:** Using a tool like `pgloader` is safer than manually exporting and importing SQL files because the tool handles data type conversions for you. * **Evidence:** A report showing zero data loss (matching row counts) between the old and new database. * **Methodology:** 1. Create a MySQL database with 10,000 rows. 2. Use `pgloader` or a tool like Hevo to move it to PostgreSQL. 3. Compare the data to ensure nothing was corrupted. * **Results:** A "Migration Runbook" that lists the steps and common errors to avoid. * **Research Question:** What are the three most common "data type" conflicts when moving from MySQL to PostgreSQL? ## 14. Providing Data Governance and Provenance with Apache NiFi * **Statement:** In big companies, you need to know exactly where a piece of data came from. This is called "Data Provenance." * **Argumentation:** Apache NiFi is better than a "hidden" Python script because you can visually see the path the data takes on a map. * **Evidence:** A screenshot of the NiFi "Provenance" window showing the history of a single data packet. * **Methodology:** 1. Install NiFi. 2. Create a flow that fetches a web file, changes a value, and saves it to a DB. 3. Use the "Data Provenance" feature to "replay" a failed data transfer. * **Results:** A visual pipeline that tracks every change made to the data in real-time. * **Research Question:** How does a visual tool like NiFi help a non-programmer understand the data lifecycle? ## 15. Apache Arrow for In-Memory Data Retrieval * **Statement:** Moving data between a database and a Python script can be slow. Apache Arrow makes this nearly instant by using a clever memory format. * **Argumentation:** For data science projects, "Zero-Copy" reading (using Arrow) is much more efficient than traditional "Row-by-Row" reading. * **Evidence:** A graph comparing the time to load 1 million rows into Python with and without using the Arrow format. * **Methodology:** 1. Set up a PostgreSQL database. 2. Use a standard Python connector to fetch 1M rows. 3. Use an Arrow-based connector (like ADBC) to fetch the same data. 4. Record the speed. * **Results:** A benchmark showing that Arrow is significantly faster for data-heavy applications. * **Research Question:** Can Apache Arrow reduce data loading times by more than 50% for typical datasets? ## 16. Presto: A Distributed SQL Query Engine * **Statement:** Presto lets you run one SQL query that pulls data from a MySQL database and a MongoDB database at the same time. * **Argumentation:** It is easier to use Presto to "join" data from different places than it is to move all that data into one single database first. * **Evidence:** A single SQL query result that contains data from two completely different database types. * **Methodology:** 1. Install Presto (or Trino). 2. Connect it to a MySQL instance and a CSV file. 3. Write a `JOIN` query that combines them. * **Results:** A working "Virtual Database" that gives a single view of multiple data sources. * **Research Question:** What is the performance "cost" (delay) of querying data where it lives instead of moving it to a central warehouse? ## 17. Proof of Concept for a NoSQL Database (e.g., MongoDB or Neo4j) * **Statement:** Not all data belongs in tables. For things like social networks or chat logs, NoSQL databases are often a better fit. * **Argumentation:** A graph database (Neo4j) is much more natural for finding "friends of friends" than a relational database with complex joins. * **Evidence:** Compare the SQL code needed for a "friend-of-friend" search vs. the much shorter "Cypher" code in Neo4j. * **Methodology:** 1. Pick a NoSQL type (e.g., Document or Graph). 2. Load a sample dataset (like a small social network). 3. Run a query that would be hard in SQL. * **Results:** A working prototype showing why your chosen NoSQL database is faster for that specific task. * **Research Question:** In what specific scenarios does a NoSQL database become easier to use than a standard MySQL database? ## 18. Setting up an Open-Source Database Cluster in Containers * **Statement:** Instead of one big server, we can use several small "containers" (Docker) to create a database cluster that is easier to manage. * **Argumentation:** Containerized databases are better for learning because you can "spin them up" or "tear them down" in seconds without messing up your computer. * **Evidence:** A screenshot of Docker Desktop showing 3 database nodes running simultaneously. * **Methodology:** 1. Write a `docker-compose.yml` file for a 3-node PostgreSQL or MariaDB cluster. 2. Start the cluster. 3. Shut down one node and show that the others still work. * **Results:** A repeatable script that builds a complete database lab on any computer. * **Research Question:** How much faster is it to deploy a 3-node cluster using Docker compared to a manual installation? ## 19. Best Coding Practices for Transactional Database Programming * **Statement:** In distributed databases (like CockroachDB), things can go wrong (like network lag). Your code needs to be "smart" enough to retry a transaction if it fails. * **Argumentation:** "Good" code expects errors and knows how to handle them, whereas "Bad" code just crashes when the database is busy. * **Evidence:** A log showing a Python script successfully retrying a "failed" transaction 3 times until it works. * **Methodology:** 1. Set up a CockroachDB instance. 2. Write a Python script to transfer "money" between accounts. 3. Intentionally cause a "collision" (concurrency error) and show the retry logic in action. Refer to “Enterprise Application Development with CockroachDB” (Parts [1](https://university.cockroachlabs.com/courses/course-v1:crl+client-side-txn-handling+self-paced/about?mkt_tok=MzUwLVFJTi04MjcAAAGQ20visPrk5C6MuDnC-M2QgBDPMTo-JQ2Dt9pGMKNh9u6Nf5MlzeTqmz0L305eekwcslCKQsvdy4xs825i0djTkcMElUPy1qAtUIuWNWkje90), [2](https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffriends.cockroachlabs.com%2FMzUwLVFJTi04MjcAAAGQ20visMBZzU37yvzG1xxfZpr5O6hQDVzJm_yeenIFVQqpid9W1XqS7gk5OXRYBg4z-sqizhM%3D&data=05%7C02%7Cfrank.neubueser%40eviden.com%7C9bb68089938c468c94ed08dc1cfa6a25%7C7d1c77852d8a437db8421ed5d8fbe00a%7C0%7C0%7C638417109050346375%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=aRBj57qyKRdWV5rP5lkQM%2FVxEP9F5X2QadGd0ItMaoQ%3D&reserved=0) and [3](https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffriends.cockroachlabs.com%2FMzUwLVFJTi04MjcAAAGQ20visBz6LCJY3OW8cUp7x9iHDSZQ5xDAkfTl-yQd5C1GdCAjjnv0HRAcHwQ4AC8EeJzc988%3D&data=05%7C02%7Cfrank.neubueser%40eviden.com%7C9bb68089938c468c94ed08dc1cfa6a25%7C7d1c77852d8a437db8421ed5d8fbe00a%7C0%7C0%7C638417109050353519%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=21d3PMGWj7UcUy2m2d0cvS3OMXXXwHEVRHAVka31FZE%3D&reserved=0)). * **Results:** A template for "Resilient" database code that doesn't lose data during errors. * **Research Question:** How often might transactions fail in a distributed environment compared to a single-server setup, for what reasons? ## 20. Recursive Common Table Expressions (CTEs) * **Statement:** Recursive CTEs are a special SQL trick that lets a query "loop" over itself, which is perfect for hierarchical data like family trees or folder structures. * **Argumentation:** Using a Recursive CTE is much more elegant than writing a complex Python loop to navigate a tree structure. * **Evidence:** A "Tree View" output generated by a single SQL query. * **Methodology:** 1. Create a table of "Employees" and their "Managers." 2. Write a Recursive CTE to show the entire management chain for any employee. 3. Compare the speed of this query vs. multiple separate queries. * **Results:** A "Cheat Sheet" of recursive SQL patterns for common business hierarchies. * **Research Question:** Does using a Recursive CTE reduce the amount of code a developer has to write by more than 50% for tree-traversal tasks? ## 21. Load Testing / Performance Benchmarking with `mysqlslap` * **Statement:** You can't know if a database is "fast" until you push it to its limits. `mysqlslap` is a tool that simulates thousands of users hitting the database at once. * **Argumentation:** It is better to find out your database crashes under pressure *now* during a test than *later* when a real application goes live. * **Evidence:** A graph showing how response times increase as you go from 10 users to 100 users. * **Methodology:** 1. Set up a MySQL server. 2. Use `mysqlslap` to run a "mixed" load of reads and writes. 3. Change a setting (like the `innodb_buffer_pool_size`) and run the test again to see if it helped. * **Results:** A "Stress Test" report that identifies exactly how many users the server can handle before it slows down. * **Research Question:** At what number of concurrent users does the database response time become "unacceptable" (e.g., > 1 second)? ## 22. Overview of PostgreSQL Architecture & Deployment * **Statement:** PostgreSQL can be installed on your laptop, on a server, or in the cloud. Each choice changes how the database performs and how much it costs. * **Argumentation:** For a student project, a local Docker setup is best, but for a real business, a cloud-managed service is safer because it handles backups automatically. * **Evidence:** A comparison table of "Manual Management" vs. "Cloud Managed" tasks (Backups, Patches, Scaling). * **Methodology:** 1. Install PostgreSQL locally. 2. Set up a "Free Tier" cloud instance (e.g., on Aiven, Supabase or Neon). 3. Compare the configuration options available in both. * **Results:** A guide explaining which deployment method is best for different types of startup projects. * **Research Question:** What are the top three administrative tasks that are automated when moving from local PostgreSQL to a cloud-managed service? ## 23. LlamaIndex Framework * **Statement:** LlamaIndex is a framework that connects an LLM to your own private data (like your PDF notes or a database). * **Argumentation:** Instead of uploading all your data to a (public) LLM, LlamaIndex "indexes" it so the AI only looks at the parts it needs to answer a specific question. * **Evidence:** A comparison of an AI's answer *without* LlamaIndex (wrong/generic) vs. *with* LlamaIndex (correct/specific). * **Methodology:** 1. Take a collection of text files. 2. Use LlamaIndex to create a searchable index. 3. Build a simple "Chat with my Documents" interface using Python. * **Results:** A functional AI assistant that can answer questions about a specific dataset it wasn't originally trained on. * **Research Question:** How much does "indexing" improve the accuracy of an AI when answering questions about a niche topic? ## 24. LangChain Framework * **Statement:** LangChain lets you build "chains" of actions for an AI, such as: "First, look at the database. Second, summarize the results. Third, email the summary." * **Argumentation:** LangChain makes AI more useful by allowing it to *act* on data, rather than just *talking* about it. * **Evidence:** A step-by-step log showing the "Reasoning" process of a LangChain agent as it solves a problem. * **Methodology:** 1. Connect LangChain to a small SQL database. 2. Create a "SQL Agent" that translates human questions (e.g., "Who bought the most?") into SQL code and runs it. * **Results:** A Python application where a user can "talk" to a database without knowing SQL. * **Research Question:** How well does a LangChain agent handle "ambiguous" questions compared to a human writing SQL? ## 25. Writing SQL Code with AI Assistance: An Overview * **Statement:** AI tools (like GitHub Copilot or ChatGPT) are changing how developers write SQL code. * **Argumentation:** AI assistance is a great "starter," but it often suggests "old" or inefficient ways of doing things, so the developer must still be the expert. * **Evidence:** A "Code Review" of 5 AI-generated queries, highlighting where the AI was efficient and where it was wrong. * **Methodology:** 1. Solve 5 medium-difficulty SQL problems manually. 2. Solve the same 5 problems using an AI assistant. 3. Compare the time taken and the quality of the code. * **Results:** A "Survival Guide" for students on how to use AI for SQL without making common mistakes. * **Research Question:** Does AI-assisted SQL writing usually lead to more or fewer syntax errors? ## 26. Databricks: An Overview and Assessment * **Statement:** Databricks is a platform that combines the "fast math" of a Data Warehouse with the "storage" of a Data Lake. * **Argumentation:** For big data, Databricks is better than a traditional database because it can process massive amounts of data in parallel using a technology called Spark. * **Evidence:** A benchmark showing how long it takes to count words in a massive text file using standard Python vs. Databricks (Spark). * **Methodology:** 1. Sign up for a Databricks Community Edition (Free). 2. Upload a large dataset. 3. Use "Notebooks" to run a simple data analysis. * **Results:** A report on the "Lakehouse" concept and why it's becoming popular in big companies. * **Research Question:** How does the "Notebook" style of Databricks change the way data engineers collaborate? ## 27. Introducing GQL (Graph Query Language) * **Statement:** GQL is a brand-new standard for querying graph databases, making it as official as SQL is for relational databases. * **Argumentation:** Now that there is a standard language (GQL), you can learn it once and use it across many different graph database brands. * **Evidence:** A side-by-side comparison of the same query written in SQL (many joins) vs. GQL (clear path). * **Methodology:** 1. Map out a small network (e.g., airports and flights). 2. Write queries to find "the shortest path between two cities" using a GQL-compatible tool. * **Results:** A demonstration of why "Pathfinding" is easier in a graph language than in SQL. * **Research Question:** Is GQL significantly easier for beginners to read than the older "Cypher" language? ## 28. The Parquet Columnar Data Format * **Statement:** Parquet is a file format (like a better version of CSV) that is optimized for data science and machine learning. * **Argumentation:** Storing data in Parquet saves disk space (it's compressed) and is much faster to read into Python than a standard CSV file. * **Evidence:** A table showing file size: 100MB CSV vs. 20MB Parquet for the exact same data. * **Methodology:** 1. Take a large CSV file. 2. Convert it to Parquet using Python (Pandas). 3. Measure the file size and the "Time to Load" for both. * **Results:** A clear proof that Parquet is the superior choice for "Big Data" storage. * **Research Question:** How much storage space (in %) can a company save by switching from CSV to Parquet? ## 29. DuckDB - A Lightweight In-Process Database * **Statement:** DuckDB is a database that lives inside your Python script or your browser. You don't have to install a "Server." * **Argumentation:** For a student analyzing a single CSV or Parquet file, DuckDB is much faster and easier to set up than a full PostgreSQL server. * **Evidence:** A comparison of the "Setup Time" and "Query Speed" for a 1GB file in DuckDB vs. MySQL. * **Methodology:** 1. Install DuckDB (it’s just one command). 2. Run a "heavy" analytical query on a Parquet file. 3. Document how easy it was to set up. * **Results:** A working example of "Serverless" data analysis on a local machine. * **Research Question:** Can DuckDB replace a traditional database for small-to-medium data science projects? ## 30. Apache Pinot OLAP Datastore * **Statement:** Apache Pinot is built to answer questions instantly, even if millions of new rows are being added every second. * **Argumentation:** Standard databases get slow when they are huge; Pinot stays fast because it uses specialized "indexes" designed for real-time data. * **Evidence:** A screenshot showing a dashboard that updates in "Real-Time" (sub-second) as data flows in. * **Methodology:** 1. Set up a local Pinot instance (using Docker). 2. Ingest a "Batch" of data. 3. Run queries and measure the response time (in milliseconds). * **Results:** A PoC showing a "Real-Time Leaderboard" for a game or a stock market. * **Research Question:** How does Pinot achieve sub-100ms response times on datasets with millions of rows? ## 31. Database in the Browser using WebAssembly (WASM) * **Statement:** You can now run a full database (like SQLite) inside a website, so the data never has to leave the user's computer. * **Argumentation:** This makes websites faster and more private, as the user's data isn't sent to a central server every time they click a button. * **Evidence:** A demonstration of a web app that works perfectly even when you "Turn Off" the internet. * **Methodology:** 1. Create a simple HTML page. 2. Include the SQLite-WASM library. 3. Write a script that lets the user save and search their own "Notes" in a local database. * **Results:** A "Local-First" web application that is fast and private. * **Research Question:** What is the maximum amount of data you can store in a browser-based database before the browser slows down? ## 32. Apache Kafka: A Data Analytics Streaming Platform * **Statement:** Kafka is like a "Post Office" for data. It sends messages from one system to another in real-time. * **Argumentation:** For modern apps (like Uber or Amazon), data needs to move *instantly* between services; Kafka is the standard for doing this at scale. * **Evidence:** A log showing data being "Produced" by one script and "Consumed" by another in under 10 milliseconds. * **Methodology:** 1. Set up a Kafka broker (Docker). 2. Write a Python "Producer" that sends fake temperature data. 3. Write a "Consumer" that alerts you if the temperature is too high. * **Results:** A working "Real-Time Alert" system. * **Research Question:** How does Kafka ensure that no data is lost even if the "Consumer" script crashes for a few minutes? ## 33. KDB-X: A High-Performance, In-Memory Database * **Statement:** KDB-X is used by big banks to process billions of stock trades. It is incredibly fast because it keeps all its data in the computer's RAM. * **Argumentation:** When you need to do math on millions of numbers in a split second, an in-memory database like KDB-X is the only option. * **Evidence:** A comparison of "Calculation Speed" for a moving average in KDB-X vs. a standard SQL database. * **Methodology:** 1. Download the "Personal" version of KDB+. 2. Load a small dataset of numbers. 3. Use the `q` language to run a simple calculation and time it. * **Results:** A report on why "In-Memory" is the secret to high-frequency trading. * **Research Question:** Why is the `q` language considered so powerful for time-series data compared to SQL? ## 34. Replicating a MySQL Database * **Statement:** Replication means having a "Clone" of your database. If the main one breaks, the clone can take over. * **Argumentation:** Every student should know how to set up a "Source" and a "Replica" to prevent data loss in a real production environment. * **Evidence:** A log showing that a new row added to "Server A" appeared on "Server B" automatically within one second. * **Methodology:** 1. Set up two MySQL servers. 2. Configure "Master-Slave" (Source-Replica) replication. 3. Add data to one and verify it appears on the other. * **Results:** A "Disaster Recovery" plan showing how to switch to the replica if the source fails. * **Research Question:** What is "Replication Lag" and what causes it in a standard MySQL setup? ## 35. Using Orange Data Mining for Data Analytics * **Statement:** Orange is a "Drag-and-Drop" tool for data science. You connect icons with lines to analyze your data. * **Argumentation:** Orange is great for Data Science students & professionals because it lets you *visualize* the math of machine learning without getting stuck on Python syntax errors. * **Evidence:** A screenshot of a "Decision Tree" created in Orange that explains why a certain classification was made. * **Methodology:** 1. Load a dataset (like the "Titanic" passenger list). 2. Create a workflow to predict "Who survived?". 3. Use the "Scatter Plot" and "Box Plot" widgets to explore the data. * **Results:** A complete data analysis project done entirely without writing a single line of code. * **Research Question:** Can a visual tool like Orange produce a prediction model that is as accurate as a manually coded one? # Archive (already selected topics) ## WWI23SCB ### ~~Data Preparation, Analysis and Visualization with Jupyter Notebooks~~ - **Understanding of the subject**: Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text. It is crucial for data engineering as it facilitates interactive data analysis and visualization, making it easier to understand complex datasets. - **Implementation Details**: Set up a Jupyter environment and import necessary libraries for data analysis, preparation, processing and visualisation (like e.g. NumPy, Pandas, Matplotlib, Scikitlearn, Seaborn, SciPy, TensorFlow and PyTorch). Challenges may include managing dependencies and ensuring compatibility with different data sources. - **Real Engineering Use Case (PoC)**: Analyze a dataset, such as a Black Friday Sales dataset from Kaggle, to identify trends and patterns. Illustrate data preparation and cleansing. Use visualizations like line charts and histograms to present findings. Reference: [Data Analysis and Visualization with Jupyter Notebook](https://www.geeksforgeeks.org/data-analysis-and-visualization-with-jupyter-notebook/?ref=ml_lbp) - **Evaluation and Analysis**: Evaluate the success by performing an exploratory data analysis (EDA) to understand data distributions and identify patterns, by preparing data for cleaning and processing, and by demonstrating the clarity and insightfulness of the visualizations. --- ### ~~Developing Applications with MariaDB & Containers – via Docker~~ - **Understanding of the subject**: Using Docker to containerize MariaDB applications allows for consistent development environments and easy deployment. It is important for data engineering as it enhances scalability and portability. - **Implementation Details**: Create a Docker container for a MariaDB instance and develop a sample application. Challenges include managing container orchestration and networking. - **Real Engineering Use Case (PoC)**: Develop a microservice / call interface that interacts with a MariaDB database, demonstrating CRUD operations within a containerized environment. - **Evaluation and Analysis**: Evaluate by the ease of deployment and scalability. Metrics include deployment time and resource efficiency. --- ### ~~Streamlit - A Library for Prototyping Interactive Data-driven Web Apps with Only Python~~ - **Understanding of the subject**: Streamlit is a Python library that allows for the rapid development of interactive web applications. It is important for data engineering as it enables quick prototyping and sharing of data insights. - **Implementation Details**: Develop a Streamlit app that visualizes a dataset and provides interactive features. Challenges include managing app performance and user interface design. - **Real Engineering Use Case (PoC)**: Create an interactive dashboard that allows users to explore and visualize COVID-19 data trends. - **Evaluation and Analysis**: Evaluate by the interactivity and usability of the app. Metrics include user engagement and feedback. --- ### ~~Building a Business Intelligence (BI) Dashboard with Apache Superset~~ - **Understanding of the subject**: Apache Superset is an open-source BI tool that allows for the creation of interactive dashboards. It is important for data engineering as it facilitates data-driven decision-making. - **Implementation Details**: Set up Apache Superset and connect it to a data source. Challenges include designing effective visualizations and ensuring data accuracy. - **Real Engineering Use Case (PoC)**: Build a BI dashboard that visualizes key performance indicators (KPIs) for a retail business. - **Evaluation and Analysis**: Success is measured by the clarity and usefulness of the dashboard. Analyze user feedback and decision-making impact. --- ### ~~An Overview of the Python Data Engineering Ecosystem with Some Illustrative Use Cases~~ - **Understanding of the subject**: The Python data engineering ecosystem includes libraries and tools for data processing, analysis, and visualization. It is important for data engineering as it provides a comprehensive toolkit for handling data workflows. - **Implementation Details**: Explore key Python libraries like Pandas, NumPy, and Scikit-learn. Challenges include integrating different tools and managing dependencies. - **Real Engineering Use Case (PoC)**: Demonstrate a data pipeline that cleans, analyzes, and visualizes a dataset using Python libraries. - **Evaluation and Analysis**: Evaluate by the efficiency and effectiveness of the data pipeline. Metrics include processing time and data quality. --- ### ~~Performanceprobleme in relationalen Datenbanksystemen~~ - **Understanding of Database Performance**: How can you identify improvement opportunities (including indexes, queries, clusters, schemas)?  - **Implementation Details**: Choose e.g. MariaDB, PostGreSQL, Cockroach as a basis. - **Real Engineering Use Case (PoC)**: Develop and implement strategies to optimize the performance of the database system. What open-source tools are available? How do you qualify them (based on which criteria)? Provide an exemplary utility analysis (NWA). - **Real Engineering Use Case**: Illustrate your insights with real-world experiences. --- ### ~~Orange Data Mining~~ - **Understanding of the subject**: Orange is an open-source data visualization and analysis tool for data mining. It is important for data engineering as it provides an intuitive interface for exploring data patterns. - **Implementation Details**: Set up Orange and import a dataset for analysis. Challenges include managing data preprocessing and ensuring visualization clarity. - **Real Engineering Use Case (PoC)**: Analyze a customer dataset to identify segments and visualize the results using Orange. - **Evaluation and Analysis**: Success is measured by the insightfulness of visualizations and analysis. Analyze user feedback and data-driven decisions. --- ### ~~Star Schema Data Modeling for Data Warehousing~~ - **Understanding of the subject**: The star schema is a data modeling technique used in data warehousing to optimize query performance. It is important for data engineering as it simplifies complex queries and enhances data retrieval speed. - **Implementation Details**: Design a star schema for a sample data warehouse and implement it. Challenges include managing data redundancy and ensuring data integrity. - **Real Engineering Use Case (PoC)**: Develop a data warehouse for a retail business using a star schema to optimize sales reporting. - **Evaluation and Analysis**: Evaluate by the speed and efficiency of queries. Metrics include query execution time and data accuracy. --- ### ~~Evaluation und praktische Anwendungsmöglichkeiten von SurrealDB als Multi-Modell-Datenbank~~ - **Understanding of the subject**: Diese Hausarbeit untersucht SurrealDB als moderne Multi-Modell-Datenbank, die die Vorteile relationaler, dokumentenbasierter und graphenbasierter Datenbanken kombinieren soll. Ziel ist es, ein tieferes Verständnis der Architektur und Funktionalität von SurrealDB zu erlangen und deren potenzielle Einsatzmöglichkeiten zu evaluieren. Dabei wird vorallem SurrealQL als Abfragesprache, sowie Szenarien für den Einsatz dieser Datenbank betrachtet. - **Implementation Details**: Im Rahmen dieser Untersuchung wird SurrealDB installiert und getestet, um die verschiedenen Betriebsmodi zu verstehen. Die Evaluierung umfasst die Nutzung der REST- und WebSocket-APIs sowie das Testen schema-basierter und schema-loser Datenmodelle. Zudem werden die Sicherheitsmechanismen wie Authentifizierung und rollenbasierte Zugriffskontrolle analysiert. Die Implementierung wird anhand praktischer Beispiele dokumentiert, um die Nutzungsmöglichkeiten greifbar zu machen. - **Real Engineering Use Case (PoC)**: Ein Proof-of-Concept (PoC) wird erstellt, um einen möglichen Einsatzzweck für SurrealDB zu identifizieren. Dabei wird untersucht, wie die Funktionalitäten von SurrealDB in der Praxis eignen. Die verschiedenen Datenbankarten/-typen die angeboten werden, sollen bestmöglich in einem Anwendungsfall eingesetzt werden. Somit wird getestet und bewertet, ob SurrealDB eine geeignete Alternative zu bestehenden Lösungen darstellt. Mögliche Anwendungsfälle werden auf Basis der Testergebnisse diskutiert. - **Evaluation and Analysis**: Die Hausarbeit zielt darauf ab, die Leistungsfähigkeit, Skalierbarkeit und Flexibilität von SurrealDB zu bewerten. Es werden Benchmarks durchgeführt, um die Performance bei verschiedenen Datenmodellen und Abfragearten zu messen. Zudem wird SurrealDB mit anderen Datenbanklösungen verglichen, um Vor- und Nachteile herauszuarbeiten. Abschließend wird diskutiert, für welche Anwendungsfälle SurrealDB besonders gut geeignet ist und in welchen Szenarien andere Datenbanklösungen vorzuziehen wären. --- ### ~~Blockchain als dezentrale Datenbank: Chancen & Grenzen~~ - **Understanding of the subject**: Ist eine Blockchain mit einer klassischen Datenbank vergleichbar? Wo macht es Sinn, Blockchain für Datenspeicherung zu nutzen, und wo nicht? - **Implementation Details**: Vergleich von Ethereum, Bitcoin & Hyperledger als Datenbanken Speicherung & Abruf von Daten in einer Smart Contract-basierten Struktur Performance-Analyse von Abfragen & Transaktionsgeschwindigkeit - **Real Engineering Use Case (PoC)**: Untersuchung, ob eine Blockchain eine herkömmliche SQL-Datenbank in einem Unternehmen ersetzen könnte. - **Evaluation and Analysis**: Vergleich mit zentralisierten Datenbanken (PostgreSQL, MySQL) Analyse der Speicherkosten, Skalierbarkeit & Performance Sicherheitsaspekte: Wann ist Blockchain-Datenspeicherung sinnvoll, wann nicht? --- ### ~~Echtzeit-Datenbanken für Krypto-Trading & Orderbuch-Management~~ - **Understanding of the subject**: Warum sind niedrige Latenzen & hohe Schreibgeschwindigkeiten entscheidend für Krypto-Trading? Welche Datenbanken eignen sich für Live-Marktdaten & Orderbuch-Speicherung? - **Implementation Details**: Einrichtung einer In-Memory-Datenbank (Redis, TimescaleDB) für Orderbuch-Daten Speicherung von Trades, Spreads & Liquiditätsdaten in Echtzeit Vergleich mit traditionellen SQL & NoSQL-Datenbanken für Trading-Anwendungen - **Real Engineering Use Case (PoC)**: Entwicklung einer schnellen Datenbanklösung für eine Krypto-Trading-Plattform mit Live-Orderbuch. - **Evaluation and Analysis**: Latenz-Vergleich: SQL vs. NoSQL für hochfrequentes Trading Datenkonsistenz & Skalierbarkeit: Welche Datenbanken eignen sich für Millionen von Trades pro Sekunde? Sicherheitsrisiken: Wie verhindert man DDoS-Angriffe & Manipulationen? --- ### ~~Sicherheitslücken in modernen Datenbanken: Vergleich von SQL-Injection in relationalen und NoSQL-Datenbanken sowie Schutzmaßnahmen~~ - **Understanding of the subject**: - **Implementation Details**: - **Real Engineering Use Case (PoC)**: - **Evaluation and Analysis**: --- ### ~~Strukturelle Potenziale und Grenzen von SharePoint und OneDrive im Kontext klassischer Datenbankkonzepte~~ - **Understanding of the subject**: - **Implementation Details**: - **Real Engineering Use Case (PoC)**: - **Evaluation and Analysis**: --- ### ~~Entwicklung eines Retrieval-Augmented-Generation Systems zur Bereitstellung von Wissen aus Schulungsunterlagen~~ - **Understanding of the subject**: - **Implementation Details**: - **Real Engineering Use Case (PoC)**: - **Evaluation and Analysis**: ## WWI22SCB ### New Topic - **Understanding of the subject**: - **Implementation Details**: - **Real Engineering Use Case (PoC)**: - **Evaluation and Analysis**: