2025 HackMD 資料工程師專案作業

# Project for HackMD Data Engineer 2025 ## 1. Project Background and Vision In today's rapidly evolving digital collaboration era, knowledge management has become the core competitive advantage for enterprises and teams. HackMD is committed to creating a platform that is more than just a note-taking tool—it's an intelligent knowledge ecosystem. Our mission is to build a system that can intelligently understand, organize, and recommend knowledge. Imagine a platform that not only stores documents but actively helps users discover relevant knowledge, fostering team collaboration and knowledge dissemination. ### Our Challenge Traditional knowledge management systems are often static and passive. We aim to create an actively learning, continuously evolving intelligent system that can: - Understand users' knowledge needs - Recommend relevant content in real-time - Facilitate the natural flow and connection of knowledge ## 2. Data-Driven Intelligent Platform - **Source**: Kaggle (Cornell-University/arxiv) - **Link**: [Scholarly Articles Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) ### Dataset Selection: Scholarly Articles Dataset We've chosen the scholarly articles dataset from Kaggle—not just a data source, but a microcosm of knowledge collaboration. #### Why This Dataset? Academic paper datasets offer rich metadata, including titles, abstracts, authors, and research domains. They perfectly simulate the complexity of knowledge collaboration: - Collaboration patterns across different authors - Interdisciplinary knowledge intersections - Evolution of research topics ## 3. System Architecture ### 3.1 Technical Stack - **Cloud Infrastructure**: AWS - **Data Collection**: AWS Lambda, AWS CloudWatch, Node.js - **Data Processing**: Python, Pandas, NumPy - **Storage**: OpenSearch, DynamoDB, S3 - **Machine Learning**: Scikit-learn, AWS SageMaker - **Monitoring**: Prometheus, Grafana, AWS CloudWatch ### 3.2 System Components 1. Multi-Source Data Collection Layer 2. Distributed Data Processing Layer 3. Intelligent Recommendation System ## 4. Detailed Requirements ### 4.1 Data Collection Layer - Integrate data from diverse sources (APIs, databases, logs, event streams) - Support flexible, configurable, reusable ingestion strategies - Ensure data integrity with robust error handling and retry mechanisms ### 4.2 Distributed Data Processing - Design and implement fault-tolerant, parallelized pipelines - Ensure traceability for large-scale data transformations ### 4.3 Observability & Monitoring for Data Systems - Track real-time system performance and error metrics - Enable real-time alerting and automated recovery mechanisms - Ensure comprehensive observability of the pipeline ### 4.4 Data Quality, Modeling & Storage - Design and maintain dimensional data models - Implement and maintain data warehouses or data lakes - Monitor data quality and track data lineage ## 5. Task Description You are a data engineer working at an academic institution. Your role collaborates closely with data analysts and data scientists, focusing on arXiv metadata to build data products that support both academic research and institutional decision-making. Here are some ongoing and upcoming projects your team supports: 1. **Management Dashboard** Analysts maintain weekly-refreshed dashboards to monitor academic trends, including: - Average number of updates of paper in each discipline - Medium time needed from arXiv submission to journal/conference publication in each discipline by week - Cumulative number of submissions per institution or author - Academic co-authorship networks 2. **Institutional Rankings** The team produces subject-wise rankings of academic institutions based on both the quantity and quality of their publications. This involves enriching arXiv metadata with external datasets (e.g., journal impact factors, citation counts, CORE rankings). 3. **Paper Recommendation System** A planned product will recommend papers based on users’ browsing behavior and paper metadata. This will involve NLP techniques such as abstract/title vectorization and other feature extraction methods. As a data engineer, your mission is to build a solid, scalable, and observable data infrastructure that supports these use cases while maintaining flexibility for future applications. ### 5.1 Task Goal **Objective:** Design and implement a reusable, configurable pipeline to collect and process arXiv metadata. You may use any tools listed in section 3.1. The pipeline should: 1. Store arXiv data in a format that is queryable and accessible for downstream users 2. Include built-in mechanisms for quality monitoring, anomaly detection, and alerting 3. Follow a logical, intuitive, and scalable database schema If you believe the tools in section 3.1 are insufficient for your approach, you are welcome to use alternatives that are both cost-effective and well-suited to your solution. Please justify your choice in the written portion of your submission. ### 5.2 Submission Guidelines Your submission should include: - A **GitHub repository** with your code and documentation - We value clean, well-structured code - Please provide thorough documentation describing your design decisions, system structure, and how to run the pipeline - A **README or explanatory document** that addresses the following: - Performance and scalability considerations - Rationale behind your architectural and processing design, especially in relation to future analytical needs - Any challenges or irregularities you encountered when handling arXiv metadata - Based on your experience, what could go wrong with this type of data? How did you safeguard data quality? - Measures taken to ensure fault tolerance and recovery in the pipeline - Any assumptions made throughout your implementation - A **prototype or proposal** for a monitoring dashboard to track the pipeline’s health and performance ## 6. Evaluation Dimensions: Holistic Assessment Beyond Technology ### Technical Capabilities We focus not just on code, but on: - Elegance of system architecture - Innovation in problem-solving - Code readability and maintainability ### Timeline 1. Analysis (7 calendar days): You have slightly more than one week (till the end of the 7th day) to finish the challenge. For example, if your first interview with us is on Monday 1pm, we expect to receive your analysis and presentation by the end of day (11:59:59pm) of the next Monday. 2. Review (7 calendar days): We will review your works within 7 days after receiving them and let you know if you're invited to the presentation. Following the previous example, you can expect to hear from us no later than Friday should you submit on Monday. 3. Present (1 hour): There will be a 55-minute session for your presenation and Q&A with a panel of interviewers if you're invited to the presentation. We recommend to prepare a 20-minute presentaion and be ready to **explain your thought process** and **defend your recommendations** during the presentation. You are welcome and encouraged to present in English if you feel comfortable, but this is not required. Clear communication should be prioritized.