# 2023H1 ITESO MAF DE/A ## Data Engineering Course **Professor**: Rodrigo H. Mota **Quick Links**: * Zoom Session: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09) * Virtual Board: [here](https://miro.com/app/board/uXjVPvZCqeE=/?share_link_id=437079415297) * Video Lectures Recordings: [here](https://www.youtube.com/playlist?list=PLIbTa97DHk7j4dn_0uggKNx589pG7T8x1) * Data Engineering Toolbox: [Github Repo](https://github.com/dataengtoolbox/dataengtoolbox) * Data Engineering Course: [Github Repo](https://github.com/data-engineering-course) **Link to this document**: * Link: https://hackmd.io/@rhdzmota/B1JubK7ss * Access ID: `B1JubK7ss` :hourglass_flowing_sand:: Session Pending :ballot_box_with_check:: Session Done -- **** ## Last Week Hello everyone, this is our last week. We'll have our last session on Wed. 05/10 where we'll review the core concepts & final grade.  * **Mon 05/08**: We won't have this session, be sure that you & your team have all the coding implementations available in Github.  * **Wed 05/10**: Final session at ITESO. ## :ballot_box_with_check:: 2023-05-03 (Wed) Agenda & Notes Use this session to work on the homework `Tree Traversal Levels`. I'll be adding some hints in this section. **Example / Reference**: consider that we have a linked list like the following `a -> b -> c`. We should be able to create a python class to represent wach node in the list. ```python= class Node: def __init__(self, value, child): self.value = value self.child = child ``` By using the `Node` class, we can create a "compositon of nodes" to represent the idea of a linked-list: ```python= # Composition of nodes my_linked_list = Node(value="a", child=Node(value="b", child=Node(value="c", child=None))) ``` How can I traverse the linked-list? You can work by getting the value from each node at the time. ```python= # Recursive traverse function def traverse(node: Node): print(node.value) if node.child: # Recursivity return traverse(node.child) # You can apply this function to the top-level node of the list # and it'll traverse the complete structure until it reaches `None` traverse(my_linked_list) ``` **Hint**: you can use the same idea of a `Node` to represent each element of the tree-structure. Instead of `child` you may use `children` to represent multiple downstream nodes. ## **:ballot_box_with_check::** 2023-04-26 (Wed) Agenda & Notes :::info Today (**Wednesday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: Coding session link: [here](https://code-with-me.global.jetbrains.com/MOrVBeF2l_e-zv4vWJCPYQ#p=PY&fp=04A1108AC9FCBF307C7C6262370295A2B5258B23992A4C4246DE79E458FCC78F) * Transformation abstraction * Create "infrastructure" & core interfaces * Add functionality for pandas dataframes * Generalize to use different dataframe type. * [pandas](https://github.com/pandas-dev/pandas) * [polars](https://github.com/pola-rs/polars) * Apache Spark (pyspark) * [DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html) * [koalas](https://github.com/databricks/koalas) ## :ballot_box_with_check: 2023-04-17 (Mon) Agenda & Notes :::info Today (**Monday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: * **Apache Spark** Theory. * Google Sheets: configuring the "write" permissions. Update: `crawler` application to be delivered on Thursday. ## :ballot_box_with_check: 2023-04-12 (Wed) Agenda & Notes :::warning Today (**Wednesday**) we'll have our session at ITESO! ::: Before working on the classwork, consider the following recommendations. **Recommdation 1**: Re-install the `dataengtoolbox` ```commandline= $ pip install -e dataengtoolbox ``` **Recommendation 2**: Execute the `setup` command ```commandline= $ python -m dataengtoolbox.iteso setup ``` * You should have a `dataengtoolbox_database` file in the root of your repository. * DO NOT add this file into github, never. **Recommendation 3**: Be sure that the `hello_world` application can be executed. ```commandline= $ pip install -e services/hello_world ``` ```commandline= $ python -m dataengtoolbox.iteso execute --app-name hello_world --command run ``` Expected output similar to: ```text= INFO:hello_world:Starting execution of python application hello_world: 63477a06-e2ac-4ace-925a-ac47eaaf756a INFO:hello_world:On Start method call; not implemented. INFO:hello_world.cli:This is a test log message; executing run command from hello world. INFO:hello_world:Stopping application execution hello_world: 63477a06-e2ac-4ace-925a-ac47eaaf756a INFO:hello_world:Application Status hello_world:63477a06-e2ac-4ace-925a-ac47eaaf756a:success INFO:hello_world:Application Duration (seconds) hello_world:63477a06-e2ac-4ace-925a-ac47eaaf756a:0.001852 INFO:hello_world:On Finalize method call; not implemented. Hello, world ``` You should also have the following directory (folder) in the root of your repo: `dataengtoolbox_execution_logs` * DO NOT add this folder in github, never. ```commandline= $ python -m dataengtoolbox.iteso frontend --app-name hello_world ``` * This command should open the frontend (streamlit). * You can stop the execution via `ctrl + c` in the terminal. ### Classwork: Python crawler application Read the ORM documentation regarding the `models & fields` [here](https://docs.peewee-orm.com/en/latest/peewee/models.html#). **Instructions**: Create a python crawler application that's capable of extracting all the links in a target webpage. The solution should be able to extract the information from "n" degrees (levels) deep relative to the target webpage (e.g., 1 = extract the top-level URLs). * Example of root page: https://rhdzmota.com/post/the-best-way-to-install-python/ The table should be similar to the following: * `id` contains the id of the record, this is automatically configured by the `BaseModel` * `execution_id` a simple random uuid (string) that represents an execution. This is relevant to distinguish between different executions. You can create a random uuid via Python with: `import uuid; str(uuid.uuid4())` * `url` a string that represents the URL of the website. * `content` a string that contains the website content (HTML). * `parent` a foreign key to this same table that allows us to represent the hierarchy. Consider that the `BaseModel` will also add additional columns (e.g., `created_at`). That's okay, you can add any other column that you consider relevant as well. TABLE NAME: `crawlerresult` CLASS NAME: `CrawlerResult` | id | execution_id | url | content | parent | |----|--------------|-----|---------|--------| | | | | | | | | | | | | | | | | | | | | | | | | Execution example: `setup` command ```commandline= $ python -m dataengtoolbox.iteso execute \ --app-name crawler \ --command setup ``` * Include an `--reset` argument (boolean) to optionally drop the tables. Execution example: `run` command ```commandline= $ python -m dataengtoolbox.iteso execute \ --app-name crawler \ --command run \ --target 'https://rhdzmota.com/post/the-best-way-to-install-python/' \ --levels 1 ``` * You should be able to use any other website as target. Considerations: * The `crawler` python application should have a CLI that inherits from the `AbstractCLI`. * You should define the table structure via the ORM (python class that inherits from the `BaseModel`). * Use the `logger` instead of print statements when required. * The `crawler` application should have their own `models` module. * The `setup` command should create the table in the database. * You can reuse the database available at `dataengtoolbox.iteso.dao.database` * The `setup` command should be executed once to ensure that the tables are correctly created in SQLite. * You should be able to re-run this command and re-create the backend tables by using the `--reset` flag (be sure to implement the drop statement for this to work). * You can manually delete the database (delete the file in your laptop) to completely reset the system. Consider that you might need to run the `setup` command from `dataengtoolbox` and your application as well. ORM References to consider: * Models & Fields: https://docs.peewee-orm.com/en/latest/peewee/models.html# * Foreign keys and composite keys: https://docs.peewee-orm.com/en/latest/peewee/models.html#primary-keys-composite-keys-and-other-tricks * Self reference foreign keys: * https://docs.peewee-orm.com/en/latest/peewee/models.html#self-referential-foreign-keys * https://stackoverflow.com/questions/59944653/foreign-key-to-the-same-table-in-python-peewee --- Hints: -- Import the database: ```python= from dataengtoolbox.iteso.dao.database import db ``` Import the `ModelBase`: ```python= from dataengtoolbox.iteso.dao.iterface import BaseModel ``` Create your model table in the database: ```python= from dataengtoolbox.iteso.dao.database import db with db: db.create_tables(models=[]) ``` * Add you models classes into the list. Consider that some operations require a `commit` to be persisted in the database. You can commit on demand via `db.commit()`. -- You can get the html content from a webpage via `requests.get`. Then, you can use `BeautifulSoup` to parse and extract elements from the HTML. Take a look at this hint to see how to extract `href` objects from the HTML: * Review the `sic_industries` python application * Checkout this [stackoverflow post](https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href). -- You can query the database via the ORM or `sqlite3` connection: ```python= import sqlite3 # Open a database connection database_connection = sqlite3.connect("dataengtoolbox_database") # Execute a query cursor = database_connection.cursor() result = cursor.execute("SELECT * FROM serviceexecution") result.fetchall() # Close the connection database_connection.close() ``` **For the adventurers**, if you want a challenge, wrap the `requests.get` calls into our `Future` monad (located at `dataengtoolbox.iteso.monads.future`) to make the implementation concurrent. ## :ballot_box_with_check: 2023-04-10 (Mon) Agenda & Notes :::info Today (**Monday**) we'll have a **virtual session**:~~ * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09) ::: ## :ballot_box_with_check: 2023-03-29 (Wed) Agenda & Notes :::warning ~~Today (**Wednesday**) we'll have our session at ITESO!~~ ::: :::info Today (**Wednesday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: PyCharm Code with me session: [here](https://code-with-me.global.jetbrains.com/b8yZSp6d_kCobFR3DcSjDw#p=PY&fp=CFE9627282A3FE34CC1AC788CAC994458FA57EE55DBC1AF3A199F186BF86FC3C) ## :ballot_box_with_check: 2023-03-27 (Mon) Agenda & Notes :::info ~~Today (**Monday**) we'll have a **virtual session**:~~ ~~* Start time: **20:30hrs CST**~~ ~~* Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09)~~ ::: :::danger **2023-03-27 Session Cancelled** Today's virtual session has been cancelled due to a work issue. We'll continue our session on Thursday (ITESO). Please use today's time to start working on a homework available in Canvas. ::: ## :ballot_box_with_check: 2023-03-22 (Wed) Agenda & Notes :::info Today (**Wednesday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: PyCharm Code with me session: [here](https://code-with-me.global.jetbrains.com/WFTMmnTMgpxLzr5x_qz8MA#p=PY&fp=E529C1F8C40C3E6D308E5855533315126801F567CDE04CF0F8C6120E9C9CAF33) * Logging (persistent) * CLI - Command Line Interface * Visualization Tools * Data models for monitoring **Objective**: Refactor our python applications to be executed via a centralized framework (e.g., dataengtoolbox). **Motivation**: Share common abstractions, infrastructure and code. Expected execution pattern: ```commandline= $ python -m dataengtoolbox.iteso execute \ --app-name hello_world \ --command run <args> ``` Old implementation: ```commadline= $ python -m hello_world run ``` ## :ballot_box_with_check: 2023-03-15 (Wed) Agenda & Notes :::warning Today (**Wednesday**) we'll have our session at ITESO! ::: Pycharm Code with me session: [here](https://code-with-me.global.jetbrains.com/53gCMJxpZefggbrq4sxC6g#p=PY&fp=654DC9F2DC0B16B2CF6A46FA383C9D2FB04BCE7CCC459C86329A09F7C868ECF2) 1. Configure the data engineering toolbox 2. Classwork exercise: Google API Python namespace packages: https://packaging.python.org/en/latest/guides/packaging-namespace-packages/ Usage example: https://github.com/RHDZMOTA/python-namespace-example ### Google Sheets Reference: https://rhdzmota.com/post/quickstart-google-sheets-with-python/ Create a class called `GoogleSheet` with the following methods: :::danger **UPDATE**: Due to current API limitations, only focus on implementing the methods that retrieve data! * get_value * get_values * get_pandas Remember that your implementation **must be** a python class located in the dataengtoolbox extension package. ::: #### GoogleSheet initialization arguments * gs_name: name of the google sheet * gs_id: spreadsheet id * key: developer key to be used #### Single operations * `get_value(col: str, row: int) -> Any` * `update_value(value: Any, col: str, row: int)` Considerations: * The `value` can be `Any` type. You can get this type from: `from typing import Any` * When interacting with the Google Sheets API, you'll need to transform `Any` into a string. * Verify if a variable is a string: `isinstance(something, str)` * The "casting" of an object into a string is called "serialization". In python, you can serialize an object via: * The `json` module: `json.dumps(obj)` * The `str` function: `str(obj)` (not recommended) #### Multiple operations * `get_values(from_col: str, from_row: int, to_col: str, to_row: int) -> Union[List[Any], List[List[Any]]]` * `update_values(values: List, from_col: str, from_row: int)` Considerations: * You can get the `List` and `Union` types from: * `from typing import List, Union` * The return type of the `get_values` method should be: * Option 1: `List[Any]` when either the columns or rows are the same. * Option 2: `List[List[Any]]` when retrieving data from multiple rows and multiple columns. * The `update_values` method argument `values` operates on the same manner (can be a single list or a list of lists). You can assume that the first outer list represents the rows and the inner lists represents the columns. Therefore, a single list should update the rows values for the same column. * You can infer the number of rows from the length of the outer list. * You can infer the number of cols from the length of the inner list. #### Pandas compatibility (dataframes) * `get_pandas(from_col: str, from_row: int, to_col: str, to_row: int, header: bool = False ) -> pd.DataFrame` * `update_from_pandas(df: pd.DataFrame, from_col: str, from_row: int, include_header: bool = False)` #### Hints There's a python module called `string` that can be used to get the ascii letters: ```python= import string string.ascii_uppercase ``` **Why is this relevant?** You can create a function that allows you to get the column name given the position. For example: * Postion 1 should return `A` * Position 100 should return `CV` ## :ballot_box_with_check: 2023-03-13 (Mon) Agenda & Notes :::info Today (**Monday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: Branch programming is not recommended for "complex" logic with several nested cases. ```python= if condition_1: if condition_2: something_1 else: something_2 elif condition_3: something_3 else: something_4 ``` ### Future Monad We can use python native `Threads` to execute a target function in the background. But, how can we use the outputs of that function execution? A possible alternative is using thread-pools (reference [here](https://superfastpython.com/threadpool-python/)). Another alterantive is for us to create our own abstraction. The `Future` monad should allow us to work with future values in the present. What do we mean by 'working' with the future results? We should be able to do the following operations: * Transform the future value `A` into a `B` value. * Map: `A => B` * Apply other transformations over `A` that might result in a future value `Future[B]`. * FlatMap: `A => Future[B]` ## :ballot_box_with_check: 2023-03-08 (Wed) Agenda & Notes :::info Today (**Wednesday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: ```python= from typing import TypeVar, List A = TypeVar("A") def first(my_list: List[A]) -> A: return my_list[0] ``` What's the type signature of the `first` function? * `List[A] => A` ```text Callable[[List[A]], A] ``` What's the `pure` operation in the context of a `list` in python? ```python= from typing import TypeVar, List, Callable, Generic T = TypeVar("T") A = TypeVar("T") B = TypeVar("T") class ListMonad(Generic[T]): def __init__(self, value: List[T]): self.value = value @staticmethod def pure(x: T) -> List[T]: return [x] def flat_map( self, f: Callable[[A], List[B]] ) -> List[B]: return [ result for element in self.value for result in f(element) ] def map( self, f: Callable[[A], B] ) -> List[B]: # Transform f into the type needed for flatmap g = lambda x: self.pure(f(x)) return self.flat_map(f=g) ``` ```python= my_list = ["first", "second"] def function_1(x: str) -> List[str]: return [char for char in x] def function_2(x: str) -> int: return len(x) ``` ```python= # Apply function 1 via a flat_map ListMonad(my_list).flat_map(function_1) ``` ```python= # Apply function 2 via a map ListMonad(my_list).map(function_2) ``` ## :ballot_box_with_check: 2023-03-06 (Mon) Agenda & Notes :::info Today (**Monday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: Agenda: 1. Functional programming paradigm - example with concurrent programming. 2. Integrate the `dataengtoolbox` into our course repository to allow sharing abstractions to other python services. ```python= class Student: def __init__(self, name: str, birthdate: str): self.name = name self.birthdate = birthdate def age(self): current_year = dt.datetime.utcnow().year student_year = dt.datetime.strptime( self.birthdate, "%Y-%m-%d" ).year return current_year - student_year student = Student(name="example", birthdate="2000-01-01") student.age() ``` In this case, the method `age` has the following type signature: ```text= (Student, ) -> int ``` Consider the following function: ```python= def add(a: int, b: int) -> int: return a + b ``` * The type signature of this function is: `(int, int) => int` * Using the python hints/type annotations notation: ```text= from typing import Callable add_type_signature = Callable[[int, int], int] ``` Consider the following value in Python: ```python= my_list = [1, 2, 3] ``` * The data type for the `my_list` variable is: `List[int]` **What can we generalize?** The first thing we can generalize are the values that the list contains. We can use a generic datatype `A`. * `List[A]` **What else can we generalize?** We can attempt to generalize the "context" that contains `A`. Let's call `F[]` the new context. * `F[A]` Consider now the following function: `g: A -> B` We should be able to apply `g` over `F[A]` without changing the context. Therefore, resutling in `F[B]`. We can define `F[]` as a monad if it contains the following operations: * `pure`: `A => F[A]` * `flat_map`: `(F[A], A => F[B]) => F[B]` * `map`: `(F[A], A => B) => F[B]` * `F.flat_map(F.pure * (A => B) )` Apache Spark: * `ds: Dataset[A]` * `ds.map(A => B)` * `ds.flatMap(A => Dataset[B])` ## :ballot_box_with_check: 2023-03-01 (Wed) Agenda & Notes :::warning Today (**Wednesday**) we'll have our session at ITESO! ::: SIC Industry Search Problem Statement - implement the missing `search` command in the `sic_industries` python application. * Instuctions: [here](https://github.com/data-engineering-course/ITESO2023H1DEA/tree/main/services/sic_industries). **How to do an exact pattern search?** You can easily do this in python by comparing both string (the search condition & industry title) via the `in` operator. For example: ```python= search_condition = "example" industry_title = "This is an example" def search(pattern: str, title: str, exact: bool = False): if exact: # Exact search implementation return pattern in title ``` **How to do a similarity search in python?** There's a difflib module built-in the standard python libarary that we can use to solve this. For example: ```python= from difflib import SequenceMatcher search_condition = "example" industry_title = "This is an example" similarity_ratio = SequenceMatcher(None, search_condition, industry_title).ratio() print(similarity_ratio) ``` * You can assume two strings are "similar enough" if the similarity ratio is greater than `0.5`. * I highly recommend completing the `search` function to "fallback" to the similarity search when the `exact` parameter is `False`. **How can I iterate over a dictionary?** There are several ways... Some examples here: ```python= example = {"a": 1, "b": 2, "c": 3} # Option 1 for key in example: print(key) # Option 2 for key, val in example.items(): print(key, val) # And more... complement with a simple google search. ``` **How can I load the json file into Python?** There are two options: * **Option 1**: Load into our custom model by using the SICHelper class: `instance = SICHelper.from_file(filename)` * **Option 2**: Open the file, read the content, and load as a dictionary via `json.loads(content)` (this requires importing `json`). ## :ballot_box_with_check: 2023-02-27 (Mon) Agenda & Notes :::info Today (**Monday**) we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: PyCharm code with me session: [here](https://code-with-me.global.jetbrains.com/CqeQ2CVlW7jgDYuu84I6vw#p=PY&fp=11D5F47E27E99E03E0ED5BDCDC303AFF3A159E4FF309BB78D122C7BC352342BD) Python Replicator implementation: * Conc. Exec: 343.14 * Seq. Exec: 123.067 * What can we do about it? What did we got wrong? Logging with Python: https://docs.python.org/3/howto/logging.html ```python= import os import logging from typing import Optional DEFAULT_LOG_LEVEL = os.environ.get( "DEFAULT_LOG_LEVEL", default="INFO" ).strip() def get_logger(name: str, log_level: Optional[str] = None): logging.basicConfig( level=log_level or DEFAULT_LOG_LEVEL, ) return logging.getLogger(name=name) ``` Python profilers: * Allows us to analyze our implementation and identify potential performance bottlnecks in our python code. * Recommendation: [Scalene](https://github.com/plasma-umass/scalene) Python Namespace Packages: * Documentation [here](https://packaging.python.org/en/latest/guides/packaging-namespace-packages/) * Examples [here](https://github.com/RHDZMOTA/python-namespace-example) PyPI: * PyPI is a cloud repository that contains all the python libraries available for installation via `pip`. * You can create an account [here](https://pypi.org/account/login/). ## :ballot_box_with_check: 2023-02-22 (Wed) Agenda & Notes :::info Today we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: Code with me in PyCharm Session: [here](https://code-with-me.global.jetbrains.com/J3UN4e3kAuzdlyeXuWJF3g#p=PY&fp=F37C688C2C403FE4496B7E248E955689E5DF3A2160E063ABFCD156F27A20206E) Agenda: * Python application example using concurrency model. * Integrate the `dataengtoolbox` in our class repository. * Functional programming principles in concurrency model. * DAGs & Orchestrators * DAG = Directed Acyclic Graph Reactive Manifesto: https://www.reactivemanifesto.org/ * Examples: * [Akka Streams](https://doc.akka.io/docs/akka/current/stream/stream-introduction.html#motivation) in Scala * [Apache Kafka](https://kafka.apache.org/) > Premature optimization is the root of all evil * [Donald Knuth ](https://www.youtube.com/watch?v=74RdET79q40) * Recommendation: "The Art of Computer Programming" Book Series by D. Knuth Function Types: pure, impure `Pure Functions` are deterministic and do not depend on external resources. **Do not have any side effects**. ```python= # This is a pure function # (Int, Int) => Int def add(a: int, b: int) -> int: return a + b # This is an impure function # (Int, Int) => (Side effect) Int | Exception | ... def add(a: int, b: int) -> int: print("Add function call: ", a, b) database.query("INSERT INTO TABLE VALUES (a, b)") return a + b ``` Examples of functional programming languages: * [Haskell](https://www.haskell.org/) * *[Scala](https://www.scala-lang.org/) Recommendation: * [Essential Scala by Noel Welsch](https://books.underscore.io/essential-scala/essential-scala.html) * [Functional Programming Principles in Scala](https://www.coursera.org/specializations/scala#instructors) ## :ballot_box_with_check: 2023-02-20 (Mon) Agenda & Notes :::info Today we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: ## :ballot_box_with_check: 2023-02-15 (Wed) Agenda & Notes :::warning Today we'll have our session at ITESO! ::: * SIC Industries - closure * Packaging python applications * Team Exercise: scanner application (Friday) You can create python wheel by: * Installing the following dependency: `wheel==0.38.4` * Executing this command: `python setup.py bdist_wheel` ## :ballot_box_with_check: 2023-02-13 (Mon) Agenda & Notes :::info Today we'll have a **virtual session**: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: *Announcement*: Repository forking has been enabled in our github organization! Code with me session: [here](https://code-with-me.global.jetbrains.com/KkDHLR4ZCutv4VgtiYHm6Q#p=PY&fp=F37C688C2C403FE4496B7E248E955689E5DF3A2160E063ABFCD156F27A20206E) Agenda: * Finish our "SIC Industry" data extraction ETL development. * **Concurrent** with python using multithreading! Relational Database Management Systems: * MySQL * Postgres * ... **ORM Tooling**: Object Relational Mappings * [Python ORM with Django](https://docs.djangoproject.com/en/4.1/topics/db/queries/) * [Python ORM with SQLAlchemy](https://docs.sqlalchemy.org/en/20/orm/quickstart.html#declare-models) SIC Industry Data Extraction: For each industry, we need to create the following structure: ```json { "title": "...", "sub_industries": [ {...}, {...} ] } ``` We can use python class to create our "data models" and allow transforming in instance of the class into a dictionary. ```python= class Student: def __init__(self, name: str, age: int) self.name = name self.age = age def to_dict(self): return { "name": self.name, "age": self.age, } student_a = Student("Alice", 25) student_b = Student("Bob", 22) student_c = Student("Carol", 27) ``` ```python= payload = student_a.to_dict() # This is a dictionary ``` Payload contains the following: ```jsonld= { "name": "Alice", "age": 25 } ``` We can create the json by transforming the dictionary into a string and saving it into a file: ```python= import json with open("path/to/file.json", "w") as file: content = json.dumps(payload) file.write(content) ``` You can read back the information into a student instance by running the following: ```python= with open("path/to/file.json", "r") as file: payload = json.loads(file.read()) student_a = ``` There are two components to consider in this example: * Serialization process * Deserialization process JSON: JavaScript Object Notation * JavaScript Object => String ## :ballot_box_with_check: 2023-02-08 (Wed) Agenda & Notes :::info Today we'll have a virtual session: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09) ::: Code with me session: [here](https://code-with-me.global.jetbrains.com/104Ss8WqzYptEw0Z67rZQQ#p=PY&fp=953987B0D48B1E2F20405D9C912E08A13D67B87CBC7C1E0FF8569CEAD513CF93) Data Source: https://www.osha.gov/data/sic-manual Python service layout: ```text= <python-application-name> ├── README.md # Usage documentation ├── requirements.txt # Dependencies ├── setup.py # Should be consistent with the app-name └── src # Source code └── <python-application-name> ├── cli.py # Contains the CLI Class ├── settings.py # Contains parametrizations & constants ├── utils.py # Utility & reusable code ├── models.py # Data models (entities) ├── helpers.py / admin.py # Helper classes to manage the models ├── __init__.py # Initialized the python module └── __main__.py # Executes fire; allows CLI commands to be exposed in the terminal ``` Execute the following command to install the python app: ```commandline= $ pip install -e services/<app-name> ``` Execute the following to run the application: ```commandline= $ python -m <app-name> <command> <arguments> ``` * Example: `python -m hello_world run --name world` When creating a python application you need to be sure to follow these principles: * Modularity * Separation of Concerns * Single Reponsibility Principle HTTP Requests: * GET * POST / PUT * DELETE * ... Python dunder methods. ## :ballot_box_with_check: 2023-02-01 (Wed) Agenda & Notes :::warning Today we'll have our session at ITESO! ::: ## :ballot_box_with_check: 2023-01-30 (Mon) Agenda & Notes :::warning Today we'll have our session at ITESO! ::: Agenda: * Review: data application infrastructure * Be sure to have the tooling installed in your system. * Python installation manager: [pyenv](https://rhdzmota.com/post/the-best-way-to-install-python/) * PyCharm Professional (installation guide [here](https://hackmd.io/@credit-risk-iteso/SJMtcIsWd)) * Git bash (installation guide [here](https://hackmd.io/@credit-risk-iteso/r1PogIiW_)) ### Installation Guide - Verification Steps to verify your local installation: **STEP 1**: Clone the repository in your computer ```commandline= $ git clone https://github.com/dataengtoolbox/dataengtoolbox.git ``` Consider that you can choose where to clone the repo (e.g., `Documents`). Be sure to "change into" the repository base once the clone is successful (example: `cd dataengtoolbox`). **STEP 2**: Create a python virtual environment in the repo. Using pyenv: ```commandline= $ pyenv local 3.9.5 $ pyenv exec python -m venv venv ``` Using python directly: ```commandline= $ python -m venv venv ``` Activate the virtual environment: * Windows: `source venv/Scripts/activate` * Mac/Linux: `source venv/bin/activate` Be sure that the virtual environment is using `python 3.9.5` by running: * `python --version` **STEP 3**: Install our python application. Execute the following command in the base of the cloned repository: ```commandline= $ pip install -e . ``` Verify that the application is correctly installed by correctly running at least one of the following: * Option 1: `dataengtoolbox hello` * Option 2: `python -c "import fire; from dataengtoolbox.cli import CLI; fire.Fire(CLI())" hello` * Option 3: `python -m dataengtoolbox hello` ### Exercise: Scanner Implementation Be sure to have implemented the following funtions before our next session: * `list_files` * `scan_path` This is not an official homework, but the implementations will be used for the next session classwork. #### Function: List Files Create a python function called `list_files` that receives a `target_path` as an input and returns a generator where each item is a string containing all the files in the target directory. If the `include_nested` boolean argument is set to true, the result should contain all the nested files as well. ```python= from typing import Generator def list_files(target_path: str, include_nested: bool = False) -> Generator: pass ``` #### Function: Scan Path Create a function called `scan_path` that returns a list of dictionaries that represent the filename and external metadata. The function should have the following inputs: * `target_path` & `include_nested` have the same argument definition as with the `list_file` function. * `include_created_at`: boolean value (should default to `False`) adds the `created_at` key to the dictionary with the creation timestamp of the file. * `include_file_size`: boolean value (should default to `False`) that adds the `file_size` metadata into the dictionary. ```python= from typing import Dict, Generator, List def list_files(target_path: str, include_nested: bool = False) -> Generator: pass def scan_path( target_path: str, include_nested: bool = False, include_created_at: bool = False, include_file_size: bool = False, ) -> List[Dict]: pass ``` Complete output: ```json { "file_path": "xxxx", "file_name": "xxxx" "file_created_at": "xxxx", "file_size": "xxxx", } ``` * The `file_created_at` key should not be included when the `include_created_at` argument is set to `False` (default). * The `file_size` key should not be included when the `include_file_size` argument is set to `False` (default). Considerations: * Use the `list_files` function in the `scan_path` implementation. * **Hint 1**: Consider the following standard library functions to get the results of listing the target directory: * [os.walk](https://docs.python.org/3/library/os.html#os.walk) * [os.listdir](https://docs.python.org/3/library/os.html#os.listdir) * **Hint 2**: Consider the following standard library function to get the creation timestamp of a file [os.path.getctime](https://docs.python.org/3/library/os.path.html#os.path.getctime). * **Hint 3**: Consider the following standard library function to get the file size [os.path.getsize](https://docs.python.org/3/library/os.path.html#os.path.getsize). ## :ballot_box_with_check: 2023-01-25 (Wed) Agenda & Notes :::info Today we'll have a virtual session: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: **Collaborative coding session**: [here](https://code-with-me.global.jetbrains.com/TzOUlJxnaMFL9_R4IQ0uOg#p=PY&fp=C9359314554A4A35CBD5C43F76B3A65764B79E2C831AFFA201013A5AB7E4E56A) * (concurrent) AsyncIO Python module documentation: [here](https://docs.python.org/3/library/asyncio-task.html) * (concurrent) Python Threading documentation: [here](https://docs.python.org/3/library/threading.html) * (parallel) Python Multiprocessing documentation: [here](https://docs.python.org/3/library/multiprocessing.html) **Complementary programming languages:** * Scala: https://www.scala-lang.org/ * Haskell: https://www.haskell.org/ **Python Coding Style Guide Reference**: [here](https://peps.python.org/pep-0008/) ## :ballot_box_with_check: 2023-01-23 (Mon) Agenda & Notes :::info Today we'll have a virtual session: * Start time: **20:30hrs CST** * Zoom URL: [here](https://us02web.zoom.us/j/83284544756?pwd=QXV1NzhaK2FzdVVmTy9BS0p3NjJEUT09 ) ::: **Collaborative Coding Session**: [here](https://code-with-me.global.jetbrains.com/Vh3EHNcElsrvEwa-uBFg4A#p=PY&fp=C9359314554A4A35CBD5C43F76B3A65764B79E2C831AFFA201013A5AB7E4E56A) Example Github Repo: https://github.com/dataengtoolbox/dataengtoolbox Install python `3.9.5` using pyenv: * Install python version in your system: `pyenv install 3.9.5` * Specify that `3.9.5` should be used in the current directory: `pyenv local 3.9.5` Create a python virtual env in the current directory: * Create virtual env: `pyenv exec python -m venv venv` * Activate virtual env: * Mac/Linux: `source venv/bin/activate` * Windows: `source venv/Scripts/activate` Install requirements: ```commandline= $ pip install -r requirements.txt ``` ## :ballot_box_with_check: 2023-01-18 (Wed) Agenda & Notes * Intro to Python * ~~Python applications~~ * Classwork exercise * Link to [classwork exercise](https://forms.gle/Ejjpy1LfV2wcGkhn7) * Due date: Friday (2023-01-20) Python Benchmarks: The following repository contains a set of benchmarks to compare the naive python performance vs different implementations. * Link to benchmark repo [here](https://github.com/RHDZMOTA/python-benchmark-primes). Terminal recommendations: * Review the resources provided in the previous session containing a guide and intro on using the terminal. * For `Windows` users, I recommend installing [Git Bash](https://git-scm.com/downloads) or [WSL with Ubuntu](https://learn.microsoft.com/en-us/windows/wsl/about). * For `Mac` users, you can use the default terminal or install [iterm2](https://iterm2.com/). * For `Linux` user, the default terminal should work just fine. **PyEnv**: Install different python versions in your computer. * Installation guide [here](https://rhdzmota.com/post/the-best-way-to-install-python/). * Python version for this course: `3.9.5` **Python virtual environments (venv)**: manage python dependencies for a project. Recommendation: use the python version and dependencies specified in the [databricks runtime](https://docs.databricks.com/release-notes/runtime/releases.html). At the time of this course, the latest LTS Runtime is the [11.3 LTS](https://docs.databricks.com/release-notes/runtime/11.3.html). :::info **Teams** Be sure to send me an email to `rodrigohm@iteso.mx` with your team (max 4 members). Specify the following information for each member: * Full Name * Student ID * Github Username ::: ## :ballot_box_with_check: 2023-01-16 (Mon) Agenda & Notes Link to temporal document: * Link: https://hackmd.io/@rhdzmota/B1JubK7ss * Access ID: `B1JubK7ss` Data Roles (21:45 - 21:55 hrs): https://www.wizeline.com/data-roles-friends-but-not-the-same/ Survey (21:07 - 21:17): [Tell me more about yourself](https://forms.gle/fWvbQAasfdZUvDLk9) Databricks Runtimes: [here](https://docs.databricks.com/release-notes/runtime/releases.html) ---- Consider reviewing the following resources about the command line & git: * Terminal: * [10 Linux Terminal Commands for Beginners by Gary Sims](https://www.youtube.com/watch?v=CpTfQ-q6MPU) * [Beginner’s Guide to the Bash Terminal by Joe Collins](https://www.youtube.com/watch?v=oxuRxtrO2Ag) * Git & Github * [Git Core Concepts](https://www.youtube.com/watch?v=uR6G2v_WsRA) * [Git Branching and Merging](https://www.youtube.com/watch?v=FyAAIHHClqI) * [Github cheat sheet](https://training.github.com/downloads/github-git-cheat-sheet/) * [Forking a Github Repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo)