Gregory Malcom
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## 🧭 Emoji Legend (Note Formatting Guide) <details> <summary>🧭 Click to expand Emoji Legend</summary> | Icon | Meaning / Use Case | Example Usage | |:----:|--------------------|----------------| | 🧩 | **Main topic or section header** — introduces a major theme or concept | `## 🧩 Groupwise Analysis` | | 🔸 | **Major concept or subsection** — represents a key idea under the main topic | `### 🔸 Why Group Data?` | | 🔹 | **Detailed operation or example** — used for methods, step-by-step logic, or code examples | `### 🔹 groupby() minimum` | | ⚙️ | **Mechanics** — explains what’s happening under the hood | “⚙️ Mechanics: Pandas scans column values and builds sub-DataFrames.” | | 🎯 | **Goal / Purpose** — clarifies what we’re trying to achieve | “🎯 Goal: Count how many rows belong to each group.” | | 💡 | **Tip / Takeaway / Insight** — highlights key points, advice, or conceptual notes | “💡 Takeaway: Use .agg() for multiple summary stats.” | | ✅ | **Shortcut / Confirmation** — shows equivalent methods or quick wins | “✅ Equivalent: `reviews.points.value_counts()`” | | 🧠 | **Conceptual Insight** — emphasizes mental models or deeper understanding | “🧠 Think of groupby() as split → apply → combine.” | | 📘 | **Reference / Documentation Link** — links to external docs or sources | “📘 [Pandas MultiIndex Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)” | | 🪄 | **Trick / Tip** — optional visual for clever techniques or shortcuts | “🪄 Use .apply() to manipulate each group independently.” | | 🧭 | **Legend / Navigation Aid** — used for orientation or overview sections | “## 🧭 Emoji Legend (Note Formatting Guide)” | --- 🪄 *Tip:* All of these are Unicode emojis that work natively in HackMD, GitHub, Notion, and most Markdown renderers. You can insert them using your OS emoji picker: - **Windows:** `Win + .` - **macOS:** `Cmd + Ctrl + Space` </details> ## 🧭 Table of Contents [TOC] --- # PMT2-FA22 [Link to PMT2-FA22 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4490389&stepid=4490392&hideNavBar=1) Pandas heavy notebook - in fact, the vast majority of it is Pandas! ## 🧩 Data Tables We are given the following four tables: ![image](https://hackmd.io/_uploads/S13gSnLAxx.png) ### 🔸 Strategy On Midterm 1, my primary failure seemed to stem from trying to force myself through the exercises in numerical order. The strategy for this midterm was to locate the 1 point questions and attempt those first. Then move on to the 2 point questions. If there is time remaining, I will attempt to tackle the 3 point questions. ### 🔹 Exercise 0 - Filtering, Structuring and Cleaning (Pandas) **🎯 Goal**: Return a new DataFrame of active promotions (time_left > 0) that contains the following keys: * `cust_id` * `promo` * `service` **➡️ Input**: The DataFrame `promos` that has the columns: * `cust_id` → id of a single customer * `promo` → a service where the customer has participated in a promo * `service` → name of a promo in which a customer has participated for the associated service * `time_left` → the time the customer has left on the promo ![image](https://hackmd.io/_uploads/r1ORL4YRex.png) **⬅️ Output**: A new DataFrame with active promotions, which does the following: * `cust_id` → The id of the customer for each unique combination with `service` * `promo` → the active promo for the `cust_id/service` pair: * when `time_left` = 0 → column value = base * when `time_left` != 0 → match promo values across all applicable `cust_id` * `service` → identifies a service for which the customer has an active promotion. **The customer may not actually be subscribed to the service!** ![image](https://hackmd.io/_uploads/r1R1wVKRex.png) **✅ Solution**: ```python ### Exercise 0 solution def get_active_promos(promos): # 1) Select only the columns we need cols_needed = ['cust_id', 'service', 'promo', 'time_left'] promos_subset = promos[cols_needed] # 2) Build a DataFrame of unique (cust_id, service) pairs cust_service = promos_subset[['cust_id', 'service']] cust_service_unique = cust_service.drop_duplicates() # 3) Identify rows with an active promo (time_left != '0') is_active_row = promos_subset['time_left'] != '0' active_rows = promos_subset[is_active_row] # 4) Keep only columns needed from active rows active_cols = ['cust_id', 'service', 'promo'] active_promos_only = active_rows[active_cols] # 5) Ensure one active promo per (cust_id, service) pair active_promos_unique = active_promos_only.drop_duplicates(subset=['cust_id', 'service']) # 6) Left-join unique pairs to their active promo (if present) merged = cust_service_unique.merge(active_promos_unique, on=['cust_id', 'service'], how='left') # 7) Fill missing promos with 'base' missing_mask = merged['promo'].isna() merged.loc[missing_mask, 'promo'] = 'base' # 8) Reorder columns exactly as required result = merged[['cust_id', 'service', 'promo']] # 9) Return the final DataFrame return result ``` **⚙️ Pandas Concepts**: * `df = df[[list of column names]]` → filters a DataFrame to return a new one with the supplied column names * `df.drop_duplicates()` → removes duplicate values from a DataFrame * `df = df[col_name] != '0'` → returns a Series of Boolean values for the supplied condition * `df1.merge()` → merges a defined table into df1 on the inputs (and order) below: * `df2` → table that will merge into df1 * `how=` → define the way in which the tables will merge: * `"left"` → adds columns from df2 on the right side of df1, where columns with the same name and values are matched across each table * `"right"` → adds columns from df2 on the left side of df1, where columns with the same name and values are matched across each table * `"outer"` → adds ALL columns from df2 and sorts the resulting DataFrame lexographically, this is a union of the keys * `"inner"` → adds SIMILAR columns from df2 and preserves the order of the left keys, this is an intersection of the keys * `"cross"` → creates the [cartesian product](https://www.geeksforgeeks.org/maths/cartesian-product-of-sets/) from both DataFrames, ordered on the left keys * `on=` → column or index level names to join on * `left_on/right_on=` → the names of the columns that share values, but have different keys (column names) * `df[[col1, col2]]` → reorders columns based on the list (will drop any column names left out) ### 🔹 Exercise 1 - Merging (Pandas) ![image](https://hackmd.io/_uploads/rk6Gea80ge.png) **🎯 Goal**: Given several DataFrames, `customers`, `services`, `active_promos`, and `prices`, use a series of `left merges` to create the required output. ![image](https://hackmd.io/_uploads/B15TigOCgg.png) **✅ Solution**: ```python ### Exercise 1 solution def denormalize(customers, services, active_promos, prices): ### df = services.copy() df = df.merge(customers, how="left", left_on="cust_id", right_on="id") df = df.drop('id', axis=1) df = df.merge(active_promos, how ="left") df = df.merge(prices, how="left") df = df.rename(columns={'cust_id': 'id'}) reorder = ['id', 'paid', 'service', 'tier', 'promo', 'price'] df = df[reorder] return df ### ``` ![image](https://hackmd.io/_uploads/ryYqVzOCgg.png) **⚙️ Pandas Concepts**: * `df1.merge()` → merges a defined table into df1 on the inputs (and order) below: * `df2` → table that will merge into df1 * `how=` → define the way in which the tables will merge: * `"left"` → adds columns from df2 on the right side of df1, where columns with the same name and values are matched across each table * `"right"` → adds columns from df2 on the left side of df1, where columns with the same name and values are matched across each table * `"outer"` → adds ALL columns from df2 and sorts the resulting DataFrame lexographically, this is a union of the keys * `"inner"` → adds SIMILAR columns from df2 and preserves the order of the left keys, this is an intersection of the keys * `"cross"` → creates the [cartesian product](https://www.geeksforgeeks.org/maths/cartesian-product-of-sets/) from both DataFrames, ordered on the left keys * `on=` → column or index level names to join on * `left_on/right_on=` → the names of the columns that share values, but have different keys (column names) 📘 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html * `df.drop(index/key, axis=0/1)` → whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’) 📘 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html ### 🔹 Exercise 2 - Sum and Type Conversion (Pandas) **🎯 Goal**: Given a DataFrame, `df`, with the structure below, return the total of the `price` column rounded to 2 decimal places. **💡 Note**: All data fields are string, so we will need to cast to a float prior to computing the total. **`df` Structure**: ```python ### Define demo inputs demo_df_ex2 = pd.DataFrame({'id': {0: '0', 1: '2', 2: '2', 3: '3', 4: '4'}, 'paid': {0: 'True', 1: 'True', 2: 'True', 3: 'True', 4: 'True'}, 'service': {0: 'audio', 1: 'video', 2: 'audio', 3: 'audio', 4: 'video'}, 'tier': {0: '1', 1: '2', 2: '2', 3: '1', 4: '1'}, 'promo': {0: 'base', 1: 'base', 2: 'base', 3: 'base', 4: 'base'}, 'price': {0: '8.99', 1: '15.99', 2: '12.99', 3: '8.99', 4: '10.99'}}) #Solution should produce 57.95 ``` **✅ Solution**: ```python ### Exercise 2 solution def get_revenue(df): pickle = df['price'] pickle = pickle.astype(float) return pickle.sum() print(get_revenue(demo_df_ex2)) #57.95 ``` ### 🔹 Exercise 3 - Filtering and Structuring **🎯 Goal**: Partition (split) an existing “journal” DataFrame into two separate DataFrames — one containing historical (expired) records and one containing active (current) records — while ensuring the output has the correct schema, even if no prior journal exists. **➡️ Input**: * `df` → DataFrame - don't care about it's structure * `audit_cols` → List of strings - names of audit columns used to track history in the journal * `existing_journal` (Optional) → DataFrame or None - if it is a DataFrame it will have all of the columns of `df` and `audit_cols` ![image](https://hackmd.io/_uploads/HkwOFHFCxx.png) **⬅️ Output**: Two DataFrames returned as a tuple * `historical_journal` → * if `existing_journal = None` → empty DataFrame with the columns from `df` and `audit_cols` * if `existing_journal != None` → a DataFrame containing all rows of `existing_journal` where `exp_dt` is not '9999-12-31' * `active_journal` → * if `existing_journal = None` → empty DataFrame with the columns from `df` and `audit_cols` * if `existing_journal != None` → a DataFrame containing all rows of `existing_journal` where `exp_dt` is '9999-12-31' **✅ Solution**: ```python ### Exercise 3 solution def partition_journal(df, audit_cols, existing_journal=None): if existing_journal is None: cols = list(df.columns) + audit_cols existing_journal = pd.DataFrame(columns=cols) active_mask = existing_journal['exp_dt'] == '9999-12-31' historical_journal = existing_journal.loc[~active_mask, :] active_journal = existing_journal.loc[active_mask, :] return historical_journal, active_journal ``` ![image](https://hackmd.io/_uploads/H1VPYrK0xe.png) **⚙️ Pandas Concepts**: * `list(df.columns)` → converts Pandas index to a Python list * `df_condition = df['column name'] == 'condition'` → returns a Boolean Series for the given condition * `df = df.loc[~df_condtion, :]` → filters out the True statements from the Boolean series * `~` → negation of the mask (the opposite) ### 🔹 Exercise 4 - Merge, Partition and Custom Function Use **🎯 Goal**: Compare the current business DataFrame `df` to the active journal `active_journal` using a set of key columns. Identify: * new records → rows only in `df` * expired records → rows only in `active_journal` * new and old records → rows in both **➡️ Input**: * `df` → a DataFrame ![image](https://hackmd.io/_uploads/r17tL8t0gl.png) * `active_journal` → a DataFrame that has either all of the columns in `df` with or without some additional. Might also be empty. ![image](https://hackmd.io/_uploads/ByATULtCel.png) * `key_cols` → list of strings denoting some columns in `df` and `active_journal` which can uniquely identify one record in either `df` or `active_journal` by a combination of these columns. ![image](https://hackmd.io/_uploads/SytC88tAgx.png) **⬅️ Output**: a tuple that returns the newly created DataFrames, each sorted lexographically based on `key_cols`: * `merged` → Outer merge `df` and `active_journal` on `key_cols`; use `suffixes=` and `indicator=` * 3 Partitioned DataFrames based on the value in the `merge` column: * `df_only` → all rows in `merged` with keys existing only in `df` * `aj_only` → all rows in `merged` with keys existing only in `active_journal` * `both` → all rows in `merged` with keys existing only in `df` and in `active_journal` * Copies of slices taken from the partitions. Utilize the provided function `def drop_rename_sort(df, drop_pattern, rename_pattern, key_cols)` to create the following: * `new_df` - all columns from `df_only` which are not duplicate columns originating from `active_journal`. * `expired_df` - all columns from `aj_only` which are not duplicate columns originating from `df`. * `compare_new_df` - all columns from `both` which are not duplicate columns originating from `active_journal`. * `compare_old_df` - all columns from `both` which are not duplicate columns originating from `df`. * Return the copies as a tuple ![image](https://hackmd.io/_uploads/HJGP3ItCll.png) **✅ Solution**: ```python ### Exercise 4 solution def compare_to_journal(df, key_cols, active_journal): # Outer merge with suffixes and indicator set merged = df.merge(active_journal, on=key_cols, suffixes=['_df', '_aj'], # _df indicates data comes from `df`, _aj ... from `active_journal` indicator=True, # adds '_merge' column to indicate which DataFrame the keys were found in how='outer') # Partition `merged` based on '_merge' column, then drop it df_only = merged.loc[merged['_merge'] == 'left_only'].drop('_merge', axis=1) aj_only = merged.loc[merged['_merge'] == 'right_only'].drop('_merge', axis=1) both = merged.loc[merged['_merge'] == 'both'].drop('_merge', axis=1) # Pull data originating from the appropriate source new_df = drop_rename_sort(df_only, '_aj$', '_df$', key_cols) expired_df = drop_rename_sort(aj_only, '_df$', '_aj$', key_cols) compare_new_df = drop_rename_sort(both, '_aj$', '_df$', key_cols) compare_old_df = drop_rename_sort(both, '_df$', '_aj$', key_cols) # Return the results return (new_df, expired_df, compare_new_df, compare_old_df) ``` ![image](https://hackmd.io/_uploads/ryE4hLKAgx.png) **⚙️ Pandas Concepts**: * `df = df1.merge()` → merges a defined table into df1 on the inputs (and order) below: * `df2` → table that will merge into df1 * `how=` → define the way in which the tables will merge: * `"left"` → adds columns from df2 on the right side of df1, where columns with the same name and values are matched across each table * `"right"` → adds columns from df2 on the left side of df1, where columns with the same name and values are matched across each table * `"outer"` → adds ALL columns from df2 and sorts the resulting DataFrame lexographically, this is a union of the keys * `"inner"` → adds SIMILAR columns from df2 and preserves the order of the left keys, this is an intersection of the keys * `"cross"` → creates the [cartesian product](https://www.geeksforgeeks.org/maths/cartesian-product-of-sets/) from both DataFrames, ordered on the left keys * `on=` → column or index level names to join on * `suffixes=` → a list or index input that adds defined suffixes to key names * `indicator=True/False` → creates a column called `_merge` that provides a value for where the merge key came from ### 🔹 Exercise 5 - **🎯 Goal**: * Detect which records changed vs did not change between two aligned DataFrames. * Ignore any “audit-only” columns when deciding whether a record changed. **➡️ Input**: * `compare_new_df`: DataFrame (new/current values), same shape/index/columns as the old. * `compare_old_df`: DataFrame (old/previous values), aligned with compare_new_df. * `audit_cols`: list of column names to exclude from the comparison. ![image](https://hackmd.io/_uploads/HJP493YRlg.png) **⬅️ Output**: * `unchanged`: rows from `compare_new_df` where all comparable columns match the old. * `old_changed`: rows from `compare_old_df` where any comparable column differs. * `new_changed`: rows from `compare_new_df` where any comparable column differs. **✅ Solution**: ```python ### Exercise 5 solution def compare_changes(compare_new_df, compare_old_df, audit_cols): ### # Handle the case of empty DataFrame inputs if compare_new_df.shape[0] == 0: compare_new_df.copy(), compare_new_df.copy(), compare_new_df.copy() # 3 empty DataFrames with proper colunms # Identify all columns which are not `audit_cols` cols = [c for c in compare_new_df.columns if c not in audit_cols] # Create boolean mask - True when there is any difference between the two frames, ignoring `audit_cols` different = (compare_new_df[cols] != compare_old_df[cols]).any(axis=1) # Use the mask to partition the DataFrames and return result unchanged = compare_new_df.loc[~different, :] old_changed = compare_old_df.loc[different, :] new_changed = compare_new_df.loc[different, :] return (unchanged, old_changed, new_changed) ``` ![image](https://hackmd.io/_uploads/SJrm5htAgg.png) **⚙️ Pandas Concepts**: ### 🔹 Exercise 6 - Convert Timestamp to String & Assign Data **🎯 Goal**: Given a DataFrame, `df`, that contains `eff_dt` and `exp_dt`, assign the following columns the following values: * `eff_dt` → `data_date` which is a Timestamp that needs to be converted to a string * `exp_dt` → '9999-12-31' **✅ Solution**: ```python ### Exercise 6 solution def add_records(df, data_date): ### new_df = df.copy() data_date_string = data_date.strftime('%Y-%m-%d') new_df['eff_dt'] = data_date_string new_df['exp_dt'] = '9999-12-31' return new_df ### ``` **⚙️ Pandas Concepts**: * `df['column_name']` → Access a column * `df['column_name'] = data` → Assign the same data to every row in the column * `input_data.strftime()` → Convert a timestamp to a string * 📘 https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.strftime.html ### 🔹 Exercise 7 - **🎯 Goal**: Changes the `exp_dt` column values to be 1 day prior to the provided `data_date`. **➡️ Input**: Timestamp `data_date` and DataFrame `df` which can be assumed to have columns: * `eff_dt` * `exp_dt` ![image](https://hackmd.io/_uploads/rJb2hnFRll.png) **⬅️ Output**: A copy of DataFrame `df` that alters the `exp_dt` values to be one day before the provided `data_date`. **✅ Solution**: ```python ### Exercise 7 solution def expire_records(df, data_date): df = df.copy() df['exp_dt'] = (data_date - pd.Timedelta('1 day')).strftime('%Y-%m-%d') return df ``` ![image](https://hackmd.io/_uploads/HJR-anFCll.png) **⚙️ Pandas Concepts**: * https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.strftime.html * https://pandas.pydata.org/docs/user_guide/timeseries.html ### 🔹 Exercise 8 - **🎯 Goal**: **➡️ Input**: Timestamp `data_date` and DataFrame `df` which has at least the columns: * `eff_dt` * `exp_dt` **⬅️ Output**: A new DataFrame containing all records where `eff_dt` ≤ `data_date` ≤ `exp_dt`. **✅ Solution**: ```python ### Exercise 8 solution def time_travel(journal, data_date): eff_mask = journal['eff_dt'] <= data_date exp_mask = journal['exp_dt'] >= data_date mask = eff_mask & exp_mask return journal.loc[mask, :].drop(columns=['eff_dt', 'exp_dt']) ``` ![image](https://hackmd.io/_uploads/HJyipnFReg.png) **⚙️ Pandas Concepts**: --- # PMT2-SP23 [Link to PMT2-23 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4277966&stepid=4277967&hideNavBar=1) ### 🔸 dfs_to_conn(conn_dfs, index=False) Provides a function that turns a Pandas DataFrame into a SQLite database connection. ```python def dfs_to_conn(conn_dfs, index=False): import sqlite3 conn = sqlite3.connect(':memory:') for table_name, df in conn_dfs.items(): df.to_sql(table_name, conn, if_exists='replace', index=index) return conn ``` ### 🔸 Goodreads Data (grdbconn) SQL Table → `Interactions` * `user_id` * `book_id` * `is_read` → user marked the book as read * `rating` → user gave the book a rating * `is_reviewed` → user wrote a public review ![image](https://hackmd.io/_uploads/By2piuqAll.png) **💡 Note**: Row 2 is an example of the user viewing the book and saving it to their personal library. ### 🔹 Exercise 0 - SQL, Pandas and Python **🎯 Goal**: Create a function that calculates the following: * Total number of interactions → `SELECT COUNT()` * Number of unique user IDs → `SELECT COUNT(DISTINCT)` * Number of unique books → `SELECT COUNT(DISTINCT)` * Number of interactions where the user: * read the book * rated the book * reviewed the book **➡️ Input**: SQL Table → `Interactions` * `user_id` * `book_id` * `is_read` → user marked the book as read * `rating` → user gave the book a rating * `is_reviewed` → user wrote a public review **⬅️ Output**: A string formatted as follows: There are 370,818 interactions. \- Unique users: 2,000 \- Unique books: 138,633 \- Number of reads: 208,701 (56.3% of all interactions) \- Number of ratings: 194,243 (52.4%) \- Number of reviews: 23,720 (6.4%) **✅ Solution**: ```python ### Exercise 0 solution ### def summarize_interactions_str(conn): # Use or adapt this template as you see fit: template = """There are {} interactions. - Unique users: {} - Unique books: {} - Number of reads: {} ({}% of all interactions) - Number of ratings: {} ({}%) - Number of reviews: {} ({}%)""" ### BEGIN SOLUTION #Step 1 - Use SQLite and grab the Interactions table and apply this to a `df` variable df = pd.read_sql("SELECT * FROM Interactions", conn) #Step 2 - Create a Series of UNIQUE values for books and users - get the count using len() n_users = len(df['user_id'].unique()) n_books = len(df['book_id'].unique()) #Step 3 - Create a Series for `is_read`, `rating`, and `is_reviewed`: ##`is_read` → 1/0 (T/F) so a sum suffices ##`rating` → 1-5, where 0 means not rated - by using a conditional, we create a Boolean ##`is_reviewed` → 1/0 (T/F) n_reads = df['is_read'].sum() n_rated = (df['rating'] > 0).sum() n_rev = df['is_reviewed'].sum() #print(df['rating'] > 0) #Step 4 - Wrap all of our numbers in the provided template ##f""" {var_name}""" → 12345 | f""" {var_name:,}""" → 12,345 template = f"""There are {len(df):,} interactions. - Unique users: {n_users:,} - Unique books: {n_books:,} - Number of reads: {n_reads:,} ({n_reads/len(df)*100:.1f}% of all interactions) - Number of ratings: {n_rated:,} ({n_rated/len(df)*100:.1f}%) - Number of reviews: {n_rev:,} ({n_rev/len(df)*100:.1f}%)""" return template ### END SOLUTION ``` ![image](https://hackmd.io/_uploads/BkpMBc5Cgg.png) **⚙️ Concepts/References**: * Pandas/Python - We can combine these * `var_name = df['col_name']` → Attach a Series to a variable * `len(df['col_name'])` → Returns the length of the Series (# of rows) * `df['col_name'].sum()` → Sums values in a Series * 📘 [Pandas Documentation: .sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) * `f"""{:,}"""` → Adds a comma to applicable numbers (i.e. 1,000) * `f"""{:.1f}"""` → Creates a single decimal place ### 🔹 Exercise 1 - SQL - Query f-string, Pandas - Manipulation **🎯 Goal**: Return a DataFrame that provides a count of how many interactions (rows) there are for each unique value. **➡️ Input**: `col`, the name of the column. `conn` database connection to connect to `Interactions`. SQL Table → `Interactions` * `user_id` * `book_id` * `is_read` → user marked the book as read * `rating` → user gave the book a rating * `is_reviewed` → user wrote a public review ![image](https://hackmd.io/_uploads/r1yNYqqRee.png) **⬅️ Output**: A DataFrame with two columns: * `col` → A column with the same name as the given input column holding the unique values * `count` → A column with the number of interactions for each unique value **Solution**: ```python ### Exercise 1 solution def count_interactions_by(col, conn): ### BEGIN SOLUTION #query = f"SELECT {col}, COUNT(*) AS count FROM Interactions GROUP BY {col}" #return pd.read_sql(query, conn) ### END SOLUTION ### MY SOLUTION # Step 1: Read the full Interactions table into a DataFrame df = pd.read_sql("SELECT * FROM Interactions", conn) # Step 2: Group by the specified column and count how many rows fall in each group grouped = df.groupby(col).size().reset_index(name="count") return grouped ### END MY SOLUTION ### demo function calls ### display(count_interactions_by(demo_col_ex1, demo_conn_ex1)) display(count_interactions_by('is_read', demo_conn_ex1)) ``` ![image](https://hackmd.io/_uploads/S19Vj950eg.png) **⚙️ Concepts/References**: * SQLite * `query = f"""` → Wrap the SQL query in a f-string * `SELECT {col}, COUNT(*) AS count` → In the provided column, count all rows (* = all) * `FROM Interactions` → Within this table * `GROUP BY {col}` → Group the results by the provided column * Pandas * `pd.read_sql(query, conn)` → Return a DataFrame of the resulting table ### 🔸 Power Laws Data ++Power Law Distributions++: a probability density $f(x)$ is a power law if it behaves like $\frac{1}{x^d}$ for some constant $d$ when $x$ is large. ### 🔹 Exercise 2 - Numpy, Pandas - Series Manipulation **🎯 Goal**: Given a `series` object holding values from 1 to $n$, count how many of those integers lie within the log-two bins * $[1,2)[1,2)$: that is, starting at one up to but excluding 2; * $[2,4)[2,4)$: starting at two up to but excluding 4; * $[4,8)[4,8)$: starting at 4 up to but excluding 8; * ... * and $[2k−1,2k)[2k−1,2k)$: starting at $2k−12k−1$ up to but excluding $2k2k$, where $2k2k$ is the first power of two greater than $n$. **➡️ Input**: `series` is a Panda-Series object holding the values. **⬅️ Output**: A DataFrame with two columns: * `bin start` the value of the left edge ofa bin, which are integers starting at 1 and all of the form $2^i$ * `count` the number of values in a series that lie in $[2^{i}, 2^{i+1})$ **💡 Notes and Hints**: 1. You may assume all input values are integers greater than or equal to 1. 2. Given a value $x$, the next largest power of two is $2k2k$ where $k=⌈log2x⌉+1k=⌈log2⁡x⌉+1$. 3. A helpful function is `pandas.cut (pd.cut)`, but you certainly do not have to use it. 4. Recall that you should omit empty bins. **✅ Solution**: ```python ### Exercise 2 solution def log2bin_count(series): ### BEGIN SOLUTION from pandas import cut bins = make_log2bins(series) bin_owners = cut(series, bins=bins, labels=bins[:-1], right=False).astype(int) counts = bin_owners.value_counts() df = counts.to_frame().reset_index() df = df.rename(columns={'index': 'bin_start', 0: 'count'}) return df def make_log2bins(series): from numpy import log2, ceil, arange pmax = int(ceil(log2(series.max()))) + 1 return 2**arange(pmax) ### END SOLUTION ``` ![image](https://hackmd.io/_uploads/H1GMZj9Rle.png) **⚙️ Concepts/References**: ### 🔹 Exercise 3 - FREEBIE **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: **⚙️ Concepts/References**: ### 🔹 Exercise 4 - SQL - `WHERE` Conditional **🎯 Goal**: Use SQLite/Pandas to return a DataFrame where the rating is 4 or more. **➡️ Input**: `conn` to connect to the `Interactions` DataFrame. **⬅️ Output**: A DataFrame where the rating is 4 or more. **✅ Solution**: ```python ### Exercise 4 solution def form_analysis_sample(conn): ### df = pd.read_sql("""SELECT * FROM Interactions WHERE rating >= 4""", conn) return df ### ### demo function call ### form_analysis_sample(demo_conn_ex4) ``` ![image](https://hackmd.io/_uploads/SklPqkiRxg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 5 - Pandas - Group and Reformat a DataFrame **🎯 Goal**: Given a DataFrame and a threshold integer signifying connection strength, determine which pairs of users are connected. **➡️ Input**: * `ubdf` → A DataFrame with the columns `user_id` and `book_id` where each row indicates the user gave a rating of 4 or more * `threshold` → integer threshold on connection strength ![image](https://hackmd.io/_uploads/Skp5beo0gg.png) **⬅️ Output**: A new DataFrame with the columns: * `user_id_x` → a user id * `user_id_y` → another user id * `count` → number of books they both rated in common (should be greater than `threshold`) **✅ Solution**: ```python ### Exercise 5 solution def connect_users(ubdf, threshold): ### BEGIN SOLUTION # Step 1 — Self-join on book_id to produce all user pairs who touched the same book. # Why: joining ubdf to itself on book_id gives every (user_x, user_y) pair that co-rated that book. paired = ubdf.merge(ubdf, on='book_id') # Step 2 — Group by user pairs and count how many shared books they have. # Why: the number of rows per (user_id_x, user_id_y) equals their shared-book count. counts = ( paired.groupby(['user_id_x', 'user_id_y']) .size() .reset_index() .rename(columns={0: 'count'}) ) # Step 3 — Remove trivial self-pairs (a user paired with themselves). # Why: we only care about connections between *different* users. non_self = counts[counts['user_id_x'] != counts['user_id_y']] # Step 4 — Keep only pairs with shared-book count ≥ threshold. # Why: enforce the minimum strength required to call them "connected". strong = non_self[non_self['count'] >= threshold] # Step 5 — Clean up the index for a tidy return. # Why: downstream code often assumes a simple RangeIndex. result = strong.reset_index(drop=True) return result ### demo function call ### connect_users(demo_ubdf_ex5, demo_threshold_ex5) ``` ![image](https://hackmd.io/_uploads/B1E_Wxs0xx.png) **⚙️ Concepts/References**: * Pandas * `df.merge(df, on='col_id')` → merging a dataframe with itself on a specified column finds and creates value pairs that match for that value ![image](https://hackmd.io/_uploads/rJSNGliCxl.png) ### 🔹 Exercise 6 **🎯 Goal**: Turn a list of sets into a DataFrame, where each set is a collection of user ids. **➡️ Input**: `communities`, a list of sets, where each set's index in the list represents it's `comm_id` and the values within the set are the `user_id`. **⬅️ Output**: A DataFrame where the columns are: * `user-id` → the values within the set * `comm_id` → the index position of the set within the list **✅ Solution**: ```python ### Exercise 6 solution def assign_communities(communities): ### # Step 1 — Prepare an empty list to collect rows rows = [] # Step 2 — Loop through each community using enumerate() # Why: enumerate provides both the community_id (its index) and the set of users for i, community in enumerate(communities): # Step 3 — Loop through each user in the current community for member in community: # Step 4 — Append a dictionary for each (community_id, user_id) pair rows.append({ 'comm_id': i, 'user_id': member }) # Step 5 — Convert the list of dictionaries into a DataFrame df = pd.DataFrame(rows) # Step 6 - Reorder the columns r = ['user_id', 'comm_id'] return df[r] ### ### demo function call ### assign_communities(demo_communities_ex6) ``` ![image](https://hackmd.io/_uploads/SJGokoiCee.png) **⚙️ Concepts/References**: * `enumerate()` → Assigns a value, starting from 0, to each item in the iterable ### 🔹 Exercise 7 - Pandas - Merge, Group, Average by Row **🎯 Goal**: Calculate the averages of `is_read`, `rating` and `is_reviewed` per community. **➡️ Input**: * `intdf` → A DataFrame * `user_id` * `book_id` * `is_read` * `rating` * `is_reviewed` * `comdf` → A DataFrame * `user_id` * `comm_id` ![image](https://hackmd.io/_uploads/SJTCNsiRle.png) **⬅️ Output**: A new DataFrame with the columns: * `comm_id` → one per row * `is_read` → mean value of each row for all `intdf` for all users of the community * `rating` → mean value of each row for all `intdf` for all users of the community * `is_reviewed` → mean value of each row for all `intdf` for all users of the community **✅ Solution**: ```python ### Exercise 7 solution def means_by_community(intdf, comdf): # Step 1 — Define the columns whose means we’ll calculate VALUES = ['is_read', 'rating', 'is_reviewed'] # Step 2 — Merge the interactions and communities DataFrames on 'user_id' # Why: to attach each user's community ID to their interaction data df = intdf.merge(comdf, on='user_id') # Step 3 — Group all rows by community ID # Why: so that each community’s users’ interactions are grouped together grouped = df.groupby('comm_id') # Step 4 — Select only the columns we want to average # Why: restrict aggregation to the three numeric columns of interest selected = grouped[VALUES] # Step 5 — Compute the mean for each selected column within each community # Why: summarize user behavior per community means = selected.mean() # Step 6 — Move 'comm_id' back from the index to a normal column # Why: groupby() sets it as an index, but we want a clean DataFrame output result = means.reset_index() # Step 7 — Return the final result return result ### END SOLUTION ### ### demo function call ### demo_result_ex7 = means_by_community(demo_intdf_ex7, demo_comdf_ex7) display(demo_result_ex7) ``` ![image](https://hackmd.io/_uploads/H1TdsTiRee.png) **⚙️ Concepts/References**: * Pandas * `df.groupby('col_name')` → Groups rows by the supplied column name. DOES NOT COMPUTE ANYTHING, just creates a lazy grouped view. ### 🔹 Exercise 8 - Pandas - **🎯 Goal**: For EACH community, calculate what percentage of its users read each book. Identify the top `rank` books. Some useful strategies include: * Determine the number of unique users in each community. You need this information to get percentages. * Determine how many users read each book by community. * Normalize these counts by the community size. * Sort and return the results, retaining just the top rank books in each community. **➡️ Input**: * `xcdf` → Dataframe with the columns: * `user_id` → a user id (integer) * `book_id` → a book id that this user read (integer) * `comm_id` → the community id which the user belongs to (integer) * `rank` → integer indicating the number of top ranked books to return ![image](https://hackmd.io/_uploads/Sy0YpasRgx.png) **⬅️ Output**: A new DataFrame with the following columns: * `comm_id` → the community id * `book_id` → a book id that was read in that community * `percent` → percent of the community that read that book * `comm_size` → number of users in the community **✅ Solution**: ```python ### Exercise 8 solution def get_topreads_by_community(xcdf, rank=5): """ Goal: For each community, find the top `rank` books by % of community members who read them. Returns columns: ['comm_id', 'book_id', 'percent', 'comm_size']. """ # Step 1 — Count unique users per community (community sizes). # Why: We need the denominator to compute "what % of the community read book X". sizesdf = count_users_by_comm(xcdf, outcol='comm_size') # Step 2 — Count readers per (community, book). # Why: This is the numerator for the percentage. readersdf = count_readers_by_comm(xcdf, outcol='num_read') # Step 3 — Join counts so each (comm_id, book_id) row has both num_read and comm_size. # Why: Put numerator and denominator together to compute percentages. df = readersdf.merge(sizesdf, on='comm_id') # Step 4 — Compute the percentage of community that read the book. # Why: percent = (num_read / comm_size) * 100. df['percent'] = df['num_read'] / df['comm_size'] * 100 # Step 5 — Sort within each community by: # (a) percent DESC, then (b) book_id ASC (tie-breaker). # Why: Define the ranking order and ensure tie-breaking by lowest book_id. dftop = ( df.sort_values(['comm_id', 'percent', 'book_id'], ascending=[True, False, True]) .reset_index(drop=True) ) # Step 6 — Take top `rank` rows per community. # Why: Keep only the top-N books for each community after sorting. dftop = ( dftop.groupby('comm_id').apply(lambda g: g.iloc[:rank]).reset_index(drop=True) ) # Step 7 — Return only required columns. # Why: Match the specified output schema. return dftop[['comm_id', 'book_id', 'percent', 'comm_size']] def count_users_by_comm(xcdf, **kwargs): """Count unique users per community (drop duplicates first).""" # Step A — Keep only (comm_id, user_id) and drop duplicates. # Why: A user may have multiple rows; each user counts once per community. cudf = xcdf[['comm_id', 'user_id']].drop_duplicates() # Step B — Group by community and count rows → community size. # Why: Each remaining row is a distinct user in that community. return count_by_group(cudf, 'comm_id', **kwargs) def count_readers_by_comm(xcdf, **kwargs): """Count readers per (community, book).""" # Step A — Group by (comm_id, book_id) and count rows. # Why: Each row indicates a user read that book; counting gives readers per book per community. return count_by_group(xcdf, ['comm_id', 'book_id'], **kwargs) def count_by_group(df, cols, outcol='count'): # Step — Generic "count rows per group". # Why: Reusable helper to produce a tidy (group cols + count) DataFrame. return df.groupby(cols).size().reset_index().rename(columns={0: outcol}) ### END SOLUTION ### demo function call ### get_topreads_by_community(demo_xcdf_ex8, demo_rank_ex8) ``` ![image](https://hackmd.io/_uploads/HyV-AToCge.png) **⚙️ Concepts/References**: ### 🔹 Exercise 9 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: **⚙️ Concepts/References**: ### 🔹 Exercise 10 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: **⚙️ Concepts/References**: --- # PMT2 - FA23 [Link to PMT2-F23 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4278040&stepid=4278041&hideNavBar=1) ## 🧩 American Football Data A Pandas DataFrame with the columns: * `play_id` → unique id for a play * `drive_id` → unique id for a possession * `event_id` → unique id for a game * `type` → type of play ran * `scoringPlay` → T/F to signal a score * `awayScore` → score of the away team AFTER the play occurred * `homeScore` → score of the home team AFTER the play occurred * `period` → the quarter * `clock` → time remaining when the play started * `homeTeamPoss` → T if home team had possession of the ball when play started * `down` → the down * `distance` → distance to line to gain * `yardsToEndzone` → distance to the endzone ![image](https://hackmd.io/_uploads/BkOj_RsRxg.png) ### 🔹 Exercise 0 - Pandas - Time Conversion using Timedelta Objects **🎯 Goal**: Convert the game clock to time remaining in the game. $$ timeLeft = (15 \times 60)(4 - \text{period}) + 60(\text{clockMinutes}) + \text{clockSeconds} $$ **➡️ Input**: `all_events_df` where the `clock` field is a string with the format `{clockMinutes}:{clockSeconds}` **⬅️ Output**: A new DataFrame that returns the following: * `index` → DO NOT RESET * `event_id` * `period` * `clock` * `timeLeft` **✅ Solution**: ```python ### Exercise 0 solution def calc_time_left(all_events_df: pd.DataFrame) -> pd.DataFrame: ### #Step 1 - Make a copy of the DF all_df = all_events_df.copy() #Step 2 - Use the provided formula to calculate the time left - convert the output to an integer all_df['timeLeft'] = ( 900 * (4 - all_df['period']) + (pd.to_timedelta('00:' + all_df['clock']) // pd.Timedelta(seconds=1)) ).astype('int64') return all_df ### ### demo function call calc_time_left(demo_all_events_df_ex0) ``` ![image](https://hackmd.io/_uploads/Hk9zmy30ll.png) **⚙️ Concepts/References**: * Pandas * `pd.to_timedelta(args)` → converts a string, number or Series in a Timedelta object (differences in time) * `//` → integer division operator * `pd.Timedelta(value=..., unit=None)` → creates a single Timedelta object (a duration) * `pd.to_timedelta(args) // pd.Timedelta(value=..., unit=None)` → converts our supplied clock time to Timedelta objects (e.g. 1:23 → 0 days 00:01:23) then divides the given Timedelta by 1 second (e.g. 0 days 00:01:23 → 83) ### 🔹 Exercise 1 - Pandas - Multiple Conditions, `.loc()` **🎯 Goal**: Remove events that are not plays, using the given list, and remove events/games that went into extra periods. **➡️ Input**: A DataFrame that contains: * `type` * `timeLeft` * `event_id` **⬅️ Output**: A copy of the DataFrame that: * `non_play_rows` → identify all rows with a `type` value that is in `non_play_types` * `ot_event_ids` → identify all unique `event_id` values occurring in rows there `timeLeft` is less than 0 * `ot_event_rows` → identify all rows where the `event_id` value is in `ot_event_ids` **✅ Solution**: ```python ### Exercise 1 solution def filter_non_plays_and_ot(df: pd.DataFrame) -> pd.DataFrame: non_play_types = ['Penalty', 'End Period', 'Two-minute warning', 'Timeout', 'End of Half', 'End of Game', 'Official Timeout', 'Defensive 2pt Conversion', 'Two Point Rush', 'Extra Point Good'] #Step 1 - Create a copy df1 = df.copy() #Step 2 - Define filters ##non_play_rows → all rows with a 'type' value in non_play_types non_play_rows = df1['type'].isin(non_play_types) ##ot_event_ids → all unique event id values in rows where timeLeft < 0 ot_event_ids = df1.loc[df1['timeLeft'] < 0, 'event_id'].unique() ##ot_event_rows → all rows where the 'event_id' is in ot_event_ids ot_event_rows = df1['event_id'].isin(ot_event_ids) #Step 3 - Return the filtered DataFrame filtered_df = df1[~non_play_rows & ~ot_event_rows] return filtered_df ### demo function call filter_non_plays_and_ot(demo_df_ex1) ``` ![image](https://hackmd.io/_uploads/HJsfzfhCeg.png) **⚙️ Concepts/References**: * `df['col_name'].isin(list/series)` → returns a T/F for the column to check membership in the supplied list/series * `df.loc[df['col_name1'] <0, 'col_name2'].unique()` → indexes the provided rows and columns and grabs the unique values and returns a list * `.loc[ROW, COLUMN]` → index the provided row and column **🧠 Insight**: Think of `.loc(rows, columns)`, `rows` is actually a Boolean mask, list, or slice, `columns` is the field we want to return. ### 🔹 Exercise 2 - Pandas - Partition Column by Another & Create Subsets **🎯 Goal**: Model the probability that the offense will advance the ball to the line to gain. **➡️ Input**: The helper function `converted_by_drive(group: pd.DataFrame) -> pd.DataFrame` which takes a DataFrame containing **only records from the same drive** and returns a new DataFrame with the new column `converted` added. `event_df` → DataFrame * `drive_id` * `type` * `scoringPlay` * `down` * `timeLeft` **⬅️ Output**: There are 2 steps to this output. 1. Subsets * Partition `event_df` by `drive_id` * Apply `converted_by_drive` to each partition 2. Concatenation * Concatenate all results into a single DataFrame **✅ Solution**: ```python ### Exercise 2 solution ### Helper function provided as part of the starter code def converted_by_drive(group: pd.DataFrame) -> pd.DataFrame: group = group.sort_values('timeLeft', ascending=False)\ .reset_index(drop=True) offensive_touchdown_types = ['Passing Touchdown', 'Rushing Touchdown', 'Fumble Recovery (Own)', 'Rush', 'Pass Reception'] # `pd.DataFrame.shift` might be useful later... first_downs = (group['down'] == 1).shift(-1, fill_value=False) scores = (group['scoringPlay'] == True)&(group['type'].isin(offensive_touchdown_types)) group['converted'] = (first_downs|scores) return group ### Your solution def converted(event_df: pd.DataFrame) -> pd.DataFrame: ### #Step 0 - Copy of our DataFrame a_df = event_df.copy() #Step 1 - Partition using groupby drive_groups = a_df.groupby('drive_id') #→ returns a DataFrameGroupBy objet #Step 2 - APPLY the function AND drop the index ##Drop because the index changes to the drive_id values result = drive_groups.apply(converted_by_drive).reset_index(drop=True) return result ### converted(demo_event_df_ex2) ``` ![image](https://hackmd.io/_uploads/Sku3KMnAee.png) **⚙️ Concepts/References**: * `df.groupby('col_name')` → returns a DataFrameGroupBy object that must perform a calculation of some kind to get a DataFrame returned * `df.apply(function_name)` → applies the provided function to the dataframe ### 🔹 Exercise 3 - Pandas - Sort, Identify Max & Python Conditions **🎯 Goal**: Model win probability by creating a new column that returns a T/F given the parameters: * Identify the final score for both teams by either sorting and indexing OR using the maximum value * Determine which team had the higher score * Return a copy of `event_df` with the new column `won` that is set to True if the team on offense won the game (`homeTeamPoss`) **➡️ Input**: `event_df` → DataFrame with the columns: * `awayScore` * `homeScore * `homeTeamPoss` * `timeLeft` **⬅️ Output**: A new DataFrame, that is a copy of `event_df` but now has the column `won`. **✅ Solution**: ```python ### Exercise 3 solution def who_won(event_df: pd.DataFrame) -> pd.DataFrame: ### #Step 1 - Create a copy of the input DataFrame df = event_df.copy() #Step 2 - Sort by timeLeft, although this is optional due to using .max(), we could have indexed the values from the last row df = df.sort_values(by='timeLeft', ascending=False) #Step 3 - Identify the final scores of each team home_final = df['homeScore'].max() away_final = df['awayScore'].max() #Step 4 - Logical operations using the final scores if home_final > away_final: df['won'] = df['homeTeamPoss'] elif away_final > home_final: df['won'] = ~df['homeTeamPoss'] else: df['won'] = False return df ### ### demo function call demo_soln_ex3 = who_won(demo_event_df_ex3) display(demo_soln_ex3.head()) display(demo_soln_ex3.tail()) ``` ![image](https://hackmd.io/_uploads/HJNPPB2Rxe.png) **⚙️ Concepts/References**: * `df['col_name'].max()` → Finds the maximum value * `df.sort_values(by=col_name , ascending=T/F)` → Sorts the DataFrame by the values within the provided column ### 🔹 Exercise 4 - **🎯 Goal**: Calculate a column, `nextScore` that provides an integer based on the next `scoringPlay=True`, and the scoring team using `homeTeamPoss`, `awayScore` and `homeScore`. **➡️ Input**: * `event_df` → A DataFrame, as pictured below * `get_updated_list(df: pd.DataFrame) → list` → A function that accepts a DataFrame of the play data for a single game that has been sorted by `timeLeft`. Returns a list of tuples, where each tuple is `(a, b)` such that `a` is the play number and `b` is the change in score in relation to the home team ![image](https://hackmd.io/_uploads/rJDLYPCAge.png) **⬅️ Output**: A DataFrame that has been through the following: * Copy `event_df`, sort the copy by `timeLeft` in descending order, and reset the index. We will call this copy `df`. * Call `get_update_list` on the `df`. We will call the output `update_list`. * Create a new column `nextScore` in `df`. Set the values of `nextScore` based on the `update_list`. Hint: clever use of the `zip` function and slicing of `update_list` will make this an easy task. * Negate `df['nextScore']` in all rows where `df['homeTeamPoss']` is True **✅ Solution**: ```python ### Exercise 4 solution def add_next_score(event_df: pd.DataFrame) -> pd.DataFrame: ### BEGIN SOLUTION #Step 1 - Create a copy df = event_df.copy() #Step 2 - Sort the DataFrame by timeLeft - Reset the index as a precaution df = event_df.sort_values('timeLeft', ascending=False).reset_index(drop=True) #Step 3 - Call the function on the DataFrame #Note: We did not use .apply() because we're not changing values in the DataFrame, but rather creating tuples from the DataFrame update_list = get_update_list(df) #Step 4 - Splice the list, zip it together, update nextScore ##Splice the list so that one list is missing the last element, one the first ##Zip these two together (Recall (a, b) → a - play # where score updates, b - score change) ##This creates a tuple of tuples so that we create ranges and values, based on the change of score and when it happened ##.loc[(row, column)] specifies the row range, the column to be updated, and sets it equal to the value for start, end in zip(update_list[:-1], update_list[1:]): df.loc[(start[0]):(end[0]), 'nextScore'] = end[1] #Step 5 - Flip the sign of the nextScore value when homeTeamPoss is False df.loc[~df['homeTeamPoss'], 'nextScore'] = -1*df.loc[~df['homeTeamPoss'], 'nextScore'] df['nextScore'] = df['nextScore'].astype(int) return df ### END SOLUTION ### demo function call demo_output_ex4 = add_next_score(demo_event_df_ex4) ``` ![image](https://hackmd.io/_uploads/BkOAYPA0ex.png) **⚙️ Concepts/References**: * ### 🔹 Exercise 5 - Pandas - Parition and Shift Row Values up 1 **🎯 Goal**: The score field currently updates AFTER a play has occurred, but should be updated so that it reflects the score BEFORE the play. **➡️ Input**: `events_df` → DataFrame ![image](https://hackmd.io/_uploads/SypufF0Ale.png) **⬅️ Output**: A copy of the input that: * Copy `all_events_df` and sort it by `event_id` and `timeLeft`. The order of `event_id` doesn't matter (sorting helps things run faster). However, sorting the records for each event by timeLeft in descending order is critical (which is why we must sort by both columns). We will call the sorted copy `df`. * Partition `df` by `event_id`. * For each partition, set the `homeScore` and `awayScore` values to their values from one row prior. Since the first row has no rows which are prior to it, we will set both score values to 0. * Hint - The pd.DataFrame.shift method will be helpful in accomplishing this step. * Concatenate the partitions together. * Return the result. **✅ Solution**: ```python ### Exercise 5 solution def lag_score(all_events_df: pd.DataFrame) -> pd.DataFrame: ### df = all_events_df.copy() df = df.sort_values(by=['event_id', 'timeLeft'], ascending = [True, False]) partition = df.groupby('event_id') df = df.groupby('event_id', as_index=False).apply(helper).reset_index(drop=True) return df ### def helper(df: pd.DataFrame) -> pd.DataFrame: df[['homeScore', 'awayScore']] = df[['homeScore', 'awayScore']].shift(1, fill_value=0) return df ### demo function call demo_output = lag_score(demo_all_events_df_ex5) print(demo_output) ``` ![image](https://hackmd.io/_uploads/SJ6ojBh0eg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 6 - Conditional Filters and Create Multiple DataFrames **🎯 Goal**: Using 1 input DataFrame, create 3 DataFrames with the appropriate columns. **➡️ Input**: `all_events_df` → A DataFrame that includes the columns: * `'distance'` * `'nextScore'` * `'timeLeft'` * `'play_id'` * `'won'` * `'homeTeamPoss'` * `'awayScore'` * `'converted'` * `'down'` * `'yardsToEndzone'` * `'event_id'` * `'homeScore'` ![image](https://hackmd.io/_uploads/ByUyrtA0ll.png) **⬅️ Output**: Several DataFrames * Copy `all_events_df`. We will call this `df`. * Identify all rows where `df['down']` is 0. Filter them out of `df`. * Calculate `conversion_data` from `df`. * should include only fields in `conversion_fields` (part of the startercode) * should only include rows where `df['down']` is 3 or 4 * should not include rows which are have "field goal" in `df['type'].lower()`. * should not include rows which are have "punt" mentioned in `df['type'].lower()`. * Calculate `ep_data` from `df` * `ep_data` should include only fields in `ep_fields` (part of startercode) * Calculate `wp_data` from `df` * `wp_data` should include only fields in `wp_fields` (part of startercode) * Return `conversion_data`, `ep_data` and `wp_data` **✅ Solution**: ```python ### Exercise 6 solution def build_model_inputs(all_events_df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame, pd.DataFrame): conversion_fields = ['event_id', 'play_id', 'down', 'distance', 'converted'] ep_fields = ['event_id', 'play_id', 'down', 'distance', 'yardsToEndzone', 'nextScore'] win_prob_fields = ['event_id', 'play_id', 'down', 'distance', 'yardsToEndzone', 'timeLeft', 'awayScore', 'homeScore', 'homeTeamPoss', 'won'] ### df = all_events_df.copy() df = df[df['down'] != 0] conversion_data = df.copy() conversion_data = conversion_data[(conversion_data['down'] == 3) | (conversion_data['down'] == 4)] conversion_data = conversion_data[~(conversion_data['type'].str.lower().str.contains("field goal"))] conversion_data = conversion_data[~(conversion_data['type'].str.lower().str.contains("punt"))] col_drop = df.columns.difference(conversion_fields) conversion_data = conversion_data.drop(columns=col_drop) ep_data = df.copy() ep_drop = df.columns.difference(ep_fields) ep_data = ep_data.drop(columns=ep_drop) wp_data = df.copy() wp_drop = df.columns.difference(win_prob_fields) wp_data = wp_data.drop(columns=wp_drop) return conversion_data, ep_data, wp_data ### ### demo function call (demo_conversion_data_ex6, demo_ep_data_ex6, demo_wp_data_ex6) = build_model_inputs(demo_all_events_df_ex6) for df in (demo_conversion_data_ex6, demo_ep_data_ex6, demo_wp_data_ex6): display(df) ``` ![image](https://hackmd.io/_uploads/Syd-LYARxg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 7 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Exercise 7 solution def win_prob_model(df: pd.DataFrame) -> np.ndarray: ### ### BEGIN SOLUTION offense_ep = ep_model(df) mu = (offense_ep+df['homeScore']-df['awayScore']) * df['homeTeamPoss'] + \ (offense_ep+df['awayScore']-df['homeScore']) * (1 - df['homeTeamPoss']) sigma = 13.85 / np.sqrt((3600)/(1 + df['timeLeft'])) return 1 - norm.cdf(0.5, loc=mu, scale=sigma) ### END SOLUTION ### ### demo function call win_prob_model(demo_df_ex7) ``` ![image](https://hackmd.io/_uploads/SJDrDFCAgg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 8 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Exercise 8 solution def sim_outcomes(df: pd.DataFrame, models: dict) -> pd.DataFrame: conversion_model = models['convert'] fg_model = models['fg'] win_prob_model = models['win_prob'] def_win_prob_model = lambda d: 1 - win_prob_model(d) df = df.copy() goal_to_go = df['distance'] == df['yardsToEndzone'] fg_make_prob = fg_model(df['yardsToEndzone']) go_succeed_prob = conversion_model(df['distance']) df['go_succeed_prob'] = go_succeed_prob df['fg_make_prob'] = fg_make_prob # Make columns for win probability if a particular outcome occurs df['fg_make_wp'] = def_win_prob_model(simulate_fg_make(df)) df['fg_miss_wp'] = def_win_prob_model(simulate_fg_miss(df)) df['fail_wp'] = def_win_prob_model(simulate_fourth_down_fail(df)) df.loc[goal_to_go, 'succeed_wp'] = def_win_prob_model(simulate_fourth_down_succeed(df))[goal_to_go] df.loc[~goal_to_go, 'succeed_wp'] = win_prob_model(simulate_fourth_down_succeed(df))[~goal_to_go] # Make df_punt df_punt = simulate_punt(df) ### ### BEGIN SOLUTION fg_miss_prob = 1-fg_make_prob go_fail_prob = 1-go_succeed_prob df['punt_wp'] = def_win_prob_model(df_punt) df['kick_wp'] = fg_make_prob*df['fg_make_wp'] + fg_miss_prob*df['fg_miss_wp'] df['go_wp'] = go_succeed_prob*df['succeed_wp'] + go_fail_prob*df['fail_wp'] return df.round(4) ### END SOLUTION ### ### demo function call models = { 'convert': conversion_model, 'fg': fg_model, 'win_prob': win_prob_model } sim_outcomes(demo_df_ex8, models) ``` ![image](https://hackmd.io/_uploads/rkYmvK00gx.png) **⚙️ Concepts/References**: # PMT2 - SP24 [Link to PMT2-SP24 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4491736&stepid=4491737&hideNavBar=1) ## 🧩 Data - New York Collisions Detailed traffic collision data collected by the City of New York. There are 3 tables in our connection, but we do not know the relationships yet: * `CRASHES` * `VEHICLES` * `PERSON` **💡 Note**: Quickly view the table structure using: `pd.read_sql("SELECT * FROM TABLE_NAME LIMIT 5;", conn)` OR `pd.read_sql("PRAGMA table_info(TABLE_NAME);", conn)` **💡 Note**: Quickly view the data using: `pd.read_sql("SELECT * FROM table_name LIMIT 5;", conn)` ### 🔹 Exercise 0 - SQL - Dynamic Query w/ Parameter, Pandas - SQL Query to DataFrame **🎯 Goal**: Identify the column names of a given `table_name`. **➡️ Input**: * `table_name` → string, name of a table * `conn` → Connect object, name of the connection to our database **⬅️ Output**: A list of column names for the given `table_name`, extracted by: * verifying that the `table_name` contains letters, numbers, or underscores * if not, raise a ValueError * generating a dynamic query to pass in a parameter * use the query to generate a DataFrame * extract the column names as a list and order them alphabetically **✅ Solution**: ```python ### Solution - Exercise 0 def get_table_cols(table_name, conn): ### #Step 0 - Insurance that table_name contains letters, numbers, or underscores if re.search('\W', table_name) is not None: raise ValueError() #Step 1 - Dynamic query with a parameter FROM clause ##Dynamic is simply writing the query as a variable sql = f''' SELECT * FROM {table_name} LIMIT 1''' #Step 2 - Load the query as a DataFrame df = pd.read_sql(sql, conn) #Step 3 - Extract the column names as a list ##Could also use df.columns.to_list() col_names = list(df.columns) return sorted(col_names) ### ### Demo function call print(get_table_cols(table_name='vehicles', conn=conn)) ``` ![image](https://hackmd.io/_uploads/ByJZ-qAAxx.png) **⚙️ Key Concepts**: * Regex * `re.search(pattern, string)` → Locate the first occurrence of a string * `\W` → Non-word character, any character that is not A-z0-9_ * SQLite * `sql = f''' '''` → Dynamic Query syntax, used for passing parameters * `SELECT *` → Select all columns * `FROM {table_name}` → From this table * `LIMIT 1` → Return only the first row * Pandas * `pd.read_sql(query_variable, conn)` → Turns the SQL query into a DataFrame ### 🔹 Exercise 1 - Count Items in Sublist **🎯 Goal**: Given a list of lists, identify names that are present in $num$ number lists. **➡️ Input**: * `lol` → list * `num` → integer **⬅️ Output**: A set containing all names appearing in $num$ number of list. **✅ Solution**: ```python ### Solution - Exercise 1 def intersection_of_cols(lol: list, num=None) -> set: assert len(lol) > 1, '`lol` must have at least 2 items' if num is None: num = len(lol) ### from collections import Counter #Step 1 - Apply Counter to each sublist ##Counter returns a dictionary count of each item ##for l in lol → gets into the first list ##for col_name in l → gets into each sublist ##col_name identifies each name c = Counter([col_name for l in lol for col_name in l]) #Step 2 - Access the key, value in our Counter return and apply the condition d = {col for col, count in c.items() if count >= num} return d ### ### Demo function call for num in [2,3]: demo_intersection = intersection_of_cols( lol=[['ringo', 'phyllis', 'angela', 'paul'], ['kevin', 'paul', 'oscar', 'kelly', 'phyllis'], ['phyllis', 'oscar', 'ryan', 'john', 'toby']], num=num) print(f'num={num}; intersection={demo_intersection}') ``` ![image](https://hackmd.io/_uploads/SyGLI50Cgg.png) **⚙️ Key Concepts**: * `Counter()` → Counts occurrences of each item in an iterable, returns the item and count as a dictionary ### 🔹 Exercise 2 - SQL - Aggregate Functions, GROUP BY, and SUBSTR **🎯 Goal**: Return a Pandas DataFrame and build the SQL Query. **➡️ Input**: `conn` → Connection to the database. **⬅️ Output**: A DataFrame for the query that: * `YEAR`: Year for which summary stats are computed. (taken from the `CRASH_DATE` field) * `INJURY_COUNT`: Count of records with 'Injured' values in the `PERSON_INJURY` field for the same year. * `DEATHS_COUNT`: Count of records with 'Killed' values in the `PERSON_INJURY` field for the same year. * `UNSPECIFIED_COUNT`: Count of records with 'Unspecified' values in the `PERSON_INJURY` field for the same year. **✅ Solution**: ```python ### Solution - Exercise 2 def summarize_person(conn): ### #No input, so no f-string query = """ SELECT SUBSTR(CRASH_DATE, 7) AS YEAR, SUM(PERSON_INJURY='Injured') AS INJURY_COUNT, SUM(PERSON_INJURY='Killed') AS DEATHS_COUNT, SUM(PERSON_INJURY='Unspecified') AS UNSPECIFIED_COUNT FROM PERSON GROUP BY YEAR""" return pd.read_sql(query, conn) ### ### Demo function call demo_person_summary_df = summarize_person(conn) display(demo_person_summary_df) ``` ![image](https://hackmd.io/_uploads/rJVVCqCRel.png) **⚙️ Key Concepts**: * SQLite * `SELECT` → Used to pull columns from the table and perform calculations on them * `SUBSTR(string, start_position, length) AS COL_NAME` → Pull a substring from a given string * `SUM(TABLE_COL_NAME = condition) AS COL_NAME` → The condition evaluates each row to True/False (1, 0) and sums up all occurrences when the condition is true * `COUNT(CASE WHEN condition THEN 1 END)` → The same SUM-condition function, but as COUNT * `GROUPBY COL_NAME` → Groups results based on the column name AND performs the aggregate functions based on the that grouping ### 🔹 Exercise 3 - SQLite - SUBSTR, COUNT(\*), Quotient (/) **🎯 Goal**: Return a DataFrame using a query statement. **➡️ Input**: `conn` → Connection to the database. **⬅️ Output**: A DataFrame for the query that: * YEAR: Year for which summary stats are computed. (taken from the "CRASH DATE" field) * CRASH_COUNT: Count of crashes in the same year. * DEATHS: Sum of the "NUMBER OF PERSONS KILLED" values in the same year. * PEDESTRIAN_DEATHS: Sum of the "NUMBER OF PEDESTRIANS KILLED" values in the same year. * PEDESTRIAN_DEATH_SHARE: The quotient PEDESTRIAN_DEATHS/DEATHS * DEATHS_PER_CRASH: The quotient DEATHS/CRASH_COUNT **✅ Solution**: ```python ### Solution - Exercise 3 def summarize_crashes(conn): ### query = ''' SELECT SUBSTR("CRASH DATE", 7) YEAR, COUNT(*) CRASH_COUNT, SUM("NUMBER OF PERSONS KILLED") DEATHS, SUM("NUMBER OF PEDESTRIANS KILLED") PEDESTRIAN_DEATHS, SUM("NUMBER OF PEDESTRIANS KILLED") / SUM("NUMBER OF PERSONS KILLED") PEDESTRIAN_DEATH_SHARE, SUM("NUMBER OF PERSONS KILLED") / COUNT(*) DEATHS_PER_CRASH FROM CRASHES GROUP BY YEAR ''' return pd.read_sql(query, conn) ### ### Demo function call demo_crashes_summary = summarize_crashes(conn) display(demo_crashes_summary) ``` ![image](https://hackmd.io/_uploads/SkhMasCAxx.png) **⚙️ Key Concepts**: * SQLite * `SELECT` → All SELECT statements are done based on the `GROUP BY` * `COUNT(*)` → Count all rows * `SUM("COLUMN NAME")` → Sum the values within the column * `SUM("COLUMN NAME") / SUM("COLUMN NAME")` → Divides one sum by another **💡 Note**: `"COLUMN NAME"` uses "" when there is a space or special character, otherwise `COLUMN_NAME`. ### 🔹 Exercise 4 - SQLite - INNER JOIN, WHERE not null, SELECT multiple conditions **🎯 Goal**: Return a DataFrame using a query statement that meets the criteria below. **➡️ Input**: * `conn` → connection to the database * `CRASHES` → A table from our database ![image](https://hackmd.io/_uploads/SympI5kyZg.png) * `VEHICLES` → A table from our database ![image](https://hackmd.io/_uploads/SkFePcyJZl.png) **⬅️ Output**: A DataFrame that provides a summary with these columns: * `VEHICLE_TYPE`: The upper-case `VEHICLE_TYPE` from `VEHICLES` for which summary stats are computed * You may encounter records with NULL values in the `VEHICLE_TYPE` field. These should not be included in your calculation. * `COUNT`: The number of records with the same `VEHICLE_TYPE` * `OUT_OF_STATE`: The number of records with the same `VEHICLE_TYPE` having upper-case `STATE_REGISTRATION` values in `VEHICLES` other than 'NY' * NULL values in the `STATE_REGISTRATION` field do not count as `OUT_OF_STATE`. * `DEATHS`: The sum of the "NUMBER OF PERSONS KILLED" values in `CRASHES` associated with the same `VEHICLE_TYPE`. * You can relate `VEHICLES` to `CRASHES` by the `COLLISION_ID` column which is common to both tables. Additional requirements: * The result should be sorted by `COUNT` in descending order then by `VEHICLE_TYPE` in ascending order. * The result should include at most 30 rows * Your result should only be derived from records which have a `COLLISION_ID` value present in both tables. * Your result should not include any row with NULL as the `VEHICLE_TYPE` value. **✅ Solution**: ```python ### Solution - Exercise 4 def summarize_vehicles(conn): # SOLUTION (VERSION 1) - SQL Only: # Google Search: 'sqlite join' # Google Result: https://www.sqlitetutorial.net/sqlite-join/ query = ''' SELECT upper(vehicle_type) AS VEHICLE_TYPE, count(*) AS COUNT, sum(upper(state_registration) <> "NY" and state_registration is not null) AS OUT_OF_STATE, sum("NUMBER OF PERSONS KILLED") AS DEATHS FROM vehicles INNER JOIN crashes USING(collision_id) WHERE vehicle_type is not null GROUP BY upper(vehicle_type) ORDER BY COUNT desc, VEHICLE_TYPE LIMIT 30 ''' return pd.read_sql(query, conn) ### Demo function call demo_vehicle_summary = summarize_vehicles(conn) display(demo_vehicle_summary) print() ``` ![image](https://hackmd.io/_uploads/HJsoHqy1be.png) **⚙️ Key Concepts**: * SQL * `SELECT` → compute and/or relabel rows * `FROM table_primary` → pull data from this table, primary table * `INNER JOIN table_seconday USING column_name` → matches rows that appear in both table's column, merges, then discards non-matches. Keeps all other columns from both tables * 📘 https://www.sqlitetutorial.net/sqlite-join/ * `WHERE` → filter records, removes rows that meet the criteria ### 🔹 Exercise 5 - Pandas - Query to DF w/ Params, Cast string to datetime64 **🎯 Goal**: Given a query, connection, and parameters, pull the query and convert (cast) the date string to datetime64. **➡️ Input**: * `qry` → a provided query * `conn` → connection to the database * `params` → provided parameters **⬅️ Output**: A DataFrame that changes the `CRASH_DATE` from a text type to a datetime64 type. **✅ Solution**: ```python ### Solution - Exercise 5 def geo_filter_crashes(qry, conn, params): ### df = pd.read_sql(qry, conn, params=params) df['CRASH_DATE'] = pd.to_datetime(df['CRASH_DATE']) return df ### ### Demo function call demo_qry = ''' select "CRASH DATE" AS CRASH_DATE, LATITUDE, LONGITUDE, "NUMBER OF PERSONS KILLED" AS PERSONS_KILLED from crashes where latitude is not null and latitude between ? and ? and longitude is not null and longitude between ? and ? and PERSONS_KILLED is not null and PERSONS_KILLED > 0 ''' demo_df = geo_filter_crashes(demo_qry, conn, [40.5, 40.95, -74.1, -73.65]) display(demo_df.head()) ``` ![image](https://hackmd.io/_uploads/Bk93GjJk-g.png) **⚙️ Key Concepts**: * Pandas * `df['column_name'] = pd.to_datetime(df['column_name'])` → converts a column name from text to datetime64 ### 🔹 Exercise 6 - **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Solution - Exercise 6 #np.linspace(start, stop, interval) def density_grid(bounds, grid_size, estimator): ### x = np.linspace(bounds[0], bounds[1] , grid_size) y = np.linspace(bounds[2], bounds[3] , grid_size) xx, yy = np.meshgrid(x, y) test = np.vstack((xx.flatten(), yy.flatten())) a = estimator(test).reshape((grid_size, grid_size)) return a ### ### Demo function call demo_bounds = [0, 9, 0, 90] demo_grid_size = 10 # estimator adds the x and y values demo_estimator = lambda a: a[0] + a[1] demo_grid = density_grid(demo_bounds, demo_grid_size, demo_estimator) print(demo_grid) ``` ![image](https://hackmd.io/_uploads/Sy9-VikkWx.png) **⚙️ Key Concepts**: --- # PMT2 - FA24 [Link to PMT2-FA24 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4278380&stepid=4278381&hideNavBar=1) ## 🧩 ### 🔹 Exercise 0 - Pandas - to_datetime(args, utc=True), dt.tz_convert('America/New York') **🎯 Goal**: Given a list of dictionaries, create a DataFrame that meets the criteria. **➡️ Input**: `records` → a list of dictionaries, where each dictionary contains information for a single row of data **⬅️ Output**: A Pandas DataFrame containing the data held in `records` with columns: * `Firm` → in the dictionary under the `Stock` key * `Date` → convert the string to datetime64[ns, America/New York] * `Open` * `Close` * `Volume` **✅ Solution**: ```python ### Solution - Exercise 0 def load_data(records: list) -> pd.DataFrame: ### df = pd.DataFrame(records) df = df.rename(columns={'Stock':'Firm'}) df['Date'] = pd.to_datetime(df['Date'], utc=True).dt.tz_convert('America/New_York') #df['Date'] = df['Date'].dt.tz_localize('UTC') return df ### ### Demo function call ### `convert_timezone_types` demo --------------------------------------------- # This helper function will help you change the data-type # -- Input: pd.Series, with dtype "string" # -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]" type_conversion_demo = pd.DataFrame({ 'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'], 'DemoColumn': [1, 2] }) # Convert the type! type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date']) assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works! ### `load_data` demo ---------------------------------------------------------- # We'll sample the dataset to 20 records, just to make debugging easier. random.seed(6040) stock_data_subsample = random.sample(stock_data, k=10) # Here's what your solution produces! print('Here is what your loaded data looks like in Pandas:') stock_subsample_df = load_data(stock_data_subsample) display(stock_subsample_df) ``` ![image](https://hackmd.io/_uploads/Hy_ltQxJ-x.png) **⚙️ Key Concepts**: * Pandas * `pd.to_datetime(column, utc=True)` → used to convert a string with a timezone to a universal time zone * `.dt.tz_convert('America/New York')` → convert a datetime type to a specified timezone 📘 https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html 📘 https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.tz_convert.html 📘 https://stackoverflow.com/questions/55385497/how-can-i-convert-my-datetime-column-in-pandas-all-to-the-same-timezone/55386135#55386135 ### 🔹 Exercise 1 - Pandas - .head(), .sample() **🎯 Goal**: Preview the data sources. **➡️ Input**: * `records` → DataFrame with the results from the exercise above * `conn` → SQLite connection to the database **⬅️ Output**: A tuple with: * `records_preview` → Dataframe containing 10 entries * `nflx_originals_preview` → DataFrame containing 5 different, random entries from the nflx_originals SQL database table * `nflx_top_preview` → DataFrame containing 5 different, random entries from the nflx_top SQL database table **✅ Solution**: ```python ### Solution - Exercise 1 def preview_data(records: list, conn: sqlite3.Connection) -> tuple: # Filter out any records beyond the second element records_preview = records.head(n=10) # Select everything from nflx_originals and randomly sample 5 rows nflx_originals_preview = pd\ .read_sql('SELECT * FROM nflx_originals', conn)\ .sample(5) # Select everything from nflx_top and randomly sample 5 rows nflx_top_preview = pd\ .read_sql('SELECT * FROM nflx_top', conn)\ .sample(5) return records_preview, nflx_originals_preview, nflx_top_preview ### Demo function call records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn) print('Preview of Stock Data') display(records_preview) print('------------------------------------------------------------------------') print('Preview of Netflix Originals Table') display(nflx_originals) print('------------------------------------------------------------------------') print('Preview of Netflix Top Programs Table') display(nflx_top) ``` ![image](https://hackmd.io/_uploads/rJ6DyVlJZl.png) **⚙️ Key Concepts**: ### 🔹 Exercise 2 - Pandas - Shift, Column Calculations, Outer Merge **🎯 Goal**: Calculate `Daily_Growth` for a given firm. **➡️ Input**: * `stock_records` → a DataFrame * `stock` → a string specifying a particular stock **⬅️ Output**: `stock_growth` → a DataFrame with the columns: * `Date` * `Firm` * `Open` * `Close` * `Volume` * `Day_Growth` Requirements/Steps: 1. Do not use SQLite. 2. Create a new Pandas DataFrame with Day_Growth appended as a new column. The growth should be specific to the firm specified by the firm parameter. 3. Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend. * You should use the next valid close price to calculate growth in these instances. 4. Daily growth is defined as $$growth0=(close0/close−1)−1$$ * In English: "Daily Growth is Today's Close divided by Yesterday's Close, minus one." * $closex$ is the value of the Close, offset by $x$. So, $close−1$ is the value of Close, offset by 1 position. * You may find the `pd.DataFrame.shift()` method, and the related `GroupBy` shift method helpful for offsetting the close value! 5. To repeat: Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend. * You should use the next valid close price to calculate growth in these instances. * For example, if: A. The close price on Friday is 10 B. There are no close prices for the next two days (Saturday and Sunday) C. The close price on Monday is 12 ... then the daily growth for Monday should be equal to (12/10)-1=0.2. * You can make this easier by filtering out all NaN values when you initially calculate the growth. 6. Your final DataFrame (stock_growth) should contain the same number of records as stock_records for that firm. 7. Finally, reset the indexes before returning the resulting dataframe. ![image](https://hackmd.io/_uploads/ryx4jzWybx.png) **✅ Solution**: ```python ### Solution - Exercise 2 def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame: ### BEGIN SOLUTION # Filter the DataFrame for the stock stockdf = stock_records[stock_records['Firm']==firm].copy() # Remove NA values in a copy. # We'll join everything back together at the end. stockdf2 = stockdf.copy() stockdf2 = stockdf2.dropna() # Lag the close and calculate the daily growth stockdf2['Yest_Close'] = stockdf2.groupby('Firm')['Close'].shift(1) stockdf2['Day_Growth'] = (stockdf2['Close']/stockdf2['Yest_Close']) - 1 # Merge everything back together and fill missing values with 0 output = stockdf.merge( stockdf2, how='outer', left_index=True, right_index=True, suffixes=(None, '_y')) output = output[['Date', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth']] output['Day_Growth'] = output['Day_Growth'].fillna(0) # Return the result return output.reset_index(drop=True) ### END SOLUTION ### Demo function call print("Demo daily growth for Netflix's stock:") daily_growth_nflx_demo = calculate_daily_growth( stock_records=calculate_daily_growth_demo_input, firm='NFLX' ) display(daily_growth_nflx_demo) ``` **⚙️ Key Concepts**: * Pandas * `df[df['column_name'] == condition]` → apply a condition to filter out ROWS that do not meet the criteria * `df['new_column_name'] = df.groupby('column_name')['column_to_shift'].shift(#)` → groups by the column provided and then shifts the values from the column to shift column down 1 in the new column ### 🔹 Exercise 3 - Pandas - Year String Extraction, Product and Column Calculation **🎯 Goal**: Compute each firm’s annual growth by multiplying daily returns within the same calendar year, then attach that value back to every row of the original records for that firm-year. **➡️ Input**: * `stock_records` → `pd.DataFrame` produced after Exercise 2 (must include `Date`, `Firm`, and `Day_Growth`). **⬅️ Output**: * `annual_growth` → pd.DataFrame with all original `stock_records` columns plus: * `Year` (string) * `Annual_Growth` (float), repeated for all rows in the same (Firm, Year). Requirements/Steps: * Extract year from Date and store as string: e.g., "2019" * For each `(Firm, Year)` * Compute the product $\prod(1+g(f,y)_{i})-1$ → $\prod(1+\text{Day_Growth}_{i})$ * `Day_Growth` → Fill `NaN` with 0 * Merge the results back onto a copy of `stock_records` by `(Firm, Year)` * Reset the index **✅ Solution**: ```python ### Solution - Exercise 3 def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame: ### BEGIN SOLUTION #Step 0 - Create a copy stockdf = stock_records.copy() #Step 1 - Extract the year stockdf['Year'] = stockdf['Date'].dt.strftime('%Y') #Step 2 - Create the growth ratio column and add 1 to each value for Day_Growth → Eventually will become Annual_Growth ##Why - Break the formula down into small bites as opposed to a large bite stockdf['Growth_Ratio'] = 1 + stockdf['Day_Growth'] #Step 3 - Separate the needed columns for our Product calculation, group them and apply .prod on the Growth_Ratio column stockdf2 = stockdf[['Year', 'Growth_Ratio', 'Firm']].copy() resultdf = stockdf2.groupby(['Year', 'Firm']).prod('Growth_Ratio') - 1 #Step 4 - Ensure resultdf is correct index and correct column names resultdf = resultdf.reset_index() resultdf.columns = ['Year', 'Firm', 'Annual_Growth'] #Step 5 - Merge our DataFrames together output = pd.merge(stockdf, resultdf) #Step 6 - Reorder and return the output r = ['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth'] return output[r] ### END SOLUTION ### Demo function call annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df) display(annual_growth_demo_df) ``` ![image](https://hackmd.io/_uploads/HkFMgFM1be.png) **⚙️ Key Concepts**: * Pandas * `df['date_column'].dt.strftime('%Y')` → returns an Index of strings that extracts the years out of a given column * `df['date_column'].dt.year.astype(str)` → also works * `df.groupby([list of columns]).prod('column')` → groups by the provided columns and performs a product calculation on the provided column ### 🔹 Exercise 4 - Pandas - Rolling Average **🎯 Goal**: Implement a function that calculates a concept known as the 'Golden Cross'. **➡️ Input**: * `stockdf` → a Pandas DataFrame. It contains the results calculated by exercise 3 * `firm` → a string. It specifies which firm to calculate the golden cross results for * `short_average` → an integer. It defines the window for the short moving average. The number represents the number of days to include in the window * `long_average` → an integer. It defines the window for the long moving average. The number represents the number of days to include in the window **⬅️ Output**: `golden_cross_results` → DataFrame with all columns from `stockdf` plus: * `Short_Average` * `Long_Average` * `Golden_Cross` Requirements/Steps: 1. Use Pandas. Do not use SQLite. 2. Filter the DataFrame to include records for the firm parameter. 3. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages. * You may find the `pd.DataFrame.notna()` method useful for filtering these values out! 4. Add the columns Short_Average and Long_Average to the Pandas DataFrame. * These columns should calculate the moving average of the Close column. * The window of the moving average should be equal to the size specified by the short_average and long_average arguments. * You may find the `pd.DataFrame.rolling()` method useful for creating the windows! 5. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame. * This column should indicate whether the short-window moving average is greater than the long-window moving average. 6. You may assume that only one firm is provided for the firm argument. 7. You may assume that stockdf is already ordered by date in ascending order. 8. Finally, reset the indexes before returning the resulting dataframe. **✅ Solution**: ```python ### Solution - Exercise 4 def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame: ### #Step 0 - Create a copy df = stockdf.copy() #Step 1 - Filter rows to the firm string provided df = df[df['Firm'] == firm] #Step 2 - Filter out NaN values in Close df = df[df['Close'].notna()] #Step 3 - Implement Rolling averages using .rolling df['Short_Average'] = df['Close'].rolling(window=short_average).mean() df['Long_Average'] = df['Close'].rolling(window=long_average).mean() #Step 4 - df['Golden_Cross'] = df['Short_Average'] > df['Long_Average'] df = df.reset_index(drop=True) return df ### ## Demo function call golden_cross_demo_df = golden_cross_strategy( stockdf=annual_growth_df, firm='NFLX', short_average=3, long_average=7) # Let's only look at the first 15 days for our input display(golden_cross_demo_df.head(15)) ``` **⚙️ Key Concepts**: * Pandas * `df['column_name'].rolling(window= ).mean()` → creates a rolling average based on the provided column, where window is the length ### 🔹 Exercise 5 - **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Solution - Exercise 5 def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame: # Create a list of every firm, except for the one we specified in the inputs non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm]) non_chosen_firms = list(non_chosen_firms) # Pivot the data for easy manipulation pivoted_data = daily_growth_by_firm.pivot( index='Date', columns='Firm', values='Day_Growth' ).reset_index() # Calculate the total growth for every firm that isn't our firm pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\ .sum(axis=1) / len(non_chosen_firms) return pivoted_data ### Demo function call normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX') display(normalized_returns_demo_df) ``` **⚙️ Key Concepts**: ### 🔹 Exercise 6 - SQLite - SELECT, COUNT(\*), GROUPBY, ORDERBY **🎯 Goal**: Write a function that returns an SQL query that summarizes the trends of Netflix originals by year and genre. **➡️ Input**: None **⬅️ Output**: `query` → a Python string with an SQL query. It should query the database for these columns: * `year` * `genre` * `genre_count` Requirements/Steps: 1. Count the number of Netflix original shows, grouped by year and by genre. * The year variable should be extracted from the premiere variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year from the premiere variable. 2. Order them by: A. genre_count (descending) B. year (ascending) C. genre (ascending) 3. Return the top 10 rows The database table you will need is named `nflx_originals`. Make sure you rename your column variables so that they match the specifications above. **✅ Solution**: ```python ### Solution - Exercise 6 def summarize_netflix_original_genres()-> str: ### query = f''' SELECT strftime('%Y', premiere) AS year, genre, COUNT(*) AS genre_count FROM nflx_originals GROUP BY year, genre ORDER BY genre_count DESC, year ASC, genre ASC LIMIT 10 ''' return query ### ### Demo function call summary_query = summarize_netflix_original_genres() netflix_original_genres = pd.read_sql(summary_query, conn) display(netflix_original_genres) ``` ![image](https://hackmd.io/_uploads/H1iYwFf1Zx.png) **⚙️ Key Concepts**: * SQLite * `SELECT` * `strftime(%Y, origin_column)` → Pull the year values as a string * `COUNT(*)` → Counts the numbers based on each unique year, genre combination ### 🔹 Exercise 7 - SQLite - Subquery **🎯 Goal**: **➡️ Input**: None **⬅️ Output**: `query` → string with a SQLite query. Query should contain the following tables: * `title` → title of the show * `total_score` → sum of the show's score * `occurrence` → the number of times a show appeared in the top 10 * `avg_score` → `total_score` divided by `occurrence` Requirements/Steps: 1. You will need to calculate show scores to solve this problem: * A show ranked #1 on a day should earn 10 points towards its total score. A show ranked #2 should earn 9, etc. until a show ranked #10 should earn 1 point. 2. After counting the occurrences, you'll need to calculate the `avg_score` to produce a floating-point value 3. Order the results by show: * `total_score` in descending order * `title` in descending order 4. Return the top 10 rows **✅ Solution**: ```python ### Solution - Exercise 7 def calc_netflix_top10_scores() -> str: ### query = f''' WITH scores AS (SELECT SUM(11 - rank) AS total_score, COUNT(title) AS occurrence, title FROM nflx_top GROUP BY title ORDER BY total_score DESC, title DESC) SELECT title, total_score, occurrence, (total_score * 1.0 / occurrence) AS avg_score FROM scores LIMIT 10 ''' return query ### ### Demo function call scores_query = calc_netflix_top10_scores() netflix_top10_scores = pd.read_sql(scores_query, conn) display(netflix_top10_scores) ``` ![image](https://hackmd.io/_uploads/rJdXnqfkWg.png) **⚙️ Key Concepts**: ### 🔹 Exercise 8 - **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ``` **⚙️ Key Concepts**: ### 🔹 Exercise 9 - **🎯 Goal**: Prepare a model matrix for regression by combining show scores (independent variables) and relative growth (response variable) into one DataFrame. **➡️ Input**: * `normalized_growth` → DataFrame from exercise 5 * `show_scores_df` → DataFrame that contains pre-computed scored for each show on a given date **⬅️ Output**: * `model_matrix` → DataFrame containing independent variables (show scores) and response variable (Relative_Growth) Requirements/Steps: 1. To obtain the independent variables, you will need to pivot the show_scores_df so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score. * Fill any missing values with 0's. * See pandas.DataFrame.pivot for more information * You may also refer to Exercise 5 from above too 2. You'll need to join the pivoted dataframe with the normalized_growths_df on date. We'll call this model_matrix 3. Return the model_matrix but only with the columns for the shows and Relative_Growth **✅ Solution**: ```python ### Solution - Exercise 9 def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame: ### ssdf = show_scores_df.copy() ssdf = ssdf.fillna(0) ssdf = ssdf.pivot( index='date', columns='title', values='score' ).reset_index().fillna(0) ssdf1 = ssdf.copy() ssdf1 = ssdf1.iloc[:, 1:] columns = ssdf1.columns.to_list() ngdf = normalized_growths_df.copy() col1 = ngdf.columns.to_list() columns1 = col1[-1] columns.append(columns1) cols = columns[-1:] + columns[:-1] model_growth = ssdf.merge(ngdf, how="outer", left_on='date', right_on='Date').dropna() model_growth = model_growth[cols] return model_growth ### ### Demo function call model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df) display(model_matrix) ``` **⚙️ Key Concepts**: --- # PMT2-SP25 [Link to PMT2-SP25 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4278070&stepid=4278071&hideNavBar=1) **1 Point Questions and Takeaways**: * 0 - Freebie * 2 * SQL - Struggled to get an output here: used 'column_name' which was not necessary * 7 * Pandas - Reordering columns - Faster method to reorder columns in place. **2 Point** 1 3 * 6 * Pandas, Python - 8 9 CTRL+F 1.4 Schema Diagram ![image](https://hackmd.io/_uploads/HyNgl7BJZl.png) Helper Functions: * `get_table_list(conn)` → retrieves the list of tables from the SQLite database. ![image](https://hackmd.io/_uploads/BkCOx7rkZg.png) * `get_column_details (conn, table_name)` → retrieves column details for a specific table. ![image](https://hackmd.io/_uploads/SJoclQByWl.png) * `get_foreign_key_details(conn, table_name)` → retrieves foreign key details for a specific table ![image](https://hackmd.io/_uploads/BJNCx7B1bl.png) * `get_table_schema(conn, table_name)` → combines column details and foreign key details for a specific table. ![image](https://hackmd.io/_uploads/HJqZZ7S1-e.png)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully