M1 - PMT Notes & Solutions

## 🧭 Emoji Legend (Note Formatting Guide) <details> <summary>🧭 Click to expand Emoji Legend</summary> | Icon | Meaning / Use Case | Example Usage | |:----:|--------------------|----------------| | 🧩 | **Main topic or section header** — introduces a major theme or concept | `## 🧩 Groupwise Analysis` | | 🔸 | **Major concept or subsection** — represents a key idea under the main topic | `### 🔸 Why Group Data?` | | 🔹 | **Detailed operation or example** — used for methods, step-by-step logic, or code examples | `### 🔹 groupby() minimum` | | ⚙️ | **Mechanics** — explains what’s happening under the hood | “⚙️ Mechanics: Pandas scans column values and builds sub-DataFrames.” | | 🎯 | **Goal / Purpose** — clarifies what we’re trying to achieve | “🎯 Goal: Count how many rows belong to each group.” | | 💡 | **Tip / Takeaway / Insight** — highlights key points, advice, or conceptual notes | “💡 Takeaway: Use .agg() for multiple summary stats.” | | ✅ | **Shortcut / Confirmation** — shows equivalent methods or quick wins | “✅ Equivalent: `reviews.points.value_counts()`” | | 🧠 | **Conceptual Insight** — emphasizes mental models or deeper understanding | “🧠 Think of groupby() as split → apply → combine.” | | 📘 | **Reference / Documentation Link** — links to external docs or sources | “📘 [Pandas MultiIndex Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)” | | 🪄 | **Trick / Tip** — optional visual for clever techniques or shortcuts | “🪄 Use .apply() to manipulate each group independently.” | | 🧭 | **Legend / Navigation Aid** — used for orientation or overview sections | “## 🧭 Emoji Legend (Note Formatting Guide)” | --- 🪄 *Tip:* All of these are Unicode emojis that work natively in HackMD, GitHub, Notion, and most Markdown renderers. You can insert them using your OS emoji picker: - **Windows:** `Win + .` - **macOS:** `Cmd + Ctrl + Space` </details> ## 🧭 Table of Contents [TOC] --- # PMT2-FA22 [Link to PMT2-FA22 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4490389&stepid=4490392&hideNavBar=1) Pandas heavy notebook - in fact, the vast majority of it is Pandas! ## 🧩 Data Tables We are given the following four tables: ![image](https://hackmd.io/_uploads/S13gSnLAxx.png) ### 🔸 Strategy On Midterm 1, my primary failure seemed to stem from trying to force myself through the exercises in numerical order. The strategy for this midterm was to locate the 1 point questions and attempt those first. Then move on to the 2 point questions. If there is time remaining, I will attempt to tackle the 3 point questions. ### 🔹 Exercise 0 - Filtering, Structuring and Cleaning (Pandas) **🎯 Goal**: Return a new DataFrame of active promotions (time_left > 0) that contains the following keys: * `cust_id` * `promo` * `service` **➡️ Input**: The DataFrame `promos` that has the columns: * `cust_id` → id of a single customer * `promo` → a service where the customer has participated in a promo * `service` → name of a promo in which a customer has participated for the associated service * `time_left` → the time the customer has left on the promo ![image](https://hackmd.io/_uploads/r1ORL4YRex.png) **⬅️ Output**: A new DataFrame with active promotions, which does the following: * `cust_id` → The id of the customer for each unique combination with `service` * `promo` → the active promo for the `cust_id/service` pair: * when `time_left` = 0 → column value = base * when `time_left` != 0 → match promo values across all applicable `cust_id` * `service` → identifies a service for which the customer has an active promotion. **The customer may not actually be subscribed to the service!** ![image](https://hackmd.io/_uploads/r1R1wVKRex.png) **✅ Solution**: ```python ### Exercise 0 solution def get_active_promos(promos): # 1) Select only the columns we need cols_needed = ['cust_id', 'service', 'promo', 'time_left'] promos_subset = promos[cols_needed] # 2) Build a DataFrame of unique (cust_id, service) pairs cust_service = promos_subset[['cust_id', 'service']] cust_service_unique = cust_service.drop_duplicates() # 3) Identify rows with an active promo (time_left != '0') is_active_row = promos_subset['time_left'] != '0' active_rows = promos_subset[is_active_row] # 4) Keep only columns needed from active rows active_cols = ['cust_id', 'service', 'promo'] active_promos_only = active_rows[active_cols] # 5) Ensure one active promo per (cust_id, service) pair active_promos_unique = active_promos_only.drop_duplicates(subset=['cust_id', 'service']) # 6) Left-join unique pairs to their active promo (if present) merged = cust_service_unique.merge(active_promos_unique, on=['cust_id', 'service'], how='left') # 7) Fill missing promos with 'base' missing_mask = merged['promo'].isna() merged.loc[missing_mask, 'promo'] = 'base' # 8) Reorder columns exactly as required result = merged[['cust_id', 'service', 'promo']] # 9) Return the final DataFrame return result ``` **⚙️ Pandas Concepts**: * `df = df[[list of column names]]` → filters a DataFrame to return a new one with the supplied column names * `df.drop_duplicates()` → removes duplicate values from a DataFrame * `df = df[col_name] != '0'` → returns a Series of Boolean values for the supplied condition * `df1.merge()` → merges a defined table into df1 on the inputs (and order) below: * `df2` → table that will merge into df1 * `how=` → define the way in which the tables will merge: * `"left"` → adds columns from df2 on the right side of df1, where columns with the same name and values are matched across each table * `"right"` → adds columns from df2 on the left side of df1, where columns with the same name and values are matched across each table * `"outer"` → adds ALL columns from df2 and sorts the resulting DataFrame lexographically, this is a union of the keys * `"inner"` → adds SIMILAR columns from df2 and preserves the order of the left keys, this is an intersection of the keys * `"cross"` → creates the [cartesian product](https://www.geeksforgeeks.org/maths/cartesian-product-of-sets/) from both DataFrames, ordered on the left keys * `on=` → column or index level names to join on * `left_on/right_on=` → the names of the columns that share values, but have different keys (column names) * `df[[col1, col2]]` → reorders columns based on the list (will drop any column names left out) ### 🔹 Exercise 1 - Merging (Pandas) ![image](https://hackmd.io/_uploads/rk6Gea80ge.png) **🎯 Goal**: Given several DataFrames, `customers`, `services`, `active_promos`, and `prices`, use a series of `left merges` to create the required output. ![image](https://hackmd.io/_uploads/B15TigOCgg.png) **✅ Solution**: ```python ### Exercise 1 solution def denormalize(customers, services, active_promos, prices): ### df = services.copy() df = df.merge(customers, how="left", left_on="cust_id", right_on="id") df = df.drop('id', axis=1) df = df.merge(active_promos, how ="left") df = df.merge(prices, how="left") df = df.rename(columns={'cust_id': 'id'}) reorder = ['id', 'paid', 'service', 'tier', 'promo', 'price'] df = df[reorder] return df ### ``` ![image](https://hackmd.io/_uploads/ryYqVzOCgg.png) **⚙️ Pandas Concepts**: * `df1.merge()` → merges a defined table into df1 on the inputs (and order) below: * `df2` → table that will merge into df1 * `how=` → define the way in which the tables will merge: * `"left"` → adds columns from df2 on the right side of df1, where columns with the same name and values are matched across each table * `"right"` → adds columns from df2 on the left side of df1, where columns with the same name and values are matched across each table * `"outer"` → adds ALL columns from df2 and sorts the resulting DataFrame lexographically, this is a union of the keys * `"inner"` → adds SIMILAR columns from df2 and preserves the order of the left keys, this is an intersection of the keys * `"cross"` → creates the [cartesian product](https://www.geeksforgeeks.org/maths/cartesian-product-of-sets/) from both DataFrames, ordered on the left keys * `on=` → column or index level names to join on * `left_on/right_on=` → the names of the columns that share values, but have different keys (column names) 📘 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html * `df.drop(index/key, axis=0/1)` → whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’) 📘 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html ### 🔹 Exercise 2 - Sum and Type Conversion (Pandas) **🎯 Goal**: Given a DataFrame, `df`, with the structure below, return the total of the `price` column rounded to 2 decimal places. **💡 Note**: All data fields are string, so we will need to cast to a float prior to computing the total. **`df` Structure**: ```python ### Define demo inputs demo_df_ex2 = pd.DataFrame({'id': {0: '0', 1: '2', 2: '2', 3: '3', 4: '4'}, 'paid': {0: 'True', 1: 'True', 2: 'True', 3: 'True', 4: 'True'}, 'service': {0: 'audio', 1: 'video', 2: 'audio', 3: 'audio', 4: 'video'}, 'tier': {0: '1', 1: '2', 2: '2', 3: '1', 4: '1'}, 'promo': {0: 'base', 1: 'base', 2: 'base', 3: 'base', 4: 'base'}, 'price': {0: '8.99', 1: '15.99', 2: '12.99', 3: '8.99', 4: '10.99'}}) #Solution should produce 57.95 ``` **✅ Solution**: ```python ### Exercise 2 solution def get_revenue(df): pickle = df['price'] pickle = pickle.astype(float) return pickle.sum() print(get_revenue(demo_df_ex2)) #57.95 ``` ### 🔹 Exercise 3 - Filtering and Structuring **🎯 Goal**: Partition (split) an existing “journal” DataFrame into two separate DataFrames — one containing historical (expired) records and one containing active (current) records — while ensuring the output has the correct schema, even if no prior journal exists. **➡️ Input**: * `df` → DataFrame - don't care about it's structure * `audit_cols` → List of strings - names of audit columns used to track history in the journal * `existing_journal` (Optional) → DataFrame or None - if it is a DataFrame it will have all of the columns of `df` and `audit_cols` ![image](https://hackmd.io/_uploads/HkwOFHFCxx.png) **⬅️ Output**: Two DataFrames returned as a tuple * `historical_journal` → * if `existing_journal = None` → empty DataFrame with the columns from `df` and `audit_cols` * if `existing_journal != None` → a DataFrame containing all rows of `existing_journal` where `exp_dt` is not '9999-12-31' * `active_journal` → * if `existing_journal = None` → empty DataFrame with the columns from `df` and `audit_cols` * if `existing_journal != None` → a DataFrame containing all rows of `existing_journal` where `exp_dt` is '9999-12-31' **✅ Solution**: ```python ### Exercise 3 solution def partition_journal(df, audit_cols, existing_journal=None): if existing_journal is None: cols = list(df.columns) + audit_cols existing_journal = pd.DataFrame(columns=cols) active_mask = existing_journal['exp_dt'] == '9999-12-31' historical_journal = existing_journal.loc[~active_mask, :] active_journal = existing_journal.loc[active_mask, :] return historical_journal, active_journal ``` ![image](https://hackmd.io/_uploads/H1VPYrK0xe.png) **⚙️ Pandas Concepts**: * `list(df.columns)` → converts Pandas index to a Python list * `df_condition = df['column name'] == 'condition'` → returns a Boolean Series for the given condition * `df = df.loc[~df_condtion, :]` → filters out the True statements from the Boolean series * `~` → negation of the mask (the opposite) ### 🔹 Exercise 4 - Merge, Partition and Custom Function Use **🎯 Goal**: Compare the current business DataFrame `df` to the active journal `active_journal` using a set of key columns. Identify: * new records → rows only in `df` * expired records → rows only in `active_journal` * new and old records → rows in both **➡️ Input**: * `df` → a DataFrame ![image](https://hackmd.io/_uploads/r17tL8t0gl.png) * `active_journal` → a DataFrame that has either all of the columns in `df` with or without some additional. Might also be empty. ![image](https://hackmd.io/_uploads/ByATULtCel.png) * `key_cols` → list of strings denoting some columns in `df` and `active_journal` which can uniquely identify one record in either `df` or `active_journal` by a combination of these columns. ![image](https://hackmd.io/_uploads/SytC88tAgx.png) **⬅️ Output**: a tuple that returns the newly created DataFrames, each sorted lexographically based on `key_cols`: * `merged` → Outer merge `df` and `active_journal` on `key_cols`; use `suffixes=` and `indicator=` * 3 Partitioned DataFrames based on the value in the `merge` column: * `df_only` → all rows in `merged` with keys existing only in `df` * `aj_only` → all rows in `merged` with keys existing only in `active_journal` * `both` → all rows in `merged` with keys existing only in `df` and in `active_journal` * Copies of slices taken from the partitions. Utilize the provided function `def drop_rename_sort(df, drop_pattern, rename_pattern, key_cols)` to create the following: * `new_df` - all columns from `df_only` which are not duplicate columns originating from `active_journal`. * `expired_df` - all columns from `aj_only` which are not duplicate columns originating from `df`. * `compare_new_df` - all columns from `both` which are not duplicate columns originating from `active_journal`. * `compare_old_df` - all columns from `both` which are not duplicate columns originating from `df`. * Return the copies as a tuple ![image](https://hackmd.io/_uploads/HJGP3ItCll.png) **✅ Solution**: ```python ### Exercise 4 solution def compare_to_journal(df, key_cols, active_journal): # Outer merge with suffixes and indicator set merged = df.merge(active_journal, on=key_cols, suffixes=['_df', '_aj'], # _df indicates data comes from `df`, _aj ... from `active_journal` indicator=True, # adds '_merge' column to indicate which DataFrame the keys were found in how='outer') # Partition `merged` based on '_merge' column, then drop it df_only = merged.loc[merged['_merge'] == 'left_only'].drop('_merge', axis=1) aj_only = merged.loc[merged['_merge'] == 'right_only'].drop('_merge', axis=1) both = merged.loc[merged['_merge'] == 'both'].drop('_merge', axis=1) # Pull data originating from the appropriate source new_df = drop_rename_sort(df_only, '_aj$', '_df$', key_cols) expired_df = drop_rename_sort(aj_only, '_df$', '_aj$', key_cols) compare_new_df = drop_rename_sort(both, '_aj$', '_df$', key_cols) compare_old_df = drop_rename_sort(both, '_df$', '_aj$', key_cols) # Return the results return (new_df, expired_df, compare_new_df, compare_old_df) ``` ![image](https://hackmd.io/_uploads/ryE4hLKAgx.png) **⚙️ Pandas Concepts**: * `df = df1.merge()` → merges a defined table into df1 on the inputs (and order) below: * `df2` → table that will merge into df1 * `how=` → define the way in which the tables will merge: * `"left"` → adds columns from df2 on the right side of df1, where columns with the same name and values are matched across each table * `"right"` → adds columns from df2 on the left side of df1, where columns with the same name and values are matched across each table * `"outer"` → adds ALL columns from df2 and sorts the resulting DataFrame lexographically, this is a union of the keys * `"inner"` → adds SIMILAR columns from df2 and preserves the order of the left keys, this is an intersection of the keys * `"cross"` → creates the [cartesian product](https://www.geeksforgeeks.org/maths/cartesian-product-of-sets/) from both DataFrames, ordered on the left keys * `on=` → column or index level names to join on * `suffixes=` → a list or index input that adds defined suffixes to key names * `indicator=True/False` → creates a column called `_merge` that provides a value for where the merge key came from ### 🔹 Exercise 5 - **🎯 Goal**: * Detect which records changed vs did not change between two aligned DataFrames. * Ignore any “audit-only” columns when deciding whether a record changed. **➡️ Input**: * `compare_new_df`: DataFrame (new/current values), same shape/index/columns as the old. * `compare_old_df`: DataFrame (old/previous values), aligned with compare_new_df. * `audit_cols`: list of column names to exclude from the comparison. ![image](https://hackmd.io/_uploads/HJP493YRlg.png) **⬅️ Output**: * `unchanged`: rows from `compare_new_df` where all comparable columns match the old. * `old_changed`: rows from `compare_old_df` where any comparable column differs. * `new_changed`: rows from `compare_new_df` where any comparable column differs. **✅ Solution**: ```python ### Exercise 5 solution def compare_changes(compare_new_df, compare_old_df, audit_cols): ### # Handle the case of empty DataFrame inputs if compare_new_df.shape[0] == 0: compare_new_df.copy(), compare_new_df.copy(), compare_new_df.copy() # 3 empty DataFrames with proper colunms # Identify all columns which are not `audit_cols` cols = [c for c in compare_new_df.columns if c not in audit_cols] # Create boolean mask - True when there is any difference between the two frames, ignoring `audit_cols` different = (compare_new_df[cols] != compare_old_df[cols]).any(axis=1) # Use the mask to partition the DataFrames and return result unchanged = compare_new_df.loc[~different, :] old_changed = compare_old_df.loc[different, :] new_changed = compare_new_df.loc[different, :] return (unchanged, old_changed, new_changed) ``` ![image](https://hackmd.io/_uploads/SJrm5htAgg.png) **⚙️ Pandas Concepts**: ### 🔹 Exercise 6 - Convert Timestamp to String & Assign Data **🎯 Goal**: Given a DataFrame, `df`, that contains `eff_dt` and `exp_dt`, assign the following columns the following values: * `eff_dt` → `data_date` which is a Timestamp that needs to be converted to a string * `exp_dt` → '9999-12-31' **✅ Solution**: ```python ### Exercise 6 solution def add_records(df, data_date): ### new_df = df.copy() data_date_string = data_date.strftime('%Y-%m-%d') new_df['eff_dt'] = data_date_string new_df['exp_dt'] = '9999-12-31' return new_df ### ``` **⚙️ Pandas Concepts**: * `df['column_name']` → Access a column * `df['column_name'] = data` → Assign the same data to every row in the column * `input_data.strftime()` → Convert a timestamp to a string * 📘 https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.strftime.html ### 🔹 Exercise 7 - **🎯 Goal**: Changes the `exp_dt` column values to be 1 day prior to the provided `data_date`. **➡️ Input**: Timestamp `data_date` and DataFrame `df` which can be assumed to have columns: * `eff_dt` * `exp_dt` ![image](https://hackmd.io/_uploads/rJb2hnFRll.png) **⬅️ Output**: A copy of DataFrame `df` that alters the `exp_dt` values to be one day before the provided `data_date`. **✅ Solution**: ```python ### Exercise 7 solution def expire_records(df, data_date): df = df.copy() df['exp_dt'] = (data_date - pd.Timedelta('1 day')).strftime('%Y-%m-%d') return df ``` ![image](https://hackmd.io/_uploads/HJR-anFCll.png) **⚙️ Pandas Concepts**: * https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.strftime.html * https://pandas.pydata.org/docs/user_guide/timeseries.html ### 🔹 Exercise 8 - **🎯 Goal**: **➡️ Input**: Timestamp `data_date` and DataFrame `df` which has at least the columns: * `eff_dt` * `exp_dt` **⬅️ Output**: A new DataFrame containing all records where `eff_dt` ≤ `data_date` ≤ `exp_dt`. **✅ Solution**: ```python ### Exercise 8 solution def time_travel(journal, data_date): eff_mask = journal['eff_dt'] <= data_date exp_mask = journal['exp_dt'] >= data_date mask = eff_mask & exp_mask return journal.loc[mask, :].drop(columns=['eff_dt', 'exp_dt']) ``` ![image](https://hackmd.io/_uploads/HJyipnFReg.png) **⚙️ Pandas Concepts**: --- # PMT2-SP23 [Link to PMT2-23 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4277966&stepid=4277967&hideNavBar=1) ### 🔸 dfs_to_conn(conn_dfs, index=False) Provides a function that turns a Pandas DataFrame into a SQLite database connection. ```python def dfs_to_conn(conn_dfs, index=False): import sqlite3 conn = sqlite3.connect(':memory:') for table_name, df in conn_dfs.items(): df.to_sql(table_name, conn, if_exists='replace', index=index) return conn ``` ### 🔸 Goodreads Data (grdbconn) SQL Table → `Interactions` * `user_id` * `book_id` * `is_read` → user marked the book as read * `rating` → user gave the book a rating * `is_reviewed` → user wrote a public review ![image](https://hackmd.io/_uploads/By2piuqAll.png) **💡 Note**: Row 2 is an example of the user viewing the book and saving it to their personal library. ### 🔹 Exercise 0 - SQL, Pandas and Python **🎯 Goal**: Create a function that calculates the following: * Total number of interactions → `SELECT COUNT()` * Number of unique user IDs → `SELECT COUNT(DISTINCT)` * Number of unique books → `SELECT COUNT(DISTINCT)` * Number of interactions where the user: * read the book * rated the book * reviewed the book **➡️ Input**: SQL Table → `Interactions` * `user_id` * `book_id` * `is_read` → user marked the book as read * `rating` → user gave the book a rating * `is_reviewed` → user wrote a public review **⬅️ Output**: A string formatted as follows: There are 370,818 interactions. \- Unique users: 2,000 \- Unique books: 138,633 \- Number of reads: 208,701 (56.3% of all interactions) \- Number of ratings: 194,243 (52.4%) \- Number of reviews: 23,720 (6.4%) **✅ Solution**: ```python ### Exercise 0 solution ### def summarize_interactions_str(conn): # Use or adapt this template as you see fit: template = """There are {} interactions. - Unique users: {} - Unique books: {} - Number of reads: {} ({}% of all interactions) - Number of ratings: {} ({}%) - Number of reviews: {} ({}%)""" ### BEGIN SOLUTION #Step 1 - Use SQLite and grab the Interactions table and apply this to a `df` variable df = pd.read_sql("SELECT * FROM Interactions", conn) #Step 2 - Create a Series of UNIQUE values for books and users - get the count using len() n_users = len(df['user_id'].unique()) n_books = len(df['book_id'].unique()) #Step 3 - Create a Series for `is_read`, `rating`, and `is_reviewed`: ##`is_read` → 1/0 (T/F) so a sum suffices ##`rating` → 1-5, where 0 means not rated - by using a conditional, we create a Boolean ##`is_reviewed` → 1/0 (T/F) n_reads = df['is_read'].sum() n_rated = (df['rating'] > 0).sum() n_rev = df['is_reviewed'].sum() #print(df['rating'] > 0) #Step 4 - Wrap all of our numbers in the provided template ##f""" {var_name}""" → 12345 | f""" {var_name:,}""" → 12,345 template = f"""There are {len(df):,} interactions. - Unique users: {n_users:,} - Unique books: {n_books:,} - Number of reads: {n_reads:,} ({n_reads/len(df)*100:.1f}% of all interactions) - Number of ratings: {n_rated:,} ({n_rated/len(df)*100:.1f}%) - Number of reviews: {n_rev:,} ({n_rev/len(df)*100:.1f}%)""" return template ### END SOLUTION ``` ![image](https://hackmd.io/_uploads/BkpMBc5Cgg.png) **⚙️ Concepts/References**: * Pandas/Python - We can combine these * `var_name = df['col_name']` → Attach a Series to a variable * `len(df['col_name'])` → Returns the length of the Series (# of rows) * `df['col_name'].sum()` → Sums values in a Series * 📘 [Pandas Documentation: .sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) * `f"""{:,}"""` → Adds a comma to applicable numbers (i.e. 1,000) * `f"""{:.1f}"""` → Creates a single decimal place ### 🔹 Exercise 1 - SQL - Query f-string, Pandas - Manipulation **🎯 Goal**: Return a DataFrame that provides a count of how many interactions (rows) there are for each unique value. **➡️ Input**: `col`, the name of the column. `conn` database connection to connect to `Interactions`. SQL Table → `Interactions` * `user_id` * `book_id` * `is_read` → user marked the book as read * `rating` → user gave the book a rating * `is_reviewed` → user wrote a public review ![image](https://hackmd.io/_uploads/r1yNYqqRee.png) **⬅️ Output**: A DataFrame with two columns: * `col` → A column with the same name as the given input column holding the unique values * `count` → A column with the number of interactions for each unique value **Solution**: ```python ### Exercise 1 solution def count_interactions_by(col, conn): ### BEGIN SOLUTION #query = f"SELECT {col}, COUNT(*) AS count FROM Interactions GROUP BY {col}" #return pd.read_sql(query, conn) ### END SOLUTION ### MY SOLUTION # Step 1: Read the full Interactions table into a DataFrame df = pd.read_sql("SELECT * FROM Interactions", conn) # Step 2: Group by the specified column and count how many rows fall in each group grouped = df.groupby(col).size().reset_index(name="count") return grouped ### END MY SOLUTION ### demo function calls ### display(count_interactions_by(demo_col_ex1, demo_conn_ex1)) display(count_interactions_by('is_read', demo_conn_ex1)) ``` ![image](https://hackmd.io/_uploads/S19Vj950eg.png) **⚙️ Concepts/References**: * SQLite * `query = f"""` → Wrap the SQL query in a f-string * `SELECT {col}, COUNT(*) AS count` → In the provided column, count all rows (* = all) * `FROM Interactions` → Within this table * `GROUP BY {col}` → Group the results by the provided column * Pandas * `pd.read_sql(query, conn)` → Return a DataFrame of the resulting table ### 🔸 Power Laws Data ++Power Law Distributions++: a probability density $f(x)$ is a power law if it behaves like $\frac{1}{x^d}$ for some constant $d$ when $x$ is large. ### 🔹 Exercise 2 - Numpy, Pandas - Series Manipulation **🎯 Goal**: Given a `series` object holding values from 1 to $n$, count how many of those integers lie within the log-two bins * $[1,2)[1,2)$: that is, starting at one up to but excluding 2; * $[2,4)[2,4)$: starting at two up to but excluding 4; * $[4,8)[4,8)$: starting at 4 up to but excluding 8; * ... * and $[2k−1,2k)[2k−1,2k)$: starting at $2k−12k−1$ up to but excluding $2k2k$, where $2k2k$ is the first power of two greater than $n$. **➡️ Input**: `series` is a Panda-Series object holding the values. **⬅️ Output**: A DataFrame with two columns: * `bin start` the value of the left edge ofa bin, which are integers starting at 1 and all of the form $2^i$ * `count` the number of values in a series that lie in $[2^{i}, 2^{i+1})$ **💡 Notes and Hints**: 1. You may assume all input values are integers greater than or equal to 1. 2. Given a value $x$, the next largest power of two is $2k2k$ where $k=⌈log2x⌉+1k=⌈log2⁡x⌉+1$. 3. A helpful function is `pandas.cut (pd.cut)`, but you certainly do not have to use it. 4. Recall that you should omit empty bins. **✅ Solution**: ```python ### Exercise 2 solution def log2bin_count(series): ### BEGIN SOLUTION from pandas import cut bins = make_log2bins(series) bin_owners = cut(series, bins=bins, labels=bins[:-1], right=False).astype(int) counts = bin_owners.value_counts() df = counts.to_frame().reset_index() df = df.rename(columns={'index': 'bin_start', 0: 'count'}) return df def make_log2bins(series): from numpy import log2, ceil, arange pmax = int(ceil(log2(series.max()))) + 1 return 2**arange(pmax) ### END SOLUTION ``` ![image](https://hackmd.io/_uploads/H1GMZj9Rle.png) **⚙️ Concepts/References**: ### 🔹 Exercise 3 - FREEBIE **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: **⚙️ Concepts/References**: ### 🔹 Exercise 4 - SQL - `WHERE` Conditional **🎯 Goal**: Use SQLite/Pandas to return a DataFrame where the rating is 4 or more. **➡️ Input**: `conn` to connect to the `Interactions` DataFrame. **⬅️ Output**: A DataFrame where the rating is 4 or more. **✅ Solution**: ```python ### Exercise 4 solution def form_analysis_sample(conn): ### df = pd.read_sql("""SELECT * FROM Interactions WHERE rating >= 4""", conn) return df ### ### demo function call ### form_analysis_sample(demo_conn_ex4) ``` ![image](https://hackmd.io/_uploads/SklPqkiRxg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 5 - Pandas - Group and Reformat a DataFrame **🎯 Goal**: Given a DataFrame and a threshold integer signifying connection strength, determine which pairs of users are connected. **➡️ Input**: * `ubdf` → A DataFrame with the columns `user_id` and `book_id` where each row indicates the user gave a rating of 4 or more * `threshold` → integer threshold on connection strength ![image](https://hackmd.io/_uploads/Skp5beo0gg.png) **⬅️ Output**: A new DataFrame with the columns: * `user_id_x` → a user id * `user_id_y` → another user id * `count` → number of books they both rated in common (should be greater than `threshold`) **✅ Solution**: ```python ### Exercise 5 solution def connect_users(ubdf, threshold): ### BEGIN SOLUTION # Step 1 — Self-join on book_id to produce all user pairs who touched the same book. # Why: joining ubdf to itself on book_id gives every (user_x, user_y) pair that co-rated that book. paired = ubdf.merge(ubdf, on='book_id') # Step 2 — Group by user pairs and count how many shared books they have. # Why: the number of rows per (user_id_x, user_id_y) equals their shared-book count. counts = ( paired.groupby(['user_id_x', 'user_id_y']) .size() .reset_index() .rename(columns={0: 'count'}) ) # Step 3 — Remove trivial self-pairs (a user paired with themselves). # Why: we only care about connections between *different* users. non_self = counts[counts['user_id_x'] != counts['user_id_y']] # Step 4 — Keep only pairs with shared-book count ≥ threshold. # Why: enforce the minimum strength required to call them "connected". strong = non_self[non_self['count'] >= threshold] # Step 5 — Clean up the index for a tidy return. # Why: downstream code often assumes a simple RangeIndex. result = strong.reset_index(drop=True) return result ### demo function call ### connect_users(demo_ubdf_ex5, demo_threshold_ex5) ``` ![image](https://hackmd.io/_uploads/B1E_Wxs0xx.png) **⚙️ Concepts/References**: * Pandas * `df.merge(df, on='col_id')` → merging a dataframe with itself on a specified column finds and creates value pairs that match for that value ![image](https://hackmd.io/_uploads/rJSNGliCxl.png) ### 🔹 Exercise 6 **🎯 Goal**: Turn a list of sets into a DataFrame, where each set is a collection of user ids. **➡️ Input**: `communities`, a list of sets, where each set's index in the list represents it's `comm_id` and the values within the set are the `user_id`. **⬅️ Output**: A DataFrame where the columns are: * `user-id` → the values within the set * `comm_id` → the index position of the set within the list **✅ Solution**: ```python ### Exercise 6 solution def assign_communities(communities): ### # Step 1 — Prepare an empty list to collect rows rows = [] # Step 2 — Loop through each community using enumerate() # Why: enumerate provides both the community_id (its index) and the set of users for i, community in enumerate(communities): # Step 3 — Loop through each user in the current community for member in community: # Step 4 — Append a dictionary for each (community_id, user_id) pair rows.append({ 'comm_id': i, 'user_id': member }) # Step 5 — Convert the list of dictionaries into a DataFrame df = pd.DataFrame(rows) # Step 6 - Reorder the columns r = ['user_id', 'comm_id'] return df[r] ### ### demo function call ### assign_communities(demo_communities_ex6) ``` ![image](https://hackmd.io/_uploads/SJGokoiCee.png) **⚙️ Concepts/References**: * `enumerate()` → Assigns a value, starting from 0, to each item in the iterable ### 🔹 Exercise 7 - Pandas - Merge, Group, Average by Row **🎯 Goal**: Calculate the averages of `is_read`, `rating` and `is_reviewed` per community. **➡️ Input**: * `intdf` → A DataFrame * `user_id` * `book_id` * `is_read` * `rating` * `is_reviewed` * `comdf` → A DataFrame * `user_id` * `comm_id` ![image](https://hackmd.io/_uploads/SJTCNsiRle.png) **⬅️ Output**: A new DataFrame with the columns: * `comm_id` → one per row * `is_read` → mean value of each row for all `intdf` for all users of the community * `rating` → mean value of each row for all `intdf` for all users of the community * `is_reviewed` → mean value of each row for all `intdf` for all users of the community **✅ Solution**: ```python ### Exercise 7 solution def means_by_community(intdf, comdf): # Step 1 — Define the columns whose means we’ll calculate VALUES = ['is_read', 'rating', 'is_reviewed'] # Step 2 — Merge the interactions and communities DataFrames on 'user_id' # Why: to attach each user's community ID to their interaction data df = intdf.merge(comdf, on='user_id') # Step 3 — Group all rows by community ID # Why: so that each community’s users’ interactions are grouped together grouped = df.groupby('comm_id') # Step 4 — Select only the columns we want to average # Why: restrict aggregation to the three numeric columns of interest selected = grouped[VALUES] # Step 5 — Compute the mean for each selected column within each community # Why: summarize user behavior per community means = selected.mean() # Step 6 — Move 'comm_id' back from the index to a normal column # Why: groupby() sets it as an index, but we want a clean DataFrame output result = means.reset_index() # Step 7 — Return the final result return result ### END SOLUTION ### ### demo function call ### demo_result_ex7 = means_by_community(demo_intdf_ex7, demo_comdf_ex7) display(demo_result_ex7) ``` ![image](https://hackmd.io/_uploads/H1TdsTiRee.png) **⚙️ Concepts/References**: * Pandas * `df.groupby('col_name')` → Groups rows by the supplied column name. DOES NOT COMPUTE ANYTHING, just creates a lazy grouped view. ### 🔹 Exercise 8 - Pandas - **🎯 Goal**: For EACH community, calculate what percentage of its users read each book. Identify the top `rank` books. Some useful strategies include: * Determine the number of unique users in each community. You need this information to get percentages. * Determine how many users read each book by community. * Normalize these counts by the community size. * Sort and return the results, retaining just the top rank books in each community. **➡️ Input**: * `xcdf` → Dataframe with the columns: * `user_id` → a user id (integer) * `book_id` → a book id that this user read (integer) * `comm_id` → the community id which the user belongs to (integer) * `rank` → integer indicating the number of top ranked books to return ![image](https://hackmd.io/_uploads/Sy0YpasRgx.png) **⬅️ Output**: A new DataFrame with the following columns: * `comm_id` → the community id * `book_id` → a book id that was read in that community * `percent` → percent of the community that read that book * `comm_size` → number of users in the community **✅ Solution**: ```python ### Exercise 8 solution def get_topreads_by_community(xcdf, rank=5): """ Goal: For each community, find the top `rank` books by % of community members who read them. Returns columns: ['comm_id', 'book_id', 'percent', 'comm_size']. """ # Step 1 — Count unique users per community (community sizes). # Why: We need the denominator to compute "what % of the community read book X". sizesdf = count_users_by_comm(xcdf, outcol='comm_size') # Step 2 — Count readers per (community, book). # Why: This is the numerator for the percentage. readersdf = count_readers_by_comm(xcdf, outcol='num_read') # Step 3 — Join counts so each (comm_id, book_id) row has both num_read and comm_size. # Why: Put numerator and denominator together to compute percentages. df = readersdf.merge(sizesdf, on='comm_id') # Step 4 — Compute the percentage of community that read the book. # Why: percent = (num_read / comm_size) * 100. df['percent'] = df['num_read'] / df['comm_size'] * 100 # Step 5 — Sort within each community by: # (a) percent DESC, then (b) book_id ASC (tie-breaker). # Why: Define the ranking order and ensure tie-breaking by lowest book_id. dftop = ( df.sort_values(['comm_id', 'percent', 'book_id'], ascending=[True, False, True]) .reset_index(drop=True) ) # Step 6 — Take top `rank` rows per community. # Why: Keep only the top-N books for each community after sorting. dftop = ( dftop.groupby('comm_id').apply(lambda g: g.iloc[:rank]).reset_index(drop=True) ) # Step 7 — Return only required columns. # Why: Match the specified output schema. return dftop[['comm_id', 'book_id', 'percent', 'comm_size']] def count_users_by_comm(xcdf, **kwargs): """Count unique users per community (drop duplicates first).""" # Step A — Keep only (comm_id, user_id) and drop duplicates. # Why: A user may have multiple rows; each user counts once per community. cudf = xcdf[['comm_id', 'user_id']].drop_duplicates() # Step B — Group by community and count rows → community size. # Why: Each remaining row is a distinct user in that community. return count_by_group(cudf, 'comm_id', **kwargs) def count_readers_by_comm(xcdf, **kwargs): """Count readers per (community, book).""" # Step A — Group by (comm_id, book_id) and count rows. # Why: Each row indicates a user read that book; counting gives readers per book per community. return count_by_group(xcdf, ['comm_id', 'book_id'], **kwargs) def count_by_group(df, cols, outcol='count'): # Step — Generic "count rows per group". # Why: Reusable helper to produce a tidy (group cols + count) DataFrame. return df.groupby(cols).size().reset_index().rename(columns={0: outcol}) ### END SOLUTION ### demo function call ### get_topreads_by_community(demo_xcdf_ex8, demo_rank_ex8) ``` ![image](https://hackmd.io/_uploads/HyV-AToCge.png) **⚙️ Concepts/References**: ### 🔹 Exercise 9 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: **⚙️ Concepts/References**: ### 🔹 Exercise 10 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: **⚙️ Concepts/References**: --- # PMT2 - FA23 [Link to PMT2-F23 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4278040&stepid=4278041&hideNavBar=1) ## 🧩 American Football Data A Pandas DataFrame with the columns: * `play_id` → unique id for a play * `drive_id` → unique id for a possession * `event_id` → unique id for a game * `type` → type of play ran * `scoringPlay` → T/F to signal a score * `awayScore` → score of the away team AFTER the play occurred * `homeScore` → score of the home team AFTER the play occurred * `period` → the quarter * `clock` → time remaining when the play started * `homeTeamPoss` → T if home team had possession of the ball when play started * `down` → the down * `distance` → distance to line to gain * `yardsToEndzone` → distance to the endzone ![image](https://hackmd.io/_uploads/BkOj_RsRxg.png) ### 🔹 Exercise 0 - Pandas - Time Conversion using Timedelta Objects **🎯 Goal**: Convert the game clock to time remaining in the game. $$ timeLeft = (15 \times 60)(4 - \text{period}) + 60(\text{clockMinutes}) + \text{clockSeconds} $$ **➡️ Input**: `all_events_df` where the `clock` field is a string with the format `{clockMinutes}:{clockSeconds}` **⬅️ Output**: A new DataFrame that returns the following: * `index` → DO NOT RESET * `event_id` * `period` * `clock` * `timeLeft` **✅ Solution**: ```python ### Exercise 0 solution def calc_time_left(all_events_df: pd.DataFrame) -> pd.DataFrame: ### #Step 1 - Make a copy of the DF all_df = all_events_df.copy() #Step 2 - Use the provided formula to calculate the time left - convert the output to an integer all_df['timeLeft'] = ( 900 * (4 - all_df['period']) + (pd.to_timedelta('00:' + all_df['clock']) // pd.Timedelta(seconds=1)) ).astype('int64') return all_df ### ### demo function call calc_time_left(demo_all_events_df_ex0) ``` ![image](https://hackmd.io/_uploads/Hk9zmy30ll.png) **⚙️ Concepts/References**: * Pandas * `pd.to_timedelta(args)` → converts a string, number or Series in a Timedelta object (differences in time) * `//` → integer division operator * `pd.Timedelta(value=..., unit=None)` → creates a single Timedelta object (a duration) * `pd.to_timedelta(args) // pd.Timedelta(value=..., unit=None)` → converts our supplied clock time to Timedelta objects (e.g. 1:23 → 0 days 00:01:23) then divides the given Timedelta by 1 second (e.g. 0 days 00:01:23 → 83) ### 🔹 Exercise 1 - Pandas - Multiple Conditions, `.loc()` **🎯 Goal**: Remove events that are not plays, using the given list, and remove events/games that went into extra periods. **➡️ Input**: A DataFrame that contains: * `type` * `timeLeft` * `event_id` **⬅️ Output**: A copy of the DataFrame that: * `non_play_rows` → identify all rows with a `type` value that is in `non_play_types` * `ot_event_ids` → identify all unique `event_id` values occurring in rows there `timeLeft` is less than 0 * `ot_event_rows` → identify all rows where the `event_id` value is in `ot_event_ids` **✅ Solution**: ```python ### Exercise 1 solution def filter_non_plays_and_ot(df: pd.DataFrame) -> pd.DataFrame: non_play_types = ['Penalty', 'End Period', 'Two-minute warning', 'Timeout', 'End of Half', 'End of Game', 'Official Timeout', 'Defensive 2pt Conversion', 'Two Point Rush', 'Extra Point Good'] #Step 1 - Create a copy df1 = df.copy() #Step 2 - Define filters ##non_play_rows → all rows with a 'type' value in non_play_types non_play_rows = df1['type'].isin(non_play_types) ##ot_event_ids → all unique event id values in rows where timeLeft < 0 ot_event_ids = df1.loc[df1['timeLeft'] < 0, 'event_id'].unique() ##ot_event_rows → all rows where the 'event_id' is in ot_event_ids ot_event_rows = df1['event_id'].isin(ot_event_ids) #Step 3 - Return the filtered DataFrame filtered_df = df1[~non_play_rows & ~ot_event_rows] return filtered_df ### demo function call filter_non_plays_and_ot(demo_df_ex1) ``` ![image](https://hackmd.io/_uploads/HJsfzfhCeg.png) **⚙️ Concepts/References**: * `df['col_name'].isin(list/series)` → returns a T/F for the column to check membership in the supplied list/series * `df.loc[df['col_name1'] <0, 'col_name2'].unique()` → indexes the provided rows and columns and grabs the unique values and returns a list * `.loc[ROW, COLUMN]` → index the provided row and column **🧠 Insight**: Think of `.loc(rows, columns)`, `rows` is actually a Boolean mask, list, or slice, `columns` is the field we want to return. ### 🔹 Exercise 2 - Pandas - Partition Column by Another & Create Subsets **🎯 Goal**: Model the probability that the offense will advance the ball to the line to gain. **➡️ Input**: The helper function `converted_by_drive(group: pd.DataFrame) -> pd.DataFrame` which takes a DataFrame containing **only records from the same drive** and returns a new DataFrame with the new column `converted` added. `event_df` → DataFrame * `drive_id` * `type` * `scoringPlay` * `down` * `timeLeft` **⬅️ Output**: There are 2 steps to this output. 1. Subsets * Partition `event_df` by `drive_id` * Apply `converted_by_drive` to each partition 2. Concatenation * Concatenate all results into a single DataFrame **✅ Solution**: ```python ### Exercise 2 solution ### Helper function provided as part of the starter code def converted_by_drive(group: pd.DataFrame) -> pd.DataFrame: group = group.sort_values('timeLeft', ascending=False)\ .reset_index(drop=True) offensive_touchdown_types = ['Passing Touchdown', 'Rushing Touchdown', 'Fumble Recovery (Own)', 'Rush', 'Pass Reception'] # `pd.DataFrame.shift` might be useful later... first_downs = (group['down'] == 1).shift(-1, fill_value=False) scores = (group['scoringPlay'] == True)&(group['type'].isin(offensive_touchdown_types)) group['converted'] = (first_downs|scores) return group ### Your solution def converted(event_df: pd.DataFrame) -> pd.DataFrame: ### #Step 0 - Copy of our DataFrame a_df = event_df.copy() #Step 1 - Partition using groupby drive_groups = a_df.groupby('drive_id') #→ returns a DataFrameGroupBy objet #Step 2 - APPLY the function AND drop the index ##Drop because the index changes to the drive_id values result = drive_groups.apply(converted_by_drive).reset_index(drop=True) return result ### converted(demo_event_df_ex2) ``` ![image](https://hackmd.io/_uploads/Sku3KMnAee.png) **⚙️ Concepts/References**: * `df.groupby('col_name')` → returns a DataFrameGroupBy object that must perform a calculation of some kind to get a DataFrame returned * `df.apply(function_name)` → applies the provided function to the dataframe ### 🔹 Exercise 3 - Pandas - Sort, Identify Max & Python Conditions **🎯 Goal**: Model win probability by creating a new column that returns a T/F given the parameters: * Identify the final score for both teams by either sorting and indexing OR using the maximum value * Determine which team had the higher score * Return a copy of `event_df` with the new column `won` that is set to True if the team on offense won the game (`homeTeamPoss`) **➡️ Input**: `event_df` → DataFrame with the columns: * `awayScore` * `homeScore * `homeTeamPoss` * `timeLeft` **⬅️ Output**: A new DataFrame, that is a copy of `event_df` but now has the column `won`. **✅ Solution**: ```python ### Exercise 3 solution def who_won(event_df: pd.DataFrame) -> pd.DataFrame: ### #Step 1 - Create a copy of the input DataFrame df = event_df.copy() #Step 2 - Sort by timeLeft, although this is optional due to using .max(), we could have indexed the values from the last row df = df.sort_values(by='timeLeft', ascending=False) #Step 3 - Identify the final scores of each team home_final = df['homeScore'].max() away_final = df['awayScore'].max() #Step 4 - Logical operations using the final scores if home_final > away_final: df['won'] = df['homeTeamPoss'] elif away_final > home_final: df['won'] = ~df['homeTeamPoss'] else: df['won'] = False return df ### ### demo function call demo_soln_ex3 = who_won(demo_event_df_ex3) display(demo_soln_ex3.head()) display(demo_soln_ex3.tail()) ``` ![image](https://hackmd.io/_uploads/HJNPPB2Rxe.png) **⚙️ Concepts/References**: * `df['col_name'].max()` → Finds the maximum value * `df.sort_values(by=col_name , ascending=T/F)` → Sorts the DataFrame by the values within the provided column ### 🔹 Exercise 4 - **🎯 Goal**: Calculate a column, `nextScore` that provides an integer based on the next `scoringPlay=True`, and the scoring team using `homeTeamPoss`, `awayScore` and `homeScore`. **➡️ Input**: * `event_df` → A DataFrame, as pictured below * `get_updated_list(df: pd.DataFrame) → list` → A function that accepts a DataFrame of the play data for a single game that has been sorted by `timeLeft`. Returns a list of tuples, where each tuple is `(a, b)` such that `a` is the play number and `b` is the change in score in relation to the home team ![image](https://hackmd.io/_uploads/rJDLYPCAge.png) **⬅️ Output**: A DataFrame that has been through the following: * Copy `event_df`, sort the copy by `timeLeft` in descending order, and reset the index. We will call this copy `df`. * Call `get_update_list` on the `df`. We will call the output `update_list`. * Create a new column `nextScore` in `df`. Set the values of `nextScore` based on the `update_list`. Hint: clever use of the `zip` function and slicing of `update_list` will make this an easy task. * Negate `df['nextScore']` in all rows where `df['homeTeamPoss']` is True **✅ Solution**: ```python ### Exercise 4 solution def add_next_score(event_df: pd.DataFrame) -> pd.DataFrame: ### BEGIN SOLUTION #Step 1 - Create a copy df = event_df.copy() #Step 2 - Sort the DataFrame by timeLeft - Reset the index as a precaution df = event_df.sort_values('timeLeft', ascending=False).reset_index(drop=True) #Step 3 - Call the function on the DataFrame #Note: We did not use .apply() because we're not changing values in the DataFrame, but rather creating tuples from the DataFrame update_list = get_update_list(df) #Step 4 - Splice the list, zip it together, update nextScore ##Splice the list so that one list is missing the last element, one the first ##Zip these two together (Recall (a, b) → a - play # where score updates, b - score change) ##This creates a tuple of tuples so that we create ranges and values, based on the change of score and when it happened ##.loc[(row, column)] specifies the row range, the column to be updated, and sets it equal to the value for start, end in zip(update_list[:-1], update_list[1:]): df.loc[(start[0]):(end[0]), 'nextScore'] = end[1] #Step 5 - Flip the sign of the nextScore value when homeTeamPoss is False df.loc[~df['homeTeamPoss'], 'nextScore'] = -1*df.loc[~df['homeTeamPoss'], 'nextScore'] df['nextScore'] = df['nextScore'].astype(int) return df ### END SOLUTION ### demo function call demo_output_ex4 = add_next_score(demo_event_df_ex4) ``` ![image](https://hackmd.io/_uploads/BkOAYPA0ex.png) **⚙️ Concepts/References**: * ### 🔹 Exercise 5 - Pandas - Parition and Shift Row Values up 1 **🎯 Goal**: The score field currently updates AFTER a play has occurred, but should be updated so that it reflects the score BEFORE the play. **➡️ Input**: `events_df` → DataFrame ![image](https://hackmd.io/_uploads/SypufF0Ale.png) **⬅️ Output**: A copy of the input that: * Copy `all_events_df` and sort it by `event_id` and `timeLeft`. The order of `event_id` doesn't matter (sorting helps things run faster). However, sorting the records for each event by timeLeft in descending order is critical (which is why we must sort by both columns). We will call the sorted copy `df`. * Partition `df` by `event_id`. * For each partition, set the `homeScore` and `awayScore` values to their values from one row prior. Since the first row has no rows which are prior to it, we will set both score values to 0. * Hint - The pd.DataFrame.shift method will be helpful in accomplishing this step. * Concatenate the partitions together. * Return the result. **✅ Solution**: ```python ### Exercise 5 solution def lag_score(all_events_df: pd.DataFrame) -> pd.DataFrame: ### df = all_events_df.copy() df = df.sort_values(by=['event_id', 'timeLeft'], ascending = [True, False]) partition = df.groupby('event_id') df = df.groupby('event_id', as_index=False).apply(helper).reset_index(drop=True) return df ### def helper(df: pd.DataFrame) -> pd.DataFrame: df[['homeScore', 'awayScore']] = df[['homeScore', 'awayScore']].shift(1, fill_value=0) return df ### demo function call demo_output = lag_score(demo_all_events_df_ex5) print(demo_output) ``` ![image](https://hackmd.io/_uploads/SJ6ojBh0eg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 6 - Conditional Filters and Create Multiple DataFrames **🎯 Goal**: Using 1 input DataFrame, create 3 DataFrames with the appropriate columns. **➡️ Input**: `all_events_df` → A DataFrame that includes the columns: * `'distance'` * `'nextScore'` * `'timeLeft'` * `'play_id'` * `'won'` * `'homeTeamPoss'` * `'awayScore'` * `'converted'` * `'down'` * `'yardsToEndzone'` * `'event_id'` * `'homeScore'` ![image](https://hackmd.io/_uploads/ByUyrtA0ll.png) **⬅️ Output**: Several DataFrames * Copy `all_events_df`. We will call this `df`. * Identify all rows where `df['down']` is 0. Filter them out of `df`. * Calculate `conversion_data` from `df`. * should include only fields in `conversion_fields` (part of the startercode) * should only include rows where `df['down']` is 3 or 4 * should not include rows which are have "field goal" in `df['type'].lower()`. * should not include rows which are have "punt" mentioned in `df['type'].lower()`. * Calculate `ep_data` from `df` * `ep_data` should include only fields in `ep_fields` (part of startercode) * Calculate `wp_data` from `df` * `wp_data` should include only fields in `wp_fields` (part of startercode) * Return `conversion_data`, `ep_data` and `wp_data` **✅ Solution**: ```python ### Exercise 6 solution def build_model_inputs(all_events_df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame, pd.DataFrame): conversion_fields = ['event_id', 'play_id', 'down', 'distance', 'converted'] ep_fields = ['event_id', 'play_id', 'down', 'distance', 'yardsToEndzone', 'nextScore'] win_prob_fields = ['event_id', 'play_id', 'down', 'distance', 'yardsToEndzone', 'timeLeft', 'awayScore', 'homeScore', 'homeTeamPoss', 'won'] ### df = all_events_df.copy() df = df[df['down'] != 0] conversion_data = df.copy() conversion_data = conversion_data[(conversion_data['down'] == 3) | (conversion_data['down'] == 4)] conversion_data = conversion_data[~(conversion_data['type'].str.lower().str.contains("field goal"))] conversion_data = conversion_data[~(conversion_data['type'].str.lower().str.contains("punt"))] col_drop = df.columns.difference(conversion_fields) conversion_data = conversion_data.drop(columns=col_drop) ep_data = df.copy() ep_drop = df.columns.difference(ep_fields) ep_data = ep_data.drop(columns=ep_drop) wp_data = df.copy() wp_drop = df.columns.difference(win_prob_fields) wp_data = wp_data.drop(columns=wp_drop) return conversion_data, ep_data, wp_data ### ### demo function call (demo_conversion_data_ex6, demo_ep_data_ex6, demo_wp_data_ex6) = build_model_inputs(demo_all_events_df_ex6) for df in (demo_conversion_data_ex6, demo_ep_data_ex6, demo_wp_data_ex6): display(df) ``` ![image](https://hackmd.io/_uploads/Syd-LYARxg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 7 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Exercise 7 solution def win_prob_model(df: pd.DataFrame) -> np.ndarray: ### ### BEGIN SOLUTION offense_ep = ep_model(df) mu = (offense_ep+df['homeScore']-df['awayScore']) * df['homeTeamPoss'] + \ (offense_ep+df['awayScore']-df['homeScore']) * (1 - df['homeTeamPoss']) sigma = 13.85 / np.sqrt((3600)/(1 + df['timeLeft'])) return 1 - norm.cdf(0.5, loc=mu, scale=sigma) ### END SOLUTION ### ### demo function call win_prob_model(demo_df_ex7) ``` ![image](https://hackmd.io/_uploads/SJDrDFCAgg.png) **⚙️ Concepts/References**: ### 🔹 Exercise 8 **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Exercise 8 solution def sim_outcomes(df: pd.DataFrame, models: dict) -> pd.DataFrame: conversion_model = models['convert'] fg_model = models['fg'] win_prob_model = models['win_prob'] def_win_prob_model = lambda d: 1 - win_prob_model(d) df = df.copy() goal_to_go = df['distance'] == df['yardsToEndzone'] fg_make_prob = fg_model(df['yardsToEndzone']) go_succeed_prob = conversion_model(df['distance']) df['go_succeed_prob'] = go_succeed_prob df['fg_make_prob'] = fg_make_prob # Make columns for win probability if a particular outcome occurs df['fg_make_wp'] = def_win_prob_model(simulate_fg_make(df)) df['fg_miss_wp'] = def_win_prob_model(simulate_fg_miss(df)) df['fail_wp'] = def_win_prob_model(simulate_fourth_down_fail(df)) df.loc[goal_to_go, 'succeed_wp'] = def_win_prob_model(simulate_fourth_down_succeed(df))[goal_to_go] df.loc[~goal_to_go, 'succeed_wp'] = win_prob_model(simulate_fourth_down_succeed(df))[~goal_to_go] # Make df_punt df_punt = simulate_punt(df) ### ### BEGIN SOLUTION fg_miss_prob = 1-fg_make_prob go_fail_prob = 1-go_succeed_prob df['punt_wp'] = def_win_prob_model(df_punt) df['kick_wp'] = fg_make_prob*df['fg_make_wp'] + fg_miss_prob*df['fg_miss_wp'] df['go_wp'] = go_succeed_prob*df['succeed_wp'] + go_fail_prob*df['fail_wp'] return df.round(4) ### END SOLUTION ### ### demo function call models = { 'convert': conversion_model, 'fg': fg_model, 'win_prob': win_prob_model } sim_outcomes(demo_df_ex8, models) ``` ![image](https://hackmd.io/_uploads/rkYmvK00gx.png) **⚙️ Concepts/References**: # PMT2 - SP24 [Link to PMT2-SP24 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4491736&stepid=4491737&hideNavBar=1) ## 🧩 Data - New York Collisions Detailed traffic collision data collected by the City of New York. There are 3 tables in our connection, but we do not know the relationships yet: * `CRASHES` * `VEHICLES` * `PERSON` **💡 Note**: Quickly view the table structure using: `pd.read_sql("SELECT * FROM TABLE_NAME LIMIT 5;", conn)` OR `pd.read_sql("PRAGMA table_info(TABLE_NAME);", conn)` **💡 Note**: Quickly view the data using: `pd.read_sql("SELECT * FROM table_name LIMIT 5;", conn)` ### 🔹 Exercise 0 - SQL - Dynamic Query w/ Parameter, Pandas - SQL Query to DataFrame **🎯 Goal**: Identify the column names of a given `table_name`. **➡️ Input**: * `table_name` → string, name of a table * `conn` → Connect object, name of the connection to our database **⬅️ Output**: A list of column names for the given `table_name`, extracted by: * verifying that the `table_name` contains letters, numbers, or underscores * if not, raise a ValueError * generating a dynamic query to pass in a parameter * use the query to generate a DataFrame * extract the column names as a list and order them alphabetically **✅ Solution**: ```python ### Solution - Exercise 0 def get_table_cols(table_name, conn): ### #Step 0 - Insurance that table_name contains letters, numbers, or underscores if re.search('\W', table_name) is not None: raise ValueError() #Step 1 - Dynamic query with a parameter FROM clause ##Dynamic is simply writing the query as a variable sql = f''' SELECT * FROM {table_name} LIMIT 1''' #Step 2 - Load the query as a DataFrame df = pd.read_sql(sql, conn) #Step 3 - Extract the column names as a list ##Could also use df.columns.to_list() col_names = list(df.columns) return sorted(col_names) ### ### Demo function call print(get_table_cols(table_name='vehicles', conn=conn)) ``` ![image](https://hackmd.io/_uploads/ByJZ-qAAxx.png) **⚙️ Key Concepts**: * Regex * `re.search(pattern, string)` → Locate the first occurrence of a string * `\W` → Non-word character, any character that is not A-z0-9_ * SQLite * `sql = f''' '''` → Dynamic Query syntax, used for passing parameters * `SELECT *` → Select all columns * `FROM {table_name}` → From this table * `LIMIT 1` → Return only the first row * Pandas * `pd.read_sql(query_variable, conn)` → Turns the SQL query into a DataFrame ### 🔹 Exercise 1 - Count Items in Sublist **🎯 Goal**: Given a list of lists, identify names that are present in $num$ number lists. **➡️ Input**: * `lol` → list * `num` → integer **⬅️ Output**: A set containing all names appearing in $num$ number of list. **✅ Solution**: ```python ### Solution - Exercise 1 def intersection_of_cols(lol: list, num=None) -> set: assert len(lol) > 1, '`lol` must have at least 2 items' if num is None: num = len(lol) ### from collections import Counter #Step 1 - Apply Counter to each sublist ##Counter returns a dictionary count of each item ##for l in lol → gets into the first list ##for col_name in l → gets into each sublist ##col_name identifies each name c = Counter([col_name for l in lol for col_name in l]) #Step 2 - Access the key, value in our Counter return and apply the condition d = {col for col, count in c.items() if count >= num} return d ### ### Demo function call for num in [2,3]: demo_intersection = intersection_of_cols( lol=[['ringo', 'phyllis', 'angela', 'paul'], ['kevin', 'paul', 'oscar', 'kelly', 'phyllis'], ['phyllis', 'oscar', 'ryan', 'john', 'toby']], num=num) print(f'num={num}; intersection={demo_intersection}') ``` ![image](https://hackmd.io/_uploads/SyGLI50Cgg.png) **⚙️ Key Concepts**: * `Counter()` → Counts occurrences of each item in an iterable, returns the item and count as a dictionary ### 🔹 Exercise 2 - SQL - Aggregate Functions, GROUP BY, and SUBSTR **🎯 Goal**: Return a Pandas DataFrame and build the SQL Query. **➡️ Input**: `conn` → Connection to the database. **⬅️ Output**: A DataFrame for the query that: * `YEAR`: Year for which summary stats are computed. (taken from the `CRASH_DATE` field) * `INJURY_COUNT`: Count of records with 'Injured' values in the `PERSON_INJURY` field for the same year. * `DEATHS_COUNT`: Count of records with 'Killed' values in the `PERSON_INJURY` field for the same year. * `UNSPECIFIED_COUNT`: Count of records with 'Unspecified' values in the `PERSON_INJURY` field for the same year. **✅ Solution**: ```python ### Solution - Exercise 2 def summarize_person(conn): ### #No input, so no f-string query = """ SELECT SUBSTR(CRASH_DATE, 7) AS YEAR, SUM(PERSON_INJURY='Injured') AS INJURY_COUNT, SUM(PERSON_INJURY='Killed') AS DEATHS_COUNT, SUM(PERSON_INJURY='Unspecified') AS UNSPECIFIED_COUNT FROM PERSON GROUP BY YEAR""" return pd.read_sql(query, conn) ### ### Demo function call demo_person_summary_df = summarize_person(conn) display(demo_person_summary_df) ``` ![image](https://hackmd.io/_uploads/rJVVCqCRel.png) **⚙️ Key Concepts**: * SQLite * `SELECT` → Used to pull columns from the table and perform calculations on them * `SUBSTR(string, start_position, length) AS COL_NAME` → Pull a substring from a given string * `SUM(TABLE_COL_NAME = condition) AS COL_NAME` → The condition evaluates each row to True/False (1, 0) and sums up all occurrences when the condition is true * `COUNT(CASE WHEN condition THEN 1 END)` → The same SUM-condition function, but as COUNT * `GROUPBY COL_NAME` → Groups results based on the column name AND performs the aggregate functions based on the that grouping ### 🔹 Exercise 3 - SQLite - SUBSTR, COUNT(\*), Quotient (/) **🎯 Goal**: Return a DataFrame using a query statement. **➡️ Input**: `conn` → Connection to the database. **⬅️ Output**: A DataFrame for the query that: * YEAR: Year for which summary stats are computed. (taken from the "CRASH DATE" field) * CRASH_COUNT: Count of crashes in the same year. * DEATHS: Sum of the "NUMBER OF PERSONS KILLED" values in the same year. * PEDESTRIAN_DEATHS: Sum of the "NUMBER OF PEDESTRIANS KILLED" values in the same year. * PEDESTRIAN_DEATH_SHARE: The quotient PEDESTRIAN_DEATHS/DEATHS * DEATHS_PER_CRASH: The quotient DEATHS/CRASH_COUNT **✅ Solution**: ```python ### Solution - Exercise 3 def summarize_crashes(conn): ### query = ''' SELECT SUBSTR("CRASH DATE", 7) YEAR, COUNT(*) CRASH_COUNT, SUM("NUMBER OF PERSONS KILLED") DEATHS, SUM("NUMBER OF PEDESTRIANS KILLED") PEDESTRIAN_DEATHS, SUM("NUMBER OF PEDESTRIANS KILLED") / SUM("NUMBER OF PERSONS KILLED") PEDESTRIAN_DEATH_SHARE, SUM("NUMBER OF PERSONS KILLED") / COUNT(*) DEATHS_PER_CRASH FROM CRASHES GROUP BY YEAR ''' return pd.read_sql(query, conn) ### ### Demo function call demo_crashes_summary = summarize_crashes(conn) display(demo_crashes_summary) ``` ![image](https://hackmd.io/_uploads/SkhMasCAxx.png) **⚙️ Key Concepts**: * SQLite * `SELECT` → All SELECT statements are done based on the `GROUP BY` * `COUNT(*)` → Count all rows * `SUM("COLUMN NAME")` → Sum the values within the column * `SUM("COLUMN NAME") / SUM("COLUMN NAME")` → Divides one sum by another **💡 Note**: `"COLUMN NAME"` uses "" when there is a space or special character, otherwise `COLUMN_NAME`. ### 🔹 Exercise 4 - SQLite - INNER JOIN, WHERE not null, SELECT multiple conditions **🎯 Goal**: Return a DataFrame using a query statement that meets the criteria below. **➡️ Input**: * `conn` → connection to the database * `CRASHES` → A table from our database ![image](https://hackmd.io/_uploads/SympI5kyZg.png) * `VEHICLES` → A table from our database ![image](https://hackmd.io/_uploads/SkFePcyJZl.png) **⬅️ Output**: A DataFrame that provides a summary with these columns: * `VEHICLE_TYPE`: The upper-case `VEHICLE_TYPE` from `VEHICLES` for which summary stats are computed * You may encounter records with NULL values in the `VEHICLE_TYPE` field. These should not be included in your calculation. * `COUNT`: The number of records with the same `VEHICLE_TYPE` * `OUT_OF_STATE`: The number of records with the same `VEHICLE_TYPE` having upper-case `STATE_REGISTRATION` values in `VEHICLES` other than 'NY' * NULL values in the `STATE_REGISTRATION` field do not count as `OUT_OF_STATE`. * `DEATHS`: The sum of the "NUMBER OF PERSONS KILLED" values in `CRASHES` associated with the same `VEHICLE_TYPE`. * You can relate `VEHICLES` to `CRASHES` by the `COLLISION_ID` column which is common to both tables. Additional requirements: * The result should be sorted by `COUNT` in descending order then by `VEHICLE_TYPE` in ascending order. * The result should include at most 30 rows * Your result should only be derived from records which have a `COLLISION_ID` value present in both tables. * Your result should not include any row with NULL as the `VEHICLE_TYPE` value. **✅ Solution**: ```python ### Solution - Exercise 4 def summarize_vehicles(conn): # SOLUTION (VERSION 1) - SQL Only: # Google Search: 'sqlite join' # Google Result: https://www.sqlitetutorial.net/sqlite-join/ query = ''' SELECT upper(vehicle_type) AS VEHICLE_TYPE, count(*) AS COUNT, sum(upper(state_registration) <> "NY" and state_registration is not null) AS OUT_OF_STATE, sum("NUMBER OF PERSONS KILLED") AS DEATHS FROM vehicles INNER JOIN crashes USING(collision_id) WHERE vehicle_type is not null GROUP BY upper(vehicle_type) ORDER BY COUNT desc, VEHICLE_TYPE LIMIT 30 ''' return pd.read_sql(query, conn) ### Demo function call demo_vehicle_summary = summarize_vehicles(conn) display(demo_vehicle_summary) print() ``` ![image](https://hackmd.io/_uploads/HJsoHqy1be.png) **⚙️ Key Concepts**: * SQL * `SELECT` → compute and/or relabel rows * `FROM table_primary` → pull data from this table, primary table * `INNER JOIN table_seconday USING column_name` → matches rows that appear in both table's column, merges, then discards non-matches. Keeps all other columns from both tables * 📘 https://www.sqlitetutorial.net/sqlite-join/ * `WHERE` → filter records, removes rows that meet the criteria ### 🔹 Exercise 5 - Pandas - Query to DF w/ Params, Cast string to datetime64 **🎯 Goal**: Given a query, connection, and parameters, pull the query and convert (cast) the date string to datetime64. **➡️ Input**: * `qry` → a provided query * `conn` → connection to the database * `params` → provided parameters **⬅️ Output**: A DataFrame that changes the `CRASH_DATE` from a text type to a datetime64 type. **✅ Solution**: ```python ### Solution - Exercise 5 def geo_filter_crashes(qry, conn, params): ### df = pd.read_sql(qry, conn, params=params) df['CRASH_DATE'] = pd.to_datetime(df['CRASH_DATE']) return df ### ### Demo function call demo_qry = ''' select "CRASH DATE" AS CRASH_DATE, LATITUDE, LONGITUDE, "NUMBER OF PERSONS KILLED" AS PERSONS_KILLED from crashes where latitude is not null and latitude between ? and ? and longitude is not null and longitude between ? and ? and PERSONS_KILLED is not null and PERSONS_KILLED > 0 ''' demo_df = geo_filter_crashes(demo_qry, conn, [40.5, 40.95, -74.1, -73.65]) display(demo_df.head()) ``` ![image](https://hackmd.io/_uploads/Bk93GjJk-g.png) **⚙️ Key Concepts**: * Pandas * `df['column_name'] = pd.to_datetime(df['column_name'])` → converts a column name from text to datetime64 ### 🔹 Exercise 6 - **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Solution - Exercise 6 #np.linspace(start, stop, interval) def density_grid(bounds, grid_size, estimator): ### x = np.linspace(bounds[0], bounds[1] , grid_size) y = np.linspace(bounds[2], bounds[3] , grid_size) xx, yy = np.meshgrid(x, y) test = np.vstack((xx.flatten(), yy.flatten())) a = estimator(test).reshape((grid_size, grid_size)) return a ### ### Demo function call demo_bounds = [0, 9, 0, 90] demo_grid_size = 10 # estimator adds the x and y values demo_estimator = lambda a: a[0] + a[1] demo_grid = density_grid(demo_bounds, demo_grid_size, demo_estimator) print(demo_grid) ``` ![image](https://hackmd.io/_uploads/Sy9-VikkWx.png) **⚙️ Key Concepts**: --- # PMT2 - FA24 [Link to PMT2-FA24 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4278380&stepid=4278381&hideNavBar=1) ## 🧩 ### 🔹 Exercise 0 - Pandas - to_datetime(args, utc=True), dt.tz_convert('America/New York') **🎯 Goal**: Given a list of dictionaries, create a DataFrame that meets the criteria. **➡️ Input**: `records` → a list of dictionaries, where each dictionary contains information for a single row of data **⬅️ Output**: A Pandas DataFrame containing the data held in `records` with columns: * `Firm` → in the dictionary under the `Stock` key * `Date` → convert the string to datetime64[ns, America/New York] * `Open` * `Close` * `Volume` **✅ Solution**: ```python ### Solution - Exercise 0 def load_data(records: list) -> pd.DataFrame: ### df = pd.DataFrame(records) df = df.rename(columns={'Stock':'Firm'}) df['Date'] = pd.to_datetime(df['Date'], utc=True).dt.tz_convert('America/New_York') #df['Date'] = df['Date'].dt.tz_localize('UTC') return df ### ### Demo function call ### `convert_timezone_types` demo --------------------------------------------- # This helper function will help you change the data-type # -- Input: pd.Series, with dtype "string" # -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]" type_conversion_demo = pd.DataFrame({ 'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'], 'DemoColumn': [1, 2] }) # Convert the type! type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date']) assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works! ### `load_data` demo ---------------------------------------------------------- # We'll sample the dataset to 20 records, just to make debugging easier. random.seed(6040) stock_data_subsample = random.sample(stock_data, k=10) # Here's what your solution produces! print('Here is what your loaded data looks like in Pandas:') stock_subsample_df = load_data(stock_data_subsample) display(stock_subsample_df) ``` ![image](https://hackmd.io/_uploads/Hy_ltQxJ-x.png) **⚙️ Key Concepts**: * Pandas * `pd.to_datetime(column, utc=True)` → used to convert a string with a timezone to a universal time zone * `.dt.tz_convert('America/New York')` → convert a datetime type to a specified timezone 📘 https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html 📘 https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.tz_convert.html 📘 https://stackoverflow.com/questions/55385497/how-can-i-convert-my-datetime-column-in-pandas-all-to-the-same-timezone/55386135#55386135 ### 🔹 Exercise 1 - Pandas - .head(), .sample() **🎯 Goal**: Preview the data sources. **➡️ Input**: * `records` → DataFrame with the results from the exercise above * `conn` → SQLite connection to the database **⬅️ Output**: A tuple with: * `records_preview` → Dataframe containing 10 entries * `nflx_originals_preview` → DataFrame containing 5 different, random entries from the nflx_originals SQL database table * `nflx_top_preview` → DataFrame containing 5 different, random entries from the nflx_top SQL database table **✅ Solution**: ```python ### Solution - Exercise 1 def preview_data(records: list, conn: sqlite3.Connection) -> tuple: # Filter out any records beyond the second element records_preview = records.head(n=10) # Select everything from nflx_originals and randomly sample 5 rows nflx_originals_preview = pd\ .read_sql('SELECT * FROM nflx_originals', conn)\ .sample(5) # Select everything from nflx_top and randomly sample 5 rows nflx_top_preview = pd\ .read_sql('SELECT * FROM nflx_top', conn)\ .sample(5) return records_preview, nflx_originals_preview, nflx_top_preview ### Demo function call records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn) print('Preview of Stock Data') display(records_preview) print('------------------------------------------------------------------------') print('Preview of Netflix Originals Table') display(nflx_originals) print('------------------------------------------------------------------------') print('Preview of Netflix Top Programs Table') display(nflx_top) ``` ![image](https://hackmd.io/_uploads/rJ6DyVlJZl.png) **⚙️ Key Concepts**: ### 🔹 Exercise 2 - Pandas - Shift, Column Calculations, Outer Merge **🎯 Goal**: Calculate `Daily_Growth` for a given firm. **➡️ Input**: * `stock_records` → a DataFrame * `stock` → a string specifying a particular stock **⬅️ Output**: `stock_growth` → a DataFrame with the columns: * `Date` * `Firm` * `Open` * `Close` * `Volume` * `Day_Growth` Requirements/Steps: 1. Do not use SQLite. 2. Create a new Pandas DataFrame with Day_Growth appended as a new column. The growth should be specific to the firm specified by the firm parameter. 3. Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend. * You should use the next valid close price to calculate growth in these instances. 4. Daily growth is defined as $$growth0=(close0/close−1)−1$$ * In English: "Daily Growth is Today's Close divided by Yesterday's Close, minus one." * $closex$ is the value of the Close, offset by $x$. So, $close−1$ is the value of Close, offset by 1 position. * You may find the `pd.DataFrame.shift()` method, and the related `GroupBy` shift method helpful for offsetting the close value! 5. To repeat: Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend. * You should use the next valid close price to calculate growth in these instances. * For example, if: A. The close price on Friday is 10 B. There are no close prices for the next two days (Saturday and Sunday) C. The close price on Monday is 12 ... then the daily growth for Monday should be equal to (12/10)-1=0.2. * You can make this easier by filtering out all NaN values when you initially calculate the growth. 6. Your final DataFrame (stock_growth) should contain the same number of records as stock_records for that firm. 7. Finally, reset the indexes before returning the resulting dataframe. ![image](https://hackmd.io/_uploads/ryx4jzWybx.png) **✅ Solution**: ```python ### Solution - Exercise 2 def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame: ### BEGIN SOLUTION # Filter the DataFrame for the stock stockdf = stock_records[stock_records['Firm']==firm].copy() # Remove NA values in a copy. # We'll join everything back together at the end. stockdf2 = stockdf.copy() stockdf2 = stockdf2.dropna() # Lag the close and calculate the daily growth stockdf2['Yest_Close'] = stockdf2.groupby('Firm')['Close'].shift(1) stockdf2['Day_Growth'] = (stockdf2['Close']/stockdf2['Yest_Close']) - 1 # Merge everything back together and fill missing values with 0 output = stockdf.merge( stockdf2, how='outer', left_index=True, right_index=True, suffixes=(None, '_y')) output = output[['Date', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth']] output['Day_Growth'] = output['Day_Growth'].fillna(0) # Return the result return output.reset_index(drop=True) ### END SOLUTION ### Demo function call print("Demo daily growth for Netflix's stock:") daily_growth_nflx_demo = calculate_daily_growth( stock_records=calculate_daily_growth_demo_input, firm='NFLX' ) display(daily_growth_nflx_demo) ``` **⚙️ Key Concepts**: * Pandas * `df[df['column_name'] == condition]` → apply a condition to filter out ROWS that do not meet the criteria * `df['new_column_name'] = df.groupby('column_name')['column_to_shift'].shift(#)` → groups by the column provided and then shifts the values from the column to shift column down 1 in the new column ### 🔹 Exercise 3 - Pandas - Year String Extraction, Product and Column Calculation **🎯 Goal**: Compute each firm’s annual growth by multiplying daily returns within the same calendar year, then attach that value back to every row of the original records for that firm-year. **➡️ Input**: * `stock_records` → `pd.DataFrame` produced after Exercise 2 (must include `Date`, `Firm`, and `Day_Growth`). **⬅️ Output**: * `annual_growth` → pd.DataFrame with all original `stock_records` columns plus: * `Year` (string) * `Annual_Growth` (float), repeated for all rows in the same (Firm, Year). Requirements/Steps: * Extract year from Date and store as string: e.g., "2019" * For each `(Firm, Year)` * Compute the product $\prod(1+g(f,y)_{i})-1$ → $\prod(1+\text{Day_Growth}_{i})$ * `Day_Growth` → Fill `NaN` with 0 * Merge the results back onto a copy of `stock_records` by `(Firm, Year)` * Reset the index **✅ Solution**: ```python ### Solution - Exercise 3 def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame: ### BEGIN SOLUTION #Step 0 - Create a copy stockdf = stock_records.copy() #Step 1 - Extract the year stockdf['Year'] = stockdf['Date'].dt.strftime('%Y') #Step 2 - Create the growth ratio column and add 1 to each value for Day_Growth → Eventually will become Annual_Growth ##Why - Break the formula down into small bites as opposed to a large bite stockdf['Growth_Ratio'] = 1 + stockdf['Day_Growth'] #Step 3 - Separate the needed columns for our Product calculation, group them and apply .prod on the Growth_Ratio column stockdf2 = stockdf[['Year', 'Growth_Ratio', 'Firm']].copy() resultdf = stockdf2.groupby(['Year', 'Firm']).prod('Growth_Ratio') - 1 #Step 4 - Ensure resultdf is correct index and correct column names resultdf = resultdf.reset_index() resultdf.columns = ['Year', 'Firm', 'Annual_Growth'] #Step 5 - Merge our DataFrames together output = pd.merge(stockdf, resultdf) #Step 6 - Reorder and return the output r = ['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth'] return output[r] ### END SOLUTION ### Demo function call annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df) display(annual_growth_demo_df) ``` ![image](https://hackmd.io/_uploads/HkFMgFM1be.png) **⚙️ Key Concepts**: * Pandas * `df['date_column'].dt.strftime('%Y')` → returns an Index of strings that extracts the years out of a given column * `df['date_column'].dt.year.astype(str)` → also works * `df.groupby([list of columns]).prod('column')` → groups by the provided columns and performs a product calculation on the provided column ### 🔹 Exercise 4 - Pandas - Rolling Average **🎯 Goal**: Implement a function that calculates a concept known as the 'Golden Cross'. **➡️ Input**: * `stockdf` → a Pandas DataFrame. It contains the results calculated by exercise 3 * `firm` → a string. It specifies which firm to calculate the golden cross results for * `short_average` → an integer. It defines the window for the short moving average. The number represents the number of days to include in the window * `long_average` → an integer. It defines the window for the long moving average. The number represents the number of days to include in the window **⬅️ Output**: `golden_cross_results` → DataFrame with all columns from `stockdf` plus: * `Short_Average` * `Long_Average` * `Golden_Cross` Requirements/Steps: 1. Use Pandas. Do not use SQLite. 2. Filter the DataFrame to include records for the firm parameter. 3. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages. * You may find the `pd.DataFrame.notna()` method useful for filtering these values out! 4. Add the columns Short_Average and Long_Average to the Pandas DataFrame. * These columns should calculate the moving average of the Close column. * The window of the moving average should be equal to the size specified by the short_average and long_average arguments. * You may find the `pd.DataFrame.rolling()` method useful for creating the windows! 5. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame. * This column should indicate whether the short-window moving average is greater than the long-window moving average. 6. You may assume that only one firm is provided for the firm argument. 7. You may assume that stockdf is already ordered by date in ascending order. 8. Finally, reset the indexes before returning the resulting dataframe. **✅ Solution**: ```python ### Solution - Exercise 4 def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame: ### #Step 0 - Create a copy df = stockdf.copy() #Step 1 - Filter rows to the firm string provided df = df[df['Firm'] == firm] #Step 2 - Filter out NaN values in Close df = df[df['Close'].notna()] #Step 3 - Implement Rolling averages using .rolling df['Short_Average'] = df['Close'].rolling(window=short_average).mean() df['Long_Average'] = df['Close'].rolling(window=long_average).mean() #Step 4 - df['Golden_Cross'] = df['Short_Average'] > df['Long_Average'] df = df.reset_index(drop=True) return df ### ## Demo function call golden_cross_demo_df = golden_cross_strategy( stockdf=annual_growth_df, firm='NFLX', short_average=3, long_average=7) # Let's only look at the first 15 days for our input display(golden_cross_demo_df.head(15)) ``` **⚙️ Key Concepts**: * Pandas * `df['column_name'].rolling(window= ).mean()` → creates a rolling average based on the provided column, where window is the length ### 🔹 Exercise 5 - **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ### Solution - Exercise 5 def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame: # Create a list of every firm, except for the one we specified in the inputs non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm]) non_chosen_firms = list(non_chosen_firms) # Pivot the data for easy manipulation pivoted_data = daily_growth_by_firm.pivot( index='Date', columns='Firm', values='Day_Growth' ).reset_index() # Calculate the total growth for every firm that isn't our firm pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\ .sum(axis=1) / len(non_chosen_firms) return pivoted_data ### Demo function call normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX') display(normalized_returns_demo_df) ``` **⚙️ Key Concepts**: ### 🔹 Exercise 6 - SQLite - SELECT, COUNT(\*), GROUPBY, ORDERBY **🎯 Goal**: Write a function that returns an SQL query that summarizes the trends of Netflix originals by year and genre. **➡️ Input**: None **⬅️ Output**: `query` → a Python string with an SQL query. It should query the database for these columns: * `year` * `genre` * `genre_count` Requirements/Steps: 1. Count the number of Netflix original shows, grouped by year and by genre. * The year variable should be extracted from the premiere variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year from the premiere variable. 2. Order them by: A. genre_count (descending) B. year (ascending) C. genre (ascending) 3. Return the top 10 rows The database table you will need is named `nflx_originals`. Make sure you rename your column variables so that they match the specifications above. **✅ Solution**: ```python ### Solution - Exercise 6 def summarize_netflix_original_genres()-> str: ### query = f''' SELECT strftime('%Y', premiere) AS year, genre, COUNT(*) AS genre_count FROM nflx_originals GROUP BY year, genre ORDER BY genre_count DESC, year ASC, genre ASC LIMIT 10 ''' return query ### ### Demo function call summary_query = summarize_netflix_original_genres() netflix_original_genres = pd.read_sql(summary_query, conn) display(netflix_original_genres) ``` ![image](https://hackmd.io/_uploads/H1iYwFf1Zx.png) **⚙️ Key Concepts**: * SQLite * `SELECT` * `strftime(%Y, origin_column)` → Pull the year values as a string * `COUNT(*)` → Counts the numbers based on each unique year, genre combination ### 🔹 Exercise 7 - SQLite - Subquery **🎯 Goal**: **➡️ Input**: None **⬅️ Output**: `query` → string with a SQLite query. Query should contain the following tables: * `title` → title of the show * `total_score` → sum of the show's score * `occurrence` → the number of times a show appeared in the top 10 * `avg_score` → `total_score` divided by `occurrence` Requirements/Steps: 1. You will need to calculate show scores to solve this problem: * A show ranked #1 on a day should earn 10 points towards its total score. A show ranked #2 should earn 9, etc. until a show ranked #10 should earn 1 point. 2. After counting the occurrences, you'll need to calculate the `avg_score` to produce a floating-point value 3. Order the results by show: * `total_score` in descending order * `title` in descending order 4. Return the top 10 rows **✅ Solution**: ```python ### Solution - Exercise 7 def calc_netflix_top10_scores() -> str: ### query = f''' WITH scores AS (SELECT SUM(11 - rank) AS total_score, COUNT(title) AS occurrence, title FROM nflx_top GROUP BY title ORDER BY total_score DESC, title DESC) SELECT title, total_score, occurrence, (total_score * 1.0 / occurrence) AS avg_score FROM scores LIMIT 10 ''' return query ### ### Demo function call scores_query = calc_netflix_top10_scores() netflix_top10_scores = pd.read_sql(scores_query, conn) display(netflix_top10_scores) ``` ![image](https://hackmd.io/_uploads/rJdXnqfkWg.png) **⚙️ Key Concepts**: ### 🔹 Exercise 8 - **🎯 Goal**: **➡️ Input**: **⬅️ Output**: **✅ Solution**: ```python ``` **⚙️ Key Concepts**: ### 🔹 Exercise 9 - **🎯 Goal**: Prepare a model matrix for regression by combining show scores (independent variables) and relative growth (response variable) into one DataFrame. **➡️ Input**: * `normalized_growth` → DataFrame from exercise 5 * `show_scores_df` → DataFrame that contains pre-computed scored for each show on a given date **⬅️ Output**: * `model_matrix` → DataFrame containing independent variables (show scores) and response variable (Relative_Growth) Requirements/Steps: 1. To obtain the independent variables, you will need to pivot the show_scores_df so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score. * Fill any missing values with 0's. * See pandas.DataFrame.pivot for more information * You may also refer to Exercise 5 from above too 2. You'll need to join the pivoted dataframe with the normalized_growths_df on date. We'll call this model_matrix 3. Return the model_matrix but only with the columns for the shows and Relative_Growth **✅ Solution**: ```python ### Solution - Exercise 9 def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame: ### ssdf = show_scores_df.copy() ssdf = ssdf.fillna(0) ssdf = ssdf.pivot( index='date', columns='title', values='score' ).reset_index().fillna(0) ssdf1 = ssdf.copy() ssdf1 = ssdf1.iloc[:, 1:] columns = ssdf1.columns.to_list() ngdf = normalized_growths_df.copy() col1 = ngdf.columns.to_list() columns1 = col1[-1] columns.append(columns1) cols = columns[-1:] + columns[:-1] model_growth = ssdf.merge(ngdf, how="outer", left_on='date', right_on='Date').dropna() model_growth = model_growth[cols] return model_growth ### ### Demo function call model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df) display(model_matrix) ``` **⚙️ Key Concepts**: --- # PMT2-SP25 [Link to PMT2-SP25 on Vocareum](https://labs.vocareum.com/main/main.php?m=editor&asnid=4278070&stepid=4278071&hideNavBar=1) **1 Point Questions and Takeaways**: * 0 - Freebie * 2 * SQL - Struggled to get an output here: used 'column_name' which was not necessary * 7 * Pandas - Reordering columns - Faster method to reorder columns in place. **2 Point** 1 3 * 6 * Pandas, Python - 8 9 CTRL+F 1.4 Schema Diagram ![image](https://hackmd.io/_uploads/HyNgl7BJZl.png) Helper Functions: * `get_table_list(conn)` → retrieves the list of tables from the SQLite database. ![image](https://hackmd.io/_uploads/BkCOx7rkZg.png) * `get_column_details (conn, table_name)` → retrieves column details for a specific table. ![image](https://hackmd.io/_uploads/SJoclQByWl.png) * `get_foreign_key_details(conn, table_name)` → retrieves foreign key details for a specific table ![image](https://hackmd.io/_uploads/BJNCx7B1bl.png) * `get_table_schema(conn, table_name)` → combines column details and foreign key details for a specific table. ![image](https://hackmd.io/_uploads/HJqZZ7S1-e.png)