Better Research Software - 2026-03-09-NCL

# Building Better Research Software - 2026-03-09-NCL ### This document: bit.ly/2026-03-09-NCL https://hackmd.io/@rseteam/2026-03-09-NCL/edit (_**no need** to log in to Hackmd.io, edit mode: CTRL+ALT+B_) ## Online Meeting: https://teams.microsoft.com/meet/31298350057381?p=qf6hykS6p6yQVFwe2D # Day 2 Schedule The schedule on the website for Day 2 is not correct. We will have breaks at the same times as yesterday (may vary by a few minutes): Morning Break: 10:30 - 10:45 LUNCH: 12:30 - 13:30 Afternoon Break: 15:00 - 15:15 ## Links: - [The UK Reproducibility Network (UKRN)](https://www.ukrn.org/) funded the [Software Sustainability Institute](https://software.ac.uk) to develop this Carpentries Style training workshop - [The Carpentries](https://carpentries.org) - [Newcastle University RSE Team](https://rse.ncldata.dev) - [Code of Conduct](https://docs.carpentries.org/policies/coc/) - [Workshop website](https://nclrse-training.github.io/2026-03-09-NCL/) - [Link to lesson website](https://carpentries-incubator.github.io/better-research-software/) - [Software Used](https://carpentries-incubator.github.io/better-research-software/#software-setup) - [Check your GitHub Account](https://github.com/) (SSH key on your local machine) - [Link to lesson data and code](https://github.com/carpentries-incubator/better-research-software/raw/refs/heads/main/learners/spacewalks.zip) - [Pre-workshop survey](https://forms.office.com/e/mhTCTTGtb6) - [Post-workshop survey](https://forms.office.com/e/x8SrtmPRkB) - [Code Community](https://teams.microsoft.com/l/team/19%3aG79Rz7Mhk6rC0mhia04YCD-nj7WabLMxhnyb1YLp04A1%40thread.tacv2/conversations?groupId=7059214c-2200-4ad6-a739-9d350c74c7a9&tenantId=9c5012c9-b616-44c2-a917-66814fbe3e87) ## Your instructors [Email us your questions](mailto://training.researchcomputing@newcastle.ac.uk) to training.researchcomputing@newcastle.ac.uk Carol Booth (carol.booth2@newcastle.ac.uk) Dr. Frances Turner (frances.hutchings@newcastle.ac.uk) Dr. Jannetta Steyn (jannetta.steyn@newcastle.ac.uk) ## GitHub Account: Paste your [<span style="color:red">GitHub login name</span>](https://github.com) on a new line *Please enter your github username below:* carolbooth2 Fehings jsteyn Kat-01-del OliverWillis-CPI minzhang904 UniCandice ## Useful information: _Everyone is welcome to add tips, tricks and comments to this document_ by editing text in the left pane of this document in edit mode (CTRL+ALT+B) - Conda equivalent to `pip freeze > requirements.txt` is `conda list --explicit > spec-file.txt` (as recommended by https://docs.conda.io/) - https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository. You can convert an existing repository into a template, so yes, this can include things like generic tests and some CI-CD such as pre-commit hooks which could include a check of the requirements/ imports for forbidden packages upon a git commit for example. Creating a template repository - GitHub Docs You can make an existing repository a template, so you and others can generate new repositories with the same directory structure, branches, and files. - https://github.com/the-turing-way/reproducible-project-template an example public project template from the Turing Way - GitHub - NewcastleRSE/Standard-Project: A template repo for the standard RSE project A template repo for the standard RSE project. Contribute to NewcastleRSE/Standard-Project development by creating an account on GitHub. - Where system-level packages are still used by Python instead of the versions installed into the virtual environment. Previous participants solved the problem by unsetting the PYTHONHOME, PYTHONPATH and PYTHONSTARTUP environment variables before creating the virtual environment for the project: BASH $ unset PYTHONHOME PYTHONPATH PYTHONSTARTUP $ python -m venv ./venv_spacewalks - R-click and 'rename symbol' can be helpful in changing variable names through the whole script; or 'refactor' in PyCharm. - generate citation files: https://citation-file-format.github.io/cff-initializer-javascript/#/ ## Virtual Environments Conversation We had a chat at the start of Day 2 about issues with virtual environments using 'system' packages instead of from within the venv. It's best to keep your system 'clean' with minimal Python and package installations to avoid ## File naming when using Pytest Pytest automatically discovered tests by looking for files named like `test_*` or `*_test.py` Pytest then runs any functions named like `test_` It's a good idea to reserve this naming convention for code tests but not essential. Any files with names like this will be examined but only functions named test_ will be run by the test suite. ### checking test coverage with pytest-cov `python -m pytest --cov --cov-report=html` Min checked whether a new python script in our directory would be picked up for coverage by pytest. Found that only scripts named in a `test_*.py` file are checked ## Code performance - Profiling Tools Not really covered in the lesson but very handy so Frances gave a demo... Python has an inbuilt tool called `cprofile` (avoid older 'profile' if possible as cprofile uses c underlying for greater efficiency) `$ python` ``` >>> import cProfile >>> from eva_data_analysis import main >>> cProfile.run('main("data/eva-data.json","results/eva-data.csv","results/cumulativ_eva_graph.png")') ``` gave a LOT of output because of calls to packages like matplotlib ``` >>> import cProfile >>> from eva_data_analysis import text_to_duration >>> cProfile.run('text_to_duration("10:00")') ``` gave a smaller, more digestible output ## Continuous Integration for automated testing https://carpentries-incubator.github.io/better-research-software/instructor/ci-for-testing.html Github Marketplace allows you to choose from published actions (be aware anyone can publish) https://github.com/marketplace?type=actions ## Places to start - Decide on which tools you will use, stick with it! - Agree on coding conventions (e.g. variable naming) - Use template repo with: - Structured Readme.md - Directory structure (e.g. have a tests directory) - Action on push to run tests (even if not tests don't exist yet) - provide tests to your developers so they can check before push ---------- **Current state of the code** can be found at the start of each episode in the [lesson website](https://carpentries-incubator.github.io/better-research-software/) but we have also put them here for convenience: [Better Research Software - Code State](https://hackmd.io/@RSETeam/BetterRS-CodeState) --------- ## Code State At certain points in the lesson we will update the code to fit better practice. As this is not specifically a python course, but rather a course on best practice with python as an example, we will at points ask you to update your code by copying the code versions below for speed. Starting Code: <details> ```python= # https://data.nasa.gov/resource/eva.json (with modifications) data_f = open('./eva-data.json', 'r', encoding='ascii') data_t = open('./eva-data.csv','w', encoding='utf-8') g_file = './cumulative_eva_graph.png' fieldnames = ("EVA #", "Country", "Crew ", "Vehicle", "Date", "Duration", "Purpose") data=[] import json for i in range(375): line=data_f.readline() print(line) data.append(json.loads(line[1:-1])) #data.pop(0) ## Comment out this bit if you don't want the spreadsheet import csv w=csv.writer(data_t) import datetime as dt time = [] date =[] j=0 for i in data: print(data[j]) # and this bit w.writerow(data[j].values()) if 'duration' in data[j].keys(): tt=data[j]['duration'] if tt == '': pass else: t=dt.datetime.strptime(tt,'%H:%M') ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60) print(t,ttt) time.append(ttt) if 'date' in data[j].keys(): date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d')) #date.append(data[j]['date'][0:10]) else: time.pop(0) j+=1 t=[0] for i in time: t.append(t[-1]+i) date,time = zip(*sorted(zip(date, time))) import matplotlib.pyplot as plt plt.plot(date,t[1:], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(g_file) plt.show() ``` </details> ### Code Readability Lesson: <details> #### Code update 1: Pandas refactor <details> Please copy the below code and paste over the contents of `eva_data_analysis.py` when asked: ```python= import matplotlib.pyplot as plt import pandas as pd # Data source: https://data.nasa.gov/resource/eva.json (with modifications) input_file = open('./eva-data.json', 'r', encoding='ascii') output_file = open('./eva-data.csv', 'w', encoding='utf-8') graph_file = './cumulative_eva_graph.png' eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) eva_df.to_csv(output_file, index=False, encoding='utf-8') eva_df.sort_values('date', inplace=True) eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum() plt.plot(eva_df['date'], eva_df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() ``` </details> #### Code update 2: initial functions <details> Please copy the below code and paste over the contents of `eva_data_analysis.py` when asked: ```python= import matplotlib.pyplot as plt import pandas as pd def read_json_to_dataframe(input_file): print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') # Main code print("--START--") input_file = open('./eva-data.json', 'r', encoding='ascii') output_file = open('./eva-data.csv', 'w', encoding='utf-8') graph_file = './cumulative_eva_graph.png' # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum() plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() print("--END--") ``` </details> #### Code update 3: Docstrings <details> ```python= import matplotlib.pyplot as plt import pandas as pd def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df['duration_hours'] = df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() # Main code print("--START--") input_file = open('./eva-data.json', 'r', encoding='ascii') output_file = open('./eva-data.csv', 'w', encoding='utf-8') graph_file = './cumulative_eva_graph.png' # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") ``` </details> </details> ### Code Structure Lesson: <details> #### Code update 1: Adding more functions <details> ```python= import matplotlib.pyplot as plt import pandas as pd def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy # Main code print("--START--") input_file = open('./eva-data.json', 'r', encoding='ascii') output_file = open('./eva-data.csv', 'w', encoding='utf-8') graph_file = './cumulative_eva_graph.png' # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") ``` </details> #### Code update 2: Adding a main function <details> ```python= import matplotlib.pyplot as plt import pandas as pd def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy if __name__ == "__main__": input_file = './eva-data.json' output_file = './eva-data.csv' graph_file = './cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` </details> #### Code update 3: Adding CLI with sys.argv <details> ```python= import matplotlib.pyplot as plt import pandas as pd import sys def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default input and output filenames') else: input_file = sys.argv[1] output_file = sys.argv[2] print('Using custom input and output filenames') graph_file = 'results/cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` </details> </details> ### Start of Day 2 Where the code should be at the start of day 2: <details> ```python= import matplotlib.pyplot as plt import pandas as pd import sys def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default file paths and filenames: {input_file}, {output_file}') else: input_file = sys.argv[1] output_file = sys.argv[2] graph_file = 'results/cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` </details> ### Lesson 6: New code from collaborator <details> This is to be added to eva_data_analysis.py ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re # added this line def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Calculate and add crew size to data eva_data = add_crew_size_column(eva_data) # added this line # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") ``` and also add the following: ```python= def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy ``` </details> ### Code state: full code with new collaborator functions <details> ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Add crew size variable to dataset eva_data = add_crew_size_column(eva_data) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/60 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default file paths and filenames: {input_file}, {output_file}') else: input_file = sys.argv[1] output_file = sys.argv[2] graph_file = 'results/cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` </details> ### Code state: end of lesson 6 <details> Code for eva_data_analysis.py ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Calculate and add crew size to data eva_data = add_crew_size_column(eva_data) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/60 return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default input and output filenames') else: input_file = sys.argv[1] output_file = sys.argv[2] print('Using custom input and output filenames') graph_file = 'results/cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` Test script test_eva_analysis.py: ```python= from eva_data_analysis import text_to_duration, calculate_crew_size import pytest def test_text_to_duration_integer(): """ Test that text_to_duration returns expected values for typical durations with a zero minutes component """ assert text_to_duration("10:00") == 10 def test_text_to_duration_float(): """ Test that text_to_duration returns expected ground truth values for typical durations with non-zero minutes components """ assert text_to_duration("10:20") == pytest.approx(10.333333) @pytest.mark.parametrize("input_value, expected_result", [ ("Valentina Tereshkova;", 1), ("Judith Smith; Sally Rider;", 2), ("", None) ]) def test_calculate_crew_size(input_value,expected_result): """ Test that calculate_crew_size returns expected size for typical crew values """ actual_result = calculate_crew_size(input_value) assert actual_result == expected_result ``` </details> ### Lesson 8: Adding new function Add this new function to your code: ```python= def summary_duration_by_astronaut(df): """ Summarise the duration data by each astronaut and saves resulting table to a CSV file Args: df (pd.DataFrame): Input dataframe to be summarised Returns: sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column """ print(f'Calculating summary of total EVA time by astronaut') subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns subset = add_duration_hours(subset) # need duration_hours for easier calcs subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations subset = subset.groupby('crew').sum() return subset ``` **Full code after adding new function:** <details> ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re def main(input_file, output_file, duration_by_astronaut_output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Calculate summary table for total EVA per astronaut duration_by_astronaut_df = summary_duration_by_astronaut(eva_data) # Save summary duration data by each astronaut to CSV file write_dataframe_to_csv(duration_by_astronaut_df, duration_by_astronaut_output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/60 return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy def summary_duration_by_astronaut(df): """ Summarise the duration data by each astronaut and saves resulting table to a CSV file Args: df (pd.DataFrame): Input dataframe to be summarised Returns: sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column """ print(f'Calculating summary of total EVA time by astronaut') subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns subset = add_duration_hours(subset) # need duration_hours for easier calcs subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations subset = subset.groupby('crew', as_index=False).sum() return subset if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default input and output filenames') else: input_file = sys.argv[1] output_file = sys.argv[2] print('Using custom input and output filenames') graph_file = 'results/cumulative_eva_graph.png' duration_by_astronaut_output_file = 'results/duration_by_astronaut.csv' main(input_file, output_file, duration_by_astronaut_output_file, graph_file) ``` </details>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.