Better Research Software - 2026-02-24-NCL

# Building Better Research Software - 2026-02-24-NCL ## This document: bit.ly/2026-02-24-NCL ### https://hackmd.io/@rseteam/2026-02-24-NCL/edit (_**no need** to log in to Hackmd.io, edit mode: CTRL+ALT+B_) [toc] -------- ### Links: - [The UK Reproducibility Network (UKRN)](https://www.ukrn.org/) funded the [Software Sustainability Institute](https://software.ac.uk) to develop this training workshop and provided our catering today - [Carpentries](https://carpentries.org) - [RSE Team](https://rse.ncldata.dev) - [Code of Conduct](https://docs.carpentries.org/policies/coc/) - [Workshop website](https://nclrse-training.github.io/2026-02-24-NCL/) - [Link to lesson website](https://carpentries-incubator.github.io/better-research-software/) - [Software Used](https://carpentries-incubator.github.io/better-research-software/#software-setup) is pre-installed on our cluster PCs - [Check your GitHub Account](https://github.com/) (SSH key on your H: drive or laptop local drive) - [Link to lesson data and code](https://github.com/carpentries-incubator/better-research-software/raw/refs/heads/main/learners/spacewalks.zip) - [Pre-workshop survey](https://forms.office.com/e/Qu8auyhQzS) - [Post-workshop survey](https://forms.gle/NdtugmSvM9PS2k6T8) - [Code Community](https://teams.microsoft.com/l/team/19%3aG79Rz7Mhk6rC0mhia04YCD-nj7WabLMxhnyb1YLp04A1%40thread.tacv2/conversations?groupId=7059214c-2200-4ad6-a739-9d350c74c7a9&tenantId=9c5012c9-b616-44c2-a917-66814fbe3e87) ### Attendance: You will be asked to show the QR code we sent to you via email ## Your instructors [Email us your questions](mailto://training.researchcomputing@newcastle.ac.uk) to training.researchcomputing@newcastle.ac.uk Carol Booth (carol.booth2@newcastle.ac.uk) Dr. Frances Turner (<frances.hutchings@newcastle.ac.uk>) Dr. Robin Wardle (<robin.wardle@newcastle.ac.uk>) ### GitHub Account: Paste your [<span style="color:red">GitHub login name</span>](https://github.com) on a new line Please enter your github username below: ## Useful information: _Everyone is welcome to add tips, tricks and comments to this document_ by editing text in the left pane of this document in edit mode (CTRL+ALT+B) ### Git Default Branch You can change your default branch name from `master` to `main`: ``` git config --global init.defaultBranch main ``` If you've already created a `master` branch, you can rename it: ``` git branch -m master main ``` ### **Day2: Software Mangement and Collaboration - where to get a DOI?** This workshop looks at Zenodo https://zenodo.org/ for DOIs. #### We use https://sandbox.zenodo.org/ to practice You may have heard of obtaining DOIs using https://data.ncl.ac.uk/, which is Newcastle University's institutional Figshare instance. Figshare was originally aimed at image and document sharing, but you can also use it like Zenodo with GitHub: https://info.figshare.com/user-guide/how-to-connect-figshare-with-your-github-account/ The research data management team provide guidance at: https://www.ncl.ac.uk/library/academics-and-researchers/lrs/rdm/working/organise/ ### Renaming Files - can use oneliner: `git mv <old_file.py> <new_file.py>` instead of the sequence `mv <old_file.py> <new_file.py>; git add <new_file.py>` ### Renaming Variables - [the dangers of heedless global search and replace](https://selinker.livejournal.com/32929.html) ### Private email addresses preventing pushing to GitHub ["Your push would publish a private email address"](https://stackoverflow.com/questions/43863522/error-your-push-would-publish-a-private-email-address) - this happens when you have ["Block commands that expose my email"](https://github.com/settings/emails) set in GitHub. A solution is given in the linked Stack Overflow article. For a local repository: ``` git config user.email "{ID}+{username}@users.noreply.github.com" ``` For all repositories on the computer ``` git config --global user.email "{ID}+{username}@users.noreply.github.com" ``` ### Why make a copy of a dataframe inside a function? https://stackoverflow.com/questions/46533197/understanding-variable-scope-and-changes-in-python#:~:text=I'm%20using%20Python%203.6%20and,dataframe%20to%20the%20original%20columns. This Stackoverflow post covers the question of variable scope. We expect a variables to be isolated inside functions but the lines can be blurred because the variable is actually a pointer to your dataframe. ### pandas read_json input The pandas read_json function can take a string or a file buffer as an input which is strange. ### how does Pytest know which test script to run? the test scripts have to be in ./test and be named starting test_ ### mkdocs with main script in a subdirectory In the reference file use`:::mydir.myfile` rather than `:::mydir/myfile` or `:::myfile` ### Wrap Up: What are we trying to do? Research software development principles explained by Software Sustainability Institute’s Director Neil Chue Hong during his [keynote at RSECon23](https://rsecon24.society-rse.org/about/research-software-development-principles/#neil-chue-hong-rse23-keynote) the section from minute 21 to minute 29 apx explains the Research Software Development Principles. ---------- # Code State ## Code state: Start of day 2 The below code is what your `eva_data_analysis.py` file should look like at the start of day 2. <details> You can copy the below code to your eva_data_analysis.py file, overwriting the previous contents: ```python= Import matplotlib.pyplot as plt import pandas as pd def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df['duration_hours'] = df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() # Main code print("--START--") input_file = open('./eva-data.json', 'r', encoding='ascii') output_file = open('./eva-data.csv', 'w', encoding='utf-8') graph_file = './cumulative_eva_graph.png' # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") ``` For the wider code state, at this point, the code in your local software project’s directory should be as in: https://github.com/carpentries-incubator/bbrs-software-project/tree/05-code-structure </details> ------------- ### Code state: lesson 5, new functions <details> You can copy the below code to your eva_data_analysis.py file, overwriting the previous contents: ```python= import matplotlib.pyplot as plt import pandas as pd def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy # Main code print("--START--") input_file = open('./eva-data.json', 'r', encoding='ascii') output_file = open('./eva-data.csv', 'w', encoding='utf-8') graph_file = './cumulative_eva_graph.png' # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") ``` </details> ### Code state: lesson 5 move to 'main' <details> ```python= import matplotlib.pyplot as plt import pandas as pd def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6) return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy if __name__ == "__main__": input_file = './eva-data.json' output_file = './eva-data.csv' graph_file = './cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` </details> ### Lesson 6: New code from collaborator <details> This is to be added to eva_data_analysis.py ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re # added this line def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Calculate and add crew size to data eva_data = add_crew_size_column(eva_data) # added this line # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") ``` and also add the following: ```python= def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy ``` </details> ### Code state: end of lesson 6 <details> Code for eva_data_analysis.py ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re def main(input_file, output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Calculate and add crew size to data eva_data = add_crew_size_column(eva_data) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/60 return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default input and output filenames') else: input_file = sys.argv[1] output_file = sys.argv[2] print('Using custom input and output filenames') graph_file = 'results/cumulative_eva_graph.png' main(input_file, output_file, graph_file) ``` Test script test_eva_analysis.py: ```python= from eva_data_analyis import text_to_duration, calculate_crew_size import pytest def test_text_to_duration_integer(): """ Test that text_to_duration returns expected values for typical durations with a zero minutes component """ assert text_to_duration("10:00") == 10 def test_text_to_duration_float(): """ Test that text_to_duration returns expected ground truth values for typical durations with non-zero minutes components """ assert text_to_duration("10:20") == pytest.approx(10.333333) @pytest.mark.parametrize("input_value, expected_result", [ ("Valentina Tereshkova;", 1), ("Judith Smith; Sally Rider;", 2), ("", None) ]) def test_calculate_crew_size(input_value,expected_result): """ Test that calculate_crew_size returns expected size for typical crew values """ actual_result = calculate_crew_size(input_value) assert actual_result == expected_result ``` </details> ## Lesson 8: Adding new function Add this new function to your code: ```python= def summary_duration_by_astronaut(df): """ Summarise the duration data by each astronaut and saves resulting table to a CSV file Args: df (pd.DataFrame): Input dataframe to be summarised Returns: sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column """ print(f'Calculating summary of total EVA time by astronaut') subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns subset = add_duration_hours(subset) # need duration_hours for easier calcs subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations subset = subset.groupby('crew').sum() return subset ``` **Full code after adding new function:** <details> ```python= import matplotlib.pyplot as plt import pandas as pd import sys import re def main(input_file, output_file, duration_by_astronaut_output_file, graph_file): print("--START--") # Read the data from JSON file eva_data = read_json_to_dataframe(input_file) # Convert and export data to CSV file write_dataframe_to_csv(eva_data, output_file) # Calculate summary table for total EVA per astronaut duration_by_astronaut_df = summary_duration_by_astronaut(eva_data) # Save summary duration data by each astronaut to CSV file write_dataframe_to_csv(duration_by_astronaut_df, duration_by_astronaut_output_file) # Sort dataframe by date ready to be plotted (date values are on x-axis) eva_data.sort_values('date', inplace=True) # Plot cumulative time spent in space over years plot_cumulative_time_in_space(eva_data, graph_file) print("--END--") def read_json_to_dataframe(input_file): """ Read the data from a JSON file into a Pandas dataframe. Clean the data by removing any rows where the 'duration' value is missing. Args: input_file (file or str): The file object or path to the JSON file. Returns: eva_df (pd.DataFrame): The cleaned data as a dataframe structure """ print(f'Reading JSON file {input_file}') # Read the data from a JSON file into a Pandas dataframe eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii') eva_df['eva'] = eva_df['eva'].astype(float) # Clean the data by removing any rows where duration is missing eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True) return eva_df def write_dataframe_to_csv(df, output_file): """ Write the dataframe to a CSV file. Args: df (pd.DataFrame): The input dataframe. output_file (file or str): The file object or path to the output CSV file. Returns: None """ print(f'Saving to CSV file {output_file}') # Save dataframe to CSV file for later analysis df.to_csv(output_file, index=False, encoding='utf-8') def plot_cumulative_time_in_space(df, graph_file): """ Plot the cumulative time spent in space over years. Convert the duration column from strings to number of hours Calculate cumulative sum of durations Generate a plot of cumulative time spent in space over years and save it to the specified location Args: df (pd.DataFrame): The input dataframe. graph_file (file or str): The file object or path to the output graph file. Returns: None """ print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') df = add_duration_hours(df) df['cumulative_time'] = df['duration_hours'].cumsum() plt.plot(df['date'], df['cumulative_time'], 'ko-') plt.xlabel('Year') plt.ylabel('Total time spent in space to date (hours)') plt.tight_layout() plt.savefig(graph_file) plt.show() def text_to_duration(duration): """ Convert a text format duration "HH:MM" to duration in hours Args: duration (str): The text format duration Returns: duration_hours (float): The duration in hours """ hours, minutes = duration.split(":") duration_hours = int(hours) + int(minutes)/60 return duration_hours def add_duration_hours(df): """ Add duration in hours (duration_hours) variable to the dataset Args: df (pd.DataFrame): The input dataframe. Returns: df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added """ df_copy = df.copy() df_copy["duration_hours"] = df_copy["duration"].apply( text_to_duration ) return df_copy def calculate_crew_size(crew): """ Calculate the size of the crew for a single crew entry Args: crew (str): The text entry in the crew column containing a list of crew member names Returns: (int): The crew size """ if crew.split() == []: return None else: return len(re.split(r';', crew))-1 def add_crew_size_column(df): """ Add crew_size column to the dataset containing the value of the crew size Args: df (pd.DataFrame): The input data frame. Returns: df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added """ print('Adding crew size variable (crew_size) to dataset') df_copy = df.copy() df_copy["crew_size"] = df_copy["crew"].apply( calculate_crew_size ) return df_copy def summary_duration_by_astronaut(df): """ Summarise the duration data by each astronaut and saves resulting table to a CSV file Args: df (pd.DataFrame): Input dataframe to be summarised Returns: sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column """ print(f'Calculating summary of total EVA time by astronaut') subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns subset = add_duration_hours(subset) # need duration_hours for easier calcs subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations subset = subset.groupby('crew').sum() return subset if __name__ == "__main__": if len(sys.argv) < 3: input_file = 'data/eva-data.json' output_file = 'results/eva-data.csv' print(f'Using default input and output filenames') else: input_file = sys.argv[1] output_file = sys.argv[2] print('Using custom input and output filenames') graph_file = 'results/cumulative_eva_graph.png' duration_by_astronaut_output_file = 'results/duration_by_astronaut.csv' main(input_file, output_file, duration_by_astronaut_output_file, graph_file) ``` </details>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.