# Building Better Research Software - 2026-02-24-NCL
## This document: bit.ly/2026-02-24-NCL
### https://hackmd.io/@rseteam/2026-02-24-NCL/edit
(_**no need** to log in to Hackmd.io, edit mode: CTRL+ALT+B_)
[toc]
--------
### Links:
- [The UK Reproducibility Network (UKRN)](https://www.ukrn.org/) funded the [Software Sustainability Institute](https://software.ac.uk) to develop this training workshop and provided our catering today
- [Carpentries](https://carpentries.org)
- [RSE Team](https://rse.ncldata.dev)
- [Code of Conduct](https://docs.carpentries.org/policies/coc/)
- [Workshop website](https://nclrse-training.github.io/2026-02-24-NCL/)
- [Link to lesson website](https://carpentries-incubator.github.io/better-research-software/)
- [Software Used](https://carpentries-incubator.github.io/better-research-software/#software-setup) is pre-installed on our cluster PCs
- [Check your GitHub Account](https://github.com/) (SSH key on your H: drive or laptop local drive)
- [Link to lesson data and code](https://github.com/carpentries-incubator/better-research-software/raw/refs/heads/main/learners/spacewalks.zip)
- [Pre-workshop survey](https://forms.office.com/e/Qu8auyhQzS)
- [Post-workshop survey](https://forms.gle/NdtugmSvM9PS2k6T8)
- [Code Community](https://teams.microsoft.com/l/team/19%3aG79Rz7Mhk6rC0mhia04YCD-nj7WabLMxhnyb1YLp04A1%40thread.tacv2/conversations?groupId=7059214c-2200-4ad6-a739-9d350c74c7a9&tenantId=9c5012c9-b616-44c2-a917-66814fbe3e87)
### Attendance:
You will be asked to show the QR code we sent to you via email
## Your instructors
[Email us your questions](mailto://training.researchcomputing@newcastle.ac.uk) to training.researchcomputing@newcastle.ac.uk
Carol Booth (carol.booth2@newcastle.ac.uk)
Dr. Frances Turner (<frances.hutchings@newcastle.ac.uk>)
Dr. Robin Wardle (<robin.wardle@newcastle.ac.uk>)
### GitHub Account: Paste your [<span style="color:red">GitHub login name</span>](https://github.com) on a new line
Please enter your github username below:
## Useful information:
_Everyone is welcome to add tips, tricks and comments to this document_ by editing text in the left pane of this
document in edit mode (CTRL+ALT+B)
### Git Default Branch
You can change your default branch name from `master` to `main`:
```
git config --global init.defaultBranch main
```
If you've already created a `master` branch, you can rename it:
```
git branch -m master main
```
### **Day2: Software Mangement and Collaboration - where to get a DOI?**
This workshop looks at Zenodo https://zenodo.org/ for DOIs.
#### We use https://sandbox.zenodo.org/ to practice
You may have heard of obtaining DOIs using https://data.ncl.ac.uk/, which is Newcastle University's institutional Figshare instance. Figshare was originally aimed at image and document sharing, but you can also use it like Zenodo with GitHub: https://info.figshare.com/user-guide/how-to-connect-figshare-with-your-github-account/
The research data management team provide guidance at: https://www.ncl.ac.uk/library/academics-and-researchers/lrs/rdm/working/organise/
### Renaming Files
- can use oneliner: `git mv <old_file.py> <new_file.py>` instead of the sequence `mv <old_file.py> <new_file.py>; git add <new_file.py>`
### Renaming Variables
- [the dangers of heedless global search and replace](https://selinker.livejournal.com/32929.html)
### Private email addresses preventing pushing to GitHub
["Your push would publish a private email address"](https://stackoverflow.com/questions/43863522/error-your-push-would-publish-a-private-email-address) - this happens when you have ["Block commands that expose my email"](https://github.com/settings/emails) set in GitHub. A solution is given in the linked Stack Overflow article.
For a local repository:
```
git config user.email "{ID}+{username}@users.noreply.github.com"
```
For all repositories on the computer
```
git config --global user.email "{ID}+{username}@users.noreply.github.com"
```
### Why make a copy of a dataframe inside a function?
https://stackoverflow.com/questions/46533197/understanding-variable-scope-and-changes-in-python#:~:text=I'm%20using%20Python%203.6%20and,dataframe%20to%20the%20original%20columns.
This Stackoverflow post covers the question of variable scope. We expect a variables to be isolated inside functions but the lines can be blurred because the variable is actually a pointer to your dataframe.
### pandas read_json input
The pandas read_json function can take a string or a file buffer as an input which is strange.
### how does Pytest know which test script to run?
the test scripts have to be in ./test and be named starting test_
### mkdocs with main script in a subdirectory
In the reference file use`:::mydir.myfile` rather than `:::mydir/myfile` or `:::myfile`
### Wrap Up: What are we trying to do?
Research software development principles explained by Software Sustainability Institute’s Director Neil Chue Hong during his [keynote at RSECon23](https://rsecon24.society-rse.org/about/research-software-development-principles/#neil-chue-hong-rse23-keynote) the section from minute 21 to minute 29 apx explains the Research Software Development Principles.
----------
# Code State
## Code state: Start of day 2
The below code is what your `eva_data_analysis.py` file should look like at the start of day 2.
<details>
You can copy the below code to your eva_data_analysis.py file, overwriting the previous contents:
```python=
Import matplotlib.pyplot as plt
import pandas as pd
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df['duration_hours'] = df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
# Main code
print("--START--")
input_file = open('./eva-data.json', 'r', encoding='ascii')
output_file = open('./eva-data.csv', 'w', encoding='utf-8')
graph_file = './cumulative_eva_graph.png'
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
```
For the wider code state, at this point, the code in your local software project’s directory should be as in: https://github.com/carpentries-incubator/bbrs-software-project/tree/05-code-structure
</details>
-------------
### Code state: lesson 5, new functions
<details>
You can copy the below code to your eva_data_analysis.py file, overwriting the previous contents:
```python=
import matplotlib.pyplot as plt
import pandas as pd
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
# Main code
print("--START--")
input_file = open('./eva-data.json', 'r', encoding='ascii')
output_file = open('./eva-data.csv', 'w', encoding='utf-8')
graph_file = './cumulative_eva_graph.png'
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
```
</details>
### Code state: lesson 5 move to 'main'
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
if __name__ == "__main__":
input_file = './eva-data.json'
output_file = './eva-data.csv'
graph_file = './cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
</details>
### Lesson 6: New code from collaborator
<details>
This is to be added to eva_data_analysis.py
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re # added this line
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Calculate and add crew size to data
eva_data = add_crew_size_column(eva_data) # added this line
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
```
and also add the following:
```python=
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
```
</details>
### Code state: end of lesson 6
<details>
Code for eva_data_analysis.py
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Calculate and add crew size to data
eva_data = add_crew_size_column(eva_data)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/60
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default input and output filenames')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
print('Using custom input and output filenames')
graph_file = 'results/cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
Test script test_eva_analysis.py:
```python=
from eva_data_analyis import text_to_duration, calculate_crew_size
import pytest
def test_text_to_duration_integer():
"""
Test that text_to_duration returns expected values for typical
durations with a zero minutes component
"""
assert text_to_duration("10:00") == 10
def test_text_to_duration_float():
"""
Test that text_to_duration returns expected ground truth values for
typical durations with non-zero minutes components
"""
assert text_to_duration("10:20") == pytest.approx(10.333333)
@pytest.mark.parametrize("input_value, expected_result", [
("Valentina Tereshkova;", 1),
("Judith Smith; Sally Rider;", 2),
("", None)
])
def test_calculate_crew_size(input_value,expected_result):
"""
Test that calculate_crew_size returns expected size for
typical crew values
"""
actual_result = calculate_crew_size(input_value)
assert actual_result == expected_result
```
</details>
## Lesson 8: Adding new function
Add this new function to your code:
```python=
def summary_duration_by_astronaut(df):
"""
Summarise the duration data by each astronaut and saves resulting table to a CSV file
Args:
df (pd.DataFrame): Input dataframe to be summarised
Returns:
sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column
"""
print(f'Calculating summary of total EVA time by astronaut')
subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
subset = add_duration_hours(subset) # need duration_hours for easier calcs
subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
subset = subset.groupby('crew').sum()
return subset
```
**Full code after adding new function:**
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re
def main(input_file, output_file, duration_by_astronaut_output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Calculate summary table for total EVA per astronaut
duration_by_astronaut_df = summary_duration_by_astronaut(eva_data)
# Save summary duration data by each astronaut to CSV file
write_dataframe_to_csv(duration_by_astronaut_df, duration_by_astronaut_output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/60
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
def summary_duration_by_astronaut(df):
"""
Summarise the duration data by each astronaut and saves resulting table to a CSV file
Args:
df (pd.DataFrame): Input dataframe to be summarised
Returns:
sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column
"""
print(f'Calculating summary of total EVA time by astronaut')
subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
subset = add_duration_hours(subset) # need duration_hours for easier calcs
subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
subset = subset.groupby('crew').sum()
return subset
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default input and output filenames')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
print('Using custom input and output filenames')
graph_file = 'results/cumulative_eva_graph.png'
duration_by_astronaut_output_file = 'results/duration_by_astronaut.csv'
main(input_file, output_file, duration_by_astronaut_output_file, graph_file)
```
</details>