# Building Better Research Software - 2026-03-09-NCL
### This document: bit.ly/2026-03-09-NCL
https://hackmd.io/@rseteam/2026-03-09-NCL/edit
(_**no need** to log in to Hackmd.io, edit mode: CTRL+ALT+B_)
## Online Meeting: https://teams.microsoft.com/meet/31298350057381?p=qf6hykS6p6yQVFwe2D
# Day 2 Schedule
The schedule on the website for Day 2 is not correct. We will have breaks at the same times as yesterday (may vary by a few minutes):
Morning Break: 10:30 - 10:45
LUNCH: 12:30 - 13:30
Afternoon Break: 15:00 - 15:15
## Links:
- [The UK Reproducibility Network (UKRN)](https://www.ukrn.org/) funded the [Software Sustainability Institute](https://software.ac.uk) to develop this Carpentries Style training workshop
- [The Carpentries](https://carpentries.org)
- [Newcastle University RSE Team](https://rse.ncldata.dev)
- [Code of Conduct](https://docs.carpentries.org/policies/coc/)
- [Workshop website](https://nclrse-training.github.io/2026-03-09-NCL/)
- [Link to lesson website](https://carpentries-incubator.github.io/better-research-software/)
- [Software Used](https://carpentries-incubator.github.io/better-research-software/#software-setup)
- [Check your GitHub Account](https://github.com/) (SSH key on your local machine)
- [Link to lesson data and code](https://github.com/carpentries-incubator/better-research-software/raw/refs/heads/main/learners/spacewalks.zip)
- [Pre-workshop survey](https://forms.office.com/e/mhTCTTGtb6)
- [Post-workshop survey](https://forms.office.com/e/x8SrtmPRkB)
- [Code Community](https://teams.microsoft.com/l/team/19%3aG79Rz7Mhk6rC0mhia04YCD-nj7WabLMxhnyb1YLp04A1%40thread.tacv2/conversations?groupId=7059214c-2200-4ad6-a739-9d350c74c7a9&tenantId=9c5012c9-b616-44c2-a917-66814fbe3e87)
## Your instructors
[Email us your questions](mailto://training.researchcomputing@newcastle.ac.uk) to training.researchcomputing@newcastle.ac.uk
Carol Booth (carol.booth2@newcastle.ac.uk)
Dr. Frances Turner (frances.hutchings@newcastle.ac.uk)
Dr. Jannetta Steyn (jannetta.steyn@newcastle.ac.uk)
## GitHub Account: Paste your [<span style="color:red">GitHub login name</span>](https://github.com) on a new line
*Please enter your github username below:*
carolbooth2
Fehings
jsteyn
Kat-01-del
OliverWillis-CPI
minzhang904
UniCandice
## Useful information:
_Everyone is welcome to add tips, tricks and comments to this document_ by editing text in the left pane of this document in edit mode (CTRL+ALT+B)
- Conda equivalent to `pip freeze > requirements.txt` is `conda list --explicit > spec-file.txt` (as recommended by https://docs.conda.io/)
- https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository.
You can convert an existing repository into a template, so yes, this can include things like generic tests and some CI-CD such as pre-commit hooks which could include a check of the requirements/ imports for forbidden packages upon a git commit for example.
Creating a template repository - GitHub Docs
You can make an existing repository a template, so you and others can generate new repositories with the same directory structure, branches, and files.
- https://github.com/the-turing-way/reproducible-project-template an example public project template from the Turing Way
- GitHub - NewcastleRSE/Standard-Project: A template repo for the standard RSE project
A template repo for the standard RSE project. Contribute to NewcastleRSE/Standard-Project development by creating an account on GitHub.
- Where system-level packages are still used by Python instead of the versions installed into the virtual environment. Previous participants solved the problem by unsetting the PYTHONHOME, PYTHONPATH and PYTHONSTARTUP environment variables before creating the virtual environment for the project:
BASH
$ unset PYTHONHOME PYTHONPATH PYTHONSTARTUP
$ python -m venv ./venv_spacewalks
- R-click and 'rename symbol' can be helpful in changing variable names through the whole script; or 'refactor' in PyCharm.
- generate citation files: https://citation-file-format.github.io/cff-initializer-javascript/#/
## Virtual Environments Conversation
We had a chat at the start of Day 2 about issues with virtual environments using 'system' packages instead of from within the venv. It's best to keep your system 'clean' with minimal Python and package installations to avoid
## File naming when using Pytest
Pytest automatically discovered tests by looking for files named like `test_*` or `*_test.py`
Pytest then runs any functions named like `test_`
It's a good idea to reserve this naming convention for code tests but not essential. Any files with names like this will be examined but only functions named test_ will be run by the test suite.
### checking test coverage with pytest-cov
`python -m pytest --cov --cov-report=html`
Min checked whether a new python script in our directory would be picked up for coverage by pytest. Found that only scripts named in a `test_*.py` file are checked
## Code performance - Profiling Tools
Not really covered in the lesson but very handy so Frances gave a demo...
Python has an inbuilt tool called `cprofile` (avoid older 'profile' if possible as cprofile uses c underlying for greater efficiency)
`$ python`
```
>>> import cProfile
>>> from eva_data_analysis import main
>>> cProfile.run('main("data/eva-data.json","results/eva-data.csv","results/cumulativ_eva_graph.png")')
```
gave a LOT of output because of calls to packages like matplotlib
```
>>> import cProfile
>>> from eva_data_analysis import text_to_duration
>>> cProfile.run('text_to_duration("10:00")')
```
gave a smaller, more digestible output
## Continuous Integration for automated testing
https://carpentries-incubator.github.io/better-research-software/instructor/ci-for-testing.html
Github Marketplace allows you to choose from published actions (be aware anyone can publish)
https://github.com/marketplace?type=actions
## Places to start
- Decide on which tools you will use, stick with it!
- Agree on coding conventions (e.g. variable naming)
- Use template repo with:
- Structured Readme.md
- Directory structure (e.g. have a tests directory)
- Action on push to run tests (even if not tests don't exist yet)
- provide tests to your developers so they can check before push
----------
**Current state of the code** can be found at the start of each episode in the [lesson website](https://carpentries-incubator.github.io/better-research-software/) but we have also put them here for convenience: [Better Research Software - Code State](https://hackmd.io/@RSETeam/BetterRS-CodeState)
---------
## Code State
At certain points in the lesson we will update the code to fit better practice. As this is not specifically a python course, but rather a course on best practice with python as an example, we will at points ask you to update your code by copying the code versions below for speed.
Starting Code:
<details>
```python=
# https://data.nasa.gov/resource/eva.json (with modifications)
data_f = open('./eva-data.json', 'r', encoding='ascii')
data_t = open('./eva-data.csv','w', encoding='utf-8')
g_file = './cumulative_eva_graph.png'
fieldnames = ("EVA #", "Country", "Crew ", "Vehicle", "Date", "Duration", "Purpose")
data=[]
import json
for i in range(375):
line=data_f.readline()
print(line)
data.append(json.loads(line[1:-1]))
#data.pop(0)
## Comment out this bit if you don't want the spreadsheet
import csv
w=csv.writer(data_t)
import datetime as dt
time = []
date =[]
j=0
for i in data:
print(data[j])
# and this bit
w.writerow(data[j].values())
if 'duration' in data[j].keys():
tt=data[j]['duration']
if tt == '':
pass
else:
t=dt.datetime.strptime(tt,'%H:%M')
ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60)
print(t,ttt)
time.append(ttt)
if 'date' in data[j].keys():
date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d'))
#date.append(data[j]['date'][0:10])
else:
time.pop(0)
j+=1
t=[0]
for i in time:
t.append(t[-1]+i)
date,time = zip(*sorted(zip(date, time)))
import matplotlib.pyplot as plt
plt.plot(date,t[1:], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(g_file)
plt.show()
```
</details>
### Code Readability Lesson:
<details>
#### Code update 1: Pandas refactor
<details>
Please copy the below code and paste over the contents of `eva_data_analysis.py` when asked:
```python=
import matplotlib.pyplot as plt
import pandas as pd
# Data source: https://data.nasa.gov/resource/eva.json (with modifications)
input_file = open('./eva-data.json', 'r', encoding='ascii')
output_file = open('./eva-data.csv', 'w', encoding='utf-8')
graph_file = './cumulative_eva_graph.png'
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
eva_df.to_csv(output_file, index=False, encoding='utf-8')
eva_df.sort_values('date', inplace=True)
eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum()
plt.plot(eva_df['date'], eva_df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
```
</details>
#### Code update 2: initial functions
<details>
Please copy the below code and paste over the contents of `eva_data_analysis.py` when asked:
```python=
import matplotlib.pyplot as plt
import pandas as pd
def read_json_to_dataframe(input_file):
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
# Main code
print("--START--")
input_file = open('./eva-data.json', 'r', encoding='ascii')
output_file = open('./eva-data.csv', 'w', encoding='utf-8')
graph_file = './cumulative_eva_graph.png'
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum()
plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
print("--END--")
```
</details>
#### Code update 3: Docstrings
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df['duration_hours'] = df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
# Main code
print("--START--")
input_file = open('./eva-data.json', 'r', encoding='ascii')
output_file = open('./eva-data.csv', 'w', encoding='utf-8')
graph_file = './cumulative_eva_graph.png'
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
```
</details>
</details>
### Code Structure Lesson:
<details>
#### Code update 1: Adding more functions
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
# Main code
print("--START--")
input_file = open('./eva-data.json', 'r', encoding='ascii')
output_file = open('./eva-data.csv', 'w', encoding='utf-8')
graph_file = './cumulative_eva_graph.png'
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
```
</details>
#### Code update 2: Adding a main function
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
if __name__ == "__main__":
input_file = './eva-data.json'
output_file = './eva-data.csv'
graph_file = './cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
</details>
#### Code update 3: Adding CLI with sys.argv
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default input and output filenames')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
print('Using custom input and output filenames')
graph_file = 'results/cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
</details>
</details>
### Start of Day 2
Where the code should be at the start of day 2:
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/6 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default file paths and filenames: {input_file}, {output_file}')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
graph_file = 'results/cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
</details>
### Lesson 6: New code from collaborator
<details>
This is to be added to eva_data_analysis.py
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re # added this line
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Calculate and add crew size to data
eva_data = add_crew_size_column(eva_data) # added this line
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
```
and also add the following:
```python=
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
```
</details>
### Code state: full code with new collaborator functions
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Add crew size variable to dataset
eva_data = add_crew_size_column(eva_data)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/60 # there is an intentional bug on this line (should divide by 60 not 6)
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default file paths and filenames: {input_file}, {output_file}')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
graph_file = 'results/cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
</details>
### Code state: end of lesson 6
<details>
Code for eva_data_analysis.py
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re
def main(input_file, output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Calculate and add crew size to data
eva_data = add_crew_size_column(eva_data)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/60
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default input and output filenames')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
print('Using custom input and output filenames')
graph_file = 'results/cumulative_eva_graph.png'
main(input_file, output_file, graph_file)
```
Test script test_eva_analysis.py:
```python=
from eva_data_analysis import text_to_duration, calculate_crew_size
import pytest
def test_text_to_duration_integer():
"""
Test that text_to_duration returns expected values for typical
durations with a zero minutes component
"""
assert text_to_duration("10:00") == 10
def test_text_to_duration_float():
"""
Test that text_to_duration returns expected ground truth values for
typical durations with non-zero minutes components
"""
assert text_to_duration("10:20") == pytest.approx(10.333333)
@pytest.mark.parametrize("input_value, expected_result", [
("Valentina Tereshkova;", 1),
("Judith Smith; Sally Rider;", 2),
("", None)
])
def test_calculate_crew_size(input_value,expected_result):
"""
Test that calculate_crew_size returns expected size for
typical crew values
"""
actual_result = calculate_crew_size(input_value)
assert actual_result == expected_result
```
</details>
### Lesson 8: Adding new function
Add this new function to your code:
```python=
def summary_duration_by_astronaut(df):
"""
Summarise the duration data by each astronaut and saves resulting table to a CSV file
Args:
df (pd.DataFrame): Input dataframe to be summarised
Returns:
sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column
"""
print(f'Calculating summary of total EVA time by astronaut')
subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
subset = add_duration_hours(subset) # need duration_hours for easier calcs
subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
subset = subset.groupby('crew').sum()
return subset
```
**Full code after adding new function:**
<details>
```python=
import matplotlib.pyplot as plt
import pandas as pd
import sys
import re
def main(input_file, output_file, duration_by_astronaut_output_file, graph_file):
print("--START--")
# Read the data from JSON file
eva_data = read_json_to_dataframe(input_file)
# Convert and export data to CSV file
write_dataframe_to_csv(eva_data, output_file)
# Calculate summary table for total EVA per astronaut
duration_by_astronaut_df = summary_duration_by_astronaut(eva_data)
# Save summary duration data by each astronaut to CSV file
write_dataframe_to_csv(duration_by_astronaut_df, duration_by_astronaut_output_file)
# Sort dataframe by date ready to be plotted (date values are on x-axis)
eva_data.sort_values('date', inplace=True)
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)
print("--END--")
def read_json_to_dataframe(input_file):
"""
Read the data from a JSON file into a Pandas dataframe.
Clean the data by removing any rows where the 'duration' value is missing.
Args:
input_file (file or str): The file object or path to the JSON file.
Returns:
eva_df (pd.DataFrame): The cleaned data as a dataframe structure
"""
print(f'Reading JSON file {input_file}')
# Read the data from a JSON file into a Pandas dataframe
eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
eva_df['eva'] = eva_df['eva'].astype(float)
# Clean the data by removing any rows where duration is missing
eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
return eva_df
def write_dataframe_to_csv(df, output_file):
"""
Write the dataframe to a CSV file.
Args:
df (pd.DataFrame): The input dataframe.
output_file (file or str): The file object or path to the output CSV file.
Returns:
None
"""
print(f'Saving to CSV file {output_file}')
# Save dataframe to CSV file for later analysis
df.to_csv(output_file, index=False, encoding='utf-8')
def plot_cumulative_time_in_space(df, graph_file):
"""
Plot the cumulative time spent in space over years.
Convert the duration column from strings to number of hours
Calculate cumulative sum of durations
Generate a plot of cumulative time spent in space over years and
save it to the specified location
Args:
df (pd.DataFrame): The input dataframe.
graph_file (file or str): The file object or path to the output graph file.
Returns:
None
"""
print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
df = add_duration_hours(df)
df['cumulative_time'] = df['duration_hours'].cumsum()
plt.plot(df['date'], df['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(graph_file)
plt.show()
def text_to_duration(duration):
"""
Convert a text format duration "HH:MM" to duration in hours
Args:
duration (str): The text format duration
Returns:
duration_hours (float): The duration in hours
"""
hours, minutes = duration.split(":")
duration_hours = int(hours) + int(minutes)/60
return duration_hours
def add_duration_hours(df):
"""
Add duration in hours (duration_hours) variable to the dataset
Args:
df (pd.DataFrame): The input dataframe.
Returns:
df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
"""
df_copy = df.copy()
df_copy["duration_hours"] = df_copy["duration"].apply(
text_to_duration
)
return df_copy
def calculate_crew_size(crew):
"""
Calculate the size of the crew for a single crew entry
Args:
crew (str): The text entry in the crew column containing a list of crew member names
Returns:
(int): The crew size
"""
if crew.split() == []:
return None
else:
return len(re.split(r';', crew))-1
def add_crew_size_column(df):
"""
Add crew_size column to the dataset containing the value of the crew size
Args:
df (pd.DataFrame): The input data frame.
Returns:
df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
"""
print('Adding crew size variable (crew_size) to dataset')
df_copy = df.copy()
df_copy["crew_size"] = df_copy["crew"].apply(
calculate_crew_size
)
return df_copy
def summary_duration_by_astronaut(df):
"""
Summarise the duration data by each astronaut and saves resulting table to a CSV file
Args:
df (pd.DataFrame): Input dataframe to be summarised
Returns:
sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column
"""
print(f'Calculating summary of total EVA time by astronaut')
subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
subset = add_duration_hours(subset) # need duration_hours for easier calcs
subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
subset = subset.groupby('crew', as_index=False).sum()
return subset
if __name__ == "__main__":
if len(sys.argv) < 3:
input_file = 'data/eva-data.json'
output_file = 'results/eva-data.csv'
print(f'Using default input and output filenames')
else:
input_file = sys.argv[1]
output_file = sys.argv[2]
print('Using custom input and output filenames')
graph_file = 'results/cumulative_eva_graph.png'
duration_by_astronaut_output_file = 'results/duration_by_astronaut.csv'
main(input_file, output_file, duration_by_astronaut_output_file, graph_file)
```
</details>