2023-04-04 <br> DiMeN Reproducible Research Day 2: Python and GitHub

--- tags: teaching --- # 2023-04-04 <br> DiMeN Reproducible Research Day 2: Python and GitHub Welcome to today's hackpad! We'll use this editable space to include links for all resources for today and as a place for you to ask questions. Alex Coleman, <a.coleman1@leeds.ac.uk> ## Contents - [Links for today](#Links-for-today) - [Agenda](#Agenda) - [Setup](#Setup) - [Questions](#Questions) - [Further reading](#Further-reading) ## Links for today - [Hackpad](https://hackmd.io/@research-computing-leeds/2022-03-dimen-python) - [Training material](https://arctraining.github.io/quant-python-03-2022/index.html) - [Register for GitHub](https://github.com/join) - [Introduction to Git training material](https://arctraining.github.io/swd2_git/) ## Agenda | Time | Agenda | | -------- | ------------------------------------------ | | 0900 | Arrival | | 0915 | Getting everyone setup, intro to python | | 0955 | Break ☕💨 | | 1000 | Crash course python, starting with data | | 1030 | Break ☕ | | 1100 | Manipulating data | | 1140 | Quick comfort break (depending on how in the flow we all are) | | 1145 | Wrap up | | 1200 | Lunch 🥪 | | 1300 | Wrapping up python | | 1350 | Break ☕ | | 1410 | GitHub and tying everything together | | 1450 | Questions and close | | 1500 | End | ## Setup ### Pre course prep To get ready for today's session you'll need to do the following steps: 1. [Sign up for a GitHub account](https://github.com/join), everything we're doing today will require a GitHub account, even the non-GitHub stuff. So sign up now using your academic email address 2. Navigate to https://github.com/ARCTraining/dimen-python-2023 3. Click the [`Fork`](https://github.com/ARCTraining/dimen-python-2023/fork) button ![](https://i.imgur.com/3MZ9AZx.png) ## Questions Click the edit button (pencil symbol) on the top right and use the dark background edit mode to write your own question below. ## Further reading ### Software Sustainability Institute Research Software camps These include 1-2-1 mentoring schemes - https://software.ac.uk/research-software-camps ### Link to Git and GitHub Materials Introduction to git and github materials from the Carpentries - https://swcarpentry.github.io/git-novice/ ### Link to Conda tutorial Managing packages and dependencies is a tricky topic. Conda is a package and environment manager that is common in research. We have some course notes introducing conda for our HPC2 course at Leeds - https://arctraining.github.io/hpc2-software/course/conda.html ### Specific Python things #### Copys and references **Summary**: _Be careful when making changes to subsets of data. To avoid making changes to the original data, make a copy of it._ When creating a new variable from an existing pandas dataframe we actually create a reference back to the original dataframe. This means if we edit the new variable, we also edit the original dataframe. ```python= >>> # we have our initial dataframe >>> surveys_df.head() record_id month day year plot_id species_id sex hindfoot_length weight 0 1 7 16 1977 2 NL M 32.0 NaN 1 2 7 16 1977 3 NL M 33.0 NaN 2 3 7 16 1977 2 DM F 37.0 NaN 3 4 7 16 1977 7 DM M 36.0 NaN 4 5 7 16 1977 3 DM M 35.0 NaN >>> # we create a new dataframe using our existing dataframe >>> ref_surveys_df = surveys_df >>> # we change the data in our ref_surveys_df >>> ref_surveys_df[0:3] = 0 >>> # we have a quick look at this dataframe >>> ref_surveys_df.head() record_id month day year plot_id species_id sex hindfoot_length weight 0 0 0 0 0 0 0 0 0.0 0.0 1 0 0 0 0 0 0 0 0.0 0.0 2 0 0 0 0 0 0 0 0.0 0.0 3 4 7 16 1977 7 DM M 36.0 NaN 4 5 7 16 1977 3 DM M 35.0 NaN >>> # we look back at our original dataframe and find it's changed too! arghhh >>> surveys_df.head() record_id month day year plot_id species_id sex hindfoot_length weight 0 0 0 0 0 0 0 0 0.0 0.0 1 0 0 0 0 0 0 0 0.0 0.0 2 0 0 0 0 0 0 0 0.0 0.0 3 4 7 16 1977 7 DM M 36.0 NaN 4 5 7 16 1977 3 DM M 35.0 NaN >>> # we can check that they were actually references of each other >>> ref_surveys_df is surveys_df True >>> # to create a true copy with pandas we have to use the .copy() function >>> copy_surveys_df = surveys_df.copy() >>> copy_surveys_df is surveys_df False >>> # now we can change copy_surveys_df without affecting our original data (surveys_df) ``` For more information, see when [Pandas returns a view or a copy](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy), the [Python docs](https://docs.python.org/3/library/copy.html), and [Real Python](https://realpython.com/copying-python-objects/). #### Virtual environments and conda There is a nice introduction to concepts around using virtual environments in this [Carpentries Incubator workshop](https://carpentries-incubator.github.io/python-intermediate-development/12-virtual-environments/index.html). It nicely covers what they are, how you use them, how you share them. Conda is another (very popular) tool for doing the same thing are virtual environments and again there is a nice guide from the [Carpentries](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/). I would recommend reading both of these are it's worth knowing about both but if you just want to really crack on i'd start with the conda one as that is probably most commonly used in your domain. #### Objects in python For the very brave you can read more about the concept of objects in Python and it's associated programming paradigm object-orientated programming (OOP) on this [Real Python tutorial](https://realpython.com/python3-object-oriented-programming/). This answer on [StackOverFlow](https://stackoverflow.com/questions/56310092/what-is-an-object-in-python) is also good. But crucially this isn't something as a beginner you should worry about too much but it is something worth being aware of as you expand your python experience. #### Dropping NAs in one column only In the example I showed for dropping NAs we get a slightly funny behaviour. ```python! surveys_df["weight"] = surveys_df["weight"].dropna() * 1.1 ``` This will not actually drop the NAs from the entire `surveys_df` dataframe just fill in the values where a value is present. To subset the dataframe so we only have rows where the `weight` column has a value we need to do the following: ```python! # subset the surveys_df dataframe # dropping any rows where the weight column is NA surveys_df = surveys_df.dropna(subset=["weight"]) # this will cause changes to the index so we need to reindex # drop=True here drops the old index, otherwise # it is added as a new column surveys_df = surveys_df.reset_index(drop=True) ```