UPGG Informatics Orientation Bootcamp 2023

--- tags: course notes --- # UPGG Informatics Orientation Bootcamp 2023 :::warning The [notes for Wednesday and Thursday](https://hackmd.io/@8dEsA7nbQsexUv4D0NazQA/SkwLQNBan/edit) have been moved to a separate document. ::: ## Friday Morning ### Version Control with Git #### Instructor: Raven #### Summary and Setup Today we'll follow the swcarpentry notes: https://swcarpentry.github.io/git-novice/ Create and navigate to a new working directory ```` mkdir bootcamp cd bootcamp ```` Version control can keep track of both what you ADD and what you SUBTRACT from the document. However, a conflict can arise if two users are making changes to the same section of the document. It works like an unlimited "undo". #### Setup Git You only need to configure `git` one time at the start! ````bash git config --global user.name "Vlad Dracula" git config --global user.email "vlad@tran.sylvan.ia" ```` Your computer encodes the `enter` input as a character ("newline"). You can change the way Git recognize and encode line endings. On MacOS and Linux: ```bash git config --global core.autocrlf input ``` And on Windows: ```bash git config --global core.autocrlf false ``` Setup a default text editor (optional): ```bash git config --global core.editor "nano -w" ``` Configure the default branch name to be `main` ```bash git config --global init.defaultBranch main ``` You can view all the configurations you have for `git`. ```bash git config --list ``` ##### Creating a Repository Make sure you are in your working directory for this lecture. create a new directory and initialize a new repository. ```bash mkdir planets cd planets git init ``` Print your git version. ```bash git version ``` Create `main` branch and make sure you are on it. ```bash git checkout -b main ``` You can get a report on the repository. ```bash git status ``` It's recommended **NOT** to initialize a new git repository within a pre-existing one. E.g. making a directory called `moons` and call `git init` inside `moons` directory. If you accidentally make nested git repositories just remove the `.git` directory. ```bash rm -rf moons/.git ``` Version control keeps track of files only (not directories). ##### Tracking Changes Make a new file and type some notes. ```bash nano mars.txt ``` We can see that git detected the new file, but it's not being tracked yet: ```bash git status ``` To start tracking changes: ```bash git add mars.txt ``` To **save** the changes, you can run the command `commit` (basically **save** in `git` lexicon). ```bash git commit -m "Start notes on Mars as a base" ``` This will summarizes some changes that you made since last `commit`. After the `-m` flag, you are giving your future self or collaborator a note on what changes you've done in this commit. It should be relatively brief. Take a look at the commit history. ```bash git log ``` If you make further changes, you can run `git diff` to show the differences between the current state of the file and the most recently saved version. ```bash git diff ``` Git add your new changes before committing ```bash git add mars.txt git commit -m "Add concerns about effects of Mars' moons on Wolfman" ``` Git compares the staging arae with the unstaged files by default. If you use git diff after staging your new changes, it won't find a difference. ```bash nano mars.txt git add mars.txt git diff ``` However, you can compare the commited version and the staged version: ```bash git diff --staged ``` Change visualization from the line level (default) to word level changes ```bash git diff --color-words ``` If `git log` is very long: ```bash git log -1 ## Limits to just last commit, change 1 to any number of logs you want git log --oneline ## reduces information to just one line ``` You can git add a directory even if directories are not tracked (it tracks the contents). ```bash mkdir spaceships touch spaceships/apollo-11 spaceships/sputnik-1 git status git add spaceships git status ``` Notice the difference in git status before and after adding spaceships. Commit the changes: ```bash git commit -m "Add some initial thoughts on spaceships" ``` It's best practice to be descriptive with your commit messages. ###### Exercises: ###### Which command(s) below would save the changes of myfile.txt to my local Git repository? Answer number 3 is correct: ```bash git add myfile.txt git commit -m "my recent changes" ``` The staging area can hold changes from any number of files that you want to commit as a single snapshot. ###### Adding multiple files - Add some text to mars.txt noting your decision to consider Venus as a base ```bash nano mars.txt ``` - Create a new file venus.txt with your initial thoughts about Venus as a base for you and your friends ```bash nano venus.txt ``` - Add changes from both files to the staging area, and commit those changes. ```bash git add mars.txt venus.txt git commit -m "Started considering Venus as a base" ``` ###### `bio` Repository - Create a new Git repository on your computer called bio. - Write a three-line biography for yourself in a file called me.txt, commit your changes - Modify one line, add a fourth line - Display the differences between its updated state and its original state. ```bash cd .. # .git already exists here (planets) mkdir bio cd bio git init # initializes git nano me.txt # write biography in the file git add me.txt git commit -m "Add biography file" nano me.txt # adds the fourth line git diff me.txt # shows the differences ``` ##### Exploring History Make some more changes in `mars.txt` ```bash nano mars.txt cat mars.txt ``` See what changed. This will compare to the most recent commit. ```bash git diff HEAD mars.txt ``` Using `HEAD~1` to move down the logs by 1 from the most recent commit (aka 2 commits ago). `HEAD` refers to the most recent commit, and everything before can be referred to relatively to `HEAD`. ```bash git diff HEAD~1 mars.txt ``` Using `git show` shows the commit message on top of differences between a commit and our working directory. ```bash git show HEAD~2 mars.txt ``` `HEAD` is good for looking for a recent commit because it is relative to the most recent one, but if you want to point at a commit with an absolute identifier, you can obtain the ID from `git log`. Each commit gets its own unique one. ```bash git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b mars.txt ``` You don't have t ofuse the full 40-character string, as long as the first few letters/digits you type in are unique from other commit IDs, git will know which one you are referring to(similar logic to `tab completion`). ```bash git diff f22b25e mars.txt ``` Using `git checkout` checks out (i.e., restores) an old version of a file. In this case, we are telling Git we want to recover the version of the file recorded in `HEAD`. ```bash git checkout HEAD mars.txt ``` ##### Don't Lose Your Head If you want to restore a version of a file, but forget to add the filename after `git checkout <commit ID>`, you will enter a unique state called `datached HEAD`. You basically entered an area dedicated for experimenting with that `<commit ID>` version of your repository. Knowing this, you can change files, and even commit those changes. HOWEVER, once you "reattach" the `HEAD`, those changes and commits you've done in the `detached HEAD` state won't come with you. There's a way to retain all those commits you made in the `detached HEAD` state. Basically, you will make a new `branch` to keep all those commits and make an `alternate reality` version of your repository. Once, you've done that, you can "reattach" `HEAD` to return to the original repository on the `main` branch. Essentially, now you have 2 versions of your repositories on 2 different branches. I'll demonstrate the codes below: ```bash git checkout f22b25e # Now you entered the 'detached HEAD' state with your repository looking like the stage in which you committed the f22b25e... commit nano mars.txt # in this state, you can make changes... git add mars.txt git commit -m "made changes in mars.txt" # or even commit in this state ### YOU CAN RETURN TO MAIN WITHOUT RETAINING THOSE COMMITS BY git checkout main ### ALTERNATIVELY, if you wish to retain those changes somewhere, make a new branch git checkout -b <new-branch-name> # This branch will keep all changes you made inside the "detached HEAD" state git checkout main # will bring you back to the original state before you ran the 'git checkout f22b25e' line ``` ###### Exercise: RECOVERING OLDER VERSIONS OF A FILE Which commands below will let her recover the last committed version of her Python script called data_cruncher.py? Answer is 2 and 4: ```bash git checkout HEAD data_cruncher.py ``` or ```bash git checkout <unique ID of last commit> data_cruncher.py ``` ##### **Important!!!** `Checkout` vs `Restore` `git checkout` does two different things. 1. Restoring a previous commit 2. Navigate to a branch This is confusing, so the developers try to separate the two functions to two different commands. Now, to restore a commit, run: ```bash git restore <commit ID> ``` `git checkout`'s restore function is slowly depreciated. Let's make it a habit to use `restore` for the restore function!!! ##### Ignoring things You don't always want `git` to keep track of EVERY files in the repository (e.g., raw files, large files etc.). You can tell `git` to ignore these files. Create a new directory and files: ```bash mkdir results touch a.dat b.dat c.dat results/a.out results/b.out git status ``` The git status will show the new files. To ignore them, make a ``.gitignore`. ```bash nano .gitignore ``` type: ```` *.dat results/ ```` ```bash git status ``` You also need to keep track of your `.gitignore`: ```bash git add .gitignore git commit -m "Ignore data files and the results folder" git status ``` Take a look at your ignored files: ```bash git status --ignored ``` You can override the ignored settings. You can use `-f` to force add the file to staging area. ```bash git add -f a.dat ``` ###### Exercise: IGNORING NESTED FILES If you have a directory structure like this: ```` results/data results/plots ```` You can ignore only one of the subdirectories by specifying in `.gitignore` ```` results/plots/ ```` ##### Remotes in GitHub ###### Create a remote repository Log into GitHub and click create new repository. Click the create repository button. ###### Connect local to remote repository Copy the ssh from your remote repository. Go to your local directory: ```bash git remote add origin git@github.com:vlad/planets.git ``` Check that it worked: ```bash git remote -v ``` ###### SSH Background and Setup Create private and public keys in your computer. ```bash= ls -al ~/.ssh ``` Create an SSH key pair if you don't already have one. Choose whatever password you like for your passphrase. Write down and make sure to remember the passphrase. ```bash ssh-keygen -t ed25519 -C "vlad@tran.sylvan.ia" ``` If you type ```bash ls -al ~/.ssh ``` the new key pair will show. Copy the public key to GitHub ```bash cat ~/.ssh/id_ed25519.pub ``` Copy the output and go to GitHub. Click profile > settings > SSH keys Add a new one. Paste the public key. Check that the key is setup on GitHub: ```bash ssh -T git@github.com ``` ###### Push local changes to a remote Check you are still in your planets directory. After authenticating your ssh key pair, let's push your changes from the local to remote repository. ```bash git push origin main ``` If you go to your GitHub page, you should be able to see all the files that you committed, and the version history of your repository. Including the local changes. ###### Collaborating Practice adding a collaborator: - Go to the settings on your plantes repositories. - Go to collabors - Add people - Search by github username or mail You should receive an email from the person inviting you to collaborate. Create a new directory called collaboration: ```bash= cd .. # get out of your own plants directory mkdir collaboration cd collaboration ``` Clone your collaborator's repository from GitHub ```bash= git clone git@github.com:vlad/planets.git . ``` Go to your collaborator's repository and create a new file ```bash= cd planets nano pluto.txt ``` Commit your changes ```bash= git add pluto.txt git commit -m "Add notes about Pluto" ``` And push to your collaborator's GitHub repository ```bash= git push origin main ``` If you go to the repository GitHub page, you can see your new file and your commit in the commit history. However, your collaborator won't have the new file locally unless they pull from the remote repository. ```bash= git pull origin main ``` To test the updates before git pull, you can use: ```bash= git fetch origin main ``` You can also git diff your local vs your remote: ```bash= git diff main origin/main ``` ##### Conflicts Modify the mars.txt file in your collaborator's repository. ```bash nano mars.txt ```` Push change to GitHub ```bash= git add mars.txt git commit -m "Add a line in our home copy" git push origin main ``` If the owner also makes parallel changes in their own repository: ```bash= nano mars.txt git add mars.txt git commit -m "Add a line in my copy" git push origin main ``` It won't work because there is a conflict in the mars.txt You'll need to pull the repository and resolve the conflict ```bash git pull origin main ``` Edit the file and resolve the conflicts and commit the changes. The conflict will be indicated within the file between the `<<<<<<< HEAD`, `=======`, and `>>>>>>>`. ```bash nano mars.txt ``` ```bash git add mars.txt git commit -m "Merge changes from GitHub" ``` **Remember to always git pull before you git push.** ##### Branches You can branch off the main version of the repository into your own space that you can commit and push freely. Your collaborator can also do the same thing on their own branch. Then, if you want to merge those branches into main (aka making changes permanent), you can review and approve if the merge has any conflict or works. ### Data and Project Organization #### Instructor: Emiliano Data organization is an ongoing process! Setup: Open terminal, a text editor of choice, "gapminderDataFiveYear_superDirty data.xlsx" file, create a new directory to work in #### Introduction You just started a new job/a new rotation, someone hands you this data file to analyze. ```bash= #Let's look at the data head gapminderDataFiveYear_superDirty data.xlsx ``` We can't tell - What is it? - Where did it come from? - When was it collected? - Has anything been changed? If so, why was it changed? We'll look into how to store data for yourself and others: #### Project Structure Here are some characteristics of files you should pay attention to: - File History - File Function - File Format - File Origin Basic intiutive project directory: - Code Directory (Keep scripts here seperate from data) - README.txt (add info here about the project and its organization: project name, date, contact info, where data came from, other info about project) - Data Directory keeps all your data - Raw data dir: keep raw data seperatefrom everything else - Modified data dir: keep modified data folder here that have been analyzed by scripts/are stopping points - Output Directory for files generated from other files - Like figures, stats, paper etc. Key points: - Organize files so that they are intuitive - Have README files within folders to describe the project and gives context for the analyses - Make a copy of raw file, so you don't have to modify the raw version - Keep clear record of modification that has been made by making sure your scripts are reproducible from raw data - Compartmentalize your directory ##### Helpful Naming conventions - Keep track of steps by numbers (e.g., `01_file.txt`, `02_file.txt`) - Use dates and version of files (e.g. `2023-08-24_file.txt`, `2023-08-25_file.txt`) ##### Let's organize the directory that we just created. ```bash= #Inside our working directory that we just created #Creating data dir with orig and raw subdir mkdir data mkdir data/original mkdir data/raw mv gapminder* data/original #moving data files to data/original cd data/original ls #gapminderDataFiveYear_superDirty.xlsx #gapminderDataFiveYear_superDirty.txt chmod 444 gapminder* #removes write and execute access at all permission levels ls -l #checking file permissions -r--r--r--@ cd ..#returning to our main project folder nano README.txt #Inside our README.txt>> #Project Name: UPGG Bootcamp Data Organization #Date: 2023-08-25 #Email/Contact info:Emiliano Sotelo-jemilianos@gmail.com #Data downloaded from: https://reproducible-science-curriculum.github.io/organization-RR-Jupyter/setup/#:~:text=gapminderDataFiveYear_superDirty%20data.xlsx #Goal is to learn how to organize our data projects. ``` You might want to leave your personal email as a contact instead of your Duke email, just because you might not have access to your Duke email later in your life. Your README might outlast your access to Duke email. You most likely will have access to your personal email much longer than your Duke email. ##### Modifying Data To avoid modifying the original data, we should make a copy and go from there ```bash= mkdir cleaned #in smae dir as original cp original/gapminder*.xlsx cleaned/. cd cleaned chmod 777 gapminder*.xlsx #gives all the permissions open gapminder*xlsx # this will open the file in excel nano README.txt #Inside README.txt>> #-Data cleaning steps for gapminder: #-Removed 5th row #Better to write out a script to modify our data than to manually clean/edit our data ``` After we've cleaned the data in the copied file, lets make some output directories to sort our files: ```bash= mkdir code mkdir output mkdir doc ``` Here's an example of a template for an analysis that you can adapt: https://github.com/jemilianosf/template_analysis. ### Sharing Jupyter Notebooks #### Instructor: Hilmar Lapp You already have all the materials needed for this last lesson, but it will apply to any other work as well. ##### Sharing Jupyter Notebooks using GitHub How do we share work to other people? 1. Static - sharing a snapshot of your work, cannot be changed 2. Dynamic - they can interact with, or change, or run the code without having to ask you to change anything ##### Binder Running code is harder than displaying code. To run code you need: - Hardware - Code, including dependencies (like r or python, and packages) Binder provides both. Example binder link: https://mybinder.org/v2/gh/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages?filepath=notebooks%2FData_exploration_run.ipynb You can execute each chunk of code on a "live" notebook in the link above. Go to your GitHub repository to copy the https link paste in: https://mybinder.org/ under "GitHub repository name or URL". Paste the path to your notebook under "Path to a notebook file (optional)". Mybinder creates a container, a very lightweight virtual machine in a host hardware. After creating you notebook execution environment you can get an URL that you can share with others. ###### Adding dependencies If you only have simple code with standard python code, it should run without extra steps. But if you have a more complex project, you need to tell binder what your dependencies are (what packages to import). Go to your GitHub repository and create a new file called `requirements.txt`. You can add dependencies like `pandas`, `numpy`, `matplotlib`, `seaborn`. Binder will recognize your `requirements.txt` and load those packages in your notebook's container. ###### Adding data You won't be able to host data on binder. If your data is large, you won't be able to save on GitHub either. You can use an external data repository, and get links to your data that you can refer to inside of your notebook. huggingface is a service to store and document large datasets. Here's an example: https://huggingface.co/imageomics/BGNN-trait-segmentation Note: python gets updated frequently, and it's something to be aware of. Be explicit about the versions of python and packages you use. FYI initiative to build reproducible containers: https://codeocean.com/

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.