Introduction to computing for the social sciences
Learning objectives
- Introduce myself
- Identify major course objectives
- Identify course logistics
- Introduce basic principles of data science workflow and programming
- Explain how to get started in R
Monday, April 6, 2020
General Notes
- Everything will be available on the course website!
- The course mainly focuses on learning basic programmatic and computational skills
- Stick to the 15 minute rule when trying to problem solve
- Otherwise, ask for help using:
- Grades in this class are based on evaluations on weekly programming assignments
- Any adjustments that need to be made will happen at the end of the quarter
- If you want to take the class P/F, email Professor Soltoff before the last week of the class and check-in around week 5 or 6
- Homework due every Monday at 11:59pm CST
Lecture Notes
Computational Social Science Workflow
- Importation Stage
- Have data in some sort of format and need to prepare it for analysis
- Tidy
- Get data into a usable format, generally a data frame
- Will spend a lot of time at this stage!
- Middle Stages (Stastical Analysis): Transform, Visualize, Model
- Cyclical process
- As you go through the process, you will likely learn something new and go through the process several times until you arrive at an end product
- Communicate
- Some sort of end-product like a report or a visualization
We will be learning how to go through all these steps using programming
Program
- A series of instructions that specifies how to perform a computation
- Includes: Input, Output, Math, Conditional execution, Repetition
GUI (Graphical User Interface)
- Want to think of programming as an explicit activity -> actually writing it out
- A little more difficult becuase we have to remember certain syntax
Example: "Jane: a GUI workflow"
- Search for data files online
- Cleans the files in Excel
- Analyzes the data in Stata
- Writes her report in Google docs
Example:"Sally: a programmatic workflow"
- Creates a folder specifically for this project
data
graphics
output
- Search for data files online
- Cleans the files in R
- Analyzes the files in R
- Writes her report in R Markdown
All of Sally's work is done using code!
But there's an issue for Jane when asked to re-do the analysis…
- Janes does not have the original data, nor does she remember the steps taken to analyze her data
- Sally will benefit because she has a written record of what she's done -> automation
- Much easier to implement in the long run
Reproducibility
- Are my results valid? Can it be replicated?
- The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
Version control
- Revisions in research
- Tracking revisions
- Multiple copies
analysis-1.r
analysis-2.r
analysis-3.r
- This is not an optimal system for a number of reasons
- Cloud storage (e.g. Dropbox, Google Drive, Box)
- Slight improvement, but its still difficult to track changes as well as keep track of who made what changes when collaborating, among other reasons
- Version control software
- Repository
- Git
- Can work on a single computer or as a network
- There's an explicit record of who's making changes
Documentation
- Comments are the what
- let others know what is happening in the code
- Code is the how
- Computer code should also be self-documenting
- Future-proofing
- Several weeks or months later, you should still know how that code works
- Likewise, if you're sharing code with colleagues, they should be able to understand what's happening as well
Software