---
title: "Finding Data for Your Data Project"
author: "Tim Dennis"
date: "October 2, 2024"
---
### Finding Data
#### Tim Dennis, Director of Library Data Science Center
**Stats 15: Introduction to Data Science, Dr. Gould**
These notes: <https://hackmd.io/@timdennis/H16PoWqCC>
---
#### This presentation will cover:
- Key concepts and terms used in data scholarship (e.g., DOI, data citation).
- Important tools and resources for finding datasets.
- Best practices for evaluating data quality.
- How to ensure data is FAIR (Findable, Accessible, Interoperable, Reusable).
- Q&A
---
### Why Finding Data is Important
Data supports your arguments and enables analysis in your research project.
Helps turn your research question into something tangible.
Note:
- Explain that finding the right data is essential for supporting their research project.
- Data is the foundation for answering research questions and validating findings.
---
### Narrowing Your Data Search
- **Research topic**: Define your subject.
- **Measurement**: Identify data type (surveys, counts, etc.).
- **Geographic scope**: Specify region/unit.
- **Time frame**: Choose relevant years/frequency.
- **Data type**: Quantitative (numbers) or qualitative (opinions)?
Note:
- Ask key questions about their research topic.
- Be specific about the geographic and temporal scope of the data they are looking for.
---
### Evaluating Data Quality
- **Read Technical Documentation**
- Look for a **Data Dictionary**, **Codebook**) or well thoughtout **README** file.
- **Understand Bias and Purpose**: Who collected the data and why?
- **Consider Methodology**: How was the data gathered? (by survey, instrument, interview, etc.)
- **Comparability Issues**: Are datasets consistent over time?
Note:
- Read the technical documentation for any dataset they plan to use.
- Is the data adequately described, do you know what variables are, their measurment units?
- This will help you understand the limitations, bias, and methodology behind the data.
- Ensure the data they choose is comparable across time if they’re working with time-series data.
---
### Key Terms in Data Scholarship
- **DOI**: A unique identifier for digital objects, providing a permanent link to datasets.
- Looks like this <https://doi.org/10.25346/S6/OBHVMJ>
- **Data Citation**: A formal standard reference to a dataset, similar to a publication citation. The goal is to treat data as "legitimate, citable products of research" just like articles or books.
- See: *Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 https://doi.org/10.25490/a97f-egyk*
- **Data Usage Agreement**: Outlines terms for accessing and using data.
- **Data Availability Statement**: Specifies where and how data supporting a study can be accessed.
- Example:
- >The data that support the findings of this study are openly available in [UCLA Dataverse](https://dataverse.ucla.edu/) at https://doi.org/10.25346/S6/NNGECQ.
---
### Additional Data Scholarship Terms
- **Codebook**: Describes a dataset, e.g. variables, how was it collected, it's provenance.
- See [What is a code book](https://www.icpsr.umich.edu/web/ICPSR/cms/1983)
- **Data Dictionary**: Explains dataset structure and variables.
- See <https://library.ucmerced.edu/data-dictionaries>
- **README**: A file providing context and instructions for using a dataset.
- See: <https://cornell.app.box.com/v/ReadmeTemplate> for a template on datasets
- **Metadata**: Descriptive information about a dataset to help users understand and locate it.
---
### [FAIR Data Principles](https://www.go-fair.org/fair-principles/)
- **Findable**: Data is easily located with clear metadata (e.g., DOI).
- **Accessible**: Data can be accessed under defined conditions.
- **Interoperable**: Data is in a format that can be integrated with other datasets and tools.
- **Reusable**: Data is well-documented and available for future research.
- FAIR is an emerging set of practices and [maturity model](https://datascience.codata.org/articles/10.5334/dsj-2020-041) for Data Repositories to signify data quality. If the dataset authors or repository claims that the data is FAIR, that's a good sign!
---
### When You Can’t Find the Perfect Data
- **Be flexible**: Adjust your question if necessary.
- **Combine Datasets**: Use multiple sources to create a comprehensive analysis.
- **Use Proxy Variables**: If exact data isn’t available, use approximations.
Note:
- It's common not to find perfect data.
- BE flexibile, combining datasets or using proxy variables is a valid approach to dealing with imperfect data.
- Examples include using average zip code income for household-level data or using regional health statistics when individual data is unavailable.
- You can talk about the limitations of the data you find in your paper.
---
### Tools & Resources for Finding Data
- **Library Licensed Resources**: [ICPSR](http://www.icpsr.umich.edu/), [Sage Data](https://data.sagepub.com/), [more ...](https://guides.library.ucla.edu/az.php?t=3849)
- The Library buys data for using in your research. Often, licensed data can be of higher quality than what you find on the web.
- **Curated Lists**: [Data/Stats LibGuides](https://guides.library.ucla.edu/directory/formats)(by area), [UCLA Library data resources](https://guides.library.ucla.edu/az.php?t=3849) (a-z list)
- [Ask the DSC](https://datascience.ucla.edu) if you need help navigating these resources.
- **Search for Data**: [Google Dataset Search](https://datasetsearch.research.google.com/), [Data.gov](https://data.gov/), [LA Open Data](https://data.lacity.org/) (municipalities are increasingly offering open data catalogs of city data)
Note:
- The UCLA Library’s licensed resources, which offer access to a variety of datasets.
- Highlight how curated lists and databases like ICPSR can help them find more specialized data for their projects.
---
### ICPSR (Inter-university Consortium for Political and Social Research)
- <https://www.icpsr.umich.edu/web/pages/>
- You should have to authenticate via UCLA Logon ID to accept DUA & download data (though many datasets are public).
- Large archive of surveys useful for **secondary analysis** and **quantitative research**.
- Includes studies in political and election studies, education, population research, social indicators, and health.
- Contains data from the United States and many countries worldwide.
- [Publication bibliography](https://www.icpsr.umich.edu/web/pages/ICPSR/citations/) and [variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/).
- Gold standard for well curated social science data.
- Data is set up for common statistical analytic tools (R, Stata, SPSS) - this saves time!
Note:
- Valuable resource for conducting secondary data analysis, particularly in social science fields.
- It includes a wide range of studies across different subjects, making it a versatile resource for student research.
---
### Sage Data
- <https://data.sagepub.com/>
- Web-based application with access to **65 billion datasets** from **90+ sources**.
- Includes both **licensed** and **public domain datasets**.
- View data in **side-by-side tables and charts**; create customized visualizations.
- Data is fully standardized with detailed metadata (up to 37 fields).
- Includes **premium datasets**: China Data Institute, Claritas Consumer Profiles, Woods and Poole, and more.
- Easy to export data and visualizations with built-in citations for sources and Data Planet.
- **Limit**: Up to **500 charts/tables** per work product.
Note:
- **Data Planet** is a powerful tool for exploring large datasets across many subject areas.
- It's particularly useful for creating visualizations and analyzing data in a way that's accessible for students working on research projects.
- You can also extract tables for external analysis.
- **UCLA authentication** is required.
---
### Next Steps for Your Data Project
1. Narrow down your research question.
2. Look for articles about similar studies.
- [ICPSR Bibliography](https://www.icpsr.umich.edu/web/pages/ICPSR/citations/) of data related publications.
4. Identify three potential datasets.
6. Evaluate dataset quality
- Does it have a **codebook** or **readme**? A **DOI** and **Data Citation**? Is it well organized? Does it have set up files for the tool you want to analyze in?)
- Data cleaning can often take 80% or more of the work. Finding well described and organized data will save you time.
8. Start exploring the data and refining your project.
Note:
- Start by narrowing your questions, identifying datasets, and evaluating the quality before diving deeper into your analysis.
---
## Need Help Finding or Using Data?
* Reach out to the [Library Data Science Center](https://www.library.ucla.edu/visit/locations/data-science-center/)
* Meet with UCLA DataSquad or DSC.
---
### UCLA Data Science Center (DSC)
- Provide support for **finding data**, **data analysis, visualization,** and **computational research**.
- Offer workshops and training on data tools like R, Python, SQL, and more.
- Facilitate access to datasets and software for research.
- **Contact Us**:
- **Email**: datascience@ucla.edu
- **Visit**: [UCLA Data Science Center Website](https://datascience.ucla.edu)
- **[Schedule a meeting](https://calendly.com/data-science-team)**
- **Location**: YRL, first floor in Collaboration Pod space
---
### UCLA DataSquad
- A team of trained students providing data support to researchers across campus.
- Offer assistance with data cleaning, statistics, visualization, and analysis.
- Collaborate with faculty and students on data-driven projects.
- **Contact Us**:
- **Visit**: [UCLA DataSquad Website](https://ucla-datasquad.github.io)
---
### Questions?
Contact: [tdennis@library.ucla.edu](mailto:tdennis@library.ucla.edu)
Notes from talk: <https://hackmd.io/@timdennis/H16PoWqCC>