Stats 15 2024-10-02

--- title: "Finding Data for Your Data Project" author: "Tim Dennis" date: "October 2, 2024" --- ### Finding Data #### Tim Dennis, Director of Library Data Science Center **Stats 15: Introduction to Data Science, Dr. Gould** These notes: <https://hackmd.io/@timdennis/H16PoWqCC> --- #### This presentation will cover: - Key concepts and terms used in data scholarship (e.g., DOI, data citation). - Important tools and resources for finding datasets. - Best practices for evaluating data quality. - How to ensure data is FAIR (Findable, Accessible, Interoperable, Reusable). - Q&A --- ### Why Finding Data is Important Data supports your arguments and enables analysis in your research project. Helps turn your research question into something tangible. Note: - Explain that finding the right data is essential for supporting their research project. - Data is the foundation for answering research questions and validating findings. --- ### Narrowing Your Data Search - **Research topic**: Define your subject. - **Measurement**: Identify data type (surveys, counts, etc.). - **Geographic scope**: Specify region/unit. - **Time frame**: Choose relevant years/frequency. - **Data type**: Quantitative (numbers) or qualitative (opinions)? Note: - Ask key questions about their research topic. - Be specific about the geographic and temporal scope of the data they are looking for. --- ### Evaluating Data Quality - **Read Technical Documentation** - Look for a **Data Dictionary**, **Codebook**) or well thoughtout **README** file. - **Understand Bias and Purpose**: Who collected the data and why? - **Consider Methodology**: How was the data gathered? (by survey, instrument, interview, etc.) - **Comparability Issues**: Are datasets consistent over time? Note: - Read the technical documentation for any dataset they plan to use. - Is the data adequately described, do you know what variables are, their measurment units? - This will help you understand the limitations, bias, and methodology behind the data. - Ensure the data they choose is comparable across time if they’re working with time-series data. --- ### Key Terms in Data Scholarship - **DOI**: A unique identifier for digital objects, providing a permanent link to datasets. - Looks like this <https://doi.org/10.25346/S6/OBHVMJ> - **Data Citation**: A formal standard reference to a dataset, similar to a publication citation. The goal is to treat data as "legitimate, citable products of research" just like articles or books. - See: *Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 https://doi.org/10.25490/a97f-egyk* - **Data Usage Agreement**: Outlines terms for accessing and using data. - **Data Availability Statement**: Specifies where and how data supporting a study can be accessed. - Example: - >The data that support the findings of this study are openly available in [UCLA Dataverse](https://dataverse.ucla.edu/) at https://doi.org/10.25346/S6/NNGECQ. --- ### Additional Data Scholarship Terms - **Codebook**: Describes a dataset, e.g. variables, how was it collected, it's provenance. - See [What is a code book](https://www.icpsr.umich.edu/web/ICPSR/cms/1983) - **Data Dictionary**: Explains dataset structure and variables. - See <https://library.ucmerced.edu/data-dictionaries> - **README**: A file providing context and instructions for using a dataset. - See: <https://cornell.app.box.com/v/ReadmeTemplate> for a template on datasets - **Metadata**: Descriptive information about a dataset to help users understand and locate it. --- ### [FAIR Data Principles](https://www.go-fair.org/fair-principles/) - **Findable**: Data is easily located with clear metadata (e.g., DOI). - **Accessible**: Data can be accessed under defined conditions. - **Interoperable**: Data is in a format that can be integrated with other datasets and tools. - **Reusable**: Data is well-documented and available for future research. - FAIR is an emerging set of practices and [maturity model](https://datascience.codata.org/articles/10.5334/dsj-2020-041) for Data Repositories to signify data quality. If the dataset authors or repository claims that the data is FAIR, that's a good sign! --- ### When You Can’t Find the Perfect Data - **Be flexible**: Adjust your question if necessary. - **Combine Datasets**: Use multiple sources to create a comprehensive analysis. - **Use Proxy Variables**: If exact data isn’t available, use approximations. Note: - It's common not to find perfect data. - BE flexibile, combining datasets or using proxy variables is a valid approach to dealing with imperfect data. - Examples include using average zip code income for household-level data or using regional health statistics when individual data is unavailable. - You can talk about the limitations of the data you find in your paper. --- ### Tools & Resources for Finding Data - **Library Licensed Resources**: [ICPSR](http://www.icpsr.umich.edu/), [Sage Data](https://data.sagepub.com/), [more ...](https://guides.library.ucla.edu/az.php?t=3849) - The Library buys data for using in your research. Often, licensed data can be of higher quality than what you find on the web. - **Curated Lists**: [Data/Stats LibGuides](https://guides.library.ucla.edu/directory/formats)(by area), [UCLA Library data resources](https://guides.library.ucla.edu/az.php?t=3849) (a-z list) - [Ask the DSC](https://datascience.ucla.edu) if you need help navigating these resources. - **Search for Data**: [Google Dataset Search](https://datasetsearch.research.google.com/), [Data.gov](https://data.gov/), [LA Open Data](https://data.lacity.org/) (municipalities are increasingly offering open data catalogs of city data) Note: - The UCLA Library’s licensed resources, which offer access to a variety of datasets. - Highlight how curated lists and databases like ICPSR can help them find more specialized data for their projects. --- ### ICPSR (Inter-university Consortium for Political and Social Research) - <https://www.icpsr.umich.edu/web/pages/> - You should have to authenticate via UCLA Logon ID to accept DUA & download data (though many datasets are public). - Large archive of surveys useful for **secondary analysis** and **quantitative research**. - Includes studies in political and election studies, education, population research, social indicators, and health. - Contains data from the United States and many countries worldwide. - [Publication bibliography](https://www.icpsr.umich.edu/web/pages/ICPSR/citations/) and [variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/). - Gold standard for well curated social science data. - Data is set up for common statistical analytic tools (R, Stata, SPSS) - this saves time! Note: - Valuable resource for conducting secondary data analysis, particularly in social science fields. - It includes a wide range of studies across different subjects, making it a versatile resource for student research. --- ### Sage Data - <https://data.sagepub.com/> - Web-based application with access to **65 billion datasets** from **90+ sources**. - Includes both **licensed** and **public domain datasets**. - View data in **side-by-side tables and charts**; create customized visualizations. - Data is fully standardized with detailed metadata (up to 37 fields). - Includes **premium datasets**: China Data Institute, Claritas Consumer Profiles, Woods and Poole, and more. - Easy to export data and visualizations with built-in citations for sources and Data Planet. - **Limit**: Up to **500 charts/tables** per work product. Note: - **Data Planet** is a powerful tool for exploring large datasets across many subject areas. - It's particularly useful for creating visualizations and analyzing data in a way that's accessible for students working on research projects. - You can also extract tables for external analysis. - **UCLA authentication** is required. --- ### Next Steps for Your Data Project 1. Narrow down your research question. 2. Look for articles about similar studies. - [ICPSR Bibliography](https://www.icpsr.umich.edu/web/pages/ICPSR/citations/) of data related publications. 4. Identify three potential datasets. 6. Evaluate dataset quality - Does it have a **codebook** or **readme**? A **DOI** and **Data Citation**? Is it well organized? Does it have set up files for the tool you want to analyze in?) - Data cleaning can often take 80% or more of the work. Finding well described and organized data will save you time. 8. Start exploring the data and refining your project. Note: - Start by narrowing your questions, identifying datasets, and evaluating the quality before diving deeper into your analysis. --- ## Need Help Finding or Using Data? * Reach out to the [Library Data Science Center](https://www.library.ucla.edu/visit/locations/data-science-center/) * Meet with UCLA DataSquad or DSC. --- ### UCLA Data Science Center (DSC) - Provide support for **finding data**, **data analysis, visualization,** and **computational research**. - Offer workshops and training on data tools like R, Python, SQL, and more. - Facilitate access to datasets and software for research. - **Contact Us**: - **Email**: datascience@ucla.edu - **Visit**: [UCLA Data Science Center Website](https://datascience.ucla.edu) - **[Schedule a meeting](https://calendly.com/data-science-team)** - **Location**: YRL, first floor in Collaboration Pod space --- ### UCLA DataSquad - A team of trained students providing data support to researchers across campus. - Offer assistance with data cleaning, statistics, visualization, and analysis. - Collaborate with faculty and students on data-driven projects. - **Contact Us**: - **Visit**: [UCLA DataSquad Website](https://ucla-datasquad.github.io) --- ### Questions? Contact: [tdennis@library.ucla.edu](mailto:tdennis@library.ucla.edu) Notes from talk: <https://hackmd.io/@timdennis/H16PoWqCC>