# Data cleaning
## Data cleaning issues
---
Text datasets can be messy! They often need to cleaned up, edited, transformed, structured. Data cleaning is a time consuming process but the benefits of working with clean, structured data make it worth it. Here are some of the most issues scholars, students, developers, deal with when cleaning texts and datasets:
---
- OCR correction and cleaning up
- reconciling conflicts in values/expressions (for example date formats)
- character encoding (espeically for multilingual content)
- converting geographical and spatiotemporal data (e.g. place names, dates) into machine-readable formats.
- standardizing abbreviations, acronyms, spellings
- splitting/merging texts into smaller units or larger datasets for analysis
---
## Tidying up and preparing your data
- use UTF-8 Encoding
- be consistent in formats/expressions of place/time (dates, locations)
- test your data or parts of your data featuring alternate names, non-Roman characters, different languages
- remove characters that are not useful in your analysis (html tags, "stopwords")
- use data cleaning software tools to speed up the process
- check that you have the right dataset for your research question or hypothesis (is anything left out? why?)
{"metaMigratedAt":"2023-06-15T14:51:52.335Z","metaMigratedFrom":"Content","title":"Data cleaning","breaks":true,"contributors":"[{\"id\":\"005a9d34-ea67-40b8-b24e-e86661babdde\",\"add\":1268,\"del\":0}]"}