# Data cleaning
## Data cleaning issues
Text datasets can be messy! They often need to cleaned up, edited, transformed, structured. Data cleaning is a time consuming process but the benefits of working with clean, structured data make it worth it. Here are some of the most issues scholars, students, developers, deal with when cleaning texts and datasets:
- OCR correction and cleaning up
- reconciling conflicts in values/expressions (for example date formats)
- character encoding (espeically for multilingual content)
- converting geographical and spatiotemporal data (e.g. place names, dates) into machine-readable formats.
- standardizing abbreviations, acronyms, spellings
- splitting/merging texts into smaller units or larger datasets for analysis
## Tidying up and preparing your data
- use UTF-8 Encoding
- be consistent in formats/expressions of place/time (dates, locations)
- test your data or parts of your data featuring alternate names, non-Roman characters, different languages
- remove characters that are not useful in your analysis (html tags, "stopwords")
- use data cleaning software tools to speed up the process
- check that you have the right dataset for your research question or hypothesis (is anything left out? why?)
## Text / data cleaning tools
### Open Refine
Open Refine is a popular and comprehensive tool for data cleaning. You can use Open Refine to manipulate texts and text excerpts in spreadsheet format by replacing words, phrases, and characters, removing gaps, or merging and combining values. Open Refine is an extremely efficient tool that can help save time in the process of cleaning, transforming and reconciling data. There are installation kits for Windows, MacOSX and Linux available for [download](https://openrefine.org/download.html).
Open Refine is a tool widely used in digital humanities projects and is part of the data preparation and processing workflow. Thomas Padilla has written a great guide for Open Refine with a digital humanities focus and flavor:
http://thomaspadilla.org/dataprep/
### Text Cleanr
[Text Cleanr](http://www.textcleanr.com/) is an simple, intuitive tool for clearning and preparing texts for analysis. You can try it here:
http://www.textcleanr.com/
Task:
* Go to: [Homage to What? by Peter McGregor](http://library.nothingness.org/articles/SI/en/display/272)
* Right click on the page and select "View source".
The text should look like this:

* Copy all text and past into Cleanr's textbox.
* Find the html option menu and select "Remove." Set any other variables or filters you wish for the text.