Introduction to OpenRefine
2022-01-18
Chapter 1 Welcome
Got messy spreadsheets? OpenRefine is a powerful, free, open-source software tool for cleaning and transforming data in a way that is easy to reproduce. This hands-on class is targeted at people who need to clean messy data, including spreadsheets of survey responses, patient encounters, financial records, or workshop attendance. Together we will work through the basics of cleaning data in OpenRefine and also go over some more advanced techniques including pulling in additional data from an API. If you want something more powerful than Excel but don’t want to spend the time to learn a programming language like R or Python, OpenRefine could be the perfect tool for you!
1.1 Pre-Class Prep
Before joining the workshop please complete the following activities:
- Download OpenRefine
- For Windows machines we recommend the “Windows Kit with Embedded Java”
- For Mac machines you will likely need to allow OpenRefine to download by going to System Preferences > Security and Privacy > General and clicking “Open Anyway”
- Download the class data
1.2 Learning Objectives
By the end of the class learners should be able to:
- Explain how OpenRefine works on their computer
- Use OpenRefine to:
- Split data into multiple columns
- Facet data to find typos and errors
- Cluster data to easily correct typos at scale
- Pull in additional data from an API
- Export their cleaned data in a variety of formats
- Save their cleaning scripts so they can be re-used
1.3 Class Data
The data for this class was pulled from the National Library of Medicine Dietary Supplements Label Database. The version we are using has been intentionally “dirtied” to introduce errors and typos.
1.4 About the Instructor
Ariel Deardorff is the Data Services Librarian at UCSF, and member of the Library’s Data Science Initiative team. She teaches classes and does research on data management, open science, and reproducibility in the health sciences. For questions about this course or other courses email ariel.deardorff@ucsf.edu or visit the Data Science Initiative Website
This work is licensed under a Creative Commons Attribution 4.0 International License.