Chapter 1 Welcome

Got messy spreadsheets? OpenRefine is a powerful, free, open-source software tool for cleaning and transforming data in a way that is easy to reproduce. This hands-on class is targeted at people who need to clean messy data, including spreadsheets of survey responses, patient encounters, financial records, or workshop attendance. Together we will work through the basics of cleaning data in OpenRefine and also go over some more advanced techniques including pulling in additional data from an API. If you want something more powerful than Excel but don’t want to spend the time to learn a programming language like R or Python, OpenRefine could be the perfect tool for you!

1.1 Pre-Class Prep

Before joining the workshop please complete the following activities:

  • Download OpenRefine
    • For Windows machines we recommend the “Windows Kit with Embedded Java”
    • For Mac machines you will likely need to allow OpenRefine to download by going to System Preferences > Security and Privacy > General and clicking “Open Anyway”
  • Download the class data

1.2 Learning Objectives

By the end of the class learners should be able to:

  • Explain how OpenRefine works on their computer
  • Use OpenRefine to:
    • Split data into multiple columns
    • Facet data to find typos and errors
    • Cluster data to easily correct typos at scale
    • Pull in additional data from an API
  • Export their cleaned data in a variety of formats
  • Save their cleaning scripts so they can be re-used

1.3 Class Data

The data for this class was pulled from the National Library of Medicine Dietary Supplements Label Database. The version we are using has been intentionally “dirtied” to introduce errors and typos.

1.4 About the Instructor

Ariel Deardorff is the Data Services Librarian at UCSF, and member of the Library’s Data Science Initiative team. She teaches classes and does research on data management, open science, and reproducibility in the health sciences. For questions about this course or other courses email or visit the Data Science Initiative Website

This work is licensed under a Creative Commons Attribution 4.0 International License.