Data Processing
“Raw data, like raw potatoes, usually require cleaning before use.”
— Ronald A. Thisted, Professor Emeritus - Departments of Statistics and Public Health Sciences, The University of Chicago
“Happy families are all alike; every unhappy family is unhappy in its own way.”
— Leo Tolstoy, Russian author known as one of the world’s greatest novelists.
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
— Hadley Wickham, Chief Scientist at Posit (formerly RStudio)
Good data organization and formatting is the foundation of any data project. Once the data are collected, you will need to process them so they can be used in your analyses or product development steps.
It’s commonly understood in the data science field that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003) - expect the same will be true for your project and prepare to invest time and resources accordingly.
The Resources section below links to resources that provide detailed technical guidance on how to process data using data science methods. Here, we discuss the why and how we go about this step with an equity lens.
When we talk about data processing here, we’re referring to the steps required to organize, format, and clean the data so it’s more efficient to use in future steps, and so others can consistently reproduce your steps - - this is commonly referred to as tidying data.
Why Tidy Data?
From a data science lens - Investing the time to properly tidy your data upfront makes the analyses or product development steps much more efficient and reproducible. This will cut down the time it takes in the long term to complete your analysis and product development steps (even if they need to evolve/change over time!), enabling you to spend more time on the actual data, science, management, and equity questions you have for your project (aka the good stuff!).
From an equity lens - having consistent and reproducible data makes it easy for others to replicate your work, which can increase transparency of your process and ultimately help build trust with your partners, the communities impacted by the management decisions associated with your project, and the public at large.
Additionally, working from a foundation of tidy data (and using consistent tools) can make it easier for others to collaborate with and contribute to your data project, which generally results in a better product in the end.
Processing Data with an Equity Lens
Keep your “raw data raw”: It’s important to keep (and backup) a version of your data that is untouched by your process (i.e. raw). We also recommend backing up a version of your data that is tidied and cleaned (i.e. what you will use for analysis). Not only does doing so protect you if your working version of the data are lost for some reason (e.g., computer crash, freak system failure) - but it also helps with reproducibility and transparency of your process.
Remove duplicate or irrelevant data: When you’re processing your data and making it tidy, think through which data from your sources are applicable to your question(s). Sometimes we don’t have the technical or logistical capacity to keep ALL of the data we are able to collect - so if you are going to remove data at this step, be sure to document WHY you’re removing it in a way that makes it easy for others to easily understand the reasoning behind the data decisions you’re making. Some reasons might include:
- Data are outside of the geographic scope of the analysis (e.g. we’re focused on a particular region for the project, and we are removing data outside of that region so we can increase loading/analysis speeds)
- Fields/columns from source data are not relevant to our analysis for XYZ reasons.
Transform your data into a tidy structure: Here you will reformat or restructure your dataset so it can be used in your future analyses or product development steps in a way that is efficient, effective, and meaningful for your project’s objectives. A common data transformation at this phase is converting a dataset from wide to long formats.
Explore your data - but don’t analyze it: Once duplicate and irrelevant data are removed from the dataset, you may want to explore your data using simple visualizations or statistics (see the Data Exploration Checklist for ideas). This will help you find extreme outliers in your data, or otherwise incorrect or missing data that should be removed, separated, or corrected before you begin your data analysis phase. Similar to when you removed duplicate or irrelevant data, you will want to document WHY you’re removing/correcting these data in a way that makes it easy for others to easily understand the reasoning behind the data decisions you’re making.
Any aggregation or preliminary analysis of your data should be completed during the analysis phase of the project. Keeping the values and aggregation of the data as they are at this phase increases reproducibility and transparency now and during future steps. Note that this is different from transforming data as described above (e.g. restructuring from wide to long format), which is a common and very useful part of the data processing phase.
You might be tempted to begin your data quality assessments as you’re looking at the data during this phase, but you’ll actually want to hold off on that for now. You’ll complete your data quality and assurance steps after all of your data have been tidied and you can look at all of your data holistically.
Resources
- Julia Lowndes and Allison Horst (2020) Tidy Data for reproducibility, efficiency, and collaboration. Openscapes blog.
- College of Water Informatics Data Management Handbook - Collect and Process Section
- Hadley Wickham. Tidy Data. Journal of Statistical Software
- Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. R for Data Science (2e) - Data Tidying Chapter
- The Py4DS Community. Python for Data Science - Tidy Data Chapter
- Karl W. Broman, and Kara H Woo (2018) Data Organization in Spreadsheets. The American Statistician 72 (1). Available open access as a PeerJ preprint.