Keeping Your Data So Fresh and So Clean
A Playlist for Squeaky Clean Data
I have been consuming lots of data science and analysis content, and one of the major takeaways I have found is that there is no model you can build that will yield actionable insights if you feed it garbage data. One of my instructors holds that the first thing that newbies need to learn is the GIGO principle. Well, he would put it more succinctly and say that it would be better to spend 80% of your time cleaning your data instead of building a shiny model that you feed Garbage.
The GIGO (Garbage In = Garbage Out) Principle holds that models are akin to humans and are what they consume. Moreover, if you feed your model unrefined data, you can expect your metrics to be dampened by the inaccuracies that compound along your data pipeline.
With this in mind, I created a playlist to keep my mind focused while cleaning data and thought that you might find it helpful if you are new to digging into data like I am.
Let’s start with data that is in the .csv
file format, and we have also imported all of the libraries and visualization packages necessary to build a model.
So my typical first step is to create a data frame called df
that I use pd.read_csv
to create.
df = pd.read_csv('myfile.csv', index_col = 0)
After I have an initial data frame, I clean the data. The initial song on my list that I am typically listening to while reading the .csv
file and creating my initial data frame is :
Jill Scott’s Golden
Cleaning Data
To clean data, I often spend lots of time:
- Removing Duplicates — duplicates take up space, can slow the implementation of models, and may lead to erroneous insights.
- Replacing Missing Values — while there is no optimal way to deal with missing values, learning how to handle missing values is a skill that budding data nerds should work on building up.
- Replacing Placeholder values — placeholders like: ‘—,’ NANs,‘ ?’, and ‘#’ consistently gum up the works and lead to errors both seen and unseen in analysis. If you are lucky, error messages abound; if you are supremely unlucky, you notice after trying to visualize the correlation between dependent and independent variables and get noticeably improbable graphs.
- Casting datatypes to more appropriate datatypes —Often, this shows up when I have naively been treating a categorical variable (think something with a fixed number of values) as though it was continuous (something that can have any value between a minimum and a maximum).
Where do I begin?
I look at the shape of the data frame, pretty much I need to know how many rows and columns I am working with and need to wrangle into submission. As I jam to the legendary vocalist from Philadelphia, Pennsylvania, I use the following:
df.shape
— Review the dimensions of the datadf.info()
— Review a concise summary of the datadf.head()
— Review the first five rows of the data (you can increase this number by adding a number n inside of the parentheses exdf.head(10)
will yield the first ten rows of the data.
Onto the Next
After I have learned the basic features of the data, I check for the entire data frame for placeholders, often with something that looks like the following:
df.isin['?', '#', 'NaN', 'null', 'N/A', '-']).any()
This is typically when I start listening to old-school Reggae Artists like Baby Cham, Capleton, and Sean Paul, which bring back the nostalgia of my high school projects in the physics lab with Mr. Green, who had the best rhythmic reggae soundtrack in the late 90s on the East Side of New Providence bar none.
Reviewing the Changes
After identifying placeholders, I use steps like this while jamming to old-school Reggae:
df['column_with_placeholder'] = pd.to_numeric(df['column_with_placeholder'], errors = "coerce")
Keeping the Groove Going
I typically transition to dirty South jams while removing duplicates and deciding whether I will drop rows from the data frame with null values or replace them with the median or mode. My favorite artists in that genre are from the Dungeon Family.
My playlist typically looks like this:
- Jill Scott — Golden
- Baby Cham featuring Alicia Keys — Ghetto Story
- Wayne Wonder — No Letting Go
- Sean Paul — Temperature
- Beenie Man— Romie
- Dungeon Family — Excalibur
- Erykah Badu — On and On
- Outkast — Hey Ya!
- Outkast — Rosa Parks
- Lauryn Hill — Ex Factor
What‘s my next step?
I would like to build in more flexibility in and variety of my playlist:
- How do I use data science to create a better playlist?
I was thinking of leveraging the youtube API, but am still thinking about this, so for right now, I just add songs based on my mood.
- I am currently working on creating a trip planning tool to help with my upcoming trip to the Bahamas to help celebrate our 50th year of Independence.
Contact Me
If you would like to be updated with my latest articles, follow me on Medium. You can also connect with me on LinkedIn or email me at tenicka.norwood@gmail.com.