3 Steps to Consider BEFORE Deciding to Impute Missing Data
Your data sets will more often than not contain missing values. Every data analyst, regardless of experience, has to deal with this. Missing data becomes especially problematic when using data algorithm-based models, since most models will fail when missing values are present. As a data analyst, you have many techniques to impute missing data at your disposal.
This article is NOT about which imputation technique you should use. Instead, this article provides you with three tips to better understand when and where it is appropriate to apply these techniques: 1) why are values missing, 2) how many data points are missing and should I impute or toss out data, and 3) how you will use your dataset in your analysis.
Tip 1: Understand why the data is missing
The first step in deciding whether to impute missing values is to understand why the data is missing. There are three main reasons for missing data. Sometimes data is unknown, unavailable, or unreported during the data collection period. This can happen, for example, when a survey respondent does not provide all responses to a survey or a machine breakdown prevents an organization from temporarily collecting data. You will often find such errors in raw data recorded as either an empty, ‘NA’, ‘N/A’, or ‘NaN’ value depending on the software that generated or is reading the data. Second, there are several reasons data may be missing. Sometimes it is because of technical problems, such as the removal of personal information from a dataset for privacy reasons. Other times, it may be because the data was uncollected or is not available for public release. Data documents usually describe missing values so that users know what to expect. Third, missing data may occur because of human error, like when someone makes a typo or classifies something wrong. These mistakes will stay in the data most of the time. It’s up to the data analyst to decide what to do with that information. Sometimes you might see a flag or data notation saying that a data point might not be accurate.
One of the best datasets I have come across in my career which shows the importance of understanding data is the Quarterly Census of Employment Wage (QCEW) data from the U.S. Bureau of Labor Statistics (BLS) (https://www.bls.gov/cew/). It is a good example of the above discussions. I will start with a subset of this dataset using construction employment data from Boone County, Nebraska, which is a sparsely populated rural county, for 2021.
Table 1 above shows the total number of construction establishments and employment in the county, and the same data for construction sub-sectors at the NAICS 4-digit level. In Boone County, 25 construction businesses employed 90 people in the county. The QCEW reports the number of establishments for each NAICS-level establishment. However, it is required to suppress all other data — which includes the number of employees, total establishment payroll, and total taxes paid by the establishment — which may allow a data analyst to identify an individual business. A disclosure code of ‘N’ informs the analyst of suppressed data. The number of employees reported in the suppressed data (‘0’) is actually misleading because every establishment reported has, by definition, at least one employee. The ‘0’ values should be replaced with ‘NA’ before proceeding with further data analysis.
There is also another data curiosity to grasp. Table 1 reports establishments in seven construction sub-sectors. But in reality, there are 10 4-digit construction sub-sectors in the QCEW data shown in the table below. Like the suppressed data, the ‘NA’s, which were generated when expanding the dataset, are also misleading. The number of establishments and employment are not because of the reasons discussed above. It is misleading because there are neither establishments nor employees in those sub-sectors in the county. So the ‘NA’s actually represent zero values. Table 3 (below) shows how the correct data should look.
Tip 2: Think about your data before deciding whether to impute or toss it out
Depending on the size of your dataset, there are scenarios that might favor either imputing or discarding missing values. The more extreme the case of missing values, the more difficult it is to make assumptions about the population based on imputed values. As a general rule, if the number of missing values is small relative to the sample size (less than 5%), you may opt to impute. If a larger proportion of data points is missing (greater than 20%), it makes more sense to discard the data.
I will focus on nonresidential building construction (NAICS 2362) from Tip 1 and explain why it is important to understand the data when deciding whether to impute missing data. One way of imputing missing data for this sub-sector would be to query the sub-sector’s establishments and wages and use it to estimate employees per sub-sector business. If we use the raw QCEW data for all counties in the U.S. as reported, there are no missing values because, as you will recall, as ‘0’s in the raw data are actually missing. Chart 1 shows that when correcting this by replacing all ‘0’ values of the Number of Employees with ‘NA’ where the Disclosure Code equals ‘N’, 43 percent of counties reported businesses in this subs-sector with suppressed data.
But we still do not have the full percentage of counties with missing data. The data set does not include counties with no establishments or jobs. In fact, we need to include these missing 1,905 counties back into the dataset in order to get the true number of counties with missing employment values. Making these changes, Chart 2 shows the true percent of missing values in this sub-sector is 35 percent and we would still likely decide to exclude it from any further modeling. However, there could be other sub-sectors in the dataset which, when correcting the data, may give us the confidence to impute the missing data.
Once again, this discussion highlights the need for a thorough understanding of your data before jumping directly into imputing missing values.
Tip 3: Understand how you will use the data
When imputing data, understanding why values are missing is also important. Why was the data not collected? Is it a result of negligence or oversight? Was its acquisition simply too expensive? Or was there some technical reason that some data was not collectible? Knowing what caused the missing data in the first place can help you make an informed decision about how to address it use in your analysis.
Also consider how you will use the data after imputation. Do you plan to use the data for modeling or visualization? In that case, understanding how much of an effect missing values have on your model output or chart is essential. If the impact of the missing values is negligible, then imputing them may make sense. If you expect the missing values to have a large impact on your results, then throwing out the data may be the better option.
Finally, think about other information reported along with the missing data. Is it important to include it in your model? In tabular data, for example, do some rows/observations have mostly missing values? If so, you may decide to toss those rows and then re-evaluate the remaining missing values to determine if imputation now makes sense. In my experience, these decisions are more of an art based on your knowledge of the dataset and less on a scientific approach.
Ultimately, understanding how you will use the data and why it is missing will help determine whether imputation or discarding is the right course of action.