Gone missing? dealing with missing survey data

Rotem_D
Jul 2, 2021
2 min read

I recently worked on a project that collected survey responses related to the adoption of technological innovations into the American healthcare system. The final sample consisted of However, a more robust approach is to use e some of our attitudes (see this, and this posts where I explo,re public views on foreign policy and different preferences).

However, surveys are completed by people, and some people do not have the patience, willingness or any other reason to answer all the questions in a survey. That creates a problem for analysis because if our data in not complete, the results may be biased.

How to tackle missing data?

There are various options including dropping the missing observations or use the average value among the non-missing data points in order to correct for the missing observations. However, a more robust approach is to use multiple imputation. The idea is that we create multiple copies of the data while filling-in replacements for the missing values based on an imputation model (if you really want to understand it, check this paper). Then, we run our analysis on each completed dataset and combine the results to get pooled estimates (and standard errors).

My missing data experience

I recently worked on a project that collected survey responses related to the adoption of technological innovation into the American healthcare system. The final sample consisted of

198 individuals but had not insignificant amount of missing responses.

The first thing we can do is figure out the proportion of missing values in our data (and even better, visualize it). Using a simple function, we can see how much missing data we are dealing with.

We can also plot it...

The plot above shows the pattern of missing in the data as well as the variables with the highest proportion of missing values. It seems that some of these items have large number of missing values.

We implement the imputation procedure using the mice package in R. After the program runs the procedure, we end-up with five imputed datasets, and we can visualize the imputed and original data.

Using the complete() function, I 'select' one of the imputed datasets and then use the cbind() function to create a complete dataset of my survey.

My goal in this short blogpost is not to talk about the research itself, but rather present the implementation of the imputation procedure. After running all my models with the complete dataset (which includes the first of five imputed datasets), I conduct a few more tests to check that the results are consistent across all imputed datasets. First, we create 5 complete datasets that include the original data and the imputed values. Then, I ran my selected regression model and stored the results of each model from each complete dataset in a list.

Finally, using modelplot() from the modelsummary package, I plot the coefficients from all the models.

What does the plot tell us? The fact that the model coefficients from all five imputed datasets are relatively similar increase my confidence in my results based on the imputation procedure.

As always, code for the analysis in this post is available on my GitHub.

Rotem Dvir

Gone missing? dealing with missing survey data

Bình luận