Missingness Maps and Cross Country Data

I've been doing some work on diagnostics for missing data issues and one that I have found particularly useful and enlightening has been what I've been calling a "missingness map." In the last few days, I used it on some World Bank data I downloaded to see what missingness looks like in a typical comparative political economy dataset.

p>missmap2.png

View image

The y-axis here are country-years and the x-axis are variables. We draw a red square where the country-year-variable cell is missing and a light green square where the cell is observed. We can see immediately that a whole set of variables in the middle columns are almost always unobserved. These are variables measuring income inequality and they are known to have extremely poor coverage. This plot very quickly shows us how listwise deletion will affect our analyzed sample and how the patterns of missingness occur in our data. For example, in these data, it seems that if GDP is missing, then many of the other variables, such as imports and exports are also missing. I think this is a neat way to get a quick, broad view of missingness.

(Another map and some questions after the jump...)

We can also change the ordering of the rows to give a better sense of missingness. For the World Bank data, it is wise to resort the data by time and see how missingness changes over time.

missmap-time2.png

View image

A clear pattern emerges that the World Bank has better and better data as we move forward in time (the map becomes more "clear"). This is not surprising, but it is an important point when, say, deciding the population under study in a comparative study. Clearly, listwise deletion will radically change the sample we analyze (the answers will be biased toward more recent data, at the very least). The standard statistical advice of imputation or data augmentation is tricky as well here because we need to choose what to impute. Should we carry forth with imputation given that income inequality measures seem to be completely unavailable before 1985? If we remove observations before this, how do we qualify our findings?

Any input on the missingness map would be amazing, as I am trying to add as a diagnostic it to a new version of Amelia. What would make these plots better?

Posted by Matt Blackwell at February 25, 2009 2:58 PM