Policymakers, businesses and NGOs need disaggregated data to guide their actions. But typically this data does not exist below the first administrative level, if at all.
A recent paper in Science by Blumenstock et al. suggests that anonymized mobile phone metadata might provide an answer. There has been prior work linking the volume of call use, aggregated up, to population statistics. But this paper is a departure because it focuses on “understanding how the digital footprints of a single individual can be used to accurately predict that same individual’s economic characteristics”.
To do this, they gained access to an anonymized 2009 database from Rwanda’s largest mobile operator containing records of billions of interactions. They then draw a random sample of 856 of these individual subscribers (from 1.5 million), geographically stratified (covering 30 districts and 300 cell towers). They then conduct a phone interview with them (asking 75 questions), and get permission from each of them to merge the survey data with the mobile phone interaction data.
The authors use a “structured combinatorial method to automatically generate several thousand metrics from the phone logs that quantify factors such as the total volume, intensity, timing and directionality of communication; the structure of the individual’s contact network; patterns of mobility and migration based on geospatial markers in the data, and so forth”. The authors then use an “elasticated net regularization” to remove less relevant metrics to get a manageable model. Actual wealth (as constructed from answers given in the phone survey) is correlated with predicted wealth (using the regression model developed) at the individual level at R=0.68.
Then they use this model to generate out of sample predictions for the remaining 1.5 million Rwandan mobile phone users (the whole point of the exercise—can we predict the wealth of individual subscribers based solely on anonymized transactions records?) and compare various predictions with 2007 and 2010 DHS survey data (using those reporting mobile phone ownership or not in the DHS survey) which allow for wealth indices to be constructed.
These out of sample predicted wealth estimates are aggregated up to the district level (30 of them) and compared to the DHS wealth estimates at the district level. For the 3 big districts (more than 400k people) the correlations are tight. For the small ones (less than 200k people) the correlations are looser. In fact the district wealth maps don’t look THAT similar for the two data sources (Figure 3 A and B) even though the r value is about 0.91. The r value for correlations between the 2 data sources at the cluster level is 0.79.
The paper does not (as far as I can tell) dwell on its limitations, but there are several. First, obviously, how representative are mobile phone owners relative to the general population? It would have been good to see some geographic data on this, based on the last census. Second, how representative is the sample of 856 of the 1.5 million mobile phone users? How were refusals dealt with? Third, how contemporaneous are the survey data (year not given) and the phone data year (2009)? Fourth, what is the elasticated net regularizations? It would be nice to see the workings. Firth, I’m not sure but I’d be surprised if the DHS is representative at the district level—so how valid are the comparisons at that level? Sixth, I would have liked to have seen the Spearman correlations (on the ranks of the districts by wealth) to get a sense of what difference data source makes for action. Seventh, what are the safeguards for the subscribers whose records were used (presumably 1.5 million did not give their permission)?
But these are all things that can be worked on and improved (and perhaps they have been). I’m not going to be too critical because I admire the creativity of the work and its potential value—mobile phones are everywhere, survey enumerators are rare sightings.
I would like to see if the correlations are any good with stunting of under 3’s or under 5s, perhaps keying the model into health information seeking behaviors used by subscribers.
When the data you want are scarce, you have to creatively use the data you have to patch things over while you make the case for investment in the former. I would like to see those district rankings however. That would give us a sense of how serious misclassification is likely to be (do we implement an infant complementary feeding programme in District A or District B?).
This post was written by Lawrence Haddad and first appeared on Development Horizons.