About the Turkish citizens database breach: geography (part III)

[continued from part II]

Geographic patterns

We can also look at geographic distribution of people, since the database contains columns for both city of birth and city of current residence. There are exactly 81 distinct values for the column labelled “address_city” corresponding to the 81 provinces of Turkey on last count. (That number has steadily crept up over the years, with former towns vying for the privilege of receiving province status.)



As expected, Istanbul dwarfs every other province with more people than the next three cities combined. That stack ranking of provinces is not too far from the 2014 population estimates, taking into account the fact that these figures likely reflect eligible voters only, as opposed to all residents. Cities with very young populations will be under-represented in an electoral database because many of their residents will not meet the age criteria for eligibility.

Looking at the other column labelled “birth_city” in the database paints a messy picture. Unlike the current place of residence, there is no a priori requirement for this field to map to an existing province of Turkey.** Indeed there are many foreign cities, with often inconsistent representations. For example the following distinct values with subtle differences all appear to refer to New York City, with “ABD” being the translation of USA:

  • NEW YORK A.B.D (periods in the abbreviation)
  • NEW YORK /A.B.D (slash)
  • NEW YORK A.B.D. (period at the end)
  • NEW YORK N.Y. A.B.D. (state spelled out)
  • N.Y. NEWYORK/A.B.D (inverted order of state & city)

Not to mention malformed variants such as missing spacing (“NEWYORK A.B.D”) and outright misspellings such as “NEVYORK” and “NEVWYORK.”

Case in point

This suggests no attempts were made to normalize or otherwise correct this column. It is also one of the few places where mixed-case entries appear. While majority of entries in the database are entirely upper-case (even names—danah boyd would not be pleased) addresses on rare occasions feature lower-case to accommodate Turkish spelling. For example the city Çorlu is represented as cORLU, with the lower-case C standing for original accented C or Unicode character U+00C7. This trick works because for each letter that is part of the ASCII set, there is only one possible alternative in Turkish alphabet differing from that letter in a diacritic, such as O and Ö  which is U+00D6. There is no Unicode or UTF8 in the database but lower-casing appears sparingly in a handful of records. Out of more than 49.6 million records, about 30 thousand have some lower-case letter in at least one fields.

Another complication is that even within Turkey, city of birth is captured at a different level of granularity than city of residence. In particular, it does not map to provinces. A straightforward query asking what percent of Istanbul residents were actually born in Istanbul turns up the surprising conclusion that barely one million out of nearly nine-million residents at the time of this snapshot were natives. At first blush this would appear to confirm the Istanbulites’ gripe about massive waves of immigration from the heartland crowding out their city beyond recognition. But a closer look at the distribution of birth place for residents of Istanbul at the time of this snapshot (2009) tells a different story:


As before Istanbul “natives” take up a  small fraction of the pie. But on closer inspection this turns out to be an artifact of inconsistent classification. Bakırköy, Üsküdar, Kadıköy, Şişli are all townships within the larger Istanbul province. So are Fatih, Kartal and Eyüp. These are not cases of people relocating to Istanbul, any more than a person born in Manhattan and presently living in Brooklyn could claim to have “relocated” to NYC. The capital city Ankara is the first data-point here that represents a genuine influx of people relocating from a different region.

Such inconsistencies make it difficult to correlate this data-set against patterns of migration directly. But we can ask a different question: for each province, how many people born in that province were living in the same place?


Places with the greatest likelihood of outward migration are rural cities in the Eastern and Southeaster regions of the country. In the most extreme cases only one in four residents have stuck around. One outlier in this picture is Hatay. It’s clearly incorrect for only 649 people from a major province appear in the database. The explanation turns out to be another inconsistency in classification: that province also goes by another name “Antakya” derived from the original Antioch.  There are over a quarter million records where the city of birth is Antakya but exactly zero where the current city is labelled as such.


On the opposite extreme, large cities including İstanbul, Antalya, Bursa and İzmir have loyal natives who choose to set down roots. Ankara on the other hand does not appear to inspire that level of attachment. Geography might offer another explanation: Aydın, Antalya and İzmir are coastal cities, as are Muğla, Mersin and Adana which  rank high on the list- chalk it up to the Mediterranean climate. (Nearby Denizli and Manisa are inland from the Aegean coast but also retain large percentage of their voting-age population according to these figures.) Konya, Kahramanmaraş and Gaziantep provide something of a counterpoint to that “sunshine-and-beach” theory of what makes cities appealing. Located further inland, they have relatively harsh climates and in the case of Gaziantep have the additional disadvantage of bordering the failing state Syria— although that would have been less of a factor in 2009. Still the glaring outlier here is Şırnak. With an adult population around 400,000 according to 2010 census, only 10% of its residents are present in this data-set. This suggests that the inflated figure may be more an artifact of the criteria for inclusion in this database.


We can check for a correlation between local climate and tendency of people growing up in that location to relocate. Each circle in the above graph represents one province using data sourced from a Norwegian meteorological site.* The vertical axis shows the percentage of individuals born and currently living in that province. 100% score indicates everyone is still living there, while 0% means all residents have moved away. (These figures do not rule out the possibility of having moved at some point in the past and later returned; the database only affords a snapshot in time.) The diameter of each circle is proportional to number of total residents with that birth-city; largest circle corresponds to Istanbul. The dotted trend-line does suggest a small correlation between higher average temperatures and tendency to stay put.

[Continued: trending names]


* There were no records in the database on citizens residing overseas. Those individuals can still vote in general elections by going through the local consulate. It’s likely that a different system is used to track voter rolls for that scenario, which explains why they may not have been affected by the breach.

** Osmaniye has been removed due to missing data.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s