[continued from part I]
There are different ways to gauge the accuracy of this data set:
- Spot-check known records. For example, individuals can search the data for their own information or that of friends/relatives. While this is useful for preliminary screening (finding correct records lends support to the allegation that this data originates from a government database) it does not provide much guidance
- Statistical distribution. Look for patterns in the dump and compare against external references on demographic trends in Turkey. The challenge is that without knowing the exact criteria for inclusion, the correct external data-set is a guess. For example, if the data-set contains only persons who were alive at the time of the breach, then population records will not line up exactly. Similarly if the data excludes citizens who live abroad or have never received a citizen ID, it will not be fully representative of all Turkish citizens.
Years of birth
First we look at distribution of birth years in the data. The latest records are from 1991. There is no record of any citizen born after Mar 29, 1991. Interestingly this corroborates an earlier statement from the government that this is not a new data-set, but a full publication of data previously lost in 2009. Given that the voting age in Turkey is 18 and youngest voters at that time would have been born in 1991, the dump could have originated with an election system. As it turns out there were local elections held on Mar 29, 2009.
The overall distribution is plausible, but there are some quirks:
- The last data point corresponding to 1991 shows a steep drop. This is expected, and it is an artifact of the data-set not being complete for that year.
- There are more mysterious drops for certain years, which are not easily explained by reference historic events. For example, there is no a priori reason to expect significantly fewer people from 1988, 1982 & 1983 (although this corresponds to the aftermath of a military coup in late 1980), 1975, 1967 or 1961 (also the year following a military coup.) These drops could also be a reflection of large-scale redactions performed by the individual(s) releasing the dump, to protect their own records or those of their associates.
The Ottoman zombie-apocalypse?
Oldest record in the above graph goes back to 1888, corresponding to a citizen who was 120 years old at the time of the incident. But some outliers were removed from the set prior to creating that chart. First there records where the date of field is formatted incorrectly, not conforming to the DD/MM/YYYY pattern. More interestingly there are a few hundred records where the year ranges in the 1300s. They cluster around 1340s as depicted in this graph:
There is one theory for explaining these records without invoking 600-year old zombies or ancient civilizations with secret health-diets. They may correspond to dates in Rumi calendar, in use by the Ottoman Empire starting 1839. Such dates would be offset 584 years from the conventional Gregorian dates (although one “year” is still 365 or 366 days, unlike the shorter years of Hijri calendar) Most records in the data-set interpreted as Rumi years correspond to early 20th century, consistent with the adoption of Gregorian calendar in 1927 by the modern Republic of Turkey.
Closer look at distribution
There are other quirks in the data. What if we look at most popular birthdays as day/month pairs? Of top 25 most common birth-days in the data-set across all years, all fall on January 1st. Not only that, but in many years they correspond to more than 10% of all records: as if one out of ten people that year were born on New Year’s Day.
Similarly top 50 only contains two different calendar days: January 1st and July 1st. The skew remains looking at the top one hundred: every one of those falls on the first day of a month. While it is known that distribution of birthdays is not uniform, such an extreme skew is not plausible in an “organic” distribution. More likely it reflects limitations of record-keeping. For example individuals born in a particular month could have been arbitrarily assigned first day of that month. Alternatively when exact dates were not known due to missing birth certificates, Jan 1st was chosen. These biases persist even when looking at more recent years, where one would expect more precise record-keeping. Here is the distribution of birth-days for 1990, the latest complete year in the data-set, with January 1st removed to reduce the bias:
There are still sharp peaks corresponding to the start of each month. There is another periodic pattern, exhibiting peaks on a weekly basis. (That is the opposite of expected result: it is known that fewer births fall on weekends. That should have manifested itself as small dips lasting two days every week, instead of noticable spikes.) There are also steep drops where the data-set contains a few “impossible” dates such as April 31st and June 31st.
Updated: Revised second graph showing records with year of birth prior to 1400.
[continued: geographic patterns]