On April 4th, a website appeared claiming to release data on nearly 50 million Turkish citizens from the breach of a government database. This series of posts looks at the contents of that alleged database dump.
The website announcing availability of the data was hosted at IP address 220.127.116.11. Gelocation services trace that address to Romania, near the capital city of Bucharest.
As of this writing, no group has stepped forward to claim responsibility for the breach. There are no credit-cards, bank account numbers or other information directly monetizable in the dump, suggesting that financial gain is an unlikely motivation. (Not to mention that groups driven by monetary incentive would be more likely to exploit the information for private gain, instead of releasing it publicly to make a statement.) A more likely candidate is hacktivism: attackers motivated out of ideological reasoning to embarrass their target by revealing gaps in security. The announcement lends support to that in at least two ways. There are critical opinions about contemporary Turkish politics, including unfavorable comparisons to a current US republican candidate. The authors also attempt to pin the blame for the security breach (“crumbling and weak technical infrastructure”) on the governing party. The perpetrators offer a detailed critique of specific issues they encountered with the system:
- Weak encryption, described as “bit-shifting” suggesting that the system was not using modern cryptography to protect sensitive information
- Weak authentication in the user-interface (“hard-coded password”)
- Failure to optimize the database for efficient queries
Attribution for data breaches is tricky. Perpetrators have strong incentive to use false-flag tactics to confuse or mislead investigation. While opposition groups inside Turkey would be prime-suspects, the wording of the message suggests US origin. Choice of pronouns in “we really shouldn’t elect Trump” and “your country” imply authors located in the US, reflecting on the upcoming 2016 Presidential election and drawing comparison between internal politics of two countries. Meanwhile the text does not have obvious grammatical errors or awkward phrasing that would suggest non-native speakers.
Questions about the data
Putting aside the difficult problem of attribution, there are other questions that can be answered from the data:
- Is the data-set valid? In other words, does this dump include information about actual people or is it complete junk? How accurate are the entries? (This concern applies to even commercial databases used for profit. For example, FTC found that 5% of consumer records at credit-reporting agencies had errors.)
- Who appears in this data-set? Turkey boasts a population close to 75 million but this set contains roughly 49 million records. What is the criteria for inclusion?
- What type of large-scale demographic trends can be learned from this data?
Working with the data
Before answering these questions, some tactical notes on the structure of the data released. The website did not host the data directly, but instead linked to a torrent. That torrent leads to a 1.5GB compressed archive. Decompressed it results in a massive 7GB file “data_dump.sql” (For reference its SHA256 hash is 6580e9f592ce21d2b750135cb4d6357e668d2fb29bc69353319573422d49ca2d) This file is a PostgreSQL dump with three parts:
- Initial segment defines the table schema.
- This preamble is followed by the bulk of the file containing information about citizens. Each line of text contains one row and columns are separated by tabs.
- Finally the last few lines setup indices for faster searching on first/last name, city of birth, current city of residence etc.
Reanimating the database
Because each individual record appears on a line by itself, it’s possible to run ad hoc queries using nothing more fancy than grep and regular expressions. But for more efficient , it makes sense to treat the database dump as a database proper.While the file is ready as-is for loading into PostgreSQL, with some minor transformations we can also use a more lightweight solution with sqlite. There are how-to guides for importing Postgre dumps into SQLite but these instructions are dated and no longer work. (In particular the COPY command has been deprecated in sqlite3.) One approach is:
- Create a new file that contains only the row-records from the original file, stripping the preamble and indices
- Start sqlite, manually create the table and any desired indices
- Change field separator to tab and use .import to load the file created in step #1
One note about hardware requirements: running this import process initially will consume a very large amount of memory. (Case in point: this blogger used a machine with 16GB RAM and sqlite3 process peaked around 12GB.) After the resulting database is saved to a file in native sqlite format, it can be opened in the future with only a handful MB of memory used.
[continue to part II: demographics]