Looking back on the Google Wallet plastic card

[Full disclosure: This blogger worked on Google Wallet 2011-2013]

Google recently announced that its Wallet Card launched in 2013 will be discontinued this summer:

“After careful consideration, we’ve decided that we’ll no longer support the Wallet Card as of June 30. Moving forward, we want to focus on making it easier than ever to send and receive money with the Google Wallet app.”

This is the latest in a series of changes starting with the rebranding of original Google Wallet into Android Pay. It is also a good time to look back on the Wallet Card experiment in the context of the overall consumer-payments ecosystem.

Early example of Google Wallet card

Early iteration of the Wallet card


Boot-strapping NFC payments

The original Google Wallet launched in 2011 was a mobile application focused on contactless payments or colloquially tap-and-pay: making purchases at bricks-and-mortar stores using the emerging Near Field Communication wireless interface on Android phones. NFC already enjoyed support from payment industry, having been anointed as the next generation payment interface combining the security of chip & PIN protocols with a modern form factor suitable for smartphones. (Despite impressive advances in manufacturing every-thinner phones, it’s still not possible to squeeze one into a credit-card slot, although LoopPay had an interesting solution to that problem.) There were already pilot projects with cards supporting NFC, typically launched without much marketing fanfare. At one point Chase and American Express shipped NFC-enabled cards. That is remarkable considering that on the whole, banks have been slow to jump on the more established contact-based chip & PIN technology. NFC involves even more moving parts.  Not only must the card contain a similar chip to execute more secure payment protocols, but it requires an antenna and additional circuitry to draw power wirelessly from the field generated by a point-of-sale terminal. In engineering terms, that translates into more opportunities for a transaction to fail and leave a card-holder frustrated.

Uphill battle for NFC adoption

Payment instruments have multiple moving pieces controlled by different entities: banks issue the cards, merchants accept them as a form of payment with help from payment-processors and ultimately consumers make payments. Boot-strapping a new technology can be either an accelerating virtuous-cycle or stuck in a chicken-and-egg circularity.

  • Issuers: It’s one thing for banks to be issuing NFC-enabled plastic cards on their own, quite another for those cards to be usable through Google Wallet. After all the whole point of having a card with chip is that one can not make a functioning “copy” of that card by typing in the number, expiration-date and CVC into a form. Instead the bank must cooperate with the mobile-wallet provider (in other words, Google) to provision cryptographic keys over the air into special hardware on the phone. Such integrations were far from standardized in 2011 when Wallet launched, leaving customers with only two choices: a Citibank MasterCard and a white-label prepaid card from Metabank. Not surprisingly, this was a significant limitation for consumers who were not existing Citibank customers or interested in the hassle of maintaining a prepaid card. It would have been a hard slog to scale up one issuer at a time but an even better option presented itself with the TxVia acquisition: virtual cards for relaying transactions transparently via the cloud to any existing major credit-card held by the customer. That model wasn’t without its own challenges, including unfavorable economics and fraud-risk concentration at Google. But it did solve the problem of issuer support for users.
  • Merchants: Upgrading point-of-sale equipment is an upfront expense for merchants, who are reluctant to spend that money without a value proposition. For some being on the cutting edge is sufficient. When mobile wallets were new (and Google enjoyed ~3 year lead before ApplePay arrived on the scene) it was an opportunity to attract a savvy audience of early-adopters. But PR benefits only extend so far. Card networks did not help the case either: NFC transactions still incurred same costs in credit-card processing fees, even though expected fraud rates are lower for when using NFC compared to magnetic stripes which are trivially cloned.
  • Users: For all the challenges of merchant adoption, there was still a decent cross-section of merchants accepting NFC payments in 2011: organic grocery-chain Whole Foods, Peet’s coffee, clothing retailer Gap, Walgreens pharmacies, even taxicabs in NYC. But merchants were far from being the only limiting factor for Google. In the US wireless carriers represented an even more formidable obstacle. With Verizon, AT&T and T-Mobile having thrown in their lot with a competing mobile payments consortium called ISIS (later renamed Softcard to avoid confusion with the terrorist group) they lobbied to block their own subscribers from installing Google Wallet on their phones.



From virtual to physical: evolution of the proxy-card

Shut out of its own customers’ devices and locked in uneasy alliance with wireless carriers over the future of Android, Google turned to an alternative strategy to deliver a payment product with broader reach, accessible to customers who either did not have an NFC-enabled phone or could not run Google Wallet for any reason. This was going to be a regular plastic card, powered by the same virtual card technology used in NFC payments.

For all intents and purposes, it was an ordinary MasterCard that could be swiped anywhere MasterCard was accepted. It could also be used online for card-not-present purchases with CVC2 code. Under the covers, it was a prepaid-card. Consumers could only spend existing balances loaded ahead of time. There was no credit extended, no interests accruing on balances, no late fees. It did not show up on credit history or influence FICO scores.

There would still be a Google Wallet app for these users; it would show  transactions and managing funding sources. But it could not be used for tap-and-pay. NFC payments— once the defining feature of this product— had been factored out from the mobile application, becoming an optional feature available to a minority of users when the stars aligned.

Prepaid vs “prepaid”

But there was one crucial difference from the NFC virtual-card: users had to fund their card ahead of time with a prepaid balance. That might seem obvious given the “prepaid” moniker, yet it was precisely a clever run-around that limitation which had made the Google Wallet NFC offering a compelling product. When users paid tapped their phone, the request to authorize that transaction was routed to Google. But before returning a thumbs up or down, Google in turn attempted to place a charge for the exact same amount on the credit-card the customer had setup in the cloud. The original payment was authorized only after this secondary transaction cleared. In effect, the consumer has just funded their virtual-card by transferring $43.98 from an existing debit/credit card, and immediately turned around to spend that balance to make a purchase which coincidentally was exactly $43.98.

Not for the plastic card: there was an explicit stored-value account to keep track of. This time around that account must “prepaid” for real, with an explicit step taken by the consumer to transfer funds from an existing bank-account or debit/credit card associated with the Google account. Not only that but using a credit card as funding source involves explicit fees to the tune of 2.9% to cover payment processing. (If the same logic applied to NFC scenario, $97 purchase at the cash-register would have been reflected as $100 charge against the original funding source.)

The economics of the plastic card necessitate this. Unlike its NFC incarnation, this product could be used at ATMs to withdraw money. If there were no fees for funding from a credit-card, it would have effectively created a loop-hole for free cash-advances: tapping into available credit on a card without generating any of the interchange fees associated with credit transactions. While having to fund in advance was a distinct disadvantage, in principle existing balance could be spent through alternative channels such as purchases from Google Store or peer-to-peer payments to other users. But none of those other use-cases involve swiping— which raises the question: what is the value proposition of a plastic card in the first place?


End of the road, or NFC reaches critical mass

In retrospect the plastic card was stuck in no man’s land. From the outset it was a temporary work-around, a bridge solution until mobile-wallets could run on every device and merchants accepted NFC consistently. That first problem was eventually solved by jettisoning the embedded secure-element chip that at the heart of the controversy with wireless carriers, and falling back to a less-secure but more open alternative called host-card emulation. As for the second problem,  time eventually took care of that with a helping hand from ApplePay which gave NFC a significant boost. In the end, the plastic proxy-card lived out its shelf-life, which is the eventual fate for all technologies predicated on squeezing out a few more years out of swipe transactions, including dynamic/programmable stripes and LoopPay.


Bitcoin’s meta problem: governance (part I)

Layer 9: you are here

Bitcoin has room for improvement. Putting aside regulatory uncertainty, there is the unsustainable waste of electricity consumed by mining operations, unclear profitability for miners as block rewards decrease and last but not least, difficulty scaling beyond its Lilluputian capacity of handling only a few transactions per second globally. (You want to pay for something using Bitcoin? Better hope not many other people have that same idea in the next 10 minutes or so.) In theory all of these problems can be solved. What stands in the way of a solution is not the hard reality of mathematics; this is not a case of trying to circle the square or solve the halting problem. Neither are they insurmountable engineering problems. Unlike calls for designing “secure” systems with built-in backdoors accessible only to good guys, there is plenty of academic research and some real-world experience building trusted, distributed systems to show the way. Instead Bitcoin the protocol is running into problems squarely at “layer 9:” politics and governance.

This last problem of scaling has occupied the public agenda recently and festered into a full-fledged PR crisis last year with predictions of the end of Bitcoin. Much of the conflict  focusing on the so-called “block-size”- the maximum size of each virtual page added to the global ledger of all transactions maintained by the system. More space in that page, more transactions can be squeezed in. That matters for throughput because the protocol also fixes the rate at which pages can be added, to roughly one every 10 minutes. But TANSTAAFL still holds: there are side-effects to increasing this limit, which was first put in place by Satoshi himself/herself/themselves to mitigate denial-of-service attacks against the protocol.

Game of chicken

Two former Bitcoin Core developers found this out the hard way last summer when they tried to force the issue. They created a fork of the popular open-source implementation  of bitcoin (Bitcoin Core) called BitcoinXT with support for expanded block size. The backlash came swift and loud. XT did not go anywhere, its supporters were banned from Reddit forums and the main developer rage-quit Bitcoin entirely with a scathing farewell. But that was not the end of the scaling experiment. Take #2 followed shortly afterwards as a new fork dubbed Bitcoin Classic, with more modest and incremental changes to block-size to address criticisms in XT. As of this writing, Classic has more traction than XT ever managed but remains far from reaching the 75% threshold required to trigger a permanent change in protocol dynamics.

Magic numbers and arbitrary decisions

This is a good time to step back and ask the obvious question: why is it so difficult to change the Bitcoin protocol? There are many arbitrary “magic numbers” and design choices hard-coded in the design:

  • Money supply is fixed at 21 million bitcoins.
  • Each block rewards the miner 50 bitcoins, but that reward halves periodically with the next decrease expected around June of this year
  • Mining uses a proof-of-work algorithm based on the SHA2 hash function
  • Proof-of-work construction encourages the creation of special-purpose ASIC chips, because they have significant efficiency advantages over using ordinary CPUs or GPUs that ship with off-the-shelf PCs/servers.
  • That same design is “pool-friendly:” its design permits the creation of mining pools, where a centralized pool operator coordinates work by thousands of independent contributors and distributes rewards based on share of work coordinated.
  • Difficulty level for that proof-of-work is adjusted roughly around ~2000 blocks, with the goal of making the interval between blocks 10 minutes
  • Transactions are signed using ECDSA algorithm over one specific elliptic-curve secp256k1
  • And of course, blocks are limited to 1MB in size

Where did all of these decisions come from? To what extent are they fundamental aspects of Bitcoin—it would not be “Bitcoin” as we understand it without that property— as opposed to arbitrary decisions made by Satoshi that could have gone a different way? What is sacred about the number 21 million? (It is half of 42, the answer to the meaning of life?) Each of the decisions can be questioned, and in fact many have been challenged. For example, proof-of-stake has been offered as an alternative to proof-of-work to halt runaway costs and CO2 emissions of electricity wasted on mining. Meanwhile later designs such as Ethereum tailor their proof-of-work system explicitly to discourage ASIC mining, by reducing the advantage such custom hardware would have over vanilla hardware. Other researchers proposed discouraging mining by making it possible for the participant who solves the PoW puzzle to keep the reward, instead of having it automatically returned to the pool operator for distribution. One core developer even proposed (and later withdrew) a special-case adjustment to block difficulty for upcoming change to block rewards. It was motivated by the observation that many mining operations will become unprofitable when rewards are cut in half, powering off their rigs and resulting in a significant drop in total mining power that will remain uncorrected for a significant time as blocks are mined at a slower rate.

Some of these numbers reflect limitations or trade-offs necessitated by current infrastructure. For example, one can imagine a version of Bitcoin that runs twice as fast, generating blocks every 5 minutes instead of 10. But that version would require each node running the software to exchange data twice as fast among themselves, because Bitcoin relies on a peer-to-peer network for distributing transactions and mined blocks. This goes back to the same objection levied against large-block proposals such as XT and Classic. Many miners are based in countries with high-latency, low-bandwidth connections such as China, a situation not helped by economics that drive mining operations to locate to the middle of nowhere, close to cheap source of power such as dams, but away from fiber. There is a legitimate concern that if bandwidth requirements escalate- either because blocks sizes go up or alternatively because blocks are minted more frequently- they will not be able to keep up But what happens when those limitations go away, when multi-gigabit pipes are available to even the most remote locations and the majority of mining power is no longer constrained by networking?

Planning for change

Once we acknowledged that change is necessary, the question becomes how such changes are made. This is as much a question of governance as it is of technology. Who gets to make the decision? Who gets veto power? Does everyone have to agree? What happens to participants who are not on board with the new plan?

Systems can be limited because of a failure in either domain. Some protocols were designed with insufficient versioning and forwards-compatibility; that means it is very difficult for them to operate in a heterogeneous environment consisting of “old” and “new” versions existing side-by-side. That makes it very difficult to introduce upgrades, because everyone must coordinate on a “flag-day” to upgrade everything at once. In other cases, the design is flexible enough to allow small, local improvements, but the incentives for upgrading are absent. Perhaps the benefits for upgrade are not compelling enough or there is no single entity in charge of the system capable of forcing all participants to go along.

For example, credit-card networks have long been aware of the vulnerabilities associated with magnetic-stripe cards. Yet it has been a slow uphill battle to get issuing-banks to replace existing cards and especially merchants to upgrade their point-of-sale terminals to support EMV. Incidentally that is a relatively centralized system: card-networks such as Visa and MasterCard sit in the middle of every transaction, mediating the movement of funds from the bank that issued the credit-card to the merchant. Visa/MC call the shots around who gets to participate in this network and under what conditions, with some limits defined by regulatory watch-dogs worried about concentration in this space. In fact it was their considerable leverage over banks/merchants which allowed card networks to push for EMV upgrade in the US, by dangling economic incentives/penalties in front of both sides. Capitalizing on the climate of panic in the aftermath of Target data-breach, these networks were able to move forward with their upgrade objectives.




About the Turkish citizenship database breach: trending names (part IV)

[continued from part III]

One final aggregate pattern in the data we can sanity-check is the distribution of names. While the US Census Bureau publishes statistics on popular names by year and even crunches the data for trends, there is no comparable official source of statistics for Turkey. We can look at patterns in this data-set as a proxy for that.

Last names

There are over 332,000 distinct last names in the data. Here are the most popular ones in order of decreasing frequency:

Popular last names (all years)

This data is relatively stable over time and consistent with expectations. The most popular surnames in 1960 were:

Yilmaz, Kaya, Demir, Sahin, Celik,
Yildiz, Yildirim, Ozturk, Aydin, Ozdemir

Fast forward thirty years to 1990, the last complete year in this data-set, the picture changes only slightly:

Yilmaz, Demir, Kaya, Celik, Yildiz,
Sahin, Yildirim, Aydin, Ozturk, Ozdemir

There is only minor reshuffling of original group. Demir has moved up a spot, Sahin has dropped a couple, Aydin and Ozturk have swapped places. Remarkably all 10 names are identical between the individual years 1960/1990 and entire data-set going back to the 19th century.

First names

By contrast the distribution of first names starts with greater concentration but also exhibits more change over time, with new trendy names appearing and existing ones falling out of favor. There is both greater concentration (the top 10 most popular names account for a much larger fraction of all first-names, compared to last names) and surprisingly greater overall diversity. With nearly 628,000 unique first-names the greater concentration must be accompanied by a long-tail of relatively uncommon names.

The names they are a-changing

For women born in 1950 the most popular names were:


By 1990 the list has changed— more than half the entries are new— and experienced a dramatic flattening  effect:


The pronounced clustering around a handful choices has weakened. Twice as many people were born in 1990 compared to forty years age and still the number of newborns with popular names have decreased in absolute numbers. Collectively they account for smaller percentage, suggesting  increasing diversity. We can visualize these trends by taking all names appearing in both lists and charting over time the number of people born in a given year with that name:


Some trends stand out:

  • Between 1950 and 1965, popular names are still holding their own and continuing to hit new highs in absolute numbers. (But overall population is also growing at an increasing pace; the next section considers adjusted numbers relative to overall births in that year.)
  • That trend plateaus in the 1970s and reverses sharply after 1980. Several of the names in the top 10 for 1950 start declining in absolute numbers even as more people are born each year.
  • Popular names come out of nowhere and take off quickly. 1960s witness the rise of Özlem and Esra, 1970s introduce Tuğba and 1980s have hockey-stick pattern for Merve. These names were hardly on the radar in the 1950s, registering in the single digits most years and in some cases exactly zero.
  • There are also names walking a middle-ground, bucking both trends such as Elif and Zeynep. They start out steady through the first two decades, inching up higher in the 70s and 80s.
  • The mysterious drop for certain years affecting overall numbers is also reflected here. 1951, 1957, 1961, 1967, 1975 and 1982 feature declines across the board for all names. Interestingly there is no similar correction observed for 1988.

Viewing popularity as a percentage of people born that year removes the artifacts caused by those anomalies in the data, revealing a steady erosion in the incidence of popular names. Over-arching trend is towards greater diversity and less concentration in a handful of popular options.


Paging class of 1988

Similar trends apply to names for men. Here are the most popular names in 1950 in descending order:

Mehmet, Mustafa, Ahmet, Ali, Huseyin,
Hasan, Ismail, Ibrahim, Osman, Halil

In 1990:

Mehmet, Mustafa, Ahmet, Murat, Ali,
Gokhan, Ibrahim, Huseyin, Emre, Ugur

The first three spots are identical but Huseyin has dropped to #8, Ismail is no longer on the list, while Emre, Ugur and Gokhan make an appearance. Murat came out of left-field to claim #4. While the overall ranking has not changed appreciably, the trends are more pronounced when visualized over time:


There is that significant dip for 1988 again, only it is very pronounced. Here is one of the sharp differences from the graphs for women: while this graph also has drops around 1951, 1957, 1961, 1967, 1975 and 1982, it is unique in having a far more pronounced across-the-board decline in 1988. That difference may point towards one explanation: military service. With some exceptions, Turkey requires all men to serve in the armed forces. Those born in 1988 would have reached their first year of eligibility in 2009, coinciding with the timing of local elections which these records are believed to be associated with. During their compulsory service, these men would not be eligible for voting. Removing them from voter rolls could explain that anomaly for 1988. Looking at percentages instead of absolute numbers smooths out the anomaly:


Again the overall trend is towards greater diversity and less concentration among a handful of popular names. Most of the lines are sloping downward. Murat had an unusual burst of popularity through the 1960s but peaked in the following decade. Even names that came into vogue more recently in the 70s and early 80s are starting to plateau.


About the Turkish citizens database breach: geography (part III)

[continued from part II]

Geographic patterns

We can also look at geographic distribution of people, since the database contains columns for both city of birth and city of current residence. There are exactly 81 distinct values for the column labelled “address_city” corresponding to the 81 provinces of Turkey on last count. (That number has steadily crept up over the years, with former towns vying for the privilege of receiving province status.)



As expected, Istanbul dwarfs every other province with more people than the next three cities combined. That stack ranking of provinces is not too far from the 2014 population estimates, taking into account the fact that these figures likely reflect eligible voters only, as opposed to all residents. Cities with very young populations will be under-represented in an electoral database because many of their residents will not meet the age criteria for eligibility.

Looking at the other column labelled “birth_city” in the database paints a messy picture. Unlike the current place of residence, there is no a priori requirement for this field to map to an existing province of Turkey.** Indeed there are many foreign cities, with often inconsistent representations. For example the following distinct values with subtle differences all appear to refer to New York City, with “ABD” being the translation of USA:

  • NEW YORK A.B.D (periods in the abbreviation)
  • NEW YORK /A.B.D (slash)
  • NEW YORK A.B.D. (period at the end)
  • NEW YORK N.Y. A.B.D. (state spelled out)
  • N.Y. NEWYORK/A.B.D (inverted order of state & city)

Not to mention malformed variants such as missing spacing (“NEWYORK A.B.D”) and outright misspellings such as “NEVYORK” and “NEVWYORK.”

Case in point

This suggests no attempts were made to normalize or otherwise correct this column. It is also one of the few places where mixed-case entries appear. While majority of entries in the database are entirely upper-case (even names—danah boyd would not be pleased) addresses on rare occasions feature lower-case to accommodate Turkish spelling. For example the city Çorlu is represented as cORLU, with the lower-case C standing for original accented C or Unicode character U+00C7. This trick works because for each letter that is part of the ASCII set, there is only one possible alternative in Turkish alphabet differing from that letter in a diacritic, such as O and Ö  which is U+00D6. There is no Unicode or UTF8 in the database but lower-casing appears sparingly in a handful of records. Out of more than 49.6 million records, about 30 thousand have some lower-case letter in at least one fields.

Another complication is that even within Turkey, city of birth is captured at a different level of granularity than city of residence. In particular, it does not map to provinces. A straightforward query asking what percent of Istanbul residents were actually born in Istanbul turns up the surprising conclusion that barely one million out of nearly nine-million residents at the time of this snapshot were natives. At first blush this would appear to confirm the Istanbulites’ gripe about massive waves of immigration from the heartland crowding out their city beyond recognition. But a closer look at the distribution of birth place for residents of Istanbul at the time of this snapshot (2009) tells a different story:


As before Istanbul “natives” take up a  small fraction of the pie. But on closer inspection this turns out to be an artifact of inconsistent classification. Bakırköy, Üsküdar, Kadıköy, Şişli are all townships within the larger Istanbul province. So are Fatih, Kartal and Eyüp. These are not cases of people relocating to Istanbul, any more than a person born in Manhattan and presently living in Brooklyn could claim to have “relocated” to NYC. The capital city Ankara is the first data-point here that represents a genuine influx of people relocating from a different region.

Such inconsistencies make it difficult to correlate this data-set against patterns of migration directly. But we can ask a different question: for each province, how many people born in that province were living in the same place?


Places with the greatest likelihood of outward migration are rural cities in the Eastern and Southeaster regions of the country. In the most extreme cases only one in four residents have stuck around. One outlier in this picture is Hatay. It’s clearly incorrect for only 649 people from a major province appear in the database. The explanation turns out to be another inconsistency in classification: that province also goes by another name “Antakya” derived from the original Antioch.  There are over a quarter million records where the city of birth is Antakya but exactly zero where the current city is labelled as such.


On the opposite extreme, large cities including İstanbul, Antalya, Bursa and İzmir have loyal natives who choose to set down roots. Ankara on the other hand does not appear to inspire that level of attachment. Geography might offer another explanation: Aydın, Antalya and İzmir are coastal cities, as are Muğla, Mersin and Adana which  rank high on the list- chalk it up to the Mediterranean climate. (Nearby Denizli and Manisa are inland from the Aegean coast but also retain large percentage of their voting-age population according to these figures.) Konya, Kahramanmaraş and Gaziantep provide something of a counterpoint to that “sunshine-and-beach” theory of what makes cities appealing. Located further inland, they have relatively harsh climates and in the case of Gaziantep have the additional disadvantage of bordering the failing state Syria— although that would have been less of a factor in 2009. Still the glaring outlier here is Şırnak. With an adult population around 400,000 according to 2010 census, only 10% of its residents are present in this data-set. This suggests that the inflated figure may be more an artifact of the criteria for inclusion in this database.


We can check for a correlation between local climate and tendency of people growing up in that location to relocate. Each circle in the above graph represents one province using data sourced from a Norwegian meteorological site.* The vertical axis shows the percentage of individuals born and currently living in that province. 100% score indicates everyone is still living there, while 0% means all residents have moved away. (These figures do not rule out the possibility of having moved at some point in the past and later returned; the database only affords a snapshot in time.) The diameter of each circle is proportional to number of total residents with that birth-city; largest circle corresponds to Istanbul. The dotted trend-line does suggest a small correlation between higher average temperatures and tendency to stay put.

[Continued: trending names]


* There were no records in the database on citizens residing overseas. Those individuals can still vote in general elections by going through the local consulate. It’s likely that a different system is used to track voter rolls for that scenario, which explains why they may not have been affected by the breach.

** Osmaniye has been removed due to missing data.



About the Turkish citizenship database breach: demographics (part II)

[continued from part I]

There are different ways to gauge the accuracy of this data set:

  • Spot-check known records. For example, individuals can search the data for their own information or that of friends/relatives. While this is useful for preliminary screening (finding correct records lends support to the allegation that this data originates from a government database) it does not provide much guidance
  • Statistical distribution. Look for patterns in the dump and compare against external references on demographic trends in Turkey. The challenge is that without knowing the exact criteria for inclusion, the correct external data-set is a guess. For example, if the data-set contains only persons who were alive at the time of the breach, then population records will not line up  exactly. Similarly if the data excludes citizens who live abroad or have never received a citizen ID, it will not be fully representative of all Turkish citizens.

Years of birth

First we look at distribution of birth years in the data. The latest records are from 1991. There is no record of any citizen born after Mar 29, 1991. Interestingly this corroborates an earlier statement from the government that this is not a new data-set, but a full publication of data previously lost in 2009. Given that the voting age in Turkey is 18 and youngest voters at that time would have been born in 1991, the dump could have originated with an election system. As it turns out there were local elections held on Mar 29, 2009.

Citizens by birth-year

The overall distribution is plausible, but there are some quirks:

  • The last data point corresponding to 1991 shows a steep drop. This is expected, and it is an artifact of the data-set not being complete for that year.
  • There are more mysterious drops for certain years, which are not easily explained by reference historic events. For example, there is no a priori reason to expect  significantly fewer people from 1988, 1982 & 1983 (although this corresponds to the aftermath of a military coup in late 1980), 1975, 1967 or 1961 (also the year following a military coup.) These drops could also be a reflection of large-scale redactions performed by the individual(s) releasing the dump, to protect their own records or those of their associates.

The Ottoman zombie-apocalypse?

Oldest record in the above graph goes back to 1888, corresponding to a citizen who was 120 years old at the time of the incident. But some outliers were removed from the set prior to creating that chart. First there records where the date of field is formatted incorrectly, not conforming to the DD/MM/YYYY pattern. More interestingly there are a few hundred records where the year ranges in the 1300s. They cluster around 1340s as depicted in this graph:


There is one theory for explaining these records without invoking 600-year old zombies or ancient civilizations with secret health-diets. They may correspond to dates in Rumi calendar, in use by the Ottoman Empire starting 1839. Such dates would be offset 584 years from the conventional Gregorian dates (although one “year” is still 365 or 366 days, unlike the shorter years of Hijri calendar) Most records in the data-set interpreted as Rumi years correspond to early 20th century, consistent with the adoption of Gregorian calendar in 1927 by the modern Republic of Turkey.

Closer look at distribution

There are other quirks in the data. What if we look at most popular birthdays as day/month pairs? Of top 25 most common birth-days in the data-set across all years, all fall on January 1st. Not only that, but in many years they correspond to more than 10% of all records: as if one out of ten people that year were born on New Year’s Day.

CitizenDB_birthday_skewSimilarly top 50 only contains two different calendar days: January 1st and July 1st. The skew remains looking at the top one hundred: every one of those falls on the first day of a month. While it is known that distribution of birthdays is not uniform, such an extreme skew is not plausible in an “organic” distribution. More likely it reflects limitations of record-keeping. For example individuals born in a particular month could have been arbitrarily assigned first day of that month. Alternatively when exact dates were not known due to missing birth certificates, Jan 1st was chosen. These biases persist even when looking at more recent years, where one would expect more precise record-keeping. Here is the distribution of birth-days for 1990, the latest complete year in the data-set, with January 1st removed to reduce the bias:


There are still sharp peaks corresponding to the start of each month. There is another periodic pattern, exhibiting peaks on a weekly basis. (That is the opposite of expected result: it is known that fewer births fall on weekends. That should have manifested itself as small dips lasting two days every week, instead of noticable spikes.) There are also steep drops where the data-set contains a few “impossible” dates such as April 31st and June 31st.

Updated: Revised second graph showing records with year of birth prior to 1400.

[continued: geographic patterns]





About the Turkish citizenship database breach

On April 4th, a website appeared claiming to release data on nearly 50 million Turkish citizens from the breach of a government database. This series of posts looks at the contents of that alleged database dump.

Screenshot from 2016-04-04 07&%14&%56.png

Screenshot of the website announcing the dump

The website announcing availability of the data was hosted at IP address Gelocation services trace that address to Romania, near the capital city of Bucharest.

Screen Shot 2016-04-04 at 11.03.24

Attribution challenges

As of this writing, no group has stepped forward to claim responsibility for the breach. There are no credit-cards, bank account numbers or other information directly monetizable in the dump, suggesting that financial gain is an unlikely motivation. (Not to mention that groups driven by monetary incentive would be more likely to exploit the information for private gain, instead of releasing it publicly to make a statement.) A more likely candidate is hacktivism: attackers motivated out of ideological reasoning to embarrass their target by revealing gaps in security. The announcement lends support to that in at least two ways. There are critical opinions about contemporary Turkish politics, including unfavorable comparisons to a current US republican candidate. The authors also attempt to pin the blame for the security breach (“crumbling and weak technical infrastructure”) on the governing party. The perpetrators offer a detailed critique of specific issues they encountered with the system:

  • Weak encryption, described as “bit-shifting” suggesting that the system was not using modern cryptography to protect sensitive information
  • Weak authentication in the user-interface (“hard-coded password”)
  • Failure to optimize the database for efficient queries

Attribution for data breaches is tricky. Perpetrators have strong incentive to use false-flag tactics to confuse or mislead investigation. While opposition groups inside Turkey would be prime-suspects, the wording of the message suggests US origin. Choice of pronouns in “we really shouldn’t elect Trump” and “your country” imply authors located in the US, reflecting on the upcoming 2016 Presidential election and drawing comparison between internal politics of two countries. Meanwhile the text does not have obvious grammatical errors or awkward phrasing that would suggest non-native speakers.

Questions about the data

Putting aside the difficult problem of attribution, there are other questions that can be answered from the data:

  • Is the data-set valid? In other words, does this dump include information about actual people or is it complete junk? How accurate are the entries? (This concern applies to even commercial databases used for profit. For example, FTC found that 5% of consumer records at credit-reporting agencies had errors.)
  • Who appears in this data-set? Turkey boasts a population close to 75 million but this set contains roughly 49 million records. What is the criteria for inclusion?
  • What type of large-scale demographic trends can be learned from this data?

Working with the data

Before answering these questions, some tactical notes on the structure of the data released. The website did not host the data directly, but instead linked to a torrent. That torrent leads to a 1.5GB compressed archive. Decompressed it results in a massive 7GB file “data_dump.sql” (For reference its SHA256 hash is 6580e9f592ce21d2b750135cb4d6357e668d2fb29bc69353319573422d49ca2d)  This file is a PostgreSQL dump with three parts:

  • Initial segment defines the table schema.
  • This preamble is followed by the bulk of the file containing information about citizens. Each line of text contains one row and columns are separated by tabs.
  • Finally the last few lines setup indices for faster searching on first/last name, city of birth, current city of residence etc.

Reanimating the database

Because each individual record appears on a line by itself, it’s possible to run ad hoc queries using nothing more fancy than grep and regular expressions. But for more efficient , it makes sense to treat the database dump as a database proper.While the file is ready as-is for loading into  PostgreSQL, with some minor transformations we can also use a more lightweight solution with sqlite. There are how-to guides for importing Postgre dumps into SQLite but these instructions are dated and no longer work. (In particular the COPY command has been deprecated in sqlite3.) One approach is:

  1. Create a new file that contains only the row-records from the original file, stripping the preamble and indices
  2. Start sqlite, manually create the table and any desired indices
  3. Change field separator to tab and use .import to load the file created in step #1

One note about hardware requirements: running this import process initially will consume a very large amount of memory. (Case in point: this blogger used a machine with 16GB RAM and sqlite3 process peaked around 12GB.) After the resulting database is saved to a file in native sqlite format, it can be opened in the future with only a handful MB of memory used.

[continue to part II: demographics]


Future-proofing software updates: Global Platform and lessons from FBiOS (part III)

[continued from part II]

(Full disclosure: this blogger worked on Google Wallet 2011-2013)


Global Platform constrains the power of manufacturers and system operators to insert back-doors into a deployed system after the fact. But there are some caveats to cover before we jump to any conclusion about how that could have altered the dynamics of FBI/Apple skirmish. There are still some things that a rogue card-manager— or more precisely “someone who knows one of the issuer security-domain keys” in GP terminology, a role often outsourced to trusted third-parties after deployment— can try to subvert security policy. These are not universal attacks. Instead they depend on implementation details outside the scope of GP.

Card OS vulnerabilities

Global Platform is agnostic about what operating system is running on the hardware, and for that matter the isolation guarantees provided by the OS for restricting each application to its own space. If that isolation boundary is flawed and application A can steal or modify data owned by application B, there is room for the issuer to work around GP protections. While there is no way to directly tamper with the internal state of that application B, one can install a brand-new application B that exploits the weak isolation to steal private data from A. Luckily most modern card OSes in fact do provide isolation between mutually distrustful applications, along with limited facilities for interaction provided both sides opt-in to exchanging messages with another application. For example JavaCard-based systems apply standard JVM restrictions around access to memory, type-safety and automatic memory management.

Granted, implementation bugs in these mechanisms can be exploited to breach containment. For example early JavaCard implementations did not even implement the full-range of bytecode checks expected of a typical JVM. Instead they call for a trusted off-card verifier to weed out malformed byte-code prior to installing the application. This is a departure from the security guarantees provided by a standard desktop implementation of JVM. In theory the JVM can handle hostile byte-code by performing necessary static and run-time checks to maintain integrity of the sandbox. (In reality JVM implementations have been far from perfect in living up to that expectation.) The standard excuse for the much weaker guarantees in JavaCard goes back to hardware limitations. Performing these additional checks on-card adds to complexity of the JVM implementation which must run on the limited CPU/memory/storage environment of the card. The problem is, off-card verification is useless against a malicious issuer seeking to install deliberately malformed Java bytecode with the explicit goal of breaking out of the VM.

It’s worth pointing out that this is not a generic problem with card operating systems, but a specific case of cutting-corners in some versions of a common environment. Later generations of JavaCard OS have increasingly hardened their JVM and reduced dependence on off-card verification, to the point that at least some manufacturers claim installing applets with invalid byte-code will not permit breaking out of the JVM sandbox.

Global state shared between applications

Another pitfall is shared state across applications. For example, GP defines a card global PIN object that any application on the card can use for authenticating users. This makes sense from a usability perspective. It would be confusing if every application on the card had its own PIN and users have to remember whether they are authenticating to the SSH app vs GPG app for instance. But the downside of the global PIN is that applications installed with the right privilege can change it. That means a rogue issuer can install a malicious app designed to reset that PIN, undermining an existing application which relied on that PIN to distinguish authorized access.

There is a straight-forward mitigation for this: each application can instead use its own, private PIN object for authorization checks, at the expense of usability.  (Factoring out PIN checks into an independent application accessed via inter-process communication is not  trivial. A malicious issuer could replace that applet by a back-doored version that always returns “yes” in response to any submitted PIN, while keeping the same application identifier. Some type of authenticated channel is required.) In many scenarios this is already inevitable due to the limited semantics of the global PIN object, including mobile payments such as Apple Pay and Google Wallet which support multiple interfaces and retain PIN verification state during a reset of the card.

Hardware vulnerabilities

There is another way OS isolation can be defeated: by exploiting the underlying hardware. Some of these involve painstakingly going after the persistent storage, scraping data while the card is powered off and all software checks out of the picture. Others are more subtle, relying on fault-injection to trigger controlled errors in the implementation of security checks such as by using a focused laser-beam to induce bit flips. Interestingly enough, these exploits can be aided by installing new, colluding applications on the card designed to create a very specific condition (such as specific memory layout) susceptible to that fault. For example, this 2003 paper describes an attack involving Java byte-code deliberately crafted to take advantage of random bit-flip errors in memory. In other words, while issuer privileges do not directly translate into 0wning the device outright, they can facilitate exploitation of other vulnerabilities in hardware.

Defending against Apple-gone-rogue

Speaking of Apple, there is a corollary here for the FBiOS skirmish. Manufacturers, software vendors and cloud-service operators all present a clear danger to the safety of their own customers. These organizations can be unknowingly compromised by attackers interested in going after customer data; this is what happened to Google in 2009 when attackers connected to China breached the company. Or they can be compelled by law-enforcement as in the case of Apple, called on to attack their own customers.

“Secure enclave” despite the fancy name is home-brew proprietary technology from Apple without a proven track-record or anywhere near the level of adversarial security research aimed at smart-cards. While actual details of the exploit used by FBI to gain access to the phone are still unknown, one point remains beyond dispute: Apple could have complied with the order. Apple could have updated the software running in the secure enclave to weaken previously enforced security guarantees on any phone of that particular model. That was the whole reason this dispute went to court: Apple argued that the company is not required to deliver such an update, without ever challenging the FBI assertion that it was capable of doing that.

Global Platform mitigates against that scenario by offering a different model for managing multiple applications on a trusted execution environment. If disk-encryption and PIN verification were implemented in GP-compliant hardware, Apple would not face the dilemma of subverting that system after the fact. Nothing in Global Platform permits even the most-privileged “issuer” from arbitrarily taking control of an exiseting application already installed. Apple could even surrender card-manager keys for that particular device to the FBI and it would not help FBI defeat PIN verification, absent some other exploit against the card OS or hardware.

SE versus eSE

The strange part: there is already a Global Platform-compliant chip included in newer generation iPhones. It does not look like a “card.” That word evokes images plastic ID cards with specific dimensions and rounded corners, known by the standard ISO 7810 ID1.  While that may have been the predominant form-factor for secure execution environments when GP specifications emerged, these days such hardware comes in different shapes and incarnations. On mobile devices, it goes by the name embedded secure element—another “SE” that has no relationship to the proprietary Apple secure enclave. For all intents and purposes, eSE is the same type of hardware one would find on a chip & PIN enabled credit-card being issued by US banks today to improve security of payments. In fact mobile payments over NFC was the original driver for shipping phones equipped with an eSE, starting with Google Wallet. While Google Wallet (now Android Pay) later ditched eSE entirely, Apple picked up the same hardware infrastructure, even same manufacturer (NXP Semiconductors) for its own payments product.

The device at the heart of the FBI/Apple confrontation was an iPhone 5C, which lacks an eSE; Apple Pay is only supported on iPhone6 and later iterations. But even on these newer models, eSE hardware is not used for anything beyond payments. In other words, there is already hardware present to help deliver the result Apple is seeking— being in a position  where the company can not break into a device after the fact. But it sits on the sidelines. Will that change?

In fairness, Apple is not alone in under-utilizing the eSE. When this blogger worked on Google Wallet, general-purpose security applications of eSE were an obvious next step after mobile payments. For example, the original implementation of disk encryption in Android was susceptible to brute-force attacks. It used a key directly derived from a user-chosen PIN/password for encrypting the disk. (It did not help that the same PIN/password would be used for unlocking the screen all the time, all but guaranteeing that it had to be short.) Using eSE to verify the PIN and output a random key would greatly improve security, in the same way using TPM with PIN check improves the security of disk encryption compared to relying on user-chosen password directly.  But entrenched opposition from wireless carriers meant Android could not count on access to the eSE on any given device, much less a baseline of applications present on eSE. (Applications can be pre-installed or burnt into the ROM mask at the factory, but that would have involved waiting for a new generation of hardware to reach market.) In the end Google abandoned the secure element, settling instead for the much weaker TrustZone-backed solution for general purpose cryptography.