I am also the editor of the Neohapsis Labs blog. The following is reprinted with permission from
By Michael Pearce, Neohapsis & Neolabs
There has been a lot of concern and online chatter about iPhone/mobile applications and the private data that some send to various parties. Starting with the discovery of Path sending your entire address book to their servers, it has since also been revealed that other applications do the same thing. The other offenders include Facebook, Twitter, Instagram, Foursquare, Foodspotting, Yelp, and Gowalla. This corresponds nicely with some research I have been doing into device ID leakage on mobile devices, where I have seen the same leakages, excuses, and techniques applied and abused as those discussed around the address book leakages.
I have observed a few posts discussing the issues proposing solutions. These solutions range from requiring iOS to request permission for address book access (as it does for location) and advising developers to hash sensitive data that they send through and compare hashes server side.
The first idea is a very good one, I see few reasons a device geolocation is less sensitive than its address book. The second one as given by is only partial advice however, and if taken as it is given in Martin May’s post, or Matt Gemmel’s arguments; it will not solve the privacy problems on its own. This is because 1. anonymised data isn’t anonymous, and 2. no matter what hashing algorithm you use, if the input material is sufficiently constrained you can compute, or precompute all possible values.
Martin May’s two characteristics of a hash [link] :
- Identical inputs will yield the same hash
- It is virtually impossible to deduce the original input from a hash if a strong hashing algorithm is used.
This is because, of these two characteristics of a hash the privacy implications of first are not fully discussed, and the second is incorrect as stated.
Hashing will not solve the privacy concerns because:
- Hashing Data does not Guarantee Privacy (When the same data is input)
- Hashing Data does not Guarantee Secrecy (When the input values are constrained)
The reasons not discussed for this are centered on the fact that real world input is constrained, not infinite. Telephone numbers are an extreme case of this, as I will discuss later.
A quick primer on hashing
Hashing is a destructive, theoretically one-way process where some data is taken and put through an algorithm to produce some output that is a shadow of the input. Like a shadow, the same output is always produced by the same input, but not the other way around. (Same car, same shadow).
A very simple example of a hashing function is the modulus (or remainder). For instance the output from 3 mod 2 is the remainder when 3 is divided by 2, or 1. The percent sign is commonly used in programming languages to denote this operation, so similarly
1 % 3 is 1, 2 % 3 is 2 3 % 3 is 0 4 % 3 is 1 5 % 3 is 2 etc
If you take some input, you get the same output every time from the same hashing function. The reason the hashing process is one way is because it intentionally discards some data about the original. This results in what are called collisions, and we can see some in our earlier example using mod 3, 1 and 4 give the same hash, as do 2 and 5. The example given will cause collisions approximately one time in 1, however modern strong hashing functions are a great deal more complex than modulo 3. Even the “very broken” MD5 has collisions occur only one time in every 2^24 or 1 in ~17 000 000.
A key point is that, with a hashing algorithm for any output there are theoretically an infinite number of inputs that can give it and thus it is a one-way, irreversible, process.
A second key point is that any input gives the same output every time. So, by checking if the hashes of two items are the same you can be pretty sure they are from the same source material.
Cooking Some Phone Number Hash(es)
(All calculations are approximate, if I’m not out by two orders of magnitude then…)
Phone numbers conform to a rather well known format, or set of formats. A modern GPU can run about 20 million hashes per second (2*10^7), or 1.7 trillion (1.7 *10 11) per day. So, how does this fit with possible phone numbers?
A pretty standard phone number is made up of 1-3 digits for a country code, 3 local code, and 7 numbers, with perhaps 4 for the extension.
So, we have the following range of numbers:
0000000000000-0000 to 9999999999999-0000
Or, 10^13 possible numbers… About 60 days work to compute all possible values (and a LOT of storage space…)
If we now represent it in a few other forms that may occur to programmers…
+001 (234) 567-8910, 0012345678910, 001-234-5678910, 0012345678910(US), 001(234)5678910
We have maybe 10-20 times that, or several year’s calculations…
But, real world phone numbers don’t fill all possible values. For instance, take a US phone number. It is also made up of the country code, 3 for the local code , and 7 numbers, with perhaps 4 for the extension. But:
- The country code is known:
- The area code is only about 35% used since only 350 values are in use
- The 7 digit codes are not completely full (let’s guess 80%)
- Most numbers do not use extensions (let’s say 5% use them
Now, we only have 350 * (10 000 000 *.8) * 1.05 or 2.94 billion combinations (2.94*10^9). That is only a little over two minutes on a modern GPU. Even allowing for different representations of numbers you could store that in a few of gigabytes of RAM for instant lookup, or recalculate every time and take longer. This is what is called a time space tradeoff, the space of the memory or the time to recalculate.
Anyway, the two takeaways for our discussion here regarding privacy are:
1. Every unique output value probably corresponds to a unique input value, so this hashing anonymisation still has privacy concerns.
Since possible phone numbers are significantly fewer than the collision chance of even a broken hashing algorithm there is probably little chance of collisions.
2. Phone numbers can be reverse computed from raw hashes alone
Because of the known constraints of input values It is possible to either brute force reverse values, or to build a reasonable sized rainbow table on a modern system.
Hashing Does NOT Guarantee Privacy
Anonymising data by removing specific user identifying information but leaving in unique identifiers does not work to assuage privacy concerns. This is because often clues are in the data, or in linkages between the data. AOL learned this the hard way when they released “anonymised” search data.
Furthermore, the network effect can reveal a lot about you, how many people you connect to, and how many they connect to can be a powerful identifier of you. Not to mention predict a lot of things like your career area and salary point (since more connections tends to mean richer).
For a good discussion of some of the privacy issues related to hashes see Matt Gemmell’s post, Hashing for Privacy in social apps.
Mobile apps also often send the device hardware identifier (which cannot be changed or removed) to servers and advertising networks. And I have also observed the hash of this (or the WiFi MAC address) sent through. This hardly helps accomplish anything, as anyone who knows the device ID can hash it and look for that, and anyone who knows the hash can look for it, just as with the phone numbers. This hash is equally unique to my device, and unable to be changed.
Hashing Does not equal Secrecy
As discussed under “cooking some hash(es)” it is possible to work back from a hash to the input since we know some of the constraints operating upon phone numbers. Furthermore, even if we are not sure exactly how you are hashing data then we can simply put test data in and look for known hashes of it. If I know what 123456789 hashes to and I see it in the output, then I know how your app is hashing phone numbers.
The Full Solution to Privacy and Secrecy: Salt
Both of these issues can be greatly helped by increasing the complexity of the input into the hash function. This can both remove the tendency for anonymised data to carry identical identifiers across instances, and also reduce the chance of it becoming feasible to reverse-calculate all possible values. Unfortunately there is no perfect solution to this if user-matching functionality comes first.
The correct solution as it should be used to store passwords, entry specific salting (for example with bcrypt), is not feasible for a matching algorithm as it will only work for comparing hashed input to stored hashes, and it will not work for comparing stored hashes to stored hashes.
However, if you as a developer are determined to make a server side matching service for your users, then you need to apply a hybrid approach. This is not good practice for highly sensitive information, but it should retain the functionality needed for server side matching.
Your first privacy step is to make sure your hashes do not match those collected or used by anyone else, do this by adding some constant secret to them, a process called salting.
e.g., adding 9835476579080945368095468905486 to the start of every number before you hash
This will make all of your hashes different to those used by any other developer, but will still compare them properly. The same input will give the same output.
However, there is still a problem – If your secret salt is leaked or disclosed the reversing attacks outlined earlier become possible. To avoid this, increase the complexity of input by hashing more complex data. So, rather than just hashing the phone number, hash the name, email, and phone number together. This does introduce the problem of causing hashes to disagree if any part of the input differs by misspelling, typo’s etc…
The best way to protect your user’s data from disclosure, and your reputation from damage due to a privacy breach:
- Don’t collect or send sensitive user data or hashes in the first place – using the security principle of least privilege.
- Ask for access in a very obvious and unambiguous way – informed consent.
[Update] Added author byline and clarified some wording.
I am also the editor of the Neohapsis Labs blog. The following is reprinted with permission from
By Michael Pearce, a Security Consultant and Researcher at Neohapsis
Throughout the US on Groundhog Day, an inordinate amount of media attention will be given to small furry creatures and whether or not they emerge into bright sunlight or cloudy skies. In a tradition that may seem rather topsy-turvy to those not familiar with it, the story says that if the groundhog sees his shadow (indicating the sun is shining), he returns to his hole to sleep for six more weeks and avoid the winter weather that is to come.
Similarly, when a company comes into the world of security and begins to endure the glare of security testing, the shadow of what they find can be enough to send them back into hiding. However, with the right preparation and mindset, businesses can not only withstand the sight of insecurity, they can begin to make meaningful and incremental improvements to ensure that the next time they face the sun the shadow is far less intimidating.
Hundreds or thousands of issues – Why?
It is not uncommon for a Neohapsis consultant to find hundreds of potential issues to sort through when assessing a legacy application or website for the first time. This can be due to a number of reasons, but the most prominent are:
- Security tools that are paranoid/badly tuned/misunderstood
- Lack of developer security awareness
- Threats and technologies have evolved since the application was designed/deployed/developed
Security Tools that are Paranoid/Badly Tuned/Misunderstood
Security testing and auditing tools, by their nature, have to be flexible and able to work in most environments and at various levels of paranoia. Because of this, if they are not configured and interpreted with the specifics of your application in mind they will often find a large number of issues, of which the majority are noise that should be ignored until the more important issues are fixed. If you have a serious, unauthenticated, SQL injection that exposes plain-text credit card and payment details, you probably shouldn’t a moment’s thought stressing about whether your website allows 4 or 5 failed logins before locking an account.
Lack of Developer Security Awareness
Developers are human (at least in my experience!), and have all the usual foibles of humanity. They are affected by business pressures to release first and fix bugs later, with the result that security bugs may be de-prioritized down as “no-one will find that” and so “later” never comes. Developers also are often taught about security as an addition rather than a core concept. For instance, when I was learning programming, I was first taught to construct SQL strings and verbatim webpage output and only much later to use parameterized queries and HTML encoding. As a result, even though I know better, I sometimes find myself falling into bad practices that could introduce SQL injection or cross-site scripting, as the practices that introduce these threats come more naturally to me than the secure equivalents.
Threats and Technologies have Evolved Since the Application was Designed/Deployed/Developed
To make it even harder to manage security, many legacy applications are developed in old technologies which are either unaware of security issues, have no way of dealing with them, or both. For instance, while SQL injection has been known about for around 15 years, and cross-site scripting a little less than that, some are far more recent, such as clickjacking and CSS history stealing.
When an application was developed without awareness of a threat, it is often more vulnerable to it, and when it was built on a technology that was less mature in approaching the threat remediating the issues can be far more difficult. For instance, try remediating SQL injection in a legacy ASP application by changing queries from string concatenation to parameterized queries (ADODB objects aren’t exactly elegant to use!).
Dealing with issues
Once you have found issues, then comes the daunting task of prioritizing, managing, and preventing their reoccurrence. This is the part that can bring the shock, and the part that can require the most care, as this is a task in managing complexity.
The response to issues requires not only looking at what you have found previously, but also what you have to do, and where you want to go. Breaking this down:
- Understand the Past – Deal with existing issues
- Manage the Present – Remedy old issues, prevent introduction of new issues where possible
- Prepare for the Future – Expect new threats to arise
Understand the Past – Deal with Existing Issues
When dealing with security reports, it is important to always be psychologically and organizationally prepared for what you find. As already discussed, this is often unpleasant and the first reactions can lead to dangerous behaviors such as overreaction (“fire the person responsible”) or disillusionment (“we couldn’t possibly fix all that!”). The initial results may be frightening, but flight is not an option, so you need to fight.
To understand what you have in front of you, and to react appropriately, it is imperative that the person interpreting the results understands the tools used to develop the application; the threats surrounding the application; and the security tool and its results. If your organization is not confident in this ability, consider getting outside help or consultants (such as Neohapsis) in to explain the background and context of your findings.
Manage the present – Remedy old issues, prevent introduction of new issues where possible
Much like any software bug or defect, once you have an idea of what your overall results mean you should start making sense of them. This can be greatly aided through the use of a system (such as Neohapsis Security Manager) which can take vulnerability data from a large number of sources and track issues across time in a similar way to a bug tracker.
Issues found should then be dealt with in order of the threat they present to your application and organization. We have often observed a tendency to go for the vulnerabilities labeled as “critical” by a tool, irrespective of their meaning in the context of your business and application. A SQL injection bug in your administration interface that is only accessible by trusted users is probably a lot less serious than a logic flaw that allows users to order items and modify the price communicated and charged to zero.
Also, if required, your organization should rapidly institute training and awareness programs so that no more avoidable issues are introduced. This can be aided by integrating security testing into your QA and pre-production testing.
Prepare for the future – Expect new threats to arise
Nevertheless, even if you do everything right, and even if your developers do not introduce any avoidable vulnerabilities, new issues will probably be found as the threats evolve. To detect these, you need to regularly have security tests performed (both human and automated), keep up with the security state of the technologies in use, and have plans in place to deal with any new issues that are found.
It is not unusual to find a frightening degree of insecurity when you first bring your applications into the world of security testing, but diving back to hide is not prudent. Utilizing the right experience and tools can turn being afraid of your own shadow into being prepared for the changes to come. After all, if the cloud isn’t on the horizon for your company then you are probably already immersed in it.