Using secure pseudonymous identifiers to protect identification numbers

By now you probably have read of the Equifax data leakage. This reminded me of the idea of secure pseudonymous identifiers I had been thinking on for some time. Secure pseudonymous identifiers make use of cryptography to make it hard or impossible to recover the original identifiers representing a specific person. To be sincere, I wouldn’t be surprised if somebody else came with this idea first and named it something else since it’s quite simple. Sadly, I really haven’t had the time to check that out.

In this article I will introduce the concept of identifiers then pseudonymous identifiers and finally secure pseudonymous identifiers. I will also explain why they are a very useful technique to deter the risk of leakage on data whose only purpose is identifying an individual like U.S.A.’s SSN numbers or Spanish ID card numbers among others.

An identifier is a piece of data that makes it possible to distinguish a person from others, for example, your full name is likely to identify you clearly in local ambits like your friends, family or work; in a way similar your SSN or tax number identifies you uniquely to your country’s tax office (and most likely a lot of other organisms too). Identifiers are important because they make it possible to specify to which person, something (for example, data, debts, taxes, money, things) belongs. This is also the case on most computer systems as you need to map data to specific persons. Identifiers are intended to identify (i.e. say who you are) and not authenticate (prove that you are who you said you are).

Pseudonymous identifiers are basically identifiers that identify a person in a way in which just knowing the identifier doesn’t allow to determine which person it identifies. For example your nickname is one because unless you tell somebody you use it it can’t be mapped directly to you. Usually pseudonymous identifiers are easier to change, you can always create a new account under a different nickname on your favorite social network if you want to, but changing your tax number isn’t that easy.

Secure pseudonymous identifiers try to get both the uniqueness of the usual government issued identifiers and the difficulty of finding who they belong to of pseudonyms. This happens because pseudonymous identifiers make use of a key without which it is impossible to verify who they identify. From a mathematical point of view, a secure pseudonymous identifier is the result of a function that mixes together the key with the identifier. Unlike other techniques like password hashing where the salt must be different for every single person, the key used with a secure pseudonymous identifier must be kept secret and be the same for every user, this is because otherwise finding the data set associated with a specific identifier will require processing half of the identifiers on the database on average whilst, when using the same key, searches on the database become possible.

The choice of function though allows for different security properties. For example, if we use a block cipher to combine the identifier with the key, everybody with access to the key will be able to enumerate all the identifiers on the DB just by decrypting them. If identifier enumeration is only needed eventually, it’s preferable to use some public key approach (which lowers their security a bit) but allows you to use the public key to encrypt the identifiers. On the other hand, if we use a cryptographic hash function the only way to enumerate the identifiers is by testing every possible one with the key they are encrypted with, similarly, when using a strong key stretch function, the time to enumerate the identifiers will increase accordingly. Keep in mind though that in any case bruteforce will always be trivial to parallelize.

In order to be secure, the function used has to make it impossible for an attacker to find out the key or identifier used given the function’s result and to find the original identifier if the key is unknown. The function still has to make it easy to find the pseudonymous identifier associated with a specific identifier if the key used is the right one. Also the function must keep the possibilities of finding two identifiers generating the same pseudonymous identifier as low as possible. Additionally, unless the ability to enumerate the original identifiers is required, the function should also be one-way thus making it impossible to find out the identifier associated with a pseudonymous identifier if the key is known. Of course, one-way functions can be broken, if the key is known, by trying all the possible identifiers and seeing which is the correct one.

So, which functions could be used here? Well, if you need to enumerate the original identifiers; a block cipher that either has a block size equal or larger than the identifier (with static padding of the identifier to match the block size) or in CBC mode (with the IV being part of the key) with CTS is good enough, if using CBC remind you will have to use the same IV for all entries (which can be an added part of the secret key). Otherwise, you can use: cryptographic hash functions where the key is used as the initial state; cryptographic hash functions in HMAC mode, with the HMAC key being the secret key; or just as is prepending the key to the identifier. Also, some password stretching functions with the key as SALT can be used as long as all parameters are the same, for example PBKDF-2 with a constant number of iterations. As usual though, the best solution may involve a combination of public key cryptography and key stretching as you’ll find out on the example at the bottom.

At all times, the key and the data should be kept on separate places as, once the attacker finds the key, he can run a bruteforce attack, or just perform the enumeration if it is allowed, to map the secure pseudonymous identifiers back to the original identifiers they represent.

So let’s see a more realistic example. At a bank they are required to report all movements belonging to all the tax numbers once a year for taxing purposes. Other than, customers need to identify and authenticate themselves at branch-offices (providing their tax number and a proof of ID) and when using the online services (providing their password). The bank decided to store the data of the users mapped to the 32 bytes PBKDF2 result (with SHA256-HMAC and 1000000 iterations) of their tax number including the identifier encrypted using a public key approach (with RSA-4096 and a randomly generated AES key which is encrypted with it and used to encrypt the identifier using AES-CBC-XTS). When data needs to be sent to the tax office, the private key matching the one used is taken from the safe and used on an air gapped system along with the reports using the RSA encrypted identifiers, the system then decrypts these and fills them on the form which are then sent to the tax office. On the other hand the PBKDF2 system keeps the key inside a small secure microcontroller connected using the serial port that will only accept as input an identifier and return as output the PBKDF2 result for it with the key it contains, therefore, when a customer wants to use the Internet banking services or the branch offices’ one he only needs to provide his tax identifier (and proof of his identity like the ID cards or passwords) to do so. Once the microcontroller hashes the user’s identifier it’s trivial to match it against the one identifying the user dataset to verify he is who he says to be. Also it is easy to attach the encrypted identifier to the user’s session to keep track of him as he uses the online banking services.

Would this approach have worked on something like Equifax breach? In part, certain data like Full Names you want to keep on your database for things like sending mail to the users, and this technique wouldn’t have protected that unless the public key approach was used which would make it impossible to show this data to legitimate users easily. But, if the users’ SSNs were hashed using something like PBKDF2 with a key held on an external system, the only way in which the attackers (with a web server exploit) could have gained the SSNs would have been by accessing this second device and trying every possible SSN to find out which anonymous identifier is associated to them.