A large crowd of anonymous blurred people - Anonymous customer data concept
PHOTO: Shutterstock

If the digital workplace is being driven by data and increasingly tough regulation is forcing enterprise to anonymize their data, it would stand to reason that many of the cases currently underway over General Data Protection Regulation (GDPR) violations could be shelved — if it was absolutely guaranteed that data could not be de-anonymized.

But data is big business for tech companies that service the marketing companies that have been buying data for a competitive edge. While we may not like the idea that our personal data is harvested in the first place, theoretically by accepting the terms and conditions on most websites we visit, we accept that at least some of our personal data will be used. The understanding, or at least the hope, is that our data will remain anonymous.

Aspirational Data Privacy

However, if you’ve always suspected the promise of data anonymity was aspirational rather than real, you were correct. New research from a team of researchers in the Catholic University of Louvain in Belgium and Imperial College London, which appeared in Nature magazine, shows that using a relatively simple machine learning algorithm and some data points from the data group, it is possible to identify the so-called anonymous data contributor.

Using machine learning, the researchers developed a system to estimate the likelihood that a specific person could be re-identified from an anonymized data set containing demographic characteristics — essentially, over 99% of Americans could be correctly re-identified from any dataset using 15 demographic attributes, including age, gender and marital status. “While there might be a lot of people who are in their thirties, male and living in New York City, far fewer of them were also born on Jan. 5, are driving a red sports car and live with two kids (both girls) and one dog,” said Luc Rocher, a PhD candidate at Université Catholique de Louvain and the study’s lead author, in CNBC report.

The findings are not entirely new. In fact, one of the researchers from Imperial College, Yves-Alexandre de Montjoye, indicated in a 2015 report that anonymous datasets from credit card use can be easily backtracked based on a number data-points within the data. In the report abstract, Montjoye wrote, "We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely re-identify 90% of individuals. We show that knowing the price of a transaction increases the risk of re-identification by 22%." So what does this mean for data?

Privacy by Design

Jahia, which develops open-source content management and digital experience applications, recently joined a number of America's companies and government agencies collaborating to help define the international standard for "Consumer Protection: Privacy by Design." 

The standard is part of ISO Project Committee 317. As one of 14 countries with Participant status in ISO/PC 317, the United States is represented by its Technical Advisory Group (TAG), administered by the American National Standards Institute (ANSI) in partnership with the OASIS standards and open source consortium.

At the time, Elie Auvray, co-founder and head of business development at Jahia wrote: "Data privacy is the most important issue facing technology providers today. ...Data privacy is not a constraint, it will become a key lever of growth for any digital enterprise."

Felix Sebastian, managing editor at Termly, which provides advice to business owners and digital professionals on how they can best comply with the changing legal landscape of data privacy, said that with continual advances in computing technology, keeping data truly anonymous is becoming a near impossibility. “Much like other complementary technologies, encryption and decryption methods are in a constant battle to keep ahead of the other, resulting in a cat and mouse game for those involved,” he said.

De-identification methods were largely pioneered in the medical research field, with best practices in this field subsequently being adopted by other disciplines. However, the medical research field has seen several instances of de-identified data being re-identified — risking the privacy of millions of patients.

With studies already reporting successful re-identification efforts in 2017 and 2018, the latest Nature report published in July 2019 is surely not going to be the last.

Awareness of data privacy issues has increased in light of legislation such as the GDPR and the California Consumer Privacy Act (CCPA) and fines being handed out to corporate giants such as Google, Facebook and  British Airways. If public faith in de-identification methods erodes, then it could hinder medical and statistical research that rely on human participants. “To assume any public data in the current era is truly de-identified and anonymous would be foolish. Good faith efforts by researchers and practitioners may not be enough. Only strict and massive legal ramifications for data breaches and unauthorized data processing can help maintain a semblance of data privacy in the digital era,” he added.

Related Article: Takeaways from the Million Dollar British Airways and Marriott GDPR Fines

Aspirational Anonymized Data

Although it's great that laws and policies require anonymized data and are being updated to address current flaws, the successful release of anonymized information as a timely dataset has yet to happen, said Allan Buxton, of Secure Forensics.

As far as technology research goes, information collected is irrevocably tied to account identifiers (advertising IDs, user accounts, device IDs, etc.) and even more personal information, such as location data.

Since no one has found a method for successful anonymization, perhaps the better way to approach it is to require collected data to be treated securely, and the findings gleaned from the collected information to be reviewed for personally identifying information prior to publication, said Jeff McGehee, a senior data scientist and Internet of Things (IoT) practice lead at Very, an IoT design and development firm.

He told us that rendering data to be truly anonymous will be extremely challenging — even virtually impossible. As technologies such as IoT expand to include everything from smart light bulbs to smart cars, data collected on any particular individual can include when they leave their home, sound clips from smart speakers, or the contents of their fridge.

While arguably some of this data sounds less serious than say, medical records, what this shows is how easy it is to glean literally tens of thousands of data points about an individual user,” he said.

Regulations like GDPR can help minimize obvious identifiers, but with the anticipated growth of connected devices, standards are not likely to keep pace. That’s why the emphasis on security (such as California’s cybersecurity law covering smart devices — the first of its kind in the U.S.) is critical to securing personal data — and could also be a springboard for other emerging technologies.”

The Role of GDPR

Last year, GDPR went into effect and introduced a framework to ensure that digital consumers would be protected from data harvesting. It fundamentally changed the way consumer data is collected, stored and activated. At the time, Bruce Orcutt, senior vice president of global marketing at ABBYY, a global provider of content intelligence solutions and services, explained that one of the major issues with GDPR is identifying and anonymizing data that may break the rules.

“At the root of GDPR is personal data that directly or indirectly identifies a natural person in any format,” he said. “It mandates that organizations cannot keep data and content forever and advocates better records management and strong information governance. That, however, is where the compliance challenge lies: information is locked inside of documents.

Companies are turning to cognitive robotic process automation (RPA), which combines advanced technologies such as natural language processing, artificial intelligence (AI), machine learning and data analytics to mimic human activities such as perceiving, inferring, gathering evidence, hypothesizing, reasoning and interacting with human counterparts. But it is still a struggle to keep data de-anonymized and is likely to get more difficult as technologies to cull data get more sophisticated.