[Ed Note: The following post is part of the TLF Editorial Board Test 2019-20. It has been authored by Gitika Lahiri, a third year student of NALSAR University of Law.]
The present Data Protection Bill introduced by the Ministry of Electronics and Information Technology aims to change the manner in which data is stored and processed in the India. In doing so, it regulates the personal data of data principals collected data fiduciaries and the government. The fiduciaries have a set of obligations that it must follow while processing data. These include the transparency and accountability measures under §31 state that data fiduciaries and processors must implement necessary safeguards including de-identification and encryption of data. However, the effectiveness of such a policy and its application in India is uncertain.
De-identification refers to the process of removal of all the personally identifiable information (data identifiers) from the data that is collected. It is typically done through the replacing the identifiers with category names, symbols, random values or generic data. It has been considered to be essential since it allows organisations to reap the benefits of data collection without excessive legal scrutiny. However, if implemented without a thorough analysis of the data, it leads to cases of re-identification and linkage attacks. Linkage attacks refer to when each record in the data set is linked with other similar records in another data set to reveal the identity of the data subject. Standard measures of data de-identification such as salting and hashing are not sufficient to ensure the complete protection data and anonymity of data principals.
Present Flaws with De-identified Data
1) Smaller Data Pools
Despite the key markers of a person’s identity being removed, there are still certain risks that exist. Data may inadvertently reveal the identity of the person despite data masking steps being in place. When the pool of data is relatively smaller, it is easy to identify people through certain unique markers. For example, in data collected in the field of education, in smaller rural communities there are often very few students of colour which can lead to their identity being revealed.
The most commonly discussed linkage attacks was done by an MIT student doing her graduate work in the 1990s. Latonya Sweeney was able to re-identify the medical records of the Massachusetts Governor William Weld. At the time, the state of Massachusetts was distributing a research dataset containing de-identified insurance reimbursement records of Massachusetts state employees that had been hospitalized. To protect the employees’ privacy, their names were stripped from the dataset, but the employees’ date of birth, zip code, and sex was preserved to allow for statistical analysis. Thus, information that is often provided for demographic analysis becomes potentially dangerous.
2) Larger and High-dimensional Data Pools
In other instances, a vast amount of data from a single data source allows academicians to narrow down to particular individuals. This was witnessed during the AOL search data leak, which was considered to be a ticking time-bomb of privacy. AOL published the search history of their users over three months after removing names and obvious ‘markers’ for the purpose of research which was meant for the benefit of the academic community. Despite the basic removal of identifiable data, personally identifiable information was made available since one could check all the searches made by a single ID, making it fairly simple to uncover the identity of the data subject. Further, when de-identified data is high dimensional, it consists of multiple unique factors, and becomes much easier to identify specific individuals. The increased number of variables, reduces the permutations and the people who can be grouped together.
3) Binary Lens
De-identified data is also looked at primarily through a binary lens, i.e., either data is considered to be de-identified and therefore completely safe, or it is considered to be identified data. However, this method of looking at data can be fundamentally problematic. The standard for anonymization in other nations, is usually that the data must be completely “irreversible and identifying the data subject must no longer be possible”.[1] However, this standard can be fairly expensive and difficult to implement and tends to yield sub-optimal results. The high standards would de-incentivize organisations from attempting to anonymise their data at all. In contrast, India presently does not have any tangible regulations as to what may be de-identified data and manner in which it must be carried out.
Risk-Based Systems of De-identification
Due to the criticisms that previously existed for de-identified data, new methods of risk-analysis have developed for de-identified data which could potentially solve the issues that existed with the previous ideology of complete de-identification of data. Newer models have suggested that the data be categorised based on the risk that it poses and the time and investment in de-identification may vary accordingly.
Data can be divided into direct identifiers and indirect identifiers.[2] Direct identifiers are those which can be used to identify particular individuals without cross-referencing the same with any other additional data that exists within the public domain. These are removed simply through suppression or replacement. In contrast, indirect identifiers help the process of connecting information so as to single out a single person. These could include information such as gender, age, postal codes or other demographic data. Quasi-identifiers are much harder to remove since they are often important to the data set for analysis. They are often addressed through generalising values of data sets, swapping data between records or perturbing and adding noise.
A risk-based system suggests that the legal regulations can be of various degrees, depending upon the identifiability of the data. The parameters are also calibrated in terms of the organisation’s safeguards and controls, as well as the data’s sensitivity, accessibility and permanence. This system does not utilise a dichotomous system for understanding and analysing data, instead suggesting that it viewed through multiple categories with varying levels of legal and technical boundaries. It also suggests the creation of policies that incentivise organisations to ensure that there is no explicit identification, while maintaining the utility of the data itself.
The risk-based model also considers the nature of the contract or agreement through which the data was obtained in ascertaining the risk that is associated with the data in what manner it may be protected. The four typical models that exist for the release of data include: “Release and Forget model,” where data is published publicly or made available on the internet; “Data Use Agreements model,” where data is provided under legally binding contracts detailing how data may and may not be used (typically either in a negotiated agreement with a “qualified investigator” or via “click-through” license agreements); and the “Enclave model,” where the data is “kept in some kind of segregated enclave that accepts queries from qualified researchers, runs the queries on the de-identified data, and responds with results.”
Two forms of non-technical safeguards and controls are internal administrative and physical controls (internal control) external contractual and legal protections (external controls). Internal controls could include security policies, access limits, employee training, data segregation guidelines, and data deletion practices prevent data leaks to the public for confidential information. By implementing administrative safeguards, organizations provide important privacy protections independent of technical de-identification.
In addition, another method of de-identification/anonymization of data is the Anonymization Decision-Making Framework (ADF). The ADF functions on the principle, that analysis of data in a vacuum would not be sufficient to assess the level of protection that a particular data set would require, and the analysis is a context-dependent process. In order to properly assess the risk associated with any data, one also needs to analyse the environment into which the data set will be introduced. Re-identification does not take place due to the data alone, but also due to the interaction of that data with other information that exists in the public domain, and therefore it is crucial to analyse that as well.
Conclusion
The Data Protection Bill does not refer to any specific measures for de-identification that data fiduciaries or processors would have to comply with. As a result, data fiduciaries could be lax while also claiming that they have complied with the necessary measures. However, if stricter regulations for data de-identification are put in place, it could lead to the development of an effective system that protects the right to privacy of data principals. Despite not being full proof it could enable the sharing of research and maintenance in an easier and safer manner. It is now accepted that zero-risk is an unrealistic goal when de-identifying data since the potential for harm always exists for such a practice. Therefore, the aim is to balance the risk and the utility of the data being used when considering the degree of protection that needs to be applicable to it.
De-identification and anonymization of data may become extremely effective steps towards ensuring the right to privacy. However, to ensure that data fiduciaries and processors invest in meaningfully comply with the process of de-identification, which is elaborate and time-consuming, the bill must include provisions that entail specific measures to ensure that corporations handle the personal data of individuals with care.
[1] Article 29, Working Party’s Opinion.
[2] Garfinkel, S. L. (2015). De-identification of personal information. doi:10.6028/nist.ir.8053