head profile of person in shadow

Photo by Greyson Joralemon on Unsplash

ODI Fellow research – Anonymous data: Emerging risks in the data economy

Tue Nov 15, 2022
$download_content = get_field('download_content');

This blogpost is written by ODI Fellow Georgia Meyer

Anonymised personal data shifts risks from privacy to purposes and demands that more attention is paid to the emerging ways of thinking about participation in the data economy

This blogpost is written by ODI Fellow Georgia Meyer

You can get in touch with Georgia by emailing g.meyer@lse.ac.uk

The UK government’s recently published Data Protection and Digital Information Bill (DPDI), the fate of which is now uncertain given recent turmoil, re-clarifies the already present General Data Protection Regulation (GDPR) position that anonymised personal data is not subject to data protection legislation. With increasing investment in Privacy Enhancing Technologies (PETs), it is important to consider what the implications are of a data economy increasingly fuelled by data processing that will arguably be divorced from oversight in many settings. If ‘privacy’ can, for argument’s sake, be ‘guaranteed’, what else is at stake?

This is the subject of my research as a new Research Fellow at the Open Data Institute (ODI) and as an MPhil/PhD student (Information Systems) at the London School of Economics and Political Science (LSE), supervised by Dr Edgar Whitley. In this blogpost, I set out some of the conceptual issues that may emerge from increased use of anonymised personal information by virtue of the increased deployment of PETs. The first part reviews the legal status of anonymised personal data in the DPDI (and GDPR). The second part addresses PETs in practice and their complex and contextual techno-organisational arrangements. The piece closes by teasing out some of the conceptual issues that may arise, like forfeited agency, from an increasing reliance on PETs and highlights efforts to re-stitch people back closer to processes that determine the nature of the data collected about them and the purposes to which it will be put.

If you are keen to discuss this further, please email me at g.meyer@lse.ac.uk.

The legal status of anonymised personal data

The DPDI, which had its first reading in the House of Commons on 18 July 2022, reasserted that anonymised data is not subject to controller and processor obligations (set out in Chapters 12-20 of the DPDI and all data). Under the GDPR, anonymised personal data is not stitched to identifiable data subjects and therefore their data subject rights (as set out in ICO guidance) are forfeited. This is because the information, once anonymised, does not relate to an ‘identifiable living individual’. In light of this, it is reasonable to expect that there won’t be significant changes to this definition despite the new secretary at the Department for Digital, Culture, Media and Sport (DCMS) signalling that the government plans to revise parts of the DPDI before it is reintroduced to the Commons. For a full list of definitions in the Bill, see Table 1.

The Bill sets out that, in order for information to be considered ‘personal data’, it must relate to an ‘identified or identifiable living individual’ (Clause 101). It goes on to outline that identification can happen directly or indirectly. In both cases the text stipulates that identification might occur by ‘reasonable means’ at the time of processing – taking account of the technology available at the time. Legal expert Chris Pounder has commented: ‘In other words, non-personal data and anonymous data are both free of any data protection concern (i.e. no obligations arise from the UK GDPR). The Bill’s objective is to widen the scope of these two categories by narrowing the scope of those data that are classified as “personal data”.’

Assuming that one or more PETs do work in practice and anonymise data derived from individuals, then the Bill stipulates that: ‘The legislation does not apply to non-personal or anonymous data.’ (Clause 101). There may be other governance mechanisms at work in a given context to ensure that data is used responsibly, that re-identification does not occur, and that the data is being used in the ways that were outlined when it was collected. However, this still points to a number of issues about what it might mean to have a data economy fuelled by anonymised information.

PETs in practice

PETs are a range of statistical, hardware and cryptographic techniques that are being designed to enable data processing that does not disclose information relating to identifiable living individuals. However they are by no means foolproof (re-identification has been shown to be possible in various cases) and there is a risk that their adoption could still result in exposure of the information of identifiable individuals. In spite of these concerns, PETs have long been seen as a potential means by which, and are indeed precipitating, greater data sharing in a ‘privacy’ preserving manner. As such they are seen by many as a key part of ushering in the benefits of a data rich economy.

In a given context, it may well be that a number of PETs are employed during various stages of data collection, managing, sharing and analysis. For an overview of some of the most commonly used PETs see Table 2, and for a summary of the state of PETs see The Royal Society’s 2019 report ‘Protecting Privacy in Practice’ (an update is due for publishing in December 2022). Increased resources are being devoted to accelerate the deployment of PETs to remedy some of the risks associated with the use of personal data. Long heralded as a route to protecting privacy in healthcare and smart city settings, PETs are increasingly being seen as a potential route to inscribe ‘privacy by design’ into digital collection, processing and application architectures. A recently launched US-UK PETs challenge speaks to recent increases in attention their potential is receiving.

It is important to acknowledge that there are of course contexts within which the deployment of PETs can support important research and development breakthroughs (though of course defining importance is a contestable endeavour in and of itself). Professor Alison Noble is spearheading research with The Royal Society and The Alan Turing Institute to carefully steward this emergent terrain. Indeed, there will be situations where PETs are used in conjunction with additional elements of sound data governance, for example a data cooperative that stewards purposes, alongside a technique like homomorphic encryption which can resolve concerns over identification. The Royal Society 2019 report notes: ‘there is no technology that replaces the need for good governance and proper business practice relating to the use of data.’

At the same time it is also important to note developments in large-scale digital experimentation techniques that utilise enormous amounts of digital behavioural and psychographic data about people with an express aim to extrapolate ‘population level insights’. For example, see a recent conference at MIT (NB the DPDI Bill also stipulates that processing personal data for statistical purposes whereby the data generated is aggregate data is not subject to data protections). In such settings, by virtue of the fact that anonymisation permits insight generation whilst simultaneously protecting privacy (and hence circumnavigates data protection), it is prudent to begin to consider what else is at stake.

Forfeited agency

Considering this question goes to the very heart of epistemic issues about how knowledge gets produced, by whom and in service of whom/what. Moreover, what the reality-making properties that all ‘knowledge’, in turn, comes to have. One lens through which to consider the material impacts of technologies is that of performativity. There is an interesting and growing body of work looking to account for the impact of predictions on outcomes in models (see Mendler-Dünner et al 2022 and Perdomo et al 2021) calling into question the nature of the causal relations between variables and their purported relative impacts on outcomes. The relationships between models as companions to and / or creators of realities is a subject also explored extensively in Dr Erica Thompson’s forthcoming book ‘Escape From Model Land’ (out in December 2022).

Performativity is a useful lens through which to consider the potential negative consequences of anonymised personal information. It enables us to begin to think through how people come to be divorced from input into the reality-making properties which their data comes to have. The sensemaking frameworks which give data meaning are constructed. These constructed meanings, or even more fundamentally the choice of variables to classify and measure in the first place, in turn come to shape how we interpret reality. Recent work from the ODI’s Experimentalism and the Fourth Industrial Revolution project highlighted this in the ‘Asimov and data mirrors’ piece looking at post-Brexit data policy. Gavin Freeguard’s provocation in this chapter of the project asks precisely this: “How do the methods and metrics we use for data policy impact assessment shape data policy outcomes?”

Considering what types of decision-making are lost when anonymised personal information circumnavigates legal protections raises questions about the robustness of the privacy paradigm to take account of the relational nature of data. Investigating the relational nature of data is the ambition of Dr Jeni Tennison’s, former CEO at the ODI, new organisation Connected by data, whose mission is to: ‘put community at the centre of data narratives, practices and policies by advocating for collective and open data governance’.

There is a slight shift in emphasis from language around data rights – typically used to frame sound data governance, to embed community within narratives and practices. When data campaigning that advocates for people to have more control over the purposes to which ‘their’ data will be put is framed around data as owned and with rights, it can miss the relational aspect of data. Much of the language of data rights is enshrined in data protection legislation in obligations like purpose limitation, data minimisation, the right to be forgotten etc. Though a conversation about the extent to which an economic and legal lexicon around stable, classifiable, ownable and valuable ‘personal data’ is compatible with a relational concept of data, and emerging ideas about relational ethics, remains.

Connected by data’s mission to put communities at the centre of data practices, and foreground relational impacts from data use, is also reflected in the enormous range of work around more localised community-centred data arrangements that place emphasis on collective agency in determining collection and purposes (which comprises initiatives around personal and non-personal data). See the work of Aapti Institute’s Data Economy Lab (especially the Data Stewardship Playbook and Data Stewardship Navigator), Mozilla Foundation’s Database of Initiatives: Alternative Data Governance in Practice, the ODI’s’s Data Institutions Register, and Pollicy’s work on data governance and rights.

What ties these examples together is the focus on narrowing the space between the people from whom data is collected and the decision-making processes about the purposes to which that data will be put, often articulated in terms of dimensions of value beyond simply market measures. So considering what the risks of a data economy fueled by anonymised data are could start with recognising that this scenario sets up an opposite force: one that increasingly separates individual and collective decision-making processes from how data is used in practice.

Growing the conversation

It strikes me that some of the consequences of this emergent terrain will start to pull at the saliency of various frameworks for ‘protecting’ data, like conceptual robustness of privacy for example. Moreover, that focused attention will need to be devoted to thinking through how various PETs unfold in practice, with some of the above in mind.

I’m pursuing both of these themes in my research and would be delighted to hear from anyone else working in this technical field and PhD or postdoc researchers. It would be great to share ideas and find ways to collaborate on where and how conceptual issues meet technical and legal challenges.

Get in touch by emailing g.meyer@lse.ac.uk


Table 1 – Definitions in the Data Protection and Digital Information Bill (DPDI)

Term / phraseDefinitions in the Bill and in the Bill explanatory notesReference
Personal dataClause 101:
‘The UK GDPR and the DPA 2018 apply to the processing of personal data’

Clause 101:
Personal data ‘is defined as information relating to an identified or identifiable living individual.’

IdentificationClause 102:
‘... a living individual may be identifiable either directly or indirectly.’
Direct identification‘(3A) An individual is identifiable from information “directly” if the individual can be identified without the use of additional information.’2
Indirect identification‘(3B) An individual is identifiable from information “indirectly” if the individual can be identified only with the use of additional information.’2
Information being processed is information relating to an identifiable living individual only in cases.
‘(2) The first case is where the living individual is identifiable (as described in section 3(3)) by the controller or processor by reasonable means at the time of the processing.’2
Information being processed is information relating to an identifiable living individual only in cases.
‘(3) The second case is where the controller or processor knows, or ought reasonably to know, that— (a) another person will, or is likely to, obtain the information as a result of the processing, and (b) the living individual will be, or is likely to be, identifiable (as described in section 3(3)) by that person by reasonable means at the time of the processing.’2
Pseudonymisation‘(5) “pseudonymisation” means the processing of personal data in such a manner that it becomes information relating to a living individual who is only indirectly identifiable; but personal data is only pseudonymised if the additional information needed to identify the individual is kept separately and is subject to technical and organisational measures to ensure that the personal data is not information relating to an identified or directly identifiable living individual;’3
Anonymous dataClause 101
‘The legislation does not apply to non-personal or anonymous data.’
Statistical surveys or statistical results‘6. References in this Regulation to the processing of personal data for statistical purposes are references to processing for statistical surveys or for the production of statistical results where— (a) the information that results from the processing is aggregate data that is not personal data, and (b) neither that information, nor the personal data processed, is used in support of measures or decisions with respect to a particular individual.’3-4
Synthetic dataThere is no mention of synthetic data directly in either the Bill or the accompanying explanatory notes.

Definition from the Alan Turing Institute:

A ‘synthetic dataset does not contain the exact datapoints of the original dataset (technically, it is possible that real datapoints are reproduced, but at random). Rather, synthetic data retains the statistical properties of the original dataset—or the ‘shape’ (distribution) of the original dataset.’
Customer data‘452.“Customer data” means information relating to the customer of a trader: customer data could, for instance, include information on the customer’s usage patterns and the price paid to aid personalised price comparisons.’63
Business data‘451.“Business” data is information about goods or services and digital content supplied by the relevant trader and includes information relating to customer feedback: business data could, for instance, include data about the products the trader offers and their prices (to enable price comparison).’63

Table 2 – Most commonly used Privacy Enhancing Technologies (PETs)

PETHow it works
Differential privacyDifferential privacy refers to a set of statistical techniques that involve ‘adding noise’ to a dataset to obscure individually identifiable attributes within a dataset (Royal Society 2019: 28). Any noise added to a dataset reduces the accuracy of that dataset. But for large datasets a - variable - degree of accuracy is forfeitable in ways that preserve the meaning of the output of the computations on that dataset at the same time as ensuring that information relating to an identifiable living individual is not exposed. This variation in noise is mathematically determined, context-specific and is known as the ‘privacy budget’ (Royal Society 2019: 44).
Synthetic dataSynthetic data is typically created using algorithms which mimics real data and can be used to train machine learning models. It is thought to be able to sidestep concerns over individual privacy given the information is not, in theory, derived from individuals in the first place.
Homomorphic encryptionHomomorphic encryption enables computations to be performed on encrypted data which yield the same result once the data is decrypted (McCarthy and Fourniol 2020). When used in the field, it tends to be what’s known as Partially Homomorphic Encryption (PHE) as opposed to fully homomorphic encryption which, due to its drain on server capabilities, is ‘inefficient in practice’ (Royal Society 2019: 32).
Secure multi-party computationSecure multi-party computation (MPC) enables two or more entities to share data securely without revealing which entity shared what data. It is a technique that allows ‘computation or analysis on combined data without the different parties revealing their own private input’ and so can be used ‘when two or more parties want to carry out analyses on their combined data but, for legal or other reasons, they cannot share data with one another’ (Royal Society 2019: 38). Notably, secure multi-party computation can also be used by different parties to train their models on multiple datasets ‘without seeing each other’s unencrypted data’ (Royal Society 2019: 38) There is currently disagreement about whether personal data is still considered personal data (and at what point it is and isn’t) in contexts of MPC implementation.
Federated learningFederated learning trains algorithms on decentralised edge devices or using data stored on local hard drives. The goal of federated learning is to increase the volume of data that a model can be trained on, improving the efficacy of a model, without exposing any of the source training data. ‘Deep learning’ models can therefore be trained without centralised stores of data implying that, so long as model-reversal isn’t possible, training data is not processed by third-parties and can remain with the controller that collected the data in the first place.
Personal data storesPersonal data stores are not typically considered a type of privacy enhancing technology - rather, they are decentralised data stores that put individuals at the centre of decision-making and controls over the use of their data. For this reason they are worth including. Their data is held in decentralised data stores and can be released for specific purposes and specific requests. Any released data may well be processed with an additional privacy preserving technology in order that the individually identifiable information is still anonymised. Nonetheless the individual from whence the data came was given more agency in determining the purposes to which their data would be put.