AI Ethics - Privacy and Security Issues
I work on a variety of themes in AI Ethics working as an AI Ethicist at NordSec - a leading internet privacy and cybersec company. We are engaging with leading researchers to build a leading and operational research program.
There is a dichotomy of privacy in a shared and controlled environment. We argue that these understandings are vastly different as regards the application of machine learning (ML) since shared environments present non-zero-sum situations and are subject to data poisoning and similar attacks. Yet, we contend, such attacks can be justified if it serves to preserve personal privacy. Moreover, data poisoning in many cases can be the only way to retroactively affect data that has been scraped, leaked, or otherwise acquired.
Privacy is a basic virtue, and it is not reduced or grounded in other virtues. Historically, privacy has been treated as an instrumental virtue for negative liberty (the freedom from constraints), especially in state-to-citizen relations. It means that privacy was considered an instrument to ensure that the state does not interfere with a citizen’s negative freedom without legal ground. However, in recent years, and largely due to developments in ML, privacy has become a key issue not only in vertical relations (state-to-citizen) but also in horizontal relations (corporation-to-person and person-to-person). In the contemporary world, access to a person’s data provides a strong opportunity to manipulate, influence, and harm a person. Thus, protecting the privacy of a person is essential to human flourishing. This is because of direct harm that results from the loss of privacy. Thus, privacy protection does not need to be justified in terms of other virtues.
Since privacy is essential to human flourishing, it cannot be given away for technological advancement or any other potential gain. One position in the technological age is to argue that because technology inherently causes privacy issues, we should give away the notion of privacy altogether. The proponents argue that it is both impossible to protect privacy and trying to do so would hurt the potential technological growth, thus reducing the economical and other gains. However, since privacy is crucial to human flourishing and losing privacy would imply significant risks for exploitation, this virtue cannot be given away for any potential gains.
Usually, privacy is understood as data protection, meaning that privacy is the fundamental right to control access to and use of one’s data. However, this definition is too narrow. Once the data is collected and stored, it can always be shared, leaked, transferred, aggregated, anonymized, and so on. Moreover, this kind of data protection often relies on terms of service agreements which are easily manipulated by companies. Terms of service agreements rely on the faulty premise that people are always fully informed, read the terms in full, and have sufficient background knowledge to evaluate the risks. This is simply not the case. Moreover, not accepting the terms of service might bar a user from accessing a service that has become crucial to their social lives and work. Also, a living person can grow and change in the course of their lives, so their decisions might change too. However, once the access to data is granted, it is not reversible because of how data leaking, sharing, and storage currently works.
Thus, the definition of privacy must be expanded to include not only the control of the access to one’s data but also the control of what kind of data is accessible in the shared environment. Data is commonly collected either from a controlled environment (in which a person chooses to provide specific information about themselves) or a shared environment (in which an agent - usually a bot or a scraper - gathers any information automatically about a person from an independent source). In the expanded definition, privacy includes a person’s right to change, present, and manipulate information about themselves in the controlled and shared environments.
Examples of personal data acquired from shared environment include public space surveillance, 3rd party data scraping, database leaking and selling, and similar cases. Non-personal data is also subject to shared environment collection and manipulation, for example, in financial markets and GovTech. Currently, models can be built on the data collected from the shared environment. For example, Clearview AI built a facial recognition database of billions of people by scraping their public social media profiles. The application is currently used by law enforcement to extract names and addresses from potential suspects.
Similarly, competitors and government agents can use reconnaissance techniques to extract sensitive information from a shared environment. Adversaries may leverage publicly available information, or Open Source Intelligence (OSINT), about an organization that could identify where or how ML is being used in a system and help tailor an attack to make it more effective. These sources of information include technical publications, blog posts, press releases, software repositories, public data repositories, and social media postings. Adversaries may attempt to identify machine learning pipelines that exist on the system and gather information about them, including the software stack used to train and deploy models, training and testing data repositories, model repositories, and software repositories containing algorithms. This information can be used to identify targets for further collection, exfiltration, or disruption, or to tailor and improve attacks.
Shared environment exploitation is essentially different from the controlled environment because the potential damage is virtually unlimited. For example, most two-player board games are strictly zero-sum games: so deploying an adversarial policy would be ethically permissible (although maybe unsporting). However, some important real-world settings, such as financial trading, also have an explicitly sanctioned adversarial element: it is legally permissible (and expected) for two professional market participants to make a trade they consider to be positive even in the expectation the counterparty will be losing. However, prima facie rules against market manipulation would forbid adversarial policies. This suggests two significantly different understandings of “adversarial policies” in zero-sum and open-ended environments.
Masquerading techniques are available to manipulate shared information in order to hide true identity or attributes in the data models, built by other agents. Thus, we argue, this right is extremely important in the current context regarding ML because masquerading might be the only way to reverse data protection breaches. For example, if a person has given access to their data, and the data is already collected, leaked, shared, or stored, and ML algorithms are employed to build a model of the person, usually that decision is irreversible. However, masquerading can provide a way to mitigate further damage or even disrupt the already built models by providing more data that distorts the current model.
Such masquerading techniques are usually considered adversarial attacks, i.e. data poisoning. Adversaries may attempt to poison datasets used by an ML system by modifying the underlying data or its labels. Adversaries may also attempt to add their own data to an open source dataset which could create a classification backdoor. For instance, the adversary could cause a targeted misclassification attack only when certain triggers are present in the query; and perform well otherwise.
However, we argue that the application of such techniques can be part of the right to privacy since it is the only way to reverse privacy damage that has already been done. For example, there are algorithms that allow individuals to distort their models on Clearview AI by adding more pictures of themselves to social media that have special filters added to them. We argue that such data poisoning can be ethically justified because of privacy considerations.