Data privacy is essential and yet can act as a significant obstacle in machine learning. Nowadays, AI development is composed primarily of data intensive algorithms that depend on data siloed across millions of owners, and most data may be inaccessible due to their private content. Unless we develop algorithms that are either less data intensive, or can access the various disparate data sources, we risk stalling innovation.
The AI Security and Privacy Solutions team at IBM Research – Almaden is committed to providing techniques that enable the collaborative training of machine learning models without the transmission of data to a central place; while assuring inference attacks are prevented. In this blog post, we provide an overview of one of the techniques we propose to achieve this objective and we hope you attend our presentation at AISec 2019 collocated with ACM CCS for more details [1][2].
What’s federated learning?
Federated learning is a new practice in which data scientists can overcome privacy requirements by training models on local data through sharing model parameters rather than raw data. Our animated demo here showcases some of the advantages of using federated learning. Let’s take this blog as a chance to talk about a more technical description of basic process of federated learning.
Figure 1: Vanilla Federated Learning Framework [9]
In this diagram, we illustrate the primary functionality of federated learning. At a basic level, the framework consists of a centralized Aggregator (A), and multiple parties (Pi), each with its own unique dataset (Di). A high-level description of federated learning can be summarized as follows:
- The Aggregator sends a query (Q) to all or a subset of participating parties {P1, P2, …, Pn}. The query requests parties to provide information based on their own dataset, for example, requesting updated model parameters after several epochs of local training
- After receiving the query, each party runs the required functions on their respective data, creating the replies {R1, R2, …, Rn}. For example, each party may train for a single epoch and share the current model parameters
- Each party sends replies {R1, R2, …, Rn} back to the centralized Aggregator
- The Aggregator combines the replied information from parties
- Once combined, the aggregator updates the global model (M) based on the combined information and issues the next query to parties who then continue with the next training step
This process is repeated until a final model (M) is created and shared with parties. Note that data always stays with the parties! However, this simple approach may not fully protect private data under adversarial settings.
How can private data be compromised?
Federated learning provides opportunities to develop AI models for enterprise clients without accessing their data, while addressing privacy regulations such as GDPR, and can allow participants to engage in joint training efforts across multiple companies. While this approach may be sufficient for some use cases, to adhere to regulation requirements many enterprise clients require privacy and security guarantees when sharing model parameters with third parties. In particular, if an enterprise partner participates in a federated learning consortium, it will likely use data that is considered private to its business processes and clients. Competitors - and external malicious actors - have an incentive to gain access to this private data. Hence, it is vital for any federated learning platform to account for colluding and malicious agents who seek access to such data.
There are multiple threats to data privacy within the aforementioned simple federated learning platform:
1) inference of parties’ data during model training and
2) leakage of parties’ data via deploying the final predictive model.
We also identify two sets of potential malicious entities:
1) inside attackers, where one or more of parties collude to infer data from other participants and
2) outside attackers, who may attempt to obtain private information from the final model through attacks such as membership inference attacks.
Studies such as [5] demonstrate that these attacks are feasible unless proactive steps are taken. To mitigate these risks, the use of multiparty computation techniques to protect individual replies and local differential privacy to prevent membership attacks to the final model – and leakage of individual replies – have been proposed. While differential privacy can deter attacks to the final model by introducing noise to the replies sent to the aggregator, adding such noise can lead to a decrease in model efficacy in terms of precision and recall especially in the distributed setting. Therefore, choosing between maximizing privacy or maximizing accuracy is a delicate balancing act.
How can we address this gap?
The AI Security and Privacy Solutions team at IBM research has been working on developing privacy-preserving techniques that allow users to collaboratively train a highly accurate machine learning model in a federated learning fashion while ensuring data owners’ privacy. The team developed a new federated learning framework, shown in figure 2 where participants use a combination of multiparty computation and differential privacy. In this hybrid approach, not only is the data stored on the local device, but the data is also protected against malicious agents via encryption and differential privacy. As a result, parties engaged in the training process have mathematically provable privacy guarantees while maintaining model performance.
Figure 2: IBM Research Hybrid Approach to Privacy-Preserving Federated Learning [1]
By using Threshold Homomorphic encryption, based on Paillier crypto system [3], it is possible to reduce the amount of noise each party needs to inject to achieve the same differential privacy guarantee (to see how this is mathematically possible please check our paper). In a nutshell, our private federated learning framework works as follows:
- The Aggregator sends a query (Q) to each party {P1, P2, …, Pn} and each party based on their respective data computes replies {R1, R2, …, Rn}
- As compared to the vanilla federated learning process, here each party will add noise into their response relative to the number of parties queried to ensure that the response is differentially private. Prior to sending the noisy replies back the aggregator, each party encrypts their responses using the specified crypto-system
- The Aggregator combines the encrypted replies from each party
- The encrypted combined result is sent back to each party who use their own partial keys to partially decrypt the result and send the partially decrypt result back to the Aggregator
- The aggregator combines the partially decrypted results from each party and obtains a plain text version of combined noisy replies, e.g., the averaged of the noisy replies, and issuing the next query to each party who can then continue with the next training step
- The process is repeated until a final model (M) is created and shared with participants
We added a few key additions to ensure this new federated learning approach is private and secure from malicious agents. Let’s talk about what those are:
- Differential Privacy: an algorithm is considered differentially private if the inclusion of a single instance in the training dataset causes only statistically insignificant changes to the algorithm’s output. By limiting the effect that each individual instance has on the final model, we can limit the ability of a malicious agent to infer data membership. [7]
- Additive Homomorphic Encryption: it is an encryption technique that allows the Aggregator to perform basic calculations on encrypted data without decrypting it [8]. In this framework, an additively homomorphic encryption scheme is used to ensure privacy:
Enc(R’1) ◦ Enc(R’2) = Enc(R’1 + R’2)
Where ◦ is some pre-defined function applied by the Aggregator. With this, the Aggregator can securely combine each party’s reply without ever decrypting them. The ability to combine results prior to decryption ensures that no user ever has access to a single party’s decrypted result, and therefore maintains a much higher level of privacy. After the Aggregator combines the encrypted replies, it then pings a threshold number of parties required to decrypt the combined result.
Importantly, it is possible to demonstrate that through combining noisy models prior to decryption, we can actually reduce the amount of noise needed per dataset to ensure differential privacy [1]. Our final goal is to ensure that any decrypted result is differentially private to prevent data leakage but – at the same time - to reduce the noise added during parties’ local training to the minimum required. In fact, we can reduce noise levels directly proportional to the number of parties participating in model training.
To summarize, the capabilities of additive homomorphic encryption allow us to reduce the amount of added noise needed per party. No results are decrypted until all queried parties’ replies are combined (i.e., no single party can access any single party’s reply. Once results are combined and decrypted, the total noise will be enough to ensure differential privacy. This means that, as the number of contributing data parties increases, the total amount of injected noise remains constant, enabling the training of a machine learning model with high model performance.
This framework can be used to train multiple machine learning models varying from neural networks, decision tress, to SVMs and many others. The figure below compares F1 scores of a simple federated learning platform without encryption labeled as No privacy, two baselines that make random guesses (any model producing predictions below those ranges is worse than random guessing) and local DP where each party adds differential private noise independently. As expected, protecting privacy by adding differential private noise reduces the model performance with respect to the No privacy baseline. Further, when comparing our approach to the local DP baseline, we can see a drastic difference between the model performance. In fact, we see that our proposed framework consistently achieves higher F1-scores while the associated F1-score for Local DP drops as the number of parties increases. To find out more about this experiment and our proposed privacy-preserve framework, read the paper in here!
Note: Thanks to Austin Bell for his contributions to this project and these blogs this Summer 2019.
References
[1] A Hybrid Approach to Privacy-Preserving Federated Learning (Best paper award) Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, Yi Zhou https://arxiv.org/abs/1812.03224
[2] HybridAlpha: An Efficient Approach for Privacy-Preserving Federated Learning. Runhua Xu, Nathalie Baracaldo, Yi Zhou, Ali Anwar, Heiko Ludwig
[3] Damgård, I., and Jurik, M. (2001). A generalization, a simplification and some applications of paillier’s probabilistic public-key system. In Proceedings of the 4th International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, PKC ’01, 119–136. London, UK, UK: Springer-Verlag.
[4] C. Dwork. Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, editors, ICALP (2), volume 4052 of Lecture Notes in Computer Science, pages 1–12. Springer, 2006. ISBN 3-540-35907-9.
[5] Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov. (2017). Membership Inference Attacks Against Machine Learning Models. IEEE
[6] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Agüera y Arcas. (2016). Communication-Efficient Learning of Deep Networks from Decentralized Data
[7] Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate. (2011). Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research
[8] Paillier, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Applications of Cryptographic Techniques, 223–238. Springer.
[9] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Agüera y Arcas. (2016). Communication-Efficient Learning of Deep Networks from Decentralized Data
#GlobalAIandDataScience#GlobalDataScience