In the first
of our blog series on federated learning in partnership with IBM Research, we wrote about the mechanism for federated learning as well as the technology behind the paper that won the Best Paper Award 2019 at AISec 2019. It is very helpful to start with that blog prior to reading this piece – which will dive more deeply into applications.
As a reminder, federated learning allows multiple parties to collaboratively train a machine learning model without sharing their private training data. In this framework, participants train local models on their own data, and then privately and securely combine those models with other participants’ locally trained models.
The ability to combine local models is the equivalent of pooling of disparate data resources across competitively or privacy constrained industries.
The process roughly outlines as follows:
- An aggregator queries involved parties, for instance Company A, Company B, and Company C, to get model updates
- Company A, Company B, and Company C each train a machine learning model on their own data.
- The aggregator is then responsible for securely combining model updates at each step of training (e.g., each epoch), and that separation of responsibilities ensures that no data can be breached.
Figure 1- Federated Learning Entities
As a result of this training paradigm, each company can create machine learning models trained on a much larger set of data assets from the combined companies; without infringing on their respective data privacy rights.
In summary of the above overview, some of the key reasons for adopting federated learning include:
- Improved user data privacy: each participant can train local models ensuring that their personal data privacy is maintained. Afterwards, results are privately and securely combined with the other local models through the Aggregator
- Current and upcoming privacy regulations (e.g., GDPR): multinationals can train models within their own companies across borders without violating user privacy
- Institutional flexibility with sharing data while preserving privacy: effectively, multiple institutions can combine or pool data resources providing more data to their machine learning models
- Distributed data warehousing: in some cases, training models locally and sharing parameters is less costly than transferring large data lakes across resources
In the past, AI vendors invested heavily in creating and maintaining large centralized data warehouses from which they could access their data. That naturally led to sequestered pools of data. Federated learning represents a new computing paradigm where models are trained to leverage data stored in that distributed manner. Local models trained on distributed datasets are aggregated into a single centralized model, thereby preventing the need to share data across parties. Portions of data can now be separate of, and kept private from the data scientists, who are responsible for training the model.
What would that actually look like? What are the applications?
2. Federated Learning Types
Federated learning can be classified into Horizontaland VerticalFederated Learning according to the distribution of features among multiple parties.
2.1. Horizontal Federated Learning
Horizontal federated learning is when all participating parties have the entire feature set and labels available to train their local model.
A horizontal framework would need to securely identify the shared feature space and combine model results appropriately.
An easy way to think of horizontal federated learning is that the resulting model is equivalent to stacking two datasets with the same feature space and training a single model. For example, consider a grocery store Coast Groceries
that may have stores primarily on the West and East Coast in the US, whereas Central Groceries
stores may be concentrated in the Midwest and South resulting in entirely different user bases, yet all customers are grocery consumers. Both companies collect similar data surrounding customers’ purchases, so they can reduce geographic bias through pooling their data. They could engage in federated learning to collaboratively train models to potentially better understand and serve their customers.
Figure 2- Subset of data shared within a Horizontal Federated Learning Framework
2.2. Vertical Federated Learning
By comparison, vertical federated learning is when parties collect data with different feature spaces and only one party has access to the label. Given this lack of information, parties are unable to train a model on their own.
This adds an additional layer of complexity for the Aggregator, as a new step of securely aligning the sample spaces is now required and the training process requires exchanging partial model updates.
Consider a credit card provider and an e-commerce company both looking to improve their fraud detection algorithms.
Vertical federated learning allows the two companies to create a single feature vector that can be used to train a fraud detection model that reduces potential fraud and provides a more secure user experience.
Figure 3- Combined feature space for shared users within a Vertical Federated Learning Framework
3. Key Applications
The seminal paper
on federated learning focused on training local machine learning models on mobile devices and sharing model updates with a centralized aggregator such that private mobile data would not need to be accessed or moved from the phone. Federated learning, however, is much wider in its applications. We, in fact, focus on federated learning for enterprise settings and we’ll look into a different set of use cases and benefits.
Federated learning within enterprise largely focuses on the ability to collaboratively train models in competitively and privacy-constrained situations. With advancements in federated learning, enterprises will be able to:
- Ensure that client data privacy is maintained throughout deployed model training
- Train intra-company models while adhering to country-specific privacy laws (e.g., GDPR)
- Joint-effort collaboration with other institutions (even competitors) to improve internal machine learning models
If an AI vendor sells their wares to clients whose data must remain private and securely located within their own network – like in manufacturing – their clients' data is out of reach of the original AI training product. A significant benefit of federated learning is that AI vendors could now ensure that data privacy is maintained throughout model re-training. In such a scenario, client data would never leave a client’s network and only model parameters would be shared back to the original vendor during training. Their private data, remain with the owner. Let’s theorize a service that provides smart manufacturing robots to automotive companies.
The vendor deploys these robots to its customers who subsequently use them to manufacture cars. The manufacturer’s data is private, potentially containing trade secrets on how their cars are developed, and thus, unable to leave their secure network. As a consequence, the manufacturer is unable to share their data with the original vendor despite potential benefits of receiving a more performant manufacturing robot. This vendor wants to work with their client and make the most of the data generated by its users in order to provide the best service possible. Through federated learning, the vendor can now train AI models using private data without removing or accessing the manufacturer’s data directly – models are trained locally within the client’s environment and only model parameters are shared with vendor. The manufacturer could now retain total privacy while also reaping the benefits of AI.
For an additional scenario, an AI vendor can incorporate multi-party collaboration while still retaining complete client privacy through federated learning. While there will be small differences, many underlying models may be similar across automotive manufacturing robots. For example, many robotic units may require similar object detection algorithms throughout their system. The vendor can continue to train such algorithms across multiple automotive clients, combine model parameters trained locally, but without ever accessing client data. The end result is that each client takes advantage of locally trained models from all parties, which leads to the best performing model for each client and less time training. An obvious benefit to all.
At the current moment due to health privacy laws (e.g., GDPR in Europe and HIPAA in US), it is nearly impossible to share patient data without first going through a very lengthy de-anonymization process (even then, through advanced membership inference attacks, it is not always secure). As a result, hospital networks are largely reliant on training models limited to their own data. One aim of federated learning is to provide a framework that allows hospitals or research institutions to share their data without fear of infringing on patient privacy.
Imagine that there are three separate hospitals in different geographic regions that collect data on whether a patient will be readmitted within 30 days. Each hospital collects similar data, but for a separate set of patients. Under the current paradigm, each hospital trains a readmission model utilizing only their own data. However, under a federated learning paradigm, participating hospitals will collaboratively train a model. The benefit is the potential to reduce a problem that costs $41 billion annually[i].
Machine learning models are largely limited to the data available during training and data is often siloed within individual companies and heavily protected. The result is less available training data and weaker models. Through federated learning, businesses can securely pool data with one another for the benefit of improving their own machine learning models. It is important to reiterate that through the decentralized training, differential privacy, and encryption, no user’s raw data or model results are ever exposed.
Figure 5- Comparing Frameworks
The hospital networks readmission model is a great example of horizontal federated learning in practice to affect positive change. It is one in which each hospital collects similar features, but for different users (i.e., patients). One can imagine how this extends to vertical federated learning where providers are within close proximity, therefore they share patients, but collect different features.
The above has been a conceptual walkthrough of the practices of federated learning, and what they might mean in enterprise. If you’re curious what real use-cases of this technology looks like, check out the studies in our references. Our own Nathalie Baracaldo has published one on anti-money laundering, and it’s a must-read.
This second blog is only part in a longer series on federated learning in partnership with the excellent research staff from IBM Research
Special thanks to IBM Research AI Security and Privacy Solutions team for partnering with us in this series! Special thanks to the individual team members including Nathalie Baracaldo, Yi Zhou, Ali Anwar and Heiko Ludwig!
To see subsequent installations in this federated learning series, and other essential data science stories like it, join the IBM Data Science community!
 "Towards Federated Graph Learning for Collaborative Financial Crimes Detection", Toyotaro Suzumura, Yi Zhou, Nathalie Baracaldo, Guangann Ye, Keith Houck, Ryo Kawahara, Ali Anwar, Lucia Larise Stavarache, Yuji Watanabe, Pablo Loyola, Daniel Klyashtorny, Heiko Ludwig, and Kumar Bhaskaran. NeurIPS 2019 Workshop on Robust AI in Financial Services https://arxiv.org/pdf/1909.12946.pdf
 McMahan, H. B., Moore, E., Ramage, D., & Hampson, S. (2016). Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629.
 O. Choudhury, A. Gkoulalas-Divanis, T. Salonidis, I. Sylla, Y. Park, G. Hsu, A. Das, "Differential Privacy-enabled Federated Learning for Sensitive Health Data", NeurIPS ML4H (Machine Learning for Health), 2019
 O. Choudhury, Y. Park, T. Salonidis, A. Gkoulalas-Divanis, I. Sylla, A. Das, "Predicting Adverse Drug Reactions on Distributed Health Data using Federated Learning", American Medical Informatics Association (AMIA), 2019 - Nominated for Distinguished Paper Award (decision pending).
[i]Anika L. Hines, Marguerite L. Barrett, H. Joanna Jiang, Claudia A. Steiner. (2011). Conditions With the Largest Number of Adult Hospital Readmissions by Payer. Agency for Healthcare Research and Quality#Article#Tutorial#Python#DeepLearning#Paper#datascience#security