In the era of artificial intelligence (AI) and big data, predictive models have become essential tools in a variety of industries, including healthcare, finance, and genomics. These models rely heavily on processing sensitive information, making data privacy a major concern. A key challenge is maximizing the usefulness of the data without compromising the confidentiality and integrity of the associated information. Achieving this balance is essential for the continued advancement and acceptance of AI technologies.
Zama’s Machine Learning Technical Lead.
Collaboration and Open Source
Creating a robust dataset for training machine learning models is a big challenge. For example, AI technologies such as ChatGPT have developed by collecting huge amounts of data available on the Internet, but healthcare data cannot be collected so freely due to privacy concerns. Building a healthcare dataset requires integrating data from multiple sources across doctors, hospitals, and borders.
While the healthcare sector is highlighted due to its societal importance, the principles apply broadly: Autocorrect features on smartphones that personalize predictions based on user data must address similar privacy concerns, and the financial sector also faces obstacles to data sharing due to its competitive nature.
Thus, collaboration has emerged as a key element to safely harness the potential of AI in our society. However, an aspect that is often overlooked is the actual execution environment of AI and the underlying hardware that powers it. Today’s advanced AI models require robust hardware, including large CPU/GPU resources, large amounts of RAM, and even more specialized technologies such as TPUs, ASICs, and FPGAs. Conversely, the trend towards user-friendly interfaces with easy-to-understand APIs is gaining popularity. This scenario highlights the importance of developing solutions that allow AI to work on third-party platforms without compromising privacy, and the need for open-source tools that facilitate these privacy-preserving technologies.
Privacy Solutions for Training Machine Learning Models
To address privacy challenges in AI, several elegant solutions have been developed, each focused on specific needs and scenarios.
Federated Learning (FL) allows machine learning models to be trained across multiple distributed devices or servers, each with local data samples, without actually exchanging data. Similarly, Secure Multiparty Computing (MPC) allows multiple parties to collaboratively compute functions on inputs while keeping the inputs private, ensuring that sensitive data never leaves the original environment.
Another set of solutions focuses on manipulating data to enable useful analytics while maintaining privacy. Differential Privacy (DP) introduces noise into the data in a way that provides accurate aggregate information while protecting individual identities. Data Anonymization (DA) removes personally identifiable information from datasets, ensuring a degree of anonymity and reducing the risk of data breaches.
Finally, homomorphic encryption (HE) allows operations to be performed directly on encrypted data to produce an encrypted result that, when decrypted, matches the result of the operation performed on the plaintext.
Perfect fit
Each of these privacy solutions has its own advantages and disadvantages: for example, FL maintains communication with a third-party server, which can lead to data leakage, and MPC works on robust encryption principles in theory, but in practice it can require significant bandwidth.
DP requires manual configuration to strategically add noise to the data, which limits the types of operations that can be performed on the data, as noise must be carefully balanced to preserve privacy while still maintaining the usefulness of the data. DA is widely used but often offers the least privacy protection; anonymization is usually done on a third-party server, which runs the risk of exposing hidden entities in the dataset through cross-referencing.
HE, and in particular Fully Homomorphic Encryption (FHE), stands out in that it allows performing calculations on encrypted data that are very similar to those performed on plaintext. This feature makes FHE highly compatible with existing systems and easy to implement thanks to open source accessible libraries and compilers such as Concrete ML. These libraries and compilers are designed to give developers easy-to-use tools to develop a wide variety of applications. Its main drawback at the moment is the slower computation speed, which can impact performance.
While all of the solutions and technologies we have discussed encourage collaboration and joint efforts, FHE can drive innovation by strengthening data privacy protections, facilitating scenarios where no trade-off is necessary when enjoying a service or product without putting personal data at risk.
We featured the best encryption software.
This article was produced as part of TechRadarPro’s Expert Insights channel, featuring the best and brightest minds in technology today. Opinions expressed here are those of the author and not necessarily those of TechRadarPro or Future plc. If you’re interested in contributing, find out more here. https://www.techradar.com/news/submit-your-story-to-techradar-pro