Machine learning is being introduced into industry at lightning speed. As a result, data has rapidly become one of the most valuable assets an organization can own. However, for use cases where regulatory compliance and data privacy are paramount, unlocking the full potential of data poses unique challenges.
Why productize data now?
The quality of machine learning is determined by the data used for training. Developing machine learning systems that can deliver amazing breakthroughs requires the right amount and quality of data to be readily available. Ten years ago, the popular concept was theory-driven models, where models were trained based on expert knowledge and predefined rules. But today, we are in a new era where data-driven needs rule. Enterprises can now leverage the availability of data and compute to harness the power of large models trained on large datasets.
As a result, we are now witnessing a paradigm shift in which commoditization enables greater innovation. Until now, companies have struggled with small, bespoke models trained on limited datasets, resulting in suboptimal results. Today, companies are increasingly commoditized, leveraging foundational models pre-trained on broad datasets and then adapting them to fit their specific proprietary data needs. This approach results in a much more accurate model, exemplified by the concept of generative AI. Generative AI is currently primarily used in text analysis, but images and time series are not far behind.
This new approach enables businesses to realize the full potential of their unique data sets by addressing the historic challenge of data scarcity that many have faced. Some industries are leveraging these benefits of AI to perform tasks such as internet searches, digital personal assistants, and targeted advertising with great success.
What are the challenges?
However, for use cases involving sensitive data such as personally identifiable information (PII), companies must take a more nuanced approach. This data is sensitive in nature and remains unsuitable for powering commercial products without appropriate safeguards in place. In other cases where datasets are company-specific, it becomes essential to protect valuable IP encoded in proprietary datasets.
While highly regulated industries recognize the benefits of AI, they must be careful because the data is extremely sensitive. In healthcare and similarly data-sensitive sectors such as finance and the public sector, the presence of sensitive data creates constraints that limit an organization’s ability to productize this data.
In the healthcare industry, collaboration between many independent organizations is essential, and it is common to overcome regulatory and geographic boundaries governing data access and sharing. Whether complying with GDPR in Europe or health data restrictions in other regions such as HIPAA in the United States, these regional and geographic data controllers are bound by their own set of regulations. , in some cases may simply prohibit actual data sharing. .
It is inevitable that machine learning will be incorporated into digital products. However, these efforts are often hampered by information security (InfoSec) and data protection teams who must balance providing access to data with ensuring adequate governance.
How can I meet them?
There are three main considerations that both business leaders and CISOs should take into account when looking to make the most of sensitive data in regulated industries.
- Bringing compute to data
Increasing regulations and practical considerations such as security and cost require that data, especially sensitive data, should not be moved. Adopting a decentralized approach to data makes it easier for organizations to comply. The distributed computing paradigm also helps organizations protect their privacy and intellectual property. So a better approach is not to bring the data to the compute, but to bring the compute to the data. This makes it easier to ensure compliance for both the data and the resulting model trained on the data.
- If you work with distributed and sensitive data, consider federated learning.
Even if the data is resident Even if is strictly followed, the data in one separate pool is not necessarily adequate to build a sufficiently accurate model. Ideally, you can increase the size of your data pool without jeopardizing your privacy or security. One solution to meet this requirement is to pool distributed data repositories to machine learning algorithms without moving or sharing the raw data. This is a practice called distributed machine learning. Federated learning, originally proposed by Google, has emerged as the leading distributed machine learning solution. Although federated learning was originally proposed for mobile use cases, it is gaining traction in markets with access to other types of data fleets, such as healthcare, industrial equipment operations, and finance.
- Focus on technology regulation as a governance solution
Many enterprise use cases are bound by complex regulations known as “red tape.” This regulatory landscape can vary widely by geographic location and even within the same organization. Technoregulation is a concept that refers to influencing behavior through the implementation of values and rules in technology. Technology regulations allow data controllers to enforce computing governance through technology solutions. This gives organizations the means to efficiently enforce constantly changing location-specific regulations.
This can also be applied at a granular level as required by audits or regulations. Technology solutions allow data owners to control what computations are allowed on what data without significantly slowing down an organization’s pace of innovation.
conclusion
In the current climate, businesses are grappling with the complexity of sensitive data and face challenges such as regulatory compliance, data scarcity, and privacy protection obligations. However, there is a path forward for product teams looking to productize their data assets.
To enable the productization of data assets within an organization, CISOs must first ensure that where the data resides is compliant with regulations. Moreover, federated learning is an elegant solution to address the problems caused by data scarcity, data residency, and privacy protection. Finally, focusing on technology solutions that meet regulatory requirements can dramatically increase the pace of innovation, even when dealing with sensitive data.
By being compliant and embracing technological advances, businesses can navigate complex data environments, foster cross-border collaboration, and unlock the true potential of their data.
About the author
Ellie Dobson is Vice President of Products at Apheris and has an extensive career spanning multiple industries. Previously, she held key leadership positions at Graphcore and Arundo Analytics, Inc., where she leveraged her expertise in product management and data science. Ellie has her academic roots at the University of Oxford, where she obtained her MPhys and her DPhil in particle physics. Her career, from a research fellow at CERN to leadership roles at technology companies, reflects her commitment to innovation and leadership. Ellie has an extensive background in technology and data science and is a recognized leader in the field.
Sign up for free at insideBIGDATA Newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW