Soda

Guides

Beyond the Hype: Real-world AI Governance and Data Quality Essentials

As excitement about artificial intelligence (AI), and Generative AI (GenAI) in particular, is growing exponentially at the moment, a guide going beyond the hype diving into real-world AI governance and data quality essentials.

Beyond the Hype: Real-world AI Governance and Data Quality Essentials

In this guide
Download Guide

What's the difference between AI and GenAI?

Excitement about artificial intelligence (AI), and Generative AI (GenAI) in particular, is growing exponentially at the moment. Many people don't understand the difference between them, so before we explore the role of data quality in AI, let's clarify the point. It seems only right to ask ChatGPT to explain the difference: to summarize:

"AI is a broad term that covers all aspects of machines emulating human intelligence, Generative AI specifically refers to the branch of AI that is concerned with creating new, original outputs that resemble the training data but do not replicate it. This distinction highlights the versatility of AI technologies and their potential to both automate tasks and foster creative endeavours."

AI has been around for many years and GenAI is not as new as you might think, but natural language processing the "create[s] new, original outputs that resemble the training data" has made it accessible to multitudes of people and it has experienced rapid adoption. It's an exciting time and there are so many possibilities of how we can use GenAI, now least of which, using it to analyze our data more easily and faster than ever before.

That sounds like a good thing, but I do have some concerns over the speed with which it is being adopted. GenAI is always going to sound more exciting than data quality, but the outputs from AI are only as good as the model and that data that it is using.

So why data quality for AI?

In the summary above, it is clear that AI uses training data to process and prepare responses to input. If that training data is not of good enough quality, or even of terrible quality, GenAI comes to the wrong conclusions, regardless of how good the model is. GenAI tools are only as good as the data they consume. For example, ChatGPT was trained using everything on the internet--does anyone believe that every fact on the internet is correct? Even ChatGPT has this warning message:

"ChatGPT can make mistakes. Consider checking important information."

Bad data leads to incorrect AI outcomes, so it follows that good data quality is a foundational element for the successful adoption of AI. But it's not just data quality that is needed for successful AI, data governance is fundamental too.

Why data governance for AI?

In my last article, I discussed the relationship between data quality and data governance. It if is essential that a healthy AI train on high-quality trusted data, it follows that data governance is also critical in supporting the quality of the data. Data governance traditionally helps us understand what we have, but data governance's role is now larger and supports the responsible use of AI. With AI we need to go one step further and understand what we should be using that data for.

Data governance plays a crucial role in AI by establishing clear policies and procedures for data quality, which is essential for training reliable and ethical AI models. It supports AI success by fostering trust and transparency in AI applications among users and stakeholders. We need to trust that AI will do the right thing and give us the right answers. We need to make sure that we develop and use GenAI correctly and appropriately. It is very clear that there is an interconnectedness between data quality, data governance, and AI. That overlap is AI governance.

What is AI governance?

In the same way that data governance is about way more than just the data, AI governance is much wider than just the AI models. Sol Rashidi explains this well in her book, Your AI Survival Guide, when she says:

"Regardless of your role and reasons, just know that 70% of the success of your AI deployment has nothing to do with the technology. As a matter of fact, the tech is the easiest part of the life cycle, most of the work is dealing with human capital and relationships, aligning with goals, overcoming fears, picking the right strategy, picking the right use case, and finding the ambition."

AI governance is about managing the ethical use, risks, and principles associated with AI, which inherently includes managing the quality and governance of the underlying data, as illustrated in the diagram above.

Non-data people often think of AI governance as being only about ethics and avoiding unintentional bias, but I believe that our considerations in more broadly adopting and trusting AI should be wider and include governance of the data the model trains on, as well as the model itself.

I'm not alone in thinking this; there are a growing number of organizations in which AI governance is already acknowledged as a subset of data governance, and AI is discussed regularly at data governance committee meetings.

But you don't have to take my word for it. Lara Gureje, Founder of DatOculi and former VP Enterprise Data Governance at BNP Paribas Bank Group, offers this perspective on data quality, data governance, and AI.

AI's reliance on data quality and governance

"At the heart of AI innovation lies a relentless pursuit of intelligence. However, this journey towards intelligence is heavily dependent on the quality and governance of data. Governed data serves as the foundational raw material for optimizing AI and machine learning (ML) models. Without a solid foundation, trust in models diminishes, rendering analytics meaningless.

If organizations find themselves grappling with the performance of their AI models and analytics, the fundamental question arises: "What is the quality of the data being fed into these models?" It's time for every organization striving to extract insights from their data to recognize the imperative need for a governed data environment as the foremost critical success factor.

The synergy between AI/ML and data governance is where the true potential of data assets is unlocked. Establishing data governance as an integral part of organizational culture becomes the cornerstone for driving innovation and extracting meaningful value from data.

A poignant fact emerges: the marriage of AI/ML and data governance represents the sweet spot for optimizing data assets. It's a symbiotic relationship; you cannot do one without the other.

Why does it matter?

  • Recognize that it's not about the volume of data but rather its quality and trustworthiness that matters when feeding models.
  • Govern the health of data sets to pave the way for uncovering optimal insights and value within the data.
  • Ensure efficient resource utilization on curating and collecting data for models.

Here is how to optimize AI/ML models by following these practical steps to integrate data governance into your initiatives.

  1. Engage early and kick off data governance as a pre-requisite for AI/ML projects o ensure alignment and effectiveness.
  2. Harmonize and integrate governed, trusted data seamlessly into AI initiatives to enhance their efficacy.
  3. Consistently assess the quality of data being fed into models and prioritize trusted data sources.
  4. Educate data citizens and foster a cultural mindset of stewardship and du diligence around data assets to ensure sustainable governance practices.
  5. Continuously measure the model outputs against input data quality to drive refinement.
  6. Establish formalized processes for attestation and certification of data inputs and outputs to ensure accountability and reliability.

In conclusion, the future of AI and machine learning is poised o change and improve various aspects of our lives. However, to unlock its true potential, we must acknowledge that the quality of output is intrinsically linked to the quality of input data. Governed data yields better outcomes, trusted data yields better models, and better models lead to superior AI predictions.

Data governance holds a non-negotiable role in shaping the AI landscape for the better."

You can listen to Lara and I discuss this topic in detail on our podcast here.

Data quality is vital for a healthy AI

Without a doubt, ingesting high-quality and well-governed data improves the accuracy of AI models. Having good data quality is fundamental for training AI models and achieving desired outputs. After all, how can you trust the data that GenAI models are creating if the underlying data is poor quality? AI, generative or otherwise, cannot provide the correct answers of generate useful new insights if the quality of the underlying data is poor. The old age holds true: "garbage in, garbage out".

Conversely, using good-quality data ensures human trust in AI models and their outputs, so having high-quality data in the foundation for organizations to confidently develop and use AI. The intertwined nature of data quality, data governance and AI governance, means that it is vital to embrace all three to ensure responsible and ethical AI use. As my friend Tudor Borlea said, "Generative AI is growing so quickly that if you don't get a grasp on it early on with governance, it will just explode in your face".

So, if your organization has already embraced or is about to embrace GenAI, make sure that you are thinking about AI governance.

AI governance is as vital as initiatives are growing rapidly in scale and usage. To responsibly manage it, we should be taking the same approach as with any other asset of our organization and do so intentionally and thoughtfully. Perhaps the use of the word "governance" conveys the incorrect impression that AI governance is all about stopping people from creating and using AI models, but really, it's just about ensuring that we are doing our best to do the right things in the right way.

AI models are making decisions at high speeds and with speed comes greater risks. Governance is needed to ensure that there are proper principles and guidelines in place to ensure responsible development of AI and use of its output.

Key principles for AI governance

A couple of months ago, I interviewed Sony Europe's Head of Data and AI Governance, Sayantan Chaklader on my podcast, The Data Governance Podcast. He agreed with me that the principles for AI governance are very similar to those for data governance. As mentioned above, it's about doing the right thing in the right way, and that applies whether we are talking about data or AI.

What follows are several principles you might want to consider for your approach to AI governance.

  • Ownership: Ensure proper ownership and accountability for AI models and outputs.
  • Transparency: Model purpose and logic must be explainable and documented.
  • Documentation: Source data understood and documented.
  • Appropriateness: The quality of the source data being fit for its purpose.
  • Security: Data used in the models must be secured in the same way it is in any of our organizations' other systems.
  • Privacy: Use of personal data in the models must be compliant with data protection and privacy regulations.
  • Enablement: Training should be provided to all AI users to build understanding of acceptable use of AI. (From experience, this sits well as part of data literacy training.)

This list focuses on the data side of AI governance, but you will also need principles that address ethical use, regulatory compliance (such as some of the provisions of the EU AI Act that will come into effect later this year), and risk management.

An AI governance framework?

While some of the principles listed above are related to data security and data protection, the remainder of the principles are the same, or very similar, to those for a data governance framework. That is why I am a firm believer that you do not need to create a separate framework just for AI governance.

Just as data quality is covered by your dta govenance framework, you should evolve its increase in scope to cover AI. This is easier than designing another framework from scratch, it will be much simpler for your business stakeholders to get their head around it, and it's much simpler than having multiple frameworks that cover different things.

First steps for AI governance

So, what steps can you take to start our journey towards responsible AI governance? You are not going to be able to design and implement a program overnight, but I would start the process with the following steps:

  1. Foster a culture of responsible AI use. Getting formal governance in place is likely to take some time, so begin small by getting everyone to think intentionally about AI use.
  2. Add AI governance to your data governance agenda. Get your senior stakeholders talking about it and understanding its importance.
  3. Create and deliver data literacy training that includes AI awareness. It needs to cover what they can and can't do with GenAI. Too many organizations are rolling out tools like Microsoft Co-Pilot with no guidance to staff on what it should, and more importantly, should not be used for.
  4. Assess your organization's current use of AI and the risks of its usage. Is it within your organization's current risk appetite?
  5. Develop your AI governance policy. Determine what you aim to achieve with AI governance, such an ensuring ethical use, compliance with regulations, promoting transparency, and safeguarding against biases.
  6. Coordinate your AI governance approach with your data governance approach and framework to make sure that data quality, data governance, and AI governance are all fully aligned. Wherever possible expand the scope of your existing data governance team and processes to include AI.
  7. Train or add some AI expertise to your data governanc team.

Is your organization ready to embrace AI?

Let’s face it: GenAI is rapidly becoming part of everyday operations  in a huge and diverse collection of industries. The possibilities associated with GenAI adoption are very exciting, assuming it is embraced in the right way. 

In a recent data governance survey I ran with Baringa Partners, we asked participants some questions about AI and were surprised that only just over half of respondents had considered the impact of their organization’s adoption of  AI on their approach to data governance. Despite that lower-than-expected number, many respondents added comments to highlight that they felt that AI and data governance activities need to be aligned and integrated, while a number voiced concerns that the quality of their data was not currently good enough to support AI initiatives.

The survey also asked for views on whether teams could be successful at managing data quality, and other data management activities, without AI. Over 70% felt that they could, and while there have been limited options to use AI in these activities until recently, I think that it is important to look for opportunities to use AI to speed up these activities. But I’ll get into that in my next blog.

In a similar vein, it is interesting that Soda, a company whose software is on the cutting edge of data quality with GenAI, recently ran a poll on LinkedIn to see whether people thought that GenAI will disrupt the data quality workflow, and the results showed somewhat  mixed, and ultimately, inconclusive views. It seems to be a case of, “watch this space!”

These are exciting times and GenAI is definitely going to impact our data quality activities, so we need to make sure that we are ready for it. To help you get started, use this checklist to determine if your organization’s data is ready for you to embrace AI.

Checklist: Is your data ready for AI?

  • Do you already have a data governance framework in place with documented roles, responsibilities, and processes?
  • Is data governance embedded in your project and change management processes?
  • Do you have a process to identify and classify the most important data?
  • Do you use a data catalog?
  • Do you have processes to monitor the quality of important data?
  • Do you have processes in place to address poor-quality data and issues with data?
  • Do you have something in place that offers data observability to help you monitor data reliability?
  • Do you have a reasonable level of data literacy across the organization?

If you answered “no” to any of these questions, it’s time to take action.  

Take action!

Good luck!