AI & Data Protection: Do You Know Where Your Data Is?

It’s no secret that data privacy is one of the most prevalent qualms people have with AI.

Generative AI, in particular, is trained on vast amounts of information. That’s just how it works.

But when you consider that some of your data may have been used to train an AI model — without your explicit consent — it’s a little concerning.

For marketing professionals, business leaders and company execs, AI training data raises a couple of troubling questions:

Has your personal or company data been exposed to AI systems?
If so, what sort of data has been collected and parsed?

The newness of AI and its use of publicly shared information has made it difficult for people, companies and governments to know how to handle these types of questions, so let’s explore them together.

How Concerned Are Folks About AI and Data Privacy?

Many companies have announced bans on the use of generative AI tools in the workplace over the past few years. Even still, a recent Data Privacy Benchmark Study by Cisco revealed some telling truths.

Managing data and risks looks different when it comes to GenAI. In fact, 92% of the 2,600+ security professionals surveyed said GenAI requires new skills and techniques to manage data and mitigate risks, which means most organizations are likely still trying to catch up.

Additionally, 48% of respondents have entered non-public company data into a generative AI app at some point; and 69% said they’re “concerned GenAI could hurt [their] company’s legal rights and intellectual property.”

This creates a bit of a gap. Organizations need experts with the right knowledge and skills to effectively manage data and risk concerning AI, while nearly half are actively entering (or have entered) non-public information about the company into GenAI apps. This means the gap will only grow larger until organizations have access to the right folks to rein in Gen AI risk.

In the meantime, something needs to be done to help hold things in place. In this age of evolving and increasingly ubiquitous generative AI, how can people, artists and companies choose how (and when and if) their data is used? Before I answer that question, let’s take a look at an example of AI training done well, albeit on a much smaller scale than the likes of OpenAI and ChatGPT, for example.

AI Training With Data Privacy In Mind: Holly Herndon and Holly+

Have you heard of Holly Herndon? I only recently learned about her, but in the AI world, she’s something of a superstar. Her visual and musical AI creations are impressive, but she’s more known for how she’s approached the relationship between generative AI and training data sources.

She and her husband Mat Dryhurst (also an artist) boarded the AI train early; in 2016, they created Holly+, an AI singing application. (And honestly, Holly+’s rendition of “Jolene” may be my second favorite cover of this pop classic, after Cowboy Carter).

When they were developing Holly+, they purposefully limited their training sources to their own data or that which they had express consent to use.

This is not the standard. Most data on the internet has been used, in some way and at some point, to train some sort of AI. The truth is, we don’t know the extent of this — and that has a lot of people concerned.

(Will my teenage Facebook posts come back to haunt me in some sort of AI-generated ad someday?)

For individuals and artists, this brings up terrifying questions about the impact these technologies could have on their future careers and livelihoods. For organizations, large language models might inadvertently reproduce verbatim content from training data, especially if the data was heavily repeated or uniquely identifiable. That could mean dispensing information from proprietary reports, emails, customer information and more, which opens up various cans of data privacy worms.

But what can we do about it?

How To Take Control of Your Data

You can’t really remove items from an existing dataset if they’re already there. However, there are other proactive measures you can take to opt out of future training.

For Individuals

Holly Herndon and Dryhurst didn’t just create an AI that mimics the voice of beloved artists — they created a solution for them, too.

It’s a website called Have I been Trained? and the idea is simple: It lets you search for images, domains and more to see if your work is included in various popular AI training datasets, like the Laion-5b dataset.

While it’s true that there’s no possible way to remove items from an existing dataset, Have I been Trained? allows artists to pre-emptively opt their work out of future training. That’s an easy option if you find yourself worried about your data being used to train robots.

Additionally, more companies are adding opt-out options for users — a trend I hope continues. Adobe, for example, gives users a simple toggle switch right within their privacy page:

Simply switch it off to register your preference and the platform won’t analyze your content.

For Organizations

For organizations, opting out isn’t as direct or straightforward as flicking a switch. Companies are collections of individuals who all make unique decisions every day, which makes it inherently more challenging to rein in data security. That said, businesses still have some options:

Create and Update Company-Wide AI and Data Protection Policies

Create policies that outline the types of tasks that are or are not permissible to carry out through the aid of AI. Be as specific as possible, and ensure that all members of your company are aware of and understand the policies.

Stay On Top of Vendors and Third Parties

When working with vendors or third parties, ask how they use AI and what their policies are. If they use AI tools, find out if their use would violate your internal policies. Be vocal about what you’re comfortable with, and what is not OK.

For every tool in your tech stack, check their policies regarding privacy and the measures they take to protect your data.

Update Your robots.txt File

A robots.txt file is a simple text file placed in the root directory of a website to guide web crawlers on which parts of the site can or cannot be accessed and indexed.

Having control over this handy file means you can use directives such as “Disallow” to prevent crawlers from accessing certain directories, files or types of content. This can enhance data security by restricting access to sensitive or proprietary content that might otherwise be scraped and potentially included in AI training datasets.

If you’ve yet to do so, consider disallowing access to:

Directories containing confidential files.
Research materials.
User-generated content.

Afterward, stay on top of your robot.txt file by regularly reviewing and updating to account for newly added web pages and content.

Preparing for an Inevitable AI Future

Data privacy and security has always been a concern for businesses and consumers alike, but with the proliferation of GenAI, it’s only growing.

While I’m confident that most reputable businesses have data privacy policies in place, it never hurts to give them a comprehensive review and update to ensure they align with the current digital landscape that’s rife with generative AI tools and applications.

AI will continue to change a lot over the next few years. Taking steps now to protect your data — whether personal or commercial — puts you at the forefront of that change to ensure you and/or your business and customers are protected.

Note: This article was originally published on contentmarketing.ai.