New Search Tool Allows Website Owners To Figure Out If Their Content Was Used To Train AI Systems As A Part Of Google’s C4 Dataset

A new report by the Washington Post is shedding light on an advanced tool that publishers can utilize to determine if their website or content was used to better train AI systems.

Simply put, it's to check if you were a part of Google’s C4 dataset or not. Now the question is whether or not you care and if not, why should you?

The new dataset entails a range of different websites and a list of content creators that generative AI has the potential of affecting negatively or wiping out. This includes both news as well as publishers in the world of media, marketing, and some blogs.

This new offering could be found in the media outlet’s latest report which is called the Inside Secret List of Webpages that Make AI Sound Smart. It produced a list that was based on the number of tokens received from each page across the data set. And for those asking what exactly tokens are, it’s the name provided for tiny text pieces that process information that’s disorganized and scattered all over the place. It’s sometimes a word or even a phrase.

The perfect example is Search Engine Land being used. But that’s just the start as more research showed how Marketing Land Events also made the list and so did the parent firm for Search Engine Land, Third Door Media.

Some were used in bits and pieces with other data extracted through the likes of Reddit and Wikipedia among others. And while we’re still talking, we’d like to point something out here about Reddit.

The company wishes to be compensated financially so it could benefit whenever different companies wish to use the data for AI model training as confirmed during a recent report by The New York Times. So far, we’ve seen Reddit update the terms for its API and it would now be charging a few firms like Google and OpenAI for great access. This was mentioned by Reddit’s CEO and its co-founder.

The news is not too surprising as this seems to be Reddit’s right. They’ve got a lot of data on hand that it deems to be valuable.

And it does not make sense why it should be offering its own data to leading brands and enterprises for free. They are obviously having a problem with businesses trying long and hard to generate their own values and not giving the company credit to users. And this is definitely a time it feels can really tighten up so many things with this being one of them.

But the irony here is how Reddit failed to make any of this value. In all reality, it’s the users who did and hence they’re the ones who deserve the real credit. But in today’s world, you can’t argue as it was all done on record by the app itself.

New Search Tool Allows Website Owners To Figure Out If Their Content Was Used To Train AI Systems As A Part Of Google’s C4 Dataset

Dr. Hura Anwar

You might like