NewsArtificial IntelligenceCreative/Editorial/Media

AI scraping your content for “free training”? Not anymore.

By The Freelance Informer Last updated Jun 6, 2025

Freelancers and clients have a choice to no only stop LLMs from training on their content for free but making revenue from AI scraping

As legal cases against tech giants mount, it’s essential for freelancers and their clients to safeguard their digital assets. Learn how to prevent AI from freely training on your valuable content

Legal battle exposes widespread content harvesting

The legal implications have shifted dramatically for content creators and freelancers following Reddit’s lawsuit against AI company Anthropic, filed in the San Francisco Superior Court. Reddit filed a lawsuit against AI startup Anthropic, accusing the Claude chatbot developer of unlawfully training its models on Reddit users’ personal data without a license, highlighting a critical issue affecting millions of freelancers and digital creators worldwide.

The case reveals how Anthropic’s bots continued to scrape the platform more than 100,000 times even after claiming to have blocked such activities, according to Reddit’s legal filing. This legal action comes at a time when AI companies are increasingly hungry for training data, often without compensating or notifying content creators.

Some companies are using LLMs such as Claude to increase sales and operational efficiencies. For example, Replit integrated Claude into “Agent” to turn natural language into code, driving 10X revenue growth; Thomson Reuters‘ tax platform CoCounsel uses Claude to assist tax professionals; Novo Nordisk has used Claude to reduce clinical study report writing from 12 weeks to 10 minutes; and Claude now helps to power Alexa+, bringing advanced AI capabilities to millions of households and Prime members.

Costly lawsuits could not have come at a worse time for Anthropic or its investors. In May 2025 Anthropic raised $3.5 billion at a $61.5 billion post-money valuation. The round was led by Lightspeed Venture Partners, with participation from Bessemer Venture Partners, Cisco Investments, D1 Capital Partners, Fidelity Management & Research Company, General Catalyst, Jane Street, Menlo Ventures and Salesforce Ventures, among other new and existing investors.

Why should freelancers care about AI scraping?

For freelancers, photographers, writers, and digital creators, this case represents more than just corporate squabbling. It’s about the value of their intellectual property or just not following the rules. For example, Reddit’s primary claim against Anthropic is based on breach of contract due to the user agreement violation, establishing important precedent for how platforms and AI companies should handle user-generated content.

However, the implications extend far beyond Reddit. Meta’s AI systems, OpenAI’s ChatGPT, and dozens of other AI companies are continuously scanning the internet for training material, potentially including freelancers’ portfolios, blog posts, social media content, and client work posted online.

Step-by-step content protection guide

Implement Robots.txt protection

Your first line of defence could be the robots.txt file, which sits in your website’s root directory. AI bots that behave well follow robots.txt, and don’t use unlicensed content to train their models, though some companies ignore these directives entirely.

Other ways to prevent AI from scraping your website is to block IP addresses of known scrapers and using web application firewalls (WAFs) such as Cloudflare. You can also use CAPTCHAs to deter automated scraping.

According to the Google specification, each domain (or subdomain) can have a robots.txt file, according to Bright Data. Bright Data says “this is optional and must be placed in the root directory of the domain. In other words, if the base URL of a site is https://example.com, then the robots.txt file will be available at https://example.com/robots.txt.”

We found some experts to explain:

Here are some other sources that may be useful

Remove your own site’s images from Google Search

Video SEO Best Practices

For WordPress users:

Install the WP File Manager plugin
Navigate to your site’s root directory
Create or edit the robots.txt file

Consider adding these AI crawler blocks:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: ByteSpider
Disallow: /

Secure client work and portfolios

Password-protect sensitive portfolios by using authentication layers for client work samples

Implement noindex meta tags: Add <meta name=”robots” content=”noindex, nofollow”> to pages you don’t want crawled
Use watermarks: Mark visual content with subtle identifying information
Consider private hosting: Keep work-in-progress projects on password-protected sites

Monitor your content use

If you set up Google Alerts you can create and monitor alerts for your unique phrases and content

Use reverse image search: Regularly check if your images appear elsewhere
Check AI training datasets: Some companies publish lists of websites they’ve scraped. But there are other ways to check for AI scraping. For example, Spawning is a service that allows people to see if their creations have been scraped and then opt out of any future training.

According to Jordan Meyer, cofounder and CEO of Spawning, “anything with a URL can be opted out. Our search engine only searches images, but our browser extension lets you opt out any media type.” The startup also has a text-to-image tool called Stable Diffusion.

Monitor forum activity: Be aware that posts on platforms like Reddit may be harvested

Before posting on any platform, review their terms regarding AI training and data usage. Several AI companies are circumventing the Robots Exclusion Protocol (robots.txt) to scrape content from websites without permission, making it important to understand what protection each platform actually provides.

AI lawsuits set to soar

Reddit’s lawsuit against Anthropic joins a growing list of legal challenges facing AI companies. Reddit is seeking damages, restitution, and a court order barring Anthropic from using any Reddit-derived data in its products, which could set important precedents for content creators’ rights.

The case particularly resonates because it involves user-generated content—the same type of material freelancers regularly create and share online. If successful, it could establish stronger protections for individual creators against unauthorised AI training.

How to generate revenues from AI training

But what if you are happy for AI to train on your content but for a price? Then you have to set up a contract or AI training data licence agreement. When you’re looking for a contract, focus on the “Data License Agreement” or “Data Use Agreement” forms. Then you will have to consider creating clauses related to charges, license scope, and ownership of trained models to understand how fees are structured and tied to the value of your content.

What should freelancers do now?

Freelancers can no longer assume their work online is safe from commercial exploitation without consent or compensation.

Here’s what you can consider doing in the meantime:

Audit your online presence and implement robots.txt protections
Review terms of service for platforms where you share work
Consider legal consultations for valuable intellectual property
Document your original work with timestamps and metadata
Network with other creators to share protection strategies

Digital rights in the AI boom

As TechCrunch reports, this legal battle represents a moment in defining digital rights in the AI era. The outcome could determine whether content creators receive fair compensation for their intellectual property or continue to see their work harvested without consent.

For freelancers, proactive protection of digital assets is no longer optional, it’s an essential business practice. The tools exist to safeguard your content, but only if you take action before your work ends up training the next generation of AI systems.

The freelance economy depends on protecting creators’ rights to control and monetise their intellectual property. Reddit’s bold legal stance may well determine the future of that fundamental principle in our increasingly AI-hungry world. But don’t hold your breath.

Find out how AI models and capabilities will change from the people working at Anthropic