Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Meta

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

Parent

Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

+10
−2

Stack Overflow has recently announced OverflowAI and I think this video summarises pretty well this. The main drawback is that the users are less incentivized to put effort into answering questions and have this effort being fed into SO's LLM.

I am wondering if it makes sense to start blocking known agents that feed LLMs. I am thinking specifically about ChatGPT-User. This seems to be very simple to do, as shown in this article: disallow ChatGPT-User user agent on the site in the robots.txt.

I know the community is small now, but it growing, especially due to SO's decline, and one day it might become interesting for ChatGPT and others.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

Official link (2 comments)
Post
+4
−1

robots.txt amounts to politely asking people to please not crawl you, so I wouldn't expect it to do much.

At the same time, you might as well ask politely, to avoid giving the impression that you do want them to crawl you. Letting Google crawl you is fine because Google actually sends traffic back. But AFAIK ChatGPT doesn't send back much traffic. Their LLM even avoids prompts about specific websites.

I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission. I doubt that would do much either, unless you have money set aside to actually sue, but at least it would make OpenAI's lawyers a bit nervous.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Model training vs. Creative Commons (5 comments)
Model training vs. Creative Commons
Karl Knechtel‭ wrote over 1 year ago

In re "I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission."

Adding this sort of restriction explicitly is not possible using Creative Commons licenses, and the necessary legal work to create modified versions is not reasonable (or at least, anyone can draft whatever license they like, but there's nothing like a guarantee that any court would respect it).

However, by their nature, LLMs are generally abysmal at attributing their output, and depending on the input they can end up plagiarizing quite blatantly. This should be all that is needed for a legal cause of action if they are trained on any kind of CC BY-licensed data. They also don't attempt to license their output at all, so it seems to me that they cannot honour -SA licenses.

Disclaimer: I am not a lawyer and this is not legal advice.

matthewsnyder‭ wrote over 1 year ago

That's a good point - CC BY would "catch" the LLMs on attribution. However it has the side effect of also requiring humans to do it. Personally speaking, I don't care about making people do the busywork of citing my posts just for some trivial copy and paste from them. If I cared to take credit I'd publish an article or book with proper copyright. I post them truly intending it to be a "common resource", so let them "steal" away. The fame I get from an online post is tiny, and requiring attribution would destroy the convenience provided by them.

On the other hand, I don't like the idea that people use my posts, and many other people's, en masse to train models which then compete with the site that formented those posts in the first place. It's cannibalistic. So what I would really like is a "do wtf you want, except training models" license.

matthewsnyder‭ wrote over 1 year ago

As for the legalities - a lot of people act like copyright is some magic spell that works automatically but it doesn't. There is no copyright police that will come stop me automatically as soon as I use your work without proper license, the way they would stop me from stealing your car even if you didn't call to report it. Copyright only matters when you go to court, and by extension via people's fear of being taken to court. The fear is often irrational and grounded in fantasy because few lay people seem willing to actually learn copyright law. In this case, we all know CD is very unlikely to have resources to go up against OpenAI's highly paid legal team, so we should instead appeal to the irrational fear.

matthewsnyder‭ wrote over 1 year ago

There also seems to be a strange but widespread cargo cult about contracts, where people think just because professional lawyers write very obtuse and wordy contracts, that if they write one in simple language it's somehow not valid. In reality there is nothing wrong with writing a contract in your own words. Lawyers use the forms they do because (1) they don't want clients asking why they're getting billed thousands for something they can do themselves and (2) lawyers actually do go to court frequently, over millions of dollars, against offenders who themselves have lawyers employing many loopholes, and in that case the "legalese" does matter.

In every day, simple interactions, a plain language statement has every bit of legal force that any contract would. So there is nothing wrong with modifying existing contracts, again unless you're trying to protect something worth millions of dollars.

Karl Knechtel‭ wrote over 1 year ago

When I'm answering questions about code nowadays, I tend to have way more prose than actual code. If there's a lot of code, it's because there are a lot of trivial examples that can't really be useful for others as-is. So I'm happy to stick with -BY license versions. Maybe this doesn't work for everyone.