Communities

The Great Outdoors

Photography & Video

Scientific Speculation

Electrical Engineering

Languages & Linguistics

Software Development

Community Proposals

tag:snake search within a tag

answers:0 unanswered questions

user:xxxx search by author id

score:0.5 posts with 0.5+ score

"snake oil" exact phrase

votes:4 posts with 4+ votes

created:<1w created < 1 week ago

post_type:xxxx type of post

Notifications

Mark all as read See all your notifications »

Q&A Code Reviews Meta

Meta

Posts Tags Edits

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

Parent

Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

+10

−2

Stack Overflow has recently announced OverflowAI and I think this video summarises pretty well this. The main drawback is that the users are less incentivized to put effort into answering questions and have this effort being fed into SO's LLM.

I am wondering if it makes sense to start blocking known agents that feed LLMs. I am thinking specifically about ChatGPT-User. This seems to be very simple to do, as shown in this article: disallow ChatGPT-User user agent on the site in the robots.txt.

I know the community is small now, but it growing, especially due to SO's decline, and one day it might become interesting for ChatGPT and others.

discussion

posted almost 2 years ago

CC BY-SA 4.0

2y ago

5082 reputation 115 102 702 499

Raw

Markdown

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

1 comment thread

Official link (2 comments)

Post

+4

−1

robots.txt amounts to politely asking people to please not crawl you, so I wouldn't expect it to do much.

At the same time, you might as well ask politely, to avoid giving the impression that you do want them to crawl you. Letting Google crawl you is fine because Google actually sends traffic back. But AFAIK ChatGPT doesn't send back much traffic. Their LLM even avoids prompts about specific websites.

I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission. I doubt that would do much either, unless you have money set aside to actually sue, but at least it would make OpenAI's lawyers a bit nervous.

posted almost 2 years ago

CC BY-SA 4.0

matthewsnyder‭

2285 reputation 52 61 267 93

Copy Link

Raw

Markdown

1 comment thread

Model training vs. Creative Commons (5 comments)

Karl Knechtel‭ wrote almost 2 years ago

In re "I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission."

Adding this sort of restriction explicitly is not possible using Creative Commons licenses, and the necessary legal work to create modified versions is not reasonable (or at least, anyone can draft whatever license they like, but there's nothing like a guarantee that any court would respect it).

However, by their nature, LLMs are generally abysmal at attributing their output, and depending on the input they can end up plagiarizing quite blatantly. This should be all that is needed for a legal cause of action if they are trained on any kind of CC BY-licensed data. They also don't attempt to license their output at all, so it seems to me that they cannot honour -SA licenses.

Disclaimer: I am not a lawyer and this is not legal advice.

matthewsnyder‭ wrote almost 2 years ago

That's a good point - CC BY would "catch" the LLMs on attribution. However it has the side effect of also requiring humans to do it. Personally speaking, I don't care about making people do the busywork of citing my posts just for some trivial copy and paste from them. If I cared to take credit I'd publish an article or book with proper copyright. I post them truly intending it to be a "common resource", so let them "steal" away. The fame I get from an online post is tiny, and requiring attribution would destroy the convenience provided by them.

On the other hand, I don't like the idea that people use my posts, and many other people's, en masse to train models which then compete with the site that formented those posts in the first place. It's cannibalistic. So what I would really like is a "do wtf you want, except training models" license.

matthewsnyder‭ wrote almost 2 years ago

As for the legalities - a lot of people act like copyright is some magic spell that works automatically but it doesn't. There is no copyright police that will come stop me automatically as soon as I use your work without proper license, the way they would stop me from stealing your car even if you didn't call to report it. Copyright only matters when you go to court, and by extension via people's fear of being taken to court. The fear is often irrational and grounded in fantasy because few lay people seem willing to actually learn copyright law. In this case, we all know CD is very unlikely to have resources to go up against OpenAI's highly paid legal team, so we should instead appeal to the irrational fear.

matthewsnyder‭ wrote almost 2 years ago

There also seems to be a strange but widespread cargo cult about contracts, where people think just because professional lawyers write very obtuse and wordy contracts, that if they write one in simple language it's somehow not valid. In reality there is nothing wrong with writing a contract in your own words. Lawyers use the forms they do because (1) they don't want clients asking why they're getting billed thousands for something they can do themselves and (2) lawyers actually do go to court frequently, over millions of dollars, against offenders who themselves have lawyers employing many loopholes, and in that case the "legalese" does matter.

In every day, simple interactions, a plain language statement has every bit of legal force that any contract would. So there is nothing wrong with modifying existing contracts, again unless you're trying to protect something worth millions of dollars.