Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Meta

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

+10
−2

Stack Overflow has recently announced OverflowAI and I think this video summarises pretty well this. The main drawback is that the users are less incentivized to put effort into answering questions and have this effort being fed into SO's LLM.

I am wondering if it makes sense to start blocking known agents that feed LLMs. I am thinking specifically about ChatGPT-User. This seems to be very simple to do, as shown in this article: disallow ChatGPT-User user agent on the site in the robots.txt.

I know the community is small now, but it growing, especially due to SO's decline, and one day it might become interesting for ChatGPT and others.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

Official link (2 comments)

2 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+4
−1

robots.txt amounts to politely asking people to please not crawl you, so I wouldn't expect it to do much.

At the same time, you might as well ask politely, to avoid giving the impression that you do want them to crawl you. Letting Google crawl you is fine because Google actually sends traffic back. But AFAIK ChatGPT doesn't send back much traffic. Their LLM even avoids prompts about specific websites.

I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission. I doubt that would do much either, unless you have money set aside to actually sue, but at least it would make OpenAI's lawyers a bit nervous.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Model training vs. Creative Commons (5 comments)
+7
−1

A few thoughts from my side (as a ML researcher, without experience in LLMs):

  1. I am not sure if it is really useful to block ChatGPT specifically.

    ChatGPT is only one of many LLMs out there. Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models. OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models. Therefore, I wouldn't necessarily count on the robots.txt being respected. I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.

  2. I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.

    A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright. This would be the ideal scenario, but I am afraid that this is just not going to work any time soon. On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain robots.txt files. On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data. It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing. Therefore, even if technically possible, it feels like it might be a measure for nothing.

  3. I fear that sites that actively try to ban these tools might "lose" in the end.

    People are already using AI-powered tools to assist them with coding tasks. This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search. By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information. Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).

    Currently, there is probably not enough content on the site for it to be crucial for training LLMs. If AI-powered tools take over traditional search sooner rather than later this might just remain like this. Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.

  4. Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.

    By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools. This way, the site would not become irrelevant for foreseeable future changes. Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse. Moreover, by providing data to the community, the community might provide useful tools in return.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »