Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

+10

−2

Stack Overflow has recently announced OverflowAI and I think this video summarises pretty well this. The main drawback is that the users are less incentivized to put effort into answering questions and have this effort being fed into SO's LLM.

I am wondering if it makes sense to start blocking known agents that feed LLMs. I am thinking specifically about ChatGPT-User. This seems to be very simple to do, as shown in this article: disallow ChatGPT-User user agent on the site in the robots.txt.

I know the community is small now, but it growing, especially due to SO's decline, and one day it might become interesting for ChatGPT and others.

discussion

posted almost 2 years ago

CC BY-SA 4.0

2y ago

Alexei‭

5082 reputation 115 102 702 499

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

1 comment thread

Official link (2 comments)

2 answers

Score Active Age

−2

A few thoughts from my side (as a ML researcher, without experience in LLMs):

I am not sure if it is really useful to block ChatGPT specifically.

ChatGPT is only one of many LLMs out there. Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models. OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models. Therefore, I wouldn't necessarily count on the robots.txt being respected. I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.

A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright. This would be the ideal scenario, but I am afraid that this is just not going to work any time soon. On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain robots.txt files. On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data. It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing. Therefore, even if technically possible, it feels like it might be a measure for nothing.
I fear that sites that actively try to ban these tools might "lose" in the end.

People are already using AI-powered tools to assist them with coding tasks. This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search. By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information. Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).

Currently, there is probably not enough content on the site for it to be crucial for training LLMs. If AI-powered tools take over traditional search sooner rather than later this might just remain like this. Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.

By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools. This way, the site would not become irrelevant for foreseeable future changes. Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse. Moreover, by providing data to the community, the community might provide useful tools in return.

posted almost 2 years ago

CC BY-SA 4.0

2y ago

mr Tsjolder‭

516 reputation 6 15 60 6

Copy Link

Raw

Markdown

History

0 comment threads

−1

robots.txt amounts to politely asking people to please not crawl you, so I wouldn't expect it to do much.

At the same time, you might as well ask politely, to avoid giving the impression that you do want them to crawl you. Letting Google crawl you is fine because Google actually sends traffic back. But AFAIK ChatGPT doesn't send back much traffic. Their LLM even avoids prompts about specific websites.

I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission. I doubt that would do much either, unless you have money set aside to actually sue, but at least it would make OpenAI's lawyers a bit nervous.

posted almost 2 years ago

CC BY-SA 4.0

matthewsnyder‭

2285 reputation 52 61 267 93

Copy Link

Raw

Markdown

History

1 comment thread

Model training vs. Creative Commons (5 comments)

Communities

Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

1 comment thread

2 answers

0 comment threads

1 comment thread