Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?
Stack Overflow has recently announced OverflowAI and I think this video summarises pretty well this. The main drawback is that the users are less incentivized to put effort into answering questions and have this effort being fed into SO's LLM.
I am wondering if it makes sense to start blocking known agents that feed LLMs. I am thinking specifically about ChatGPT-User. This seems to be very simple to do, as shown in this article: disallow ChatGPT-User user agent on the site in the robots.txt.
I know the community is small now, but it growing, especially due to SO's decline, and one day it might become interesting for ChatGPT and others.
2 answers
A few thoughts from my side (as a ML researcher, without experience in LLMs):
-
I am not sure if it is really useful to block ChatGPT specifically.
ChatGPT is only one of many LLMs out there. Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models. OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models. Therefore, I wouldn't necessarily count on the
robots.txt
being respected. I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data. -
I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.
A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright. This would be the ideal scenario, but I am afraid that this is just not going to work any time soon. On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain
robots.txt
files. On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data. It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing. Therefore, even if technically possible, it feels like it might be a measure for nothing. -
I fear that sites that actively try to ban these tools might "lose" in the end.
People are already using AI-powered tools to assist them with coding tasks. This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search. By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information. Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
Currently, there is probably not enough content on the site for it to be crucial for training LLMs. If AI-powered tools take over traditional search sooner rather than later this might just remain like this. Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
-
Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.
By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools. This way, the site would not become irrelevant for foreseeable future changes. Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse. Moreover, by providing data to the community, the community might provide useful tools in return.
0 comment threads
robots.txt
amounts to politely asking people to please not crawl you, so I wouldn't expect it to do much.
At the same time, you might as well ask politely, to avoid giving the impression that you do want them to crawl you. Letting Google crawl you is fine because Google actually sends traffic back. But AFAIK ChatGPT doesn't send back much traffic. Their LLM even avoids prompts about specific websites.
I wish the terms of the site could be changed, to make it specifically forbidden to use this data for model training without permission. I doubt that would do much either, unless you have money set aside to actually sue, but at least it would make OpenAI's lawyers a bit nervous.
1 comment thread