Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Review Suggested Edit

You can't approve or reject suggested edits because you haven't yet earned the Edit Posts ability.

A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
~~- I am not sure if it is really useful to block ChatGPT specifically.~~
ChatGPT is only one of many LLMs out there.
Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
~~- I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.~~
A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
Therefore, even if technically possible, it feels like it might be a measure for nothing.
~~- I fear that sites that actively try to ban these tools might "lose" in the end.~~
People are already using AI-powered tools to assist them with coding tasks.
This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
~~- Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.~~
By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
This way, the site would not become irrelevant for foreseeable future changes.
Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
Moreover, by providing data to the community, the community might provide useful tools in return.

A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
1. I am not sure if it is really useful to block ChatGPT specifically.
ChatGPT is only one of many LLMs out there.
Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
2. I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.
A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
Therefore, even if technically possible, it feels like it might be a measure for nothing.
3. I fear that sites that actively try to ban these tools might "lose" in the end.
People are already using AI-powered tools to assist them with coding tasks.
This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
4. Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.
By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
This way, the site would not become irrelevant for foreseeable future changes.
Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
Moreover, by providing data to the community, the community might provide useful tools in return.

Communities

Review Suggested Edit