Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Meta

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

75%
+7 −1
Meta Should we disallow ChatGPT-User crawler (and others) from scraping Software Codidact?

A few thoughts from my side (as a ML researcher, without experience in LLMs): I am not sure if it is really useful to block ChatGPT specifically. ChatGPT is only one of many LLMs out there. ...

posted 1y ago by mr Tsjolder‭  ·  edited 1y ago by mr Tsjolder‭

Answer
#3: Post edited by user avatar mr Tsjolder‭ · 2023-08-02T17:56:44Z (over 1 year ago)
  • A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
  • 1. I am not sure if it is really useful to block ChatGPT specifically.
  • ChatGPT is only one of many LLMs out there.
  • Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
  • OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
  • Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
  • I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
  • 2. I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.
  • A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
  • This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
  • On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
  • On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
  • It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
  • Therefore, even if technically possible, it feels like it might be a measure for nothing.
  • 3. I fear that sites that actively try to ban these tools might "lose" in the end.
  • People are already using AI-powered tools to assist them with coding tasks.
  • This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
  • By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
  • Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
  • Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
  • If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
  • Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
  • 4. Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.
  • By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
  • This way, the site would not become irrelevant for foreseeable future changes.
  • Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
  • Moreover, by providing data to the community, the community might provide useful tools in return.
  • A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
  • 1. I am not sure if it is really useful to block ChatGPT specifically.
  • ChatGPT is only one of many LLMs out there.
  • Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
  • OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
  • Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
  • I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
  • 2. I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.
  • A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
  • This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
  • On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
  • On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
  • It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
  • Therefore, even if technically possible, it feels like it might be a measure for nothing.
  • 3. I fear that sites that actively try to ban these tools might "lose" in the end.
  • People are already using AI-powered tools to assist them with coding tasks.
  • This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
  • By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
  • Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
  • Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
  • If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
  • Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
  • 4. Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.
  • By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
  • This way, the site would not become irrelevant for foreseeable future changes.
  • Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
  • Moreover, by providing data to the community, the community might provide useful tools in return.
#2: Post edited by user avatar matthewsnyder‭ · 2023-08-02T17:56:10Z (over 1 year ago)
I think it would be easier to discuss individual points if they were numbers rather than bullets.
  • A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
  • - I am not sure if it is really useful to block ChatGPT specifically.
  • ChatGPT is only one of many LLMs out there.
  • Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
  • OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
  • Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
  • I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
  • - I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.
  • A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
  • This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
  • On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
  • On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
  • It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
  • Therefore, even if technically possible, it feels like it might be a measure for nothing.
  • - I fear that sites that actively try to ban these tools might "lose" in the end.
  • People are already using AI-powered tools to assist them with coding tasks.
  • This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
  • By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
  • Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
  • Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
  • If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
  • Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
  • - Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.
  • By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
  • This way, the site would not become irrelevant for foreseeable future changes.
  • Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
  • Moreover, by providing data to the community, the community might provide useful tools in return.
  • A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
  • 1. I am not sure if it is really useful to block ChatGPT specifically.
  • ChatGPT is only one of many LLMs out there.
  • Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
  • OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
  • Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
  • I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
  • 2. I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.
  • A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
  • This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
  • On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
  • On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
  • It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
  • Therefore, even if technically possible, it feels like it might be a measure for nothing.
  • 3. I fear that sites that actively try to ban these tools might "lose" in the end.
  • People are already using AI-powered tools to assist them with coding tasks.
  • This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
  • By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
  • Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
  • Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
  • If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
  • Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
  • 4. Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.
  • By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
  • This way, the site would not become irrelevant for foreseeable future changes.
  • Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
  • Moreover, by providing data to the community, the community might provide useful tools in return.
#1: Initial revision by user avatar mr Tsjolder‭ · 2023-07-31T19:09:09Z (over 1 year ago)
A few thoughts from my side (as a ML researcher, **without** experience in LLMs):
 - I am not sure if it is really useful to block ChatGPT specifically.

   ChatGPT is only one of many LLMs out there.
   Blocking only ChatGPT will probably not prevent the data on codidact from being fed into other models.
   OpenAI (the creator/owner of the ChatGPT model) is notoriously intransparent about the data they use for training their models.
   Therefore, I wouldn't necessarily count on the `robots.txt` being respected.
   I would also rather have my content processed by other companies, but there are enough other (maybe even worse) companies that could still use the data.
 - I doubt it is feasible to block this site from appearing in any dataset used for training AI-powered tools.

   A more sensible approach would be to block all LLMs and only allow those data collection efforts that respect licensing and copyright.
   This would be the ideal scenario, but I am afraid that this is just not going to work any time soon.
   On one side, every data collection effort can make its own rules for being excluded, requiring extensive effort to create/maintain `robots.txt` files.
   On the other side, there are enough examples of licenses and/or copyright issues being ignored when collecting data.
   It also seems that companies (like OpenAI) can just get away with things by not disclosing any details about what they are doing.
   Therefore, even if technically possible, it feels like it might be a measure for nothing.
 - I fear that sites that actively try to ban these tools might "lose" in the end.

   People are already using AI-powered tools to assist them with coding tasks.
   This might eventually lead to a paradigm shift where people just rely on these tools instead of traditional search.
   By preventing the content of this website from being used for training LLMs, it might be that people will be unable to find this resource of information.
   Instead, the models will provide information from other sources (and maybe even copycat sites that steal content from this site).
   
   Currently, there is probably not enough content on the site for it to be crucial for training LLMs.
   If AI-powered tools take over traditional search sooner rather than later this might just remain like this.
   Until there is enough content, I do not think the site would benefit from blocking LLMs from training on the data that is generated here.
 - Instead of fighting the idea of LLMs being trained on the content on this website, it might be wiser to lean into it.

   By providing the data via an API or even as a ready-to-use dataset(s) (as a collection of question-answer pairs with easy-to-parse meta-information or in any form that might help solve some sort of moderation task), we could actually take control of how the data on this site is being used to train AI-powered tools.
   This way, the site would not become irrelevant for foreseeable future changes.
   Also, by having people sign an agreement before getting access to the data/API, there would be some leverage against possible abuse.
   Moreover, by providing data to the community, the community might provide useful tools in return.