FedSearch - Federated network search engine

PSA: If you're running @writefreely, make sure your server is set up to serve a robots.txt so that you can block bots you don't want to gobble up the contents of your website (looking at you, #ChatGPT).

Something like

location /robots.txt {
alias /complete/path/to/your/robots.txt;
}

in your #nginx configuration.

A wordier version of this #ServiceToot can be found here: https://blog.tinycities.net/dirkhaun/robots-txt-chatgpt-and-writefreely 🙈

#WriteFreely #RobotsTXT

#chatgpt #nginx #servicetoot #writefreely #robotstxt

Last updated 2 years ago

Original post

bananabob · @bananabob

75 followers · 1781 posts · Server mastodon.nz

Ars Technica - Sites scramble to block ChatGPT web crawler after instructions emerge

Sites scramble to block ChatGPT web crawler after instructions emerge

#ArsTechnica #ChatGPT #GPTBot #RobotsTXT

https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/

#arstechnica #chatgpt #gptbot #robotstxt

Last updated 2 years ago

Original post

Ross A. Baker · @ross

838 followers · 964 posts · Server social.rossabaker.com

Setting up /robots.txt, not because it helps, but because being crabby in compliance with an RFC is satisfying.

Who has some unsavory ones besides ChatGPT and Twitterbot?

https://rossabaker.com/configs/website/webcrawlers/

#RFC9309 #RobotsTxt

#rfc9309 #robotstxt

Last updated 2 years ago

Original post

Mr.Trunk · @mrtrunk

6 followers · 11961 posts · Server dromedary.seedoubleyou.me

Gizmodo: Google Says It Will Scrape Publishers’ Data for AI Unless They Force It Not To https://gizmodo.com/google-bard-ai-scrape-websites-data-australia-opt-out-1850720633 #applicationsofartificialintelligence #generativepretrainedtransformer #computationalneuroscience #artificialintelligence #largelanguagemodels #technologyinternet #thenewyorktimes #robotstxt #deepfake #gizmodo #chatbot #chatgpt #google #openai #palm2 #bard

#applicationsofartificialintelligence #generativepretrainedtransformer #computationalneuroscience #artificialintelligence #largelanguagemodels #technologyinternet #thenewyorktimes #robotstxt #deepfake #gizmodo #chatbot #chatgpt #google #openai #palm2 #bard

Last updated 2 years ago

Original post

Benjamin Carr, Ph.D. 👨🏻‍💻🧬 · @BenjaminHCCarr

977 followers · 2499 posts · Server hachyderm.io

Open media

Now you can block #OpenAI’s #webcrawler
OpenAI now lets you block its web crawler from scraping your site to help train #GPT models. OpenAI said website operators can specifically disallow its #GPTBot crawler on their site's #Robots.txt file or block its IP address.
https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai #privacy #security #RobotsTxT

#openai #webcrawler #gpt #gptbot #robots #privacy #security #robotstxt

Last updated 2 years ago

Original post

Tomodachi94 · @tomodachi94

17 followers · 274 posts · Server floss.social

@Seirdy updated on my blog to include both of the user agents.

🙈 I didn't actually know you could do that on GitHub Pages, but it turns out... you can!

#robotstxt #openai #chatgpt

Last updated 2 years ago

Original post

Laravista · @laravista

20 followers · 257 posts · Server mastodon.uno

Search Engine Land - Robots.txt is not the answer: Proposing a new meta tag for LLM/AI

#RobotsTxt is not the answer: Proposing a new meta tag for LLM/AI
https://searchengineland.com/robots-txt-new-meta-tag-llm-ai-429510

#robotstxt

Last updated 2 years ago

Original post

Sebastian Nagel · @sebnagel

4 followers · 4 posts · Server fosstodon.org

Released #CrawlerCommons 1.4: Java 11, #RobotsTxt compliant with #rfc9309 - https://github.com/crawler-commons/crawler-commons#18th-july-2023----crawler-commons-14-released

#crawlercommons #robotstxt #rfc9309

Last updated 2 years ago

Original post

PCH🎙️ :wp_fedi: · @phillycodehound

7452 followers · 4780 posts · Server masto.ai

@rustybrick do you think #Google will just ignore Robots.txt? I mean they'd love to be able to train on everything.

Though I would love more controls on stopping AI from scraping without blocking my sites from Search

#seo #AI #scraping #RobotsTXT

#google #seo #ai #scraping #robotstxt

Last updated 2 years ago

Original post

Angus McIntyre · @angusm

578 followers · 585 posts · Server mastodon.social

Open media

What the actual fuck?

Will someone kindly explain to "global cybersecurity leader" Palo Alto Networks that the User-Agent header is a place to put the name of your user agent? You send the name of your user agent, and you obey `robots.txt` (which they don't, of course). You DO NOT write a short essay ending with a request for people to mail you to opt-out. It is 2023 and the right way to do this was established DECADES ago.

#paloaltonetworks #clownshoes #robotstxt #webcrawlers #www #web

Last updated 3 years ago

Original post

Angus McIntyre · @angusm

511 followers · 483 posts · Server mastodon.social

Unsurprisingly, webmeup's assurance that "you will not see recurring requests from the BLEXBot crawler to the same page" turns out to be ... not true?

At least according to my log files, which show the same page getting hit at 5 day intervals as part of their process of fetching every single page on my site over and over to satisfy some vague marketing need.

So I think BLEXBot can join AHRefsBot and SEMRushBot in my robots.txt. And nothing of value was lost.

#crawlers #webspiders #robotstxt

Last updated 3 years ago

Original post

toot box · @cyborg

49 followers · 444 posts · Server gamers.rip

I'm having this wild experience where I recall being able to put a sort of #NOARCHIVE and/or #NOCACHE command in robots.txt, not just meta tags.

Was that deprecated when I wasn't looking, or should I just blame the #MandelaEffect? :eyes_squint:

#WebsiteDesign #RobotsTXT

#robotstxt #websitedesign #mandelaeffect #nocache #noarchive

Last updated 3 years ago

Original post

Éamonn · @eob

225 followers · 76 posts · Server social.coop

Open media

#Media companies and #journalists, you can partially boycott #Twitter by adding the following to the robots.txt file on your website:

User-agent: Twitterbot
Disallow: *

This prevents Twitter using your images in links to your articles.

How to add a #RobotsTxt:

https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt

#media #journalists #twitter #robotstxt

Last updated 3 years ago

Original post

Mike Blazer 🇺🇦 · @MikeBlazer

531 followers · 508 posts · Server mastodon.social

I was checking video and image CDN hosts of several sites and found out that their robots.txt files are 404.

As per @johnmu:
"If the robots.txt file is unreachable, we'll see that as blocking crawling."

https://twitter.com/JohnMu/status/1435688745681014798

My question to John: if example.com/robots.txt returns "200 OK" but cdn84.video-image-12.com/robots.txt has "404 not found", would images and videos of this website have problems ranking on Google Images and Google Videos?

#seo #cdn #google #404 #robotstxt

#seo #cdn #google #robotstxt

Last updated 3 years ago

Original post