Mayobrot · @Mayobrot
104 followers · 913 posts · Server zirk.us

What stuff should I put in my robots.txt? Feel free to suggest overly restrictive entries.

#robotstxt

Last updated 1 year ago

Markus Feilner :verified: · @mfeilner
696 followers · 4778 posts · Server mastodon.cloud

The question is:

Is Scraping for AI systems of ALT-right US corporations "fair use" .

I have a strong opinion. But I am not with the IP corps either.


New York Times, CNN and Australia’s ABC block OpenAI’s GPTBot web crawler from accessing content | Artificial intelligence (AI) | The Guardian theguardian.com/technology/202

#chatgpt #robotstxt

Last updated 1 year ago

🐙 Compañero Allende · @morpheo
383 followers · 17853 posts · Server kolektiva.social

I have a question. It's really dumb, but here goes: Why would any bad actor (e.g. Shitter) care to respect robots.txt?

#robotstxt #RhetoricalQuestions

Last updated 1 year ago

Dirk Haun · @dirkhaun
102 followers · 640 posts · Server tinycities.net

PSA: If you're running @writefreely, make sure your server is set up to serve a robots.txt so that you can block bots you don't want to gobble up the contents of your website (looking at you, ).

Something like

location /robots.txt {
alias /complete/path/to/your/robots.txt;
}

in your configuration.

A wordier version of this can be found here: blog.tinycities.net/dirkhaun/r 🙈

#chatgpt #nginx #servicetoot #writefreely #robotstxt

Last updated 1 year ago

bananabob · @bananabob
75 followers · 1781 posts · Server mastodon.nz
Ross A. Baker · @ross
838 followers · 964 posts · Server social.rossabaker.com

Setting up /robots.txt, not because it helps, but because being crabby in compliance with an RFC is satisfying.

Who has some unsavory ones besides ChatGPT and Twitterbot?

rossabaker.com/configs/website

#rfc9309 #robotstxt

Last updated 1 year ago

Mr.Trunk · @mrtrunk
6 followers · 11961 posts · Server dromedary.seedoubleyou.me

Now you can block ’s
OpenAI now lets you block its web crawler from scraping your site to help train models. OpenAI said website operators can specifically disallow its crawler on their site's .txt file or block its IP address.
theverge.com/2023/8/7/23823046

#openai #webcrawler #gpt #gptbot #robots #privacy #security #robotstxt

Last updated 1 year ago

Tomodachi94 · @tomodachi94
17 followers · 274 posts · Server floss.social

@Seirdy updated on my blog to include both of the user agents.

🙈 I didn't actually know you could do that on GitHub Pages, but it turns out... you can!

#robotstxt #openai #chatgpt

Last updated 1 year ago

Laravista · @laravista
20 followers · 257 posts · Server mastodon.uno

is not the answer: Proposing a new meta tag for LLM/AI
searchengineland.com/robots-tx

#robotstxt

Last updated 1 year ago

Sebastian Nagel · @sebnagel
4 followers · 4 posts · Server fosstodon.org
PCH🎙️ :wp_fedi: · @phillycodehound
7452 followers · 4780 posts · Server masto.ai

@rustybrick do you think will just ignore Robots.txt? I mean they'd love to be able to train on everything.

Though I would love more controls on stopping AI from scraping without blocking my sites from Search

#google #seo #ai #scraping #robotstxt

Last updated 1 year ago

Angus McIntyre · @angusm
578 followers · 585 posts · Server mastodon.social

What the actual fuck?

Will someone kindly explain to "global cybersecurity leader" Palo Alto Networks that the User-Agent header is a place to put the name of your user agent? You send the name of your user agent, and you obey `robots.txt` (which they don't, of course). You DO NOT write a short essay ending with a request for people to mail you to opt-out. It is 2023 and the right way to do this was established DECADES ago.

#paloaltonetworks #clownshoes #robotstxt #webcrawlers #www #web

Last updated 2 years ago

Angus McIntyre · @angusm
511 followers · 483 posts · Server mastodon.social

Unsurprisingly, webmeup's assurance that "you will not see recurring requests from the BLEXBot crawler to the same page" turns out to be ... not true?

At least according to my log files, which show the same page getting hit at 5 day intervals as part of their process of fetching every single page on my site over and over to satisfy some vague marketing need.

So I think BLEXBot can join AHRefsBot and SEMRushBot in my robots.txt. And nothing of value was lost.

#crawlers #webspiders #robotstxt

Last updated 2 years ago

toot box · @cyborg
49 followers · 444 posts · Server gamers.rip

I'm having this wild experience where I recall being able to put a sort of and/or command in robots.txt, not just meta tags.

Was that deprecated when I wasn't looking, or should I just blame the ? :eyes_squint:

#robotstxt #websitedesign #mandelaeffect #nocache #noarchive

Last updated 2 years ago

Éamonn · @eob
225 followers · 76 posts · Server social.coop

companies and , you can partially boycott by adding the following to the robots.txt file on your website:

User-agent: Twitterbot
Disallow: *

This prevents Twitter using your images in links to your articles.

How to add a :

developers.google.com/search/d

#media #journalists #twitter #robotstxt

Last updated 2 years ago

Mike Blazer 🇺🇦 · @MikeBlazer
531 followers · 508 posts · Server mastodon.social

I was checking video and image CDN hosts of several sites and found out that their robots.txt files are 404.

As per @johnmu:
"If the robots.txt file is unreachable, we'll see that as blocking crawling."

twitter.com/JohnMu/status/1435

My question to John: if example.com/robots.txt returns "200 OK" but cdn84.video-image-12.com/robots.txt has "404 not found", would images and videos of this website have problems ranking on Google Images and Google Videos?

#404

#seo #cdn #google #robotstxt

Last updated 2 years ago

gaby_wald · @gaby_wald
70 followers · 16249 posts · Server framapiaf.org
gaby_wald · @gaby_wald
74 followers · 16277 posts · Server framapiaf.org
ijliao · @ijliao
299 followers · 6174 posts · Server g0v.social