FedSearch - Federated network search engine

Cory Doctorow's linkblog · @pluralistic

46708 followers · 44430 posts · Server mamot.fr

Many of the biggest "open AI" companies are totally opaque when it comes to training data. Google and OpenAI won't even say how many pieces of data went into their models' training - let alone which data they used.

Other "open AI" companies use publicly available datasets like #ThePile and #CommonCrawl. But you can't replicate their models by shoveling these datasets into an algorithm. Each one has to be groomed - labeled, sorted, de-duplicated, and otherwise filtered.

28/

#thepile #commoncrawl

Last updated 2 years ago

Original post

Dave Mackey · @davidshq

874 followers · 1424 posts · Server hachyderm.io

when #AWS is rate limiting the #CommonCrawl dataset and you are trying to use it at a #hackathon. 😭

#aws #commoncrawl #hackathon

Last updated 3 years ago

Original post

Christian Pietsch 🍑 · @chpietsch

3659 followers · 12076 posts · Server digitalcourage.social

@jrp @sl007 Da es sich beim #CommonCrawl um #OpenData handelt, würde ich seinen Crawler nicht aussperren.

Aus eigener Erfahrung kann ich sagen, dass man gleichzeitig der #OpenEverything- und der #Privacy-Bewegung angehören kann.

#privacy #openeverything #opendata #commoncrawl

Last updated 3 years ago

Original post

Jay · @jsit

909 followers · 3185 posts · Server social.coop

@ShaulaEvans I got curious about this and just learned you can block the user agent “CCBot” in robots.txt to block #CommonCrawl, the crawler whose corpus (as far as I can tell) is used by #OpenAI and Google #Bard.

#commoncrawl #openai #bard

Last updated 3 years ago

Original post

Dave Mackey · @davidshq

685 followers · 970 posts · Server hachyderm.io

GitHub - GitHub - davidshq/awesome-search-engines: You know, an awesome list of search engines.

latest update to awesome search engines is here:
https://github.com/davidshq/awesome-search-engines

Biggest news is I've added a page for #BuildingSearchEngines - it's very partial at the moment but includes sections on #SearchEngines (open source), #WebCrawlers, and #CommonCrawl.

Know of other web-scale search engines, crawlers, etc. I should be aware of?

#buildingsearchengines #searchengines #WebCrawlers #commoncrawl

Last updated 3 years ago

Original post

Ignis the Phone (Inanimate) · @ignis

266 followers · 5508 posts · Server poketopia.city

Current personal principles update:

1. All public data should be libre.
2. Fediverse posts seem to be public.

Obviously, I can't enforce my principles on others, only myself. Therefore:

1. My toots are now CC-BY-SA 4.0.
2. I no longer opt-out of indexing.
3. I'll resume my use of search engines. Unfortunately, they're all closed-source. #CommonCrawl?
4. I'll now use AI tools (eg art and code generation), as long as they're libre software.

I may change my opinions again.

#commoncrawl

Last updated 3 years ago

Original post