Sascha Wolfer · @sascha_wolfer
448 followers · 319 posts · Server fediscience.org

New preprint: We present a new on the : DeReKoGram includes uni-, bi-, and trigram frequencies, lemma and POS information for a corpus of around 43 billion tokens.

We evaluate the distribution over the 16 datasets and present a (small) case study on growth.

At owid.de/plus/derekogram, we provide , and code that should help you getting started with the dataset.

Preprint available at doi.org/10.21203/rs.3.rs-31396.

#linguistics #stata #rstats #python #vocabulary #language #german #DataSet

Last updated 1 year ago

Pratik Patel · @ppatel
943 followers · 13083 posts · Server mstdn.social

The MIT researchers found that models trained for autocaptioning with their dataset consistently generated captions that were precise, semantically rich, and described data trends and complex patterns.

Researchers teach an to write better chart captions.

A new can help scientists develop automatic systems that generate richer, more descriptive captions for online charts for people.

news.mit.edu/2023/researchers-

#generativeAI #a11y #Accessibility #blind #DataSet #AI #MachineLearning

Last updated 1 year ago

Paweł Kleka · @pkleka
6 followers · 22 posts · Server fediscience.org

I am looking for a dataset with raw data from the WAIS test or other intelligence tests. My mentee is writing her master's thesis about mutualism and wants to reanalyse some actual data. Any help with the source?

#rstats #DataSet #OpenScience #opensource

Last updated 1 year ago

· @gizz
3 followers · 5 posts · Server mstdn.social
Shu Daizi · @SDZ
218 followers · 2093 posts · Server mstdn.social

Google's C4 dataset for training AI/LLM includes Literotica, Smashwords, Pornhub, Wattpad

Inside the secret list of websites that make AI like ChatGPT sound smart

washingtonpost.com/technology/

#wattpad #pornhub #smashwords #literotica #porn #NSFW #erotica #datasettraining #DataSet #LLM #AI

Last updated 1 year ago

Kevin Karhan :verified: · @kkarhan
898 followers · 48815 posts · Server mstdn.social

@stux because AI can only learn based off data - and beyond US-English and the official languages of & publications there's no sufficient to teach it other languages unless you're and can run all day on your search index...

#Oltmann #Google #DataSet #un #EU

Last updated 1 year ago

Kevin Karhan :verified: · @kkarhan
669 followers · 27540 posts · Server mstdn.social

@amydiehl @aral is alyways as as it's used to it.

And it shows...

#train #DataSet #biased #AI

Last updated 2 years ago

Datendealerin · @Datendealerin
467 followers · 512 posts · Server fediscience.org

Did you know you can adopt a ? From now on and all of next week 🧡

➡️ Find your new lovable dataset here: icpsr.umich.edu/web/about/cms/

#icpsr #LoveData23 #DataSet

Last updated 2 years ago

Céline Heuzé · @ClnHz
816 followers · 48 posts · Server mstdn.social

The drama continues.
How frowned upon would it be if "a friend", when forced to add humans who did not contribute, created a profile for their and added them as well? Reasoning being that said cat at least provided moral support and did use the keyboard while the dataset was open, unlike the extra humans. Hypothetically, obviously.

#fbullies #academia #Cat #authorship #DataSet

Last updated 2 years ago

AV_SP · @AV_SP
143 followers · 42 posts · Server fediscience.org

A fMRI dataset acquired during naturalistic movie watching and narrated recall of a series of short cinematic films
sciencedirect.com/science/arti openneuro.org/datasets/ds00404

Whole-brain fMRI data from continuous naturalistic tasks (here unguided spoken recall) are rare- this dataset can be reanalyzed using brain areas & functional characteristics not explored in the published articles- &, the behavioral data (transcripts of spoken recall) can be reanalyzed on their own!

#Naturalistic #fmri #DataSet

Last updated 2 years ago

Philipp Leitner · @xLeitix
98 followers · 114 posts · Server fediscience.org

The conference is this year again running a data challenge, this year with a of microbenchmarking data in Java.

icpe2023.spec.org/tracks-and-s

#DataSet #ICPE

Last updated 2 years ago

Dr Sam Burgess · @OceanTerra
328 followers · 24 posts · Server fediscience.org

has just published a new for Europe.

It provides hourly estimates of surface & soil variables reaching back from 1984 at the same enhanced horizontal resolution of 5.5km as CERRA.

Access it via Climate Data Store: bit.ly/3MM8LLg

#Climate #opendata #C3S #DataSet #Reanalysis #CopernicusClimate

Last updated 2 years ago

If you want to quickly find in a in then you can use my {healthyR.ai}

There is a function called hai_skeweed_features() that will list off the columns that are skewed.

Here is the post: spsanderson.com/steveondata/po

#package #r #DataSet #features #skewed

Last updated 2 years ago

Technology Tales · @technology_tales
7 followers · 81 posts · Server mstdn.social

Learning computing languages
Over the years, I have taught myself a number of computing languages with some coming in useful for professional work while others came in handy for website development and maintenance. The collection has grown to include HTML, CSS, XML, Perl, PHP and UNIX
technologytales.com/2021/04/11

#sas #r #Python #programminglanguage #OpenSource #languages #language #graphs #DataSet #Data #Computing #Software #scripting #Programming

Last updated 3 years ago

Technology Tales · @technology_tales
13 followers · 96 posts · Server mstdn.social

Learning computing languages
Over the years, I have taught myself a number of computing languages with some coming in useful for professional work while others came in handy for website development and maintenance. The collection has grown to include HTML, CSS, XML, Perl, PHP and UNIX
technologytales.com/2021/04/11

#sas #r #Python #programminglanguage #OpenSource #languages #language #graphs #DataSet #Data #Computing #Software #scripting #Programming

Last updated 3 years ago