Chandrrasekar argues that #LLM #developers are violating Stack Overflow's terms of service, as users own the content they post on the platform, but it falls under a #CreativeCommons (CC) license. #AI companies sell their models to customers, but they are unable to attribute each community member whose questions and answers were used to train the model, which is in breach of the CC license.
#StackOverflow will charge #AIGiants for #TrainingData | #GenerativeAI | WIRED
#generativeAI #TrainingData #aigiants #stackoverflow #ai #creativecommons #developers #LLM
'Just catching up on recent topics after passing my kidney stone. I heard about the AI / ML #Drake + #TheWeekend song on #DailyTechHeadlines podcast, but #LewLater also covered it in a vblog / vlog. We're past the point of no return (hmm... I think that's a Nu Shooz song).
#AI #ML #TrainingData #AITrainingData #MLTrainingData #AIArt #AIArtwork #AIMusic #Art #Artwork #Music #Copyright #Violation #PointofNoReturn
#pointofnoreturn #violation #copyright #music #artwork #art #aimusic #AIArtwork #aiart #mltrainingdata #aitrainingdata #TrainingData #ml #ai #lewlater #dailytechheadlines #TheWeekend #drake
“In #AI development, the dominant paradigm is that the more #TrainingData, the better. #OpenAI’s GPT-2 model had a data set consisting of 40 gigabytes of text. GPT-3, which ChatGPT is based on, was trained on 570 GB of data. OpenAI has not shared how big the #DataSet for its latest model, GPT-4, is.
But that hunger for larger models is now coming back to bite the company. In the past few weeks, several Western data protection authorities have started investigations into how OpenAI collects and processes the data powering #ChatGPT. They believe it has scraped people’s personal data, such as names or email addresses, and used it without their consent.”
#EU / #DataProtection
#ai #TrainingData #openai #dataset #chatgpt #eu #dataprotection
Reading this very interesting paper on #TrainingData sizes for #machinelearning applications like #LLM and #ComputerVision. Speaks to so many things that have been on my mind about the quality, size, curation and modeling of data. But one preliminary thought that jumps out is how this insistence on massive amounts of training data essentially focuses on the #Anglophone and perhaps #Western world, or how it perpetuates differences of #class and access in other places.
#class #western #anglophone #computervision #llm #machinelearning #TrainingData
@jantic I might be wrong, but AFAIR from a discussion a few years ago with my uni's lawyers, intellectual property rights extend to forms of distribution that could recreate the original work. So, making word frequencies of a text available over API was okay, but not the work. I wonder whether if this #AI #TrainingData issue ever ends up in court, it might play out along similar lines.
Raises so many questions on the level of granularity at which individual creative and legal agency operates.