@johnleonard Yeah. Introducing deliberate structure and shape into the training data is for sure going to be improving #AI performance.
Also granularisation of input data is necessary to protect privacy of individual personal details in it.
But the #intellectual #property issue remains. Moving the original works of art the #preprocess doesn’t change them contributing to the final data.
This is simply marketing. No data scientist worth their salt would make such claims! 🤨
#intellectual #property #preprocess #ai
In #bioinformatics, and #scientific #computing generally, raw #data is never ever in a usable form.
This is one of the many things #Hollywood persistently gets wrong: the #scientist / #hacker / omni-computer-geek opens up the file, stares at a bunch of #numbers or #symbols, and says, “Ah hah! If I #compile the #HTML to reverse the #polarity on the #IP #gateway, I can #deconvolve the #DNA sequence to #backpropagate the #cellular #metabolism of the #alien #plague! Oh, and make #dinosaurs if you want, but that’s extra.”
Bonus points if the screen projects on said scientist’s face and reflects from the inevitable chunky-framed glasses. Scribbling equations backward on a transparent whiteboard may also be involved.
#Scientists, as I have said many times before and no doubt will need to say many times again, are people. We’re pretty good with numbers, yes, as a rule. But what we’re good at doing with those numbers is not reading and understanding them. It’s using them as the raw materials for product which makes sense to the human brain. Words, pictures, and a MUCH SMALLER number of numbers is our goal. Also continued #funding, which is about the kind of numbers everyone understands.
Before we process the numbers, we need to “#preprocess” them. There are several intermediate steps between the really raw data and the cover story for next week’s issue of Nature. Preprocessing is where we turn the glowing symbols projected onto our faces into something that kinda-sorta makes sense. It’s still not really readable, but people looking at it, who know what they’re looking at, can tell what it represents.
Usually this is in the form of one or more #tables: for a familiar example, think of an #Excel workbook with several large #spreadsheets. (In reality, storing data in Excel is a terrible idea, but I’ll stick with that metaphor.) Nobody’s going to read and digest everything in the workbook. You can look at the headers and a few of the values and at least have an idea where to start. Preprocessing gets you to that point.
For most types of data, preprocessing is fairly standardized. You don’t have to write your own code: someone else has already done that work for you. Just pick a #software #package, run the raw data through it, glance at the output to make sure nothing went horribly wrong. Now you’re ready to write the code only you can write, to discover the Secrets of Life Itself. Now is the time for SCIENCE.
Or Nature. Or The Journal Of Obscure Subfield Ten People In The World Know Exists. Or a tech report. You know, whatever.
Careful readers will have noticed the word “fairly” above. In fact there are multiple #algorithms to choose from, and multiple packages implementing those algorithms, and #documentation written at 3:00 AM by an exhausted #postdoc who really just wanted to check the #cell cultures one last time and grab the remaining half a chicken salad sandwich from the break room fridge and go home and crawl into bed for a few hours’ sleep before dragging ass back in tomorrow. Shower optional.
Other exhausted postdocs and their harassed #principal #investigators, who get somewhat more sleep and a somewhat finer grade of chicken salad but are much more worried about upcoming funding application deadlines, may or may not bother to write down which package they use to preprocess their data. Or what specific parameters they tuned. Or if they even know how they’re supposed to use the damned thing: there’s a really good chance they just ran the data through on the default settings, got something that looked reasonable, and called it a day.
Amazingly, most of the time this doesn’t really matter. Data has a life of its own. The bigger the data set gets, and these days nearly all data are “big data,” the more likely it is that any reasonable method will produce similar results. Good thing too, otherwise science (and Science) would grind to a screeching, shuddering, smoking halt.
Sometimes it matters a lot. Careful scientists check, just in case. I try to be one of those, and when I’m not, my coworkers pick up the slack. Luckily for me, for most of my career I’ve found myself in the company of those who live up to that standard, and I can mostly convince myself I do the same. Another item on Hollywood’s long list of sins: science is not a solo enterprise. In fact it’s deeply social, which is one of several reason why the stereotype of scientists as loners is a load of crap. But I digress.
In case you’re wondering if this has a point, yes it does, and here it is: all the above is why my boss recently sent me a message saying, “Woah yeah ok so maybe you do need to process from raw after all. B/c idk wtf that is.”
Without any irony at all: I love my job.
#scientific #ip #deconvolve #backpropagate #cellular #data #hacker #polarity #alien #plague #package #algorithms #documentation #computing #hollywood #scientist #gateway #dna #preprocess #tables #excel #spreadsheets #software #postdoc #bioinformatics #numbers #symbols #compile #html #metabolism #dinosaurs #scientists #funding #cell #principal #investigators
Jusqu’à maintenant je me prennais le choux pour maintenir les version firefox et chrome de mon extension. Car malgré que ce soit basé sur la même API il y parfois quelques différences. Alors à grand coup de gulp et de gulp preprocess et aprés avoir écris un petit script pour générer le manifest spécifique à chacune. J'ai enfin une source commune pour les deux extensions. Le développement devrais donc s'en trouver accéléré. #gulp #watch #preprocess #fork #mastoshare
#gulp #watch #preprocess #fork #mastoshare