RT @BreitingerC
If you've ever extracted information from PDFs, you've probably used a tool like #GROBID, #CERMINE or #ScienceParse
But which tool is best for this job?
My colleague @MeuschkeN ran the tests and is presenting his results at #iConf23. @iconf
Paper 📰 https://arxiv.org/abs/2303.09957 https://twitter.com/MeuschkeN/status/1640262099644432385
#iconf23 #scienceparse #cermine #grobid
@osma Cool, thank you! I had a quick glance and will definitely have a closer look. Do you happen to know if your GPT-3 model had been pretrained with (presumably small volumes of) Finnish texts? But it seems to confirm our intuition that recognition and parsing of such data in texts could probably be quite good, hopefully better than what #Grobid or #AnyStyle presently achieve.
You are of course warmly invited to consider joining us in one way or another. 😃
Has anyone used large language models for extracting (#bibliographic style, e.g. #DublinCore) #metadata from fulltext (PDF) documents? I tried this with a fine-tuned #OpenAI #GPT3 Curie model and the results were outrageously good at least for doctoral theses. Much better than traditional NLP methods like #GROBID.
#bibliographic #dublincore #metadata #openai #gpt3 #grobid #ai #machinelearning #llm