FedSearch - Federated network search engine

Angelo Salatino · @angelosalatino

55 followers · 118 posts · Server fediscience.org

arXiv.org - A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

RT @BreitingerC
If you've ever extracted information from PDFs, you've probably used a tool like #GROBID, #CERMINE or #ScienceParse

But which tool is best for this job?

My colleague @MeuschkeN ran the tests and is presenting his results at #iConf23. @iconf

Paper 📰 https://arxiv.org/abs/2303.09957 https://twitter.com/MeuschkeN/status/1640262099644432385

#iconf23 #scienceparse #cermine #grobid

Last updated 3 years ago

Original post

Andreas Wagner · @anwagnerdreas

665 followers · 1161 posts · Server hcommons.social

@osma Cool, thank you! I had a quick glance and will definitely have a closer look. Do you happen to know if your GPT-3 model had been pretrained with (presumably small volumes of) Finnish texts? But it seems to confirm our intuition that recognition and parsing of such data in texts could probably be quite good, hopefully better than what #Grobid or #AnyStyle presently achieve.

You are of course warmly invited to consider joining us in one way or another. 😃

#grobid #anystyle

Last updated 3 years ago

Original post

Osma Suominen · @osma

140 followers · 155 posts · Server sigmoid.social

Has anyone used large language models for extracting (#bibliographic style, e.g. #DublinCore) #metadata from fulltext (PDF) documents? I tried this with a fine-tuned #OpenAI #GPT3 Curie model and the results were outrageously good at least for doctoral theses. Much better than traditional NLP methods like #GROBID.

#AI #machinelearning #LLM

#bibliographic #dublincore #metadata #openai #gpt3 #grobid #ai #machinelearning #llm

Last updated 3 years ago

Original post