Doc Edward Morbius ⭕​ · @dredmorbius
2081 followers · 14668 posts · Server toot.cat

@thornAvery My own approaches are:

  • Find LITERALLY ANY FORMAT OTHER THAN PDF. HTML, text, ePub, etc., if possible.

  • Try pdftotext, part of Poppler utils: poppler.freedesktop.org/ This is available for most Linux distros, MacOS under Homebrew, or check out via Git.

If I can get something vaguely reasonable, that's usually sufficient.

  • OCR is an option. I've never had good luck with that, and there's such a tremendous amount of tendous correcting that retyping is frequently preferable. That said, I operate at fairly low scale.

  • Retype by hand. Since I'm usually reading the work, this actually turns out to be a pretty good reading method for content-retention.

PDF itself is a container around a bunch of other formats. Asking how to convert a PDF is a bit like asking how to cook a bag full of groceries. It really depends on what's in it, and what you're hoping to get.

#pdf #pdfConversion #kfc #docfs #webfs

Last updated 3 years ago

Doc Edward Morbius ⭕​ · @dredmorbius
2081 followers · 14668 posts · Server toot.cat

@thornAvery I'm trying to find what I thought I remembered as an excellent HN comment discussing how to do this at scale.

It turns out to be really complicated.

That said, maybe tell us what it is you're trying to do, specifically:

  • How many documents.
  • How large.
  • What languages / charactersets.
  • What budget (if any).
  • What end-use.

#webfs #docfs #kfc #pdfConversion #pdf

Last updated 3 years ago

Doc Edward Morbius ⭕​ · @dredmorbius
2081 followers · 14668 posts · Server toot.cat

@thornAvery There's no such creature that will cover all cases. You may get lucky in many instances with easier options.

Your best bet is to find another form of the document that's closer to text. For many published documents there are good odds of this.

If the PDF is actually rendered from a text source, pdftotext is pretty good at extracting the actual text.

If it's not ... you're left with a much more challenging job. I find with rather startling frequency that simply re-typing the document from scratch is often the best option.

1/

#pdf #pdfConversion #kfc #docfs #webfs

Last updated 3 years ago

Swapnil · @thinkfree
47 followers · 141 posts · Server fosstodon.org

manual.calibre-ebook.com/conve

Reading PDFs on ebook readers and smart phones is a pain.
Hope content creators get to use a better format. (Epub, Single file Html? )

PDF format is perfect for many use cases, but not for everything.

#ebookreader #fileformat #pdfConversion #ebookformatting

Last updated 4 years ago