@thornAvery My own approaches are:
Find LITERALLY ANY FORMAT OTHER THAN PDF. HTML, text, ePub, etc., if possible.
Try pdftotext
, part of Poppler utils: https://poppler.freedesktop.org/ This is available for most Linux distros, MacOS under Homebrew, or check out via Git.
If I can get something vaguely reasonable, that's usually sufficient.
OCR is an option. I've never had good luck with that, and there's such a tremendous amount of tendous correcting that retyping is frequently preferable. That said, I operate at fairly low scale.
Retype by hand. Since I'm usually reading the work, this actually turns out to be a pretty good reading method for content-retention.
PDF itself is a container around a bunch of other formats. Asking how to convert a PDF is a bit like asking how to cook a bag full of groceries. It really depends on what's in it, and what you're hoping to get.
#pdf #pdfConversion #kfc #docfs #webfs
@thornAvery I'm trying to find what I thought I remembered as an excellent HN comment discussing how to do this at scale.
It turns out to be really complicated.
That said, maybe tell us what it is you're trying to do, specifically:
#webfs #docfs #kfc #pdfConversion #pdf
@thornAvery There's no such creature that will cover all cases. You may get lucky in many instances with easier options.
Your best bet is to find another form of the document that's closer to text. For many published documents there are good odds of this.
If the PDF is actually rendered from a text source, pdftotext
is pretty good at extracting the actual text.
If it's not ... you're left with a much more challenging job. I find with rather startling frequency that simply re-typing the document from scratch is often the best option.
#pdf #PDFConversion #kfc #docfs #webfs
1/
#pdf #pdfConversion #kfc #docfs #webfs
https://manual.calibre-ebook.com/conversion.html#pdfconversion
Reading PDFs on ebook readers and smart phones is a pain.
Hope content creators get to use a better format. (Epub, Single file Html? )
PDF format is perfect for many use cases, but not for everything.
#ebookreader #fileformat #pdfConversion #ebookformatting