Stumbled upon Trafilatura just now. An amazingly efficient Python lib/tool to extract text from HTML-based pages.
Especially welcomed since Newspaper3k have been abandoned since almost 3 years ago.
#Python #TextExtraction #Trafilatura
#python #textextraction #trafilatura