https://bugs.kde.org/show_bug.cgi?id=490127

--- Comment #3 from Volker Krause <vkra...@kde.org> ---
The main challenge with browser-generated PDFs is that line and page breaks
don't tend to be stable, resulting in many more variations that need to be
handled. That doesn't mean extracting from them is impossible, but you can
usually only rely on textual content not the structure or layout of it.

For extracting other PDF we tend to use a mix of textual and structural
approaches. Metadata beyond that is rare in PDFs (eg. the author/creator
fields), we use that e.g. for determining the correct extractor scripts when
properly set (which is more efficient than doing that based on the content).
The best case are (large) 2D barcodes in PDFs, like found on flight or train
tickets, but those should work in PDFs printed from websites already as well.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to