[Bug 151577] Writer PDF import filter should default to producing paragraphs of text, not drawing objects

bugzilla-daemon Tue, 01 Jul 2025 04:11:41 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=151577


Dave Gilbert <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |RESOLVED
         Resolution|---                         |DUPLICATE

--- Comment #8 from Dave Gilbert <[email protected]> ---
(In reply to V Stuart Foote from comment #6)
> Sorry, it is a dupe of bug 33249 clear an simple. Filter functions needed to
> render PDF text spans back as Paragraph objects would be the same across all
> LO modules. 

The poppler import code does have an abstraction of which module it's
targeting,
so it _could_ do something different for writer than draw; however...

> 
> Comment 0 was opened against a Writer originated ODF document, but there is
> no distinction made in the export filter(s) (PDF has no "paragraph" object
> keeping text spans together as sentences, even words might be broken apart).
> And this *enhancement* is not about the LO Hybrid PDF that attaches the ODF
> source document into the PDF and selectively LO will open that attachment on
> import--bypassing the PDF facsimile. But that already functions as an export
> option.
> 
> For bug 32249 and bug 118370 Justin L. completed *one* reasonable approach
> working with the poppler -> cairo extracted sd text box objects from the PDF
> BT/ET spans, of "consolidating" a selection of the generated text boxes into
> a single text box object.
> 
> An alternative was proposed at
> https://bugs.documentfoundation.org/show_bug.cgi?id=32249#c19 of an process
> taking the extracted strings (still poppler -> cairo based) and reflowing
> that into lexically correct full sentences or full paragraph objects. And
> assembling those into as an ODF ready object available to style, spell
> check, etc. Focus would be less on the layout of the PDF and more on
> extracting a lexicographic correct representation of a page.
> 
> So, this bz issue could be that additional work. More fully scoped here. Or,
> we  could set back to the dupe it is as bug 33249 was left open after the
> work on bug 118370 but scope was not expanded to all PDF import filters. 
> 
> Added the devs with insight, for their opinions, but coin flip set it again
> as the dupe it is.

Yeh, the hard part is deciding how to assemble the chunks of text; once you
have those
spitting them out as a paragraph object for writer feels relatively easy.
There's some recent separate non-LO tools that try various heuristics for it
which look pretty neat, so while it's never going to be perfect, something
better should be doable.

Duping as suggested.

If you want to repeatedly edit through a PDF you create from LO, tick the
hybrid box - that's what it's for!

*** This bug has been marked as a duplicate of bug 32249 ***

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 151577] Writer PDF import filter should default to producing paragraphs of text, not drawing objects

Reply via email to