Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

James R Barlow Fri, 30 Mar 2018 21:27:29 -0700

Hello Sean,

As promised ocrmypdf v6.1.2 makes pymupdf optional but recommended. My
continuous integration tests check with and without pymupdf.


The only major regression without pymupdf is that with all of:
1) an input file containing a mix of scanned and born digital files
2) --skip-text (not default)
3) --output-type pdf (not default)
the output file can grow extremely large compared to the input. Past
versions of ocrmypdf have had this issue for a long time, and now it will
produce a warning.

So it should be ready for Debian.

Thanks.


On Mon, 26 Mar 2018 at 14:30 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Mon, Mar 26 2018, James R Barlow wrote:
>
> > Thanks for the information. That's a worryingly high wall to climb and
> > I'm concerned about implications for other platforms as well.
> >
> > I would appreciate if you can see about getting an exception, but I
> > think I will change PyMuPDF to an optional but recommended dependency
> > fairly soon.
>
> That would be great in the meantime.
>
> > I haven't made a major investment in it as yet with new code, but it
> > does provide some powerful features that would be a major engineering
> > effort to replicate and are likely not going to materialize in another
> > open source library anytime soon. (Specifically: incremental updates,
> > safe editing of PDF/A, PDF object garbage collection, fast
> > rasterizing, robust text extraction.) The most commonly used Python
> > PDF library, PyPDF2, is essentially unmaintained and in poor shape.
>
> Having thought some more, I think our best bet will be to try to get
> pymupdf to support linking against the static version of mupdf.  We have
> techniques in Debian to deal with security updates in that case (called
> binNMUs if you want to look them up).
>
> --
> Sean Whitton
>

Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

Reply via email to