Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

David Cuenca Wed, 17 Jul 2013 13:12:51 -0700

Now that you mention it...
http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language


Micru

On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <[email protected]> wrote:

> I'm not sure his attitude will encourage people to work with him to his
> specifications.
>
> -- brion
>
>
>
>
> On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <[email protected]> wrote:
>
> > I'm forwarding this message by George Orwell III on en-ws [1]. I think it
> > is extremely important as it offers an insight about what is wrong with
> > Djvu handling on Wikisource.
> >
> >
> > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
> > because the original PHP contributing a-hole for the DjVu routine on our
> > servers never bothered to finish the part where the internal DjVu text
> > layer is converted to a (coordinate rich) XML file using the existing
> > DjVuLibre software package because, at the time, the software had issues.
> >
> > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago
> and
> > the issue has been long fixed now EXCEPT that the .DTD file needed to
> base
> > the plain-text to XML conversion on still has the wrong 'folder path' on
> > local DjVuLibre installs (if this is true on server installs as well, I
> > cannot say for sure). Once I copied the folder to the [wrong] folder
> path,
> > I was able to generate the XMLs all day long. These XMLs are just like
> the
> > ones IA generates during their process (in addition to the XML that AABBY
> > generates for them).
> >
> > "So its not that we as a community decided not to follow through with
> > (coordinate rich) XML generation but got stuck with the plain-text dump
> > workaround due to a DjVuLibre problem that no longer exists. Plus, the
> guy
> > who created the beginnings of this fabulous disaster was like tick with
> an
> > attention span deficit and moved on to conjuring up some other blasted
> > thing or another instead of following up on his own workaround & finish
> the
> > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July
> 2013
> > (UTC)
> >
> >
> > [1]
> >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> >
> > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <[email protected]>
> > wrote:
> >
> > > Just a brief comment about djvu text layer, using IA files to digging
> > > deeper the topic.
> > >
> > > FineReader OCR stores an incredibly detailed information in a
> proprietary
> > > format; then, various FineReader versions export something of this
> > > extremely rich set of information into different outputs - one of them
> > > being djvu text layer. It's worth to note that even if any information
> > > stored into djvu text layer can be extracted and used, the set of
> > > information wrapped into djvu text layer (both in lisp-like format or
> in
> > > xml format) is only a minor subset of original OCR information.
> > >
> > > If someone is interested to get much more information, it can find it
> > into
> > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the
> list
> > > of exportable files. It's a very heavy and complex xml structure but it
> > is
> > > possible to parse it, end to extract from it any information wrapped
> into
> > > djvu text layer and much more - most interestingly, wortPenalty, that
> is,
> > > word by word, the resume of degree of incertainty of OCR recognition of
> > the
> > > whole word.
> > >
> > > We (I and Aarti) are digging into this mess, with fast preliminary
> > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some
> brief
> > > pieces of text extracted from abbyy.gx, where doubtful  words (in the
> > > opinion of OCR software) are red. They can be easily managed by
> > > VisualEditor - caming simply from a simple span tag.
> > >
> > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> > > run, it would be possible to extract text by bot from abbyy.gz (if the
> > work
> > > comes from IA) and to upload such text as OCR.
> > >
> > > Alex
> > >
> > >
> > >
> > > 2013/7/16 David Cuenca <[email protected]>
> > >
> > >> Hi Aubrey,
> > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked
> on
> > >> the djvu text extraction/merging and he was interested in following-up
> > on
> > >> that. Maybe he has some fresh ideas about it.
> > >>
> > >> Micru
> > >>
> > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> > [email protected]>wrote:
> > >>
> > >>> Hi David, Aarti, thibaud and Tpt,
> > >>> please look at this thread:
> > >>>
> > >>>
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > >>> especially the last message.
> > >>>
> > >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> > >>> extension,
> > >>> and it's probably worth digging into this "layer text" djvu thing.
> > >>>
> > >>> Even if I might dream of an ideal solution (a "layered structure" for
> > >>> wikisource, in which text can marked up several times in different
> > layers)
> > >>> that is probably very far away.
> > >>>
> > >>> But it's still important to pave the way for further improvements, I
> > >>> guess:
> > >>> losing all the information from a formatted, mapped IA djvu it's not
> a
> > >>> good thing to do, IMHO.
> > >>> And the Visual Editor could help us, in the future, to keep some of
> > that
> > >>> information (italics, bold, etc.)
> > >>>
> > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> > >>> something with it?
> > >>>
> > >>> Aubrey
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Etiamsi omnes, ego non
> > >> _______________________________________________
> > >> Wikisource-l mailing list
> > >> [email protected]
> > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >>
> > >>
> > >
> > > _______________________________________________
> > > Wikisource-l mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >
> > >
> >
> >
> > --
> > Etiamsi omnes, ego non
> > _______________________________________________
> > MediaWiki-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>



-- 
Etiamsi omnes, ego non
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Reply via email to