Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Brion Vibber Wed, 17 Jul 2013 13:23:04 -0700

Yeah, Linus is kind of an asshole too. I don't see that as something to
emulate.


-- brion


On Wed, Jul 17, 2013 at 1:10 PM, David Cuenca <[email protected]> wrote:

> Now that you mention it...
>
> http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language
>
> Micru
>
> On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <[email protected]> wrote:
>
> > I'm not sure his attitude will encourage people to work with him to his
> > specifications.
> >
> > -- brion
> >
> >
> >
> >
> > On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <[email protected]> wrote:
> >
> > > I'm forwarding this message by George Orwell III on en-ws [1]. I think
> it
> > > is extremely important as it offers an insight about what is wrong with
> > > Djvu handling on Wikisource.
> > >
> > >
> > > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping
> coordinates)
> > > because the original PHP contributing a-hole for the DjVu routine on
> our
> > > servers never bothered to finish the part where the internal DjVu text
> > > layer is converted to a (coordinate rich) XML file using the existing
> > > DjVuLibre software package because, at the time, the software had
> issues.
> > >
> > > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago
> > and
> > > the issue has been long fixed now EXCEPT that the .DTD file needed to
> > base
> > > the plain-text to XML conversion on still has the wrong 'folder path'
> on
> > > local DjVuLibre installs (if this is true on server installs as well, I
> > > cannot say for sure). Once I copied the folder to the [wrong] folder
> > path,
> > > I was able to generate the XMLs all day long. These XMLs are just like
> > the
> > > ones IA generates during their process (in addition to the XML that
> AABBY
> > > generates for them).
> > >
> > > "So its not that we as a community decided not to follow through with
> > > (coordinate rich) XML generation but got stuck with the plain-text dump
> > > workaround due to a DjVuLibre problem that no longer exists. Plus, the
> > guy
> > > who created the beginnings of this fabulous disaster was like tick with
> > an
> > > attention span deficit and moved on to conjuring up some other blasted
> > > thing or another instead of following up on his own workaround & finish
> > the
> > > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July
> > 2013
> > > (UTC)
> > >
> > >
> > > [1]
> > >
> > >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > >
> > > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <[email protected]>
> > > wrote:
> > >
> > > > Just a brief comment about djvu text layer, using IA files to digging
> > > > deeper the topic.
> > > >
> > > > FineReader OCR stores an incredibly detailed information in a
> > proprietary
> > > > format; then, various FineReader versions export something of this
> > > > extremely rich set of information into different outputs - one of
> them
> > > > being djvu text layer. It's worth to note that even if any
> information
> > > > stored into djvu text layer can be extracted and used, the set of
> > > > information wrapped into djvu text layer (both in lisp-like format or
> > in
> > > > xml format) is only a minor subset of original OCR information.
> > > >
> > > > If someone is interested to get much more information, it can find it
> > > into
> > > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the
> > list
> > > > of exportable files. It's a very heavy and complex xml structure but
> it
> > > is
> > > > possible to parse it, end to extract from it any information wrapped
> > into
> > > > djvu text layer and much more - most interestingly, wortPenalty, that
> > is,
> > > > word by word, the resume of degree of incertainty of OCR recognition
> of
> > > the
> > > > whole word.
> > > >
> > > > We (I and Aarti) are digging into this mess, with fast preliminary
> > > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some
> > brief
> > > > pieces of text extracted from abbyy.gx, where doubtful  words (in the
> > > > opinion of OCR software) are red. They can be easily managed by
> > > > VisualEditor - caming simply from a simple span tag.
> > > >
> > > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage
> will
> > > > run, it would be possible to extract text by bot from abbyy.gz (if
> the
> > > work
> > > > comes from IA) and to upload such text as OCR.
> > > >
> > > > Alex
> > > >
> > > >
> > > >
> > > > 2013/7/16 David Cuenca <[email protected]>
> > > >
> > > >> Hi Aubrey,
> > > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he
> worked
> > on
> > > >> the djvu text extraction/merging and he was interested in
> following-up
> > > on
> > > >> that. Maybe he has some fresh ideas about it.
> > > >>
> > > >> Micru
> > > >>
> > > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> > > [email protected]>wrote:
> > > >>
> > > >>> Hi David, Aarti, thibaud and Tpt,
> > > >>> please look at this thread:
> > > >>>
> > > >>>
> > >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > > >>> especially the last message.
> > > >>>
> > > >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> > > >>> extension,
> > > >>> and it's probably worth digging into this "layer text" djvu thing.
> > > >>>
> > > >>> Even if I might dream of an ideal solution (a "layered structure"
> for
> > > >>> wikisource, in which text can marked up several times in different
> > > layers)
> > > >>> that is probably very far away.
> > > >>>
> > > >>> But it's still important to pave the way for further improvements,
> I
> > > >>> guess:
> > > >>> losing all the information from a formatted, mapped IA djvu it's
> not
> > a
> > > >>> good thing to do, IMHO.
> > > >>> And the Visual Editor could help us, in the future, to keep some of
> > > that
> > > >>> information (italics, bold, etc.)
> > > >>>
> > > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> > > >>> something with it?
> > > >>>
> > > >>> Aubrey
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Etiamsi omnes, ego non
> > > >> _______________________________________________
> > > >> Wikisource-l mailing list
> > > >> [email protected]
> > > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > > >>
> > > >>
> > > >
> > > > _______________________________________________
> > > > Wikisource-l mailing list
> > > > [email protected]
> > > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > > >
> > > >
> > >
> > >
> > > --
> > > Etiamsi omnes, ego non
> > > _______________________________________________
> > > MediaWiki-l mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> > >
> > _______________________________________________
> > MediaWiki-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
>
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Reply via email to