Package: referencer
Version: 1.0.2-1
Severity: normal

I have a number of PDFs with DOIs appearing in the text, but that
Referencer cannot properly scrape out.  There is no true metadata in the
PDF, so it's going for text extraction from the page body.  The complete
BT/ET block containing the DOI is at the end of this message, but the
key bit is this:

[(doi:10.1016/)14.5(S)-95.3(0)]TJ
6.3307 0 TD
0.0983 Tc
[(010-0277\(02\)00)-6.3(235-4)]TJ
ET

This causes libpoppler to feed this text to BibData::guessDoi():

  doi:10.1016/S 0 0 1 0 - 0 2 7 7 ( 0 2 ) 0 0 2 3 5 - 4\n

"10.1016/S" is what Referencer records as the DOI.  The correct DOI is the above
string with all the spaces taken out, i.e. 10.1016/S0010-0277(02)00235-4 .

Unfortunately, I don't have any concrete suggestion for how guessDoi() could
do a better job in this case without also screwing up other situations (where
random text appears immediately after the DOI, separated only by a space).

-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.18-4-686 (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages referencer depends on:
ii  libart-2.0-2               2.3.19-3      Library of functions for 2D graphi
ii  libatk1.0-0                1.18.0-2      The ATK accessibility toolkit
ii  libbonobo2-0               2.18.0-2      Bonobo CORBA interfaces library
ii  libbonoboui2-0             2.18.0-5      The Bonobo UI library
ii  libboost-regex1.33.1       1.33.1-10     regular expression library for C++
ii  libc6                      2.5-7         GNU C Library: Shared libraries
ii  libcairo2                  1.4.6-1       The Cairo 2D vector graphics libra
ii  libfontconfig1             2.4.2-1.2     generic font configuration library
ii  libgcc1                    1:4.1.2-6     GCC support library
ii  libgconf2-4                2.18.0.1-3    GNOME configuration database syste
ii  libgconfmm-2.6-1c2         2.14.2-1      C++ wrappers for GConf (shared lib
ii  libglade2-0                1:2.6.0-4     library to load .glade files at ru
ii  libglademm-2.4-1c2a        2.6.2-2       C++ wrappers for libglade2 (shared
ii  libglib2.0-0               2.12.12-1     The GLib library of C routines
ii  libglibmm-2.4-1c2a         2.12.7-1      C++ wrapper for the GLib toolkit (
ii  libgnome-keyring0          0.8.1-2       GNOME keyring services library
ii  libgnome-vfsmm-2.6-1c2a    2.14.0-1      C++ wrappers for GnomeVFS (shared 
ii  libgnome2-0                2.18.0-4      The GNOME 2 library - runtime file
ii  libgnomecanvas2-0          2.14.0-2      A powerful object-oriented display
ii  libgnomecanvasmm-2.6-1c2a  2.14.0-1      C++ wrappers for libgnomecanvas2 (
ii  libgnomemm-2.6-1c2         2.14.0-1      C++ wrappers for libgnome (shared 
ii  libgnomeui-0               2.18.1-2      The GNOME 2 libraries (User Interf
ii  libgnomeuimm-2.6-1c2a      2.14.0-1      C++ wrappers for libgnomeui (share
ii  libgnomevfs2-0             1:2.18.1-2    GNOME Virtual File System (runtime
ii  libgtk2.0-0                2.10.12-1     The GTK+ graphical user interface 
ii  libgtkmm-2.4-1c2a          1:2.8.8-1     C++ wrappers for GTK+ 2.4 (shared 
ii  libice6                    1:1.0.3-2     X11 Inter-Client Exchange library
ii  liborbit2                  1:2.14.7-0.1  libraries for ORBit2 - a CORBA ORB
ii  libpango1.0-0              1.16.4-1      Layout and rendering of internatio
ii  libpoppler0c2              0.4.5-5.1     PDF rendering library
ii  libpopt0                   1.10-3        lib for parsing cmdline parameters
ii  libsigc++-2.0-0c2a         2.0.17-2      type-safe Signal Framework for C++
ii  libsm6                     1:1.0.2-2     X11 Session Management library
ii  libstdc++6                 4.1.2-6       The GNU Standard C++ Library v3
ii  libx11-6                   2:1.0.3-7     X11 client-side library
ii  libxcursor1                1:1.1.8-2     X cursor management library
ii  libxext6                   1:1.0.3-2     X11 miscellaneous extension librar
ii  libxfixes3                 1:4.0.3-2     X11 miscellaneous 'fixes' extensio
ii  libxi6                     1:1.0.1-4     X11 Input extension library
ii  libxinerama1               1:1.0.2-1     X11 Xinerama extension library
ii  libxml2                    2.6.28.dfsg-1 GNOME XML library
ii  libxrandr2                 2:1.2.1-1     X11 RandR extension library
ii  libxrender1                1:0.9.2-1     X Rendering Extension client libra

referencer recommends no packages.

-- no debconf information

BT
7.9702 0 0 7.9702 340.5542 597.3164 Tm
[(www.elsev)11.4(ier.com/locate/co)8.9(gnit)]TJ
-32.0589 -63.7337 TD
[(0010-0277)15.5(/03/$)-299.5(-)-300.1(see)-293(front)-300.7(matter)]TJ
/F4 1 Tf
13.9915 0 TD
(\001)Tj
/F1 1 Tf
1.1666 0 TD
[(2003)-297.5(Elsevier)-289.8(Science)-293.2(B.V.)-299.7(All)-299.1(rights)-294.9(reserved.)]TJ
-15.1581 -1.2448 TD
[(doi:10.1016/)14.5(S)-95.3(0)]TJ
6.3307 0 TD
0.0983 Tc
[(010-0277\(02\)00)-6.3(235-4)]TJ
ET


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to