Bug#423586: referencer: PDF-scraping for DOIs sometimes cuts them off in the middle

Michael Banck Thu, 19 Jul 2007 12:01:48 -0700

Hi Zack,

On Sat, May 12, 2007 at 08:00:56PM -0700, Zack Weinberg wrote:
> I have a number of PDFs with DOIs appearing in the text, but that
> Referencer cannot properly scrape out.  There is no true metadata in the
> PDF, so it's going for text extraction from the page body.  The complete
> BT/ET block containing the DOI is at the end of this message, but the
> key bit is this:
> 
> [(doi:10.1016/)14.5(S)-95.3(0)]TJ
> 6.3307 0 TD
> 0.0983 Tc
> [(010-0277\(02\)00)-6.3(235-4)]TJ
> ET
> 
> This causes libpoppler to feed this text to BibData::guessDoi():
> 
>   doi:10.1016/S 0 0 1 0 - 0 2 7 7 ( 0 2 ) 0 0 2 3 5 - 4\n
> 
> "10.1016/S" is what Referencer records as the DOI.  The correct DOI is the 
> above
> string with all the spaces taken out, i.e. 10.1016/S0010-0277(02)00235-4 .
> 
> Unfortunately, I don't have any concrete suggestion for how guessDoi() could
> do a better job in this case without also screwing up other situations (where
> random text appears immediately after the DOI, separated only by a space).


Are you still using referencer?  Can you verify this issue is still
present?  If yes, I would like to forward it to the author, so he can
ponder on it.

Sorry for the late reply,

Michael


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#423586: referencer: PDF-scraping for DOIs sometimes cuts them off in the middle

Reply via email to