date field type problem

2009-09-02 Thread Peter Kiraly

Hi Solr users,

I have a lots of dates from a library catalog in not
solr.DateField compatible format. I wrote a new 
definition inside the solrconfig.xml, which creates
eg. 1991-01-01T00:00:01Z from the input '[c1991.]' string.
It works fine when I tried it with the typical values
in the http://localhost:8983/solr/admin/analysis.jsp,
but it always throws an exception, when I try to index
the records.


 
   
   
   
   
   
   
   
   
 


It is more than possible, that I misunderstand something. What I
like to do is to 'normalize' somehow the input data, and I thought
that it is more effective in the Solr side, than in the client.

Have you got any advise, how I may continue?

Péter



Re: date field type problem

2009-09-02 Thread Peter Kiraly

Hi,

the exception I received:

SEVERE: org.apache.solr.common.SolrException: Error while creating field 
'date_df{type=trickyDate,properties=indexed,stored,omitNorms,omitTf,multiValued,sortMissingLast}' 
from value 'c1991.'

   at org.apache.solr.schema.FieldType.createField(FieldType.java:190)
   at 
org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
   at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:244)
   at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59)
   at 
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)

   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
   at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
   at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)

   at org.mortbay.jetty.Server.handle(Server.java:285)
   at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)

   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.solr.common.SolrException: Invalid Date 
String:'c1991.'

   at org.apache.solr.schema.DateField.parseMath(DateField.java:167)
   at org.apache.solr.schema.DateField.toInternal(DateField.java:138)
   at org.apache.solr.schema.FieldType.createField(FieldType.java:188)
   ... 27 more

My expectation is, that a field type behaves like this:
0) I give a field type as the storage type
1) I give it a string
2) with tokenizers, and filters I parse into a given form
3) the Solr handles it as the given type

for example:
0) I set the field type as "solr.DateField"
1) input string is "1991."
2) the analyzer creates "1991-01-01T00:00:00Z"
3) and as it is the normal input form of the date type, Solr
  indexes it.

It seems, that the input string ("1991.") must match to the
solr.DateField's expectation, and not the output
("1991-01-01T00:00:00Z").

So the question is: is there a solution, in which I can
"preprocess" the inputs, or it is only doable only on the client's
side.

Péter


From: "Grant Ingersoll" 
Subject: Re: date field type problem




What's the exception?


On Sep 2, 2009, at 3:00 AM, Peter Kiraly wrote:


Hi Solr users,

I have a lots of dates from a library catalog in not
solr.DateField compatible format. I wrote a new 
definition inside the solrconfig.xml, which creates
eg. 1991-01-01T00:00:01Z from the input '[c1991.]' string.
It works fine when I tried it with the typical values
in the http://localhost:8983/solr/admin/analysis.jsp,
but it always throws an exception, when I try to index
the records.



  
  
  
  
  
  
  
  



It is more than possible, that I misunderstand something. What I
like to do is to 'normalize' somehow the input data, and I thought
that it is more effective in the Solr side, than in the client.

Have you got any advise, how I may continue?

Péter



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search



Re: How to deal with hyphens in PDF documents?

2009-05-27 Thread Peter Kiraly

Hi,

My solution was to this problem in Lucene, that I modified the
Lucene's parser. There was a file in Lucene not in Java 
(StandardTokenizer.jj),

which defines what is a token, and the types of tokens. My rule
was, that a soft or hard hypen at the end of the line denote a
word which continues in the beginning of the next line. I used
iText instead of PDFBox, because PDFBox was ignoring hypens
at the end of the line. It was years before. Now the file called
StandardTokenizerImpl.jflex. I don't know how to solve it in Solr,
because it incorporates severeal Lucene jars, and not clear for
me how to hack only one jar.

Király Péter
http://extensiblecatalog.org
http://tesuji.eu


- Original Message - 
From: "Bauke Jan Douma" 

To: 
Sent: Wednesday, May 27, 2009 1:55 AM
Subject: Re: How to deal with hyphens in PDF documents?



Otis Gospodnetic wrote on 05/26/2009 11:06 PM:

Hello,

You really want to fix this before indexing, so you don't index garbage. 
One way to fix this is to make use of dictionaries while looking at two 
tokens at a time (current + next).  Then you might see that neither "fo" 
or "cus" are in the dictionary, but that "focus" is, so you might 
concatenate the tokens and output just one "focus" token.  You'd do 
something similar with "fo-" and "cus".


 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Bauke Jan Douma 
To: solr-user@lucene.apache.org
Sent: Tuesday, May 26, 2009 4:42:39 PM
Subject: How to deal with hyphens in PDF documents?

Good day, fellow solr users,

Fair warning:
-
I am utterly new to solr and to this mailing list (and to lucene for 
that matter).

I have been playing with solr for about two weeks.


Goal:
-
I would like to index several thousand OCR'd newspaper articles, stored 
as PDF
documents. I have been also been fiddling with PDFBox (tika), and with 
pdftotext in

that regard.
Ultimately, I would like to present search results having a URL to the 
original PDF,

which when clicked, opens up the PDF with the search terms highlighted.


Problem: hyphens (using PDFBox):

Said newspaper articles are in Dutch. Now that language has the 
peculiarity that

hyphenated words at EOL are a very common occurrence.

The OCR'ed PDF's contain both soft and hard hyphens. Let's take the word 
'focus'
for example (focus in English), which is hyphenated as 'fo - cus', 
neither part of

which are Dutch words by the way.

Currently, in the XML search-results, using tika PDFBox, this can occur 
as:


fo- cus (when the original PDF has a hard hyphen here, U+002D)
fo cus  (when the original PDF has a soft hyphen here, U+00AD)

The problem is that neither of these would be found with a search term 
of 'focus'.
I'v been googling for this for the past few days, but haven't seen this 
issue

addressed anywhere. I must be overlooking something very obvious.


Alternative? (using pdftotext):
---
I was thinking of an alternative: using pdftotext to extract the 
content, run it
through some custom filter to unhyphenate hyphenated words, and index 
these
separately, besides the indexed original text. That way a search for 
those terms

would yield results.

With my limited knowledge and experience with solr however, presently I 
see that
as shifting the same problem more or less, namely to where I want to 
present a
clickable URL into the original PDF, with a search-string obtained from 
the solr

search results (to highlight the term in the PDF).


Any thoughts or pointers would be appreciated.
Thanks all in advance for your time.

Regards,
Bauke Jan Douma






Hello Otis,

Understood. But wouldn't that lead to the problem that, when using the 
search result
(taking it from the highlighting result in solr -- forgot to mention), 
that fragment

will not be found in the PDF, since the PDF contains the hyphenated word?

Oops.  Just now I discovered that searching multiple-word strings that 
cross multiple
lines in a PDF doesn't even work to begin with, even when there are no 
hyphens (evince
on Ubuntu -- don't know if that works in Adobe Acrobat).  That looks like 
an unsolved

problem.

Thank you for your input.

Bauke Jan