Using SolrJ with Tika

2009-09-02 Thread Angel Ice
Hi everybody.

I hope it's the right place for questions, if not sorry.

I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
I have seen a few examples explaining how to use tika to solve this. But most 
of these examples are using curl to send documents to Solr or an HTML POST with 
an input file.
But i'd like to do it in full java.
Is there a way to use Solrj to index the documents with the 
ExtractingRequestHandler of SolR or at least to get the extracted xml back 
(with the extract.only option) ?

Many thanks.

Laurent.



  

Re : Using SolrJ with Tika

2009-09-02 Thread Angel Ice
Hi Rajan.

As mentioned in my message, I don't want tu use Curl to post documents and 
can't use an HTTP POST (the document has already been posted to my JEE webapp 
for other purposes). All I can use is just java.

In fact, I'd like the user to post the document to my webapp with an HTML POST 
(it's a struts2 webapp).  --This is OK.
Then my webapp uses the document for its own purposes. --This is OK.
And finally the webapp send  the document to solr in order to index it.  --This 
is not OK.

That's what I am doing with other stuffs that I index where there is no rich 
document, just some simple text fields to index, like daily articles.
In this case, once my webapp has finished its job on the article (creating, 
saving ...), I index the title/author/... like this :
SolrInputDocument doc = new SolrInputDocument();
doc.addField("art_title", "foo");
...
solrServer.add(doc);
soltServer.commit();

I'm looking for a way to do the same thing for rich document, once my webapp 
has finished its job with the document.

Regards,

Laurent






De : rajan chandi 
À : solr-user@lucene.apache.org
Envoyé le : Mercredi, 2 Septembre 2009, 16h13mn 22s
Objet : Re: Using SolrJ with Tika

Laurent,

Check-out Solr 1.4.

You can download the trunk and Build it on your box.

The Solr 1.4 does this out-of-the-box. No configuration required.

You can use HTTP POST to post the document using some Linux utility like
Curl and the PDF/Word/RTF/PPT/XLS etc. will be indexed. We tested this last
week.

Tika has already been included in Solr 1.4.

Cheers
Rajan

On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice  wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>



  

Re : Using SolrJ with Tika

2009-09-03 Thread Angel Ice
Hi

This is the solution I was testing.
I got some difficulties with AutoDetectParser but I think it's the solution I 
will use in the end.


Thanks for the advice anyway :)

Regards,

Laurent





De : Abdullah Shaikh 
À : solr-user@lucene.apache.org
Envoyé le : Jeudi, 3 Septembre 2009, 14h31mn 10s
Objet : Re: Using SolrJ with Tika

Hi Laurent,

I am not sure if this is what you need, but you can extract the content from
the uploaded document (MS Docs, PDF etc) using TIKA and then send it to SOLR
for indexing.

String CONTENT = extract the content using TIKA (you can use
AutoDetectParser)

and then,

SolrInputDocument doc = new SolrInputDocument();
doc.addField("DOC_CONTENT", CONTENT);

solrServer.add(doc);
soltServer.commit();


On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice  wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>



  

wildcard searches

2009-10-05 Thread Angel Ice
Hi everyone,

I have a little question regarding the search engine when a wildcard character 
is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the "e")
- The filters applied to the field that will handle this word, result in the 
indexation of "esit" (the mute H is suppressed (home made filter), the accent 
too (IsoLatin1Filter), and the SnowballPorterFilter suppress the "ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is OK, the 
document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not returned. 
In fact, I have to put the wildcard in a manner that match the indexed term 
exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the wildcard 
? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent



  

Re : wildcard searches

2009-10-06 Thread Angel Ice
Hi.

Thanks for your answers Christian and Avlesh.

But I don't understant what you mean by :
"If you want to enable wildcard queries, preserving the original token (while 
processing each token in your filter) might work."

Could you explain this point please ?

Laurent






De : Avlesh Singh 
À : solr-user@lucene.apache.org
Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s
Objet : Re: wildcard searches

Zambrano is right, Laurent. The analyzers for a field are not invoked for
wildcard queries. You custom filter is not even getting executed at
query-time.
If you want to enable wildcard queries, preserving the original token (while
processing each token in your filter) might work.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:

> Hi everyone,
>
> I have a little question regarding the search engine when a wildcard
> character is used in the query.
> Let's take the following example :
>
> - I have sent in indexation the word Hésitation (with an accent on the "e")
> - The filters applied to the field that will handle this word, result in
> the indexation of "esit" (the mute H is suppressed (home made filter), the
> accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> "ation".
>
> When i search for "hesitation", "esitation", "ésitation" etc ... all is OK,
> the document is returned.
> But as soon as I use a wildcard, like "hésita*", the document is not
> returned. In fact, I have to put the wildcard in a manner that match the
> indexed term exactly (example "esi*")
>
> Does the search engine applies the filters to the word that prefix the
> wildcard ? Or does it use this prefix verbatim ?
>
> Thanks for you help.
>
> Laurent
>
>
>
>



  

Re : Re : wildcard searches

2009-10-06 Thread Angel Ice
Ah yes, got it.
But i'm not sure this will solve my problem.
Because, I'm aloso using the IsoLatin1 filter, that remove the accentued 
characters.
So I will have the same problem with accentued characters. Cause the original 
token is not stored with this filter.

Laurent







De : Avlesh Singh 
À : solr-user@lucene.apache.org
Envoyé le : Mardi, 6 Octobre 2009, 10h41mn 56s
Objet : Re: Re : wildcard searches

You are processing your tokens in the filter that you wrote. I am assuming
it is the first filter being applied and removes the character 'h' from
tokens. When you are doing that, you can preserve the original token in the
same field as well. Because as of now, you are simply removing the
character. Subsequent filters don't even know that there was an 'h'
character in the original token.

Since wild card queries are not analyzed, the 'h' character in the query
"hésita*" does NOT get removed during query time. This means that unless the
original token was preserved in the field it wouldn't find any matches.

This helps?

Cheers
Avlesh

On Tue, Oct 6, 2009 at 2:02 PM, Angel Ice  wrote:

> Hi.
>
> Thanks for your answers Christian and Avlesh.
>
> But I don't understant what you mean by :
> "If you want to enable wildcard queries, preserving the original token
> (while processing each token in your filter) might work."
>
> Could you explain this point please ?
>
> Laurent
>
>
>
>
>
> 
> De : Avlesh Singh 
> À : solr-user@lucene.apache.org
> Envoyé le : Lundi, 5 Octobre 2009, 20h30mn 54s
> Objet : Re: wildcard searches
>
> Zambrano is right, Laurent. The analyzers for a field are not invoked for
> wildcard queries. You custom filter is not even getting executed at
> query-time.
> If you want to enable wildcard queries, preserving the original token
> (while
> processing each token in your filter) might work.
>
> Cheers
> Avlesh
>
> On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:
>
> > Hi everyone,
> >
> > I have a little question regarding the search engine when a wildcard
> > character is used in the query.
> > Let's take the following example :
> >
> > - I have sent in indexation the word Hésitation (with an accent on the
> "e")
> > - The filters applied to the field that will handle this word, result in
> > the indexation of "esit" (the mute H is suppressed (home made filter),
> the
> > accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> > "ation".
> >
> > When i search for "hesitation", "esitation", "ésitation" etc ... all is
> OK,
> > the document is returned.
> > But as soon as I use a wildcard, like "hésita*", the document is not
> > returned. In fact, I have to put the wildcard in a manner that match the
> > indexed term exactly (example "esi*")
> >
> > Does the search engine applies the filters to the word that prefix the
> > wildcard ? Or does it use this prefix verbatim ?
> >
> > Thanks for you help.
> >
> > Laurent
> >
> >
> >
> >
>
>
>
>