indexing from Java

2006-04-10 Thread Erik Hatcher
I'm refactoring a command-line application that indexes RDF data to  
index into Solr rather than directly into a Lucene index.


In looking at the code, it looks like it'd be difficult to do this  
without actually doing an HTTP POST.  Or is there a clean way to  
leverage Solr's infrastructure without POSTing into it?  I don't even  
need Solr running while indexing.  Is this possible easily?  If so  
what API should I be using?


Thanks,
Erik



Re: Unique id field support in Solr

2006-04-10 Thread Yonik Seeley
On 4/10/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : Is there any way to explicitly add a record using Solr, such that the
> : add will fail if a record already exists with the same value in the
> : unique ID field (as specified by the schema's uniqueKey field)?
>
> I was going to say  overwriteCommitted="false"> should do what you want,
>
>
> http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef
>
> ...but that combination of options doesn't seem to be supported -- when i
> tried to write a test case for it, it failed with this exception...
>
> SEVERE: org.apache.solr.core.SolrException: unsupported param 
> combo:add:,allowDups=false,overwritePending=false,overwriteCommitted=false
> at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:197)
>
> ...i'm not sure how hard it would be to add that, the UpdateHandler is
> already checking for the dumplicate, i'm not sure why it can't just throw
> the new doc away when it finds one instead of deleting the old

It's not checking for a duplicate at the time the document is added to
the index though... it removes duplicates later when the 
command is given.

That combination of parameters could be supported by
DirectUpdateHander2 (the default) by doing a lookup in the index in
addition to checking the local state (uncommitted documents).  Feel
free to file a JIRA bug, or, even better, provide a patch.

-Yonik


Re: indexing from Java

2006-04-10 Thread Yonik Seeley
On 4/10/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> I'm refactoring a command-line application that indexes RDF data to
> index into Solr rather than directly into a Lucene index.
>
> In looking at the code, it looks like it'd be difficult to do this
> without actually doing an HTTP POST.  Or is there a clean way to
> leverage Solr's infrastructure without POSTing into it?  I don't even
> need Solr running while indexing.  Is this possible easily?  If so
> what API should I be using?

See the DocumentBuilder class for creating lucene Document objects
using the Solr schema.

You could add these Document objects to a Lucene index yourself, or
instantiate a SolrCore, and get it's UpdateHandler.

Hmmm, it looks like there isn't a getUpdateHandler() on the SolrCore...
So two options:
  - add a SolrCore.getUpdateHandler() method
  - add a SolrCore.update(UpdateCommandc cmd) method

-Yonik


Re: indexing from Java

2006-04-10 Thread Chris Hostetter

: In looking at the code, it looks like it'd be difficult to do this
: without actually doing an HTTP POST.  Or is there a clean way to
: leverage Solr's infrastructure without POSTing into it?  I don't even
: need Solr running while indexing.  Is this possible easily?  If so
: what API should I be using?

To answer your implicit question: in the past the only
recommended/supported way to get documents into Solr was by POSTing
 document messages.

I know Yonik has wanted to make it easier it easier, having a client
library that uses the SolrCore directly seems like a good start.   I
would think it should even be relatively easy to make it take in
org.apache.lucene.document.Document objects right? ... they're mainly just
beans (that sometimes have Readers).



-Hoss



Re: indexing from Java

2006-04-10 Thread Yonik Seeley
On 4/10/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> : In looking at the code, it looks like it'd be difficult to do this
> : without actually doing an HTTP POST.  Or is there a clean way to
> : leverage Solr's infrastructure without POSTing into it?  I don't even
> : need Solr running while indexing.  Is this possible easily?  If so
> : what API should I be using?
>
> To answer your implicit question: in the past the only
> recommended/supported way to get documents into Solr was by POSTing
>  document messages.

Yes, updating a running Solr server is the recommended (and more
flexible) approach.

> I know Yonik has wanted to make it easier it easier, having a client
> library that uses the SolrCore directly seems like a good start.

To me, running Solr outside the webapp is "expert" level use, and not
something I'd want to generally recommend.

For an official client library, I'd rather have something that managed
HTTP communications & serialization to a running Solr server:
  - It's not an exclusive client, can coexist with other updaters/searchers
  - can be used for rebuilding whole index as well as incremental updates
  - could possibly be used as an alternate replication strategy... do
the updates to multiple Solr instances.

-Yonik


Re: indexing from Java

2006-04-10 Thread Erik Hatcher
Sure, I realize I could create the Lucene index myself, but like the  
higher level abstraction that Solr is providing.


The disadvantage to creating the Lucene index myself or even with a  
Solr capability to take a Lucene Document is I'd lose the schema.xml  
goodness of things like  and would have to do that bit  
myself.  Ideally I'm looking for something that I could pass an  
object that represented the  from the XML POST and it'd create  
the Document internally and index it.


For now I'll use HTTP POST, but glad to hear there is some  
consideration to some type of (even expert level) client API that  
could go direct without the HTTP overhead.


Erik


On Apr 10, 2006, at 1:14 PM, Chris Hostetter wrote:



: In looking at the code, it looks like it'd be difficult to do this
: without actually doing an HTTP POST.  Or is there a clean way to
: leverage Solr's infrastructure without POSTing into it?  I don't  
even

: need Solr running while indexing.  Is this possible easily?  If so
: what API should I be using?

To answer your implicit question: in the past the only
recommended/supported way to get documents into Solr was by POSTing
 document messages.

I know Yonik has wanted to make it easier it easier, having a client
library that uses the SolrCore directly seems like a good start.   I
would think it should even be relatively easy to make it take in
org.apache.lucene.document.Document objects right? ... they're  
mainly just

beans (that sometimes have Readers).



-Hoss




Re: indexing from Java

2006-04-10 Thread Chris Hostetter

: Sure, I realize I could create the Lucene index myself, but like the
: higher level abstraction that Solr is providing.

Sorry, i guess i didn't explain myself very well.

My point was that it should be possible to write a Java API which updated
a Solr index (using SolrCore and all of the built in schema support) that
either ran in standalone mode by creating the SolrCore directly, or
interacted with Solr vai HTTP POST and was implimented in such a way that
the "input" was Lucene Documents (so that legacy Lucene applications would
need to be rewriten much -- instead of an IndexWriter constructed with a
patch, you'd construct a "SolrIndexWriter" (maybe using a schema and a
config and a data dir, or maybe using a URL depending on how you want it
to work) and while the SolrIndexWriter may not be a subclass of
lucene's IndexWriter, it could support the same addDocument(Document)
method that is typically used, so that 90% (i'm guessing) of the existing
code in the client wouldn't need to change.



-Hoss



Re: indexing from Java

2006-04-10 Thread Erik Hatcher
Now *that* is an interesting idea!   It is similar to the approach I  
want to go with a Solr Ruby library, such that all the HTTP stuff is  
hidden behind a clean domain-specific language of sorts.


Erik


On Apr 10, 2006, at 4:10 PM, Chris Hostetter wrote:



: Sure, I realize I could create the Lucene index myself, but like the
: higher level abstraction that Solr is providing.

Sorry, i guess i didn't explain myself very well.

My point was that it should be possible to write a Java API which  
updated
a Solr index (using SolrCore and all of the built in schema  
support) that

either ran in standalone mode by creating the SolrCore directly, or
interacted with Solr vai HTTP POST and was implimented in such a  
way that
the "input" was Lucene Documents (so that legacy Lucene  
applications would
need to be rewriten much -- instead of an IndexWriter constructed  
with a

patch, you'd construct a "SolrIndexWriter" (maybe using a schema and a
config and a data dir, or maybe using a URL depending on how you  
want it

to work) and while the SolrIndexWriter may not be a subclass of
lucene's IndexWriter, it could support the same addDocument(Document)
method that is typically used, so that 90% (i'm guessing) of the  
existing

code in the client wouldn't need to change.



-Hoss




selective field updating

2006-04-10 Thread Erik Hatcher
While I know the answer to this when it comes to direct Lucene usage,  
I'm curious if Solr could eventually get to the point of allowing  
clever updating of a document such that the client could only pass a  
specific field to add or update in an existing document and Solr  
could do what it takes to make this happen with Lucene.


This is mighty tricky with Lucene, unfortunately, but it would be a  
great feature to be able to simply update one field leaving the rest  
as-is.  Otherwise the client has to retrieve the document, and send  
the whole thing back with only the one field modified, right?


Erik



Re: selective field updating

2006-04-10 Thread Chris Hostetter
:
: While I know the answer to this when it comes to direct Lucene usage,
: I'm curious if Solr could eventually get to the point of allowing
: clever updating of a document such that the client could only pass a
: specific field to add or update in an existing document and Solr
: could do what it takes to make this happen with Lucene.

Solr could ... when Lucene can :)

: as-is.  Otherwise the client has to retrieve the document, and send
: the whole thing back with only the one field modified, right?

Seriously though, Solr suffers from the same roadblock that Lucene does --
Fields which can be indexed but not stored.  There's not much magic that
can be done at the "Solr" level to work arround that probem  (at least
nothing i can think of)

If there was a generic utility that operated directly at the index level
to "clone" a document already in the index, with non-stored indexed fields
and all, and only modifiy/remove/add changes then that utility could
certainly be used in Solr to support some new command with sytnax
something like...
   foo
  bar
   


-Hoss



Re: selective field updating

2006-04-10 Thread jason rutherglen
I think this could be done, and then an SQL styled update command could also be 
enabled.  

- Original Message 
From: Erik Hatcher <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, April 10, 2006 1:52:04 PM
Subject: selective field updating


While I know the answer to this when it comes to direct Lucene usage,  
I'm curious if Solr could eventually get to the point of allowing  
clever updating of a document such that the client could only pass a  
specific field to add or update in an existing document and Solr  
could do what it takes to make this happen with Lucene.

This is mighty tricky with Lucene, unfortunately, but it would be a  
great feature to be able to simply update one field leaving the rest  
as-is.  Otherwise the client has to retrieve the document, and send  
the whole thing back with only the one field modified, right?

Erik