indexing from Java
I'm refactoring a command-line application that indexes RDF data to index into Solr rather than directly into a Lucene index. In looking at the code, it looks like it'd be difficult to do this without actually doing an HTTP POST. Or is there a clean way to leverage Solr's infrastructure without POSTing into it? I don't even need Solr running while indexing. Is this possible easily? If so what API should I be using? Thanks, Erik
Re: Unique id field support in Solr
On 4/10/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : Is there any way to explicitly add a record using Solr, such that the > : add will fail if a record already exists with the same value in the > : unique ID field (as specified by the schema's uniqueKey field)? > > I was going to say overwriteCommitted="false"> should do what you want, > > > http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef > > ...but that combination of options doesn't seem to be supported -- when i > tried to write a test case for it, it failed with this exception... > > SEVERE: org.apache.solr.core.SolrException: unsupported param > combo:add:,allowDups=false,overwritePending=false,overwriteCommitted=false > at > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:197) > > ...i'm not sure how hard it would be to add that, the UpdateHandler is > already checking for the dumplicate, i'm not sure why it can't just throw > the new doc away when it finds one instead of deleting the old It's not checking for a duplicate at the time the document is added to the index though... it removes duplicates later when the command is given. That combination of parameters could be supported by DirectUpdateHander2 (the default) by doing a lookup in the index in addition to checking the local state (uncommitted documents). Feel free to file a JIRA bug, or, even better, provide a patch. -Yonik
Re: indexing from Java
On 4/10/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: > I'm refactoring a command-line application that indexes RDF data to > index into Solr rather than directly into a Lucene index. > > In looking at the code, it looks like it'd be difficult to do this > without actually doing an HTTP POST. Or is there a clean way to > leverage Solr's infrastructure without POSTing into it? I don't even > need Solr running while indexing. Is this possible easily? If so > what API should I be using? See the DocumentBuilder class for creating lucene Document objects using the Solr schema. You could add these Document objects to a Lucene index yourself, or instantiate a SolrCore, and get it's UpdateHandler. Hmmm, it looks like there isn't a getUpdateHandler() on the SolrCore... So two options: - add a SolrCore.getUpdateHandler() method - add a SolrCore.update(UpdateCommandc cmd) method -Yonik
Re: indexing from Java
: In looking at the code, it looks like it'd be difficult to do this : without actually doing an HTTP POST. Or is there a clean way to : leverage Solr's infrastructure without POSTing into it? I don't even : need Solr running while indexing. Is this possible easily? If so : what API should I be using? To answer your implicit question: in the past the only recommended/supported way to get documents into Solr was by POSTing document messages. I know Yonik has wanted to make it easier it easier, having a client library that uses the SolrCore directly seems like a good start. I would think it should even be relatively easy to make it take in org.apache.lucene.document.Document objects right? ... they're mainly just beans (that sometimes have Readers). -Hoss
Re: indexing from Java
On 4/10/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > : In looking at the code, it looks like it'd be difficult to do this > : without actually doing an HTTP POST. Or is there a clean way to > : leverage Solr's infrastructure without POSTing into it? I don't even > : need Solr running while indexing. Is this possible easily? If so > : what API should I be using? > > To answer your implicit question: in the past the only > recommended/supported way to get documents into Solr was by POSTing > document messages. Yes, updating a running Solr server is the recommended (and more flexible) approach. > I know Yonik has wanted to make it easier it easier, having a client > library that uses the SolrCore directly seems like a good start. To me, running Solr outside the webapp is "expert" level use, and not something I'd want to generally recommend. For an official client library, I'd rather have something that managed HTTP communications & serialization to a running Solr server: - It's not an exclusive client, can coexist with other updaters/searchers - can be used for rebuilding whole index as well as incremental updates - could possibly be used as an alternate replication strategy... do the updates to multiple Solr instances. -Yonik
Re: indexing from Java
Sure, I realize I could create the Lucene index myself, but like the higher level abstraction that Solr is providing. The disadvantage to creating the Lucene index myself or even with a Solr capability to take a Lucene Document is I'd lose the schema.xml goodness of things like and would have to do that bit myself. Ideally I'm looking for something that I could pass an object that represented the from the XML POST and it'd create the Document internally and index it. For now I'll use HTTP POST, but glad to hear there is some consideration to some type of (even expert level) client API that could go direct without the HTTP overhead. Erik On Apr 10, 2006, at 1:14 PM, Chris Hostetter wrote: : In looking at the code, it looks like it'd be difficult to do this : without actually doing an HTTP POST. Or is there a clean way to : leverage Solr's infrastructure without POSTing into it? I don't even : need Solr running while indexing. Is this possible easily? If so : what API should I be using? To answer your implicit question: in the past the only recommended/supported way to get documents into Solr was by POSTing document messages. I know Yonik has wanted to make it easier it easier, having a client library that uses the SolrCore directly seems like a good start. I would think it should even be relatively easy to make it take in org.apache.lucene.document.Document objects right? ... they're mainly just beans (that sometimes have Readers). -Hoss
Re: indexing from Java
: Sure, I realize I could create the Lucene index myself, but like the : higher level abstraction that Solr is providing. Sorry, i guess i didn't explain myself very well. My point was that it should be possible to write a Java API which updated a Solr index (using SolrCore and all of the built in schema support) that either ran in standalone mode by creating the SolrCore directly, or interacted with Solr vai HTTP POST and was implimented in such a way that the "input" was Lucene Documents (so that legacy Lucene applications would need to be rewriten much -- instead of an IndexWriter constructed with a patch, you'd construct a "SolrIndexWriter" (maybe using a schema and a config and a data dir, or maybe using a URL depending on how you want it to work) and while the SolrIndexWriter may not be a subclass of lucene's IndexWriter, it could support the same addDocument(Document) method that is typically used, so that 90% (i'm guessing) of the existing code in the client wouldn't need to change. -Hoss
Re: indexing from Java
Now *that* is an interesting idea! It is similar to the approach I want to go with a Solr Ruby library, such that all the HTTP stuff is hidden behind a clean domain-specific language of sorts. Erik On Apr 10, 2006, at 4:10 PM, Chris Hostetter wrote: : Sure, I realize I could create the Lucene index myself, but like the : higher level abstraction that Solr is providing. Sorry, i guess i didn't explain myself very well. My point was that it should be possible to write a Java API which updated a Solr index (using SolrCore and all of the built in schema support) that either ran in standalone mode by creating the SolrCore directly, or interacted with Solr vai HTTP POST and was implimented in such a way that the "input" was Lucene Documents (so that legacy Lucene applications would need to be rewriten much -- instead of an IndexWriter constructed with a patch, you'd construct a "SolrIndexWriter" (maybe using a schema and a config and a data dir, or maybe using a URL depending on how you want it to work) and while the SolrIndexWriter may not be a subclass of lucene's IndexWriter, it could support the same addDocument(Document) method that is typically used, so that 90% (i'm guessing) of the existing code in the client wouldn't need to change. -Hoss
selective field updating
While I know the answer to this when it comes to direct Lucene usage, I'm curious if Solr could eventually get to the point of allowing clever updating of a document such that the client could only pass a specific field to add or update in an existing document and Solr could do what it takes to make this happen with Lucene. This is mighty tricky with Lucene, unfortunately, but it would be a great feature to be able to simply update one field leaving the rest as-is. Otherwise the client has to retrieve the document, and send the whole thing back with only the one field modified, right? Erik
Re: selective field updating
: : While I know the answer to this when it comes to direct Lucene usage, : I'm curious if Solr could eventually get to the point of allowing : clever updating of a document such that the client could only pass a : specific field to add or update in an existing document and Solr : could do what it takes to make this happen with Lucene. Solr could ... when Lucene can :) : as-is. Otherwise the client has to retrieve the document, and send : the whole thing back with only the one field modified, right? Seriously though, Solr suffers from the same roadblock that Lucene does -- Fields which can be indexed but not stored. There's not much magic that can be done at the "Solr" level to work arround that probem (at least nothing i can think of) If there was a generic utility that operated directly at the index level to "clone" a document already in the index, with non-stored indexed fields and all, and only modifiy/remove/add changes then that utility could certainly be used in Solr to support some new command with sytnax something like... foo bar -Hoss
Re: selective field updating
I think this could be done, and then an SQL styled update command could also be enabled. - Original Message From: Erik Hatcher <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, April 10, 2006 1:52:04 PM Subject: selective field updating While I know the answer to this when it comes to direct Lucene usage, I'm curious if Solr could eventually get to the point of allowing clever updating of a document such that the client could only pass a specific field to add or update in an existing document and Solr could do what it takes to make this happen with Lucene. This is mighty tricky with Lucene, unfortunately, but it would be a great feature to be able to simply update one field leaving the rest as-is. Otherwise the client has to retrieve the document, and send the whole thing back with only the one field modified, right? Erik