Re: processing documents in solr

2013-07-29 Thread Joe Zhang
I'll try reindexing the timestamp. The id-creation approach suggested by Erick sounds attractive, but the nutch/solr integration seems rather tight. I don't where to break in to insert the id into solr. On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson wrote: > No SolrJ doesn't provide this autom

Re: processing documents in solr

2013-07-29 Thread Erick Erickson
No SolrJ doesn't provide this automatically. You'd be providing the counter by inserting it into the document as you created new docs. You could do this with any kind of document creation you are using. Best Erick On Mon, Jul 29, 2013 at 2:51 AM, Aditya wrote: > Hi, > > The easiest solution wou

Re: processing documents in solr

2013-07-28 Thread Aditya
Hi, The easiest solution would be to have timestamp indexed. Is there any issue in doing re-indexing? If you want to process records in batch then you need a ordered list and a bookmark. You require a field to sort and maintain a counter / last id as bookmark. This is mandatory to solve your probl

Re: processing documents in solr

2013-07-28 Thread Joe Zhang
Basically, I was thinking about running a range query like Shawn suggested on the tstamp field, but unfortunately it was not indexed. Range queries only work on indexed fields, right? On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang wrote: > I've been thinking about tstamp solution int the past few d

Re: processing documents in solr

2013-07-28 Thread Joe Zhang
I've been thinking about tstamp solution int the past few days. but too bad, the field is avaialble but not indexed... I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the counter value. If yes, that would be equivalent to an autoincrement id. I'm indexing from Nutch though; don'

Re: processing documents in solr

2013-07-28 Thread Erick Erickson
Why wouldn't a simple timestamp work for the ordering? Although I guess "simple timestamp" isn't really simple if the time settings change. So how about a simple counter field in your documents? Assuming you're indexing from SolrJ, your setup is to query q=*:*&sort=counter desc. Take the counter f

Re: processing documents in solr

2013-07-27 Thread Maurizio Cucchiara
In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, "Joe Zhang

Re: processing documents in solr

2013-07-27 Thread Roman Chyla
On Sat, Jul 27, 2013 at 4:17 PM, Shawn Heisey wrote: > On 7/27/2013 11:38 AM, Joe Zhang wrote: > > I have a constantly growing index, so not updating the index can't be > > practical... > > > > Going back to the beginning of this thread: when we use the vanilla > > "*:*"+pagination approach, woul

Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 11:38 AM, Joe Zhang wrote: > I have a constantly growing index, so not updating the index can't be > practical... > > Going back to the beginning of this thread: when we use the vanilla > "*:*"+pagination approach, would the ordering of documents remain stable? > the index is dyn

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
I have a constantly growing index, so not updating the index can't be practical... Going back to the beginning of this thread: when we use the vanilla "*:*"+pagination approach, would the ordering of documents remain stable? the index is dynamic: update/insertion only, no deletion. On Sat,

Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 11:17 AM, Joe Zhang wrote: > Thanks for sharing, Roman. I'll look into your code. > > One more thought on your suggestion, Shawn. In fact, for the id, we need > more than "unique" and "rangeable"; we also need some sense of atomic > values. Your approach might run into risk with a tex

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks for sharing, Roman. I'll look into your code. One more thought on your suggestion, Shawn. In fact, for the id, we need more than "unique" and "rangeable"; we also need some sense of atomic values. Your approach might run into risk with a text-based id field, say: the id/key has values 'a',

Re: processing documents in solr

2013-07-27 Thread Roman Chyla
Dear list, I'vw written a special processor exactly for this kind of operations https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch This is how we use it http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch It is capable of

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks. On Fri, Jul 26, 2013 at 11:34 PM, Shawn Heisey wrote: > On 7/27/2013 12:30 AM, Joe Zhang wrote: > > ==> so a "url" field would work fine? > > As long as it's guaranteed unique on every document (especially if it is > your uniqueKey) and goes into the index as a single token, that should

Re: processing documents in solr

2013-07-26 Thread Shawn Heisey
On 7/27/2013 12:30 AM, Joe Zhang wrote: > ==> so a "url" field would work fine? As long as it's guaranteed unique on every document (especially if it is your uniqueKey) and goes into the index as a single token, that should work just fine for the range queries I've described. Thanks, Shawn

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
On Fri, Jul 26, 2013 at 11:18 PM, Shawn Heisey wrote: > On 7/26/2013 11:50 PM, Joe Zhang wrote: > > ==> Essentially we are doing paigination here, right? If performance is > not > > the concern, given that the index is dynamic, does the order of > > entries remain stable over time? > > Yes, it's

Re: processing documents in solr

2013-07-26 Thread Shawn Heisey
On 7/26/2013 11:50 PM, Joe Zhang wrote: > ==> Essentially we are doing paigination here, right? If performance is not > the concern, given that the index is dynamic, does the order of > entries remain stable over time? Yes, it's pagination. Just like the other method that I've described in detail

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
On a related, inspired by what you said, Shawn, an auto increment id seems perfect here. Yet I found there is no such support in solr. The UUID only guarantees uniqueness. On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang wrote: > Thanks for your kind reply, Shawn. > > On Fri, Jul 26, 2013 at 10:27 P

Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks for your kind reply, Shawn. On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey wrote: > On 7/26/2013 11:02 PM, Joe Zhang wrote: > > I have an ever-growing solr repository, and I need to process every > single > > document to extract statistics. What would be a reasonable process that > > sati

Re: processing documents in solr

2013-07-26 Thread Shawn Heisey
On 7/26/2013 11:02 PM, Joe Zhang wrote: > I have an ever-growing solr repository, and I need to process every single > document to extract statistics. What would be a reasonable process that > satifies the following properties: > > - Exhaustive: I have to traverse every single document > - Increme

processing documents in solr

2013-07-26 Thread Joe Zhang
Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me