well. The first results are ready. I have implemented custom update processor following your suggestion using low level index reader and termdocs.
Launched scripts which add about 10 000 docs. Indexing took about 1 minute including commit that is quite good for me. I don't have larger datasets so won't be able to check with heavier conditions. If someone is interested I can send over my jar file with my update processor. As I said I am ready to contribute it to solr but will get back to it in the New Year after 10 Jan. thanks everybody. Best Regards Alexander Aristov On 29 December 2011 18:12, Erick Erickson <erickerick...@gmail.com> wrote: > I'd guess it would be much faster, assuming that > the search savings wouldn't be swamped by the > additional transmission time over the wire and > parsing the request (although SolrJ uses a binary > format, so parsing request probably isn't all > that expensive). > > You could even do a hybrid approach. Pack up all > of the IDs you are about to update, send them to > your special *request* handler and have your > request handler respond with the documents that > were already in the index... > > Hmmm, scratch all that. Start with just stringing > together a long set of <uniqueKeys> and just > search for them. Something like > q=id:(1 2 47 09873............)&fl=id > The response should be a minimal set of data > returned (just the ID). Then you can remove > each document ID returned from your > next update. No custom Solr components > required. > > Solr defaults to a maxBooleanClause count > of 1024, so your packets should have fewer IDs > this or you should bump that config setting. > > This should pretty much do what I was thinking > with custom code without having to write > anything.. > > Best > Erick > > On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov > <alexander.aris...@gmail.com> wrote: > > I have never developed for solr yet and don't know much internals but > Today > > I have tried one approach with searcher. > > > > In my update processor I get searcher and search for ID. It works but I > > need to load test it. Will index traversal be faster (less resource > > consuming) than search? > > > > Best Regards > > Alexander Aristov > > > > > > On 29 December 2011 17:03, Erick Erickson <erickerick...@gmail.com> > wrote: > > > >> Hmmm, we're not communicating <G>... > >> > >> The update processor wouldn't search in the > >> classic sense. It would just use lower-level > >> index traversal to determine if the doc (identified > >> by your unique key) was already in the index > >> and skip indexing that document if it was. No real > >> *searching* involved (see TermDocs.seek for one > >> approach). > >> > >> The price would be that you are transmitting the > >> document over to the Solr instance and then > >> throwing it away. > >> > >> Best > >> Erick > >> > >> On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev > >> <mkhlud...@griddynamics.com> wrote: > >> > Alexander, > >> > > >> > I have two ideas how to implement fast dedupe externally, assuming > your > >> PKs > >> > don't fit to java.util.*Map: > >> > > >> > - your crawler can use inprocess RDBMS (Derby, H2) to track dupes; > >> > - if your crawler is stateless - it doesn't track PKs which has been > >> > already crawled, you can retrieve it from Solr via > >> > http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, > >> but > >> > it might be a problem with removed documents (I'm not sure). And > it's > >> also > >> > can lead to OOMException (if you have too much PKs). Let me know if > you > >> > need a workaround for one of these problems. > >> > > >> > If you choose internal dedupe (UpdateProcessor), pls let me know if > >> > querying one-by-one will be to slow for your and you'll need to do it > >> > page-by-page. I did some of such paging, and will do something similar > >> > soon, so I'm interested in it. > >> > > >> > Regards > >> > > >> > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov < > >> > alexander.aris...@gmail.com> wrote: > >> > > >> >> Unfortunately I have a lot of duplicates and taking that searching > >> might > >> >> suffer I will try with implementing update procesor. > >> >> > >> >> But your idea is interesting and I will consider it, thanks. > >> >> > >> >> Best Regards > >> >> Alexander Aristov > >> >> > >> >> > >> >> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> > wrote: > >> >> > >> >> > Hello Alexander, > >> >> > > >> >> > I don't know much about your requirements in terms of size and > >> >> > performances, but I've had a similar use case and found a pretty > >> simple > >> >> > workaround. > >> >> > If your duplicate rate is not too high, you can have the > >> >> > SignatureProcessor to generate fingerprint of documents (you > already > >> did > >> >> > that). > >> >> > > >> >> > Simply turn off overwritting of duplicates, you can then rely on > >> solr's > >> >> > grouping / field collapsing to group your search results by > >> fingerprints. > >> >> > You'll then have one document group per "real" document. You can > use > >> >> > group.sort to sort your groups by indexing date ascending, and > >> >> > group.limit=1 to keep only the oldest one. > >> >> > You can even use group.format = simple to serve results as if no > >> >> > collapsing occured, and use group.ngroups (/!\ could be expansive > >> /!\) to > >> >> > get the real number of deduplicated documents. > >> >> > > >> >> > Of course the index will be larger, as I said, I made no assumption > >> >> > regarding your operating requirements. And search can be a bit > slower, > >> >> > depending on the average rate of duplicated documents. > >> >> > But you've got your issue addressed by configuration tuning only... > >> >> > Depending on your project's sizing, it could be time saving. > >> >> > > >> >> > The advantage is that you have the precious information of what > >> content > >> >> is > >> >> > duplicated from where :-) > >> >> > > >> >> > Hope this helps, > >> >> > > >> >> > -- > >> >> > Tanguy > >> >> > > >> >> > Le 28/12/2011 15:45, Alexander Aristov a écrit : > >> >> > > >> >> > Thanks Eric, > >> >> >> > >> >> >> it sets me direction. I will be writing new plugin and will get > back > >> to > >> >> >> the > >> >> >> dev forum with results and then we will decide next steps. > >> >> >> > >> >> >> Best Regards > >> >> >> Alexander Aristov > >> >> >> > >> >> >> > >> >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail. > **com< > >> >> erickerick...@gmail.com>> > >> >> >> wrote: > >> >> >> > >> >> >> Well, the short answer is that nobody else has > >> >> >>> 1> had a similar requirement > >> >> >>> AND > >> >> >>> 2> not found a suitable work around > >> >> >>> AND > >> >> >>> 3> implemented the change and contributed it back. > >> >> >>> > >> >> >>> So, if you'd like to volunteer<G>..... > >> >> >>> > >> >> >>> Seriously. If you think this would be valuable and are > >> >> >>> willing to work on it, hop on over to the dev list and > >> >> >>> discuss it, open a JIRA and make it work. I'd start > >> >> >>> by opening a discussion on the dev list before > >> >> >>> opening a JIRA, just to get a sense of where the > >> >> >>> snags would be to changing the Solr code, but that's > >> >> >>> optional. > >> >> >>> > >> >> >>> That said, writing your own update request handler > >> >> >>> that detected this case isn't very difficult, > >> >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor > >> >> >>> and use it as a plugin. > >> >> >>> > >> >> >>> Best > >> >> >>> Erick > >> >> >>> > >> >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov > >> >> >>> <alexander.aris...@gmail.com> wrote: > >> >> >>> > >> >> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it > >> >> REPLACES > >> >> >>>> > >> >> >>> old > >> >> >>> > >> >> >>>> docs. I have tried it already. > >> >> >>>> > >> >> >>>> Best Regards > >> >> >>>> Alexander Aristov > >> >> >>>> > >> >> >>>> > >> >> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com> > >> wrote: > >> >> >>>> > >> >> >>>> The SignatureUpdateProcessor is for exactly this problem: > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> http://www.lucidimagination.**com/search/link?url=http://** > >> >> >>> wiki.apache.org/solr/**Deduplication< > >> >> > >> > http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication > >> >> > > >> >> >>> > >> >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov > >> >> >>>>> <alexander.aris...@gmail.com> wrote: > >> >> >>>>> > >> >> >>>>>> I get docs from external sources and the only place I keep > them > >> is > >> >> >>>>>> > >> >> >>>>> solr > >> >> >>> > >> >> >>>> index. I have no a database or other means to track indexed docs > >> (my > >> >> >>>>>> personal oppinion is that it might be a huge headache). > >> >> >>>>>> > >> >> >>>>>> Some docs might change slightly in there original sources but > I > >> >> don't > >> >> >>>>>> > >> >> >>>>> need > >> >> >>>>> > >> >> >>>>>> that changes. In fact I need original data only. > >> >> >>>>>> > >> >> >>>>>> So I have no other ways but to either check if a document is > >> already > >> >> >>>>>> > >> >> >>>>> in > >> >> >>> > >> >> >>>> index before I put it to solrj array (read - query solr) or > >> develop my > >> >> >>>>>> > >> >> >>>>> own > >> >> >>>>> > >> >> >>>>>> update chain processor and implement ID check there and skip > such > >> >> >>>>>> > >> >> >>>>> docs. > >> >> >>> > >> >> >>>> Maybe it's wrong place to aguee and probably it's been discussed > >> >> >>>>>> > >> >> >>>>> before > >> >> >>> > >> >> >>>> but > >> >> >>>>> > >> >> >>>>>> I wonder why simple the overwrite parameter doesn't work here. > >> >> >>>>>> > >> >> >>>>>> My oppinion it perfectly suits here. In combination with > unique > >> ID > >> >> it > >> >> >>>>>> > >> >> >>>>> can > >> >> >>> > >> >> >>>> cover all possible variants. > >> >> >>>>>> > >> >> >>>>>> cases: > >> >> >>>>>> > >> >> >>>>>> 1. overwrite=true and uniquID exists then newer doc should > >> overwrite > >> >> >>>>>> > >> >> >>>>> the > >> >> >>> > >> >> >>>> old one. > >> >> >>>>>> > >> >> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be > >> >> skipped > >> >> >>>>>> > >> >> >>>>> since > >> >> >>>>> > >> >> >>>>>> old exists. > >> >> >>>>>> > >> >> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added > >> regardless > >> >> if > >> >> >>>>>> > >> >> >>>>> old > >> >> >>>>> > >> >> >>>>>> exists or not. > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> Best Regards > >> >> >>>>>> Alexander Aristov > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail. > >> >> **com<erickerick...@gmail.com> > >> >> >>>>>> > > >> >> >>>>>> > >> >> >>>>> wrote: > >> >> >>>>> > >> >> >>>>>> Mikhail is right as far as I know, the assumption built into > >> Solr is > >> >> >>>>>>> > >> >> >>>>>> that > >> >> >>>>> > >> >> >>>>>> duplicate IDs (when<uniqueKey> is defined) should trigger the > >> old > >> >> >>>>>>> document to be replaced. > >> >> >>>>>>> > >> >> >>>>>>> what is your system-of-record? By that I mean what does your > >> SolrJ > >> >> >>>>>>> program do to send data to Solr? Is there any way you could > just > >> >> >>>>>>> *not* send documents that are already in the Solr index based > >> on, > >> >> >>>>>>> for instance, any timestamp associated with your > >> system-of-record > >> >> >>>>>>> and the last time you did an incremental index? > >> >> >>>>>>> > >> >> >>>>>>> Best > >> >> >>>>>>> Erick > >> >> >>>>>>> > >> >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov > >> >> >>>>>>> <alexander.aris...@gmail.com> wrote: > >> >> >>>>>>> > >> >> >>>>>>>> Hi > >> >> >>>>>>>> > >> >> >>>>>>>> I am not using database. All needed data is in solr index > >> that's > >> >> >>>>>>>> > >> >> >>>>>>> why I > >> >> >>> > >> >> >>>> want > >> >> >>>>>>> > >> >> >>>>>>>> to skip excessive checks. > >> >> >>>>>>>> > >> >> >>>>>>>> I will check DIH but not sure if it helps. > >> >> >>>>>>>> > >> >> >>>>>>>> I am fluent with Java and it's not a problem for me to > write a > >> >> >>>>>>>> > >> >> >>>>>>> class > >> >> >>> > >> >> >>>> or > >> >> >>>>> > >> >> >>>>>> so > >> >> >>>>>>> > >> >> >>>>>>>> but I want to check first maybe there are any ways > >> (workarounds) > >> >> >>>>>>>> > >> >> >>>>>>> to > >> >> >>> > >> >> >>>> make > >> >> >>>>> > >> >> >>>>>> it working without codding, just by playing around with > >> >> >>>>>>>> > >> >> >>>>>>> configuration > >> >> >>> > >> >> >>>> and > >> >> >>>>> > >> >> >>>>>> params. I don't want to go away from default solr > implementation. > >> >> >>>>>>>> > >> >> >>>>>>>> Best Regards > >> >> >>>>>>>> Alexander Aristov > >> >> >>>>>>>> > >> >> >>>>>>>> > >> >> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev< > >> >> >>>>>>>> > >> >> >>>>>>> mkhlud...@griddynamics.com > >> >> >>>>> > >> >> >>>>>> wrote: > >> >> >>>>>>>> > >> >> >>>>>>>> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov< > >> >> >>>>>>>>> alexander.aris...@gmail.com> wrote: > >> >> >>>>>>>>> > >> >> >>>>>>>>> Hi people, > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> I urgently need your help! > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental > >> >> >>>>>>>>>> > >> >> >>>>>>>>> indexing 4 > >> >> >>> > >> >> >>>> times a > >> >> >>>>>>>>> > >> >> >>>>>>>>>> day using bulk updates. Some documents are identical to > some > >> >> >>>>>>>>>> > >> >> >>>>>>>>> extent > >> >> >>> > >> >> >>>> and I > >> >> >>>>>>> > >> >> >>>>>>>> wish to skip them, not to index. > >> >> >>>>>>>>>> But here is the problem as I could not find a way to tell > >> solr > >> >> >>>>>>>>>> > >> >> >>>>>>>>> ignore > >> >> >>>>> > >> >> >>>>>> new > >> >> >>>>>>> > >> >> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that > >> it's > >> >> >>>>>>>>>> > >> >> >>>>>>>>> new. > >> >> >>>>> > >> >> >>>>>> Just > >> >> >>>>>>>>> > >> >> >>>>>>>>>> determine by ID that such document is in the index already > >> and > >> >> >>>>>>>>>> > >> >> >>>>>>>>> that's > >> >> >>>>> > >> >> >>>>>> it. > >> >> >>>>>>> > >> >> >>>>>>>> I use solrj for indexing. I have tried setting > overwrite=false > >> >> >>>>>>>>>> > >> >> >>>>>>>>> and > >> >> >>> > >> >> >>>> dedupe > >> >> >>>>>>> > >> >> >>>>>>>> apprache but nothing helped me. I either have that a newer > doc > >> >> >>>>>>>>>> > >> >> >>>>>>>>> overwrites > >> >> >>>>>>> > >> >> >>>>>>>> old one or I get duplicate. > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> I think it's a very simple and basic feature and it must > >> exist. > >> >> >>>>>>>>>> > >> >> >>>>>>>>> What > >> >> >>>>> > >> >> >>>>>> did > >> >> >>>>>>> > >> >> >>>>>>>> I > >> >> >>>>>>>>> > >> >> >>>>>>>>>> make wrong or didn't do? > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> I guess, because the mainstream approach is > delta-import , > >> >> when > >> >> >>>>>>>>> > >> >> >>>>>>>> you > >> >> >>> > >> >> >>>> have > >> >> >>>>>>> > >> >> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp > >> stored > >> >> >>>>>>>>> somewhere. You can check how it works in DIH. > >> >> >>>>>>>>> > >> >> >>>>>>>>> > >> >> >>>>>>>>> Tried google but I couldn't find a solution there althoght > >> many > >> >> >>>>>>>>>> > >> >> >>>>>>>>> people > >> >> >>>>> > >> >> >>>>>> encounted such problem. > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> it's definitely can be done by overriding > >> >> >>>>>>>>> > >> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand), > >> >> >>>>>>>>> but I > >> >> >>>>>>>>> > >> >> >>>>>>>> suggest > >> >> >>>>>>> > >> >> >>>>>>>> to start from implementing your own > >> >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor< > >> >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for > >> >> >>>>>>>>> > >> >> >>>>>>>> PK, > >> >> >>> > >> >> >>>> bypass > >> >> >>>>>>> > >> >> >>>>>>>> chain call if it's found. Then if you meet performance > issues > >> on > >> >> >>>>>>>>> > >> >> >>>>>>>> querying > >> >> >>>>>>> > >> >> >>>>>>>> your PKs one by one, (but only after that) you can batch > your > >> >> >>>>>>>>> > >> >> >>>>>>>> searches, > >> >> >>>>> > >> >> >>>>>> there are couple of optimization techniques for huge > disjunction > >> >> >>>>>>>>> > >> >> >>>>>>>> queries > >> >> >>>>> > >> >> >>>>>> like PK:(2 OR 4 OR 5 OR 6). > >> >> >>>>>>>>> > >> >> >>>>>>>>> > >> >> >>>>>>>>> I start considering that I must query index to check if a > doc > >> >> >>>>>>>>>> > >> >> >>>>>>>>> to be > >> >> >>> > >> >> >>>> added > >> >> >>>>>>> > >> >> >>>>>>>> is in the index already and do not add it to array but I > have > >> so > >> >> >>>>>>>>>> > >> >> >>>>>>>>> many > >> >> >>>>> > >> >> >>>>>> docs > >> >> >>>>>>>>> > >> >> >>>>>>>>>> that I am affraid it's not a good solution. > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> Best Regards > >> >> >>>>>>>>>> Alexander Aristov > >> >> >>>>>>>>>> > >> >> >>>>>>>>>> > >> >> >>>>>>>>> > >> >> >>>>>>>>> -- > >> >> >>>>>>>>> Sincerely yours > >> >> >>>>>>>>> Mikhail Khludnev > >> >> >>>>>>>>> Lucid Certified > >> >> >>>>>>>>> Apache Lucene/Solr Developer > >> >> >>>>>>>>> Grid Dynamics > >> >> >>>>>>>>> > >> >> >>>>>>>>> > >> >> >>>>> > >> >> >>>>> -- > >> >> >>>>> Lance Norskog > >> >> >>>>> goks...@gmail.com > >> >> >>>>> > >> >> >>>>> > >> >> > > >> >> > >> > > >> > > >> > > >> > -- > >> > Sincerely yours > >> > Mikhail Khludnev > >> > Lucid Certified > >> > Apache Lucene/Solr Developer > >> > Grid Dynamics > >> > > >> > <http://www.griddynamics.com> > >> > <mkhlud...@griddynamics.com> > >> >