I have never developed for solr yet and don't know much internals but Today I have tried one approach with searcher.
In my update processor I get searcher and search for ID. It works but I need to load test it. Will index traversal be faster (less resource consuming) than search? Best Regards Alexander Aristov On 29 December 2011 17:03, Erick Erickson <erickerick...@gmail.com> wrote: > Hmmm, we're not communicating <G>... > > The update processor wouldn't search in the > classic sense. It would just use lower-level > index traversal to determine if the doc (identified > by your unique key) was already in the index > and skip indexing that document if it was. No real > *searching* involved (see TermDocs.seek for one > approach). > > The price would be that you are transmitting the > document over to the Solr instance and then > throwing it away. > > Best > Erick > > On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev > <mkhlud...@griddynamics.com> wrote: > > Alexander, > > > > I have two ideas how to implement fast dedupe externally, assuming your > PKs > > don't fit to java.util.*Map: > > > > - your crawler can use inprocess RDBMS (Derby, H2) to track dupes; > > - if your crawler is stateless - it doesn't track PKs which has been > > already crawled, you can retrieve it from Solr via > > http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, > but > > it might be a problem with removed documents (I'm not sure). And it's > also > > can lead to OOMException (if you have too much PKs). Let me know if you > > need a workaround for one of these problems. > > > > If you choose internal dedupe (UpdateProcessor), pls let me know if > > querying one-by-one will be to slow for your and you'll need to do it > > page-by-page. I did some of such paging, and will do something similar > > soon, so I'm interested in it. > > > > Regards > > > > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov < > > alexander.aris...@gmail.com> wrote: > > > >> Unfortunately I have a lot of duplicates and taking that searching > might > >> suffer I will try with implementing update procesor. > >> > >> But your idea is interesting and I will consider it, thanks. > >> > >> Best Regards > >> Alexander Aristov > >> > >> > >> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote: > >> > >> > Hello Alexander, > >> > > >> > I don't know much about your requirements in terms of size and > >> > performances, but I've had a similar use case and found a pretty > simple > >> > workaround. > >> > If your duplicate rate is not too high, you can have the > >> > SignatureProcessor to generate fingerprint of documents (you already > did > >> > that). > >> > > >> > Simply turn off overwritting of duplicates, you can then rely on > solr's > >> > grouping / field collapsing to group your search results by > fingerprints. > >> > You'll then have one document group per "real" document. You can use > >> > group.sort to sort your groups by indexing date ascending, and > >> > group.limit=1 to keep only the oldest one. > >> > You can even use group.format = simple to serve results as if no > >> > collapsing occured, and use group.ngroups (/!\ could be expansive > /!\) to > >> > get the real number of deduplicated documents. > >> > > >> > Of course the index will be larger, as I said, I made no assumption > >> > regarding your operating requirements. And search can be a bit slower, > >> > depending on the average rate of duplicated documents. > >> > But you've got your issue addressed by configuration tuning only... > >> > Depending on your project's sizing, it could be time saving. > >> > > >> > The advantage is that you have the precious information of what > content > >> is > >> > duplicated from where :-) > >> > > >> > Hope this helps, > >> > > >> > -- > >> > Tanguy > >> > > >> > Le 28/12/2011 15:45, Alexander Aristov a écrit : > >> > > >> > Thanks Eric, > >> >> > >> >> it sets me direction. I will be writing new plugin and will get back > to > >> >> the > >> >> dev forum with results and then we will decide next steps. > >> >> > >> >> Best Regards > >> >> Alexander Aristov > >> >> > >> >> > >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com< > >> erickerick...@gmail.com>> > >> >> wrote: > >> >> > >> >> Well, the short answer is that nobody else has > >> >>> 1> had a similar requirement > >> >>> AND > >> >>> 2> not found a suitable work around > >> >>> AND > >> >>> 3> implemented the change and contributed it back. > >> >>> > >> >>> So, if you'd like to volunteer<G>..... > >> >>> > >> >>> Seriously. If you think this would be valuable and are > >> >>> willing to work on it, hop on over to the dev list and > >> >>> discuss it, open a JIRA and make it work. I'd start > >> >>> by opening a discussion on the dev list before > >> >>> opening a JIRA, just to get a sense of where the > >> >>> snags would be to changing the Solr code, but that's > >> >>> optional. > >> >>> > >> >>> That said, writing your own update request handler > >> >>> that detected this case isn't very difficult, > >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor > >> >>> and use it as a plugin. > >> >>> > >> >>> Best > >> >>> Erick > >> >>> > >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov > >> >>> <alexander.aris...@gmail.com> wrote: > >> >>> > >> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it > >> REPLACES > >> >>>> > >> >>> old > >> >>> > >> >>>> docs. I have tried it already. > >> >>>> > >> >>>> Best Regards > >> >>>> Alexander Aristov > >> >>>> > >> >>>> > >> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com> > wrote: > >> >>>> > >> >>>> The SignatureUpdateProcessor is for exactly this problem: > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> http://www.lucidimagination.**com/search/link?url=http://** > >> >>> wiki.apache.org/solr/**Deduplication< > >> > http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication > >> > > >> >>> > >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov > >> >>>>> <alexander.aris...@gmail.com> wrote: > >> >>>>> > >> >>>>>> I get docs from external sources and the only place I keep them > is > >> >>>>>> > >> >>>>> solr > >> >>> > >> >>>> index. I have no a database or other means to track indexed docs > (my > >> >>>>>> personal oppinion is that it might be a huge headache). > >> >>>>>> > >> >>>>>> Some docs might change slightly in there original sources but I > >> don't > >> >>>>>> > >> >>>>> need > >> >>>>> > >> >>>>>> that changes. In fact I need original data only. > >> >>>>>> > >> >>>>>> So I have no other ways but to either check if a document is > already > >> >>>>>> > >> >>>>> in > >> >>> > >> >>>> index before I put it to solrj array (read - query solr) or > develop my > >> >>>>>> > >> >>>>> own > >> >>>>> > >> >>>>>> update chain processor and implement ID check there and skip such > >> >>>>>> > >> >>>>> docs. > >> >>> > >> >>>> Maybe it's wrong place to aguee and probably it's been discussed > >> >>>>>> > >> >>>>> before > >> >>> > >> >>>> but > >> >>>>> > >> >>>>>> I wonder why simple the overwrite parameter doesn't work here. > >> >>>>>> > >> >>>>>> My oppinion it perfectly suits here. In combination with unique > ID > >> it > >> >>>>>> > >> >>>>> can > >> >>> > >> >>>> cover all possible variants. > >> >>>>>> > >> >>>>>> cases: > >> >>>>>> > >> >>>>>> 1. overwrite=true and uniquID exists then newer doc should > overwrite > >> >>>>>> > >> >>>>> the > >> >>> > >> >>>> old one. > >> >>>>>> > >> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be > >> skipped > >> >>>>>> > >> >>>>> since > >> >>>>> > >> >>>>>> old exists. > >> >>>>>> > >> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added > regardless > >> if > >> >>>>>> > >> >>>>> old > >> >>>>> > >> >>>>>> exists or not. > >> >>>>>> > >> >>>>>> > >> >>>>>> Best Regards > >> >>>>>> Alexander Aristov > >> >>>>>> > >> >>>>>> > >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail. > >> **com<erickerick...@gmail.com> > >> >>>>>> > > >> >>>>>> > >> >>>>> wrote: > >> >>>>> > >> >>>>>> Mikhail is right as far as I know, the assumption built into > Solr is > >> >>>>>>> > >> >>>>>> that > >> >>>>> > >> >>>>>> duplicate IDs (when<uniqueKey> is defined) should trigger the > old > >> >>>>>>> document to be replaced. > >> >>>>>>> > >> >>>>>>> what is your system-of-record? By that I mean what does your > SolrJ > >> >>>>>>> program do to send data to Solr? Is there any way you could just > >> >>>>>>> *not* send documents that are already in the Solr index based > on, > >> >>>>>>> for instance, any timestamp associated with your > system-of-record > >> >>>>>>> and the last time you did an incremental index? > >> >>>>>>> > >> >>>>>>> Best > >> >>>>>>> Erick > >> >>>>>>> > >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov > >> >>>>>>> <alexander.aris...@gmail.com> wrote: > >> >>>>>>> > >> >>>>>>>> Hi > >> >>>>>>>> > >> >>>>>>>> I am not using database. All needed data is in solr index > that's > >> >>>>>>>> > >> >>>>>>> why I > >> >>> > >> >>>> want > >> >>>>>>> > >> >>>>>>>> to skip excessive checks. > >> >>>>>>>> > >> >>>>>>>> I will check DIH but not sure if it helps. > >> >>>>>>>> > >> >>>>>>>> I am fluent with Java and it's not a problem for me to write a > >> >>>>>>>> > >> >>>>>>> class > >> >>> > >> >>>> or > >> >>>>> > >> >>>>>> so > >> >>>>>>> > >> >>>>>>>> but I want to check first maybe there are any ways > (workarounds) > >> >>>>>>>> > >> >>>>>>> to > >> >>> > >> >>>> make > >> >>>>> > >> >>>>>> it working without codding, just by playing around with > >> >>>>>>>> > >> >>>>>>> configuration > >> >>> > >> >>>> and > >> >>>>> > >> >>>>>> params. I don't want to go away from default solr implementation. > >> >>>>>>>> > >> >>>>>>>> Best Regards > >> >>>>>>>> Alexander Aristov > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev< > >> >>>>>>>> > >> >>>>>>> mkhlud...@griddynamics.com > >> >>>>> > >> >>>>>> wrote: > >> >>>>>>>> > >> >>>>>>>> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov< > >> >>>>>>>>> alexander.aris...@gmail.com> wrote: > >> >>>>>>>>> > >> >>>>>>>>> Hi people, > >> >>>>>>>>>> > >> >>>>>>>>>> I urgently need your help! > >> >>>>>>>>>> > >> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental > >> >>>>>>>>>> > >> >>>>>>>>> indexing 4 > >> >>> > >> >>>> times a > >> >>>>>>>>> > >> >>>>>>>>>> day using bulk updates. Some documents are identical to some > >> >>>>>>>>>> > >> >>>>>>>>> extent > >> >>> > >> >>>> and I > >> >>>>>>> > >> >>>>>>>> wish to skip them, not to index. > >> >>>>>>>>>> But here is the problem as I could not find a way to tell > solr > >> >>>>>>>>>> > >> >>>>>>>>> ignore > >> >>>>> > >> >>>>>> new > >> >>>>>>> > >> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that > it's > >> >>>>>>>>>> > >> >>>>>>>>> new. > >> >>>>> > >> >>>>>> Just > >> >>>>>>>>> > >> >>>>>>>>>> determine by ID that such document is in the index already > and > >> >>>>>>>>>> > >> >>>>>>>>> that's > >> >>>>> > >> >>>>>> it. > >> >>>>>>> > >> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false > >> >>>>>>>>>> > >> >>>>>>>>> and > >> >>> > >> >>>> dedupe > >> >>>>>>> > >> >>>>>>>> apprache but nothing helped me. I either have that a newer doc > >> >>>>>>>>>> > >> >>>>>>>>> overwrites > >> >>>>>>> > >> >>>>>>>> old one or I get duplicate. > >> >>>>>>>>>> > >> >>>>>>>>>> I think it's a very simple and basic feature and it must > exist. > >> >>>>>>>>>> > >> >>>>>>>>> What > >> >>>>> > >> >>>>>> did > >> >>>>>>> > >> >>>>>>>> I > >> >>>>>>>>> > >> >>>>>>>>>> make wrong or didn't do? > >> >>>>>>>>>> > >> >>>>>>>>>> I guess, because the mainstream approach is delta-import , > >> when > >> >>>>>>>>> > >> >>>>>>>> you > >> >>> > >> >>>> have > >> >>>>>>> > >> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp > stored > >> >>>>>>>>> somewhere. You can check how it works in DIH. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Tried google but I couldn't find a solution there althoght > many > >> >>>>>>>>>> > >> >>>>>>>>> people > >> >>>>> > >> >>>>>> encounted such problem. > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> it's definitely can be done by overriding > >> >>>>>>>>> > o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand), > >> >>>>>>>>> but I > >> >>>>>>>>> > >> >>>>>>>> suggest > >> >>>>>>> > >> >>>>>>>> to start from implementing your own > >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor< > >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for > >> >>>>>>>>> > >> >>>>>>>> PK, > >> >>> > >> >>>> bypass > >> >>>>>>> > >> >>>>>>>> chain call if it's found. Then if you meet performance issues > on > >> >>>>>>>>> > >> >>>>>>>> querying > >> >>>>>>> > >> >>>>>>>> your PKs one by one, (but only after that) you can batch your > >> >>>>>>>>> > >> >>>>>>>> searches, > >> >>>>> > >> >>>>>> there are couple of optimization techniques for huge disjunction > >> >>>>>>>>> > >> >>>>>>>> queries > >> >>>>> > >> >>>>>> like PK:(2 OR 4 OR 5 OR 6). > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> I start considering that I must query index to check if a doc > >> >>>>>>>>>> > >> >>>>>>>>> to be > >> >>> > >> >>>> added > >> >>>>>>> > >> >>>>>>>> is in the index already and do not add it to array but I have > so > >> >>>>>>>>>> > >> >>>>>>>>> many > >> >>>>> > >> >>>>>> docs > >> >>>>>>>>> > >> >>>>>>>>>> that I am affraid it's not a good solution. > >> >>>>>>>>>> > >> >>>>>>>>>> Best Regards > >> >>>>>>>>>> Alexander Aristov > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> Sincerely yours > >> >>>>>>>>> Mikhail Khludnev > >> >>>>>>>>> Lucid Certified > >> >>>>>>>>> Apache Lucene/Solr Developer > >> >>>>>>>>> Grid Dynamics > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>> -- > >> >>>>> Lance Norskog > >> >>>>> goks...@gmail.com > >> >>>>> > >> >>>>> > >> > > >> > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > Lucid Certified > > Apache Lucene/Solr Developer > > Grid Dynamics > > > > <http://www.griddynamics.com> > > <mkhlud...@griddynamics.com> >