Unfortunately I have a lot of duplicates and taking that searching might suffer I will try with implementing update procesor.
But your idea is interesting and I will consider it, thanks. Best Regards Alexander Aristov On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote: > Hello Alexander, > > I don't know much about your requirements in terms of size and > performances, but I've had a similar use case and found a pretty simple > workaround. > If your duplicate rate is not too high, you can have the > SignatureProcessor to generate fingerprint of documents (you already did > that). > > Simply turn off overwritting of duplicates, you can then rely on solr's > grouping / field collapsing to group your search results by fingerprints. > You'll then have one document group per "real" document. You can use > group.sort to sort your groups by indexing date ascending, and > group.limit=1 to keep only the oldest one. > You can even use group.format = simple to serve results as if no > collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to > get the real number of deduplicated documents. > > Of course the index will be larger, as I said, I made no assumption > regarding your operating requirements. And search can be a bit slower, > depending on the average rate of duplicated documents. > But you've got your issue addressed by configuration tuning only... > Depending on your project's sizing, it could be time saving. > > The advantage is that you have the precious information of what content is > duplicated from where :-) > > Hope this helps, > > -- > Tanguy > > Le 28/12/2011 15:45, Alexander Aristov a écrit : > > Thanks Eric, >> >> it sets me direction. I will be writing new plugin and will get back to >> the >> dev forum with results and then we will decide next steps. >> >> Best Regards >> Alexander Aristov >> >> >> On 28 December 2011 18:08, Erick >> Erickson<erickerickson@gmail.**com<erickerick...@gmail.com>> >> wrote: >> >> Well, the short answer is that nobody else has >>> 1> had a similar requirement >>> AND >>> 2> not found a suitable work around >>> AND >>> 3> implemented the change and contributed it back. >>> >>> So, if you'd like to volunteer<G>..... >>> >>> Seriously. If you think this would be valuable and are >>> willing to work on it, hop on over to the dev list and >>> discuss it, open a JIRA and make it work. I'd start >>> by opening a discussion on the dev list before >>> opening a JIRA, just to get a sense of where the >>> snags would be to changing the Solr code, but that's >>> optional. >>> >>> That said, writing your own update request handler >>> that detected this case isn't very difficult, >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor >>> and use it as a plugin. >>> >>> Best >>> Erick >>> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov >>> <alexander.aris...@gmail.com> wrote: >>> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES >>>> >>> old >>> >>>> docs. I have tried it already. >>>> >>>> Best Regards >>>> Alexander Aristov >>>> >>>> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com> wrote: >>>> >>>> The SignatureUpdateProcessor is for exactly this problem: >>>>> >>>>> >>>>> >>>>> http://www.lucidimagination.**com/search/link?url=http://** >>> wiki.apache.org/solr/**Deduplication<http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication> >>> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov >>>>> <alexander.aris...@gmail.com> wrote: >>>>> >>>>>> I get docs from external sources and the only place I keep them is >>>>>> >>>>> solr >>> >>>> index. I have no a database or other means to track indexed docs (my >>>>>> personal oppinion is that it might be a huge headache). >>>>>> >>>>>> Some docs might change slightly in there original sources but I don't >>>>>> >>>>> need >>>>> >>>>>> that changes. In fact I need original data only. >>>>>> >>>>>> So I have no other ways but to either check if a document is already >>>>>> >>>>> in >>> >>>> index before I put it to solrj array (read - query solr) or develop my >>>>>> >>>>> own >>>>> >>>>>> update chain processor and implement ID check there and skip such >>>>>> >>>>> docs. >>> >>>> Maybe it's wrong place to aguee and probably it's been discussed >>>>>> >>>>> before >>> >>>> but >>>>> >>>>>> I wonder why simple the overwrite parameter doesn't work here. >>>>>> >>>>>> My oppinion it perfectly suits here. In combination with unique ID it >>>>>> >>>>> can >>> >>>> cover all possible variants. >>>>>> >>>>>> cases: >>>>>> >>>>>> 1. overwrite=true and uniquID exists then newer doc should overwrite >>>>>> >>>>> the >>> >>>> old one. >>>>>> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be skipped >>>>>> >>>>> since >>>>> >>>>>> old exists. >>>>>> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added regardless if >>>>>> >>>>> old >>>>> >>>>>> exists or not. >>>>>> >>>>>> >>>>>> Best Regards >>>>>> Alexander Aristov >>>>>> >>>>>> >>>>>> On 27 December 2011 22:53, Erick >>>>>> Erickson<erickerickson@gmail.**com<erickerick...@gmail.com> >>>>>> > >>>>>> >>>>> wrote: >>>>> >>>>>> Mikhail is right as far as I know, the assumption built into Solr is >>>>>>> >>>>>> that >>>>> >>>>>> duplicate IDs (when<uniqueKey> is defined) should trigger the old >>>>>>> document to be replaced. >>>>>>> >>>>>>> what is your system-of-record? By that I mean what does your SolrJ >>>>>>> program do to send data to Solr? Is there any way you could just >>>>>>> *not* send documents that are already in the Solr index based on, >>>>>>> for instance, any timestamp associated with your system-of-record >>>>>>> and the last time you did an incremental index? >>>>>>> >>>>>>> Best >>>>>>> Erick >>>>>>> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov >>>>>>> <alexander.aris...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> I am not using database. All needed data is in solr index that's >>>>>>>> >>>>>>> why I >>> >>>> want >>>>>>> >>>>>>>> to skip excessive checks. >>>>>>>> >>>>>>>> I will check DIH but not sure if it helps. >>>>>>>> >>>>>>>> I am fluent with Java and it's not a problem for me to write a >>>>>>>> >>>>>>> class >>> >>>> or >>>>> >>>>>> so >>>>>>> >>>>>>>> but I want to check first maybe there are any ways (workarounds) >>>>>>>> >>>>>>> to >>> >>>> make >>>>> >>>>>> it working without codding, just by playing around with >>>>>>>> >>>>>>> configuration >>> >>>> and >>>>> >>>>>> params. I don't want to go away from default solr implementation. >>>>>>>> >>>>>>>> Best Regards >>>>>>>> Alexander Aristov >>>>>>>> >>>>>>>> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev< >>>>>>>> >>>>>>> mkhlud...@griddynamics.com >>>>> >>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov< >>>>>>>>> alexander.aris...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hi people, >>>>>>>>>> >>>>>>>>>> I urgently need your help! >>>>>>>>>> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental >>>>>>>>>> >>>>>>>>> indexing 4 >>> >>>> times a >>>>>>>>> >>>>>>>>>> day using bulk updates. Some documents are identical to some >>>>>>>>>> >>>>>>>>> extent >>> >>>> and I >>>>>>> >>>>>>>> wish to skip them, not to index. >>>>>>>>>> But here is the problem as I could not find a way to tell solr >>>>>>>>>> >>>>>>>>> ignore >>>>> >>>>>> new >>>>>>> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that it's >>>>>>>>>> >>>>>>>>> new. >>>>> >>>>>> Just >>>>>>>>> >>>>>>>>>> determine by ID that such document is in the index already and >>>>>>>>>> >>>>>>>>> that's >>>>> >>>>>> it. >>>>>>> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false >>>>>>>>>> >>>>>>>>> and >>> >>>> dedupe >>>>>>> >>>>>>>> apprache but nothing helped me. I either have that a newer doc >>>>>>>>>> >>>>>>>>> overwrites >>>>>>> >>>>>>>> old one or I get duplicate. >>>>>>>>>> >>>>>>>>>> I think it's a very simple and basic feature and it must exist. >>>>>>>>>> >>>>>>>>> What >>>>> >>>>>> did >>>>>>> >>>>>>>> I >>>>>>>>> >>>>>>>>>> make wrong or didn't do? >>>>>>>>>> >>>>>>>>>> I guess, because the mainstream approach is delta-import , when >>>>>>>>> >>>>>>>> you >>> >>>> have >>>>>>> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp stored >>>>>>>>> somewhere. You can check how it works in DIH. >>>>>>>>> >>>>>>>>> >>>>>>>>> Tried google but I couldn't find a solution there althoght many >>>>>>>>>> >>>>>>>>> people >>>>> >>>>>> encounted such problem. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> it's definitely can be done by overriding >>>>>>>>> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand), >>>>>>>>> but I >>>>>>>>> >>>>>>>> suggest >>>>>>> >>>>>>>> to start from implementing your own >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<http://wiki.apache.org/solr/UpdateRequestProcessor>- >>>>>>>>> search for >>>>>>>>> >>>>>>>> PK, >>> >>>> bypass >>>>>>> >>>>>>>> chain call if it's found. Then if you meet performance issues on >>>>>>>>> >>>>>>>> querying >>>>>>> >>>>>>>> your PKs one by one, (but only after that) you can batch your >>>>>>>>> >>>>>>>> searches, >>>>> >>>>>> there are couple of optimization techniques for huge disjunction >>>>>>>>> >>>>>>>> queries >>>>> >>>>>> like PK:(2 OR 4 OR 5 OR 6). >>>>>>>>> >>>>>>>>> >>>>>>>>> I start considering that I must query index to check if a doc >>>>>>>>>> >>>>>>>>> to be >>> >>>> added >>>>>>> >>>>>>>> is in the index already and do not add it to array but I have so >>>>>>>>>> >>>>>>>>> many >>>>> >>>>>> docs >>>>>>>>> >>>>>>>>>> that I am affraid it's not a good solution. >>>>>>>>>> >>>>>>>>>> Best Regards >>>>>>>>>> Alexander Aristov >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sincerely yours >>>>>>>>> Mikhail Khludnev >>>>>>>>> Lucid Certified >>>>>>>>> Apache Lucene/Solr Developer >>>>>>>>> Grid Dynamics >>>>>>>>> >>>>>>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> goks...@gmail.com >>>>> >>>>> >