the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old docs. I have tried it already.
Best Regards Alexander Aristov On 28 December 2011 13:04, Lance Norskog <goks...@gmail.com> wrote: > The SignatureUpdateProcessor is for exactly this problem: > > > http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication > > On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov > <alexander.aris...@gmail.com> wrote: > > I get docs from external sources and the only place I keep them is solr > > index. I have no a database or other means to track indexed docs (my > > personal oppinion is that it might be a huge headache). > > > > Some docs might change slightly in there original sources but I don't > need > > that changes. In fact I need original data only. > > > > So I have no other ways but to either check if a document is already in > > index before I put it to solrj array (read - query solr) or develop my > own > > update chain processor and implement ID check there and skip such docs. > > > > Maybe it's wrong place to aguee and probably it's been discussed before > but > > I wonder why simple the overwrite parameter doesn't work here. > > > > My oppinion it perfectly suits here. In combination with unique ID it can > > cover all possible variants. > > > > cases: > > > > 1. overwrite=true and uniquID exists then newer doc should overwrite the > > old one. > > > > 2. overwrite=false and uniqueID exists then newer doc must be skipped > since > > old exists. > > > > 3. uniqueID doesn't exist then newer doc just gets added regardless if > old > > exists or not. > > > > > > Best Regards > > Alexander Aristov > > > > > > On 27 December 2011 22:53, Erick Erickson <erickerick...@gmail.com> > wrote: > > > >> Mikhail is right as far as I know, the assumption built into Solr is > that > >> duplicate IDs (when <uniqueKey> is defined) should trigger the old > >> document to be replaced. > >> > >> what is your system-of-record? By that I mean what does your SolrJ > >> program do to send data to Solr? Is there any way you could just > >> *not* send documents that are already in the Solr index based on, > >> for instance, any timestamp associated with your system-of-record > >> and the last time you did an incremental index? > >> > >> Best > >> Erick > >> > >> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov > >> <alexander.aris...@gmail.com> wrote: > >> > Hi > >> > > >> > I am not using database. All needed data is in solr index that's why I > >> want > >> > to skip excessive checks. > >> > > >> > I will check DIH but not sure if it helps. > >> > > >> > I am fluent with Java and it's not a problem for me to write a class > or > >> so > >> > but I want to check first maybe there are any ways (workarounds) to > make > >> > it working without codding, just by playing around with configuration > and > >> > params. I don't want to go away from default solr implementation. > >> > > >> > Best Regards > >> > Alexander Aristov > >> > > >> > > >> > On 27 December 2011 09:33, Mikhail Khludnev < > mkhlud...@griddynamics.com > >> >wrote: > >> > > >> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov < > >> >> alexander.aris...@gmail.com> wrote: > >> >> > >> >> > Hi people, > >> >> > > >> >> > I urgently need your help! > >> >> > > >> >> > I have solr 3.3 configured and running. I do uncremental indexing 4 > >> >> times a > >> >> > day using bulk updates. Some documents are identical to some extent > >> and I > >> >> > wish to skip them, not to index. > >> >> > But here is the problem as I could not find a way to tell solr > ignore > >> new > >> >> > duplicate docs and keep old indexed docs. I don't care that it's > new. > >> >> Just > >> >> > determine by ID that such document is in the index already and > that's > >> it. > >> >> > > >> >> > I use solrj for indexing. I have tried setting overwrite=false and > >> dedupe > >> >> > apprache but nothing helped me. I either have that a newer doc > >> overwrites > >> >> > old one or I get duplicate. > >> >> > > >> >> > I think it's a very simple and basic feature and it must exist. > What > >> did > >> >> I > >> >> > make wrong or didn't do? > >> >> > > >> >> > >> >> I guess, because the mainstream approach is delta-import , when you > >> have > >> >> "updated" timestamps in your DB and "last-import" timestamp stored > >> >> somewhere. You can check how it works in DIH. > >> >> > >> >> > >> >> > > >> >> > Tried google but I couldn't find a solution there althoght many > people > >> >> > encounted such problem. > >> >> > > >> >> > > >> >> it's definitely can be done by overriding > >> >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I > >> suggest > >> >> to start from implementing your own > >> >> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, > >> bypass > >> >> chain call if it's found. Then if you meet performance issues on > >> querying > >> >> your PKs one by one, (but only after that) you can batch your > searches, > >> >> there are couple of optimization techniques for huge disjunction > queries > >> >> like PK:(2 OR 4 OR 5 OR 6). > >> >> > >> >> > >> >> > I start considering that I must query index to check if a doc to be > >> added > >> >> > is in the index already and do not add it to array but I have so > many > >> >> docs > >> >> > that I am affraid it's not a good solution. > >> >> > > >> >> > Best Regards > >> >> > Alexander Aristov > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Sincerely yours > >> >> Mikhail Khludnev > >> >> Lucid Certified > >> >> Apache Lucene/Solr Developer > >> >> Grid Dynamics > >> >> > >> > > > > -- > Lance Norskog > goks...@gmail.com >