Hi I am not using database. All needed data is in solr index that's why I want to skip excessive checks.
I will check DIH but not sure if it helps. I am fluent with Java and it's not a problem for me to write a class or so but I want to check first maybe there are any ways (workarounds) to make it working without codding, just by playing around with configuration and params. I don't want to go away from default solr implementation. Best Regards Alexander Aristov On 27 December 2011 09:33, Mikhail Khludnev <mkhlud...@griddynamics.com>wrote: > On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov < > alexander.aris...@gmail.com> wrote: > > > Hi people, > > > > I urgently need your help! > > > > I have solr 3.3 configured and running. I do uncremental indexing 4 > times a > > day using bulk updates. Some documents are identical to some extent and I > > wish to skip them, not to index. > > But here is the problem as I could not find a way to tell solr ignore new > > duplicate docs and keep old indexed docs. I don't care that it's new. > Just > > determine by ID that such document is in the index already and that's it. > > > > I use solrj for indexing. I have tried setting overwrite=false and dedupe > > apprache but nothing helped me. I either have that a newer doc overwrites > > old one or I get duplicate. > > > > I think it's a very simple and basic feature and it must exist. What did > I > > make wrong or didn't do? > > > > I guess, because the mainstream approach is delta-import , when you have > "updated" timestamps in your DB and "last-import" timestamp stored > somewhere. You can check how it works in DIH. > > > > > > Tried google but I couldn't find a solution there althoght many people > > encounted such problem. > > > > > it's definitely can be done by overriding > o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest > to start from implementing your own > http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass > chain call if it's found. Then if you meet performance issues on querying > your PKs one by one, (but only after that) you can batch your searches, > there are couple of optimization techniques for huge disjunction queries > like PK:(2 OR 4 OR 5 OR 6). > > > > I start considering that I must query index to check if a doc to be added > > is in the index already and do not add it to array but I have so many > docs > > that I am affraid it's not a good solution. > > > > Best Regards > > Alexander Aristov > > > > > > -- > Sincerely yours > Mikhail Khludnev > Lucid Certified > Apache Lucene/Solr Developer > Grid Dynamics >