Re: solr keep old docs

Alexander Aristov Thu, 29 Dec 2011 05:16:35 -0800

I have never developed for solr yet and don't know much internals but Today
I have tried one approach with searcher.


In my update processor I get searcher and search for ID. It works but I
need to load test it. Will index traversal be faster (less resource
consuming) than search?

Best Regards
Alexander Aristov


On 29 December 2011 17:03, Erick Erickson <erickerick...@gmail.com> wrote:

> Hmmm, we're not communicating <G>...
>
> The update processor wouldn't search in the
> classic sense. It would just use lower-level
> index traversal to determine if the doc (identified
> by your unique key) was already in the index
> and skip indexing that document if it was. No real
> *searching* involved (see TermDocs.seek for one
> approach).
>
> The price would be that you are transmitting the
> document over to the Solr instance and then
> throwing it away.
>
> Best
> Erick
>
> On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev
> <mkhlud...@griddynamics.com> wrote:
> > Alexander,
> >
> > I have two ideas how to implement fast dedupe externally, assuming your
> PKs
> > don't fit to java.util.*Map:
> >
> >   - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
> >   - if your crawler is stateless - it doesn't track PKs which has been
> >   already crawled, you can retrieve it from Solr via
> >   http://wiki.apache.org/solr/TermsComponent . That's blazingly fast,
> but
> >   it might be a problem with removed documents (I'm not sure). And it's
> also
> >   can lead to OOMException (if you have too much PKs). Let me know if you
> >   need a workaround for one of these problems.
> >
> > If you choose internal dedupe (UpdateProcessor), pls let me know if
> > querying one-by-one will be to slow for your and you'll need to do it
> > page-by-page. I did some of such paging, and will do something similar
> > soon, so I'm interested in it.
> >
> > Regards
> >
> > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov <
> > alexander.aris...@gmail.com> wrote:
> >
> >> Unfortunately I have a lot of duplicates  and taking that searching
> might
> >> suffer I will try with implementing update procesor.
> >>
> >> But your idea is interesting and I will consider it, thanks.
> >>
> >> Best Regards
> >> Alexander Aristov
> >>
> >>
> >> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote:
> >>
> >> > Hello Alexander,
> >> >
> >> > I don't know much about your requirements in terms of size and
> >> > performances, but I've had a similar use case and found a pretty
> simple
> >> > workaround.
> >> > If your duplicate rate is not too high, you can have the
> >> > SignatureProcessor to generate fingerprint of documents (you already
> did
> >> > that).
> >> >
> >> > Simply turn off overwritting of duplicates, you can then rely on
> solr's
> >> > grouping / field collapsing to group your search results by
> fingerprints.
> >> > You'll then have one document group per "real" document. You can use
> >> > group.sort to sort your groups by indexing date ascending, and
> >> > group.limit=1 to keep only the oldest one.
> >> > You can even use group.format = simple to serve results as if no
> >> > collapsing occured, and use group.ngroups (/!\ could be expansive
> /!\) to
> >> > get the real number of deduplicated documents.
> >> >
> >> > Of course the index will be larger, as I said, I made no assumption
> >> > regarding your operating requirements. And search can be a bit slower,
> >> > depending on the average rate of duplicated documents.
> >> > But you've got your issue addressed by configuration tuning only...
> >> > Depending on your project's sizing, it could be time saving.
> >> >
> >> > The advantage is that you have the precious information of what
> content
> >> is
> >> > duplicated from where :-)
> >> >
> >> > Hope this helps,
> >> >
> >> > --
> >> > Tanguy
> >> >
> >> > Le 28/12/2011 15:45, Alexander Aristov a écrit :
> >> >
> >> >  Thanks Eric,
> >> >>
> >> >> it sets me direction. I will be writing new plugin and will get back
> to
> >> >> the
> >> >> dev forum with results and then we will decide next steps.
> >> >>
> >> >> Best Regards
> >> >> Alexander Aristov
> >> >>
> >> >>
> >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com<
> >> erickerick...@gmail.com>>
> >> >>  wrote:
> >> >>
> >> >>  Well, the short answer is that nobody else has
> >> >>> 1>  had a similar requirement
> >> >>> AND
> >> >>> 2>  not found a suitable work around
> >> >>> AND
> >> >>> 3>  implemented the change and contributed it back.
> >> >>>
> >> >>> So, if you'd like to volunteer<G>.....
> >> >>>
> >> >>> Seriously. If you think this would be valuable and are
> >> >>> willing to work on it, hop on over to the dev list and
> >> >>> discuss it, open a JIRA and make it work. I'd start
> >> >>> by opening a discussion on the dev list before
> >> >>> opening a JIRA, just to get a sense of where the
> >> >>> snags would be to changing the Solr code, but that's
> >> >>> optional.
> >> >>>
> >> >>> That said, writing your own update request handler
> >> >>> that detected this case isn't very difficult,
> >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
> >> >>> and use it as a plugin.
> >> >>>
> >> >>> Best
> >> >>> Erick
> >> >>>
> >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
> >> >>> <alexander.aris...@gmail.com>  wrote:
> >> >>>
> >> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it
> >> REPLACES
> >> >>>>
> >> >>> old
> >> >>>
> >> >>>> docs. I have tried it already.
> >> >>>>
> >> >>>> Best Regards
> >> >>>> Alexander Aristov
> >> >>>>
> >> >>>>
> >> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com>
>  wrote:
> >> >>>>
> >> >>>>  The SignatureUpdateProcessor is for exactly this problem:
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>  http://www.lucidimagination.**com/search/link?url=http://**
> >> >>> wiki.apache.org/solr/**Deduplication<
> >>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
> >> >
> >> >>>
> >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> >> >>>>> <alexander.aris...@gmail.com>  wrote:
> >> >>>>>
> >> >>>>>> I get docs from external sources and the only place I keep them
> is
> >> >>>>>>
> >> >>>>> solr
> >> >>>
> >> >>>> index. I have no a database or other means to track indexed docs
> (my
> >> >>>>>> personal oppinion is that it might be a huge headache).
> >> >>>>>>
> >> >>>>>> Some docs might change slightly in there original sources but I
> >> don't
> >> >>>>>>
> >> >>>>> need
> >> >>>>>
> >> >>>>>> that changes. In fact I need original data only.
> >> >>>>>>
> >> >>>>>> So I have no other ways but to either check if a document is
> already
> >> >>>>>>
> >> >>>>> in
> >> >>>
> >> >>>> index before I put it to solrj array (read - query solr) or
> develop my
> >> >>>>>>
> >> >>>>> own
> >> >>>>>
> >> >>>>>> update chain processor and implement ID check there and skip such
> >> >>>>>>
> >> >>>>> docs.
> >> >>>
> >> >>>> Maybe it's wrong place to aguee and probably it's been discussed
> >> >>>>>>
> >> >>>>> before
> >> >>>
> >> >>>> but
> >> >>>>>
> >> >>>>>> I wonder why simple the overwrite parameter doesn't work here.
> >> >>>>>>
> >> >>>>>> My oppinion it perfectly suits here. In combination with unique
> ID
> >> it
> >> >>>>>>
> >> >>>>> can
> >> >>>
> >> >>>> cover all possible variants.
> >> >>>>>>
> >> >>>>>> cases:
> >> >>>>>>
> >> >>>>>> 1. overwrite=true and uniquID exists then newer doc should
> overwrite
> >> >>>>>>
> >> >>>>> the
> >> >>>
> >> >>>> old one.
> >> >>>>>>
> >> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be
> >> skipped
> >> >>>>>>
> >> >>>>> since
> >> >>>>>
> >> >>>>>> old exists.
> >> >>>>>>
> >> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added
> regardless
> >> if
> >> >>>>>>
> >> >>>>> old
> >> >>>>>
> >> >>>>>> exists or not.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Best Regards
> >> >>>>>> Alexander Aristov
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail.
> >> **com<erickerick...@gmail.com>
> >> >>>>>> >
> >> >>>>>>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Mikhail is right as far as I know, the assumption built into
> Solr is
> >> >>>>>>>
> >> >>>>>> that
> >> >>>>>
> >> >>>>>> duplicate IDs (when<uniqueKey>  is defined) should trigger the
> old
> >> >>>>>>> document to be replaced.
> >> >>>>>>>
> >> >>>>>>> what is your system-of-record? By that I mean what does your
> SolrJ
> >> >>>>>>> program do to send data to Solr? Is there any way you could just
> >> >>>>>>> *not* send documents that are already in the Solr index based
> on,
> >> >>>>>>> for instance, any timestamp associated with your
> system-of-record
> >> >>>>>>> and the last time you did an incremental index?
> >> >>>>>>>
> >> >>>>>>> Best
> >> >>>>>>> Erick
> >> >>>>>>>
> >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
> >> >>>>>>> <alexander.aris...@gmail.com>  wrote:
> >> >>>>>>>
> >> >>>>>>>> Hi
> >> >>>>>>>>
> >> >>>>>>>> I am not using database. All needed data is in solr index
> that's
> >> >>>>>>>>
> >> >>>>>>> why I
> >> >>>
> >> >>>>  want
> >> >>>>>>>
> >> >>>>>>>> to skip excessive checks.
> >> >>>>>>>>
> >> >>>>>>>> I will check DIH but not sure if it helps.
> >> >>>>>>>>
> >> >>>>>>>> I am fluent with Java and it's not a problem for me to write a
> >> >>>>>>>>
> >> >>>>>>> class
> >> >>>
> >> >>>> or
> >> >>>>>
> >> >>>>>> so
> >> >>>>>>>
> >> >>>>>>>> but I want to check first  maybe there are any ways
> (workarounds)
> >> >>>>>>>>
> >> >>>>>>> to
> >> >>>
> >> >>>> make
> >> >>>>>
> >> >>>>>> it working without codding, just by playing around with
> >> >>>>>>>>
> >> >>>>>>> configuration
> >> >>>
> >> >>>> and
> >> >>>>>
> >> >>>>>> params. I don't want to go away from default solr implementation.
> >> >>>>>>>>
> >> >>>>>>>> Best Regards
> >> >>>>>>>> Alexander Aristov
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev<
> >> >>>>>>>>
> >> >>>>>>> mkhlud...@griddynamics.com
> >> >>>>>
> >> >>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>  On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
> >> >>>>>>>>> alexander.aris...@gmail.com>  wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>  Hi people,
> >> >>>>>>>>>>
> >> >>>>>>>>>> I urgently need your help!
> >> >>>>>>>>>>
> >> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental
> >> >>>>>>>>>>
> >> >>>>>>>>> indexing 4
> >> >>>
> >> >>>>  times a
> >> >>>>>>>>>
> >> >>>>>>>>>> day using bulk updates. Some documents are identical to some
> >> >>>>>>>>>>
> >> >>>>>>>>> extent
> >> >>>
> >> >>>>  and I
> >> >>>>>>>
> >> >>>>>>>> wish to skip them, not to index.
> >> >>>>>>>>>> But here is the problem as I could not find a way to tell
> solr
> >> >>>>>>>>>>
> >> >>>>>>>>> ignore
> >> >>>>>
> >> >>>>>> new
> >> >>>>>>>
> >> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that
> it's
> >> >>>>>>>>>>
> >> >>>>>>>>> new.
> >> >>>>>
> >> >>>>>>  Just
> >> >>>>>>>>>
> >> >>>>>>>>>> determine by ID that such document is in the index already
> and
> >> >>>>>>>>>>
> >> >>>>>>>>> that's
> >> >>>>>
> >> >>>>>> it.
> >> >>>>>>>
> >> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false
> >> >>>>>>>>>>
> >> >>>>>>>>> and
> >> >>>
> >> >>>>  dedupe
> >> >>>>>>>
> >> >>>>>>>> apprache but nothing helped me. I either have that a newer doc
> >> >>>>>>>>>>
> >> >>>>>>>>> overwrites
> >> >>>>>>>
> >> >>>>>>>> old one or I get duplicate.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I think it's a very simple and basic feature and it must
> exist.
> >> >>>>>>>>>>
> >> >>>>>>>>> What
> >> >>>>>
> >> >>>>>> did
> >> >>>>>>>
> >> >>>>>>>> I
> >> >>>>>>>>>
> >> >>>>>>>>>> make wrong or didn't do?
> >> >>>>>>>>>>
> >> >>>>>>>>>>  I guess, because  the mainstream approach is delta-import ,
> >> when
> >> >>>>>>>>>
> >> >>>>>>>> you
> >> >>>
> >> >>>>  have
> >> >>>>>>>
> >> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp
> stored
> >> >>>>>>>>> somewhere. You can check how it works in DIH.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>  Tried google but I couldn't find a solution there althoght
> many
> >> >>>>>>>>>>
> >> >>>>>>>>> people
> >> >>>>>
> >> >>>>>>  encounted such problem.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>  it's definitely can be done by overriding
> >> >>>>>>>>>
> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand),
> >> >>>>>>>>> but I
> >> >>>>>>>>>
> >> >>>>>>>> suggest
> >> >>>>>>>
> >> >>>>>>>> to start from implementing your own
> >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<
> >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for
> >> >>>>>>>>>
> >> >>>>>>>> PK,
> >> >>>
> >> >>>>  bypass
> >> >>>>>>>
> >> >>>>>>>> chain call if it's found. Then if you meet performance issues
> on
> >> >>>>>>>>>
> >> >>>>>>>> querying
> >> >>>>>>>
> >> >>>>>>>> your PKs one by one, (but only after that) you can batch your
> >> >>>>>>>>>
> >> >>>>>>>> searches,
> >> >>>>>
> >> >>>>>>  there are couple of optimization techniques for huge disjunction
> >> >>>>>>>>>
> >> >>>>>>>> queries
> >> >>>>>
> >> >>>>>>  like PK:(2 OR 4 OR 5 OR 6).
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>  I start considering that I must query index to check if a doc
> >> >>>>>>>>>>
> >> >>>>>>>>> to be
> >> >>>
> >> >>>>  added
> >> >>>>>>>
> >> >>>>>>>> is in the index already and do not add it to array but I have
> so
> >> >>>>>>>>>>
> >> >>>>>>>>> many
> >> >>>>>
> >> >>>>>>  docs
> >> >>>>>>>>>
> >> >>>>>>>>>> that I am affraid it's not a good solution.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Best Regards
> >> >>>>>>>>>> Alexander Aristov
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Sincerely yours
> >> >>>>>>>>> Mikhail Khludnev
> >> >>>>>>>>> Lucid Certified
> >> >>>>>>>>> Apache Lucene/Solr Developer
> >> >>>>>>>>> Grid Dynamics
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Lance Norskog
> >> >>>>> goks...@gmail.com
> >> >>>>>
> >> >>>>>
> >> >
> >>
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Lucid Certified
> > Apache Lucene/Solr Developer
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mkhlud...@griddynamics.com>
>

Re: solr keep old docs

Reply via email to