well. The first results are ready. I have implemented custom update
processor following your suggestion using low level index reader and
termdocs.
Launched scripts which add about 10 000 docs. Indexing took about 1 minute
including commit that is quite good for me. I don't have larger datasets so
I'd guess it would be much faster, assuming that
the search savings wouldn't be swamped by the
additional transmission time over the wire and
parsing the request (although SolrJ uses a binary
format, so parsing request probably isn't all
that expensive).
You could even do a hybrid approach. Pack u
I have never developed for solr yet and don't know much internals but Today
I have tried one approach with searcher.
In my update processor I get searcher and search for ID. It works but I
need to load test it. Will index traversal be faster (less resource
consuming) than search?
Best Regards
Ale
Hmmm, we're not communicating ...
The update processor wouldn't search in the
classic sense. It would just use lower-level
index traversal to determine if the doc (identified
by your unique key) was already in the index
and skip indexing that document if it was. No real
*searching* involved (see T
Alexander,
I have two ideas how to implement fast dedupe externally, assuming your PKs
don't fit to java.util.*Map:
- your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
- if your crawler is stateless - it doesn't track PKs which has been
already crawled, you can retrieve it
Yes I have been warned that query index each time before adding doc to
index might be resource consuming. Will check it.
As for the overwrite parameter I think the name is not the best then.
People outside the "business" like me misuse it and assume what I wrote.
Overwrite shall mean what it means
Unfortunately I have a lot of duplicates and taking that searching might
suffer I will try with implementing update procesor.
But your idea is interesting and I will consider it, thanks.
Best Regards
Alexander Aristov
On 28 December 2011 19:12, Tanguy Moal wrote:
> Hello Alexander,
>
> I don
: That said, writing your own update request handler
: that detected this case isn't very difficult,
: extend UpdateRequestProcessorFactory/UpdateRequestProcessor
: and use it as a plugin.
i can't find the thread at the moment, but the general issue that has
caused people headaches with this typ
Hello Alexander,
I don't know much about your requirements in terms of size and
performances, but I've had a similar use case and found a pretty simple
workaround.
If your duplicate rate is not too high, you can have the
SignatureProcessor to generate fingerprint of documents (you already did
Thanks Eric,
it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.
Best Regards
Alexander Aristov
On 28 December 2011 18:08, Erick Erickson wrote:
> Well, the short answer is that nobody else has
> 1> had a simil
Well, the short answer is that nobody else has
1> had a similar requirement
AND
2> not found a suitable work around
AND
3> implemented the change and contributed it back.
So, if you'd like to volunteer .
Seriously. If you think this would be valuable and are
willing to work on it, hop on over
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old
docs. I have tried it already.
Best Regards
Alexander Aristov
On 28 December 2011 13:04, Lance Norskog wrote:
> The SignatureUpdateProcessor is for exactly this problem:
>
>
> http://www.lucidimagination.com/search/lin
The SignatureUpdateProcessor is for exactly this problem:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
wrote:
> I get docs from external sources and the only place I keep them is solr
> index. I have
I get docs from external sources and the only place I keep them is solr
index. I have no a database or other means to track indexed docs (my
personal oppinion is that it might be a huge headache).
Some docs might change slightly in there original sources but I don't need
that changes. In fact I ne
Mikhail is right as far as I know, the assumption built into Solr is that
duplicate IDs (when is defined) should trigger the old
document to be replaced.
what is your system-of-record? By that I mean what does your SolrJ
program do to send data to Solr? Is there any way you could just
*not* send
Hi
I am not using database. All needed data is in solr index that's why I want
to skip excessive checks.
I will check DIH but not sure if it helps.
I am fluent with Java and it's not a problem for me to write a class or so
but I want to check first maybe there are any ways (workarounds) to make
On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:
> Hi people,
>
> I urgently need your help!
>
> I have solr 3.3 configured and running. I do uncremental indexing 4 times a
> day using bulk updates. Some documents are identical to some extent and I
> wish t
17 matches
Mail list logo