Hello Alexander,
I don't know much about your requirements in terms of size and
performances, but I've had a similar use case and found a pretty simple
workaround.
If your duplicate rate is not too high, you can have the
SignatureProcessor to generate fingerprint of documents (you already did
that).
Simply turn off overwritting of duplicates, you can then rely on solr's
grouping / field collapsing to group your search results by
fingerprints. You'll then have one document group per "real" document.
You can use group.sort to sort your groups by indexing date ascending,
and group.limit=1 to keep only the oldest one.
You can even use group.format = simple to serve results as if no
collapsing occured, and use group.ngroups (/!\ could be expansive /!\)
to get the real number of deduplicated documents.
Of course the index will be larger, as I said, I made no assumption
regarding your operating requirements. And search can be a bit slower,
depending on the average rate of duplicated documents.
But you've got your issue addressed by configuration tuning only...
Depending on your project's sizing, it could be time saving.
The advantage is that you have the precious information of what content
is duplicated from where :-)
Hope this helps,
--
Tanguy
Le 28/12/2011 15:45, Alexander Aristov a écrit :
Thanks Eric,
it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.
Best Regards
Alexander Aristov
On 28 December 2011 18:08, Erick Erickson<erickerick...@gmail.com> wrote:
Well, the short answer is that nobody else has
1> had a similar requirement
AND
2> not found a suitable work around
AND
3> implemented the change and contributed it back.
So, if you'd like to volunteer<G>.....
Seriously. If you think this would be valuable and are
willing to work on it, hop on over to the dev list and
discuss it, open a JIRA and make it work. I'd start
by opening a discussion on the dev list before
opening a JIRA, just to get a sense of where the
snags would be to changing the Solr code, but that's
optional.
That said, writing your own update request handler
that detected this case isn't very difficult,
extend UpdateRequestProcessorFactory/UpdateRequestProcessor
and use it as a plugin.
Best
Erick
On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
<alexander.aris...@gmail.com> wrote:
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES
old
docs. I have tried it already.
Best Regards
Alexander Aristov
On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com> wrote:
The SignatureUpdateProcessor is for exactly this problem:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
<alexander.aris...@gmail.com> wrote:
I get docs from external sources and the only place I keep them is
solr
index. I have no a database or other means to track indexed docs (my
personal oppinion is that it might be a huge headache).
Some docs might change slightly in there original sources but I don't
need
that changes. In fact I need original data only.
So I have no other ways but to either check if a document is already
in
index before I put it to solrj array (read - query solr) or develop my
own
update chain processor and implement ID check there and skip such
docs.
Maybe it's wrong place to aguee and probably it's been discussed
before
but
I wonder why simple the overwrite parameter doesn't work here.
My oppinion it perfectly suits here. In combination with unique ID it
can
cover all possible variants.
cases:
1. overwrite=true and uniquID exists then newer doc should overwrite
the
old one.
2. overwrite=false and uniqueID exists then newer doc must be skipped
since
old exists.
3. uniqueID doesn't exist then newer doc just gets added regardless if
old
exists or not.
Best Regards
Alexander Aristov
On 27 December 2011 22:53, Erick Erickson<erickerick...@gmail.com>
wrote:
Mikhail is right as far as I know, the assumption built into Solr is
that
duplicate IDs (when<uniqueKey> is defined) should trigger the old
document to be replaced.
what is your system-of-record? By that I mean what does your SolrJ
program do to send data to Solr? Is there any way you could just
*not* send documents that are already in the Solr index based on,
for instance, any timestamp associated with your system-of-record
and the last time you did an incremental index?
Best
Erick
On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
<alexander.aris...@gmail.com> wrote:
Hi
I am not using database. All needed data is in solr index that's
why I
want
to skip excessive checks.
I will check DIH but not sure if it helps.
I am fluent with Java and it's not a problem for me to write a
class
or
so
but I want to check first maybe there are any ways (workarounds)
to
make
it working without codding, just by playing around with
configuration
and
params. I don't want to go away from default solr implementation.
Best Regards
Alexander Aristov
On 27 December 2011 09:33, Mikhail Khludnev<
mkhlud...@griddynamics.com
wrote:
On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
alexander.aris...@gmail.com> wrote:
Hi people,
I urgently need your help!
I have solr 3.3 configured and running. I do uncremental
indexing 4
times a
day using bulk updates. Some documents are identical to some
extent
and I
wish to skip them, not to index.
But here is the problem as I could not find a way to tell solr
ignore
new
duplicate docs and keep old indexed docs. I don't care that it's
new.
Just
determine by ID that such document is in the index already and
that's
it.
I use solrj for indexing. I have tried setting overwrite=false
and
dedupe
apprache but nothing helped me. I either have that a newer doc
overwrites
old one or I get duplicate.
I think it's a very simple and basic feature and it must exist.
What
did
I
make wrong or didn't do?
I guess, because the mainstream approach is delta-import , when
you
have
"updated" timestamps in your DB and "last-import" timestamp stored
somewhere. You can check how it works in DIH.
Tried google but I couldn't find a solution there althoght many
people
encounted such problem.
it's definitely can be done by overriding
o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
suggest
to start from implementing your own
http://wiki.apache.org/solr/UpdateRequestProcessor - search for
PK,
bypass
chain call if it's found. Then if you meet performance issues on
querying
your PKs one by one, (but only after that) you can batch your
searches,
there are couple of optimization techniques for huge disjunction
queries
like PK:(2 OR 4 OR 5 OR 6).
I start considering that I must query index to check if a doc
to be
added
is in the index already and do not add it to array but I have so
many
docs
that I am affraid it's not a good solution.
Best Regards
Alexander Aristov
--
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics
--
Lance Norskog
goks...@gmail.com