Re: solr keep old docs

Tanguy Moal Wed, 28 Dec 2011 07:13:18 -0800

Hello Alexander,

I don't know much about your requirements in terms of size andperformances, but I've had a similar use case and found a pretty simpleworkaround.If your duplicate rate is not too high, you can have theSignatureProcessor to generate fingerprint of documents (you already didthat).

Simply turn off overwritting of duplicates, you can then rely on solr'sgrouping / field collapsing to group your search results byfingerprints. You'll then have one document group per "real" document.You can use group.sort to sort your groups by indexing date ascending,and group.limit=1 to keep only the oldest one.You can even use group.format = simple to serve results as if nocollapsing occured, and use group.ngroups (/!\ could be expansive /!\)to get the real number of deduplicated documents.

Of course the index will be larger, as I said, I made no assumptionregarding your operating requirements. And search can be a bit slower,depending on the average rate of duplicated documents.But you've got your issue addressed by configuration tuning only...Depending on your project's sizing, it could be time saving.

The advantage is that you have the precious information of what contentis duplicated from where :-)


Hope this helps,

--
Tanguy

Le 28/12/2011 15:45, Alexander Aristov a écrit :

Thanks Eric,

it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.

Best Regards
Alexander Aristov


On 28 December 2011 18:08, Erick Erickson<erickerick...@gmail.com>  wrote:

Well, the short answer is that nobody else has
1>  had a similar requirement
AND
2>  not found a suitable work around
AND
3>  implemented the change and contributed it back.

So, if you'd like to volunteer<G>.....

Seriously. If you think this would be valuable and are
willing to work on it, hop on over to the dev list and
discuss it, open a JIRA and make it work. I'd start
by opening a discussion on the dev list before
opening a JIRA, just to get a sense of where the
snags would be to changing the Solr code, but that's
optional.

That said, writing your own update request handler
that detected this case isn't very difficult,
extend UpdateRequestProcessorFactory/UpdateRequestProcessor
and use it as a plugin.

Best
Erick

On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
<alexander.aris...@gmail.com>  wrote:

the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES

old

docs. I have tried it already.

Best Regards
Alexander Aristov


On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com>  wrote:

The SignatureUpdateProcessor is for exactly this problem:

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
<alexander.aris...@gmail.com>  wrote:

I get docs from external sources and the only place I keep them is

solr

index. I have no a database or other means to track indexed docs (my
personal oppinion is that it might be a huge headache).

Some docs might change slightly in there original sources but I don't

need

that changes. In fact I need original data only.

So I have no other ways but to either check if a document is already

in

index before I put it to solrj array (read - query solr) or develop my

own

update chain processor and implement ID check there and skip such

docs.

Maybe it's wrong place to aguee and probably it's been discussed

before

but

I wonder why simple the overwrite parameter doesn't work here.

My oppinion it perfectly suits here. In combination with unique ID it

can

cover all possible variants.

cases:

1. overwrite=true and uniquID exists then newer doc should overwrite

the

old one.

2. overwrite=false and uniqueID exists then newer doc must be skipped

since

old exists.

3. uniqueID doesn't exist then newer doc just gets added regardless if

old

exists or not.


Best Regards
Alexander Aristov


On 27 December 2011 22:53, Erick Erickson<erickerick...@gmail.com>

wrote:

Mikhail is right as far as I know, the assumption built into Solr is

that

duplicate IDs (when<uniqueKey>  is defined) should trigger the old
document to be replaced.

what is your system-of-record? By that I mean what does your SolrJ
program do to send data to Solr? Is there any way you could just
*not* send documents that are already in the Solr index based on,
for instance, any timestamp associated with your system-of-record
and the last time you did an incremental index?

Best
Erick

On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
<alexander.aris...@gmail.com>  wrote:

Hi

I am not using database. All needed data is in solr index that's

why I

want

to skip excessive checks.

I will check DIH but not sure if it helps.

I am fluent with Java and it's not a problem for me to write a

class

or

so

but I want to check first  maybe there are any ways (workarounds)

to

make

it working without codding, just by playing around with

configuration

and

params. I don't want to go away from default solr implementation.

Best Regards
Alexander Aristov


On 27 December 2011 09:33, Mikhail Khludnev<

mkhlud...@griddynamics.com

wrote:

On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
alexander.aris...@gmail.com>  wrote:

Hi people,

I urgently need your help!

I have solr 3.3 configured and running. I do uncremental

indexing 4

times a

day using bulk updates. Some documents are identical to some

extent

and I

wish to skip them, not to index.
But here is the problem as I could not find a way to tell solr

ignore

new

duplicate docs and keep old indexed docs. I don't care that it's

new.

Just

determine by ID that such document is in the index already and

that's

it.

I use solrj for indexing. I have tried setting overwrite=false

and

dedupe

apprache but nothing helped me. I either have that a newer doc

overwrites

old one or I get duplicate.

I think it's a very simple and basic feature and it must exist.

What

did

make wrong or didn't do?

I guess, because  the mainstream approach is delta-import , when

you

have

"updated" timestamps in your DB and "last-import" timestamp stored
somewhere. You can check how it works in DIH.

Tried google but I couldn't find a solution there althoght many

people

encounted such problem.

it's definitely can be done by overriding
o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I

suggest

to start from implementing your own
http://wiki.apache.org/solr/UpdateRequestProcessor - search for

PK,

bypass

chain call if it's found. Then if you meet performance issues on

querying

your PKs one by one, (but only after that) you can batch your

searches,

there are couple of optimization techniques for huge disjunction

queries

like PK:(2 OR 4 OR 5 OR 6).

I start considering that I must query index to check if a doc

to be

added

is in the index already and do not add it to array but I have so

many

docs

that I am affraid it's not a good solution.

Best Regards
Alexander Aristov



--
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics



--
Lance Norskog
goks...@gmail.com

Re: solr keep old docs

Reply via email to