Not sure if it will work. Say we have SearchComponent which does this in
process method:
1. DocList docs = rb.getResults().docList;
2. Go over docs and for each doc do:
3.
BooleanQuery q = new BooleanQuery(); //construct a query which gets all docs
which are not equal to current one and are from a different host (we deal there
with web pages)
q.add(new TermQuery(new Term("host", host)), BooleanClause.Occur.MUST_NOT);
q.add(new TermQuery(new Term("id", name)), BooleanClause.Occur.MUST_NOT);
DocListAndSet sim = searcher.getDocListAndSet( q, (TermQuery) null, null, 0,
1000); //TODO how to set proper limit not hard-coded 1000
4. for all docs in sim calculate similarity to current doc (from #2)
5. Count all similar documents and add a new field
FieldType ft = new FieldType();
ft.setStored(true);
ft.setIndexed(true);
Field f = new IntField("similarCount", ds.size(), ft);
d.add(f);
Now the problem is with #1 this comes in already sorted. That is if I call solr
with q=*&sort=similarityCount, sort is applied before calling last component,
which does all the above defined steps. If I add this to first-components then
#1 call will return null.
Completely different approach would be to calculate aggregate values on update
via UpdateRequestProcessor. But then I need to be able to do searches in update
processor (step #3). But in that case docs for searcher are available only
after commit. I'd expect that this would work but search always returns 0
public void processCommit(CommitUpdateCommand cmd) throws IOException {
TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100);
DocListAndSet sim = searcher.getDocListAndSet(
new MatchAllDocsQuery(), (TermQuery) null, null, 0, 10);
DocList docs = sim.docList; <---- Is always empty
(Tried placing it after solr.RunUpdateProcessorFactory in update chain, no
change)
Even if searcher would work, it looks bad. Because in this case I would need to
update not only incoming document but also all those documents which are
similar to a current one (That is if A is similar to B and C, then B and C are
similar to A, and similarCount field has to be increased in B and C as well).
________________________________
From: Koji Sekiguchi <[email protected]>
To: [email protected]
Sent: Thursday, July 18, 2013 4:29 PM
Subject: Re: Sort by document similarity counts
> I have tried doing this via custom SearchComponent, where I can find all
> similar documents for each document in current search result, then add a new
> field into document hoping to use sort parameter (q=*&sort=similarityCount).
I don't understand this part very well, but:
> But this will not work because sort is done before handling my custom search
> component, if added via last-components. Can't add it via first-components,
> because then I will have no access to query results. And I do not want to
> override QueryComponent because I need to have all the functionality it
> covers: grouping, facets, etc.
You may want to put your custom SearchComponent to last-component and inject
SortSpec
in your prepare() so that QueryComponent can sort the result complying with
your SortSpec?
koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html