: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
: blog articles from different sources, with slight changes (author name,
: etc..)).
: But they have differences.
: *Now i like to see 1 doc in my result set and the other 4 should be marked
: as similar.*

Do you actaully want al 1000 docs in your index, or do you want to prevent 
4 of the 5 copies of hte doc from being indexed?

Either way, if the the TextProfileSignature is doing a good job of 
identifying the 5 similar docs, then use that at index time.

If you want to keep 4/5 out of the index, then use the Deduplcation 
features to prefent the duplicates from being indexed and your done.  

If you wnat all docs in the index, then you have to decide how you want to 
"mark" docs as similar ... do you want to only have one of those docs 
appear in all of your results, or do you want all of them in the results 
but with an indication that there are other similar docs?  If the former: 
then take a look at "Grouping" and group on your signature field.  If the 
latter, use the MLT component, to find similar docs based on the signature 
field (ie: mlt.fl=signature_t)

https://wiki.apache.org/solr/FieldCollapsing

-Hoss

Reply via email to