: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted : blog articles from different sources, with slight changes (author name, : etc..)). : But they have differences. : *Now i like to see 1 doc in my result set and the other 4 should be marked : as similar.*
Do you actaully want al 1000 docs in your index, or do you want to prevent 4 of the 5 copies of hte doc from being indexed? Either way, if the the TextProfileSignature is doing a good job of identifying the 5 similar docs, then use that at index time. If you want to keep 4/5 out of the index, then use the Deduplcation features to prefent the duplicates from being indexed and your done. If you wnat all docs in the index, then you have to decide how you want to "mark" docs as similar ... do you want to only have one of those docs appear in all of your results, or do you want all of them in the results but with an indication that there are other similar docs? If the former: then take a look at "Grouping" and group on your signature field. If the latter, use the MLT component, to find similar docs based on the signature field (ie: mlt.fl=signature_t) https://wiki.apache.org/solr/FieldCollapsing -Hoss