Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Alessandro Benedetti
I agree with Upayavira, Title extraction is an activity independent from Solr. Furthermore I would say it's easy to extract the title before the Solr Indexng stage. When we send the content arrives to Solr Update processors it is already a String. If you want to do some clever title extraction, fo

Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Upayavira
It depends a lot on what the documents are. Some document formats have metadata that stores a title. Perhaps you can just extract that. If not, once you've extracted the content, perhaps you could just have a special field that is the first n words (followed by an ellipsis). If you use a clusteri

Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Zheng Lin Edwin Yeo
The main objective here is actually to assign a title to the documents as they are being indexed. We actually found that the cluster labels provides a good information on the key points of the documents, but I'm not sure if we can get a good cluster labels with a single documents. Besides getting

Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Alessandro Benedetti
Hi Edwin, let's do this step by step. Clustering is problem solved by unsupervised machine learning algorithms. The scope of clustering is to group per similarity a corpus of documents, trying to have meaningful groups for a human being. Solr currently provides different approaches for *Query Time