I agree with Upayavira, Title extraction is an activity independent from Solr. Furthermore I would say it's easy to extract the title before the Solr Indexng stage.
When we send the content arrives to Solr Update processors it is already a String. If you want to do some clever title extraction, formatting of your original document definitely helps and it is lost at that point. A nice fit for Title extraction is your : Indexing App or Apache Tika if you would like to add a particular customisation. Remember Apache Tika is integrated in Solr to provide Content Extraction from rich text documents. Cheers 2015-06-10 11:57 GMT+01:00 Upayavira <u...@odoko.co.uk>: > It depends a lot on what the documents are. Some document formats have > metadata that stores a title. Perhaps you can just extract that. > > If not, once you've extracted the content, perhaps you could just have a > special field that is the first n words (followed by an ellipsis). > > If you use a clustering algorithm that makes a guess at a name for a > cluster, you will get a list of names or categories, not something that > most people would think of as a title. > > This really doesn't strike me (yet) as a Solr problem. The problem is > what info there is in these documents and how you can derive a title (or > some form of summary?) from them. > > If they are all Word documents, do they start with a "Heading" style? In > which case you could extract that. As I say, most likely this will have > to be done outside of Solr. > > Upayavira > > On Wed, Jun 10, 2015, at 10:31 AM, Zheng Lin Edwin Yeo wrote: > > The main objective here is actually to assign a title to the documents as > > they are being indexed. > > > > We actually found that the cluster labels provides a good information on > > the key points of the documents, but I'm not sure if we can get a good > > cluster labels with a single documents. > > > > Besides getting from cluster labels, is there other methods which we can > > use to assign a title? > > > > > > Regards, > > Edwin > > > > > > On 10 June 2015 at 17:16, Alessandro Benedetti > > <benedetti.ale...@gmail.com> > > wrote: > > > > > Hi Edwin, > > > let's do this step by step. > > > > > > Clustering is problem solved by unsupervised machine learning > algorithms. > > > The scope of clustering is to group per similarity a corpus of > documents, > > > trying to have meaningful groups for a human being. > > > Solr currently provides different approaches for *Query Time > Clustering* ( > > > also known Online Clustering). > > > There's an out of the box integration that allows you to use > clustering at > > > query time on the query results. > > > Different algorithms can be selected, mainly provided by Carrots2 . > > > > > > This algorithms also provide a guess for the cluster name. > > > > > > Given this introduction let me see your problem. > > > > > > 1) The first part can be solved with a custom UpdateProcessor that will > > > process the document and add the automatic new title. > > > Now the problem is, how we want to extract this new title ? > > > Honestly I can not understand how clustering can fit here … > > > > > > 2) Index time clustering is not yet provided in Solr ( I remember > there was > > > only an interface ready, but no implementation) . > > > You should cluster the content before indexing it in Solr using a > machine > > > Learning library. > > > Indexing time clustering is delicate. What will happen to the next > re-Index > > > ? Should we cluster everything again ? > > > This topic must be investigated more. > > > > > > Anyway, let me know as the original problem maybe does not require the > > > clustering. > > > > > > Cheers > > > > > > > > > 2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>: > > > > > > > Hi, > > > > > > > > I'm currently using Solr 5.1, and I'm thinking of ways to allow the > > > system > > > > to automatically give the rich-text documents that are being indexed > a > > > > title automatically, instead of user entering it in manually, as we > might > > > > have to index a whole folder of documents together, so it is not > wise for > > > > the user to enter the title one by one. > > > > > > > > I would like to check, if it's possible to run the clustering, get > the > > > > results, and use the top score label to be the title of the document? > > > > Apparently, we need to run the clustering prior to the indexing, so > I'm > > > not > > > > sure if that is possible. > > > > > > > > > > > > Regards, > > > > Edwin > > > > > > > > > > > > > > > > -- > > > -------------------------- > > > > > > Benedetti Alessandro > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > "Tyger, tyger burning bright > > > In the forests of the night, > > > What immortal hand or eye > > > Could frame thy fearful symmetry?" > > > > > > William Blake - Songs of Experience -1794 England > > > > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England