Re: Assign rich-text document's title name from clustering results

Upayavira Wed, 10 Jun 2015 03:57:51 -0700

It depends a lot on what the documents are. Some document formats have
metadata that stores a title. Perhaps you can just extract that.


If not, once you've extracted the content, perhaps you could just have a
special field that is the first n words (followed by an ellipsis).

If you use a clustering algorithm that makes a guess at a name for a
cluster, you will get a list of names or categories, not something that
most people would think of as a title.

This really doesn't strike me (yet) as a Solr problem. The problem is
what info there is in these documents and how you can derive a title (or
some form of summary?) from them. 

If they are all Word documents, do they start with a "Heading" style? In
which case you could extract that. As I say, most likely this will have
to be done outside of Solr.
 
Upayavira

On Wed, Jun 10, 2015, at 10:31 AM, Zheng Lin Edwin Yeo wrote:
> The main objective here is actually to assign a title to the documents as
> they are being indexed.
> 
> We actually found that the cluster labels provides a good information on
> the key points of the documents, but I'm not sure if we can get a good
> cluster labels with a single documents.
> 
> Besides getting from cluster labels, is there other methods which we can
> use to assign a title?
> 
> 
> Regards,
> Edwin
> 
> 
> On 10 June 2015 at 17:16, Alessandro Benedetti
> <benedetti.ale...@gmail.com>
> wrote:
> 
> > Hi Edwin,
> > let's do this step by step.
> >
> > Clustering is problem solved by unsupervised machine learning algorithms.
> > The scope of clustering is to group per similarity a corpus of documents,
> > trying to have meaningful groups for a human being.
> > Solr currently provides different approaches for *Query Time Clustering* (
> > also known Online Clustering).
> > There's an out of the box integration that allows you to use clustering at
> > query time on the query results.
> > Different algorithms can be selected, mainly provided by Carrots2 .
> >
> > This algorithms also provide a guess for the cluster name.
> >
> > Given this introduction let me see your problem.
> >
> > 1) The first part can be solved with a custom UpdateProcessor that will
> > process the document and add the automatic new title.
> > Now the problem is, how we want to extract this new title ?
> > Honestly I can not understand how clustering can fit here …
> >
> > 2) Index time clustering is not yet provided in Solr ( I remember there was
> > only an interface ready, but no implementation) .
> > You should cluster the content before indexing it in Solr using a machine
> > Learning library.
> > Indexing time clustering is delicate. What will happen to the next re-Index
> > ? Should we cluster everything again ?
> > This topic must be investigated more.
> >
> > Anyway, let me know as the original problem maybe does not require the
> > clustering.
> >
> > Cheers
> >
> >
> > 2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
> >
> > > Hi,
> > >
> > > I'm currently using Solr 5.1, and I'm thinking of ways to allow the
> > system
> > > to automatically give the rich-text documents that are being indexed a
> > > title automatically, instead of user entering it in manually, as we might
> > > have to index a whole folder of documents together, so it is not wise for
> > > the user to enter the title one by one.
> > >
> > > I would like to check, if it's possible to run the clustering, get the
> > > results, and use the top score label to be the title of the document?
> > > Apparently, we need to run the clustering prior to the indexing, so I'm
> > not
> > > sure if that is possible.
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >

Re: Assign rich-text document's title name from clustering results

Reply via email to