Re: Assign rich-text document's title name from clustering results

Alessandro Benedetti Wed, 10 Jun 2015 08:29:15 -0700

I agree with Upayavira,
Title extraction is an activity independent from Solr.
Furthermore I would say it's easy to extract the title before the Solr
Indexng stage.


When we send the content arrives to Solr Update processors it is already a
String.
If you want to do some clever title extraction, formatting of your original
document definitely helps and it is lost at that point.
A nice fit for Title extraction is your :
Indexing App or
Apache Tika if you would like to add a particular customisation.

Remember Apache Tika is integrated in Solr to provide Content Extraction
from rich text documents.

Cheers

2015-06-10 11:57 GMT+01:00 Upayavira <u...@odoko.co.uk>:

> It depends a lot on what the documents are. Some document formats have
> metadata that stores a title. Perhaps you can just extract that.
>
> If not, once you've extracted the content, perhaps you could just have a
> special field that is the first n words (followed by an ellipsis).
>
> If you use a clustering algorithm that makes a guess at a name for a
> cluster, you will get a list of names or categories, not something that
> most people would think of as a title.
>
> This really doesn't strike me (yet) as a Solr problem. The problem is
> what info there is in these documents and how you can derive a title (or
> some form of summary?) from them.
>
> If they are all Word documents, do they start with a "Heading" style? In
> which case you could extract that. As I say, most likely this will have
> to be done outside of Solr.
>
> Upayavira
>
> On Wed, Jun 10, 2015, at 10:31 AM, Zheng Lin Edwin Yeo wrote:
> > The main objective here is actually to assign a title to the documents as
> > they are being indexed.
> >
> > We actually found that the cluster labels provides a good information on
> > the key points of the documents, but I'm not sure if we can get a good
> > cluster labels with a single documents.
> >
> > Besides getting from cluster labels, is there other methods which we can
> > use to assign a title?
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 10 June 2015 at 17:16, Alessandro Benedetti
> > <benedetti.ale...@gmail.com>
> > wrote:
> >
> > > Hi Edwin,
> > > let's do this step by step.
> > >
> > > Clustering is problem solved by unsupervised machine learning
> algorithms.
> > > The scope of clustering is to group per similarity a corpus of
> documents,
> > > trying to have meaningful groups for a human being.
> > > Solr currently provides different approaches for *Query Time
> Clustering* (
> > > also known Online Clustering).
> > > There's an out of the box integration that allows you to use
> clustering at
> > > query time on the query results.
> > > Different algorithms can be selected, mainly provided by Carrots2 .
> > >
> > > This algorithms also provide a guess for the cluster name.
> > >
> > > Given this introduction let me see your problem.
> > >
> > > 1) The first part can be solved with a custom UpdateProcessor that will
> > > process the document and add the automatic new title.
> > > Now the problem is, how we want to extract this new title ?
> > > Honestly I can not understand how clustering can fit here …
> > >
> > > 2) Index time clustering is not yet provided in Solr ( I remember
> there was
> > > only an interface ready, but no implementation) .
> > > You should cluster the content before indexing it in Solr using a
> machine
> > > Learning library.
> > > Indexing time clustering is delicate. What will happen to the next
> re-Index
> > > ? Should we cluster everything again ?
> > > This topic must be investigated more.
> > >
> > > Anyway, let me know as the original problem maybe does not require the
> > > clustering.
> > >
> > > Cheers
> > >
> > >
> > > 2015-06-10 4:13 GMT+01:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > I'm currently using Solr 5.1, and I'm thinking of ways to allow the
> > > system
> > > > to automatically give the rich-text documents that are being indexed
> a
> > > > title automatically, instead of user entering it in manually, as we
> might
> > > > have to index a whole folder of documents together, so it is not
> wise for
> > > > the user to enter the title one by one.
> > > >
> > > > I would like to check, if it's possible to run the clustering, get
> the
> > > > results, and use the top score label to be the title of the document?
> > > > Apparently, we need to run the clustering prior to the indexing, so
> I'm
> > > not
> > > > sure if that is possible.
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Assign rich-text document's title name from clustering results

Reply via email to