Re: MoreLikeThisHandler with mltipli input documents

Szűcs Roland Tue, 29 Sep 2015 07:46:13 -0700

Hello Upayavira,

The main goal is that when a customer is on the pruduct page on an e-book
and he does not like it somehow I want to immediately offer her/him
alternative e-books in the same topic. If I expect from the customer to
click on a button like "similar e-books" I lose half of them as they are
lazy to click anywhere. So I would like to present on the product pages the
alternatives of the e-books  without clicking.


I assumed the best idea to claculate the similar e-books for all the other
(n*(n-1) similarity calculation) and present only the top 5. I planned to
do it when our server is not busy. In this point I found the description of
mlt as a search component which seemed to be a good candidate as it
calculates the similar documents to all the result set of the query. So if
I say q=*:* and mlt component is enabled I get similar document for my
entire document set. The only problem was with this approach that mlt
search component does not give back the interesting terms for my tag cloud
calculation.

That's why I tried to mix the flexibility of mlt compoonent (multiple docs
as an input accepted) with the robustness of MoreLikeThisHandler (having
interesting terms).

If there is no solution, I will use the mlt component and solve the tag
cloud calculation other way. By the way if I am not mistaken, the 5.3.1
version takes the union of the feature set of the mlt component, and handler

Best Regards,
Roland



2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>:

> Let's take a step back. So, you have 3000 or so docs, and you want to
> know which documents are similar to these.
>
> Why do you want to know this? What feature do you need to build that
> will use that information? Knowing this may help us to arrive at the
> right technology for you.
>
> For example, you might want to investigate offline clustering algorithms
> (e.g. [1], which might be a bit dense to follow). A good book on machine
> learning if you are okay with Python is "Programming Collective
> Intelligence" as it explains the usual algorithms with simple for loops
> making it very clear.
>
> Or, you could do searches, and then cluster the results at search time
> (so if you search for 100 docs, it will identify clusters within those
> 100 matching documents). That might get you there. See [2]
>
> So, if you let us know what the end-goal is, perhaps we can suggest an
> alternative approach, rather than burying ourselves neck-deep in MLT
> problems.
>
> Upayavira
>
> [1]
>
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering
>
> On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > Hello Upayavira,
> >
> > Thanks dealing with my issue. I have applied already the termVectors=true
> > to all fileds involved in the more like this calculation. I have just 3
> > 000
> > documents each of them is represented by a relativly big term vector with
> > more than 20 000 unique terms. If I run the more like this handler for a
> > solr doc it takes close to 1 sec to get back the first 10 similar
> > documents. Aftwr this I have to pass the docid-s to my other application
> > which find the cover of the e-book and other metadata and put it on the
> > web. The end-to-end process takes too much time from customer perspective
> > that is why I tried to find solution for offline more like this
> > calculation. But if my app has to call the morelikethishandler for each
> > doc
> > it puts overhead for the offline calculation.
> >
> > Best Regards,
> > Roland
> >
> > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> >
> > > If MoreLikeThis is slow for large documents that are indexed, have you
> > > enabled term vectors on the similarity fields?
> > >
> > > Basically, what more like this does is this:
> > >
> > > * decide on what terms in the source doc are "interesting", and pick
> the
> > > 25 most interesting ones
> > > * build and execute a boolean query using these interesting terms.
> > >
> > > Looking at the first phase of this in more detail:
> > >
> > > If you pass in a document using stream.body, it will analyse this
> > > document into terms, and then calculate the most interesting terms from
> > > that.
> > >
> > > If you reference document in your index with a field that is stored, it
> > > will take the stored version, and analyse it and identify the
> > > interesting terms from there.
> > >
> > > If, however, you have stored term vectors against that field, this work
> > > is not needed. You have already done much of the work, and the
> > > identification of your "interesting terms" will be much faster.
> > >
> > > Thus, on the content field of your documents, add termVectors="true" in
> > > your schema, and re-index. Then you could well find MLT becoming a lot
> > > more efficient.
> > >
> > > Upayavira
> > >
> > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > Hi Alessandro,
> > > >
> > > > My original goal was to get offline suggestsion on content based
> > > > similarity
> > > > for every e-book we have . We wanted to run a bulk more like this
> > > > calculation in the evening when the usage of our site is low and we
> > > > submit
> > > > a new e-book. Real time more like this can take a while as we have
> > > > typically long documents (2-5MB text) with all the content indexed.
> > > >
> > > > When we upload a new document we wanted to recalculate the more like
> this
> > > > suggestions and a tf-idf based tag cloouds. Both of them are
> delivered by
> > > > the More LikeThisHandler but only for one document as you wrote.
> > > >
> > > > The text input is not good for us because we need the similar doc
> list
> > > > for
> > > > each of the matched document. If I put together text of 10 document
> I can
> > > > not separate which suggestion relates to which matched document and
> also
> > > > the tag cloud will belong to the mixed text.
> > > >
> > > > Most likley we will use the MoreLikeThisHandler for each of the
> documents
> > > > and parse the json repsonse and store the result in a DQL database
> > > >
> > > > Thanks your help.
> > > >
> > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > <benedetti.ale...@gmail.com>
> > > > :
> > > >
> > > > > Hi Roland,
> > > > > what is your exact requirement ?
> > > > > Do you want to basically build a "description" for a set of
> documents
> > > and
> > > > > then find documents in the index, similar to this description ?
> > > > >
> > > > > By default , based on my experience ( and on the code) this is the
> > > entry
> > > > > point for the Lucene More Like This :
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query
> that
> > > will
> > > > > > return docs like the passed lucene document ID.** @param docNum
> the
> > > > > > documentID of the lucene doc to generate the 'More Like This"
> query
> > > for.*
> > > > > > @return a query that will return docs like the passed lucene
> document
> > > > > > ID.*/public Query like(int docNum) throws IOException {if
> > > (fieldNames ==
> > > > > > null) {// gather list of valid fields from
> luceneCollection<String>
> > > > > fields
> > > > > > = MultiFields.getIndexedFields(ir);fieldNames =
> fields.toArray(new
> > > > > > String[fields.size()]);}return
> createQuery(retrieveTerms(docNum));}*
> > > > >
> > > > > It means that talking about "documents" you can feed only one Solr
> doc.
> > > > >
> > > > > But you can also feed the MLT with simple text.
> > > > >
> > > > > So you should study better your use case and understand which
> option
> > > > > fits better :
> > > > >
> > > > > 1) customising the MLT component starting from Lucene
> > > > >
> > > > > 2) doing some processing client side and use the "text" similarity
> > > feature.
> > > > >
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> roland.sz...@bookandwalk.com
> > > >:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Is it possible to feed multiple solr id for a
> MoreLikeThisHandler?
> > > > > >
> > > > > > <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
> > > > > > <lst name="defaults">
> > > > > > <str name="mlt.match.include">false</str>
> > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > <str name="mlt.fl">title,content</str>
> > > > > > <str name="mlt.minwl">4</str>
> > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > <str name="mlt.mintf">2</str>
> > > > > > <int name="mlt.count">10</int>
> > > > > > <str name="mlt.boost">true</str>
> > > > > > <str name="wt">json</str>
> > > > > > <str name="indent">true</str>
> > > > > > </lst>
> > > > > >   </requestHandler>
> > > > > >
> > > > > > when I call this:
> > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > >  it works fine. Is there any way to have a kind of "bulk" call of
> > > more
> > > > > like
> > > > > > this handler . I need the intresting terms as well and as far as
> I
> > > know
> > > > > if
> > > > > > i use more like this as a search component it does not return
> with
> > > it so
> > > > > it
> > > > > > is not an alternative.
> > > > > >
> > > > > > Thanks in advance,
> > > > > >
> > > > > >
> > > > > > --
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Roland
> > > > > Szűcs
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Connect
> > > > > with
> > > > > > me on Linkedin <
> > > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81 13Bookandwalk.hu
> > > > > > <https://bokandwalk.hu/>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --------------------------
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > Roland
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Ismerkedjünk
> > > > meg a Linkedin
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > 13Bookandwalk.hu
> > > > <https://bokandwalk.hu/>
> > >
> >
> >
> >
> > --
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> Roland
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Ismerkedjünk
> > meg a Linkedin
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > 13Bookandwalk.hu
> > <https://bokandwalk.hu/>
>



-- 
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
meg a Linkedin <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
-en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 13Bookandwalk.hu
<https://bokandwalk.hu/>

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to