Re: MoreLikeThisHandler with mltipli input documents

Szűcs Roland Wed, 30 Sep 2015 01:11:49 -0700

Hi Alessandro,

You are right. I forget to mention one important factor. For 3000 hungarian
e-books the approach you mentioned is absolutely fine as the response time
is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
response time is 7 sec which is definetely not acceptable for the users.


Regards,
Roland

2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <[email protected]>
:

> Hi Roland,
> you said "The main goal is that when a customer is on the pruduct page ".
> But if you are in a  product page, I guess you have the product Id.
> If you have the product id , you can simply execute the MLT request with
> the single Doc Id in input.
>
> Why do you need to calculate beforehand?
>
> Cheers
>
> 2015-09-29 15:44 GMT+01:00 Szűcs Roland <[email protected]>:
>
> > Hello Upayavira,
> >
> > The main goal is that when a customer is on the pruduct page on an e-book
> > and he does not like it somehow I want to immediately offer her/him
> > alternative e-books in the same topic. If I expect from the customer to
> > click on a button like "similar e-books" I lose half of them as they are
> > lazy to click anywhere. So I would like to present on the product pages
> the
> > alternatives of the e-books  without clicking.
> >
> > I assumed the best idea to claculate the similar e-books for all the
> other
> > (n*(n-1) similarity calculation) and present only the top 5. I planned to
> > do it when our server is not busy. In this point I found the description
> of
> > mlt as a search component which seemed to be a good candidate as it
> > calculates the similar documents to all the result set of the query. So
> if
> > I say q=*:* and mlt component is enabled I get similar document for my
> > entire document set. The only problem was with this approach that mlt
> > search component does not give back the interesting terms for my tag
> cloud
> > calculation.
> >
> > That's why I tried to mix the flexibility of mlt compoonent (multiple
> docs
> > as an input accepted) with the robustness of MoreLikeThisHandler (having
> > interesting terms).
> >
> > If there is no solution, I will use the mlt component and solve the tag
> > cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> > version takes the union of the feature set of the mlt component, and
> > handler
> >
> > Best Regards,
> > Roland
> >
> >
> >
> > 2015-09-29 14:38 GMT+02:00 Upayavira <[email protected]>:
> >
> > > Let's take a step back. So, you have 3000 or so docs, and you want to
> > > know which documents are similar to these.
> > >
> > > Why do you want to know this? What feature do you need to build that
> > > will use that information? Knowing this may help us to arrive at the
> > > right technology for you.
> > >
> > > For example, you might want to investigate offline clustering
> algorithms
> > > (e.g. [1], which might be a bit dense to follow). A good book on
> machine
> > > learning if you are okay with Python is "Programming Collective
> > > Intelligence" as it explains the usual algorithms with simple for loops
> > > making it very clear.
> > >
> > > Or, you could do searches, and then cluster the results at search time
> > > (so if you search for 100 docs, it will identify clusters within those
> > > 100 matching documents). That might get you there. See [2]
> > >
> > > So, if you let us know what the end-goal is, perhaps we can suggest an
> > > alternative approach, rather than burying ourselves neck-deep in MLT
> > > problems.
> > >
> > > Upayavira
> > >
> > > [1]
> > >
> > >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > >
> > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > Hello Upayavira,
> > > >
> > > > Thanks dealing with my issue. I have applied already the
> > termVectors=true
> > > > to all fileds involved in the more like this calculation. I have
> just 3
> > > > 000
> > > > documents each of them is represented by a relativly big term vector
> > with
> > > > more than 20 000 unique terms. If I run the more like this handler
> for
> > a
> > > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > > documents. Aftwr this I have to pass the docid-s to my other
> > application
> > > > which find the cover of the e-book and other metadata and put it on
> the
> > > > web. The end-to-end process takes too much time from customer
> > perspective
> > > > that is why I tried to find solution for offline more like this
> > > > calculation. But if my app has to call the morelikethishandler for
> each
> > > > doc
> > > > it puts overhead for the offline calculation.
> > > >
> > > > Best Regards,
> > > > Roland
> > > >
> > > > 2015-09-29 13:01 GMT+02:00 Upayavira <[email protected]>:
> > > >
> > > > > If MoreLikeThis is slow for large documents that are indexed, have
> > you
> > > > > enabled term vectors on the similarity fields?
> > > > >
> > > > > Basically, what more like this does is this:
> > > > >
> > > > > * decide on what terms in the source doc are "interesting", and
> pick
> > > the
> > > > > 25 most interesting ones
> > > > > * build and execute a boolean query using these interesting terms.
> > > > >
> > > > > Looking at the first phase of this in more detail:
> > > > >
> > > > > If you pass in a document using stream.body, it will analyse this
> > > > > document into terms, and then calculate the most interesting terms
> > from
> > > > > that.
> > > > >
> > > > > If you reference document in your index with a field that is
> stored,
> > it
> > > > > will take the stored version, and analyse it and identify the
> > > > > interesting terms from there.
> > > > >
> > > > > If, however, you have stored term vectors against that field, this
> > work
> > > > > is not needed. You have already done much of the work, and the
> > > > > identification of your "interesting terms" will be much faster.
> > > > >
> > > > > Thus, on the content field of your documents, add
> termVectors="true"
> > in
> > > > > your schema, and re-index. Then you could well find MLT becoming a
> > lot
> > > > > more efficient.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > > > Hi Alessandro,
> > > > > >
> > > > > > My original goal was to get offline suggestsion on content based
> > > > > > similarity
> > > > > > for every e-book we have . We wanted to run a bulk more like this
> > > > > > calculation in the evening when the usage of our site is low and
> we
> > > > > > submit
> > > > > > a new e-book. Real time more like this can take a while as we
> have
> > > > > > typically long documents (2-5MB text) with all the content
> indexed.
> > > > > >
> > > > > > When we upload a new document we wanted to recalculate the more
> > like
> > > this
> > > > > > suggestions and a tf-idf based tag cloouds. Both of them are
> > > delivered by
> > > > > > the More LikeThisHandler but only for one document as you wrote.
> > > > > >
> > > > > > The text input is not good for us because we need the similar doc
> > > list
> > > > > > for
> > > > > > each of the matched document. If I put together text of 10
> document
> > > I can
> > > > > > not separate which suggestion relates to which matched document
> and
> > > also
> > > > > > the tag cloud will belong to the mixed text.
> > > > > >
> > > > > > Most likley we will use the MoreLikeThisHandler for each of the
> > > documents
> > > > > > and parse the json repsonse and store the result in a DQL
> database
> > > > > >
> > > > > > Thanks your help.
> > > > > >
> > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > > > <[email protected]>
> > > > > > :
> > > > > >
> > > > > > > Hi Roland,
> > > > > > > what is your exact requirement ?
> > > > > > > Do you want to basically build a "description" for a set of
> > > documents
> > > > > and
> > > > > > > then find documents in the index, similar to this description ?
> > > > > > >
> > > > > > > By default , based on my experience ( and on the code) this is
> > the
> > > > > entry
> > > > > > > point for the Lucene More Like This :
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a
> query
> > > that
> > > > > will
> > > > > > > > return docs like the passed lucene document ID.** @param
> docNum
> > > the
> > > > > > > > documentID of the lucene doc to generate the 'More Like This"
> > > query
> > > > > for.*
> > > > > > > > @return a query that will return docs like the passed lucene
> > > document
> > > > > > > > ID.*/public Query like(int docNum) throws IOException {if
> > > > > (fieldNames ==
> > > > > > > > null) {// gather list of valid fields from
> > > luceneCollection<String>
> > > > > > > fields
> > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames =
> > > fields.toArray(new
> > > > > > > > String[fields.size()]);}return
> > > createQuery(retrieveTerms(docNum));}*
> > > > > > >
> > > > > > > It means that talking about "documents" you can feed only one
> > Solr
> > > doc.
> > > > > > >
> > > > > > > But you can also feed the MLT with simple text.
> > > > > > >
> > > > > > > So you should study better your use case and understand which
> > > option
> > > > > > > fits better :
> > > > > > >
> > > > > > > 1) customising the MLT component starting from Lucene
> > > > > > >
> > > > > > > 2) doing some processing client side and use the "text"
> > similarity
> > > > > feature.
> > > > > > >
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > >
> > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> > > [email protected]
> > > > > >:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Is it possible to feed multiple solr id for a
> > > MoreLikeThisHandler?
> > > > > > > >
> > > > > > > > <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
> > > > > > > > <lst name="defaults">
> > > > > > > > <str name="mlt.match.include">false</str>
> > > > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > > > <str name="mlt.fl">title,content</str>
> > > > > > > > <str name="mlt.minwl">4</str>
> > > > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > > > <str name="mlt.mintf">2</str>
> > > > > > > > <int name="mlt.count">10</int>
> > > > > > > > <str name="mlt.boost">true</str>
> > > > > > > > <str name="wt">json</str>
> > > > > > > > <str name="indent">true</str>
> > > > > > > > </lst>
> > > > > > > >   </requestHandler>
> > > > > > > >
> > > > > > > > when I call this:
> > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > > > >  it works fine. Is there any way to have a kind of "bulk"
> call
> > of
> > > > > more
> > > > > > > like
> > > > > > > > this handler . I need the intresting terms as well and as far
> > as
> > > I
> > > > > know
> > > > > > > if
> > > > > > > > i use more like this as a search component it does not return
> > > with
> > > > > it so
> > > > > > > it
> > > > > > > > is not an alternative.
> > > > > > > >
> > > > > > > > Thanks in advance,
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Roland
> > > > > > > Szűcs
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Connect
> > > > > > > with
> > > > > > > > me on Linkedin <
> > > > > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81
> > 13Bookandwalk.hu
> > > > > > > > <https://bokandwalk.hu/>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > --------------------------
> > > > > > >
> > > > > > > Benedetti Alessandro
> > > > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > > > >
> > > > > > > "Tyger, tyger burning bright
> > > > > > > In the forests of the night,
> > > > > > > What immortal hand or eye
> > > > > > > Could frame thy fearful symmetry?"
> > > > > > >
> > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Szűcs
> > > > > Roland
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Ismerkedjünk
> > > > > > meg a Linkedin
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > > 13Bookandwalk.hu
> > > > > > <https://bokandwalk.hu/>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > Roland
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Ismerkedjünk
> > > > meg a Linkedin
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > 13Bookandwalk.hu
> > > > <https://bokandwalk.hu/>
> > >
> >
> >
> >
> > --
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> Roland
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Ismerkedjünk
> > meg a Linkedin <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > 13Bookandwalk.hu
> > <https://bokandwalk.hu/>
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
meg a Linkedin <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
-en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 13Bookandwalk.hu
<https://bokandwalk.hu/>

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to