Re: MoreLikeThisHandler with mltipli input documents

Alessandro Benedetti Wed, 30 Sep 2015 01:17:45 -0700

I am still missing why you quote the number of the documents...
If you have 5600 polish books, but you use the MLT only when you land in
the page of a specific book ...
I think i still miss the point !
MLT on 1 polish book, takes 7 secs ?



2015-09-30 9:10 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>:

> Hi Alessandro,
>
> You are right. I forget to mention one important factor. For 3000 hungarian
> e-books the approach you mentioned is absolutely fine as the response time
> is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
> response time is 7 sec which is definetely not acceptable for the users.
>
> Regards,
> Roland
>
> 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> benedetti.ale...@gmail.com>
> :
>
> > Hi Roland,
> > you said "The main goal is that when a customer is on the pruduct page ".
> > But if you are in a  product page, I guess you have the product Id.
> > If you have the product id , you can simply execute the MLT request with
> > the single Doc Id in input.
> >
> > Why do you need to calculate beforehand?
> >
> > Cheers
> >
> > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>:
> >
> > > Hello Upayavira,
> > >
> > > The main goal is that when a customer is on the pruduct page on an
> e-book
> > > and he does not like it somehow I want to immediately offer her/him
> > > alternative e-books in the same topic. If I expect from the customer to
> > > click on a button like "similar e-books" I lose half of them as they
> are
> > > lazy to click anywhere. So I would like to present on the product pages
> > the
> > > alternatives of the e-books  without clicking.
> > >
> > > I assumed the best idea to claculate the similar e-books for all the
> > other
> > > (n*(n-1) similarity calculation) and present only the top 5. I planned
> to
> > > do it when our server is not busy. In this point I found the
> description
> > of
> > > mlt as a search component which seemed to be a good candidate as it
> > > calculates the similar documents to all the result set of the query. So
> > if
> > > I say q=*:* and mlt component is enabled I get similar document for my
> > > entire document set. The only problem was with this approach that mlt
> > > search component does not give back the interesting terms for my tag
> > cloud
> > > calculation.
> > >
> > > That's why I tried to mix the flexibility of mlt compoonent (multiple
> > docs
> > > as an input accepted) with the robustness of MoreLikeThisHandler
> (having
> > > interesting terms).
> > >
> > > If there is no solution, I will use the mlt component and solve the tag
> > > cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> > > version takes the union of the feature set of the mlt component, and
> > > handler
> > >
> > > Best Regards,
> > > Roland
> > >
> > >
> > >
> > > 2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > >
> > > > Let's take a step back. So, you have 3000 or so docs, and you want to
> > > > know which documents are similar to these.
> > > >
> > > > Why do you want to know this? What feature do you need to build that
> > > > will use that information? Knowing this may help us to arrive at the
> > > > right technology for you.
> > > >
> > > > For example, you might want to investigate offline clustering
> > algorithms
> > > > (e.g. [1], which might be a bit dense to follow). A good book on
> > machine
> > > > learning if you are okay with Python is "Programming Collective
> > > > Intelligence" as it explains the usual algorithms with simple for
> loops
> > > > making it very clear.
> > > >
> > > > Or, you could do searches, and then cluster the results at search
> time
> > > > (so if you search for 100 docs, it will identify clusters within
> those
> > > > 100 matching documents). That might get you there. See [2]
> > > >
> > > > So, if you let us know what the end-goal is, perhaps we can suggest
> an
> > > > alternative approach, rather than burying ourselves neck-deep in MLT
> > > > problems.
> > > >
> > > > Upayavira
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > > [2]
> https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > > >
> > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > > Hello Upayavira,
> > > > >
> > > > > Thanks dealing with my issue. I have applied already the
> > > termVectors=true
> > > > > to all fileds involved in the more like this calculation. I have
> > just 3
> > > > > 000
> > > > > documents each of them is represented by a relativly big term
> vector
> > > with
> > > > > more than 20 000 unique terms. If I run the more like this handler
> > for
> > > a
> > > > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > > > documents. Aftwr this I have to pass the docid-s to my other
> > > application
> > > > > which find the cover of the e-book and other metadata and put it on
> > the
> > > > > web. The end-to-end process takes too much time from customer
> > > perspective
> > > > > that is why I tried to find solution for offline more like this
> > > > > calculation. But if my app has to call the morelikethishandler for
> > each
> > > > > doc
> > > > > it puts overhead for the offline calculation.
> > > > >
> > > > > Best Regards,
> > > > > Roland
> > > > >
> > > > > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > > > >
> > > > > > If MoreLikeThis is slow for large documents that are indexed,
> have
> > > you
> > > > > > enabled term vectors on the similarity fields?
> > > > > >
> > > > > > Basically, what more like this does is this:
> > > > > >
> > > > > > * decide on what terms in the source doc are "interesting", and
> > pick
> > > > the
> > > > > > 25 most interesting ones
> > > > > > * build and execute a boolean query using these interesting
> terms.
> > > > > >
> > > > > > Looking at the first phase of this in more detail:
> > > > > >
> > > > > > If you pass in a document using stream.body, it will analyse this
> > > > > > document into terms, and then calculate the most interesting
> terms
> > > from
> > > > > > that.
> > > > > >
> > > > > > If you reference document in your index with a field that is
> > stored,
> > > it
> > > > > > will take the stored version, and analyse it and identify the
> > > > > > interesting terms from there.
> > > > > >
> > > > > > If, however, you have stored term vectors against that field,
> this
> > > work
> > > > > > is not needed. You have already done much of the work, and the
> > > > > > identification of your "interesting terms" will be much faster.
> > > > > >
> > > > > > Thus, on the content field of your documents, add
> > termVectors="true"
> > > in
> > > > > > your schema, and re-index. Then you could well find MLT becoming
> a
> > > lot
> > > > > > more efficient.
> > > > > >
> > > > > > Upayavira
> > > > > >
> > > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > > > > Hi Alessandro,
> > > > > > >
> > > > > > > My original goal was to get offline suggestsion on content
> based
> > > > > > > similarity
> > > > > > > for every e-book we have . We wanted to run a bulk more like
> this
> > > > > > > calculation in the evening when the usage of our site is low
> and
> > we
> > > > > > > submit
> > > > > > > a new e-book. Real time more like this can take a while as we
> > have
> > > > > > > typically long documents (2-5MB text) with all the content
> > indexed.
> > > > > > >
> > > > > > > When we upload a new document we wanted to recalculate the more
> > > like
> > > > this
> > > > > > > suggestions and a tf-idf based tag cloouds. Both of them are
> > > > delivered by
> > > > > > > the More LikeThisHandler but only for one document as you
> wrote.
> > > > > > >
> > > > > > > The text input is not good for us because we need the similar
> doc
> > > > list
> > > > > > > for
> > > > > > > each of the matched document. If I put together text of 10
> > document
> > > > I can
> > > > > > > not separate which suggestion relates to which matched document
> > and
> > > > also
> > > > > > > the tag cloud will belong to the mixed text.
> > > > > > >
> > > > > > > Most likley we will use the MoreLikeThisHandler for each of the
> > > > documents
> > > > > > > and parse the json repsonse and store the result in a DQL
> > database
> > > > > > >
> > > > > > > Thanks your help.
> > > > > > >
> > > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > > > > <benedetti.ale...@gmail.com>
> > > > > > > :
> > > > > > >
> > > > > > > > Hi Roland,
> > > > > > > > what is your exact requirement ?
> > > > > > > > Do you want to basically build a "description" for a set of
> > > > documents
> > > > > > and
> > > > > > > > then find documents in the index, similar to this
> description ?
> > > > > > > >
> > > > > > > > By default , based on my experience ( and on the code) this
> is
> > > the
> > > > > > entry
> > > > > > > > point for the Lucene More Like This :
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a
> > query
> > > > that
> > > > > > will
> > > > > > > > > return docs like the passed lucene document ID.** @param
> > docNum
> > > > the
> > > > > > > > > documentID of the lucene doc to generate the 'More Like
> This"
> > > > query
> > > > > > for.*
> > > > > > > > > @return a query that will return docs like the passed
> lucene
> > > > document
> > > > > > > > > ID.*/public Query like(int docNum) throws IOException {if
> > > > > > (fieldNames ==
> > > > > > > > > null) {// gather list of valid fields from
> > > > luceneCollection<String>
> > > > > > > > fields
> > > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames =
> > > > fields.toArray(new
> > > > > > > > > String[fields.size()]);}return
> > > > createQuery(retrieveTerms(docNum));}*
> > > > > > > >
> > > > > > > > It means that talking about "documents" you can feed only one
> > > Solr
> > > > doc.
> > > > > > > >
> > > > > > > > But you can also feed the MLT with simple text.
> > > > > > > >
> > > > > > > > So you should study better your use case and understand which
> > > > option
> > > > > > > > fits better :
> > > > > > > >
> > > > > > > > 1) customising the MLT component starting from Lucene
> > > > > > > >
> > > > > > > > 2) doing some processing client side and use the "text"
> > > similarity
> > > > > > feature.
> > > > > > > >
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > >
> > > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> > > > roland.sz...@bookandwalk.com
> > > > > > >:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Is it possible to feed multiple solr id for a
> > > > MoreLikeThisHandler?
> > > > > > > > >
> > > > > > > > > <requestHandler name="/mlt"
> class="solr.MoreLikeThisHandler">
> > > > > > > > > <lst name="defaults">
> > > > > > > > > <str name="mlt.match.include">false</str>
> > > > > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > > > > <str name="mlt.fl">title,content</str>
> > > > > > > > > <str name="mlt.minwl">4</str>
> > > > > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > > > > <str name="mlt.mintf">2</str>
> > > > > > > > > <int name="mlt.count">10</int>
> > > > > > > > > <str name="mlt.boost">true</str>
> > > > > > > > > <str name="wt">json</str>
> > > > > > > > > <str name="indent">true</str>
> > > > > > > > > </lst>
> > > > > > > > >   </requestHandler>
> > > > > > > > >
> > > > > > > > > when I call this:
> > > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > > > > >  it works fine. Is there any way to have a kind of "bulk"
> > call
> > > of
> > > > > > more
> > > > > > > > like
> > > > > > > > > this handler . I need the intresting terms as well and as
> far
> > > as
> > > > I
> > > > > > know
> > > > > > > > if
> > > > > > > > > i use more like this as a search component it does not
> return
> > > > with
> > > > > > it so
> > > > > > > > it
> > > > > > > > > is not an alternative.
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > >Roland
> > > > > > > > Szűcs
> > > > > > > > > <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > >Connect
> > > > > > > > with
> > > > > > > > > me on Linkedin <
> > > > > > > > >
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81
> > > 13Bookandwalk.hu
> > > > > > > > > <https://bokandwalk.hu/>
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > --------------------------
> > > > > > > >
> > > > > > > > Benedetti Alessandro
> > > > > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > > > > >
> > > > > > > > "Tyger, tyger burning bright
> > > > > > > > In the forests of the night,
> > > > > > > > What immortal hand or eye
> > > > > > > > Could frame thy fearful symmetry?"
> > > > > > > >
> > > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > >Szűcs
> > > > > > Roland
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > >Ismerkedjünk
> > > > > > > meg a Linkedin
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > > > 13Bookandwalk.hu
> > > > > > > <https://bokandwalk.hu/>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > > Roland
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > >Ismerkedjünk
> > > > > meg a Linkedin
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > 13Bookandwalk.hu
> > > > > <https://bokandwalk.hu/>
> > > >
> > >
> > >
> > >
> > > --
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > Roland
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > >Ismerkedjünk
> > > meg a Linkedin <
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > 13Bookandwalk.hu
> > > <https://bokandwalk.hu/>
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
> meg a Linkedin <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> 13Bookandwalk.hu
> <https://bokandwalk.hu/>
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to