Hi Alessandro, You are right. I forget to mention one important factor. For 3000 hungarian e-books the approach you mentioned is absolutely fine as the response time is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the response time is 7 sec which is definetely not acceptable for the users.
Regards, Roland 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <benedetti.ale...@gmail.com> : > Hi Roland, > you said "The main goal is that when a customer is on the pruduct page ". > But if you are in a product page, I guess you have the product Id. > If you have the product id , you can simply execute the MLT request with > the single Doc Id in input. > > Why do you need to calculate beforehand? > > Cheers > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>: > > > Hello Upayavira, > > > > The main goal is that when a customer is on the pruduct page on an e-book > > and he does not like it somehow I want to immediately offer her/him > > alternative e-books in the same topic. If I expect from the customer to > > click on a button like "similar e-books" I lose half of them as they are > > lazy to click anywhere. So I would like to present on the product pages > the > > alternatives of the e-books without clicking. > > > > I assumed the best idea to claculate the similar e-books for all the > other > > (n*(n-1) similarity calculation) and present only the top 5. I planned to > > do it when our server is not busy. In this point I found the description > of > > mlt as a search component which seemed to be a good candidate as it > > calculates the similar documents to all the result set of the query. So > if > > I say q=*:* and mlt component is enabled I get similar document for my > > entire document set. The only problem was with this approach that mlt > > search component does not give back the interesting terms for my tag > cloud > > calculation. > > > > That's why I tried to mix the flexibility of mlt compoonent (multiple > docs > > as an input accepted) with the robustness of MoreLikeThisHandler (having > > interesting terms). > > > > If there is no solution, I will use the mlt component and solve the tag > > cloud calculation other way. By the way if I am not mistaken, the 5.3.1 > > version takes the union of the feature set of the mlt component, and > > handler > > > > Best Regards, > > Roland > > > > > > > > 2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>: > > > > > Let's take a step back. So, you have 3000 or so docs, and you want to > > > know which documents are similar to these. > > > > > > Why do you want to know this? What feature do you need to build that > > > will use that information? Knowing this may help us to arrive at the > > > right technology for you. > > > > > > For example, you might want to investigate offline clustering > algorithms > > > (e.g. [1], which might be a bit dense to follow). A good book on > machine > > > learning if you are okay with Python is "Programming Collective > > > Intelligence" as it explains the usual algorithms with simple for loops > > > making it very clear. > > > > > > Or, you could do searches, and then cluster the results at search time > > > (so if you search for 100 docs, it will identify clusters within those > > > 100 matching documents). That might get you there. See [2] > > > > > > So, if you let us know what the end-goal is, perhaps we can suggest an > > > alternative approach, rather than burying ourselves neck-deep in MLT > > > problems. > > > > > > Upayavira > > > > > > [1] > > > > > > > > > http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html > > > [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering > > > > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote: > > > > Hello Upayavira, > > > > > > > > Thanks dealing with my issue. I have applied already the > > termVectors=true > > > > to all fileds involved in the more like this calculation. I have > just 3 > > > > 000 > > > > documents each of them is represented by a relativly big term vector > > with > > > > more than 20 000 unique terms. If I run the more like this handler > for > > a > > > > solr doc it takes close to 1 sec to get back the first 10 similar > > > > documents. Aftwr this I have to pass the docid-s to my other > > application > > > > which find the cover of the e-book and other metadata and put it on > the > > > > web. The end-to-end process takes too much time from customer > > perspective > > > > that is why I tried to find solution for offline more like this > > > > calculation. But if my app has to call the morelikethishandler for > each > > > > doc > > > > it puts overhead for the offline calculation. > > > > > > > > Best Regards, > > > > Roland > > > > > > > > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>: > > > > > > > > > If MoreLikeThis is slow for large documents that are indexed, have > > you > > > > > enabled term vectors on the similarity fields? > > > > > > > > > > Basically, what more like this does is this: > > > > > > > > > > * decide on what terms in the source doc are "interesting", and > pick > > > the > > > > > 25 most interesting ones > > > > > * build and execute a boolean query using these interesting terms. > > > > > > > > > > Looking at the first phase of this in more detail: > > > > > > > > > > If you pass in a document using stream.body, it will analyse this > > > > > document into terms, and then calculate the most interesting terms > > from > > > > > that. > > > > > > > > > > If you reference document in your index with a field that is > stored, > > it > > > > > will take the stored version, and analyse it and identify the > > > > > interesting terms from there. > > > > > > > > > > If, however, you have stored term vectors against that field, this > > work > > > > > is not needed. You have already done much of the work, and the > > > > > identification of your "interesting terms" will be much faster. > > > > > > > > > > Thus, on the content field of your documents, add > termVectors="true" > > in > > > > > your schema, and re-index. Then you could well find MLT becoming a > > lot > > > > > more efficient. > > > > > > > > > > Upayavira > > > > > > > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote: > > > > > > Hi Alessandro, > > > > > > > > > > > > My original goal was to get offline suggestsion on content based > > > > > > similarity > > > > > > for every e-book we have . We wanted to run a bulk more like this > > > > > > calculation in the evening when the usage of our site is low and > we > > > > > > submit > > > > > > a new e-book. Real time more like this can take a while as we > have > > > > > > typically long documents (2-5MB text) with all the content > indexed. > > > > > > > > > > > > When we upload a new document we wanted to recalculate the more > > like > > > this > > > > > > suggestions and a tf-idf based tag cloouds. Both of them are > > > delivered by > > > > > > the More LikeThisHandler but only for one document as you wrote. > > > > > > > > > > > > The text input is not good for us because we need the similar doc > > > list > > > > > > for > > > > > > each of the matched document. If I put together text of 10 > document > > > I can > > > > > > not separate which suggestion relates to which matched document > and > > > also > > > > > > the tag cloud will belong to the mixed text. > > > > > > > > > > > > Most likley we will use the MoreLikeThisHandler for each of the > > > documents > > > > > > and parse the json repsonse and store the result in a DQL > database > > > > > > > > > > > > Thanks your help. > > > > > > > > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti > > > > > > <benedetti.ale...@gmail.com> > > > > > > : > > > > > > > > > > > > > Hi Roland, > > > > > > > what is your exact requirement ? > > > > > > > Do you want to basically build a "description" for a set of > > > documents > > > > > and > > > > > > > then find documents in the index, similar to this description ? > > > > > > > > > > > > > > By default , based on my experience ( and on the code) this is > > the > > > > > entry > > > > > > > point for the Lucene More Like This : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a > query > > > that > > > > > will > > > > > > > > return docs like the passed lucene document ID.** @param > docNum > > > the > > > > > > > > documentID of the lucene doc to generate the 'More Like This" > > > query > > > > > for.* > > > > > > > > @return a query that will return docs like the passed lucene > > > document > > > > > > > > ID.*/public Query like(int docNum) throws IOException {if > > > > > (fieldNames == > > > > > > > > null) {// gather list of valid fields from > > > luceneCollection<String> > > > > > > > fields > > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames = > > > fields.toArray(new > > > > > > > > String[fields.size()]);}return > > > createQuery(retrieveTerms(docNum));}* > > > > > > > > > > > > > > It means that talking about "documents" you can feed only one > > Solr > > > doc. > > > > > > > > > > > > > > But you can also feed the MLT with simple text. > > > > > > > > > > > > > > So you should study better your use case and understand which > > > option > > > > > > > fits better : > > > > > > > > > > > > > > 1) customising the MLT component starting from Lucene > > > > > > > > > > > > > > 2) doing some processing client side and use the "text" > > similarity > > > > > feature. > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs < > > > roland.sz...@bookandwalk.com > > > > > >: > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > Is it possible to feed multiple solr id for a > > > MoreLikeThisHandler? > > > > > > > > > > > > > > > > <requestHandler name="/mlt" class="solr.MoreLikeThisHandler"> > > > > > > > > <lst name="defaults"> > > > > > > > > <str name="mlt.match.include">false</str> > > > > > > > > <str name="mlt.interestingTerms">details</str> > > > > > > > > <str name="mlt.fl">title,content</str> > > > > > > > > <str name="mlt.minwl">4</str> > > > > > > > > <str name="mlt.qf">title^12 content^1</str> > > > > > > > > <str name="mlt.mintf">2</str> > > > > > > > > <int name="mlt.count">10</int> > > > > > > > > <str name="mlt.boost">true</str> > > > > > > > > <str name="wt">json</str> > > > > > > > > <str name="indent">true</str> > > > > > > > > </lst> > > > > > > > > </requestHandler> > > > > > > > > > > > > > > > > when I call this: > > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id > > > > > > > > it works fine. Is there any way to have a kind of "bulk" > call > > of > > > > > more > > > > > > > like > > > > > > > > this handler . I need the intresting terms as well and as far > > as > > > I > > > > > know > > > > > > > if > > > > > > > > i use more like this as a search component it does not return > > > with > > > > > it so > > > > > > > it > > > > > > > > is not an alternative. > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > >Roland > > > > > > > Szűcs > > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > >Connect > > > > > > > with > > > > > > > > me on Linkedin < > > > > > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81 > > 13Bookandwalk.hu > > > > > > > > <https://bokandwalk.hu/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > -------------------------- > > > > > > > > > > > > > > Benedetti Alessandro > > > > > > > Visiting card - http://about.me/alessandro_benedetti > > > > > > > Blog - http://alexbenedetti.blogspot.co.uk > > > > > > > > > > > > > > "Tyger, tyger burning bright > > > > > > > In the forests of the night, > > > > > > > What immortal hand or eye > > > > > > > Could frame thy fearful symmetry?" > > > > > > > > > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > >Szűcs > > > > > Roland > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > > > >Ismerkedjünk > > > > > > meg a Linkedin > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > > > > > > 13Bookandwalk.hu > > > > > > <https://bokandwalk.hu/> > > > > > > > > > > > > > > > > > > > > > -- > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs > > > Roland > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > >Ismerkedjünk > > > > meg a Linkedin > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > > > > 13Bookandwalk.hu > > > > <https://bokandwalk.hu/> > > > > > > > > > > > -- > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs > Roland > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > >Ismerkedjünk > > meg a Linkedin < > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > > 13Bookandwalk.hu > > <https://bokandwalk.hu/> > > > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card - http://about.me/alessandro_benedetti > Blog - http://alexbenedetti.blogspot.co.uk > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk meg a Linkedin <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 13Bookandwalk.hu <https://bokandwalk.hu/>