I am still missing why you quote the number of the documents... If you have 5600 polish books, but you use the MLT only when you land in the page of a specific book ... I think i still miss the point ! MLT on 1 polish book, takes 7 secs ?
2015-09-30 9:10 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>: > Hi Alessandro, > > You are right. I forget to mention one important factor. For 3000 hungarian > e-books the approach you mentioned is absolutely fine as the response time > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the > response time is 7 sec which is definetely not acceptable for the users. > > Regards, > Roland > > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti < > benedetti.ale...@gmail.com> > : > > > Hi Roland, > > you said "The main goal is that when a customer is on the pruduct page ". > > But if you are in a product page, I guess you have the product Id. > > If you have the product id , you can simply execute the MLT request with > > the single Doc Id in input. > > > > Why do you need to calculate beforehand? > > > > Cheers > > > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>: > > > > > Hello Upayavira, > > > > > > The main goal is that when a customer is on the pruduct page on an > e-book > > > and he does not like it somehow I want to immediately offer her/him > > > alternative e-books in the same topic. If I expect from the customer to > > > click on a button like "similar e-books" I lose half of them as they > are > > > lazy to click anywhere. So I would like to present on the product pages > > the > > > alternatives of the e-books without clicking. > > > > > > I assumed the best idea to claculate the similar e-books for all the > > other > > > (n*(n-1) similarity calculation) and present only the top 5. I planned > to > > > do it when our server is not busy. In this point I found the > description > > of > > > mlt as a search component which seemed to be a good candidate as it > > > calculates the similar documents to all the result set of the query. So > > if > > > I say q=*:* and mlt component is enabled I get similar document for my > > > entire document set. The only problem was with this approach that mlt > > > search component does not give back the interesting terms for my tag > > cloud > > > calculation. > > > > > > That's why I tried to mix the flexibility of mlt compoonent (multiple > > docs > > > as an input accepted) with the robustness of MoreLikeThisHandler > (having > > > interesting terms). > > > > > > If there is no solution, I will use the mlt component and solve the tag > > > cloud calculation other way. By the way if I am not mistaken, the 5.3.1 > > > version takes the union of the feature set of the mlt component, and > > > handler > > > > > > Best Regards, > > > Roland > > > > > > > > > > > > 2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>: > > > > > > > Let's take a step back. So, you have 3000 or so docs, and you want to > > > > know which documents are similar to these. > > > > > > > > Why do you want to know this? What feature do you need to build that > > > > will use that information? Knowing this may help us to arrive at the > > > > right technology for you. > > > > > > > > For example, you might want to investigate offline clustering > > algorithms > > > > (e.g. [1], which might be a bit dense to follow). A good book on > > machine > > > > learning if you are okay with Python is "Programming Collective > > > > Intelligence" as it explains the usual algorithms with simple for > loops > > > > making it very clear. > > > > > > > > Or, you could do searches, and then cluster the results at search > time > > > > (so if you search for 100 docs, it will identify clusters within > those > > > > 100 matching documents). That might get you there. See [2] > > > > > > > > So, if you let us know what the end-goal is, perhaps we can suggest > an > > > > alternative approach, rather than burying ourselves neck-deep in MLT > > > > problems. > > > > > > > > Upayavira > > > > > > > > [1] > > > > > > > > > > > > > > http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html > > > > [2] > https://cwiki.apache.org/confluence/display/solr/Result+Clustering > > > > > > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote: > > > > > Hello Upayavira, > > > > > > > > > > Thanks dealing with my issue. I have applied already the > > > termVectors=true > > > > > to all fileds involved in the more like this calculation. I have > > just 3 > > > > > 000 > > > > > documents each of them is represented by a relativly big term > vector > > > with > > > > > more than 20 000 unique terms. If I run the more like this handler > > for > > > a > > > > > solr doc it takes close to 1 sec to get back the first 10 similar > > > > > documents. Aftwr this I have to pass the docid-s to my other > > > application > > > > > which find the cover of the e-book and other metadata and put it on > > the > > > > > web. The end-to-end process takes too much time from customer > > > perspective > > > > > that is why I tried to find solution for offline more like this > > > > > calculation. But if my app has to call the morelikethishandler for > > each > > > > > doc > > > > > it puts overhead for the offline calculation. > > > > > > > > > > Best Regards, > > > > > Roland > > > > > > > > > > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>: > > > > > > > > > > > If MoreLikeThis is slow for large documents that are indexed, > have > > > you > > > > > > enabled term vectors on the similarity fields? > > > > > > > > > > > > Basically, what more like this does is this: > > > > > > > > > > > > * decide on what terms in the source doc are "interesting", and > > pick > > > > the > > > > > > 25 most interesting ones > > > > > > * build and execute a boolean query using these interesting > terms. > > > > > > > > > > > > Looking at the first phase of this in more detail: > > > > > > > > > > > > If you pass in a document using stream.body, it will analyse this > > > > > > document into terms, and then calculate the most interesting > terms > > > from > > > > > > that. > > > > > > > > > > > > If you reference document in your index with a field that is > > stored, > > > it > > > > > > will take the stored version, and analyse it and identify the > > > > > > interesting terms from there. > > > > > > > > > > > > If, however, you have stored term vectors against that field, > this > > > work > > > > > > is not needed. You have already done much of the work, and the > > > > > > identification of your "interesting terms" will be much faster. > > > > > > > > > > > > Thus, on the content field of your documents, add > > termVectors="true" > > > in > > > > > > your schema, and re-index. Then you could well find MLT becoming > a > > > lot > > > > > > more efficient. > > > > > > > > > > > > Upayavira > > > > > > > > > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote: > > > > > > > Hi Alessandro, > > > > > > > > > > > > > > My original goal was to get offline suggestsion on content > based > > > > > > > similarity > > > > > > > for every e-book we have . We wanted to run a bulk more like > this > > > > > > > calculation in the evening when the usage of our site is low > and > > we > > > > > > > submit > > > > > > > a new e-book. Real time more like this can take a while as we > > have > > > > > > > typically long documents (2-5MB text) with all the content > > indexed. > > > > > > > > > > > > > > When we upload a new document we wanted to recalculate the more > > > like > > > > this > > > > > > > suggestions and a tf-idf based tag cloouds. Both of them are > > > > delivered by > > > > > > > the More LikeThisHandler but only for one document as you > wrote. > > > > > > > > > > > > > > The text input is not good for us because we need the similar > doc > > > > list > > > > > > > for > > > > > > > each of the matched document. If I put together text of 10 > > document > > > > I can > > > > > > > not separate which suggestion relates to which matched document > > and > > > > also > > > > > > > the tag cloud will belong to the mixed text. > > > > > > > > > > > > > > Most likley we will use the MoreLikeThisHandler for each of the > > > > documents > > > > > > > and parse the json repsonse and store the result in a DQL > > database > > > > > > > > > > > > > > Thanks your help. > > > > > > > > > > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti > > > > > > > <benedetti.ale...@gmail.com> > > > > > > > : > > > > > > > > > > > > > > > Hi Roland, > > > > > > > > what is your exact requirement ? > > > > > > > > Do you want to basically build a "description" for a set of > > > > documents > > > > > > and > > > > > > > > then find documents in the index, similar to this > description ? > > > > > > > > > > > > > > > > By default , based on my experience ( and on the code) this > is > > > the > > > > > > entry > > > > > > > > point for the Lucene More Like This : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a > > query > > > > that > > > > > > will > > > > > > > > > return docs like the passed lucene document ID.** @param > > docNum > > > > the > > > > > > > > > documentID of the lucene doc to generate the 'More Like > This" > > > > query > > > > > > for.* > > > > > > > > > @return a query that will return docs like the passed > lucene > > > > document > > > > > > > > > ID.*/public Query like(int docNum) throws IOException {if > > > > > > (fieldNames == > > > > > > > > > null) {// gather list of valid fields from > > > > luceneCollection<String> > > > > > > > > fields > > > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames = > > > > fields.toArray(new > > > > > > > > > String[fields.size()]);}return > > > > createQuery(retrieveTerms(docNum));}* > > > > > > > > > > > > > > > > It means that talking about "documents" you can feed only one > > > Solr > > > > doc. > > > > > > > > > > > > > > > > But you can also feed the MLT with simple text. > > > > > > > > > > > > > > > > So you should study better your use case and understand which > > > > option > > > > > > > > fits better : > > > > > > > > > > > > > > > > 1) customising the MLT component starting from Lucene > > > > > > > > > > > > > > > > 2) doing some processing client side and use the "text" > > > similarity > > > > > > feature. > > > > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs < > > > > roland.sz...@bookandwalk.com > > > > > > >: > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > Is it possible to feed multiple solr id for a > > > > MoreLikeThisHandler? > > > > > > > > > > > > > > > > > > <requestHandler name="/mlt" > class="solr.MoreLikeThisHandler"> > > > > > > > > > <lst name="defaults"> > > > > > > > > > <str name="mlt.match.include">false</str> > > > > > > > > > <str name="mlt.interestingTerms">details</str> > > > > > > > > > <str name="mlt.fl">title,content</str> > > > > > > > > > <str name="mlt.minwl">4</str> > > > > > > > > > <str name="mlt.qf">title^12 content^1</str> > > > > > > > > > <str name="mlt.mintf">2</str> > > > > > > > > > <int name="mlt.count">10</int> > > > > > > > > > <str name="mlt.boost">true</str> > > > > > > > > > <str name="wt">json</str> > > > > > > > > > <str name="indent">true</str> > > > > > > > > > </lst> > > > > > > > > > </requestHandler> > > > > > > > > > > > > > > > > > > when I call this: > > > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id > > > > > > > > > it works fine. Is there any way to have a kind of "bulk" > > call > > > of > > > > > > more > > > > > > > > like > > > > > > > > > this handler . I need the intresting terms as well and as > far > > > as > > > > I > > > > > > know > > > > > > > > if > > > > > > > > > i use more like this as a search component it does not > return > > > > with > > > > > > it so > > > > > > > > it > > > > > > > > > is not an alternative. > > > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > < > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > > >Roland > > > > > > > > Szűcs > > > > > > > > > < > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > > >Connect > > > > > > > > with > > > > > > > > > me on Linkedin < > > > > > > > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81 > > > 13Bookandwalk.hu > > > > > > > > > <https://bokandwalk.hu/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -------------------------- > > > > > > > > > > > > > > > > Benedetti Alessandro > > > > > > > > Visiting card - http://about.me/alessandro_benedetti > > > > > > > > Blog - http://alexbenedetti.blogspot.co.uk > > > > > > > > > > > > > > > > "Tyger, tyger burning bright > > > > > > > > In the forests of the night, > > > > > > > > What immortal hand or eye > > > > > > > > Could frame thy fearful symmetry?" > > > > > > > > > > > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > >Szűcs > > > > > > Roland > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > > > > >Ismerkedjünk > > > > > > > meg a Linkedin > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > > > > > > > 13Bookandwalk.hu > > > > > > > <https://bokandwalk.hu/> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs > > > > Roland > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > > > >Ismerkedjünk > > > > > meg a Linkedin > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > > > > > 13Bookandwalk.hu > > > > > <https://bokandwalk.hu/> > > > > > > > > > > > > > > > > -- > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs > > Roland > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu > > >Ismerkedjünk > > > meg a Linkedin < > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > > > 13Bookandwalk.hu > > > <https://bokandwalk.hu/> > > > > > > > > > > > -- > > -------------------------- > > > > Benedetti Alessandro > > Visiting card - http://about.me/alessandro_benedetti > > Blog - http://alexbenedetti.blogspot.co.uk > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > > > > -- > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk > meg a Linkedin < > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 > 13Bookandwalk.hu > <https://bokandwalk.hu/> > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England