Support loading queries from external files in QuerySenderListener
Hi all! I cant load my custom queries from the external file, as written here: https://issues.apache.org/jira/browse/SOLR-784 This option is seems to be not implemented in current version 1.4.1 of Solr. It was deleted or it comes first with new version? regards, Stanislaw
Multiple sorting on text fields
Hi all! i found some strange behavior of solr. If I do sorting by 2 text fields in chain, I do receive some results doubled. The both text fields are not multivalued, one of them is string, the other custom type based on text field and keyword analyzer. I do this: *CommonsHttpSolrServer server = SolrServer.getInstance().getServer(); SolrQuery query = new SolrQuery(); query.setQuery(suchstring); query.addSortField("type", SolrQuery.ORDER.asc); //String field- it's only one letter query.addSortField("sortName", SolrQuery.ORDER.asc); //text field, not tokenized QueryResponse rsp = new QueryResponse(); rsp = server.query(query);* after that I extract results as a list Entity objects, the most of them are unique, but some of them are doubled and even tripled in this list. (Each object has a unique id and there is only one time in index) If I'm sorting only by one text field, I'm receiving "normal" results w/o problems. Where could I do a mistake, or is it a bug? Best regards, Stanislaw
Re: Multiple sorting on text fields
Hi Dennis, thanks for reply. Please explain me what filter do you mean. I'm searching only on one field with names: query.setQuery(suchstring); then I'm adding two sortings on another fields: query.addSortField("type", SolrQuery.ORDER.asc); query.addSortField("sortName", SolrQuery.ORDER.asc); the results should be sorted in first queue by 'type' (only one letter 'A' or 'B') and then they should be sorted by names how I can define hier 'OR' or 'AND' relations? Best regards, Stanislaw 2010/9/13 Dennis Gearon > My guess is two things are happening: > 1/ Your combination of filters is in parallel,or an OR expression. This I > think for sure maybe, seen next. > 2/ To get 3 duplicate results, your custom filter AND the OR expression > above have to be working togther, or it's possible that your customer filter > is the WHOLE problem, supplying the duplicates and the triplicates. > > A first guess nothing more :-) > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Mon, 9/13/10, Stanislaw wrote: > > > From: Stanislaw > > Subject: Multiple sorting on text fields > > To: solr-user@lucene.apache.org > > Date: Monday, September 13, 2010, 12:12 AM > > Hi all! > > > > i found some strange behavior of solr. If I do sorting by 2 > > text fields in > > chain, I do receive some results doubled. > > The both text fields are not multivalued, one of them is > > string, the other > > custom type based on text field and keyword analyzer. > > > > I do this: > > > > *CommonsHttpSolrServer server > > = > > SolrServer.getInstance().getServer(); > > SolrQuery query = new > > SolrQuery(); > > query.setQuery(suchstring); > > query.addSortField("type", > > SolrQuery.ORDER.asc); > > //String field- it's only one letter > > query.addSortField("sortName", > > SolrQuery.ORDER.asc); //text > > field, not tokenized > > > > QueryResponse rsp = new > > QueryResponse(); > > rsp = server.query(query);* > > > > after that I extract results as a list Entity objects, the > > most of them are > > unique, but some of them are doubled and even tripled in > > this list. > > (Each object has a unique id and there is only one time in > > index) > > If I'm sorting only by one text field, I'm receiving > > "normal" results w/o > > problems. > > Where could I do a mistake, or is it a bug? > > > > Best regards, > > Stanislaw > > >
Re: Parsing cluster result's docs
Hi, > I have a Solr instance using the clustering component (with the Lingo > algorithm) working perfectly. However when I get back the cluster results > only the ID's of these come back with it. What is the easiest way to > retrieve full documents instead? Should I parse these IDs into a new query > to Solr, or is there some configuration I am missing to return full docs > instead of IDs? > > If it matters, I am using Solr 4.10. > Clustering results are attached to the regular Solr response (the text of the documents), much like shown in the docs: https://cwiki.apache.org/confluence/display/solr/Result+Clustering, so with the default configuration you should be getting both clusters and document content. If that's not the case, please post your solrconfig.xml and the URL you're using to initiate the search/clustering. Staszek
Re: Number of clustering labels to show
Hi, The number of clusters primarily depends on the parameters of the specific clustering algorithm. If you're using the default Lingo algorithm, the number of clusters is governed by the LingoClusteringAlgorithm.desiredClusterCountBase parameter. Take a look at the documentation ( https://cwiki.apache.org/confluence/display/solr/Result+Clustering#ResultClustering-TweakingAlgorithmSettings) for some more details (the "Tweaking at Query-Time" section shows how to pass the specific parameters at request time). A complete overview of the Lingo clustering algorithm parameters is here: http://doc.carrot2.org/#section.component.lingo. Stanislaw -- Stanislaw Osinski, stanislaw.osin...@carrotsearch.com http://carrotsearch.com On Fri, May 29, 2015 at 4:29 AM, Zheng Lin Edwin Yeo wrote: > Hi, > > I'm trying to increase the number of cluster result to be shown during the > search. I tried to set carrot.fragSize=20 but only 15 cluster labels is > shown. Even when I tried to set carrot.fragSize=5, there's also 15 labels > shown. > > Is this the correct way to do this? I understand that setting it to 20 > might not necessary mean 20 lables will be shown, as the setting is for > maximum number. But when I set this to 5, it should reduce the number of > labels to 5? > > I'm using Solr 5.1. > > > Regards, > Edwin >
Re: [Clustering] Full-Index Offline cluster
> > Thats weird. As far as I know there is no such thing. There is > classification stuff but I haven't heard of clustering. > > http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html I think the wording on the wiki page needs some clarification -- Solr contains an internal API interface for full index clustering, but the interface is not yet implemented, so the only clustering mode available out of the box is currently search results clustering (based on the Carrot2 library). Staszek
Re: [Clustering] Full-Index Offline cluster
> Thank you Ahmet, Staszek and Tomnaso ;) > so the only way to obtain offline Clustering is to move to a customisation > ! > I will take a look to the interface of the API ( If you can give me a link > to the class, it will be appreciated, If not I will find it by myself . > The API stub is the org.apache.solr.handler.clustering.DocumentClusteringEngine class in contrib/clustering. The API has not yet been implemented, so you may want to tune the API to suit the way you'd like to arrange your full-index clustering code. S.
Re: Is it possible to cluster on search results but return only clusters?
Hi Sebastián, Looking quickly through the code of the clustering component, there's currently no way to output only clusters. Let me see if this can be easily implemented. Stanislaw -- Stanislaw Osinski, stanislaw.osin...@carrotsearch.com http://carrotsearch.com On Tue, May 6, 2014 at 6:48 PM, Paul Libbrecht wrote: > put rows to zero? > Exploit the facets as "clusters" ? > > paul > > > Le 6 mai 2014 à 16:42, Sebastián Ramírez > a écrit : > > > I have this query / URL > > > > > http://example.com:8983/solr/collection1/clustering?q=%28title:%22+Atlantis%22+~100+OR+content:%22+Atlantis%22+~100%29&rows=3001&carrot.snippet=content&carrot.title=title&wt=xml&indent=true&sort=date+DESC&; > > > > With that, I get the results and also the clustering of those results. > What > > I want is just the clusters of the results, not the results, because > > returning the results is consuming too much bandwidth. > > > > I know I can write a "proxy" script that gets the response from Solr and > > then filters out the results and returns the clusters, but I first wanna > > check if it's possible with just the parameters of Solr or Carrot. > > > > Thanks in advance, > > > > > > *Sebastián Ramírez* > > Diseñador de Algoritmos > > > > <http://www.senseta.com> > > > > Tel: (+571) 795 7950 ext: 1012 > > Cel: (+57) 300 370 77 10 > > Calle 99 No. 14 - 76 Piso 5 > > Email: sebastian.rami...@senseta.com > > www.senseta.com > > > > -- > > ** > > *This e-mail transmission, including any attachments, is intended only > for > > the named recipient(s) and may contain information that is privileged, > > confidential and/or exempt from disclosure under applicable law. If you > > have received this transmission in error, or are not the named > > recipient(s), please notify Senseta immediately by return e-mail and > > permanently delete this transmission, including any attachments.* > >
Re: solrconfig.xml carrot2 params
Hi, Out of curiosity -- what would you like to achieve by changing Tokenizer.documentFields? If you want to have clustering applied to more than one document field, you can provide a comma-separated list of fields in the carrot.title and/or carrot.snippet parameters. Thanks, Staszek -- Stanislaw Osinski, stanislaw.osin...@carrotsearch.com http://carrotsearch.com On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net < youknow...@heroicefforts.net> wrote: > Would someone help me out with the syntax for setting > Tokenizer.documentFields in the ClusteringComponent engine definition in > solrconfig.xml? Carrot2 is expecting a Collection of Strings. There's no > schema definition for this XML file and a big TODO on the Wiki wrt init > params. Every permutation I have tried results in an error stating: > Cannot set java.until.Collection field ... to java.lang.String. > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Re: solrconfig.xml carrot2 params
> Thanks, I'm new to the clustering libraries. I finally made this > connection when I started browsing through the carrot2 source. I had > pulled down a smaller MM document collection from our test environment. It > was not ideal as it was mostly structured, but small. I foolishly thought > I could cluster on the text copy field before realizing that it was index > only. Doh! > That is correct -- for the time being the clustering can only be applied to stored Solr fields. > Our documents are indexed in SolrCloud, but stored in HBase. I want to > allow users to page through Solr hits, but would like to cluster on all (or > at least several thousand) of the top search hits. Now I'm puzzling over > how to efficiently cluster over possibly several thousand Solr hits when > the documents are in HBase. I thought an HBase coprocessor, but carrot2 > isn't designed for distributed computation. Mahout, in the Hadoop M/R > context, seems slow and heavy handed for this scale; maybe, I just need to > dig deeper into their library. Or I could just be missing something > fundamental? :) > Carrot2 algorithms were not designed to be distributed, but you can still use them in a single-threaded scenario. To do this, you'd probably need to write a bit of code that gets the text of your documents from your HBase and runs Carrot2 clustering on it. If you use the STC clustering algorithm, you should be able to process several thousands of documents in a reasonable time (order of seconds). The clustering side of the code should be a matter of a few lines of code ( http://download.carrot2.org/stable/javadoc/overview-summary.html#clustering-documents). The tricky bit of the setup may be efficiently getting the text for clustering -- it can happen that fetching can take longer than the actual clustering. S.
Re: Clustering and FieldType
Hi, You're right -- currently Carrot2 clustering ignores the Solr analysis chain and uses its own pipeline. It is possible to integrate with Solr's analysis components to some extent, see the discussion here: https://issues.apache.org/jira/browse/SOLR-2917. Staszek > > Hi > > Trying to use carrot2 for clustering search results. I have it setup > except it seems to treat the field as regular text instead of applying some > custom filters I have. > > > > So my schema says something like > > omitNorms="true"/> > > compressed="true"/> > > > > ic_text is our internal fieldtype with some custom analysers that strip > out certain special characters from the text. > > > > My solrconfig has something like this setup in our default search > handler. > > true > > default > > true > > > > title > > > > content > > > > In my search results, I see clusters but the labels on these clusters > have the special characters in them - which means that the clustering must > be running on raw text and not on the "ic_text" field. > > Can someone let me know if this is the default setup and if there is a > way to fix this ? > > Thanks ! > > Geetu > > >
Re: Weird docs-id clustering output in Solr 1.4.1
Hi, It looks like some serialization issue related to writing integer ids to the output. I've just tried a similar configuration on Solr 3.5 and the integer identifiers looked fine. Can you try the same configuration on Solr 3.5? Thanks, Staszek On Tue, Nov 29, 2011 at 12:03, Vadim Kisselmann wrote: > Hi folks, > i've installed the clustering component in solr 1.4.1 and it works, but not > really:) > > You can see what the doc id is corrupt. > > > > Euro-Krise > > ½Íџ > ¾ͽ > ¿)ై > ˆ > > > my fields: > > required="true"/> > required="true"/> > multiValued="true" compressed="true"/> > > and my config-snippets: > title > id > > text > > i changed my config snippets (carrot.url=id, url, title..) but the > result is the same. > anyone an idea? > > best regards and thanks > vadim >
Re: Weird docs-id clustering output in Solr 1.4.1
> > But my actual live system works on solr 1.4.1. i can only change my > solrconfig.xml and integrate new packages... > i check the possibility to upgrade from 1.4.1 to 3.5 with the same index > (without reinidex) with luceneMatchVersion 2.9. > i hope it works... > Another option would be to check out Solr 1.4.1 source code, fix the issue and recompile the clustering component. The quick and dirty way would be to convert all identifiers to strings in the clustering component, before the they are returned for serialization (I can send you a patch that does this). The proper way would be to fix the root cause of the problem, but I'd need to dig deeper into the code to find this. Staszek
Re: Weird docs-id clustering output in Solr 1.4.1
Hi Vadim, I've had limited connectivity, so I couldn't check out the complete 1.4.1 code and test the changes. Here's what you can try: In this file: http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.1/contrib/clustering/src/main/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngine.java?revision=957515&view=markup around line 216 you will see: for (Document doc : docs) { docList.add(doc.getField("solrId")); } You need to change this to: for (Document doc : docs) { docList.add(doc.getField("solrId").toString()); } Let me know if this did the trick. Cheers, S. On Thu, Dec 1, 2011 at 10:43, Vadim Kisselmann wrote: > Hi Stanislaw, > did you already have time to create a patch? > If not, can you tell me please which lines in which class in source code > are relevant? > Thanks and regards > Vadim Kisselmann > > > > 2011/11/29 Vadim Kisselmann > > > Hi, > > the quick and dirty way sound good:) > > It would be great if you can send me a patch for 1.4.1. > > > > > > By the way, i tested Solr. 3.5 with my 1.4.1 test index. > > I can search and optimize, but clustering doesn't work (java.lang.Integer > > cannot be cast to java.lang.String) > > My uniqieKey for my docs it the "id"(sint). > > These here was the error message: > > > > > > Problem accessing /solr/select/. Reason: > > > >Carrot2 clustering failed > > > > org.apache.solr.common.SolrException: Carrot2 clustering failed > >at > > > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217) > >at > > > org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) > >at > > > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) > >at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) > >at > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) > >at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) > >at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > >at > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > >at > > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > >at > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > >at > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > >at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > >at > > > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > >at > > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > >at > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > >at org.mortbay.jetty.Server.handle(Server.java:326) > >at > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > >at > > > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > >at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > >at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > >at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > >at > > > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) > >at > > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast > > to java.lang.String > >at > > > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364) > >at > > > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201) > >... 23 more > > > > It this case it's better for me to upgrade/patch the 1.4.1 version. > > > > Best regards > > Vadim > > > > > > > > > > 2011/11/29 Stanislaw Osinski > > > >> > > >> > But my actual live system works on solr 1.4.1. i can only change my > >> > solrconfig.xml and integrate new packages... > >> > i check the possibility to upgrade from 1.4.1 to 3.5 with the same > index > >> > (without reinidex) with luceneMatchVersion 2.9. > >> > i hope it works... > >> > > >> > >> Another option would be to check out Solr 1.4.1 source code, fix the > issue > >> and recompile the clustering component. The quick and dirty way would be > >> to > >> convert all identifiers to strings in the clustering component, before > the > >> they are returned for serialization (I can send you a patch that does > >> this). The proper way would be to fix the root cause of the problem, but > >> I'd need to dig deeper into the code to find this. > >> > >> Staszek > >> > > > > >
Re: Solr 3.5.0 can't find Carrot classes
Hi, Can you paste the logs from the second run? Thanks, Staszek On Wed, Jan 25, 2012 at 00:12, Christopher J. Bottaro wrote: > On Tuesday, January 24, 2012 at 3:07 PM, Christopher J. Bottaro wrote: > > SEVERE: java.lang.NoClassDefFoundError: > org/carrot2/core/ControllerFactory > > at > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:102) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown > Source) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) > > at java.lang.reflect.Constructor.newInstance(Unknown Source) > > at java.lang.Class.newInstance0(Unknown Source) > > at java.lang.Class.newInstance(Unknown Source) > > > > … > > > > I'm starting Solr with -Dsolr.clustering.enabled=true and I can see that > the Carrot jars in contrib are getting loaded. > > > > Full log file is here: > http://onespot-development.s3.amazonaws.com/solr.log > > > > Any ideas? Thanks for the help. > > > Ok, got a little further. Seems that Solr doesn't like it if you include > jars more than once (I had a lib dir and also directives in the > solrconfig which ended up loading the same jars twice). > > But now I'm getting these errors: java.lang.NoClassDefFoundError: > org/apache/solr/handler/clustering/SearchClusteringEngine > > Any help? Thanks.
Re: Clustering results limit?
Hi, I am attempting to cluster a query. It kinda works, but where my > (regular) query returns 500 results the cluster only shows 1-10 hits for > each cluster (5 clusters). Never more than 10 docs and I know its not > right. What could be happening here? It should be showing dozens of > documents per cluster. > Just to clarify -- how many documents do you see in the response ( section)? Clustering is performed on the search results (in real time), so if you request 10 results, clustering will apply only to those 10 results. To get a larger number of clusters you'd need to request more results, e.g. 50, 100, 200 etc. Obviously, the trade-off here is that it will take longer to fetch the documents from the index, clustering time will also increase. For some guidance on choosing the clustering algorithm, you can take a look at the following section of Carrot2 manual: http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.choosing-algorithm . Cheers, Staszek
Re: Clustering results limit?
Hi, In my SolrJ, I used ModifiableSolrParams and I set ("rows",50) but it > still returns less than 10 for each cluster. > Oh, the number of documents per cluster very much depends on the characteristics of your documents, it often happens that the algorithms create larger numbers of smaller clusters. However, all returned documents should get assigned to some cluster(s), the Other Topics one in the worst case. Does that hold in your case? If you'd like to tune clustering a bit, you can try Carrot2 tools: http://download.carrot2.org/stable/manual/#section.getting-started.solr and then: http://download.carrot2.org/stable/manual/#chapter.tuning Cheers, S.
Re: clustering component
Hi Matt, I'm attempting to get the carrot based clustering component (in trunk) to > work. I see that the clustering contrib has been disabled for the time > being. Does anyone know if this will be re-enabled soon, or even better, > know how I could get it working as it is? > I've recently created a patch to update the clustering algorithms in branch_3x: https://issues.apache.org/jira/browse/SOLR-1804 The patch should also work with trunk, but I haven't verified it yet. S.
Re: clustering component
> The patch should also work with trunk, but I haven't verified it yet. > I've just added a patch against solr trunk to https://issues.apache.org/jira/browse/SOLR-1804. S.
Re: specifying the doc id in clustering component
Hi Tommy, I'm using the clustering component with solr 1.4. > > The response is given by the id field in the doc array like: >"labels":["Devices"], >"docs":["200066", > "195650", > "204850", > Is there a way to change the doc label to be another field? > > i couldn't this option in http://wiki.apache.org/solr/ClusteringComponent I'm not sure if I get you right. The "labels" field is generated by the clustering engine, it's a description of the group (cluster) of documents. The description is usually a phrase or a number of phrases. The "docs" field lists the ids of documents that the algorithm assigned to the cluster. Can you give an example of the input and output you'd expect? Thanks! Stanislaw
Re: specifying the doc id in clustering component
> The solr schema has the fields, id, name and desc. > > I would like to get docs:["name Field here" ] instead of the doc Id > field as in > "docs":["200066", "195650", > The idea behind using the document ids was that based on them you could access the individual documents' content, including the other fields, right from the "response" field. Using ids limits duplication in the response text as a whole. Is it possible to use this approach in your application? Staszek
Re: News clustering
One of our clients uses Solr's search results clustering for grouping news. Instead of the default Carrot2 algorithm that ships with Solr they use a commercial one, but Carrot2 should give you decent clusters too. Here's an example clustering result: http://imagebin.org/238001 Staszek -- Stanislaw Osinski http://carrotsearch.com On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > Hi all: > > I'm thinking on using nutch combined with solr to index some news sites in > an intranet. And I was wondering how effective could be using the > clustering component to cluster the search results? Any success history on > using solr clustering component for news clustering? Any existing solution > for clustering/classification on index time? > > Greetings! > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci >
Re: News clustering
> Was the picture generated using Lingo 3G algorihtms? > I saw some sub-clusters inside it. > Nice pic :) > That is correct. I am interested to learn it. > How long is the Lingo 3G trial period? > I'll send you the details in a private e-mail in a second. > Is there any way to programmatically measure the performance of Carrot2 > clustering algorithm? > I'm not sure what you mean by performance. Measuring clustering time is pretty straightforward, measuring the quality of clusters is not, a lot depends on your specific data and application. Staszek
Re: News clustering
> I mean measuring the similarity between the document in each cluster. > Also, difference between document on one cluster with another cluster. > > I saw the sample code ClusteringQualityBencmark.java > However, I do not know how to make use of it for assessing my Solr > Clustering performance. > You'd need to write your own code for this, here are the most common clustering quality measures you mentioned: http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results These are meant for the general case (numeric attributes), to apply them to texts, you'd need to use the vector representation of the documents. One a more general note, synthetic measures test only the document-cluster assignments, but none take the quality of labels into account (this is really hard to measure objectively). Staszek
Re: Old Google Guava library needs updating (r05)
Hi Nick, Which version of Solr do you have in mind? The official 3.x line or 4.0? The quick and dirty fix to try would be to just replace Guava r05 with the latest version, chances are it will work (we did that in the past though the version number difference was smaller). The proper fix would be for us to make a point release of Carrot2 with dependencies updated and update Carrot2 in Solr. And this brings us to the question about the version of Solr you use. Upgrading Carrot2 in 4.0 shouldn't be an issue, but when it comes to 3.x I'd need to check. Staszek On Mon, Mar 26, 2012 at 13:10, Erick Erickson wrote: > Hmmm, near as I can tell, guava is only used in the Carrot2 contrib, so > maybe > ask over at: http://project.carrot2.org/? > > Best > Erick > > On Sat, Mar 24, 2012 at 3:31 PM, Nicholas Ball > wrote: > > > > Hey all, > > > > Working on a plugin, which uses the Curator library (ZooKeeper client). > > Curator depends on the very latest Google Guava library which > unfortunately > > clashes with Solr's outdated r05 of Guava. > > Think it's safe to say that Solr should be using the very latest Guava > > library (11.0.1) too right? > > Shall I open up a JIRA issue for someone to update it? > > > > Cheers, > > Nick >
Re: Old Google Guava library needs updating (r05)
I've filed an issue for myself as a reminder. Guava r05 is pretty old indeed, time to upgrade. S. On Mon, Mar 26, 2012 at 23:12, Nicholas Ball wrote: > > Hey Staszek, > > Thanks for the reply. Yep using 4.x and that was exactly what I ended up > doing, a quick replace :) > Just thought I'd document it somewhere for a proper fix to be done in the > 4.0 release. > > No issues arose for me but then again Erick mentions it's only used in > Carrot2 contrib which I'm not using in my deployment. > > Thanks for the help! > Nick > > On Mon, 26 Mar 2012 22:40:14 +0200, Stanislaw Osinski > wrote: > > Hi Nick, > > > > Which version of Solr do you have in mind? The official 3.x line or 4.0? > > > > The quick and dirty fix to try would be to just replace Guava r05 with > the > > latest version, chances are it will work (we did that in the past though > > the version number difference was smaller). > > > > The proper fix would be for us to make a point release of Carrot2 with > > dependencies updated and update Carrot2 in Solr. And this brings us to > the > > question about the version of Solr you use. Upgrading Carrot2 in 4.0 > > shouldn't be an issue, but when it comes to 3.x I'd need to check. > > > > Staszek > > > > On Mon, Mar 26, 2012 at 13:10, Erick Erickson > > wrote: > > > >> Hmmm, near as I can tell, guava is only used in the Carrot2 contrib, so > >> maybe > >> ask over at: http://project.carrot2.org/? > >> > >> Best > >> Erick > >> > >> On Sat, Mar 24, 2012 at 3:31 PM, Nicholas Ball > >> wrote: > >> > > >> > Hey all, > >> > > >> > Working on a plugin, which uses the Curator library (ZooKeeper > client). > >> > Curator depends on the very latest Google Guava library which > >> unfortunately > >> > clashes with Solr's outdated r05 of Guava. > >> > Think it's safe to say that Solr should be using the very latest > Guava > >> > library (11.0.1) too right? > >> > Shall I open up a JIRA issue for someone to update it? > >> > > >> > Cheers, > >> > Nick > >> >
Re: using Carrot2 custom ITokenizerFactory
Hi Koji, You're right, the current code overwrites the custom tokenizer though it shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular dependencies (Carrot2 default tokenizer depends on Lucene), but it shouldn't be an issue with custom tokenizers. I'll try to commit a fix later today. Meanwhile, if you have a chance to recompile the code, a temporary solution would be to hardcode your tokenizer class into the fragment you pasted: BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes) .stemmerFactory(LuceneCarrot2StemmerFactory.class) .tokenizerFactory(YourCustomTokenizer.class) .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class); Staszek On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi wrote: > Hello, > > As I'd like to use custom ITokenizerFactory, I set the following Carrot2 > key > in solrconfig.xml: > > enable="${solr.clustering.enabled:true}" > class="solr.clustering.ClusteringComponent" > > > default > : > name="PreprocessingPipeline.tokenizerFactory">my.own.TokenizerFactory > > > > But seems that CarrotClusteringEngine overwrites it with > LuceneCarrot2TokenizerFactory > in init() method: > >BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes) >.stemmerFactory(LuceneCarrot2StemmerFactory.class) >.tokenizerFactory(LuceneCarrot2TokenizerFactory.class) >.lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class); > > Am I missing something? > > koji > -- > Query Log Visualizer for Apache Solr > http://soleami.com/ >
Re: Newbie with Carrot2?
Hi Bruno, Here's the wiki documentation for Solr's clustering component: http://wiki.apache.org/solr/ClusteringComponent For configuration examples, take a look at the Configuration section: http://wiki.apache.org/solr/ClusteringComponent#Configuration. If you hit any problems, let me know. Staszek On Sun, May 20, 2012 at 11:38 AM, Bruno Mannina wrote: > Dear all, > > I use Solr 3.6.0 and I indexed some documents (around 12000). > Each documents contains a Abstract-en field (and some other fields). > > Is it possible to use Carrot2 to create cluster (classes) with the > Abstract-en field? > > What must I configure in the schema.xml ? or in other files? > > Sorry for my newbie question, but I found only documentation for Workbench > tool. > > Bruno >
Re: using Carrot2 custom ITokenizerFactory
Hi Koji, It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with this, let me know. Staszek On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchi wrote: > Hi Staszek, > > I'll wait your fix. Thank you! > > Koji Sekiguchi from iPad2 > > On 2012/05/20, at 18:18, Stanislaw Osinski wrote: > > > Hi Koji, > > > > You're right, the current code overwrites the custom tokenizer though it > > shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular > > dependencies (Carrot2 default tokenizer depends on Lucene), but it > > shouldn't be an issue with custom tokenizers. > > > > I'll try to commit a fix later today. Meanwhile, if you have a chance to > > recompile the code, a temporary solution would be to hardcode your > > tokenizer class into the fragment you pasted: > > > > BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes) > > .stemmerFactory(LuceneCarrot2StemmerFactory.class) > > .tokenizerFactory(YourCustomTokenizer.class) > > .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class); > > > > Staszek > > > > On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi > wrote: > > > >> Hello, > >> > >> As I'd like to use custom ITokenizerFactory, I set the following Carrot2 > >> key > >> in solrconfig.xml: > >> > >> >> enable="${solr.clustering.enabled:true}" > >> class="solr.clustering.ClusteringComponent" > > >> > >> default > >>: > >> >> > name="PreprocessingPipeline.tokenizerFactory">my.own.TokenizerFactory > >> > >> > >> > >> But seems that CarrotClusteringEngine overwrites it with > >> LuceneCarrot2TokenizerFactory > >> in init() method: > >> > >> BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes) > >> .stemmerFactory(LuceneCarrot2StemmerFactory.class) > >> .tokenizerFactory(LuceneCarrot2TokenizerFactory.class) > >> .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class); > >> > >> Am I missing something? > >> > >> koji > >> -- > >> Query Log Visualizer for Apache Solr > >> http://soleami.com/ > >> >
Re: using Carrot2 custom ITokenizerFactory
.carrot2.core.Controller.**process(Controller.java:240) >at org.apache.solr.handler.**clustering.carrot2.** > CarrotClusteringEngine.**cluster(**CarrotClusteringEngine.java:**220) >... 24 more > Caused by: org.carrot2.util.attribute.**AttributeBindingException: Could > not assign field org.carrot2.text.**preprocessing.pipeline.** > CompletePreprocessingPipeline#**tokenizerFactory with value > org.apache.solr.handler.**clustering.carrot2.** > LuceneCarrot2TokenizerFactory >at org.carrot2.util.attribute.**AttributeBinder$** > AttributeBinderActionBind.**performAction(AttributeBinder.**java:614) >at org.carrot2.util.attribute.**AttributeBinder.bind(** > AttributeBinder.java:311) >at org.carrot2.util.attribute.**AttributeBinder.bind(** > AttributeBinder.java:349) >at org.carrot2.util.attribute.**AttributeBinder.bind(** > AttributeBinder.java:219) >at org.carrot2.util.attribute.**AttributeBinder.set(** > AttributeBinder.java:149) >at org.carrot2.util.attribute.**AttributeBinder.set(** > AttributeBinder.java:129) >at org.carrot2.core.**ControllerUtils.init(** > ControllerUtils.java:50) >at org.carrot2.core.**PoolingProcessingComponentMana**ger$** > ComponentInstantiationListener**.objectInstantiated(** > PoolingProcessingComponentMana**ger.java:189) >... 30 more > Caused by: java.lang.**IllegalArgumentException: Can not set > org.carrot2.text.linguistic.**ITokenizerFactory field org.carrot2.text.** > preprocessing.pipeline.**BasicPreprocessingPipeline.**tokenizerFactory to > java.lang.String >at sun.reflect.**UnsafeFieldAccessorImpl.** > throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:146) >at sun.reflect.**UnsafeFieldAccessorImpl.** > throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:150) >at sun.reflect.**UnsafeObjectFieldAccessorImpl.**set(** > UnsafeObjectFieldAccessorImpl.**java:63) >at java.lang.reflect.Field.set(**Field.java:657) >at org.carrot2.util.attribute.**AttributeBinder$** > AttributeBinderActionBind.**performAction(AttributeBinder.**java:610) >... 37 more > > > I should dig in, but if you have any clue, it would be appreciated. I'm > using 3.6 branch. > > > koji > -- > Query Log Visualizer for Apache Solr > http://soleami.com/ > > (12/05/20 21:11), Stanislaw Osinski wrote: > >> Hi Koji, >> >> It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with >> this, let me know. >> >> Staszek >> >> On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchi >> wrote: >> >> Hi Staszek, >>> >>> I'll wait your fix. Thank you! >>> >>> Koji Sekiguchi from iPad2 >>> >>> On 2012/05/20, at 18:18, Stanislaw Osinski >>> wrote: >>> >>> Hi Koji, >>>> >>>> You're right, the current code overwrites the custom tokenizer though it >>>> shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular >>>> dependencies (Carrot2 default tokenizer depends on Lucene), but it >>>> shouldn't be an issue with custom tokenizers. >>>> >>>> I'll try to commit a fix later today. Meanwhile, if you have a chance to >>>> recompile the code, a temporary solution would be to hardcode your >>>> tokenizer class into the fragment you pasted: >>>> >>>> BasicPreprocessingPipelineDesc**riptor.attributeBuilder(** >>>> initAttributes) >>>> .stemmerFactory(**LuceneCarrot2StemmerFactory.**class) >>>> .tokenizerFactory(**YourCustomTokenizer.class) >>>> .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat** >>>> aFactory.class); >>>> >>>> Staszek >>>> >>>> On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi >>>> >>> wrote: >>> >>>> >>>> Hello, >>>>> >>>>> As I'd like to use custom ITokenizerFactory, I set the following >>>>> Carrot2 >>>>> key >>>>> in solrconfig.xml: >>>>> >>>>> >>>> enable="${solr.clustering.**enabled:true}" >>>>> class="solr.clustering.**ClusteringComponent"> >>>>> >>>>> default >>>>>: >>>>> >>>> >>>>> name="PreprocessingPipeline.**tokenizerFactory">my.own.** >>> TokenizerFactory >>> >>>> >>>>> >>>>> >>>>> But seems that CarrotClusteringEngine overwrites it with >>>>> LuceneCarrot2TokenizerFactory >>>>> in init() method: >>>>> >>>>> BasicPreprocessingPipelineDesc**riptor.attributeBuilder(** >>>>> initAttributes) >>>>> .stemmerFactory(**LuceneCarrot2StemmerFactory.**class) >>>>> .tokenizerFactory(**LuceneCarrot2TokenizerFactory.**class) >>>>> .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat** >>>>> aFactory.class); >>>>> >>>>> Am I missing something? >>>>> >>>>> koji >>>>> -- >>>>> Query Log Visualizer for Apache Solr >>>>> http://soleami.com/ >>>>> >>>>> >>> >> >
Re: using Carrot2 custom ITokenizerFactory
After a bit of digging: the error message in the exception is a bit misleading, but what really happens is that the code cannot load the org.apache.solr.handler.clustering.carrot2.LuceneCarrot2TokenizerFactory class. The class is being loaded by Carrot2 code ( https://github.com/carrot2/carrot2/blob/master/core/carrot2-util-common/src/org/carrot2/util/ReflectionUtils.java#L47), which doesn't seem to play well with how Solr loads classes. We'll be looking for ways to properly fix it, any hints would be helpful. Meanwhile, a quick and dirty way of fixing the config would be to make the clustering component and Carrot2 JARs available to the context classloader by copying them to WEB-INF/lib of the WAR. Staszek On Sun, May 20, 2012 at 6:16 PM, Stanislaw Osinski < stanislaw.osin...@carrotsearch.com> wrote: > Interesting... let me investigate. > > S. > > > On Sun, May 20, 2012 at 5:15 PM, Koji Sekiguchi wrote: > >> Hi Staszek, >> >> Thank you for the fix so quickly! >> >> As a trial, I set: >> >> org.apache.** >> solr.handler.clustering.**carrot2.**LuceneCarrot2TokenizerFactory<**/str> >> >> then I could start Solr without error. But when I make a request: >> >> http://localhost:8983/solr/**clustering?q=*%3A*&version=2.** >> 2&start=0&rows=10&indent=on&**wt=json&fl=id&carrot.**produceSummary=false<http://localhost:8983/solr/clustering?q=*%3A*&version=2.2&start=0&rows=10&indent=on&wt=json&fl=id&carrot.produceSummary=false> >> >> I got an exception: >> >> org.apache.solr.common.**SolrException: Carrot2 clustering failed >>at org.apache.solr.handler.**clustering.carrot2.** >> CarrotClusteringEngine.**cluster(**CarrotClusteringEngine.java:**224) >>at org.apache.solr.handler.**clustering.** >> ClusteringComponent.process(**ClusteringComponent.java:91) >>at org.apache.solr.handler.**component.SearchHandler.** >> handleRequestBody(**SearchHandler.java:186) >>at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** >> RequestHandlerBase.java:129) >>at org.apache.solr.core.**RequestHandlers$** >> LazyRequestHandlerWrapper.**handleRequest(RequestHandlers.**java:244) >>at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1376) >>at org.apache.solr.servlet.**SolrDispatchFilter.execute(** >> SolrDispatchFilter.java:365) >>at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** >> SolrDispatchFilter.java:260) >>at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.** >> doFilter(ServletHandler.java:**1212) >>at org.mortbay.jetty.servlet.**ServletHandler.handle(** >> ServletHandler.java:399) >>at org.mortbay.jetty.security.**SecurityHandler.handle(** >> SecurityHandler.java:216) >>at org.mortbay.jetty.servlet.**SessionHandler.handle(** >> SessionHandler.java:182) >>at org.mortbay.jetty.handler.**ContextHandler.handle(** >> ContextHandler.java:766) >>at org.mortbay.jetty.webapp.**WebAppContext.handle(** >> WebAppContext.java:450) >>at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(* >> *ContextHandlerCollection.java:**230) >>at org.mortbay.jetty.handler.**HandlerCollection.handle(** >> HandlerCollection.java:114) >>at org.mortbay.jetty.handler.**HandlerWrapper.handle(** >> HandlerWrapper.java:152) >>at org.mortbay.jetty.Server.**handle(Server.java:326) >>at org.mortbay.jetty.**HttpConnection.handleRequest(** >> HttpConnection.java:542) >>at org.mortbay.jetty.**HttpConnection$RequestHandler.** >> headerComplete(HttpConnection.**java:928) >>at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:549) >>at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.** >> java:212) >>at org.mortbay.jetty.**HttpConnection.handle(** >> HttpConnection.java:404) >>at org.mortbay.jetty.bio.**SocketConnector$Connection.** >> run(SocketConnector.java:228) >>at org.mortbay.thread.**QueuedThreadPool$PoolThread.** >> run(QueuedThreadPool.java:582) >> Caused by: org.carrot2.core.**ComponentInitializationExcepti**on: >> org.carrot2.util.attribute.**AttributeBindingException: Could not assign >> field org.carrot2.text.**preprocessing.pipeline.** >> CompletePreprocessingPipeline#**tokenizerFactory with value >> org.apache.solr.handler.**clustering.carrot2.** >> LuceneCarrot2TokenizerFactory >>at sun.reflect.**NativeConstructorAccessorImpl.**newInstance0(Native >> Method) &
Re: using Carrot2 custom ITokenizerFactory
process(Controller.java:333) >at org.carrot2.core.Controller.**process(Controller.java:240) >at org.apache.solr.handler.**clustering.carrot2.** > CarrotClusteringEngine.**cluster(**CarrotClusteringEngine.java:**220) >... 24 more > Caused by: org.carrot2.util.attribute.**AttributeBindingException: Could > not assign field org.carrot2.text.**preprocessing.pipeline.** > CompletePreprocessingPipeline#**tokenizerFactory with value > org.apache.solr.handler.**clustering.carrot2.** > LuceneCarrot2TokenizerFactory >at org.carrot2.util.attribute.**AttributeBinder$** > AttributeBinderActionBind.**performAction(AttributeBinder.**java:614) >at org.carrot2.util.attribute.**AttributeBinder.bind(** > AttributeBinder.java:311) >at org.carrot2.util.attribute.**AttributeBinder.bind(** > AttributeBinder.java:349) >at org.carrot2.util.attribute.**AttributeBinder.bind(** > AttributeBinder.java:219) >at org.carrot2.util.attribute.**AttributeBinder.set(** > AttributeBinder.java:149) >at org.carrot2.util.attribute.**AttributeBinder.set(** > AttributeBinder.java:129) >at org.carrot2.core.**ControllerUtils.init(** > ControllerUtils.java:50) >at org.carrot2.core.**PoolingProcessingComponentMana**ger$** > ComponentInstantiationListener**.objectInstantiated(** > PoolingProcessingComponentMana**ger.java:189) >... 30 more > Caused by: java.lang.**IllegalArgumentException: Can not set > org.carrot2.text.linguistic.**ITokenizerFactory field org.carrot2.text.** > preprocessing.pipeline.**BasicPreprocessingPipeline.**tokenizerFactory to > java.lang.String >at sun.reflect.**UnsafeFieldAccessorImpl.** > throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:146) >at sun.reflect.**UnsafeFieldAccessorImpl.** > throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:150) >at sun.reflect.**UnsafeObjectFieldAccessorImpl.**set(** > UnsafeObjectFieldAccessorImpl.**java:63) >at java.lang.reflect.Field.set(**Field.java:657) >at org.carrot2.util.attribute.**AttributeBinder$** > AttributeBinderActionBind.**performAction(AttributeBinder.**java:610) >... 37 more > > > I should dig in, but if you have any clue, it would be appreciated. I'm > using 3.6 branch. > > > koji > -- > Query Log Visualizer for Apache Solr > http://soleami.com/ > > (12/05/20 21:11), Stanislaw Osinski wrote: > >> Hi Koji, >> >> It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with >> this, let me know. >> >> Staszek >> >> On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchi >> wrote: >> >> Hi Staszek, >>> >>> I'll wait your fix. Thank you! >>> >>> Koji Sekiguchi from iPad2 >>> >>> On 2012/05/20, at 18:18, Stanislaw Osinski >>> wrote: >>> >>> Hi Koji, >>>> >>>> You're right, the current code overwrites the custom tokenizer though it >>>> shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular >>>> dependencies (Carrot2 default tokenizer depends on Lucene), but it >>>> shouldn't be an issue with custom tokenizers. >>>> >>>> I'll try to commit a fix later today. Meanwhile, if you have a chance to >>>> recompile the code, a temporary solution would be to hardcode your >>>> tokenizer class into the fragment you pasted: >>>> >>>> BasicPreprocessingPipelineDesc**riptor.attributeBuilder(** >>>> initAttributes) >>>> .stemmerFactory(**LuceneCarrot2StemmerFactory.**class) >>>> .tokenizerFactory(**YourCustomTokenizer.class) >>>> .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat** >>>> aFactory.class); >>>> >>>> Staszek >>>> >>>> On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi >>>> >>> wrote: >>> >>>> >>>> Hello, >>>>> >>>>> As I'd like to use custom ITokenizerFactory, I set the following >>>>> Carrot2 >>>>> key >>>>> in solrconfig.xml: >>>>> >>>>> >>>> enable="${solr.clustering.**enabled:true}" >>>>> class="solr.clustering.**ClusteringComponent"> >>>>> >>>>> default >>>>>: >>>>> >>>> >>>>> name="PreprocessingPipeline.**tokenizerFactory">my.own.** >>> TokenizerFactory >>> >>>> >>>>> >>>>> >>>>> But seems that CarrotClusteringEngine overwrites it with >>>>> LuceneCarrot2TokenizerFactory >>>>> in init() method: >>>>> >>>>> BasicPreprocessingPipelineDesc**riptor.attributeBuilder(** >>>>> initAttributes) >>>>> .stemmerFactory(**LuceneCarrot2StemmerFactory.**class) >>>>> .tokenizerFactory(**LuceneCarrot2TokenizerFactory.**class) >>>>> .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat** >>>>> aFactory.class); >>>>> >>>>> Am I missing something? >>>>> >>>>> koji >>>>> -- >>>>> Query Log Visualizer for Apache Solr >>>>> http://soleami.com/ >>>>> >>>>> >>> >> >
Re: Newbie with Carrot2?
Hi Bruno, Just to confirm -- are you seeing the clusters array in the result at all ()? To get reasonable clusters, you should request at least 30-50 documents (rows), but even with smaller values, you should see an empty clusters array. Staszek On Sun, May 20, 2012 at 9:20 PM, Bruno Mannina wrote: > Le 20/05/2012 11:43, Stanislaw Osinski a écrit : > > Hi Bruno, >> >> Here's the wiki documentation for Solr's clustering component: >> >> http://wiki.apache.org/solr/**ClusteringComponent<http://wiki.apache.org/solr/ClusteringComponent> >> >> For configuration examples, take a look at the Configuration section: >> http://wiki.apache.org/solr/**ClusteringComponent#**Configuration<http://wiki.apache.org/solr/ClusteringComponent#Configuration> >> . >> >> If you hit any problems, let me know. >> >> Staszek >> >> On Sun, May 20, 2012 at 11:38 AM, Bruno Mannina wrote: >> >> Dear all, >>> >>> I use Solr 3.6.0 and I indexed some documents (around 12000). >>> Each documents contains a Abstract-en field (and some other fields). >>> >>> Is it possible to use Carrot2 to create cluster (classes) with the >>> Abstract-en field? >>> >>> What must I configure in the schema.xml ? or in other files? >>> >>> Sorry for my newbie question, but I found only documentation for >>> Workbench >>> tool. >>> >>> Bruno >>> >>> Thx for this link but I have a problem to configure my solrconfig.xml > in the section: > (note I run java -Dsolr.clustering.enabled=**true) > > I have a field named abstract-en, and I would like to use only this field. > > I would like to know if my requestHandler is good? > I have a doubt with the content of : carrot.title, carrot.url > > and also the latest field > abstract-en > edismax > > abstract-en^1.0 > > *:* > 10 > *,score > > because the result when I do a request is exactly like a search request > (without more information) > > > My entire requestHandler is: > > enable="${solr.clustering.**enabled:false}" class="solr.SearchHandler"> > > true > **default > **true > > name > id > > **abstract-en > > **true > > > > false > abstract-en > edismax > > abstract-en^1.0 > > *:* > 10 > *,score > > > clustering > > > >
Re: System requirements in my case?
> > 3) Measure the size of the index folder, multiply with 8 to get a clue of >> total index size >> > With 12 000 docs my index folder size is: 33Mo > ps: I use "solr.clustering.enabled=true" Clustering is performed at search time, it doesn't affect the size of the index (but obviously it does affect the search response times). Staszek
Re: Carrot2 using rawtext of field for clustering
> > Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd > been filtered with customer tokenizer/filters instead of rawtext that it > currently > uses for clustering ? > > I read an issue in following link too . > > https://issues.apache.org/jira/browse/SOLR-2917 > > > Is writing our own parsers to filter text documents before indexing to SOLR > could be only the right approach currently ? please let me know if anyone > have come across this issue and have other better suggestions? > Until SOLR-2917 is resolved, this solutions seems the easiest to implement. Alternatively, you could provide a custom implementation of Carrot2's tokenizer ( http://download.carrot2.org/stable/javadoc/org/carrot2/text/analysis/ITokenizer.html) through the appropriate factory attribute ( http://doc.carrot2.org/#section.attribute.lingo.PreprocessingPipeline.tokenizerFactory). The custom implementation would need to apply the required filtering. Regardless of the approach, one thing to keep in mind is that Carrot2 draws labels from the input text, so if your filtered stream omits e.g. prepositions, the labels will be less readable. Staszek
Re: Carrot2 clustering component
Hi, I think the exception is caused by the fact that you're trying to use the latest version of Carrot2 with Solr 1.4.x. There are two alternative solutions here: * as described in http://wiki.apache.org/solr/ClusteringComponent, invoke "ant get-libraries" to get the compatible JAR files. or * use the latest version of Carrot2 with Solr 1.4.x by installing the compatibility package, documentation is here: http://download.carrot2.org/stable/manual/#section.solr Cheers, Staszek On Tue, Jan 18, 2011 at 13:36, Isha Garg wrote: > Hi, >Can anyone help me to solve the error: > Class org.carrot2.util.pool.SoftUnboundedPool does not implement the > requested interface org.carrot2.util.pool.IParameterizedPool >at > org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:77) >at > org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:62) >at org.carrot2.core.ControllerFactory.create(ControllerFactory.java:158) >at > org.carrot2.core.ControllerFactory.createPooling(ControllerFactory.java:71) >at > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:61) >at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) >at java.lang.reflect.Constructor.newInstance(Constructor.java:513) >at java.lang.Class.newInstance0(Class.java:355) >at java.lang.Class.newInstance(Class.java:308) >at > org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:396) >at > org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:121) >at > org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486) >at org.apache.solr.core.SolrCore.(SolrCore.java:588) >at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) >at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) >at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) >at > org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) >at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) >at > org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) >at > org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) >at > org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) >at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) >at > org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) >at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) >at > org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) >at org.mortbay.jetty.Server.doStart(Server.java:210) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) >at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:597) >at org.mortbay.start.Main.invokeMain(Main.java:183) >at org.mortbay.start.Main.start(Main.java:497) >at org.mortbay.start.Main.main(Main.java:115) > 18 Jan, 2011 6:03:30 PM org.apache.solr.common.SolrException log > SEVERE: java.lang.IncompatibleClassChangeError: Class > org.carrot2.util.pool.SoftUnboundedPool does not implement the requested > interface org.carrot2.util.pool.IParameterizedPool >at > org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:77) >at > org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:62) >at org.carrot2.core.ControllerFactory.create(ControllerFactory.java:158) >at > org.carrot2.core.ControllerFactory.createPooling(ControllerFactory.java:71) >at > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:61) >at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >at > sun.refl
Re: assit with the Clustering component in Solr/Lucene
Hi Ramdev, Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments. Alternatively, in case of the Lingo algorithm, you can try lowering the "LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the range of 0.2--0.5. If you do that, clusters containing overlapping documents will get merged. For more information about this attribute, see here: http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold . Cheers, Staszek On Wed, Mar 30, 2011 at 18:21, Markus Jelsma wrote: > Yes, you can set engine specific parameters. Check the comments in your > snippety. > > > Hi: > > I recently included the CLustering component into Solr and updated the > > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for > > the CLuserting: > > > >> name="clusteringComponent" > > enable="${solr.clustering.enabled:false}" > > class="org.apache.solr.handler.clustering.ClusteringComponent" > > > > > > > > > default > > > >> > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori > > thm > >name="LingoClusteringAlgorithm.desiredClusterCountBase">20 > > > > > > stc > >> > name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm< > > /str> > > > > > > snippet of the Config for requestHandler > >> default="true"> > > > >explicit > > > >true > >default > >true > > > >headline > >pi > > > >headline > > > >true > > > > > > > >false > > > > > > clusteringComponent > > > > > > > > > > When I perform a search, I see that the Cluster section within the Solr > > results shows me results that are not quite consistent. There are two > > documents that are reported in two different documents > > > > Are there parameters that can be set that will prevent this from > happening > > ? > > > > > > Thanks much > > > > Ramdev >
Re: assit with the Clustering component in Solr/Lucene
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are > designed to allow one document to appear in more than one cluster, which > actually does make sense in many scenarios. There's no easy way to force > them to produce hard clusterings because this would require a complete > change in the way the algorithms work. If you need each document to belong > to exactly one cluster, you'd have to post-process the clusters to remove > the redundant document assignments. > On the second thought, I have a simple implementation of k-means clustering that could do hard clustering for you. It's not available yet, it will most probably be part of the next major release of Carrot2 (the package that does the clustering). Please watch this issue http://issues.carrot2.org/browse/CARROT-791 to get updates on this. Cheers, S.
Re: assit with the Clustering component in Solr/Lucene
> I added the parameter as you suggested. > (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent > section that describes the Clustering module > Changing the value of the parameter did not have any effect on my search > results. > > However, when I used the Carrot2 workbench, I could see the effect of > changing the value. (from 6 clusters it went down to 2 clusters) > Interesting... Can you, for the sake of debugging, append &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL? S.
Re: assit with the Clustering component in Solr/Lucene
Thanks for the confirmation, I'll take a look at the issue. S. On Thu, Mar 31, 2011 at 17:24, wrote: > That did make a difference, I now see the exact number of cluster i see > from the workbench. > I am of course interested in why the config changes did not have much > effect. However, I am happy that by adding the threshold to my request URL > produces the desired results > > let me know if I can do any more tests and I will do so. Thanks much > > Ramdev > > > > On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote: > > > I added the parameter as you suggested. >> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent >> section that describes the Clustering module >> Changing the value of the parameter did not have any effect on my search >> results. >> >> However, when I used the Carrot2 workbench, I could see the effect of >> changing the value. (from 6 clusters it went down to 2 clusters) >> > > Interesting... Can you, for the sake of debugging, append > &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL? > > S. > > >
Re: assit with the Clustering component in Solr/Lucene
> > Both of the clustering algorithms that ship with Solr (Lingo and STC) are >> designed to allow one document to appear in more than one cluster, which >> actually does make sense in many scenarios. There's no easy way to force >> them to produce hard clusterings because this would require a complete >> change in the way the algorithms work. If you need each document to belong >> to exactly one cluster, you'd have to post-process the clusters to remove >> the redundant document assignments. >> > > On the second thought, I have a simple implementation of k-means clustering > that could do hard clustering for you. It's not available yet, it will most > probably be part of the next major release of Carrot2 (the package that does > the clustering). Please watch this issue > http://issues.carrot2.org/browse/CARROT-791 to get updates on this. > Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x, so you can use the bisecting k-means clustering algorithm (org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will produce non-overlapping clusters for you. The downside of this simple implementation of k-means is that, for the time being, it produces one-word cluster labels rather than phrases as Lingo and STC. Cheers, S.
Re: solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory
Hi Bryan, You'll also need to make sure the your ${solr.home}/contrib/clustering/lib directory is in the classpath; that directory contains the Carrot2 JARs that provide the classes you're missing. I think the example solrconfig.xml has the relevant declarations. Cheers, S. On Tue, Jun 7, 2011 at 13:48, bryan rasmussen wrote: > As per the subject I am getting java.lang.NoClassDEfFoundError > org/carrot2/core/ControllerFactory > when I try to run clustering. > > I am using Solr 3.1: > > I get the following error: > > java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory >at > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:74) >at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) >at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown > Source) >at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) >at java.lang.reflect.Constructor.newInstance(Unknown Source) >at java.lang.Class.newInstance0(Unknown Source) >at java.lang.Class.newInstance(Unknown Source) >at > org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412) >at > org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203) >at > org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522) >at org.apache.solr.core.SolrCore.(SolrCore.java:594) >at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458) >at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316) >at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207) >at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130) >at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94) >at > org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) >at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) >at > org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) >at > org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) >at > org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) >at > org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) >at org.mortbay.jetty.Server.doStart(Server.java:224) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >at java.lang.reflect.Method.invoke(Unknown Source) >at org.mortbay.start.Main.invokeMain(Main.java:194) >at org.mortbay.start.Main.start(Main.java:534) >at org.mortbay.start.Main.start(Main.java:441) >at org.mortbay.start.Main.main(Main.java:119) > Caused by: java.lang.ClassNotFoundException: > org.carrot2.core.ControllerFactory >at java.net.URLClassLoader$1.run(Unknown Source) >at java.security.AccessController.doPrivileged(Native Method) >at java.net.URLClassLoader.findClass(Unknown Source) >at java.lang.ClassLoader.loadClass(Unknown Source) >at java.net.FactoryURLClassLoader.loadClass(Unknown Source) > > using the following configuration > > > class="org.apache.solr.handler.clustering.ClusteringComponent" > name="clustering"> > >default > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm > > >20 > > > class="org.apache.solr.handler.component.SearchHandler"> > > explicit > > >title >all_text >all_text title > > 150 > > >clustering > > > > > > with the following command to start solr > java -Dsolr.clustering.enabled=true > -Dsolr.solr.home="C:\projects\solrexample\solr" -jar start.jar > > Any idea as to why crusty is not working? > > Thanks, > Bryan Rasmussen >
Re: solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory
Hi Bryan, You'll also need to make sure the your ${solr.dir}/contrib/clustering/lib directory is in the classpath; that directory contains the Carrot2 JARs that provide the classes you're missing. I think the example solrconfig.xml has the relevant declarations. Cheers, S. On Tue, Jun 7, 2011 at 13:48, bryan rasmussen wrote: > As per the subject I am getting java.lang.NoClassDEfFoundError > org/carrot2/core/ControllerFactory > when I try to run clustering. > > I am using Solr 3.1: > > I get the following error: > > java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory >at > org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:74) >at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) >at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown > Source) >at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) >at java.lang.reflect.Constructor.newInstance(Unknown Source) >at java.lang.Class.newInstance0(Unknown Source) >at java.lang.Class.newInstance(Unknown Source) >at > org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412) >at > org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203) >at > org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522) >at org.apache.solr.core.SolrCore.(SolrCore.java:594) >at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458) >at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316) >at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207) >at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130) >at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94) >at > org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) >at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) >at > org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) >at > org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) >at > org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) >at > org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at > org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) >at org.mortbay.jetty.Server.doStart(Server.java:224) >at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >at java.lang.reflect.Method.invoke(Unknown Source) >at org.mortbay.start.Main.invokeMain(Main.java:194) >at org.mortbay.start.Main.start(Main.java:534) >at org.mortbay.start.Main.start(Main.java:441) >at org.mortbay.start.Main.main(Main.java:119) > Caused by: java.lang.ClassNotFoundException: > org.carrot2.core.ControllerFactory >at java.net.URLClassLoader$1.run(Unknown Source) >at java.security.AccessController.doPrivileged(Native Method) >at java.net.URLClassLoader.findClass(Unknown Source) >at java.lang.ClassLoader.loadClass(Unknown Source) >at java.net.FactoryURLClassLoader.loadClass(Unknown Source) > > using the following configuration > > > class="org.apache.solr.handler.clustering.ClusteringComponent" > name="clustering"> > >default > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm > > >20 > > > class="org.apache.solr.handler.component.SearchHandler"> > > explicit > > >title >all_text >all_text title > > 150 > > >clustering > > > > > > with the following command to start solr > java -Dsolr.clustering.enabled=true > -Dsolr.solr.home="C:\projects\solrexample\solr" -jar start.jar > > Any idea as to why crusty is not working? > > Thanks, > Bryan Rasmussen >
Re: Mahout & Solr
> > Is it possible to use the clustering component to use predefined clusters > generated by Mahout? Actually, the existing Solr ClusteringComponent's API has been designed to deal with both search results clustering (implemented by Carrot2) and off-line clustering of the whole index. The latter has not yet been implemented, so the API is very likely to change depending on the specific design decisions (should clustering be triggered through Solr or externally?, should the clusters be stored in Solr?, how to handle new documents?, how to use the clusters at search time?). I can also imagine a simpler approach based on a search results clustering "algorithm" that would simply fetch Mahout's predefined clusters for each document being returned in the search result. Getting this to work is a matter of implementing a dedicated http://lucene.apache.org/solr/api/org/apache/solr/handler/clustering/SearchClusteringEngine.html and should be fairly straightforward, at least in terms of interaction with Solr. Staszek
Re: Solr Clustering For Multiple Pages
Hi, Currently, only the clustering of search results is implemented in Solr, clustering of the whole index is not possible out of the box. In other words, clustering applies only to the records you fetch during searching. For example, if you set rows=10, only the 10 returned documents will be clustered. You can try setting larger rows values (e.g. 100, 200, 500) to get more clusters. Staszek On Mon, Jun 20, 2011 at 11:36, nilay@gmail.com wrote: > Hi > > How can i create cluster for all records. > Currently i am sending clustering=true param to solr and it give the > cluster in response , > but it give for 10 rows because rows=10 . So please suggest me how can i > get the cluster for all records . > > How can i search with in cluster . > > e.g cluster created > Model(20) > Test(10) > > if i click on Model the i should get 20 records by filter so please give > me > idea about this . > > > Please help me to resolve this problem > > Regards > Nilay Tiwari > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Clustering-For-Multiple-Pages-tp3085507p3085507.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr Clustering For Multiple Pages
I don't quite follow, I must admit. Maybe it's faceting you're after? http://wiki.apache.org/solr/SolrFacetingOverview Staszek On Wed, Jun 22, 2011 at 08:40, nilay@gmail.com wrote: > Can you please tell me how can i apply filter in cluster data in Solr ? > > Currently i storing docid and topic name in Map and get the ids by topic > from Map and then pass into solr separating by OR condition > > Is there any other way to do this > > > > - > Regards > Nilay Tiwari > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Clustering-For-Multiple-Pages-tp3085507p3094390.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: what is solr clustering component
> > and my second question is does clustering effect indexes. > No, it doesn't. Clustering is performed only on the search results produced by Solr, it doesn't change anything in the index. Cheers, Staszek
Re: Multicore clustering setup problem
Hi, Can you post the full strack trace? I'd need to know if it's really org.apache.solr.handler.clustering.ClusteringComponent that's missing or some other class ClusteringComponent depends on. Cheers, Staszek On Thu, Jun 30, 2011 at 04:19, Walter Closenfleight < walter.p.closenflei...@gmail.com> wrote: > I had set up the clusteringComponent in solrconfig.xml for my first core. > It > has been working fine and now I want to get my next core working. I set up > the second core with the clustering component so that I could use it, use > solritas properly, etc. but Solr did not like the solrconfig.xml changes > for > the second core. I'm getting this error when Solr is started or when I hit > a > Solr related URL: > > SEVERE: org.apache.solr.common.SolrException: Error loading class > 'org.apache.solr.handler.clustering.ClusteringComponent' > > Should the clusteringComponent be set up in a shared configuration file > somehow or is there something else I am doing wrong? > > Thanks in advance! >
Re: Multicore clustering setup problem
It looks like the whole clustering component JAR is not in the classpath. I remember that I once dealt with a similar issue in Solr 1.4 and the cause was the relative path of the tag being resolved against the core's instanceDir, which made the path incorrect when directly copying and pasting from the single core configuration. Try correcting the relative paths or replacing them with absolute ones, it should solve the problem. Cheers, Staszek
Re: Solr Clustering For Multiple Pages
> > I am asking about the filter after clustering . Faceting is based on the > single field so,if we need to filter we can search in related field . But > in clustering it is created by multiple field then how can we create a > filter for that. > > Example > > after clusetring you get the following > > Model(20) > System(15) > Other Topics(5) > > if i will click on Model then i should get record associated with Model > I'm not sure what you mean by "filter" -- ids of documents belonging to each cluster are part of the response, see the "docs" array inside the cluster (see http://wiki.apache.org/solr/ClusteringComponent#Quick_Start for example output). When the user clicks a cluster, you just need to show the documents with ids specified inside the cluster the user clicked. Cheers, Staszek
Re: Multicore clustering setup problem
Hi Walter, That makes sense, but this has always been a multi-core setup, so the paths > have not changed, and the clustering component worked fine for core0. The > only thing new is I have fine tuned core1 (to begin implementing it). > Previously the solrconfig.xml file was very basic. I replaced it with > core0's solrconfig.xml and made very minor changes to it (unrelated to > clustering) - it's a nearly identical solrconfig.xml file so I'm surprised > it doesn't work for core1. > I'd probably need to take a look at the whole Solr dir you're working with, clearly there's something wrong with the classpath of core1. Again, I'm wondering if perhaps since both cores have the clustering > component, if it should have a shared configuration in a different file > used > by both cores(?). Perhaps the duplicate clusteringComponent configuration > for both cores is the problem? > I'm not an expert on Solr's internals related to core management, but I once did configure two cores with search results clustering, where clustering configuration and s were specified for each core separately, so this is unlikely to be a problem. Another approach would be to put all the JARs required for clustering in a common directory and point Solr to that lib using the sharedLib attribute in the tag: http://wiki.apache.org/solr/CoreAdmin#solr. But it really should work both ways. If you can somehow e-mail (off-list) the listing of your Solr directory and contents of your configuration XMLs, I may be able to trace the problem for you. Cheers, Staszek
Re: How to use solr clustering to show in search results
The "docs" array contained in each cluster contains ids of documents belonging to the cluster, so for each id you need to look up the document's content, which comes earlier in the response (in the response/docs array). Cheers, Staszek On Thu, Jun 30, 2011 at 11:50, Romi wrote: > wanted to use clustering in my search results, i configured solr for > clustering and i got following json for clusters. But i am not getting how > to use it to show in search results. as corresponding to one doc i have > number of fields and up till now i am showing name, description and id. now > in clusters i have labels and doc id. then how to use my docs in clusters, > i > am really confused what to do Please reply. > > * > "clusters":[ > >{ > "labels":[ > "Complement any Business Casual or Semi-formal > Attire" >], > "docs":[ >"7799", >"7801" >] > }, >{ > "labels":[ >"Design" >], > "docs":[ >"8252", >"7885" >] > }, >{ > "labels":[ >"Elegant Ring has an Akoya Cultured Pearl" >], > "docs":[ >"8142", >"8139" >] > }, >{ > "labels":[ >"Feel Amazing in these Scintillating Earrings > Perfect" >], > "docs":[ >"12250", >"12254" >] > }, >{ > "labels":[ >"Formal Evening Attire" >], > "docs":[ >"8151", >"8004" >] > }, >{ > "labels":[ >"Pave Set" >], > "docs":[ >"7788", >"8169" >] > }, >{ > "labels":[ >"Subtle Look or Layer it or Attach" >], > "docs":[ >"8014", >"8012" >] > }, > { > "labels":[ >"Three-stone Setting is Elegant and Fun" >], > "docs":[ >"8335", >"8337" >] > }, >{ > "labels":[ >"Other Topics" >], > "docs":[ >"8038", >"7850", >"7795", >"7989", >"7797" >] > { >]* > > > - > Thanks & Regards > Romi > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-use-solr-clustering-to-show-in-search-results-tp3125149p3125149.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Clustering not working when using 'text' field as snippet.
Hi Pablo, The reason clustering doesn't work with the "text" field is that the field is not stored: For clustering to work, you'll need to keep your documents' titles and content in stored fields. Staszek On Fri, Aug 12, 2011 at 10:28, Pablo Queixalos wrote: > Hi, > > > > > > I am using solr-3.3.0 and carrot² clustering which works fine out of the > box with the examples doc and default solr configuration (the 'features' > Field is used as snippet). > > > > I indexed my own documents using the embed ExtractingRequestHandler wich by > default stores contents in the 'text' Field. When configuring clustering on > 'text' as snippet, carrot doesn't work fine and only shows 'Other topics' > with all the documents within. It looks like carrot doesn't get the 'text' > Field stored content. > > > > > > If I store the documents content in the 'features' field and get back to > the original configuration clustering works fine. > > > > The only difference I see between 'text' and 'features' Fields in > schema.xml is that some CopyFields are defined for 'text'. > > > > > > I didn't debug solr.clustering.ClusteringComponent nor > CarrotClusteringEngine yet, but am I misunderstanding something about the > 'text' Field ? > > > > > > Thanks, > > > > Pablo. > >
Re: document clustering or tagging
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com http://carrotsearch.com I have very huge solr index. I want to tag all documents with terms that > better represent that document like this > < > http://search.carrotsearch.com/carrot2-webapp/search?source=web&view=folders&skin=fancy-compact&query=rugby+in+london&results=100&algorithm=lingo3g&EToolsDocumentSource.country=ALL&EToolsDocumentSource.language=ALL&EToolsDocumentSource.safeSearch=false > > > . Does this type of clustering results is also come under document tagging? > No, this type of clustering will not solve your problem because it's suited for small/medium collections of documents (search results) rather than the whole index. For your specific problem I'd recommend some keyword / keyphrase extractor, which would generate tags for each document separately. Staszek
Re: Solr 1.4 Clustering / mlt AS search?
Hi, On Tue, Aug 11, 2009 at 22:19, Mark Bennett wrote: Carrot2 has several pluggable algorithms to choose from, though I have no > evidence that they're "better" than Lucene's. Where TF/IDF is sort of a > one > step algebraic calculation, some clustering algorithms use iterative > approaches, etc. I'm not sure if I completely follow the way in which you'd like to use Carrot2 for scoring -- would you cluster the whole index? Carrot2 was designed to be a post-retrieval clustering algorithm and optimized to cluster small sets of documents (up to ~1000) in real time. All processing is performed in-memory, which limits Carrot2's applicability to really large sets of documents. S.
Re: Solr 1.4 Clustering / mlt AS search?
Hi, On Thu, Aug 13, 2009 at 19:29, Mark Bennett wrote: There are comments in the Solr materials about having an option to cluster > based on the entire document set, and some warning about this being > atypical > and possibly slow. And from what you're saying, for a big enough docset, > it > might go from "slow" to "impossible", I'm not sure. For Carrot2, it would go to "impossible" I'd say. But as Grant mentioned earlier, Mahout is developing clustering algorithms that should be able to handle the whole-index types of docsets. And so my question was, *if* you were willing to spend that much time and > effort to cluster all the text of all the documents (and if it were even > possible), would the result perform better than the standard TF/IDF > techniques? Depends on the algorithm, really. In case of Carrot2, we don't do re-ranking of documents within clusters, we simply use whatever document order we got on input. As far as I'm aware, most clustering algorithms do pretty much the same: they concentrate on finding groups of documents and don't delve much into the issues of ranking documents within clusters. > In the application I'm considering, the queries tend to be longer than > average, more like full sentences or more. And they tend to be of a > question and answer nature. I've seen references in several search engines > that QandA search sometimes benefits from alternative search techniques. > And, from a separate email, the IDF part of the standard similarity may be > causing a problem, so I'm casting a wide net for other ideas. Just > brainstorming here... :-) Because of what I described above, clustering the whole index may not give you the best results. But you can try something different. You could try fetching a bunch (100--500) of more or less relevant documents for the question (MLT should be fine to start with), add your question as an extra document, perform clustering and see where the question-document ends up. If it doesn't end up in the Other Topics cluster, you could examine if the other documents from the cluster give an answer to the question. In this scenario, Carrot2 should be fine, at least performance-wise. I've not followed the QA literature very closely, so it's hard to say what the results would be quality-wise, but it should be very quick to try. Carrot2 Clustering Workbench [1][2] may come in handy for the experiments too. S. [1] http://download.carrot2.org/head/manual/#section.workbench [2] http://download.carrot2.org/head/manual/#section.getting-started.xml-files
Re: SOLR-769 clustering
Hi there, I try to apply the stoplabels with the instructions that you given in the > solr clustering Wiki. But it didn't work. > > I am runing the patched solr on tomcat. So to enable the stop label. I add > "-cp " in to my system's CATALINA_OPTS. I > tried to change the file name from stoplabels.txt to stoplabel.en also . It > didn't work too. > > Then I also find out that in carrot manual page > ( > > http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words > ). > It suggested to edit the stopwords files inside the carrot2-core.jar. I > tried this but it didn't work too. > > I am not sure what is wrong with my set up. will it be caused by any sort > of > caching? > A quick and dirty hack would be to simply replace the corresponding files (stoplabels.*) in carrot2-mini.jar. I know the packaging of the clustering contrib has changed a bit, so let me see how it currently works and correct the wiki if needed. Thanks, Staszek
Re: SOLR-769 clustering
Hi, It seems like the problem can be on two layers: 1) getting the right contents of stop* files for Carrot2, 2) making sure Solr picks up the changes. I tried your quick and dirty hack too. It didn't work also. phase like > "Carbon Atoms in the Group" with "in" still appear in my clustering labels. > Here most probably layer 1) applies: if you add "in" to stopwords, the Lingo algorithm (Carrot2's default) will still create labels with "in" inside, but will not create labels starting / ending in "in". If you'd like to eliminate "in" completely, you'd need to put an appropriate regexp in stoplabels.*. For more details, please see Carrot2 manual: http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps The easiest way to tune the stopwords and see their impact on clusters is to use Carrot2 Document Clustering Workbench (see http://wiki.apache.org/solr/ClusteringComponent). > What i did is, > > 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the > stoplabel.en file. > 2. apply clustering patch. re-complie the solr with the new > carrot2-mini.jar. > 3. deploy the new apache-solr-1.4-dev.war to tomcat. > Once you make sure the changes to stopwords.* and stoplabels.* have the desired effect on clusters, the above procedure should do the trick. You can also put the modified files in WEB-INF/classes of the WAR, if that's any easier. For your reference, I've updated http://wiki.apache.org/solr/ClusteringComponent to contain a procedure working with the Jetty starter distributed in Solr's examples folder. > class="org.apache.solr.handler.clustering.ClusteringComponent" > name="clustering"> > >default > > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm >20 >0.150 > name="carrot.lingo.threshold.candidateClusterThreshold">0.775 > Not really related to your issue, but the above file looks a little outdated -- the two parameters:"carrot.lingo.threshold.clusterAssignment" and "carrot.lingo.threshold.candidateClusterThreshold" are not there anymore (but there are many others: http://download.carrot2.org/stable/manual/#section.component.lingo). For most up to date examples, please see http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in contrib\clustering\example\conf. Cheers, Staszek
[ANN] Carrot2 version 3.1.0 released
Dear All, [Apologies for cross-posting.] This is just to let you know that we've released version 3.1.0 of Carrot2 Search Results Clustering Engine. The 3.1.0 release comes with: * Experimental support for clustering Chinese Simplified content (based on Lucene's Smart Chinese Analyzer) * Document Clustering Workbench usability improvements * Suffix Tree Clustering algorithm rewritten for better performance and clustering quality * Apache Solr clustering plugin (to be available in Solr 1.4, Grant's blog post: http://www.lucidimagination.com/blog/2009/09/28/solrs-new-clustering-capabilities/ ) Release notes: http://project.carrot2.org/release-3.1.0-notes.html On-line demo: http://search.carrot2.org Download: http://download.carrot2.org Project website: http://project.carrot2.org Thanks, Staszek -- Stanislaw Osinski, http://carrot2.org
Re: SOLR-769 clustering
Hi Antonio, - is there anyway to have minimum number of labels per cluster? The current search results clustering algorithms (from Carrot2) by design generate one label per cluster, so there is no way to force them to create more. What is the reason you'd like to have more labels per cluster? I'd leave the other two Solr-related questions to answer by a more competent person (Grant?). Cheers, Staszek
Re: SOLR-769 clustering
Hi Antonio, > To answer your question in terms of minimum term is, I am working with > "joke text" very short in length so the clusters are not so meaning full.. I > mean lot of adverbs and nouns, I thought increasing it might give me less > cluster but bit more meaningful (maybe not). Clustering this type of content (jokes, blogs) is tricky for Carrot2 algorithms, mostly because such input contains relatively little "informative" words (nouns, noun phrases) which are good for cluster labels, and more narrative ones (verbs, adjectives), which usually don't lead to meaningful labels / clusters. So I think the way to go would be to tune the clustering algorithm's stop words / stop label dictionaries to exclude the labels you don't like. I can't guarantee you can get decent clusters with this technique, but it's worth giving a try. Here's how to do that: 1. Download Carrot2 Clustering Workbench from: http://project.carrot2.org/download.html 2. Attach your Solr instance as a document source: http://download.carrot2.org/head/manual/#section.getting-started.solr 3. Try tuning the stop words / labels to get more meaningful labels: http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning For more advice you may want to post your questions on Carrot2 forum: http://project.carrot2.org/forum.html. Hope that helps. Cheers, Staszek
Re: SOLR-769 clustering
> > How would we enable people via SOLR-769 to do this? Good point, Grant! To apply the modified stopwords.* and stoplabels.* files to Solr, simply make them available in the classpath. For the example Solr runner scripts that would be something like: java -cp -Dsolr.solr.home=./clustering/solr -jar start.jar I've documented the whole tuning procedure on the Wiki: http://wiki.apache.org/solr/ClusteringComponent Cheers, S.
Re: clustering SOLR-769
Hi. > I built Solr from SVN today morning. I am using Clustering example. I > have added my own schema.xml. > > The problem is the even though I change carrot.snippet field from > features to filecontent the clustering results are not changed a bit. > Please note features field is also there in my document. > > name > > features > id > > Why I get the same cluster even though I have changed the > carrot.snippet. Whether there is some problem with my understarnding? If you get back to the clustering dir in examples and change features to manu do you see any change in clusters? Cheers, Staszek -- http://carrot2.org
Re: clustering SOLR-769
Hi there, > Is it possbile to specify more than one snippet field or should I use copy > field to copy copy two or three field into single field and specify it in > snippet field. Currently, you can specify only one snippet field, so you'd need to use copy. Cheers, S.
Re: questions about Clustering
> > 1. if q=*:* is requested, Carrot2 will receive "MatchAllDocsQuery" >> via attributes. Is it OK? >> > > Yes, it only clusters on the Doc List, not the Doc Set (in other words, > it's your rows that matter) Just to add to that: Carrot2 should be able to cluster up to ~1000 search results, but by design it won't be able to process significantly more documents than that. The reason is that Carrot2 is a search results clustering engine and performs all processing in-memory. 2. I'd like to use it on an environment other than English, e.g. Japanese. >> I've implemented Carrot2JapaneseAnalyzer (w/ Payload/ITokenType) >> for this purpose. >> It worked well with ClusteringDocumentList example, but didn't >> work with CarrotClusteringEngine. >> >> What I did is that I inserted the following lines(+) to >> CarrotClusteringEngine: >> >> attributes.put(AttributeNames.QUERY, query.toString()); >> + attributes.put(AttributeUtils.getKey(Tokenizer.class, "analyzer"), >> + Carrot2JapaneseAnalyzer.class); >> >> There is no runtime errors, but Carrot2 didn't use my analyzer, >> it just ignored and used ExtendedWhitespaceAnalyzer (confirmed via >> debugger). >> Is it classloader problem? I placed my jar in ${solr.solr.home}/lib . >> > > > Hmmm, I'm not sure if the Carrot guys are on this list (they are on dev). > Can you share a simple example on the JIRA issue and we can discuss there? Yep, we're here too :-) The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the initAttributes map in the init() method of CarrotClusteringEngine. Please let me know if this solves the problem. If not, I'll investigate further. Cheers, Staszek
Re: questions about Clustering
> > Hmm, I saw the comment in ClusteringDocumentList.java of Carrot2: > > /* > * If you know what query generated the documents you're about to cluster, > pass > * the query to the algorithm, which will usually increase clustering > quality. > */ > attributes.put(AttributeNames.QUERY, "data mining"); > > So I'm worried about clustering quality when Carrot2 got string > "MatchAllDocsQuery". The query is just a hint, without the query you should still be able to get decent clusters (at least for English, we've not tested Carrot2 much with Japanese). Cheers, Staszek
Re: Faceting on text fields
Hi, Sorry for being late to the party, let me try to clear some doubts about Carrot2. Do you know under what circumstances or application should we cluster the > whole corpus of documents vs just the search results? I think it depends on what you're trying to achieve. If you'd like to give the users some alternative way of exploring the search results by organizing them into semantically related groups (search results clustering), Carrot2 would be the appropriate tool. Its algorithms are designed to work with small input (up to ~1000 results) and try to provide meaningful labels for each cluster. Currently, Carrot2 has two algorithms: an implementation of Suffix Tree Clustering (STC, a classic in search results clustering research, designed by O. Zamir, implemented by Dawid Weiss) and Lingo (designed and implemented by myself). STC is very fast compared to Lingo, but the latter will usually get you better clusters. Some comparison of the algorithms is here: http://project.carrot2.org/algorithms.html, but ultimately, I'd encourage you to experiment (e.g. using Clustering Workbench). For best results, I'd recommend feeding the algorithms with contextual snippets generated based on the user's query. If the summary could consist of complete sentence(s) containing the query (as opposed to individual words delimited by "..."), you should be getting even nicer labels. One important thing for search results clustering is that it is done on-line, so it will add extra time to each search query your server handles. Plus, to get reasonable clusters, you'd need to fetch at least 50 documents from your index, which may put more load on the disks as well (sometimes clustering time may be only be a fraction of the time required to get the documents from the index). Finally, to compare search results clustering with facets: UI-wise they may look similar, but I'd say they're two different things that complement each other. While the list of facets and their values is fairly static (brand names etc.), clusters are less "stable" -- they're generated dynamically for each search and will vary across queries. Plus, as for any other unsupervised machine learning technique, your clusters will never be 100% correct (as opposed to facets). Almost always you'll be getting one or two clusters that don't make much sense. When it comes to clustering the whole collection, it might be useful in a couple of scenarios: a) if you wanted to get some high level overview of what's in your collection, b) if you'd wanted to e.g. use clusters to re-rank the search results presented to the user (implicit clustering: showing a few documents from each cluster), c) if you wanted to distribute your index based on the semantics of the documents (wild guess, I'm not sure if anyone tried that in practice). In general, I feel clustering the whole index is much harder than search results clustering not only because of the different scale, but also because you'd need to tune the algorithm for your specific needs and data. For example, in scenario a) and a collection of 1M documents: how many top level clusters do you generate? 10? 1? If it's 10, the clusters may end up too general / meaningless, it might be hard to describe them concisely. If it's 1, clusters are likely to be more focused, but hard to browse... I must admit I haven't followed Mahout too closely, maybe there is some nice way of resolving these problems. If you have any other questions about Carrot2, I'll try to answer them here. Alternatively, feel free to join Carrot2 mailing lists. Thanks, Staszek -- http://www.carrot2.org
Re: Has anyone got Carrot2 working with Solr without using ant?
> You need, in addition to the ones shipped: > http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar > http://download.carrot2.org/maven2/org/carrot2/nni/1.0.0/nni-1.0.0.jar > > http://mirrors.ibiblio.org/pub/mirrors/maven2/org/simpleframework/simple-xml/1.7.3/simple-xml-1.7.3.jar > http://repo1.maven.org/maven2/pcj/pcj/1.2/pcj-1.2.jar > > These all go in the contrib/clustering/lib/downloads directory. From > there, the example should just work. > A quick heads up: our development server is down for maintenance today, so the temporary location of the NNI JAR is here: http://www.carrot2.org/download/maven2/org/carrot2/nni/1.0.0/nni-1.0.0.jar Apologies for the problem, S.
[ANN] Carrot2 3.2.0 released
Dear All, I'm happy to announce three releases from the Carrot Search team: Carrot2 v3.2.0, Lingo3G v1.3.1 and Carrot Search Labs. Carrot2 is an open source search results clustering engine. Version v3.2.0 introduces: * experimental support for clustering Korean and Arabic content, * a command-line batch processing application, * significant updates to the Flash-based cluster visualization. As of version 3.2.0, Carrot2 is free of LGPL-licensed dependencies. Release notes: http://project.carrot2.org/release-3.2.0-notes.html Download: http://project.carrot2.org/download.html Lingo3G is a real-time document clustering engine from Carrot Search. Version 1.3.1 introduces support for clustering Arabic, Danish, Finnish, Hungarian, Korean, Romanian, Swedish and Turkish content, a command-line application and a number of minor improvements. Please contact us at i...@carrotsearch.com for details. Carrot Search Labs shares some small pieces of software we created when working on Carrot2 and Lingo3G. Please see http://labs.carrotsearch.com for details and downloads. Thanks! Dawid Weiss, Stanislaw Osinski Carrot Search, i...@carrot-search.com
Re: Clustering from anlayzed text instead of raw input
Hi Joan, I'm trying to use carrot2 (now I started with the workbench) and I can > cluster any field, but, the text used for clustering is the original raw > text, the one that was indexed, without any of the processing performed by > the tokenizer or filters. > So I get stop words. > The easiest way to fix this is to update the stop words list used by Carrot2, see http://wiki.apache.org/solr/ClusteringComponent, "Tuning Carrot2 clustering" section at the bottom. If you want to get readable cluster labels, it's best to feed the raw text for clustering (cluster labels are phrases taken from the input text, if you remove stopwords and stem everything, the phrases will become unreadable). Cheers, Staszek
Re: Clustering from anlayzed text instead of raw input
> I'll give a try to stopwords treatbment, but the problem is that we > perform > POS tagging and then use payloads to keep only Nouns and Adjectives, and we > thought that could be interesting to perform clustering only with these > elements, to avoid senseless words. > POS tagging could help a lot in clustering (not yet implemented in Carrot2 though), but ideally, we'd need to have POS tags attached to the original tokenized text (so each token would be a tuple along the lines of: raw_text + stemmed + POS). If we have just nouns and adjectives, cluster labels will be most likely harder to read (e.g. because of missing prepositions). I'm not too familiar with Solr internals, but I'm assuming this type of representation should be possible to implement using payloads? Then, we could refactor Carrot2 a bit to work either on raw text or on the tokenized/augmented representation. Cheers, S.
Re: Clustering Search taking 4sec for 100 results
Hi, It might be also interesting to add some logging of clustering time (just filed: https://issues.apache.org/jira/browse/SOLR-1809) to see what the index search vs clustering proportions are. Cheers, S. On Fri, Mar 5, 2010 at 03:26, Erick Erickson wrote: > Search time is only partially dependent on the > number of results returned. Far more important > is the number of docs in the index, the > complexity of the query, any sorting you do, etc. > > So your question isn't really very answerable, > you need to provide many more details. Things > like your index size, the machine you're operating > on etc. > > Are you firing off warmup queries? Also, using > debugQuery=on on your URL will provide > significant timing output, that would help us > diagnose your issues. > > HTH > Erick > > > > On Thu, Mar 4, 2010 at 9:02 PM, Allahbaksh Asadullah < > allahbaks...@gmail.com > > wrote: > > > Hi, > > I am using Solr for clustering. I am have set number of row as 100 and I > am > > using clustering handler. The problem is that I am getting the search > time > > for clustering search roughly 4sec. I have set -Xmx1024m. What is the > best > > way to reduce the time. > > Regards, > > allahbaksh > > >
[ANN] Carrot2 3.3.0 released
Dear All, We're pleased to announce the 3.3.0 release of Carrot2 which significantly improves the scalability of the clustering algorithms (up to 7x times faster clustering in case of the STC algorithm) and fixes a number of minor issues. Release notes: http://project.carrot2.org/release-3.3.0-notes.html Download: http://download.carrot2.org JIRA issues: http://issues.carrot2.org/secure/IssueNavigator.jspa?jqlQuery=project+%3D+CARROT+AND+fixVersion+%3D+%223.3.0%22+ORDER+BY+priority+DESC%2C+key+DESC Similar improvements are available in Lingo3G, the real-time document clustering engine from Carrot Search. Thanks! Dawid Weiss, Stanislaw Osinski Carrot Search, i...@carrot-search.com
Re: solr + carrot2
> > Has anyone looked into using carrot2 clustering with solr? > > I know this is integrated with nutch: > > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/clustering/carrot2/Clusterer.html > > It looks like carrot has support to read results from a solr index: > > http://demo.carrot2.org/head/api/org/carrot2/input/solr/package-summary.html > > But I'm hoping for something that returns clustered results from solr. > > Carrot also has something to read lucene indexes: > > http://demo.carrot2.org/head/api/org/carrot2/input/lucene/package-summary.html > > Any pointers or experience before I (may) delve into this? > First of all, apologies for a delayed response. I'm one of Carrot2 developers and indeed we did some Solr integration, but from Carrot2's perspective, which I guess will not be directly useful in this case. If you have any ideas for integration, questions or requests for changes/patches, feel free to post on Carrot2 mailing list or file an issue for us. Thanks, Staszek
[release announcement] Carrot2 version 2.1 released
Hi All, A bit of self-promotion again :) I hope you don't find it out of topic, after all, some folks are using Carrot2 with Lucene and Solr, and Nutch has a Carrot2-based clustering plugin. Staszek [EMAIL PROTECTED] Carrot2 Search Results Clustering Engine version 2.1 released Version 2.1 of the Java-based Open Source Search Results Clustering Engine called Carrot2 has been released. Carrot2 can fetch search results from a variety of sources and automatically organize (cluster) them into thematic categories using one of its specialized search results clustering algorithms. The 2.1 release comes with the Document Clustering Server that exposes Carrot2 clustering as an XML-RPC or REST service with convenient XML or JSON data formats enabling e.g. quick PHP, .NET or Ruby integration. The new release also adds new search results sources and many other improvements ( http://project.carrot2.org/release-2.1-notes.html). At the same time Carrot Search, the Carrot2 spin-off company, released version 1.2 of Lingo3G -- a high-performance document clustering engine offering hierarchical clustering, synonyms, label filtering and advanced tuning capabilities. For more information, please check Carrot2 live demo -- http://www.carrot2.org Carrot2 project website -- http://project.carrot2.org Release 2.1 notes -- http://project.carrot2.org/release-2.1-notes.html Carrot Search -- http://www.carrot-search.com
Re: solr + carrot2
Hi All, I've just filed an issue for us related to this: http://issues.carrot2.org/browse/CARROT-106 I'll try to find some spare cycles to look into it, hopefully in some not too distant future. Meanwhile, feel free to post your thoughts and concerns on this either here or on our JIRA. Thanks, Stanislaw -- Stanislaw Osinski, [EMAIL PROTECTED] http://www.carrot-search.com On 17/08/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > Any updates on this? It certainly would be quite interesting to see how > well carrot2 clustering can be integrated with solr, I suppose it's a > fairly > similar concept to simple faceting (maybe another candidate for SOLR-281 > component?). > > One concern I have is that the additional processing required at query > time > would make the whole operation significant slower (which is something I'd > like to avoid). I've been wondering if it might be possible to calculate > (and store) clustering information at index time > however since carrot2 seems to use the query term & result set to create > clustering info this doesn't appear to be a practical approach. > > In a similar vein, I'm also looking at methods of term extraction and > automatic keyword generation from indexed documents. I've been > experimenting with MoreLikeThis and values returned by the " > mlt.interestingTerms" parameter, which has potential but needs a bit of > refinement before it can be truely useful. Has anybody else discovered > clever or useful methods of term extraction using solr? > > Piete > > > > On 02/08/07, Burkamp, Christian <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > In my opinion the results from carrot2 clustering could be used in the > > same way that facet results are used. > > That's the way I'm planning to use them. > > The user of the search application can narrow the search by selecting > one > > of the facets presented in the search result presentation. These facets > > could come from metadata (classic facets) or from dynamically computed > > categories which are results from carrot2. > > > > From this point of view it would be most convenient to have the > > integration for carrot2 directly in the StandardRequestHandler. This > leaves > > questions open like "how should filters for categories from carrot2 be > > formulated". > > > > Is anybody already using carrot2 with solr? > > > > -- Christian > > > > -Ursprüngliche Nachricht- > > Von: [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED] Im Auftrag von > > Stanislaw Osinski > > Gesendet: Mittwoch, 1. August 2007 14:01 > > An: solr-user@lucene.apache.org > > Betreff: Re: solr + carrot2 > > > > > > > > > > Has anyone looked into using carrot2 clustering with solr? > > > > > > I know this is integrated with nutch: > > > > > > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/clustering/car > > > rot2/Clusterer.html > > > > > > It looks like carrot has support to read results from a solr index: > > > > > > http://demo.carrot2.org/head/api/org/carrot2/input/solr/package-summar > > > y.html > > > > > > But I'm hoping for something that returns clustered results from solr. > > > > > > Carrot also has something to read lucene indexes: > > > > > > http://demo.carrot2.org/head/api/org/carrot2/input/lucene/package-summ > > > ary.html > > > > > > Any pointers or experience before I (may) delve into this? > > > > > > > First of all, apologies for a delayed response. I'm one of Carrot2 > > developers and indeed we did some Solr integration, but from Carrot2's > > perspective, which I guess will not be directly useful in this case. If > you > > have any ideas for integration, questions or requests for > changes/patches, > > feel free to post on Carrot2 mailing list or file an issue for us. > > > > Thanks, > > > > Staszek > > >
Re: solr + carrot2
Hi Lance, The Lucene interface is cool, but not many people put their indexes on > machines with Swing access. > > I just did a Solr integration by copying the eTools.ch implementation. > This > took several edits. As long as we're making requests, please do a > general-pupose implementation by cloning the Lucene implementation. I'm not sure if I'm getting you right here... By "implementation" do you mean adding to the Swing application an option for pulling data from Solr (with a configuration dialog for Solr URL etc.)? Thanks, Stanislaw
Re: solr + carrot2
Hi Lance and all, I've just implemented a configuration UI for Solr, similar to the one we have for Lucene. The new UI is available in the HEAD version of the browser: http://demo.carrot2.org/head/dist/carrot2-demo-browser-head.zip or through WebStart: http://demo.carrot2.org/head/webstart/ Please let us know if the new UI works for you. Thanks, Staszek On 20/08/07, Lance Norskog <[EMAIL PROTECTED]> wrote: > > Exactly! The Lucene version requires direct access to the file. Our > indexes > are on servers which do not have graphics (VNC) configured. > > A generic Solr access UI would be great. > > Lance > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stanislaw > Osinski > Sent: Saturday, August 18, 2007 2:23 AM > To: solr-user@lucene.apache.org > Subject: Re: solr + carrot2 > > Hi Lance, > > The Lucene interface is cool, but not many people put their indexes on > > machines with Swing access. > > > > I just did a Solr integration by copying the eTools.ch implementation. > > This > > took several edits. As long as we're making requests, please do a > > general-pupose implementation by cloning the Lucene implementation. > > > I'm not sure if I'm getting you right here... By "implementation" do you > mean adding to the Swing application an option for pulling data from Solr > (with a configuration dialog for Solr URL etc.)? > > Thanks, > > Stanislaw > >