RE: problems with arabic search
I'm developing a java application using solr, this application is working with English search Yes, I have tried querying solr directly for Arabic and it's working Any suggestions ?? -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 5:50 AM To: solr-user@lucene.apache.org Subject: Re: problems with arabic search FYI: you don't need to resend your question just because you didn't get a reply within a day, either people haven't had a chance to reply, or they don't know the answer. : XML Parsing Error: mismatched tag. Expected: . : : Location: http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD%D9%85 %D8%AF&cmdSearch=Search%21 this doesn't look like a query error .. and that doesn't look like a solr URL, this looks something you have in front of Solr. : HTTP Status 400 - Query parsing error: Cannot parse : '': '*' or '?' not allowed as first character in that looks like a Solr error. i'm guessing that your app isn't dealing with the UTF8 correctly, something is substituting "?" characters in place of any character it doesn't understand - and Solr thinks you are trying to do a wildcard query. have you tried querying solr directly (in your browser or using curl) for your arabic word? -Hoss
Problems with mySolr Wiki
Hi Solr-Users, i try to follow the instructions [1] from the solr-wiki to build my custom solr server. First i have created the directory-structure. mySolr --solr --conf --schema.xml --solrconfig.xml --solr.xml <-- Where i can find this file? --build.xml <-- copy & paste from the wiki Than i run the command $ ant mysolr.dist Buildfile: build.xml BUILD FAILED /root/buildouts/mySolr/build.xml:10: Cannot find ${env.SOLR_HOME}/build.xml imported from /root/buildouts/mySolr/build.xml Total time: 0 seconds Maybe someone have an tipp for me? Thanks for your help. Christian [1] http://wiki.apache.org/solr/mySolr
unlockOnStartup does not work in embedded solr?
Hi *, I use solr as embedded solution. I have set unlockOnStartup to "true" in my solrconfig.xml But it seems, that this option is ignored by embedded solr. Any ideas? Thanks in advance, Alexey
Manage multiple indexes with Solr
Hi guys ! Is it possible to configure Solr to manage different indexes depending on the added documents ? For example: * document "1", with uniq ID "ui1" will be indexed in the "indexA" * document "2", with uniq ID "ui2" will be indexed in the "indexB" * document "3", with uniq ID "ui1" will be indexed in the "indexA" Thus documents "1" and "3" are stored in index "indexA" and document "2" in index "indexB". In this case "indexA" and "indexB" are completely separate indexes on disk. Thanks in advance cheers Y.
Re: Manage multiple indexes with Solr
Sorry, there's a mistake in my previous example. Please read this: * document "1", with uniq ID "ui1" will be indexed in the "indexA" * document "2", with uniq ID "ui2" will be indexed in the "indexB" * document "3", with uniq ID "ui3" will be indexed in the "indexA" Thanks cheers Y. Message d'origine >De: [EMAIL PROTECTED] >A: solr-user@lucene.apache.org >Sujet: Manage multiple indexes with Solr >Date: Wed, 10 Oct 2007 11:18:02 +0200 > >Hi guys ! > >Is it possible to configure Solr to manage different indexes depending on the >added documents ? > >For example: >* document "1", with uniq ID "ui1" will be indexed in the "indexA" >* document "2", with uniq ID "ui2" will be indexed in the "indexB" >* document "3", with uniq ID "ui1" will be indexed in the "indexA" > >Thus documents "1" and "3" are stored in index "indexA" and document "2" in >index "indexB". >In this case "indexA" and "indexB" are completely separate indexes on disk. > >Thanks in advance > >cheers >Y. >
Re: Manage multiple indexes with Solr
i would be interested to know in both the cases : Case 1 : * document "1", with uniq ID "ui1" will be indexed in the "indexA" * document "2", with uniq ID "ui2" will be indexed in the "indexB" * document "3", with uniq ID "ui3" will be indexed in the "indexA" Case 2 : * document "1", with uniq ID "ui1" will be indexed in the "indexA" * document "2", with uniq ID "ui2" will be indexed in the "indexB" * document "3", with uniq ID "ui1" will be indexed in the "indexA" -vEnKAt
Different search results for (german) singular/plural searches - looking for a solution
Hello, with our application we have the issue, that we get different results for singular and plural searches (german language). E.g. for "hose" we get 1.000 documents back, but for "hosen" we get 10.000 docs. The same applies to "t-shirt" or "t-shirts", of e.g. "hut" and "hüte" - lots of cases :) This is absolutely correct according to the schema.xml, as right now we do not have any stemming or synonyms included. Now we want to have similar search results for these singular/plural searches. I'm thinking of a solution for this, and want to ask, what are your experiences with this. Basically I see two options: stemming and the usage of synonyms. Are there others? My concern with stemming is, that it might produce unexpected results, so that docs are found that do not match the query from the users point of view. I asume that this needs a lot of testing with different data. The issue with synonyms is, that we would have to create a file containing all synonyms, so we would have to figure out all cases, in contrast to a solutions that is based on an algorithm. The advantage of this approach is IMHO, that it is very predictable which results will be returned for a certain query. Some background information: Our documents contain products (id, name, brand, category, producttype, description, color etc). The singular/plural issue basically applied to the fields name, category and producttype, so we would like to restrict the solution to these fields. Do you have suggestions how to handle this? Thanx in advance for sharing your experiences, cheers, Martin - Extracts of our schema.xml: text - signature.asc Description: This is a digitally signed message part
Re: Different search results for (german) singular/plural searches - looking for a solution
in short: use stemming Try the SnowballPorterFilterFactory with German2 as language attribute first and use synonyms for combined words i.e. "Herrenhose" => "Herren", "Hose". By using stemming you will maybe have some "interesting" results, but it is much better living with them than having no or much less results ;o) Find more infos on the Snowball stemming algorithms here: http://snowball.tartarus.org/ Also have a look at the StopFilterFactory, here is a sample stopwordlist for the german language: http://snowball.tartarus.org/algorithms/german/stop.txt Good luck, Tom Martin Grotzke schrieb: Hello, with our application we have the issue, that we get different results for singular and plural searches (german language). E.g. for "hose" we get 1.000 documents back, but for "hosen" we get 10.000 docs. The same applies to "t-shirt" or "t-shirts", of e.g. "hut" and "hüte" - lots of cases :) This is absolutely correct according to the schema.xml, as right now we do not have any stemming or synonyms included. Now we want to have similar search results for these singular/plural searches. I'm thinking of a solution for this, and want to ask, what are your experiences with this. Basically I see two options: stemming and the usage of synonyms. Are there others? My concern with stemming is, that it might produce unexpected results, so that docs are found that do not match the query from the users point of view. I asume that this needs a lot of testing with different data. The issue with synonyms is, that we would have to create a file containing all synonyms, so we would have to figure out all cases, in contrast to a solutions that is based on an algorithm. The advantage of this approach is IMHO, that it is very predictable which results will be returned for a certain query. Some background information: Our documents contain products (id, name, brand, category, producttype, description, color etc). The singular/plural issue basically applied to the fields name, category and producttype, so we would like to restrict the solution to these fields. Do you have suggestions how to handle this? Thanx in advance for sharing your experiences, cheers, Martin
showing results per facet-value efficiently
First of all, I just wanted to say that I just started working with Solr and really like the results I'm getting from Solr (in terms of performance, flexibility) as well as the good responses I'm getting from this group. Hopefully I will be able to contribute in way way or another to this wonderful application in the future! The current issue that I'm having is the following ( I tried not to be long-winded, but somehow that didn't work out :-) ): I'm extending StandardRequestHandler to no only show the counts per facet-value but also the top-N results per facet-value (where N is configurable). (See http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630 for where I got the idea from). I quickly implemented this by fetching a doclist for each of my facet-values and appending these to the result as suggested in the refered post, no problems there. However, I realized that for calculating the count for each of the facetvalues, the original standardrequesthandler already loops the doclist to check for matches. Therefore my implementation actually does double work, since it gets doclists for each of the facetvalues again. My question: is there a way to get to the already calculated doclist per facetvalue from a subclassed StandardRequestHandler, and so get a nice speedup? This facet-falculation seems to go deep into the core of Solr (SimpleFacets.getFacetTermEnumCounts) and seems not very sensible to alter for just this requirement. opinions appreciated. Some additional info: I have a requirement to be able to limit the result to explicitly specified facet-values. For that I do something like: select? qt=toplist &q=name:A OR name:B OR name:C &sort=sortfield asc &facet=true &facet.field=name &facet.limit=1 &rows=2 This all works okay and results in a faceting/grouping by field: 'name', where for each facetvalue (A, B, C) 2 results are shown (ordered by sortfield). The relevant code from the subclassed standardRequestHandler is below. As can be seen I alter the query by adding the facetvalue to FQ (which is almost guarenteed to already exist in FQ btw.) Therefore a second question is: will there be a noticable speedup when persuing the above, since the request that is done per facet-value is nothing more than giving the ordered result of the intersection of the overall query (which is in the querycache) and the facetvalue itself (which is almost certainly in the filtercache). As a last and somewhat related question: is there a way to explicity specify facet-values that I want to include in the faceting without (ab)using Q? This is relevant for me since the perfect solution would be to have the ability to orthogonally get multiple toplists in 1 query. Given the current implementation, this orthoganality is now 'corrupted' as injection of a fieldvalue in Q for one facetfield influences the outcome of another facetfield. kind regards, Geert-Jan --- if(true) //TODO: this needs facetinfo as a precondition. { NamedList facetFieldList = ((NamedList)facetInfo.get("facet_fields")); for(int i = 0; i < facetFieldList.size(); i++) { NamedList facetValList = (NamedList)facetFieldList.getVal(i); for(int j = 0; j < facetValList.size(); j++) { NamedList facetValue = new SimpleOrderedMap(); // facetValue.add("count", valList.getVal(j)); DocListAndSet resultList = new DocListAndSet(); Query facetq = QueryParsing.parseQuery( facetFieldList.getName(i) + ":" + facetValList.getName(j), req.getSchema()); resultList.docList = s.getDocList(query,facetq, sort,p.getInt(CommonParams.START,0), p.getInt(CommonParams.ROWS,3)); facetValue.add("results",resultList.docList); facetValList.setVal(j, facetValue); } } rsp.add("facet_results", facetFieldList); } -- View this message in context: http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13133815 Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr and KStem
Hi Piete, Good idea. Thanks. One other change that should probably be made is to change the package statement from org.oclc.solr.analysis to org.apache.solr.analysis. Thanks again. Cheers! harry -Original Message- From: Pieter Berkel [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 09, 2007 9:10 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem Hi Harry, I re-discovered this thread last week and have made some minor changes to the code (remove deprication warnings) so that it compiles with trunk. I think it would be quite useful to get this stemmer into Solr once all the legal / licensing issues are resolved. If there are no objections, I'll open a JIRA ticket and upload my changes so we can make sure we're all working with the same code. cheers, Piete On 11/09/2007, Wagner,Harry <[EMAIL PROTECTED]> wrote: > > Bill, > Currently it is a plug-in. Put the lower case filter ahead of kstem, > just as for porter (example below). You can use it with porter, but I > can't imagine why you would want to. At least not in the same analyzer. > Hope this helps. > > > > > words="stopwords.txt"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0"/> > > cacheSize="2"/> > > > > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > words="stopwords.txt"/> > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0"/> > > cacheSize="2"/> > > > > > Cheers... harry > >
Re: problems with arabic search
Can you give more detail about what you have done? What character encoding do you have your browser set to? In Firefox, do View -> Character Encoding to see what it is set to when you are on the input page? Internet Explorer and other browsers have other options. Are you sending the query directly to Solr or is it going through some other servlet? If you are doing this, and _IF_ I recall correctly, I believe you need to tell your servlet the input is UTF-8 before doing anything else with the request. See http://kickjava.com/src/filters/ SetCharacterEncodingFilter.java.htm for a Servlet Filter that does this (it's even Apache licensed!) You will need to hook it up in your web.xml. On Oct 10, 2007, at 2:59 AM, Heba Farouk wrote: I'm developing a java application using solr, this application is working with English search Yes, I have tried querying solr directly for Arabic and it's working Any suggestions ?? -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 5:50 AM To: solr-user@lucene.apache.org Subject: Re: problems with arabic search FYI: you don't need to resend your question just because you didn't get a reply within a day, either people haven't had a chance to reply, or they don't know the answer. : XML Parsing Error: mismatched tag. Expected: . : : Location: http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD% D9%85 %D8%AF&cmdSearch=Search%21 this doesn't look like a query error .. and that doesn't look like a solr URL, this looks something you have in front of Solr. : HTTP Status 400 - Query parsing error: Cannot parse : '': '*' or '?' not allowed as first character in that looks like a Solr error. i'm guessing that your app isn't dealing with the UTF8 correctly, something is substituting "?" characters in place of any character it doesn't understand - and Solr thinks you are trying to do a wildcard query. have you tried querying solr directly (in your browser or using curl) for your arabic word? -Hoss -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http:// www.apachecon.com
RE: problems with arabic search
In firefox, character encoding is set to UTF-8 Yes, I'm sending the query directly to solr using apache httpclient and I set the http request header content type to : Content-Type="text/html; charset=UTF-8" Any suggestions Thanks in advance -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 2:43 PM To: solr-user@lucene.apache.org Subject: Re: problems with arabic search Can you give more detail about what you have done? What character encoding do you have your browser set to? In Firefox, do View -> Character Encoding to see what it is set to when you are on the input page? Internet Explorer and other browsers have other options. Are you sending the query directly to Solr or is it going through some other servlet? If you are doing this, and _IF_ I recall correctly, I believe you need to tell your servlet the input is UTF-8 before doing anything else with the request. See http://kickjava.com/src/filters/ SetCharacterEncodingFilter.java.htm for a Servlet Filter that does this (it's even Apache licensed!) You will need to hook it up in your web.xml. On Oct 10, 2007, at 2:59 AM, Heba Farouk wrote: > I'm developing a java application using solr, this application is > working with English search > > Yes, I have tried querying solr directly for Arabic and it's working > > Any suggestions ?? > > -Original Message- > From: Chris Hostetter [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 10, 2007 5:50 AM > To: solr-user@lucene.apache.org > Subject: Re: problems with arabic search > > > FYI: you don't need to resend your question just because you didn't > get > a > reply within a day, either people haven't had a chance to reply, or > they > > don't know the answer. > > : XML Parsing Error: mismatched tag. Expected: . > : > : Location: > http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD% > D9%85 > %D8%AF&cmdSearch=Search%21 > > this doesn't look like a query error .. and that doesn't look like a > solr > URL, this looks something you have in front of Solr. > > : HTTP Status 400 - Query parsing error: Cannot parse > : '': '*' or '?' not allowed as first character in > > that looks like a Solr error. i'm guessing that your app isn't > dealing > with the UTF8 correctly, something is substituting "?" characters in > place > of any character it doesn't understand - and Solr thinks you are > trying > to > do a wildcard query. > > have you tried querying solr directly (in your browser or using curl) > for > your arabic word? > > > -Hoss > -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http:// www.apachecon.com
Re: problems with arabic search
Hmmm, by the looks of your query, it doesn't seem like it is a Solr query, but I admit I don't have all the parameters memorized. What request handler, etc. are you using? Have you tried debugging? And you say you have tried a query with the Solr Admin query page, right? And that works? So what is the difference between that page (form.jsp in the Solr source) and your page? Please give more details about your application. -Grant On Oct 10, 2007, at 8:49 AM, Heba Farouk wrote: In firefox, character encoding is set to UTF-8 Yes, I'm sending the query directly to solr using apache httpclient and I set the http request header content type to : Content-Type="text/ html; charset=UTF-8" Any suggestions Thanks in advance -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 2:43 PM To: solr-user@lucene.apache.org Subject: Re: problems with arabic search Can you give more detail about what you have done? What character encoding do you have your browser set to? In Firefox, do View -> Character Encoding to see what it is set to when you are on the input page? Internet Explorer and other browsers have other options. Are you sending the query directly to Solr or is it going through some other servlet? If you are doing this, and _IF_ I recall correctly, I believe you need to tell your servlet the input is UTF-8 before doing anything else with the request. See http://kickjava.com/src/filters/ SetCharacterEncodingFilter.java.htm for a Servlet Filter that does this (it's even Apache licensed!) You will need to hook it up in your web.xml. On Oct 10, 2007, at 2:59 AM, Heba Farouk wrote: I'm developing a java application using solr, this application is working with English search Yes, I have tried querying solr directly for Arabic and it's working Any suggestions ?? -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 5:50 AM To: solr-user@lucene.apache.org Subject: Re: problems with arabic search FYI: you don't need to resend your question just because you didn't get a reply within a day, either people haven't had a chance to reply, or they don't know the answer. : XML Parsing Error: mismatched tag. Expected: . : : Location: http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD% D9%85 %D8%AF&cmdSearch=Search%21 this doesn't look like a query error .. and that doesn't look like a solr URL, this looks something you have in front of Solr. : HTTP Status 400 - Query parsing error: Cannot parse : '': '*' or '?' not allowed as first character in that looks like a Solr error. i'm guessing that your app isn't dealing with the UTF8 correctly, something is substituting "?" characters in place of any character it doesn't understand - and Solr thinks you are trying to do a wildcard query. have you tried querying solr directly (in your browser or using curl) for your arabic word? -Hoss -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http:// www.apachecon.com -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/
Re: Availability Issues
Hi, - Original Message From: David Whalen <[EMAIL PROTECTED]> On that note -- I've read that Jetty isn't the best servlet container to use in these situations, is that your experience? OG: In which situations? Jetty is great, actually! (the pretty high traffic site in my sig runs Jetty) Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share > -Original Message- > From: Chris Hostetter [mailto:[EMAIL PROTECTED] > Sent: Monday, October 08, 2007 11:20 PM > To: solr-user > Subject: RE: Availability Issues > > > : My logs don't look anything like that. They look like HTTP > : requests. Am I looking in the wrong place? > > what servlet container are you using? > > every servlet container handles applications logs differently > -- it's especially tricky becuse even the format can be > changed, the examples i gave before are in the default format > you get if you use the jetty setup in the solr example (which > logs to stdout), but many servlet containers won't include > that much detail by default (they typically leave out the > classname and method name). there's also typically a setting > that controls the verbosity -- so in some configurations only > the SEVERE messages are logged and in others the INFO > messages are logged ... you're going to want at least the > INFO level to debug stuff. > > grep all the log files you can find for "Solr home set to" > ... that's one of the first messages Solr logs. if you can > find that, you'll find the other messages i was talking about. > > > -Hoss > > >
getting number of stored documents via rest api
Hi for some tests I need to know how many documents are stored in the index - is there a fast & easy way to retrieve this number (instead of searching for "*:*" and counting the results)? I already took a look at the stats.jsp code - but there the number of documents is retrieved via an api call to SolrInfoRegistry and not the webservice. thanks - stefan
Re: getting number of stored documents via rest api
I think search for "*:*" is the optimal code to do it. I don't think you can do anything faster. On 10/11/07, Stefan Rinner <[EMAIL PROTECTED]> wrote: > > Hi > > for some tests I need to know how many documents are stored in the > index - is there a fast & easy way to retrieve this number (instead > of searching for "*:*" and counting the results)? > I already took a look at the stats.jsp code - but there the number of > documents is retrieved via an api call to SolrInfoRegistry and not > the webservice. > > thanks > > - stefan > -- Regards, Cuong Hoang
Re: Problems with mySolr Wiki
i'm not very familiar with that wiki, but note the line in the example ant script... ... : --solr.xml <-- Where i can find this file? according to the wiki page... > First we will setup a basic directory structure (assuming we only want to > change some fields) and copy the attached build.xml and solr.xml: ...i addume it is refering to what you named build.xml ... if you name it solr.xml i think you would need to run "ant -f solr.xml mysolr.dist" ... but like i said, i'm not too familiar with that wiki, so i could be wrong. -Hoss
Re: getting number of stored documents via rest api
: there a fast & easy way to retrieve this number (instead of searching for : "*:*" and counting the results)? NOTE: you don't have to count the results to know the total number of docs matching any query ... just use the numFound attribute of the block. : I already took a look at the stats.jsp code - but there the number of : documents is retrieved via an api call to SolrInfoRegistry and not the : webservice. stats.jsp returns welformed xml (not HTML) so why not just hit that to extract the numDocs ? -Hoss
Re: getting number of stored documents via rest api
: I think search for "*:*" is the optimal code to do it. I don't think you can : do anything faster. FYI: getting the data from the xml returned by stats.jsp is definitely faster in the case where you really want all docs. if you want the total number from some other query however, don't "count" them yourself in the client ... use -Hoss
WebException (ServerProtocolViolation) with SolrSharp
Hello, I am trying to run SolrSharp's example application but am getting a WebException with a ServerProtocolViolation status message. After some debugging I found out this is happening with a call to: http://localhost:8080/solr/update/ And using fiddler[1] found out that solr is actually throwing the following exception: org.apache.solr.core.SolrException: Error while creating field 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}' from value '1,234' at org.apache.solr.schema.FieldType.createField(FieldType.java:173) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) at org.apache.solr.update.DocumentBuilder.addSingleField(DocumentBuilder.java:57) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:73) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:83) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:77) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:339) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NumberFormatException: For input string: "1,234" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Float.parseFloat(Unknown Source) at org.apache.solr.util.NumberUtils.float2sortableStr(NumberUtils.java:80) at org.apache.solr.schema.SortableFloatField.toInternal(SortableFloatField.java:50) at org.apache.solr.schema.FieldType.createField(FieldType.java:171) ... 24 more type Status report message Error while creating field 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}' from value '1,234' I am just starting to try Solr, and might be missing some configurations, but I have no clue where to begin to investigate this further without digging into Solr's source, which I would really like to avoid for now. Any thoughts? thank you in advance, Filipe Correia [1] http://www.fiddlertool.com/
WebException (ServerProtocolViolation) with SolrSharp
Hello, I am trying to run SolrSharp's example application but am getting a WebException with a ServerProtocolViolation status message. After some debugging I found out this is happening with a call to: http://localhost:8080/solr/update/ And using fiddler[1] found out that solr is actually throwing the following exception: org.apache.solr.core.SolrException: Error while creating field 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}' from value '1,234' at org.apache.solr.schema.FieldType.createField(FieldType.java:173) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) at org.apache.solr.update.DocumentBuilder.addSingleField(DocumentBuilder.java:57) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:73) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:83) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:77) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:339) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NumberFormatException: For input string: "1,234" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Float.parseFloat(Unknown Source) at org.apache.solr.util.NumberUtils.float2sortableStr(NumberUtils.java:80) at org.apache.solr.schema.SortableFloatField.toInternal(SortableFloatField.java:50) at org.apache.solr.schema.FieldType.createField(FieldType.java:171) ... 24 more type Status report message Error while creating field 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}' from value '1,234' I am just starting to try Solr, and might be missing some configurations, but I have no clue where to begin to investigate this further without digging into Solr's source, which I would really like to avoid for now. Any thoughts? thank you in advance, Filipe Correia [1] http://www.fiddlertool.com/
RE: Facets and running out of Heap Space
It looks now like I can't use facets the way I was hoping to because the memory requirements are impractical. So, as an alternative I was thinking I could get counts by doing rows=0 and using filter queries. Is there a reason to think that this might perform better? Or, am I simply moving the problem to another step in the process? DW > -Original Message- > From: Stu Hood [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 09, 2007 10:53 PM > To: solr-user@lucene.apache.org > Subject: Re: Facets and running out of Heap Space > > > Using the filter cache method on the things like media type and > > location; this will occupy ~2.3MB of memory _per unique value_ > > Mike, how did you calculate that value? I'm trying to tune my > caches, and any equations that could be used to determine > some balanced settings would be extremely helpful. I'm in a > memory limited environment, so I can't afford to throw a ton > of cache at the problem. > > (I don't want to thread-jack, but I'm also wondering whether > anyone has any notes on how to tune cache sizes for the > filterCache, queryResultCache and documentCache). > > Thanks, > Stu > > > -Original Message- > From: Mike Klaas <[EMAIL PROTECTED]> > Sent: Tuesday, October 9, 2007 9:30pm > To: solr-user@lucene.apache.org > Subject: Re: Facets and running out of Heap Space > > On 9-Oct-07, at 12:36 PM, David Whalen wrote: > > >(snip) > > I'm sure we could stop storing many of these columns, > especially if > >someone told me that would make a big difference. > > I don't think that it would make a difference in memory > consumption, but storage is certainly not necessary for > faceting. Extra stored fields can slow down search if they > are large (in terms of bytes), but don't really occupy extra > memory, unless they are polluting the doc cache. Does 'text' > need to be stored? > > > >> what does the LukeReqeust Handler tell you about the # of distinct > >> terms in each field that you facet on? > > > > Where would I find that? I could probably estimate that > myself on a > > per-column basis. it ranges from 4 distinct values for > media_type to > > 30-ish for location to 200-ish for country_code to almost > 10,000 for > > site_id to almost 100,000 for journalist_id. > > Using the filter cache method on the things like media type > and location; this will occupy ~2.3MB of memory _per unique > value_, so it should be a net win for those (although quite > close in space requirements for a 30-ary field on your index size). > > -Mike > >
start tag not allowed in epilog
Hi, I've got an xml update document that I'm sending to solr's update handler with deletes and adds in it. For example: 12345678 And I'm getting the following exception in the catalina.out log: Oct 10, 2007 12:58:22 PM org.apache.solr.common.SolrException log SEVERE: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,4003] Message: start tag not allowed in epilog but got a at com.bea.xml.stream.MXParser.parseEpilog(MXParser.java:2112) at com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1945) at com.bea.xml.stream.MXParser.next(MXParser.java:1333) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:148) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:78) at org.apache.solr.core.SolrCore.execute(SolrCore.java:807) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206) Is this not allowed? I looked at the code in XmlUpdateRequestHandler.java and it's dying on this line (148): int event = parser.next(); Does anyone know how to correct this? Is it not possible to have multiple different top-level tags in the same update xml file? It seems to me like it should work, but perhaps there's something inherently bad about this from the XMLStreamReader's point of view. Thanks, Brendan -- View this message in context: http://www.nabble.com/start-tag-not-allowed-in-epilog-tf4602869.html#a13142631 Sent from the Solr - User mailing list archive at Nabble.com.
Re: start tag not allowed in epilog
: Does anyone know how to correct this? Is it not possible to have multiple : different top-level tags in the same update xml file? It seems to me like it : should work, but perhaps there's something inherently bad about this from : the XMLStreamReader's point of view. it's inherently bad from the XML spec's point of view -- as in the XML spec says you can only have one "top level" tag per "XML document". incidently: what is your use case for even trying this? you kow that with a uniqueKey declaration you don't need to delete a doc before adding a new one withthe same uniqueKey right? ... you cna just add them all. we may/should allow you to specify multiple or blocks inside of a delete, but i don't imagine anyone is plannining on adding syntax to support and ops in the same update command ... they are radically different. -Hoss
Re: start tag not allowed in epilog
We simply process a queue of updates from a database table. Some of the updates are deletes, some are adds. Sometimes you can have many deletes in a row, sometimes many adds in a row, and sometimes a mixture of deletes and adds. We're trying to batch our updates and were hoping to avoid having to manage separate files for adds and deletes. Perhaps a single top-level tag e.g. could contain deletes and adds in the same document? Thanks, Brendan hossman wrote: > > > it's inherently bad from the XML spec's point of view -- as in the XML > spec says you can only have one "top level" tag per "XML document". > > incidently: what is your use case for even trying this? you kow that with > a uniqueKey declaration you don't need to delete a doc before adding a new > one withthe same uniqueKey right? ... you cna just add them all. > > we may/should allow you to specify multiple or blocks inside > of a delete, but i don't imagine anyone is plannining on adding syntax to > support and ops in the same update command ... they are > radically different. > > > -Hoss > > > -- View this message in context: http://www.nabble.com/start-tag-not-allowed-in-epilog-tf4602869.html#a13143348 Sent from the Solr - User mailing list archive at Nabble.com.
Re: WebException (ServerProtocolViolation) with SolrSharp
Hi Felipe - The issue you're encountering is a problem with the data format being passed to the solr server. If you follow the stack trace that you posted, you'll notice that the solr field is looking for a value that's a float, but the passed value is "1,234". I'm guessing this is caused by one of two possibilities: (1) there's a typo in your example code, where "1,234" should actually be " 1.234", or (2) there's a culture settings difference on your server that's converting " 1.234" to "1,234" Assuming it's the latter, add this line in the ExampleIndexDocument constructor: CultureInfo MyCulture = new CultureInfo("en-US"); Please let me know if this fixes the issue, I've been looking at this previously and would like to confirm it. thanks, jeff r. On 10/10/07, Filipe Correia <[EMAIL PROTECTED]> wrote: > > Hello, > > I am trying to run SolrSharp's example application but am getting a > WebException with a ServerProtocolViolation status message. > > After some debugging I found out this is happening with a call to: > http://localhost:8080/solr/update/ > > And using fiddler[1] found out that solr is actually throwing the > following exception: > org.apache.solr.core.SolrException: Error while creating field > 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}' > from value '1,234' > at org.apache.solr.schema.FieldType.createField(FieldType.java > :173) > at org.apache.solr.schema.SchemaField.createField(SchemaField.java > :94) > at org.apache.solr.update.DocumentBuilder.addSingleField( > DocumentBuilder.java:57) > at org.apache.solr.update.DocumentBuilder.addField( > DocumentBuilder.java:73) > at org.apache.solr.update.DocumentBuilder.addField( > DocumentBuilder.java:83) > at org.apache.solr.update.DocumentBuilder.addField( > DocumentBuilder.java:77) > at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc( > XmlUpdateRequestHandler.java:339) > at org.apache.solr.handler.XmlUpdateRequestHandler.update( > XmlUpdateRequestHandler.java:162) > at > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( > XmlUpdateRequestHandler.java:84) > at org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:77) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) > at org.apache.solr.servlet.SolrDispatchFilter.execute( > SolrDispatchFilter.java:191) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:159) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( > ApplicationFilterChain.java:235) > at org.apache.catalina.core.ApplicationFilterChain.doFilter( > ApplicationFilterChain.java:206) > at org.apache.catalina.core.StandardWrapperValve.invoke( > StandardWrapperValve.java:233) > at org.apache.catalina.core.StandardContextValve.invoke( > StandardContextValve.java:175) > at org.apache.catalina.core.StandardHostValve.invoke( > StandardHostValve.java:128) > at org.apache.catalina.valves.ErrorReportValve.invoke( > ErrorReportValve.java:102) > at org.apache.catalina.core.StandardEngineValve.invoke( > StandardEngineValve.java:109) > at org.apache.catalina.connector.CoyoteAdapter.service( > CoyoteAdapter.java:263) > at org.apache.coyote.http11.Http11Processor.process( > Http11Processor.java:844) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( > Http11Protocol.java:584) > at org.apache.tomcat.util.net.JIoEndpoint$Worker.run( > JIoEndpoint.java:447) > at java.lang.Thread.run(Unknown Source) > Caused by: java.lang.NumberFormatException: For input string: > "1,234" > at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) > at java.lang.Float.parseFloat(Unknown Source) > at org.apache.solr.util.NumberUtils.float2sortableStr( > NumberUtils.java:80) > at org.apache.solr.schema.SortableFloatField.toInternal( > SortableFloatField.java:50) > at org.apache.solr.schema.FieldType.createField(FieldType.java > :171) > ... 24 more > type Status report > message Error while creating field > 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}' > from value '1,234' > > I am just starting to try Solr, and might be missing some > configurations, but I have no clue where to begin to investigate this > further without digging into Solr's source, which I would really like > to avoid for now. Any thoughts? > > thank you in advance, > Filipe Correia > > [1] http://www.fiddlertool.com/ >
quick allowDups questions
Normally this is the type of thing I'd just scour through the online docs or the source code for, but I'm under the gun a bit. Anyway, I need to update some docs in my index because my client program wasn't accurately putting these docs in (values for one of the fields was missing). I'm hoping I won't have to write additional code to go through and delete each existing doc before I add the new one, and I think setting allowDups on the add command to false will allow me to do this. I seem to recall something in the update handler code that goes through and deletes all but the last copy of the doc if allowDups is false - does that sound accurate? If so, I just need to make sure that solrj properly sets that flag, which leads me to my next question. Does solrj default allowDups to false? If not, what do I need to do to make sure allowDups is set to false when I'm adding these docs?
Re: Facets and running out of Heap Space
On 10-Oct-07, at 12:19 PM, David Whalen wrote: It looks now like I can't use facets the way I was hoping to because the memory requirements are impractical. I can't remember if this has been mentioned, but upping the HashDocSet size is one way to reduce memory consumption. Whether this will work well depends greatly on the cardinality of your facet sets. facet.enum.cache.minDf set high is another option (will not generate a bitset for any value whose facet set is less that this value). Both options have performance implications. So, as an alternative I was thinking I could get counts by doing rows=0 and using filter queries. Is there a reason to think that this might perform better? Or, am I simply moving the problem to another step in the process? Running one query per unique facet value seems impractical, if that is what you are suggesting. Setting minDf to a very high value should always outperform such an approach. -Mike DW -Original Message- From: Stu Hood [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 09, 2007 10:53 PM To: solr-user@lucene.apache.org Subject: Re: Facets and running out of Heap Space Using the filter cache method on the things like media type and location; this will occupy ~2.3MB of memory _per unique value_ Mike, how did you calculate that value? I'm trying to tune my caches, and any equations that could be used to determine some balanced settings would be extremely helpful. I'm in a memory limited environment, so I can't afford to throw a ton of cache at the problem. (I don't want to thread-jack, but I'm also wondering whether anyone has any notes on how to tune cache sizes for the filterCache, queryResultCache and documentCache). Thanks, Stu -Original Message- From: Mike Klaas <[EMAIL PROTECTED]> Sent: Tuesday, October 9, 2007 9:30pm To: solr-user@lucene.apache.org Subject: Re: Facets and running out of Heap Space On 9-Oct-07, at 12:36 PM, David Whalen wrote: (snip) I'm sure we could stop storing many of these columns, especially if someone told me that would make a big difference. I don't think that it would make a difference in memory consumption, but storage is certainly not necessary for faceting. Extra stored fields can slow down search if they are large (in terms of bytes), but don't really occupy extra memory, unless they are polluting the doc cache. Does 'text' need to be stored? what does the LukeReqeust Handler tell you about the # of distinct terms in each field that you facet on? Where would I find that? I could probably estimate that myself on a per-column basis. it ranges from 4 distinct values for media_type to 30-ish for location to 200-ish for country_code to almost 10,000 for site_id to almost 100,000 for journalist_id. Using the filter cache method on the things like media type and location; this will occupy ~2.3MB of memory _per unique value_, so it should be a net win for those (although quite close in space requirements for a 30-ary field on your index size). -Mike
Re: quick allowDups questions
On 10-Oct-07, at 1:11 PM, Charlie Jackson wrote: Anyway, I need to update some docs in my index because my client program wasn't accurately putting these docs in (values for one of the fields was missing). I'm hoping I won't have to write additional code to go through and delete each existing doc before I add the new one, and I think setting allowDups on the add command to false will allow me to do this. I seem to recall something in the update handler code that goes through and deletes all but the last copy of the doc if allowDups is false - does that sound accurate? Yes. But you need to define a uniqueKey in schema and make sure it is the same for docs you want overwritten. This is how solr detects "dups". If so, I just need to make sure that solrj properly sets that flag, which leads me to my next question. Does solrj default allowDups to false? If not, what do I need to do to make sure allowDups is set to false when I'm adding these docs? It is the normal mode of operation for Solr, so I'd be surprised if it wasn't the default in solrj (but I don't actually know). -Mike
Re: start tag not allowed in epilog
On 10-Oct-07, at 12:49 PM, BrendanD wrote: We simply process a queue of updates from a database table. Some of the updates are deletes, some are adds. Sometimes you can have many deletes in a row, sometimes many adds in a row, and sometimes a mixture of deletes and adds. We're trying to batch our updates and were hoping to avoid having to manage separate files for adds and deletes. Perhaps a single top-level tag e.g. could contain deletes and adds in the same document? This would be very complicated from a standpoint of returning errors to the client. Keep in mind the can never be batched, regardless. The only command that supports batching is (and it is 1 with multiple s, not multiple s). If you keep a persistent connection open to solr, I don't see why one command per request should be limiting. Note also that you can batch on your end. If the deletes are doc ids, then you can collect a bunch at once and do id:xxx id:yyy id:zzz id:aaa id:bbb to perform them all at once. -Mike
RE: quick allowDups questions
Thanks for the response, Mike. A quick test using the example app confirms your statement. As for Solrj, you're probably right, but I'm not going to take any chances for the time being. The server.add method has an optional Boolean flag named "overwrite" that defaults to true. Without knowing for sure what it does, I'm not going to mess with it. For the purposes of my problem, I've got an upper and lower bound of affected docs, so I'm just going to delete them all and then initiate a re-index of those specific ids from my source. Thanks again for the help! -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 3:58 PM To: solr-user@lucene.apache.org Subject: Re: quick allowDups questions On 10-Oct-07, at 1:11 PM, Charlie Jackson wrote: > Anyway, I need to update some docs in my index because my client > program > wasn't accurately putting these docs in (values for one of the fields > was missing). I'm hoping I won't have to write additional code to go > through and delete each existing doc before I add the new one, and I > think setting allowDups on the add command to false will allow me > to do > this. I seem to recall something in the update handler code that goes > through and deletes all but the last copy of the doc if allowDups is > false - does that sound accurate? Yes. But you need to define a uniqueKey in schema and make sure it is the same for docs you want overwritten. This is how solr detects "dups". > > If so, I just need to make sure that solrj properly sets that flag, > which leads me to my next question. Does solrj default allowDups to > false? If not, what do I need to do to make sure allowDups is set to > false when I'm adding these docs? It is the normal mode of operation for Solr, so I'd be surprised if it wasn't the default in solrj (but I don't actually know). -Mike
RE: Facets and running out of Heap Space
Accoriding to Yonik I can't use minDf because I'm faceting on a string field. I'm thinking of changing it to a tokenized type so that I can utilize this setting, but then I'll have to rebuild my entire index. Unless there's some way around that? > -Original Message- > From: Mike Klaas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 10, 2007 4:56 PM > To: solr-user@lucene.apache.org > Cc: stuhood > Subject: Re: Facets and running out of Heap Space > > On 10-Oct-07, at 12:19 PM, David Whalen wrote: > > > It looks now like I can't use facets the way I was hoping > to because > > the memory requirements are impractical. > > I can't remember if this has been mentioned, but upping the > HashDocSet size is one way to reduce memory consumption. > Whether this will work well depends greatly on the > cardinality of your facet sets. facet.enum.cache.minDf set > high is another option (will not generate a bitset for any > value whose facet set is less that this value). > > Both options have performance implications. > > > So, as an alternative I was thinking I could get counts by doing > > rows=0 and using filter queries. > > > > Is there a reason to think that this might perform better? > > Or, am I simply moving the problem to another step in the process? > > Running one query per unique facet value seems impractical, > if that is what you are suggesting. Setting minDf to a very > high value should always outperform such an approach. > > -Mike > > > DW > > > > > > > >> -Original Message- > >> From: Stu Hood [mailto:[EMAIL PROTECTED] > >> Sent: Tuesday, October 09, 2007 10:53 PM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Facets and running out of Heap Space > >> > >>> Using the filter cache method on the things like media type and > >>> location; this will occupy ~2.3MB of memory _per unique value_ > >> > >> Mike, how did you calculate that value? I'm trying to tune > my caches, > >> and any equations that could be used to determine some balanced > >> settings would be extremely helpful. I'm in a memory limited > >> environment, so I can't afford to throw a ton of cache at the > >> problem. > >> > >> (I don't want to thread-jack, but I'm also wondering > whether anyone > >> has any notes on how to tune cache sizes for the filterCache, > >> queryResultCache and documentCache). > >> > >> Thanks, > >> Stu > >> > >> > >> -Original Message- > >> From: Mike Klaas <[EMAIL PROTECTED]> > >> Sent: Tuesday, October 9, 2007 9:30pm > >> To: solr-user@lucene.apache.org > >> Subject: Re: Facets and running out of Heap Space > >> > >> On 9-Oct-07, at 12:36 PM, David Whalen wrote: > >> > >>> (snip) > >>> I'm sure we could stop storing many of these columns, > >> especially if > >>> someone told me that would make a big difference. > >> > >> I don't think that it would make a difference in memory > consumption, > >> but storage is certainly not necessary for faceting. Extra stored > >> fields can slow down search if they are large (in terms of bytes), > >> but don't really occupy extra memory, unless they are > polluting the > >> doc cache. Does 'text' > >> need to be stored? > >>> > what does the LukeReqeust Handler tell you about the # > of distinct > terms in each field that you facet on? > >>> > >>> Where would I find that? I could probably estimate that > >> myself on a > >>> per-column basis. it ranges from 4 distinct values for > >> media_type to > >>> 30-ish for location to 200-ish for country_code to almost > >> 10,000 for > >>> site_id to almost 100,000 for journalist_id. > >> > >> Using the filter cache method on the things like media type and > >> location; this will occupy ~2.3MB of memory _per unique > value_, so it > >> should be a net win for those (although quite close in space > >> requirements for a 30-ary field on your index size). > >> > >> -Mike > >> > >> > > >
Re: Different search results for (german) singular/plural searches - looking for a solution
On Wednesday 10 October 2007 12:00, Martin Grotzke wrote: > Basically I see two options: stemming and the usage of synonyms. Are > there others? A large list of German words and their forms is available from a Windows software called Morphy (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). You can use it for mapping fullforms to base forms (Häuser -> Haus). You can also have a look at www.languagetool.org which uses this data in a Java software. LanguageTool also comes with jWordSplitter, which can find a compound's parts (Autowäsche -> Auto + Wäsche). Regards Daniel -- http://www.danielnaber.de
Re: start tag not allowed in epilog
I've re-written the code to generate separate files. One for adds and one for deletes. And this is working well for us now. Thanks. Mike Klaas wrote: > > > This would be very complicated from a standpoint of returning errors > to the client. > > Keep in mind the can never be batched, regardless. The > only command that supports batching is (and it is 1 > with multiple s, not multiple s). > > If you keep a persistent connection open to solr, I don't see why one > command per request should be limiting. > > Note also that you can batch on your end. If the deletes are doc > ids, then you can collect a bunch at once and do > id:xxx id:yyy id:zzz id:aaa id:bbb to > perform them all at once. > > -Mike > > -- View this message in context: http://www.nabble.com/start-tag-not-allowed-in-epilog-tf4602869.html#a13145693 Sent from the Solr - User mailing list archive at Nabble.com.
Internal Server Error and waitSearcher="false" for commit/optimize
Hello, We're using solr 1.2 and a nightly build of the solrj client code. We very occasionally see things like this: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process( QueryRequest.java:86) at org.apache.solr.client.solrj.impl.BaseSolrServer.query( BaseSolrServer.java:99) ... Caused by: org.apache.solr.common.SolrException: Internal Server Error I'm guessing that this might be due to solr being in the middle of a commit or optimize. Could solr throw an exception like that in this case? We also occasionally see solr taking too long to respond. We currently make our commit/optimize calls without any arguments. I'm wondering whether setting waitSearcher="false" might allow search queries to be served while a commit/optimize is being run. I found this in an old message from this list: > Yes, it looks like there is no difference... the code to make commit > totally asynchronous was never put in (so you can't really get commit > to return instantly, it will always wait until the IndexWriter is closed). This isn't a problem for us as the thread making the commit/optimize call is separate from thread(s) making queries. Is waitSearcher="false" designed to allow queries to be processed while a commit/optimize is being run? Are there any negative side effects to this setting (other than a query being slightly out-of-date :)? Thanks, Jason
Re: Facets and running out of Heap Space
On 10-Oct-07, at 2:40 PM, David Whalen wrote: Accoriding to Yonik I can't use minDf because I'm faceting on a string field. I'm thinking of changing it to a tokenized type so that I can utilize this setting, but then I'll have to rebuild my entire index. Unless there's some way around that? For the fields that matter (many unique values), this is likely result in a performance regression. It might be better to try storing less unique data. For instance, faceting on the blog_url field, or create_date in your schema would case problems (they probably have millions of unique values). It would be helpful to know which field is causing the problem. One way would be to do a sorted query on a quiescent index for each field, and see if there are any suspiciously large jumps in memory usage. -Mike -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 4:56 PM To: solr-user@lucene.apache.org Cc: stuhood Subject: Re: Facets and running out of Heap Space On 10-Oct-07, at 12:19 PM, David Whalen wrote: It looks now like I can't use facets the way I was hoping to because the memory requirements are impractical. I can't remember if this has been mentioned, but upping the HashDocSet size is one way to reduce memory consumption. Whether this will work well depends greatly on the cardinality of your facet sets. facet.enum.cache.minDf set high is another option (will not generate a bitset for any value whose facet set is less that this value). Both options have performance implications. So, as an alternative I was thinking I could get counts by doing rows=0 and using filter queries. Is there a reason to think that this might perform better? Or, am I simply moving the problem to another step in the process? Running one query per unique facet value seems impractical, if that is what you are suggesting. Setting minDf to a very high value should always outperform such an approach. -Mike DW -Original Message- From: Stu Hood [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 09, 2007 10:53 PM To: solr-user@lucene.apache.org Subject: Re: Facets and running out of Heap Space Using the filter cache method on the things like media type and location; this will occupy ~2.3MB of memory _per unique value_ Mike, how did you calculate that value? I'm trying to tune my caches, and any equations that could be used to determine some balanced settings would be extremely helpful. I'm in a memory limited environment, so I can't afford to throw a ton of cache at the problem. (I don't want to thread-jack, but I'm also wondering whether anyone has any notes on how to tune cache sizes for the filterCache, queryResultCache and documentCache). Thanks, Stu -Original Message- From: Mike Klaas <[EMAIL PROTECTED]> Sent: Tuesday, October 9, 2007 9:30pm To: solr-user@lucene.apache.org Subject: Re: Facets and running out of Heap Space On 9-Oct-07, at 12:36 PM, David Whalen wrote: (snip) I'm sure we could stop storing many of these columns, especially if someone told me that would make a big difference. I don't think that it would make a difference in memory consumption, but storage is certainly not necessary for faceting. Extra stored fields can slow down search if they are large (in terms of bytes), but don't really occupy extra memory, unless they are polluting the doc cache. Does 'text' need to be stored? what does the LukeReqeust Handler tell you about the # of distinct terms in each field that you facet on? Where would I find that? I could probably estimate that myself on a per-column basis. it ranges from 4 distinct values for media_type to 30-ish for location to 200-ish for country_code to almost 10,000 for site_id to almost 100,000 for journalist_id. Using the filter cache method on the things like media type and location; this will occupy ~2.3MB of memory _per unique value_, so it should be a net win for those (although quite close in space requirements for a 30-ary field on your index size). -Mike
Re: quick allowDups questions
the default solrj implementation should do what you need. As for Solrj, you're probably right, but I'm not going to take any chances for the time being. The server.add method has an optional Boolean flag named "overwrite" that defaults to true. Without knowing for sure what it does, I'm not going to mess with it. direct solr update allows a few extra fields allowDups, overwritePending, overwriteCommited -- the future of overwritePending, overwriteCommited is in doubt (SOLR-60), so i did not want to bake that into the solrj API. internally, allowDups = !overwrite; (the one field you can set) overwritePending = !allowDups; overwriteCommited = !allowDups; ryan
RE: Facets and running out of Heap Space
I'll see what I can do about that. Truthfully, the most important facet we need is the one on media_type, which has only 4 unique values. The second most important one to us is location, which has about 30 unique values. So, it would seem like we actually need a counter-intuitive solution. That's why I thought Field Queries might be the solution. Is there some reason to avoid setting multiValued to true here? It sounds like it would be the true cure-all Thanks again! dave > -Original Message- > From: Mike Klaas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 10, 2007 6:20 PM > To: solr-user@lucene.apache.org > Subject: Re: Facets and running out of Heap Space > > On 10-Oct-07, at 2:40 PM, David Whalen wrote: > > > Accoriding to Yonik I can't use minDf because I'm faceting > on a string > > field. I'm thinking of changing it to a tokenized type so > that I can > > utilize this setting, but then I'll have to rebuild my entire index. > > > > Unless there's some way around that? > > For the fields that matter (many unique values), this is > likely result in a performance regression. > > It might be better to try storing less unique data. For > instance, faceting on the blog_url field, or create_date in > your schema would case problems (they probably have millions > of unique values). > > It would be helpful to know which field is causing the > problem. One way would be to do a sorted query on a > quiescent index for each field, and see if there are any > suspiciously large jumps in memory usage. > > -Mike > > > > > > > > >> -Original Message- > >> From: Mike Klaas [mailto:[EMAIL PROTECTED] > >> Sent: Wednesday, October 10, 2007 4:56 PM > >> To: solr-user@lucene.apache.org > >> Cc: stuhood > >> Subject: Re: Facets and running out of Heap Space > >> > >> On 10-Oct-07, at 12:19 PM, David Whalen wrote: > >> > >>> It looks now like I can't use facets the way I was hoping > >> to because > >>> the memory requirements are impractical. > >> > >> I can't remember if this has been mentioned, but upping the > >> HashDocSet size is one way to reduce memory consumption. > >> Whether this will work well depends greatly on the > >> cardinality of your facet sets. facet.enum.cache.minDf set > >> high is another option (will not generate a bitset for any > >> value whose facet set is less that this value). > >> > >> Both options have performance implications. > >> > >>> So, as an alternative I was thinking I could get counts by doing > >>> rows=0 and using filter queries. > >>> > >>> Is there a reason to think that this might perform better? > >>> Or, am I simply moving the problem to another step in the process? > >> > >> Running one query per unique facet value seems impractical, > >> if that is what you are suggesting. Setting minDf to a very > >> high value should always outperform such an approach. > >> > >> -Mike > >> > >>> DW > >>> > >>> > >>> > -Original Message- > From: Stu Hood [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 09, 2007 10:53 PM > To: solr-user@lucene.apache.org > Subject: Re: Facets and running out of Heap Space > > > Using the filter cache method on the things like media type and > > location; this will occupy ~2.3MB of memory _per unique value_ > > Mike, how did you calculate that value? I'm trying to tune > >> my caches, > and any equations that could be used to determine some balanced > settings would be extremely helpful. I'm in a memory limited > environment, so I can't afford to throw a ton of cache at the > problem. > > (I don't want to thread-jack, but I'm also wondering > >> whether anyone > has any notes on how to tune cache sizes for the filterCache, > queryResultCache and documentCache). > > Thanks, > Stu > > > -Original Message- > From: Mike Klaas <[EMAIL PROTECTED]> > Sent: Tuesday, October 9, 2007 9:30pm > To: solr-user@lucene.apache.org > Subject: Re: Facets and running out of Heap Space > > On 9-Oct-07, at 12:36 PM, David Whalen wrote: > > > (snip) > > I'm sure we could stop storing many of these columns, > especially if > > someone told me that would make a big difference. > > I don't think that it would make a difference in memory > >> consumption, > but storage is certainly not necessary for faceting. > Extra stored > fields can slow down search if they are large (in terms > of bytes), > but don't really occupy extra memory, unless they are > >> polluting the > doc cache. Does 'text' > need to be stored? > > > >> what does the LukeReqeust Handler tell you about the # > >> of distinct > >> terms in each field that you facet on? > > > > Where would I find that? I could probably estimate that > myself on a > > per-column basis. it ranges from 4 distinct values for >
Re: Facets and running out of Heap Space
On 10-Oct-07, at 3:46 PM, David Whalen wrote: I'll see what I can do about that. Truthfully, the most important facet we need is the one on media_type, which has only 4 unique values. The second most important one to us is location, which has about 30 unique values. So, it would seem like we actually need a counter-intuitive solution. That's why I thought Field Queries might be the solution. Is there some reason to avoid setting multiValued to true here? It sounds like it would be the true cure-all Should work. It would cost about 100 MB on a 25m corpus for those two fields. Have you tried setting multivalued=true without reindexing? I'm not sure, but I think it will work. -Mike
Re: [ADMIN] - Spam problems?
: Around Sept. 20 I started getting Japanese spam to this account. This is : a special account I only use for the Solr and Lucene user mailing : lists. Did anybody else get these, starting around 9/20? Note that many mailing list archives leave the sender emails in plain text (which results in easy harvesting by spam bots). even the archives that strip or obfuscate email addresses in headers frequently do nothing baout email addrs in the body of hte message. ie: when someone replies to a post and their mail client does something like... >> On Wed, 10 Oct 2007, Norskog, Lance <[EMAIL PROTECTED]> wrote: or... >> Date: Wed, 10 Oct 2007 15:50:34 -0400 >> From: "Norskog, Lance" <[EMAIL PROTECTED]> one solution is to configure your spam filter to automatically reject mail where you address is in a recepient header (as opposed to the list addresses) the trade off being you'll never see private replies replies, but hey: this is open source, everything should be discussed in the open :) -Hoss
Syntax for newSearcher query
Hi, The examples that I've found in the solrconfig.xml file and on this site are fairly basic for pre-warming specific queries. I have some rather complex looking queries that I'm not quite sure how to specify in my solrconfig.xml file in the newSearcher section. Here's an example of 3 queries that I'd like to pre-warm. The category ids will change with each query (there are 977 different category_ids): rows=20&start=0&facet.query=attribute_id:1003278&facet.query=attribute_id:1003928&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1 rows=0&start=0&sort=merchant_count+desc&f.attribute_id_decimal_value_pair.facet.limit=-1&facet=true&facet.field=attribute_id_decimal_value_pair&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.attribute_id_decimal_value_pair.facet.prefix=1003278 rows=0&start=0&sort=merchant_count+desc&f.attribute_id_value_en_pair.facet.prefix=1003928&facet=true&f.attribute_id_value_en_pair.facet.limit=-1&facet.field=attribute_id_value_en_pair&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en I'm not sure if it's necessary to have all those parameters in my query for pre-warming, but those are just the queries I see in my catalina.out file when the user clicks on a specific category. I'd like to pre-warm the first page of results from all of my categories. Thanks, Brendan -- View this message in context: http://www.nabble.com/Syntax-for-newSearcher-query-tf4604487.html#a13147569 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Syntax for newSearcher query
: looking queries that I'm not quite sure how to specify in my solrconfig.xml : file in the newSearcher section. : rows=20&start=0&facet.query=attribute_id:1003278&facet.query=attribute_id:1003928&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1 all you have to do is put each key=val pair as a val it doesn't matter what the param is, or if it's a param that has multiple values, just list each of them the same way... 20 0 attribute_id:1003278 attribute_id:1003928 ... ... -Hoss
Re: Syntax for newSearcher query
Awesome! Thanks! hossman wrote: > > > : looking queries that I'm not quite sure how to specify in my > solrconfig.xml > : file in the newSearcher section. > > : > rows=20&start=0&facet.query=attribute_id:1003278&facet.query=attribute_id:1003928&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1 > > all you have to do is put each key=val pair as a val > > it doesn't matter what the param is, or if it's a param that has multiple > values, just list each of them the same way... > > > > > 20 > 0 > attribute_id:1003278 > attribute_id:1003928 > ... > > > ... > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Syntax-for-newSearcher-query-tf4604487.html#a13148914 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets and running out of Heap Space
On 10/10/07, Mike Klaas <[EMAIL PROTECTED]> wrote: > Have you tried setting multivalued=true without reindexing? I'm not > sure, but I think it will work. Yes, that will work fine. One thing that will change is the response format for stored fields val1 instead of val1 Hopefully in the future we can specify a faceting method w/o having to change the schema. -Yonik
Re: Spell Check Handler
Hoss, I had a feeling someone would be quoting Yonik's Law of Patches! ;-) For now, this is done. I created the changes, created JavaDoc comments on the various settings and their expected output, created a JUnit test for the SpellCheckerRequestHandler which tests various components of the handler, and I also created the supporting configuration files for the JUnit tests (schema and solrconfig files). I attached the patch to the JIRA issue so now we just have to wait until it gets added back in to the main code stream. For anyone who is interested, here is a link to the JIRA: https://issues.apache.org/jira/browse/SOLR-375 Could someone please drop me a hint on how to update the wiki or any other documentation that could benefit to being updated; I'll like to help out as much as possible, but first I need to know "how". ;-) When these changes do get committed back in to the daily build, please review the generated JavaDoc for information on how to utilize these new features. If anyone has any questions, or comments, please do not hesitate to ask. As a general note of a self-critique on these changes, I am not 100% sure of the way I implemented the "nested" structure when the "multiWords" parameter is used. My interest is that it should work smoothly with some other technology such as Prototype using the JSon output type. Unfortunately, I will not be getting a chance to start on that coding until next week so it is up in the air as to if this structure will be conducive or not. I am planning on providing more details in the documentations as far as how to utilize these modifications in Prototype and AJax when I get a chance (even provide links to a production site so you can see it in action and view the source if interested). So stay tuned... Thanks for everyones time, Scott Tabar Chris Hostetter <[EMAIL PROTECTED]> wrote: : If you like, I can post the source code changes that I made to the : SpellCheckerRequestHandler, but at this time I am not ready to open a : JIRA issue and submit the changes back through the subversion. I will : need to do a little more testing, documentation, and create some unit : tests to cover all of these changes, but what I have been able to : perform, it is working very well. Keep in mind "Yonik's Law Of Patches" ... "A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all." http://wiki.apache.org/solr/HowToContribute ...even if you don't think the code is "solid" yet, if you want to eventually make it available to people, making a "rough" version available to people early gives other people the opportunity to help you make it solid (by writing unit tests, fixing bugs, and adding documentation). -Hoss