Question mark glyphs in indexed content
Hello, I am using the latest Solr4j to index content. When I look at that content in the Solr Admin web utility I see weird characters like this: http://brockwine.com/images/solrglyphs.png When I look at the text in the MySQL DB those chars appear to just be plain hyphens. The MySQL table character set is utf8 and the collation is utf8. Environment: OS X 10.5.8 java version "1.5.0_19" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304) Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing) Solr Specification Version: 1.3.0 Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47 Lucene Specification Version: 2.4-dev Lucene Implementation Version: 2.4-dev 691741 - 2008-09-03 15:25:16 Jetty 6.1.3 Any thoughts? Thanks /Rupert
Responses getting truncated
I am seeing our responses getting truncated if and only if I search on our main text field. E.g. I just do some basic like title_t:arthritis Then I get a valid document back. But if I add in our larger text field: title_t:arthritis OR text_t:arthritis then the resultant document is NOT valid XML (if using wt=xml) or Ruby (using wt=ruby). If I run these through curl on the command its truncated and if I run the search through the web-based admin panel then I get an XML parse error. This appears to have just started recently and the only thing we have done is change our indexer from a PHP one to a Java one, but functionally they are identical. Any thoughts? Thanks in advance. - Rupert
Re: Responses getting truncated
Using wt=json also yields an invalid document. So after more investigation it appears that I can always "break" the response by pulling back a specific field via the "fl" parameter. If I leave off a field then the response is valid, if I include it then Solr yields an invalid document - a truncated document. This happens in any response format (xml, json, ruby). I am using the SolrJ client to add documents to in my index. My field is a normal "text" field type and the text itself is the first 1000 characters of an article. > It can very well be an issue with the data itself. For example, if the data > contains un-escaped characters which invalidates the response When I look at the document in using wt=xml then all XML entities are escaped. When I look at it under wt=ruby then all single quotes are escaped, same for json, so it appears that all escaping it taking place. The core problem seems to be that the document is just truncated - it just plain end of files. Jetty's log says its sending back an HTTP 200 so all is well. Any ideas on how I can dig deeper? Thanks -Rupert On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness wrote: > It can very well be an issue with the data itself. For example, if the data > contains un-escaped characters which invalidates the response. I don't know > much about ruby, but what do you get with wt=json? > > Rupert Fiasco wrote: >> >> I am seeing our responses getting truncated if and only if I search on >> our main text field. >> >> E.g. I just do some basic like >> >> title_t:arthritis >> >> Then I get a valid document back. But if I add in our larger text field: >> >> title_t:arthritis OR text_t:arthritis >> >> then the resultant document is NOT valid XML (if using wt=xml) or Ruby >> (using wt=ruby). If I run these through curl on the command its >> truncated and if I run the search through the web-based admin panel >> then I get an XML parse error. >> >> This appears to have just started recently and the only thing we have >> done is change our indexer from a PHP one to a Java one, but >> functionally they are identical. >> >> Any thoughts? Thanks in advance. >> >> - Rupert >> >> >
Re: Responses getting truncated
The text file at: http://brockwine.com/solr.txt Represents one of these truncated responses (this one in XML). It starts out great, then look at the bottom, boom, game over. :) I found this document by first running our bigger search which breaks and then zeroing in a specific broken document by using the rows/start parameters. But there are any unknown number of these "broken" documents - a lot I presume. -Rupert On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh wrote: > Can you copy-paste the source data indexed in this field which causes the > error? > > Cheers > Avlesh > > On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco wrote: > >> Using wt=json also yields an invalid document. So after more >> investigation it appears that I can always "break" the response by >> pulling back a specific field via the "fl" parameter. If I leave off a >> field then the response is valid, if I include it then Solr yields an >> invalid document - a truncated document. This happens in any response >> format (xml, json, ruby). >> >> I am using the SolrJ client to add documents to in my index. My field >> is a normal "text" field type and the text itself is the first 1000 >> characters of an article. >> >> > It can very well be an issue with the data itself. For example, if the >> data >> > contains un-escaped characters which invalidates the response >> >> When I look at the document in using wt=xml then all XML entities are >> escaped. When I look at it under wt=ruby then all single quotes are >> escaped, same for json, so it appears that all escaping it taking >> place. The core problem seems to be that the document is just >> truncated - it just plain end of files. Jetty's log says its sending >> back an HTTP 200 so all is well. >> >> Any ideas on how I can dig deeper? >> >> Thanks >> -Rupert >> >> >> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness wrote: >> > It can very well be an issue with the data itself. For example, if the >> data >> > contains un-escaped characters which invalidates the response. I don't >> know >> > much about ruby, but what do you get with wt=json? >> > >> > Rupert Fiasco wrote: >> >> >> >> I am seeing our responses getting truncated if and only if I search on >> >> our main text field. >> >> >> >> E.g. I just do some basic like >> >> >> >> title_t:arthritis >> >> >> >> Then I get a valid document back. But if I add in our larger text field: >> >> >> >> title_t:arthritis OR text_t:arthritis >> >> >> >> then the resultant document is NOT valid XML (if using wt=xml) or Ruby >> >> (using wt=ruby). If I run these through curl on the command its >> >> truncated and if I run the search through the web-based admin panel >> >> then I get an XML parse error. >> >> >> >> This appears to have just started recently and the only thing we have >> >> done is change our indexer from a PHP one to a Java one, but >> >> functionally they are identical. >> >> >> >> Any thoughts? Thanks in advance. >> >> >> >> - Rupert >> >> >> >> >> > >> >
Re: Responses getting truncated
So I whipped up a quick SolrJ client and ran it against the document that I referenced earlier. When I retrieve the doc and just print its field/value pairs to stdout it ends like this: http://brockwine.com/images/output1.png It appears to be some kind of garbage characters. -Rupert On Tue, Aug 25, 2009 at 12:19 PM, Uri Boness wrote: > Hi, > > This is a very strange behavior and the fact that it is cause by one > specific field, again, leads me to believe it's still a data issue. Did you > try using SolrJ to query the data as well? If the same thing happens when > using the binary protocol, then it's probably not a data issue. On the other > hand, if it works fine, then at least you can inspect the data to see where > things go wrong. Sorry for insisting on that, but I cannot think of anything > else that can cause this problem. > > If anyone else have a better idea, I'm actually very curious to hear about > it. > > Uri > > Rupert Fiasco wrote: >> >> The text file at: >> >> http://brockwine.com/solr.txt >> >> Represents one of these truncated responses (this one in XML). It >> starts out great, then look at the bottom, boom, game over. :) >> >> I found this document by first running our bigger search which breaks >> and then zeroing in a specific broken document by using the rows/start >> parameters. But there are any unknown number of these "broken" >> documents - a lot I presume. >> >> -Rupert >> >> On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh wrote: >> >>> >>> Can you copy-paste the source data indexed in this field which causes the >>> error? >>> >>> Cheers >>> Avlesh >>> >>> On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco >>> wrote: >>> >>> >>>> >>>> Using wt=json also yields an invalid document. So after more >>>> investigation it appears that I can always "break" the response by >>>> pulling back a specific field via the "fl" parameter. If I leave off a >>>> field then the response is valid, if I include it then Solr yields an >>>> invalid document - a truncated document. This happens in any response >>>> format (xml, json, ruby). >>>> >>>> I am using the SolrJ client to add documents to in my index. My field >>>> is a normal "text" field type and the text itself is the first 1000 >>>> characters of an article. >>>> >>>> >>>>> >>>>> It can very well be an issue with the data itself. For example, if the >>>>> >>>> >>>> data >>>> >>>>> >>>>> contains un-escaped characters which invalidates the response >>>>> >>>> >>>> When I look at the document in using wt=xml then all XML entities are >>>> escaped. When I look at it under wt=ruby then all single quotes are >>>> escaped, same for json, so it appears that all escaping it taking >>>> place. The core problem seems to be that the document is just >>>> truncated - it just plain end of files. Jetty's log says its sending >>>> back an HTTP 200 so all is well. >>>> >>>> Any ideas on how I can dig deeper? >>>> >>>> Thanks >>>> -Rupert >>>> >>>> >>>> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness wrote: >>>> >>>>> >>>>> It can very well be an issue with the data itself. For example, if the >>>>> >>>> >>>> data >>>> >>>>> >>>>> contains un-escaped characters which invalidates the response. I don't >>>>> >>>> >>>> know >>>> >>>>> >>>>> much about ruby, but what do you get with wt=json? >>>>> >>>>> Rupert Fiasco wrote: >>>>> >>>>>> >>>>>> I am seeing our responses getting truncated if and only if I search on >>>>>> our main text field. >>>>>> >>>>>> E.g. I just do some basic like >>>>>> >>>>>> title_t:arthritis >>>>>> >>>>>> Then I get a valid document back. But if I add in our larger text >>>>>> field: >>>>>> >>>>>> title_t:arthritis OR text_t:arthritis >>>>>> >>>>>> then the resultant document is NOT valid XML (if using wt=xml) or Ruby >>>>>> (using wt=ruby). If I run these through curl on the command its >>>>>> truncated and if I run the search through the web-based admin panel >>>>>> then I get an XML parse error. >>>>>> >>>>>> This appears to have just started recently and the only thing we have >>>>>> done is change our indexer from a PHP one to a Java one, but >>>>>> functionally they are identical. >>>>>> >>>>>> Any thoughts? Thanks in advance. >>>>>> >>>>>> - Rupert >>>>>> >>>>>> >>>>>> >> >> >
Re: Responses getting truncated
> 1. Exactly which version of Solr / SolrJ are you using? Solr Specification Version: 1.3.0 Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47 Latest SolrJ that I downloaded a couple of days ago. > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) > file that this solr doc came from online somewhere? We are running an instance of MediaWiki so the text goes through a couple of transformations: wiki markup -> html -> plain text. Its at this last step that I take a "snippet" and insert that into Solr. My snippet code is: // article.java public String getSnippet(int maxlen) { int length = getPlainText().length() >= maxlen ? maxlen : getPlainText().length(); return getPlainText().substring(0, length); } // ... later on add to solr doc.addField("text_snippet_t", article.getSnippet(1000)); So in theory, I am getting the whole article if its less than 1K chars and a maximum of 1K chars if its bigger. I initialized this String from the DB by using the String constructor where I pass in the charset/collation text = new String(textFromDB, "UTF-8"); So to the best of my knowledge, accessing a substring of a UTF-8 encoded string should not break up the UTF-8 code point. Is that an incorrect assumption? If so, what is best way to break up a UTF-8 encoded string and get approximately that many characters? Exactness is not a requirement. -Rupert On Tue, Aug 25, 2009 at 5:37 PM, Chris Hostetter wrote: > > 1. Exactly which version of Solr / SolrJ are you using? > > 2. ... > > : I am using the SolrJ client to add documents to in my index. My field > : is a normal "text" field type and the text itself is the first 1000 > : characters of an article. > > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) > file that this solr doc came from online somewhere? > > What does your *indexing* code look like? ... Can you add some debuging to > the SolrJ client when you *add* this doc to print out exactly what those > 1000 characters are? > > My hunch: when you are extracting the first 1000 characters, you're > getting only the first half of a character ...or... you are getting docs > with less them 1000 characters and winding up with a buffer (char[]?) that > has garbage at the end; SolrJ isn't complaining on the way in, but > something farther down (maybe before indexing, maybe after) is seeing that > garbage and cutting the field off at that point. > > > > -Hoss > >
Re: Responses getting truncated
Firstly, to everyone who has been helping me, thank you very much. All this feedback is helping me narrow down these issues. I deleted the index and re-indexed all the data from scratch and for a couple of days we were OK, but now it seems to be erring again. It happens on different input documents so what was broken before now works (documents that were having issues before are OK now, after a fresh re-index). An issue we are seeing now is that an XML response from Solr will contain the "tail" of an earlier response, for an example: http://brockwine.com/solr2.txt That is a response we are getting from Solr - using the web interface for Solr in Firefox, Firefox freaks out because it tries to parse that, and of course, its invalid XML, but I can retrieve that via curl. Anyone seeing this before? In regards to earlier questions: > i assume you are correct, but you listed several steps of transformation > above, are you certian they all work correctly and produce valid UTF-8? Yes, I have looked at the source and contacted the author of the conversion library we are using and have verified that if UTF8 goes in then UTF8 will come out and UTF8 is definitely going in. I dont think sending over an actual input document would help because it seems to change. Plus, this latest issue appears to be more an issue of the last response buffer not clearing or something. Whats strange is that if I wait a few minutes and reload, then the buffer is cleared and I get back a valid response, its intermittent, but appears to be happening frequently. If it matters, we started using LucidGaze for Solr about 10 days ago, approximately when these issues started happening (but its hard to say if thats an issue because at this same time we switched from a PHP to Java indexing client). Thanks for your patience -Rupert On Tue, Aug 25, 2009 at 8:33 PM, Chris Hostetter wrote: > > : We are running an instance of MediaWiki so the text goes through a > : couple of transformations: wiki markup -> html -> plain text. > : Its at this last step that I take a "snippet" and insert that into Solr. > ... > : doc.addField("text_snippet_t", article.getSnippet(1000)); > > ok, well first off: that's the not the field we're you are having problems > is it? if i remember correctly from your previous posts, wasn't the > response getting aborted in the middle of the Contents field? > > : and a maximum of 1K chars if its bigger. I initialized this String > : from the DB by using the String constructor where I pass in the > : charset/collation > : > : text = new String(textFromDB, "UTF-8"); > : > : So to the best of my knowledge, accessing a substring of a UTF-8 > : encoded string should not break up the UTF-8 code point. Is that an > > i assume you are correct, but you listed several steps of transformation > above, are you certian they all work correctly and produce valid UTF-8? > > this leads back to my suggestion before > > : > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) > : > file that this solr doc came from online somewhere? > : > > : > What does your *indexing* code look like? ... Can you add some debuging to > : > the SolrJ client when you *add* this doc to print out exactly what those > : > 1000 characters are? > > > -Hoss >
Re: Responses getting truncated
I know in my last message I said I was having issues with "extra content" at the start of a response, resulting in an invalid document. I still am having issues with documents getting truncated (yes, I have problems galore). I will elaborate on why its so difficult to track down an actual document which is causing the failure (if I could find the document I could post it to the group) - causing an invalid / truncated document. I will just document the steps: 1) I have a query which results in a bogus truncated document. This query pulls back all fields. If I take that same query and remove the "text_t" field from the returned field list, then all is well. This indicates to me that its a problem with the text_t field. This query uses the default returned rows of 10. 2) So far so good. Then my next step is to find the document. So I take my original query, remove the text_t from the field list to get my result set. 3) I run a new query that JUST selects that document based on its Doc ID (which I have from the first query). My thinking is that my "broken" document HAS to be in that set, so I can just select it by ID and then validate the response. This is where it breaks down: I know one or more broken documents is in my set, but if I iterate over each doc id and pull it out individually, its response is valid. Its only broken when I pull it out in the first query. Its NOT broken when I pull it out by ID, even though I am also pulling out the same "broken" field. If you can read Ruby my script is here: http://brockwine.com/solr_fetch.txt In the first net/http call, if I include the "text_t" field in the "fl" list then it breaks. If I remove it, get the doc ids and then iterate over each one and get it back from Solr (including the supposed broken field "text_t") then it works just fine - the exception is never raised. But it is raised in the first call if I include it. To me this makes absolutely no sense. Thanks -Rupert On Fri, Aug 28, 2009 at 2:14 PM, Joe Calderon wrote: > i had a similar issue with text from past requests showing up, this was on > 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went > away, im using a nightly of 1.4 right now also without probs, then again > your mileage may vary as i also made a bunch of schema changes that might > have had some effect, it wouldnt hurt to try though > > > On 08/28/2009 02:04 PM, Rupert Fiasco wrote: >> >> Firstly, to everyone who has been helping me, thank you very much. All >> this feedback is helping me narrow down these issues. >> >> I deleted the index and re-indexed all the data from scratch and for a >> couple of days we were OK, but now it seems to be erring again. >> >> It happens on different input documents so what was broken before now >> works (documents that were having issues before are OK now, after a >> fresh re-index). >> >> An issue we are seeing now is that an XML response from Solr will >> contain the "tail" of an earlier response, for an example: >> >> http://brockwine.com/solr2.txt >> >> That is a response we are getting from Solr - using the web interface >> for Solr in Firefox, Firefox freaks out because it tries to parse >> that, and of course, its invalid XML, but I can retrieve that via >> curl. >> >> Anyone seeing this before? >> >> In regards to earlier questions: >> >> >>> >>> i assume you are correct, but you listed several steps of transformation >>> above, are you certian they all work correctly and produce valid UTF-8? >>> >> >> Yes, I have looked at the source and contacted the author of the >> conversion library we are using and have verified that if UTF8 goes in >> then UTF8 will come out and UTF8 is definitely going in. >> >> I dont think sending over an actual input document would help because >> it seems to change. Plus, this latest issue appears to be more an >> issue of the last response buffer not clearing or something. >> >> Whats strange is that if I wait a few minutes and reload, then the >> buffer is cleared and I get back a valid response, its intermittent, >> but appears to be happening frequently. >> >> If it matters, we started using LucidGaze for Solr about 10 days ago, >> approximately when these issues started happening (but its hard to say >> if thats an issue because at this same time we switched from a PHP to >> Java indexing client). >> >> Thanks for your patience >> >> -Rupert >> >> On Tue, Aug 25, 2009 at 8:33 PM, Chris >> Hostetter wrote: >> >>> >>> : We are running an instance of MediaWiki so t
Re: Responses getting truncated
Yes, I am hitting the Solr server directly (medsolr1.colo:9007) Versions / architectures: Jetty(6.1.3) o...@medsolr1 ~ $ uname -a Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009 x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux o...@medsolr1 ~ $ java -version java version "1.6.0_11" Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try. -Rupert On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley wrote: > On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco wrote: >> If I run these through curl on the command its >> truncated and if I run the search through the web-based admin panel >> then I get an XML parse error. > > Are you running curl directly against the solr server, or going > through a load balancer? Cutting out the middle-men using curl was a > great idea - just make sure to go all the way. > > At first I thought it could possibly be a FastWriter bug (internal > Solr class), but that's only used on the TextWriter (JSON, Python, > Ruby) based formats, not on the original XML format. > > It really looks like you're hitting a lower-level IO buffering bug > (esp when you see a response starting off with the tail of another > response). That doesn't look like it could be a Solr bug... but > rather smells like a thread safety bug in the servlet container. > > What type of machine are you running on? What JVM? > You could try upgrading your version of Jetty, the JVM, or try > switching to Tomcat. > > -Yonik > http://www.lucidimagination.com > > >> This appears to have just started recently and the only thing we have >> done is change our indexer from a PHP one to a Java one, but >> functionally they are identical. >> >> Any thoughts? Thanks in advance. >> >> - Rupert >> >
Re: Responses getting truncated
I deployed LucidWorks with my existing solrconfig / schema and re-indexed my data into it and pushed it out to production, we'll see how it stacks up over the weekend. Already queries that were breaking on the prior Jetty/stock Solr setup are now working - but I have seen it before where upon an initial re-index things work OK then a couple of days later they break. Keep y'all posted. Thanks -Rupert On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco wrote: > Yes, I am hitting the Solr server directly (medsolr1.colo:9007) > > Versions / architectures: > > Jetty(6.1.3) > > o...@medsolr1 ~ $ uname -a > Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009 > x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux > > o...@medsolr1 ~ $ java -version > java version "1.6.0_11" > Java(TM) SE Runtime Environment (build 1.6.0_11-b03) > Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) > > > I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try. > > -Rupert > > On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley wrote: >> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco wrote: >>> If I run these through curl on the command its >>> truncated and if I run the search through the web-based admin panel >>> then I get an XML parse error. >> >> Are you running curl directly against the solr server, or going >> through a load balancer? Cutting out the middle-men using curl was a >> great idea - just make sure to go all the way. >> >> At first I thought it could possibly be a FastWriter bug (internal >> Solr class), but that's only used on the TextWriter (JSON, Python, >> Ruby) based formats, not on the original XML format. >> >> It really looks like you're hitting a lower-level IO buffering bug >> (esp when you see a response starting off with the tail of another >> response). That doesn't look like it could be a Solr bug... but >> rather smells like a thread safety bug in the servlet container. >> >> What type of machine are you running on? What JVM? >> You could try upgrading your version of Jetty, the JVM, or try >> switching to Tomcat. >> >> -Yonik >> http://www.lucidimagination.com >> >> >>> This appears to have just started recently and the only thing we have >>> done is change our indexer from a PHP one to a Java one, but >>> functionally they are identical. >>> >>> Any thoughts? Thanks in advance. >>> >>> - Rupert >>> >> >
Re: Responses getting truncated
So we have been running LucidWorks for Solr for about a week now and have seen no problems - so I believe it was due to that buffering issue in Jetty 6.1.3, estimated here: >>> It really looks like you're hitting a lower-level IO buffering bug >>> (esp when you see a response starting off with the tail of another >>> response). That doesn't look like it could be a Solr bug... but >>> rather smells like a thread safety bug in the servlet container. Thanks for everyones help and input. LucidWorks For The Win. -Rupert On Fri, Aug 28, 2009 at 4:07 PM, Rupert Fiasco wrote: > I deployed LucidWorks with my existing solrconfig / schema and > re-indexed my data into it and pushed it out to production, we'll see > how it stacks up over the weekend. Already queries that were breaking > on the prior Jetty/stock Solr setup are now working - but I have seen > it before where upon an initial re-index things work OK then a couple > of days later they break. > > Keep y'all posted. > > Thanks > -Rupert > > On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco wrote: >> Yes, I am hitting the Solr server directly (medsolr1.colo:9007) >> >> Versions / architectures: >> >> Jetty(6.1.3) >> >> o...@medsolr1 ~ $ uname -a >> Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009 >> x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux >> >> o...@medsolr1 ~ $ java -version >> java version "1.6.0_11" >> Java(TM) SE Runtime Environment (build 1.6.0_11-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) >> >> >> I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try. >> >> -Rupert >> >> On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley wrote: >>> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco wrote: >>>> If I run these through curl on the command its >>>> truncated and if I run the search through the web-based admin panel >>>> then I get an XML parse error. >>> >>> Are you running curl directly against the solr server, or going >>> through a load balancer? Cutting out the middle-men using curl was a >>> great idea - just make sure to go all the way. >>> >>> At first I thought it could possibly be a FastWriter bug (internal >>> Solr class), but that's only used on the TextWriter (JSON, Python, >>> Ruby) based formats, not on the original XML format. >>> >>> It really looks like you're hitting a lower-level IO buffering bug >>> (esp when you see a response starting off with the tail of another >>> response). That doesn't look like it could be a Solr bug... but >>> rather smells like a thread safety bug in the servlet container. >>> >>> What type of machine are you running on? What JVM? >>> You could try upgrading your version of Jetty, the JVM, or try >>> switching to Tomcat. >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>>> This appears to have just started recently and the only thing we have >>>> done is change our indexer from a PHP one to a Java one, but >>>> functionally they are identical. >>>> >>>> Any thoughts? Thanks in advance. >>>> >>>> - Rupert >>>> >>> >> >
Specifying multiple documents in DataImportHandler dataConfig
I am using the DataImportHandler with a JDBC datasource. From my understanding of DIH, for each of my "content types" e.g. Blog posts, Mesh Categories, etc I would construct a series of document/entity sets, like Solr parses this just fine and allows me to issue a /dataimport?command=full-import and it runs, but it only runs against the "first" document (blog_entries). It doesnt run against the 2nd document (mesh_categories). If I remove the 2 document elements and wrap both entity sets in just one document tag, then both sets get indexed, which seemingly achieves my goal. This just doesnt make sense from my understanding of how DIH works. My 2 content types are indeed separate so they logically represent two document types, not one. Is this correct? What am I missing here? Thanks -Rupert
Re: Specifying multiple documents in DataImportHandler dataConfig
Maybe I should be more clear: I have multiple tables in my DB that I need to save to my Solr index. In my app code I have logic to persist each table, which maps to an application model to Solr. This is fine. I am just trying to speed up indexing time by using DIH instead of going through my application. From what I understand of DIH I can specify one dataSource element and then a series of document/entity sets, for each of my models. But like I said before, DIH only appears to want to index the first document declared under the dataSource tag. -Rupert On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco wrote: > I am using the DataImportHandler with a JDBC datasource. From my > understanding of DIH, for each of my "content types" e.g. Blog posts, > Mesh Categories, etc I would construct a series of document/entity > sets, like > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Solr parses this just fine and allows me to issue a > /dataimport?command=full-import and it runs, but it only runs against > the "first" document (blog_entries). It doesnt run against the 2nd > document (mesh_categories). > > If I remove the 2 document elements and wrap both entity sets in just > one document tag, then both sets get indexed, which seemingly achieves > my goal. This just doesnt make sense from my understanding of how DIH > works. My 2 content types are indeed separate so they logically > represent two document types, not one. > > Is this correct? What am I missing here? > > Thanks > -Rupert >
Understanding prefix query searching
So I tried to look on google for an answer to this before I posted here. Basically I am trying to understand how prefix searching works. I have a dynamic text field (indexed and stored) "full_name_t" I have some data in my index, specifically a record with full_name_t = "Robert P Page" A search on: full_name_t:Robert yields that document, however a search on full_name_t:Robert* yields nothing. Why? To get around this I am doing something like (full_name_t:Robert OR full_name_t:Robert*) But I would like to understand why the wildcard doesnt work, shouldn't it match anything after the first characters of "Robert"? Thanks -Rupert
Spell checking not returning "full" terms
We are using Solr 1.3 and trying to get spell checking functionality. FYI, our index contains a lot of medical terms (which might or might not make a difference as they are not English-y words, if that makes any sense?) If I specify a spellcheck query of "spellcheck.q=diabtes" I get suggestions of: diabet diabetogen dilat diamet diatom diastol diactin dialect If I re-mis-spell Diabetes to "q=diabets" then I go no suggestions. So first off two things: 1) Why would leaving out one "e" over the other affect the spelling suggestions so substantially? 2) In the former list of suggestions, notice the first suggestion is "diabet", which isnt all that helpful, it should return something like "diabetes" or maybe even "diabetic". Note that if I do a normal search against "diabetes" then I get a ton of results, in other words, our index is filled with terms of "diabetes". My relevant solrconfig is: text default text_t ./spellchecker1 0.1 jarowinkler text_t org.apache.lucene.search.spell.JaroWinklerDistance ./spellchecker2 0.1 and I have spellcheck.count = 8 Notice that I severely bumped down the "accuracy" setting to get more results. Bumping it up higher yields less results (not sure what setting really meant so I dont know in what direction I want to change that value - I am guessing that a lower value allows for more mis-spellings, e.g. its more promiscuous). Our "text" and "text_t" fields are defined in schema.xml as: and Any help would be appreciated. Thanks -Rupert
Re: Spell checking not returning "full" terms
Awesome! After reading up on the links you sent me I got it all working. Thanks! FYI - I did previously come across one of the links you sent over: http://wiki.apache.org/solr/SpellCheckerRequestHandler But what threw me off is that when I started reading about that yesterday, in the first paragraph it says that this component is deprecated and to use SpellCheckComponent - so at that point I stopped reading and went over to the component page. If I had kept reading I would have encountered all of the gritty details that I in fact needed to get it to work. The wiki entry makes it seem old and deprecated and is no longer relevant, but it certainly is. -Rupert On Wed, Feb 4, 2009 at 11:57 AM, Grant Ingersoll wrote: > I'm guessing the field you are checking against is being stemmed. The field > you spell check against should have minimal analysis done to it, i.e. > tokenization and probably downcasing. See > http://wiki.apache.org/solr/SpellCheckComponent and > http://wiki.apache.org/solr/SpellCheckerRequestHandler for tips on how to > handle analysis for spelling. > > On Feb 4, 2009, at 2:33 PM, Rupert Fiasco wrote: > >> We are using Solr 1.3 and trying to get spell checking functionality. >> >> FYI, our index contains a lot of medical terms (which might or might >> not make a difference as they are not English-y words, if that makes >> any sense?) >> >> If I specify a spellcheck query of "spellcheck.q=diabtes" >> >> I get suggestions of: >> >> diabet >> diabetogen >> dilat >> diamet >> diatom >> diastol >> diactin >> dialect >> >> If I re-mis-spell Diabetes to "q=diabets" then I go no suggestions. >> >> So first off two things: >> >> 1) Why would leaving out one "e" over the other affect the spelling >> suggestions so substantially? >> 2) In the former list of suggestions, notice the first suggestion is >> "diabet", which isnt all that helpful, it should return something like >> "diabetes" or maybe even "diabetic". >> >> Note that if I do a normal search against "diabetes" then I get a ton >> of results, in other words, our index is filled with terms of >> "diabetes". >> >> My relevant solrconfig is: >> >> >> text >> >> >> default >> text_t >> ./spellchecker1 >> 0.1 >> >> >> >> jarowinkler >> text_t >> >> > name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance >> ./spellchecker2 >> 0.1 >> >> >> >> and I have >> >> spellcheck.count = 8 >> >> Notice that I severely bumped down the "accuracy" setting to get more >> results. Bumping it up higher yields less results (not sure what >> setting really meant so I dont know in what direction I want to change >> that value - I am guessing that a lower value allows for more >> mis-spellings, e.g. its more promiscuous). >> >> Our "text" and "text_t" fields are defined in schema.xml as: >> >> > multiValued="true"/> >> and >> > stored="true" multiValued="true" /> >> >> Any help would be appreciated. >> >> Thanks >> -Rupert > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > >
Issuing just a spell check query
The docs for the SpellCheckComponent say "The SpellCheckComponent is designed to provide inline spell checking of queries without having to issue separate requests." I would like to issue just a spell check query, I dont care about it being inline and piggy-backing off a normal search query. How would I achieve this? I tried monkeying with making a new requestHandler but using class = "solr.SearchHandler" always tries to do a normal search. I succeeded in adding inline spell checking to the default request handler by *adding* spellcheck to its requestHandler config - I would like to *remove* the default search component - maybe by making a new request handler which just does spell checking? Is something like this possible? 5 spellcheck default Now, I can sort of achieve what I want by in fact a normal search but then using a dummy value for my "q" parameter (for me "00" works) and then I get no search docs back, but I do get the spell suggestions I want, driven by the "spellcheck.q" parameter. But this seems very hacky and Solr is still having to run a search against my dummy value. A roundabout way of asking: how can I fire off *just* a spell check query? Thanks in advance -Rupert
Re: Issuing just a spell check query
But its deprecated (??) -Rupert On Fri, Feb 6, 2009 at 11:51 AM, Otis Gospodnetic wrote: > Rupert, > > You could use the SpellCheck*Handler* to achieve this. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > ________ > From: Rupert Fiasco > To: solr-user@lucene.apache.org > Sent: Friday, February 6, 2009 2:47:19 PM > Subject: Issuing just a spell check query > > The docs for the SpellCheckComponent say > > "The SpellCheckComponent is designed to provide inline spell checking > of queries without having to issue separate requests." > > I would like to issue just a spell check query, I dont care about it > being inline and piggy-backing off a normal search query. > > How would I achieve this? > > I tried monkeying with making a new requestHandler but using class = > "solr.SearchHandler" always tries to do a normal search. > > I succeeded in adding inline spell checking to the default request > handler by *adding* > > > spellcheck > > > to its requestHandler config - I would like to *remove* the default > search component - maybe by making a new request handler which just > does spell checking? > > Is something like this possible? > > > > > > >5 > > > > spellcheck > > > > > default > > > > > > > > > Now, I can sort of achieve what I want by in fact a normal search but > then using a dummy value for my "q" parameter (for me "00" > works) and then I get no search docs back, but I do get the spell > suggestions I want, driven by the "spellcheck.q" parameter. > > But this seems very hacky and Solr is still having to run a search > against my dummy value. > > A roundabout way of asking: how can I fire off *just* a spell check query? > > Thanks in advance > -Rupert >
search returns matches for non-starting wildcard prefix queries
(I think I have a horrible subject line but I wasnt sure how to properly explain myself). I have a text field that I store last names in (and everything is lowercased prior to insertion, not sure if that matters). The field is described as: When running a query such as last_name:m* I get data back like: Pashman, Md Maldonado Manolidis Fleisher, M.D., D.Ht., D.A.B.F.M. Merino Monroe McLay Maltsberger McMurtray Murphy Md Loeb Md As you can see most are perfect matches, but there are some that *dont* start with the letter "M" but do have "M" at the beginning of another "word" in the field. Wouldnt the query "m*" just query for matches where the first letter is "M" in the whole field and not within another "word" in that field? Do I need to make another field to store last names and not perform any analysis on that field (akin to a spell check field)? Thanks in advance. -Rupert
Indexing issue with XML control characters
During indexing I will often get this error: SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 3)) at [row,col {unknown-source}]: [2,1] at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) By looking at this list and elsewhere I know that I need to filter out most control characters so I have been employing this regex: /[\x00-\x08\x0B\x0C\x0E-\x1F]/ But I still get the error. What is strange is that if I re-run my indexing process after a failure it will work on the previously failed node and then error out on another node some time later. That is, it is not deterministic. If I look at the text that is attempted to be indexed its pure as you can get one (a bunch of medical keywords like "leg bones" and "nose"). Any ideas would be greatly appreciated. The platform is: Solr implementation version: 1.3.0 694707 Lucene implementation version: 2.4-dev 691741 Mac OS X 10.5.7 JVM 1.5.0_19-b02-304 Thanks /Rupert