I know in my last message I said I was having issues with "extra content" at the start of a response, resulting in an invalid document. I still am having issues with documents getting truncated (yes, I have problems galore).
I will elaborate on why its so difficult to track down an actual document which is causing the failure (if I could find the document I could post it to the group) - causing an invalid / truncated document. I will just document the steps: 1) I have a query which results in a bogus truncated document. This query pulls back all fields. If I take that same query and remove the "text_t" field from the returned field list, then all is well. This indicates to me that its a problem with the text_t field. This query uses the default returned rows of 10. 2) So far so good. Then my next step is to find the document. So I take my original query, remove the text_t from the field list to get my result set. 3) I run a new query that JUST selects that document based on its Doc ID (which I have from the first query). My thinking is that my "broken" document HAS to be in that set, so I can just select it by ID and then validate the response. This is where it breaks down: I know one or more broken documents is in my set, but if I iterate over each doc id and pull it out individually, its response is valid. Its only broken when I pull it out in the first query. Its NOT broken when I pull it out by ID, even though I am also pulling out the same "broken" field. If you can read Ruby my script is here: http://brockwine.com/solr_fetch.txt In the first net/http call, if I include the "text_t" field in the "fl" list then it breaks. If I remove it, get the doc ids and then iterate over each one and get it back from Solr (including the supposed broken field "text_t") then it works just fine - the exception is never raised. But it is raised in the first call if I include it. To me this makes absolutely no sense. Thanks -Rupert On Fri, Aug 28, 2009 at 2:14 PM, Joe Calderon<calderon....@gmail.com> wrote: > i had a similar issue with text from past requests showing up, this was on > 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went > away, im using a nightly of 1.4 right now also without probs, then again > your mileage may vary as i also made a bunch of schema changes that might > have had some effect, it wouldnt hurt to try though > > > On 08/28/2009 02:04 PM, Rupert Fiasco wrote: >> >> Firstly, to everyone who has been helping me, thank you very much. All >> this feedback is helping me narrow down these issues. >> >> I deleted the index and re-indexed all the data from scratch and for a >> couple of days we were OK, but now it seems to be erring again. >> >> It happens on different input documents so what was broken before now >> works (documents that were having issues before are OK now, after a >> fresh re-index). >> >> An issue we are seeing now is that an XML response from Solr will >> contain the "tail" of an earlier response, for an example: >> >> http://brockwine.com/solr2.txt >> >> That is a response we are getting from Solr - using the web interface >> for Solr in Firefox, Firefox freaks out because it tries to parse >> that, and of course, its invalid XML, but I can retrieve that via >> curl. >> >> Anyone seeing this before? >> >> In regards to earlier questions: >> >> >>> >>> i assume you are correct, but you listed several steps of transformation >>> above, are you certian they all work correctly and produce valid UTF-8? >>> >> >> Yes, I have looked at the source and contacted the author of the >> conversion library we are using and have verified that if UTF8 goes in >> then UTF8 will come out and UTF8 is definitely going in. >> >> I dont think sending over an actual input document would help because >> it seems to change. Plus, this latest issue appears to be more an >> issue of the last response buffer not clearing or something. >> >> Whats strange is that if I wait a few minutes and reload, then the >> buffer is cleared and I get back a valid response, its intermittent, >> but appears to be happening frequently. >> >> If it matters, we started using LucidGaze for Solr about 10 days ago, >> approximately when these issues started happening (but its hard to say >> if thats an issue because at this same time we switched from a PHP to >> Java indexing client). >> >> Thanks for your patience >> >> -Rupert >> >> On Tue, Aug 25, 2009 at 8:33 PM, Chris >> Hostetter<hossman_luc...@fucit.org> wrote: >> >>> >>> : We are running an instance of MediaWiki so the text goes through a >>> : couple of transformations: wiki markup -> html -> plain text. >>> : Its at this last step that I take a "snippet" and insert that into >>> Solr. >>> ... >>> : doc.addField("text_snippet_t", article.getSnippet(1000)); >>> >>> ok, well first off: that's the not the field we're you are having >>> problems >>> is it? if i remember correctly from your previous posts, wasn't the >>> response getting aborted in the middle of the Contents field? >>> >>> : and a maximum of 1K chars if its bigger. I initialized this String >>> : from the DB by using the String constructor where I pass in the >>> : charset/collation >>> : >>> : text = new String(textFromDB, "UTF-8"); >>> : >>> : So to the best of my knowledge, accessing a substring of a UTF-8 >>> : encoded string should not break up the UTF-8 code point. Is that an >>> >>> i assume you are correct, but you listed several steps of transformation >>> above, are you certian they all work correctly and produce valid UTF-8? >>> >>> this leads back to my suggestion before.... >>> >>> :> Can you put the orriginal (pre solr, pre solrj, raw untouched, >>> etc...) >>> :> file that this solr doc came from online somewhere? >>> :> >>> :> What does your *indexing* code look like? ... Can you add some >>> debuging to >>> :> the SolrJ client when you *add* this doc to print out exactly what >>> those >>> :> 1000 characters are? >>> >>> >>> -Hoss >>> >>> > >