I know in my last message I said I was having issues with "extra
content" at the start of a response, resulting in an invalid document.
I still am having issues with documents getting truncated (yes, I have
problems galore).

I will elaborate on why its so difficult to track down an actual
document which is causing the failure (if I could find the document I
could post it to the group) - causing an invalid / truncated document.

I will just document the steps:

1) I have a query which results in a bogus truncated document. This
query pulls back all fields. If I take that same query and remove the
"text_t" field from the returned field list, then all is well. This
indicates to me that its a problem with the text_t field. This query
uses the default returned rows of 10.

2) So far so good. Then my next step is to find the document. So I
take my original query, remove the text_t from the field list to get
my result set.

3) I run a new query that JUST selects that document based on its Doc
ID (which I have from the first query). My thinking is that my
"broken" document HAS to be in that set, so I can just select it by ID
and then validate the response.

This is where it breaks down: I know one or more broken documents is
in my set, but if I iterate over each doc id and pull it out
individually, its response is valid. Its only broken when I pull it
out in the first query. Its NOT broken when I pull it out by ID, even
though I am also pulling out the same "broken" field.

If you can read Ruby my script is here:

http://brockwine.com/solr_fetch.txt

In the first net/http call, if I include the "text_t" field in the
"fl" list then it breaks. If I remove it, get the doc ids and then
iterate over each one and get it back from Solr (including the
supposed broken field "text_t") then it works just fine - the
exception is never raised. But it is raised in the first call if I
include it.

To me this makes absolutely no sense.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 2:14 PM, Joe Calderon<calderon....@gmail.com> wrote:
> i had a similar issue with text from past requests showing up, this was on
> 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went
> away, im using a nightly of 1.4 right now also without probs, then again
> your mileage may vary as i also made a bunch of schema changes that might
> have had some effect, it wouldnt hurt to try though
>
>
> On 08/28/2009 02:04 PM, Rupert Fiasco wrote:
>>
>> Firstly, to everyone who has been helping me, thank you very much. All
>> this feedback is helping me narrow down these issues.
>>
>> I deleted the index and re-indexed all the data from scratch and for a
>> couple of days we were OK, but now it seems to be erring again.
>>
>> It happens on different input documents so what was broken before now
>> works (documents that were having issues before are OK now, after a
>> fresh re-index).
>>
>> An issue we are seeing now is that an XML response from Solr will
>> contain the "tail" of an earlier response, for an example:
>>
>> http://brockwine.com/solr2.txt
>>
>> That is a response we are getting from Solr - using the web interface
>> for Solr in Firefox, Firefox freaks out because it tries to parse
>> that, and of course, its invalid XML, but I can retrieve that via
>> curl.
>>
>> Anyone seeing this before?
>>
>> In regards to earlier questions:
>>
>>
>>>
>>> i assume you are correct, but you listed several steps of transformation
>>> above, are you certian they all work correctly and produce valid UTF-8?
>>>
>>
>> Yes, I have looked at the source and contacted the author of the
>> conversion library we are using and have verified that if UTF8 goes in
>> then UTF8 will come out and UTF8 is definitely going in.
>>
>> I dont think sending over an actual input document would help because
>> it seems to change. Plus, this latest issue appears to be more an
>> issue of the last response buffer not clearing or something.
>>
>> Whats strange is that if I wait a few minutes and reload, then the
>> buffer is cleared and I get back a valid response, its intermittent,
>> but appears to be happening frequently.
>>
>> If it matters, we started using LucidGaze for Solr about 10 days ago,
>> approximately when these issues started happening (but its hard to say
>> if thats an issue because at this same time we switched from a PHP to
>> Java indexing client).
>>
>> Thanks for your patience
>>
>> -Rupert
>>
>> On Tue, Aug 25, 2009 at 8:33 PM, Chris
>> Hostetter<hossman_luc...@fucit.org>  wrote:
>>
>>>
>>> : We are running an instance of MediaWiki so the text goes through a
>>> : couple of transformations: wiki markup ->  html ->  plain text.
>>> : Its at this last step that I take a "snippet" and insert that into
>>> Solr.
>>>        ...
>>> : doc.addField("text_snippet_t", article.getSnippet(1000));
>>>
>>> ok, well first off: that's the not the field we're you are having
>>> problems
>>> is it?  if i remember correctly from your previous posts, wasn't the
>>> response getting aborted in the middle of the Contents field?
>>>
>>> : and a maximum of 1K chars if its bigger. I initialized this String
>>> : from the DB by using the String constructor where I pass in the
>>> : charset/collation
>>> :
>>> : text = new String(textFromDB, "UTF-8");
>>> :
>>> : So to the best of my knowledge, accessing a substring of a UTF-8
>>> : encoded string should not break up the UTF-8 code point. Is that an
>>>
>>> i assume you are correct, but you listed several steps of transformation
>>> above, are you certian they all work correctly and produce valid UTF-8?
>>>
>>> this leads back to my suggestion before....
>>>
>>> :>  Can you put the orriginal (pre solr, pre solrj, raw untouched,
>>> etc...)
>>> :>  file that this solr doc came from online somewhere?
>>> :>
>>> :>  What does your *indexing* code look like? ... Can you add some
>>> debuging to
>>> :>  the SolrJ client when you *add* this doc to print out exactly what
>>> those
>>> :>  1000 characters are?
>>>
>>>
>>> -Hoss
>>>
>>>
>
>

Reply via email to