from:"Rupert Fiasco"

Question mark glyphs in indexed content

2009-08-10 Thread Rupert Fiasco

Hello, I am using the latest Solr4j to index content. When I look at
that content in the Solr Admin web utility I see weird characters like
this:

http://brockwine.com/images/solrglyphs.png

When I look at the text in the MySQL DB those chars appear to just be
plain hyphens. The MySQL table character set is utf8 and the collation
is utf8.

Environment:
OS X 10.5.8
java version "1.5.0_19"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304)
Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing)

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47
Lucene Specification Version: 2.4-dev
Lucene Implementation Version: 2.4-dev 691741 - 2008-09-03 15:25:16

Jetty 6.1.3

Any thoughts?

Thanks
/Rupert

Responses getting truncated

2009-08-24 Thread Rupert Fiasco

I am seeing our responses getting truncated if and only if I search on
our main text field.

E.g. I just do some basic like

title_t:arthritis

Then I get a valid document back. But if I add in our larger text field:

title_t:arthritis OR text_t:arthritis

then the resultant document is NOT valid XML (if using wt=xml) or Ruby
(using wt=ruby). If I run these through curl on the command its
truncated and if I run the search through the web-based admin panel
then I get an XML parse error.

This appears to have just started recently and the only thing we have
done is change our indexer from a PHP one to a Java one, but
functionally they are identical.

Any thoughts? Thanks in advance.

- Rupert

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco

Using wt=json also yields an invalid document. So after more
investigation it appears that I can always "break" the response by
pulling back a specific field via the "fl" parameter. If I leave off a
field then the response is valid, if I include it then Solr yields an
invalid document - a truncated document. This happens in any response
format (xml, json, ruby).

I am using the SolrJ client to add documents to in my index. My field
is a normal "text" field type and the text itself is the first 1000
characters of an article.

> It can very well be an issue with the data itself. For example, if the data
> contains un-escaped characters which invalidates the response

When I look at the document in using wt=xml then all XML entities are
escaped. When I look at it under wt=ruby then all single quotes are
escaped, same for json, so it appears that all escaping it taking
place. The core problem seems to be that the document is just
truncated - it just plain end of files. Jetty's log says its sending
back an HTTP 200 so all is well.

Any ideas on how I can dig deeper?

Thanks
-Rupert

On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness wrote:
> It can very well be an issue with the data itself. For example, if the data
> contains un-escaped characters which invalidates the response. I don't know
> much about ruby, but what do you get with wt=json?
>
> Rupert Fiasco wrote:
>>
>> I am seeing our responses getting truncated if and only if I search on
>> our main text field.
>>
>> E.g. I just do some basic like
>>
>> title_t:arthritis
>>
>> Then I get a valid document back. But if I add in our larger text field:
>>
>> title_t:arthritis OR text_t:arthritis
>>
>> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>> (using wt=ruby). If I run these through curl on the command its
>> truncated and if I run the search through the web-based admin panel
>> then I get an XML parse error.
>>
>> This appears to have just started recently and the only thing we have
>> done is change our indexer from a PHP one to a Java one, but
>> functionally they are identical.
>>
>> Any thoughts? Thanks in advance.
>>
>> - Rupert
>>
>>
>

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco

The text file at:

http://brockwine.com/solr.txt

Represents one of these truncated responses (this one in XML). It
starts out great, then look at the bottom, boom, game over. :)

I found this document by first running our bigger search which breaks
and then zeroing in a specific broken document by using the rows/start
parameters. But there are any unknown number of these "broken"
documents - a lot I presume.

-Rupert

On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh wrote:
> Can you copy-paste the source data indexed in this field which causes the
> error?
>
> Cheers
> Avlesh
>
> On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco  wrote:
>
>> Using wt=json also yields an invalid document. So after more
>> investigation it appears that I can always "break" the response by
>> pulling back a specific field via the "fl" parameter. If I leave off a
>> field then the response is valid, if I include it then Solr yields an
>> invalid document - a truncated document. This happens in any response
>> format (xml, json, ruby).
>>
>> I am using the SolrJ client to add documents to in my index. My field
>> is a normal "text" field type and the text itself is the first 1000
>> characters of an article.
>>
>> > It can very well be an issue with the data itself. For example, if the
>> data
>> > contains un-escaped characters which invalidates the response
>>
>> When I look at the document in using wt=xml then all XML entities are
>> escaped. When I look at it under wt=ruby then all single quotes are
>> escaped, same for json, so it appears that all escaping it taking
>> place. The core problem seems to be that the document is just
>> truncated - it just plain end of files. Jetty's log says its sending
>> back an HTTP 200 so all is well.
>>
>> Any ideas on how I can dig deeper?
>>
>> Thanks
>> -Rupert
>>
>>
>> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness wrote:
>> > It can very well be an issue with the data itself. For example, if the
>> data
>> > contains un-escaped characters which invalidates the response. I don't
>> know
>> > much about ruby, but what do you get with wt=json?
>> >
>> > Rupert Fiasco wrote:
>> >>
>> >> I am seeing our responses getting truncated if and only if I search on
>> >> our main text field.
>> >>
>> >> E.g. I just do some basic like
>> >>
>> >> title_t:arthritis
>> >>
>> >> Then I get a valid document back. But if I add in our larger text field:
>> >>
>> >> title_t:arthritis OR text_t:arthritis
>> >>
>> >> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>> >> (using wt=ruby). If I run these through curl on the command its
>> >> truncated and if I run the search through the web-based admin panel
>> >> then I get an XML parse error.
>> >>
>> >> This appears to have just started recently and the only thing we have
>> >> done is change our indexer from a PHP one to a Java one, but
>> >> functionally they are identical.
>> >>
>> >> Any thoughts? Thanks in advance.
>> >>
>> >> - Rupert
>> >>
>> >>
>> >
>>
>

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco

So I whipped up a quick SolrJ client and ran it against the document
that I referenced earlier. When I retrieve the doc and just print its
field/value pairs to stdout it ends like this:

http://brockwine.com/images/output1.png

It appears to be some kind of garbage characters.

-Rupert

On Tue, Aug 25, 2009 at 12:19 PM, Uri Boness wrote:
> Hi,
>
> This is a very strange behavior and the fact that it is cause by one
> specific field, again, leads me to believe it's still a data issue. Did you
> try using SolrJ to query the data as well? If the same thing happens when
> using the binary protocol, then it's probably not a data issue. On the other
> hand, if it works fine, then at least you can inspect the data to see where
> things go wrong. Sorry for insisting on that, but I cannot think of anything
> else that can cause this problem.
>
> If anyone else have a better idea, I'm actually very curious to hear about
> it.
>
> Uri
>
> Rupert Fiasco wrote:
>>
>> The text file at:
>>
>> http://brockwine.com/solr.txt
>>
>> Represents one of these truncated responses (this one in XML). It
>> starts out great, then look at the bottom, boom, game over. :)
>>
>> I found this document by first running our bigger search which breaks
>> and then zeroing in a specific broken document by using the rows/start
>> parameters. But there are any unknown number of these "broken"
>> documents - a lot I presume.
>>
>> -Rupert
>>
>> On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh wrote:
>>
>>>
>>> Can you copy-paste the source data indexed in this field which causes the
>>> error?
>>>
>>> Cheers
>>> Avlesh
>>>
>>> On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco 
>>> wrote:
>>>
>>>
>>>>
>>>> Using wt=json also yields an invalid document. So after more
>>>> investigation it appears that I can always "break" the response by
>>>> pulling back a specific field via the "fl" parameter. If I leave off a
>>>> field then the response is valid, if I include it then Solr yields an
>>>> invalid document - a truncated document. This happens in any response
>>>> format (xml, json, ruby).
>>>>
>>>> I am using the SolrJ client to add documents to in my index. My field
>>>> is a normal "text" field type and the text itself is the first 1000
>>>> characters of an article.
>>>>
>>>>
>>>>>
>>>>> It can very well be an issue with the data itself. For example, if the
>>>>>
>>>>
>>>> data
>>>>
>>>>>
>>>>> contains un-escaped characters which invalidates the response
>>>>>
>>>>
>>>> When I look at the document in using wt=xml then all XML entities are
>>>> escaped. When I look at it under wt=ruby then all single quotes are
>>>> escaped, same for json, so it appears that all escaping it taking
>>>> place. The core problem seems to be that the document is just
>>>> truncated - it just plain end of files. Jetty's log says its sending
>>>> back an HTTP 200 so all is well.
>>>>
>>>> Any ideas on how I can dig deeper?
>>>>
>>>> Thanks
>>>> -Rupert
>>>>
>>>>
>>>> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness wrote:
>>>>
>>>>>
>>>>> It can very well be an issue with the data itself. For example, if the
>>>>>
>>>>
>>>> data
>>>>
>>>>>
>>>>> contains un-escaped characters which invalidates the response. I don't
>>>>>
>>>>
>>>> know
>>>>
>>>>>
>>>>> much about ruby, but what do you get with wt=json?
>>>>>
>>>>> Rupert Fiasco wrote:
>>>>>
>>>>>>
>>>>>> I am seeing our responses getting truncated if and only if I search on
>>>>>> our main text field.
>>>>>>
>>>>>> E.g. I just do some basic like
>>>>>>
>>>>>> title_t:arthritis
>>>>>>
>>>>>> Then I get a valid document back. But if I add in our larger text
>>>>>> field:
>>>>>>
>>>>>> title_t:arthritis OR text_t:arthritis
>>>>>>
>>>>>> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>>>>>> (using wt=ruby). If I run these through curl on the command its
>>>>>> truncated and if I run the search through the web-based admin panel
>>>>>> then I get an XML parse error.
>>>>>>
>>>>>> This appears to have just started recently and the only thing we have
>>>>>> done is change our indexer from a PHP one to a Java one, but
>>>>>> functionally they are identical.
>>>>>>
>>>>>> Any thoughts? Thanks in advance.
>>>>>>
>>>>>> - Rupert
>>>>>>
>>>>>>
>>>>>>
>>
>>
>

Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco

> 1.  Exactly which version of Solr / SolrJ are you using?

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47
Latest SolrJ that I downloaded a couple of days ago.

> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> file that this solr doc came from online somewhere?

We are running an instance of MediaWiki so the text goes through a
couple of transformations: wiki markup -> html -> plain text.
Its at this last step that I take a "snippet" and insert that into Solr.

My snippet code is:

 // article.java
public String getSnippet(int maxlen) {
  int length = getPlainText().length() >= maxlen ? maxlen :
getPlainText().length();
  return getPlainText().substring(0, length);
}
// ... later on  add to solr
doc.addField("text_snippet_t", article.getSnippet(1000));

So in theory, I am getting the whole article if its less than 1K chars
and a maximum of 1K chars if its bigger. I initialized this String
from the DB by using the String constructor where I pass in the
charset/collation

text = new String(textFromDB, "UTF-8");

So to the best of my knowledge, accessing a substring of a UTF-8
encoded string should not break up the UTF-8 code point. Is that an
incorrect assumption? If so, what is best way to break up a UTF-8
encoded string and get approximately that many characters? Exactness
is not a requirement.

-Rupert

On Tue, Aug 25, 2009 at 5:37 PM, Chris
Hostetter wrote:
>
> 1.  Exactly which version of Solr / SolrJ are you using?
>
> 2. ...
>
> :  I am using the SolrJ client to add documents to in my index. My field
> :  is a normal "text" field type and the text itself is the first 1000
> :  characters of an article.
>
> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> file that this solr doc came from online somewhere?
>
> What does your *indexing* code look like? ... Can you add some debuging to
> the SolrJ client when you *add* this doc to print out exactly what those
> 1000 characters are?
>
> My hunch: when you are extracting the first 1000 characters, you're
> getting only the first half of a character ...or... you are getting docs
> with less them 1000 characters and winding up with a buffer (char[]?) that
> has garbage at the end; SolrJ isn't complaining on the way in, but
> something farther down (maybe before indexing, maybe after) is seeing that
> garbage and cutting the field off at that point.
>
>
>
> -Hoss
>
>

Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco

Firstly, to everyone who has been helping me, thank you very much. All
this feedback is helping me narrow down these issues.

I deleted the index and re-indexed all the data from scratch and for a
couple of days we were OK, but now it seems to be erring again.

It happens on different input documents so what was broken before now
works (documents that were having issues before are OK now, after a
fresh re-index).

An issue we are seeing now is that an XML response from Solr will
contain the "tail" of an earlier response, for an example:

http://brockwine.com/solr2.txt

That is a response we are getting from Solr - using the web interface
for Solr in Firefox, Firefox freaks out because it tries to parse
that, and of course, its invalid XML, but I can retrieve that via
curl.

Anyone seeing this before?

In regards to earlier questions:

> i assume you are correct, but you listed several steps of transformation
> above, are you certian they all work correctly and produce valid UTF-8?

Yes, I have looked at the source and contacted the author of the
conversion library we are using and have verified that if UTF8 goes in
then UTF8 will come out and UTF8 is definitely going in.

I dont think sending over an actual input document would help because
it seems to change. Plus, this latest issue appears to be more an
issue of the last response buffer not clearing or something.

Whats strange is that if I wait a few minutes and reload, then the
buffer is cleared and I get back a valid response, its intermittent,
but appears to be happening frequently.

If it matters, we started using LucidGaze for Solr about 10 days ago,
approximately when these issues started happening (but its hard to say
if thats an issue because at this same time we switched from a PHP to
Java indexing client).

Thanks for your patience

-Rupert

On Tue, Aug 25, 2009 at 8:33 PM, Chris
Hostetter wrote:
>
> : We are running an instance of MediaWiki so the text goes through a
> : couple of transformations: wiki markup -> html -> plain text.
> : Its at this last step that I take a "snippet" and insert that into Solr.
>        ...
> : doc.addField("text_snippet_t", article.getSnippet(1000));
>
> ok, well first off: that's the not the field we're you are having problems
> is it?  if i remember correctly from your previous posts, wasn't the
> response getting aborted in the middle of the Contents field?
>
> : and a maximum of 1K chars if its bigger. I initialized this String
> : from the DB by using the String constructor where I pass in the
> : charset/collation
> :
> : text = new String(textFromDB, "UTF-8");
> :
> : So to the best of my knowledge, accessing a substring of a UTF-8
> : encoded string should not break up the UTF-8 code point. Is that an
>
> i assume you are correct, but you listed several steps of transformation
> above, are you certian they all work correctly and produce valid UTF-8?
>
> this leads back to my suggestion before
>
> : > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> : > file that this solr doc came from online somewhere?
> : >
> : > What does your *indexing* code look like? ... Can you add some debuging to
> : > the SolrJ client when you *add* this doc to print out exactly what those
> : > 1000 characters are?
>
>
> -Hoss
>

Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco

I know in my last message I said I was having issues with "extra
content" at the start of a response, resulting in an invalid document.
I still am having issues with documents getting truncated (yes, I have
problems galore).

I will elaborate on why its so difficult to track down an actual
document which is causing the failure (if I could find the document I
could post it to the group) - causing an invalid / truncated document.

I will just document the steps:

1) I have a query which results in a bogus truncated document. This
query pulls back all fields. If I take that same query and remove the
"text_t" field from the returned field list, then all is well. This
indicates to me that its a problem with the text_t field. This query
uses the default returned rows of 10.

2) So far so good. Then my next step is to find the document. So I
take my original query, remove the text_t from the field list to get
my result set.

3) I run a new query that JUST selects that document based on its Doc
ID (which I have from the first query). My thinking is that my
"broken" document HAS to be in that set, so I can just select it by ID
and then validate the response.

This is where it breaks down: I know one or more broken documents is
in my set, but if I iterate over each doc id and pull it out
individually, its response is valid. Its only broken when I pull it
out in the first query. Its NOT broken when I pull it out by ID, even
though I am also pulling out the same "broken" field.

If you can read Ruby my script is here:

http://brockwine.com/solr_fetch.txt

In the first net/http call, if I include the "text_t" field in the
"fl" list then it breaks. If I remove it, get the doc ids and then
iterate over each one and get it back from Solr (including the
supposed broken field "text_t") then it works just fine - the
exception is never raised. But it is raised in the first call if I
include it.

To me this makes absolutely no sense.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 2:14 PM, Joe Calderon wrote:
> i had a similar issue with text from past requests showing up, this was on
> 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went
> away, im using a nightly of 1.4 right now also without probs, then again
> your mileage may vary as i also made a bunch of schema changes that might
> have had some effect, it wouldnt hurt to try though
>
>
> On 08/28/2009 02:04 PM, Rupert Fiasco wrote:
>>
>> Firstly, to everyone who has been helping me, thank you very much. All
>> this feedback is helping me narrow down these issues.
>>
>> I deleted the index and re-indexed all the data from scratch and for a
>> couple of days we were OK, but now it seems to be erring again.
>>
>> It happens on different input documents so what was broken before now
>> works (documents that were having issues before are OK now, after a
>> fresh re-index).
>>
>> An issue we are seeing now is that an XML response from Solr will
>> contain the "tail" of an earlier response, for an example:
>>
>> http://brockwine.com/solr2.txt
>>
>> That is a response we are getting from Solr - using the web interface
>> for Solr in Firefox, Firefox freaks out because it tries to parse
>> that, and of course, its invalid XML, but I can retrieve that via
>> curl.
>>
>> Anyone seeing this before?
>>
>> In regards to earlier questions:
>>
>>
>>>
>>> i assume you are correct, but you listed several steps of transformation
>>> above, are you certian they all work correctly and produce valid UTF-8?
>>>
>>
>> Yes, I have looked at the source and contacted the author of the
>> conversion library we are using and have verified that if UTF8 goes in
>> then UTF8 will come out and UTF8 is definitely going in.
>>
>> I dont think sending over an actual input document would help because
>> it seems to change. Plus, this latest issue appears to be more an
>> issue of the last response buffer not clearing or something.
>>
>> Whats strange is that if I wait a few minutes and reload, then the
>> buffer is cleared and I get back a valid response, its intermittent,
>> but appears to be happening frequently.
>>
>> If it matters, we started using LucidGaze for Solr about 10 days ago,
>> approximately when these issues started happening (but its hard to say
>> if thats an issue because at this same time we switched from a PHP to
>> Java indexing client).
>>
>> Thanks for your patience
>>
>> -Rupert
>>
>> On Tue, Aug 25, 2009 at 8:33 PM, Chris
>> Hostetter  wrote:
>>
>>>
>>> : We are running an instance of MediaWiki so t

Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco

Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

Versions / architectures:

Jetty(6.1.3)

o...@medsolr1 ~ $ uname -a
Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

o...@medsolr1 ~ $ java -version
java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

-Rupert

On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley wrote:
> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco wrote:
>> If I run these through curl on the command its
>> truncated and if I run the search through the web-based admin panel
>> then I get an XML parse error.
>
> Are you running curl directly against the solr server, or going
> through a load balancer?  Cutting out the middle-men using curl was a
> great idea - just make sure to go all the way.
>
> At first I thought it could possibly be a FastWriter bug (internal
> Solr class), but that's only used on the TextWriter (JSON, Python,
> Ruby) based formats, not on the original XML format.
>
> It really looks like you're hitting a lower-level IO buffering bug
> (esp when you see a response starting off with the tail of another
> response).  That doesn't look like it could be a Solr bug... but
> rather smells like a thread safety bug in the servlet container.
>
> What type of machine are you running on?  What JVM?
> You could try upgrading your version of Jetty, the JVM, or try
> switching to Tomcat.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>> This appears to have just started recently and the only thing we have
>> done is change our indexer from a PHP one to a Java one, but
>> functionally they are identical.
>>
>> Any thoughts? Thanks in advance.
>>
>> - Rupert
>>
>

Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco

I deployed LucidWorks with my existing solrconfig / schema and
re-indexed my data into it and pushed it out to production, we'll see
how it stacks up over the weekend. Already queries that were breaking
on the prior Jetty/stock Solr setup are now working - but I have seen
it before where upon an initial re-index things work OK then a couple
of days later they break.

Keep y'all posted.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco wrote:
> Yes, I am hitting the Solr server directly (medsolr1.colo:9007)
>
> Versions / architectures:
>
> Jetty(6.1.3)
>
> o...@medsolr1 ~ $ uname -a
> Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
> x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux
>
> o...@medsolr1 ~ $ java -version
> java version "1.6.0_11"
> Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)
>
>
> I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.
>
> -Rupert
>
> On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley wrote:
>> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco wrote:
>>> If I run these through curl on the command its
>>> truncated and if I run the search through the web-based admin panel
>>> then I get an XML parse error.
>>
>> Are you running curl directly against the solr server, or going
>> through a load balancer?  Cutting out the middle-men using curl was a
>> great idea - just make sure to go all the way.
>>
>> At first I thought it could possibly be a FastWriter bug (internal
>> Solr class), but that's only used on the TextWriter (JSON, Python,
>> Ruby) based formats, not on the original XML format.
>>
>> It really looks like you're hitting a lower-level IO buffering bug
>> (esp when you see a response starting off with the tail of another
>> response).  That doesn't look like it could be a Solr bug... but
>> rather smells like a thread safety bug in the servlet container.
>>
>> What type of machine are you running on?  What JVM?
>> You could try upgrading your version of Jetty, the JVM, or try
>> switching to Tomcat.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>> This appears to have just started recently and the only thing we have
>>> done is change our indexer from a PHP one to a Java one, but
>>> functionally they are identical.
>>>
>>> Any thoughts? Thanks in advance.
>>>
>>> - Rupert
>>>
>>
>

Re: Responses getting truncated

2009-09-03 Thread Rupert Fiasco

So we have been running LucidWorks for Solr for about a week now and
have seen no problems - so I believe it was due to that buffering
issue in Jetty 6.1.3, estimated here:

>>> It really looks like you're hitting a lower-level IO buffering bug
>>> (esp when you see a response starting off with the tail of another
>>> response).  That doesn't look like it could be a Solr bug... but
>>> rather smells like a thread safety bug in the servlet container.

Thanks for everyones help and input. LucidWorks For The Win.

-Rupert

On Fri, Aug 28, 2009 at 4:07 PM, Rupert Fiasco wrote:
> I deployed LucidWorks with my existing solrconfig / schema and
> re-indexed my data into it and pushed it out to production, we'll see
> how it stacks up over the weekend. Already queries that were breaking
> on the prior Jetty/stock Solr setup are now working - but I have seen
> it before where upon an initial re-index things work OK then a couple
> of days later they break.
>
> Keep y'all posted.
>
> Thanks
> -Rupert
>
> On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco wrote:
>> Yes, I am hitting the Solr server directly (medsolr1.colo:9007)
>>
>> Versions / architectures:
>>
>> Jetty(6.1.3)
>>
>> o...@medsolr1 ~ $ uname -a
>> Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
>> x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux
>>
>> o...@medsolr1 ~ $ java -version
>> java version "1.6.0_11"
>> Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
>> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)
>>
>>
>> I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.
>>
>> -Rupert
>>
>> On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley wrote:
>>> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco wrote:
>>>> If I run these through curl on the command its
>>>> truncated and if I run the search through the web-based admin panel
>>>> then I get an XML parse error.
>>>
>>> Are you running curl directly against the solr server, or going
>>> through a load balancer?  Cutting out the middle-men using curl was a
>>> great idea - just make sure to go all the way.
>>>
>>> At first I thought it could possibly be a FastWriter bug (internal
>>> Solr class), but that's only used on the TextWriter (JSON, Python,
>>> Ruby) based formats, not on the original XML format.
>>>
>>> It really looks like you're hitting a lower-level IO buffering bug
>>> (esp when you see a response starting off with the tail of another
>>> response).  That doesn't look like it could be a Solr bug... but
>>> rather smells like a thread safety bug in the servlet container.
>>>
>>> What type of machine are you running on?  What JVM?
>>> You could try upgrading your version of Jetty, the JVM, or try
>>> switching to Tomcat.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>> This appears to have just started recently and the only thing we have
>>>> done is change our indexer from a PHP one to a Java one, but
>>>> functionally they are identical.
>>>>
>>>> Any thoughts? Thanks in advance.
>>>>
>>>> - Rupert
>>>>
>>>
>>
>

Specifying multiple documents in DataImportHandler dataConfig

2009-09-08 Thread Rupert Fiasco

I am using the DataImportHandler with a JDBC datasource. From my
understanding of DIH, for each of my "content types" e.g. Blog posts,
Mesh Categories, etc I would construct a series of document/entity
sets, like






  




  




  






  





Solr parses this just fine and allows me to issue a
/dataimport?command=full-import and it runs, but it only runs against
the "first" document (blog_entries). It doesnt run against the 2nd
document (mesh_categories).

If I remove the 2 document elements and wrap both entity sets in just
one document tag, then both sets get indexed, which seemingly achieves
my goal. This just doesnt make sense from my understanding of how DIH
works. My 2 content types are indeed separate so they logically
represent two document types, not one.

Is this correct? What am I missing here?

Thanks
-Rupert

Re: Specifying multiple documents in DataImportHandler dataConfig

2009-09-08 Thread Rupert Fiasco

Maybe I should be more clear: I have multiple tables in my DB that I
need to save to my Solr index. In my app code I have logic to persist
each table, which maps to an application model to Solr. This is fine.
I am just trying to speed up indexing time by using DIH instead of
going through my application. From what I understand of DIH I can
specify one dataSource element and then a series of document/entity
sets, for each of my models. But like I said before, DIH only appears
to want to index the first document declared under the dataSource tag.

-Rupert

On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco wrote:
> I am using the DataImportHandler with a JDBC datasource. From my
> understanding of DIH, for each of my "content types" e.g. Blog posts,
> Mesh Categories, etc I would construct a series of document/entity
> sets, like
>
> 
> 
>
>    
>    
>      
>        
>        
>        
>        
>      
>    
>
>    
>    
>      
>        
>        
>        
>        
>        
>        
>      
>    
> 
> 
>
>
> Solr parses this just fine and allows me to issue a
> /dataimport?command=full-import and it runs, but it only runs against
> the "first" document (blog_entries). It doesnt run against the 2nd
> document (mesh_categories).
>
> If I remove the 2 document elements and wrap both entity sets in just
> one document tag, then both sets get indexed, which seemingly achieves
> my goal. This just doesnt make sense from my understanding of how DIH
> works. My 2 content types are indeed separate so they logically
> represent two document types, not one.
>
> Is this correct? What am I missing here?
>
> Thanks
> -Rupert
>

Understanding prefix query searching

2008-10-21 Thread Rupert Fiasco

So I tried to look on google for an answer to this before I posted
here. Basically I am trying to understand how prefix searching works.

I have a dynamic text field (indexed and stored) "full_name_t"

I have some data in my index, specifically a record with full_name_t =
"Robert P Page"

A search on:

full_name_t:Robert

yields that document, however a search on

full_name_t:Robert*

yields nothing.

Why?

To get around this I am doing something like

(full_name_t:Robert OR full_name_t:Robert*)

But I would like to understand why the wildcard doesnt work, shouldn't
it match anything after the first characters of "Robert"?

Thanks

-Rupert

Spell checking not returning "full" terms

2009-02-04 Thread Rupert Fiasco

We are using Solr 1.3 and trying to get spell checking functionality.

FYI, our index contains a lot of medical terms (which might or might
not make a difference as they are not English-y words, if that makes
any sense?)

If I specify a spellcheck query of "spellcheck.q=diabtes"

I get suggestions of:

diabet
diabetogen
dilat
diamet
diatom
diastol
diactin
dialect

If I re-mis-spell Diabetes to "q=diabets" then I go no suggestions.

So first off two things:

1) Why would leaving out one "e" over the other affect the spelling
suggestions so substantially?
2) In the former list of suggestions, notice the first suggestion is
"diabet", which isnt all that helpful, it should return something like
"diabetes" or maybe even "diabetic".

Note that if I do a normal search against "diabetes" then I get a ton
of results, in other words, our index is filled with terms of
"diabetes".

My relevant solrconfig is:


text


  default
  text_t
  ./spellchecker1
  0.1



  jarowinkler
  text_t
  
  org.apache.lucene.search.spell.JaroWinklerDistance
  ./spellchecker2
  0.1



and I have

spellcheck.count = 8

Notice that I severely bumped down the "accuracy" setting to get more
results. Bumping it up higher yields less results (not sure what
setting really meant so I dont know in what direction I want to change
that value - I am guessing that a lower value allows for more
mis-spellings, e.g. its more promiscuous).

Our "text" and "text_t" fields are defined in schema.xml as:


and


Any help would be appreciated.

Thanks
-Rupert

Re: Spell checking not returning "full" terms

2009-02-04 Thread Rupert Fiasco

Awesome! After reading up on the links you sent me I got it all working. Thanks!

FYI - I did previously come across one of the links you sent over:

http://wiki.apache.org/solr/SpellCheckerRequestHandler

But what threw me off is that when I started reading about that
yesterday, in the first paragraph it says that this component is
deprecated and to use SpellCheckComponent - so at that point I stopped
reading and went over to the component page. If I had kept reading I
would have encountered all of the gritty details that I in fact needed
to get it to work. The wiki entry makes it seem old and deprecated and
is no longer relevant, but it certainly is.

-Rupert

On Wed, Feb 4, 2009 at 11:57 AM, Grant Ingersoll  wrote:
> I'm guessing the field you are checking against is being stemmed.  The field
> you spell check against should have minimal analysis done to it, i.e.
> tokenization and probably downcasing.  See
> http://wiki.apache.org/solr/SpellCheckComponent and
> http://wiki.apache.org/solr/SpellCheckerRequestHandler for tips on how to
> handle analysis for spelling.
>
> On Feb 4, 2009, at 2:33 PM, Rupert Fiasco wrote:
>
>> We are using Solr 1.3 and trying to get spell checking functionality.
>>
>> FYI, our index contains a lot of medical terms (which might or might
>> not make a difference as they are not English-y words, if that makes
>> any sense?)
>>
>> If I specify a spellcheck query of "spellcheck.q=diabtes"
>>
>> I get suggestions of:
>>
>> diabet
>> diabetogen
>> dilat
>> diamet
>> diatom
>> diastol
>> diactin
>> dialect
>>
>> If I re-mis-spell Diabetes to "q=diabets" then I go no suggestions.
>>
>> So first off two things:
>>
>> 1) Why would leaving out one "e" over the other affect the spelling
>> suggestions so substantially?
>> 2) In the former list of suggestions, notice the first suggestion is
>> "diabet", which isnt all that helpful, it should return something like
>> "diabetes" or maybe even "diabetic".
>>
>> Note that if I do a normal search against "diabetes" then I get a ton
>> of results, in other words, our index is filled with terms of
>> "diabetes".
>>
>> My relevant solrconfig is:
>>
>>
>>   text
>>
>>   
>> default
>> text_t
>> ./spellchecker1
>> 0.1
>>
>>   
>>   
>> jarowinkler
>> text_t
>> 
>> > name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance
>> ./spellchecker2
>> 0.1
>>
>>   
>>
>> and I have
>>
>> spellcheck.count = 8
>>
>> Notice that I severely bumped down the "accuracy" setting to get more
>> results. Bumping it up higher yields less results (not sure what
>> setting really meant so I dont know in what direction I want to change
>> that value - I am guessing that a lower value allows for more
>> mis-spellings, e.g. its more promiscuous).
>>
>> Our "text" and "text_t" fields are defined in schema.xml as:
>>
>> > multiValued="true"/>
>> and
>> > stored="true" multiValued="true" />
>>
>> Any help would be appreciated.
>>
>> Thanks
>> -Rupert
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
>

Issuing just a spell check query

2009-02-06 Thread Rupert Fiasco

The docs for the SpellCheckComponent say

"The SpellCheckComponent is designed to provide inline spell checking
of queries without having to issue separate requests."

I would like to issue just a spell check query, I dont care about it
being inline and piggy-backing off a normal search query.

How would I achieve this?

I tried monkeying with making a new requestHandler but using class =
"solr.SearchHandler" always tries to do a normal search.

I succeeded in adding inline spell checking to the default request
handler by *adding*

 
   spellcheck
 

to its requestHandler config - I would like to *remove* the default
search component - maybe by making a new request handler which just
does spell checking?

Is something like this possible?



  

 
5
 

 
   spellcheck
 

 
 
   default
 

  





Now, I can sort of achieve what I want by in fact a normal search but
then using a dummy value for my "q" parameter (for me "00"
works) and then I get no search docs back, but I do get the spell
suggestions I want, driven by the "spellcheck.q" parameter.

But this seems very hacky and Solr is still having to run a search
against my dummy value.

A roundabout way of asking: how can I fire off *just* a spell check query?

Thanks in advance
-Rupert

Re: Issuing just a spell check query

2009-02-06 Thread Rupert Fiasco

But its deprecated (??)

-Rupert

On Fri, Feb 6, 2009 at 11:51 AM, Otis Gospodnetic
 wrote:
> Rupert,
>
> You could use the SpellCheck*Handler* to achieve this.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
> ________
> From: Rupert Fiasco 
> To: solr-user@lucene.apache.org
> Sent: Friday, February 6, 2009 2:47:19 PM
> Subject: Issuing just a spell check query
>
> The docs for the SpellCheckComponent say
>
> "The SpellCheckComponent is designed to provide inline spell checking
> of queries without having to issue separate requests."
>
> I would like to issue just a spell check query, I dont care about it
> being inline and piggy-backing off a normal search query.
>
> How would I achieve this?
>
> I tried monkeying with making a new requestHandler but using class =
> "solr.SearchHandler" always tries to do a normal search.
>
> I succeeded in adding inline spell checking to the default request
> handler by *adding*
>
> 
>   spellcheck
> 
>
> to its requestHandler config - I would like to *remove* the default
> search component - maybe by making a new request handler which just
> does spell checking?
>
> Is something like this possible?
>
>
>
>  
>
> 
>5
> 
>
> 
>   spellcheck
> 
>
> 
> 
>   default
> 
>
>  
>
>
>
>
>
> Now, I can sort of achieve what I want by in fact a normal search but
> then using a dummy value for my "q" parameter (for me "00"
> works) and then I get no search docs back, but I do get the spell
> suggestions I want, driven by the "spellcheck.q" parameter.
>
> But this seems very hacky and Solr is still having to run a search
> against my dummy value.
>
> A roundabout way of asking: how can I fire off *just* a spell check query?
>
> Thanks in advance
> -Rupert
>

search returns matches for non-starting wildcard prefix queries

2009-02-09 Thread Rupert Fiasco

(I think I have a horrible subject line but I wasnt sure how to
properly explain myself).

I have a text field that I store last names in (and everything is
lowercased prior to insertion, not sure if that matters).

The field is described as:

   

  






  
  







  




When running a query such as

last_name:m*

I get data back like:

Pashman, Md
Maldonado
Manolidis
Fleisher, M.D., D.Ht., D.A.B.F.M.
Merino
Monroe
McLay
Maltsberger
McMurtray
Murphy Md
Loeb Md


As you can see most are perfect matches, but there are some that
*dont* start with the letter "M" but do have "M" at the beginning of
another "word" in the field.

Wouldnt the query "m*" just query for matches where the first letter
is "M" in the whole field and not within another "word" in that field?

Do I need to make another field to store last names and not perform
any analysis on that field (akin to a spell check field)?

Thanks in advance.

-Rupert

Indexing issue with XML control characters

2009-07-20 Thread Rupert Fiasco

During indexing I will often get this error:

SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
character ((CTRL-CHAR, code 3))
 at [row,col {unknown-source}]: [2,1]
at 
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)


By looking at this list and elsewhere I know that I need to filter out
most control characters so I have been employing this regex:

/[\x00-\x08\x0B\x0C\x0E-\x1F]/

But I still get the error. What is strange is that if I re-run my
indexing process after a failure it will work on the previously failed
node and then error out on another node some time later. That is, it
is not deterministic. If I look at the text that is attempted to be
indexed its pure as you can get one (a bunch of medical keywords like
"leg bones" and "nose").

Any ideas would be greatly appreciated.

The platform is:

Solr implementation version: 1.3.0 694707
Lucene implementation version: 2.4-dev 691741
Mac OS X 10.5.7
JVM 1.5.0_19-b02-304


Thanks
/Rupert

Question mark glyphs in indexed content

Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Re: Responses getting truncated

Specifying multiple documents in DataImportHandler dataConfig

Re: Specifying multiple documents in DataImportHandler dataConfig

Understanding prefix query searching

Spell checking not returning "full" terms

Re: Spell checking not returning "full" terms

Issuing just a spell check query

Re: Issuing just a spell check query

search returns matches for non-starting wildcard prefix queries

Indexing issue with XML control characters

20 matches

Site Navigation

Mail list logo

Footer information