Re: Split one string into many fields

2007-01-22 Thread Bertrand Delacretaz

On 1/22/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

...When we get to it, I'd like to hear why it (things like PDF parsing)
should be inside Solr rather than outside using our update interfaces


Same here.

I haven't had time to follow the recent (rich) design discussions
about this stuff, but if I was designing this, I'd put all the
document processing code in a separate module (separate servlet?) and
keep the Solr core lean and mean, with as thin an interface as
possible.

-Bertrand


Re: Using HTTP-Post for Queries

2007-01-22 Thread Maximilian Hütter
Thank you for the answers, the idea was to use Solr with REST. Is there
a XMLQueryParser yet? I didn't find it in the source.

Max

Erik Hatcher schrieb:
> Also consider that I expect Solr to support the XMLQueryParser at some
> point in the near future, which would be POSTed in a body for a search
> request.  Being RESTful is something I strive for all too often myself,
> and using HTTP verbs appropriately.  But pragmatically speaking, a POST
> to Solr is not a big deal.  Solr is designed to be hidden and not
> crawled anyway, so whatever front-end would have the responsibility for
> dealing with how the world sees the search interface.
> 
> Erik
> 
> 
> On Jan 19, 2007, at 9:16 PM, Yonik Seeley wrote:
> 
>> On 1/19/07, Brian Lucas <[EMAIL PROTECTED]> wrote:
>>> Walter Underwood wrote:
>>> > Use GET unless it really, really, really doesn't work. POST is
>>> > the wrong HTTP semantic for fetching information. Long query
>>> > strings are not a good enough reason. HTTP puts no limit on the
>>> > length of a URL.
>>> >
>>>
>>> Walter, while your above statement may be true, some java app servers
>>> have
>>> an issue with the length of URLs and truncate after a certain point. 
>>> I ran
>>> into this issue back in April.
>>
>> Yep, and given that even Apache has limits (which most people use to
>> front dynamic content), using really large URLs is asking for trouble.
>>
>> http://www.boutell.com/newfaq/misc/urllength.html
>>
>> -Yonik
> 
> 



Re: Admin page went down

2007-01-22 Thread Bertrand Delacretaz

On 10/31/06, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:


I kept the solr jetty server running on my box for a couple of days. Today I
found I can no longer access the admin page. It gives the following error
page:
HTTP ERROR: 404...


I've seen the same thing today on one of my Solr instances. I've
entered the info in https://issues.apache.org/jira/browse/SOLR-118 so
that we can correlate if other people see this.

-Bertrand


Re: Using HTTP-Post for Queries

2007-01-22 Thread Erik Hatcher


On Jan 22, 2007, at 4:09 AM, Maximilian Hütter wrote:

Is there
a XMLQueryParser yet? I didn't find it in the source.


Yes - it's part Lucene's contrib area:

	


You'll have to build the JAR and put it into Solr's WAR, and  
construct a new request handler to leverage it currently.


Erik



Re: Using HTTP-Post for Queries

2007-01-22 Thread Erik Hatcher


On Jan 21, 2007, at 11:12 PM, Yonik Seeley wrote:

On 1/21/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:

Yes, I think different syntaxes in different places would be useful.
For example, a user enters a full-text search query that is suitable
to use with Solr's QueryParser, and then the user facets a bit.  The
facet "queries" really aren't queries at all, but rather terms that
don't need to be parsed.  Building up a string to be parsed like "#
{params[:field]}:#{params[:value]}" is tricky because of escaping
syntax (like colons).  So fq wouldn't need to be parsed at all,
except to pull the field name out to build a TermQuery
straightforwardly.


fq params are often queries too (think price ranges, etc).


Oh, I realize that quite well :)

However, for basic faceted browsing, fq parameters are generally just  
plain terms and any type of meta-QueryParser syntax in the terms  
could get in the way when QueryParser is being used.



For syntax, what about

myfield:my unescaped value all as a singe term

Or

my unescaped value all as a singe term


This second option allows for parameters to be passed to the parser,  
which is a nice point of extensibility.



Of course, while that prefix looks decent in bare text, adding it to
an XML config file would look ugly since < would need escaping.


No biggie there.  It's not often that it'd be in a config file.  And  
there is already escaping ugliness for the ping and warmup queries in  
solrconfig.xml.



Another syntax option... something like

!term:myfield:my unescaped value all as a singe term
#term:
%term:
@term:

(basically, find a prefix that would be unlikely to appear as an
actual term or wildcard in lucene queryparser syntax)


I like the  syntax above best of the ones  
you've mentioned because of the ability to provide parameters cleanly.


Erik



Re: Using HTTP-Post for Queries

2007-01-22 Thread Yonik Seeley

On 1/22/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:


On Jan 21, 2007, at 11:12 PM, Yonik Seeley wrote:
> On 1/21/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>> Yes, I think different syntaxes in different places would be useful.
>> For example, a user enters a full-text search query that is suitable
>> to use with Solr's QueryParser, and then the user facets a bit.  The
>> facet "queries" really aren't queries at all, but rather terms that
>> don't need to be parsed.  Building up a string to be parsed like "#
>> {params[:field]}:#{params[:value]}" is tricky because of escaping
>> syntax (like colons).  So fq wouldn't need to be parsed at all,
>> except to pull the field name out to build a TermQuery
>> straightforwardly.
>
> fq params are often queries too (think price ranges, etc).

Oh, I realize that quite well :)

However, for basic faceted browsing, fq parameters are generally just
plain terms and any type of meta-QueryParser syntax in the terms
could get in the way when QueryParser is being used.

> For syntax, what about
>
> myfield:my unescaped value all as a singe term
>
> Or
>
> my unescaped value all as a singe term

This second option allows for parameters to be passed to the parser,
which is a nice point of extensibility.

> Of course, while that prefix looks decent in bare text, adding it to
> an XML config file would look ugly since < would need escaping.

No biggie there.  It's not often that it'd be in a config file.  And
there is already escaping ugliness for the ping and warmup queries in
solrconfig.xml.


But it could be coming back in an XML response, or seen in an access log.


> Another syntax option... something like
>
> !term:myfield:my unescaped value all as a singe term
> #term:
> %term:
> @term:
>
> (basically, find a prefix that would be unlikely to appear as an
> actual term or wildcard in lucene queryparser syntax)

I like the  syntax above best of the ones
you've mentioned because of the ability to provide parameters cleanly.


I had plans for providing params for all of the syntaxes... I just
left it for later since I didn't know if it would be controversial.  I
wanted to keep it simple at first to avoid scaring people :-)

|qp|+a +b#lucene query parser syntax
|qp(field='myfield')|+a +b   #providing a different default field
 OR
|qp field='myfield')|+a +b   #providing a different default field


I don't know if it's important enough to consider, but there is also
URL escaping to consider (readability of raw acces logs, etc).  Here
are alternatives sent from firefox to netcat:

http://localhost:5000/foo?q=bar
GET /foo?q=%3C!term%20fieldname='foo'%3Ebar HTTP/1.1

http://localhost:5000/foo?q=|term fieldname='foo'|bar
GET /foo?q=|term%20fieldname='foo'|bar HTTP/1.1

http://localhost:5000/foo?q=|term,fieldname='foo'|bar
GET /foo?q=|term,fieldname='foo'|bar HTTP/1.1

 anyone ;-) ;-)
OR
|facet,field='foo',offset='50',limit='10',mincount='1'|
OR
[!facet field='foo' offset='50' limit='10', mincount=1]


-Yonik


Re: Split one string into many fields

2007-01-22 Thread Chris Hostetter

: > ...When we get to it, I'd like to hear why it (things like PDF parsing)
: > should be inside Solr rather than outside using our update interfaces
:
: Same here.

I wouldn't way that i think it *should* be inside of Solr, just that it
*could* be inside of Solr.  the use case i imagine is when you run an
operation in which multiple clients that all want to index PDF files
according to some custom rules to map pices of the fiels to fields in your
schema ... if they have to send Solr XML data listing all the field=value
pairs then they all have to not only load the same PDF Parsing library,
but they also have to share the same biz logic built in to understand what
kinds of SOlr XML documents to produce and send to the server.

If you let people write their own PDFMUpdateHandler then all of those
clients can POST (or upload, refer via URL) the raw PDF file, and the
extraction logic is in one place.


At this point though, I can't for the life of me remeber what Ryan said to
convince me that it made sense to have a DocumentParser concept that
UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
it directly :)

: I haven't had time to follow the recent (rich) design discussions
: about this stuff, but if I was designing this, I'd put all the
: document processing code in a separate module (separate servlet?) and

never fret ... i too want to keep Solr lean.  The idea (in my mind anyway)
is that there are very few out of the box UpdateHandlers (one for XML, one
for CSV, probably want for JDBC) but that there could be lots of contrib
style Updaters that know how to deal with different exotic document types
users could load if they wanted to.




-Hoss



Re: Split one string into many fields

2007-01-22 Thread Ryan McKinley

looks like we wont save the discussion for later :)




At this point though, I can't for the life of me remeber what Ryan said to
convince me that it made sense to have a DocumentParser concept that
UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
it directly :)



We were discussing a handler that crawls an svn repository and another
that may accept a single file.  They should be able to share the logic
of parsing a single ContentStream into a Document.

Essentially, I was suggesting making a standard DocumentHandler
framework (like the one in LIA that gets pointed to at least once a
week for people wondering how to parse XML/PDF/TXT/etc into lucene a
Document)

With SOLR-104, this will be straight forward to implement.  I totally
agree it probably belongs in a 'tools' or 'plugins' directory along
with other things that are useful, but not the focus of solr.