Re: Using HTTP-Post for Queries

2007-01-21 Thread Chris Hostetter

: >  4) now it's easy to subclass StandardRequestHandler as
: > XmlQuerySyntaxRequestHandler or SurroundSyntaxRequestHandler
: > just by overriding getQUeryParser.
: Blech.  All the other ideas are good, and rather orthogonal to
: declaring the syntax in the query itself.

yeah ... i didn't say i was particularlay fond of this idea in light of
this discussion ... just that it was what i had been thinking before, and
would punt on these issues.

: Think about it... if we only had *one* base query and no others,
: wouldn't it make sense to just add a single parameter called q.syntax?
:  That's the way I would go.  Except we have many queries all over the
: place.

maybe .. except that was you said: every different query parser you can
imagine might have a bunch of complex options you would need to set on it
-- forget all the bq, and fq params the the DisMaxHandler supports, assume
it was only going to deal with a single "q" -- even if
StandardRequestHandler had supported an easy way of replacing the
QueryParser with a parm back when i wrote dimax, it still would have made
sense to write a new request handler in order to pass in the qf, pf,
ps, mm, and tie options the DisMaxQueryParser needs just to make sense of
the "q" param


-Hoss



Re: Using HTTP-Post for Queries

2007-01-21 Thread Erik Hatcher


On Jan 20, 2007, at 9:23 PM, Chris Hostetter wrote:

: > In the back of my mind, I've been thinking about *how* to support
: > multiple query syntaxes.
: > trying to add new parameters everywhere specifying the type  
doesn't

: > seem like a great idea (way too many places).

: Good point.  Yeah, it does make sense for the query type to be part
: of the query string itself.  There are lots of places that a
: QueryParser expression can currently be used (&q=, &fq= with  
standard

: requests, and other places with dismax).

are you guys concerned that people will want to use different  
sytnaxes for
different places where a query is expressed, ie: use the stock  
syntax in

one "fq", the xml syntax in a "q", and the surround syntax in a second
"fq" ?


Yes, I think different syntaxes in different places would be useful.   
For example, a user enters a full-text search query that is suitable  
to use with Solr's QueryParser, and then the user facets a bit.  The  
facet "queries" really aren't queries at all, but rather terms that  
don't need to be parsed.  Building up a string to be parsed like "# 
{params[:field]}:#{params[:value]}" is tricky because of escaping  
syntax (like colons).  So fq wouldn't need to be parsed at all,  
except to pull the field name out to build a TermQuery  
straightforwardly.


Also, I think many applications using Solr would do well to parse  
their own query strings and had it to Solr pre-digested, perhaps  
using the XMLQueryParser, yet using a simplified facet query syntax  
as above.



putting a prefix on every query string to identify the syntax seems
cumbersome ... put i guess it wouldn't be too bad if it was done as an
"override" for a single q.syntax option thta influenced all "query
strings" type params that didn't start with the special markup.


Yeah, a default would be necessary to keep things clean for the cases  
where the syntax can be assumed to be the same for all, though maybe  
the default choice of syntax would be parameter-specific also, so q  
could be QueryParser, and fq could be, say, a new SimpleTermFacetParser.


Erik





Split one string into many fields

2007-01-21 Thread Ryan McKinley

Is there any easy way to split a string into a multi-field on the server:

given:

subject1; subject2; subject- 3


I would like:

subject1
subject2
subject- 3


Thanks for any pointers

ryan


Re: Split one string into many fields

2007-01-21 Thread Yonik Seeley

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

Is there any easy way to split a string into a multi-field on the server:



From an indexing perspective, yes... just assign a tokenizer that splits on ';'

I don't think we currently have such as configurable Tokenizer though.
The (hypothetical) tokenizer could even add a positionIncrement,
emulating multiple fields exactly from the indexing perspective.  Then
you could follow it with the newly added TrimFilter to trim
whitespace.


From the stored field perspective, you get back what you put in.


To be nice and general, perhaps it could be regex based like String.split()

-Yonik


given:

 subject1; subject2; subject- 3


I would like:

 subject1
 subject2
 subject- 3


Thanks for any pointers

ryan



Re: Split one string into many fields

2007-01-21 Thread Ryan McKinley

Are you suggesting something like this:


   
 
   
   
 
 
   ...
 
   



On 1/21/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> Is there any easy way to split a string into a multi-field on the server:

From an indexing perspective, yes... just assign a tokenizer that splits on ';'
I don't think we currently have such as configurable Tokenizer though.
The (hypothetical) tokenizer could even add a positionIncrement,
emulating multiple fields exactly from the indexing perspective.  Then
you could follow it with the newly added TrimFilter to trim
whitespace.

From the stored field perspective, you get back what you put in.

To be nice and general, perhaps it could be regex based like String.split()

-Yonik

> given:
> 
>  subject1; subject2; subject- 3
> 
>
> I would like:
> 
>  subject1
>  subject2
>  subject- 3
> 
>
> Thanks for any pointers
>
> ryan
>



Re: Split one string into many fields

2007-01-21 Thread Yonik Seeley

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

Are you suggesting something like this:



  


  
  
...
  



Exactly, except for that  bit... what's that?

-Yonik


Re: Split one string into many fields

2007-01-21 Thread Ryan McKinley

On 1/21/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> Are you suggesting something like this:
>
>
>  sortMissingLast="true" omitNorms="true">
>   
> 
> 
>   
>   
> ...
>   
> 

Exactly, except for that  bit... what's that?



Maybe the name is wrong, but it is something to tell the updateHandler
to use the tokenizer and filters (normally used for analysis) to
convert the single field into many fields.

I want something that is equivalent to splitting the string on the
client side and filling multiple *fields* not just tokens.

or are you suggesting:


 
 

 
   ...
 
   


Re: Split one string into many fields

2007-01-21 Thread Yonik Seeley

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

Maybe the name is wrong, but it is something to tell the updateHandler
to use the tokenizer and filters (normally used for analysis) to
convert the single field into many fields.

I want something that is equivalent to splitting the string on the
client side and filling multiple *fields* not just tokens.


Oh, I was talking about indexing only.

Why is it that multiple fields are needed?  Multiple tokens are
indistinguishable from multiple fields during search.

Actually splitting things into different fields normally happens in
the client (outside Solr), or in a specialized handler (like CSV, SQL,
etc).

-Yonik


Re: Split one string into many fields

2007-01-21 Thread Ryan McKinley

>
> I want something that is equivalent to splitting the string on the
> client side and filling multiple *fields* not just tokens.

Oh, I was talking about indexing only.



aaah.


Why is it that multiple fields are needed?  Multiple tokens are
indistinguishable from multiple fields during search.



When the app displays search results, it shows a list of subjects.
(from the returned doc list).  That should be split properly.
(Ideally without knowledge of the schema)



Actually splitting things into different fields normally happens in
the client (outside Solr), or in a specialized handler (like CSV, SQL,
etc).



In the case I'm looking at, it would be cleaner and more safe to have
it on the server side...

I guess i have to wait for the "Update Plugins" discussion to wind down!


Re: Split one string into many fields

2007-01-21 Thread Yonik Seeley

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> >
> > I want something that is equivalent to splitting the string on the
> > client side and filling multiple *fields* not just tokens.
>
> Oh, I was talking about indexing only.
>

aaah.

> Why is it that multiple fields are needed?  Multiple tokens are
> indistinguishable from multiple fields during search.
>

When the app displays search results, it shows a list of subjects.
(from the returned doc list).  That should be split properly.
(Ideally without knowledge of the schema)


> Actually splitting things into different fields normally happens in
> the client (outside Solr), or in a specialized handler (like CSV, SQL,
> etc).
>

In the case I'm looking at, it would be cleaner and more safe to have
it on the server side...


Safer? It precludes adding a subject with a ';' in it...

Solr currently assumes your data is structured.  Lucene does too... an
analyzer in lucene can't create more fields or take info from one
field and add it to another.

An aside: your need sounds like it's part of that much bigger issue of
processing documents and splitting them up into multiple fields, or at
least processing certain fields in a way that can add other fields.
I'm not sure what a general solution would look like in that case.
For example, you might have a field called "mail-headers", and want
that split up into multiple fields.

Another longer term thing to keep our eye on is UIMA (added to the
Apache incubator not that long ago).

-Yonik


Re: Split one string into many fields

2007-01-21 Thread Ryan McKinley

>
> In the case I'm looking at, it would be cleaner and more safe to have
> it on the server side...

Safer? It precludes adding a subject with a ';' in it...



well, in *this* case it is :)



An aside: your need sounds like it's part of that much bigger issue of
processing documents and splitting them up into multiple fields, or at
least processing certain fields in a way that can add other fields.


Yes, it is.  I'm working with data that is almost structured, but I'd
like to have some level of validation and reprocessing before sticking
it in solr.  I'll use SOLR-104 as that seems like the right thing.



I'm not sure what a general solution would look like in that case.
For example, you might have a field called "mail-headers", and want
that split up into multiple fields.

Another longer term thing to keep our eye on is UIMA (added to the
Apache incubator not that long ago).



Deep within the "Update Plugin" discussion, Hoss and I agreed that
adding an interface and registry for DocumentParsers is a good idea:

interface SolrDocumentParser
{
  Document parse(ContentStream content);
}

SolrDocumentParser parser = core.getDocumentParse( "text/html");

This would let update plugins share (pluggable) logic for how to
convert a single stream into a single document...  this is more then
we are talking about doing now, but something (else) to keep in mind.


Re: Using HTTP-Post for Queries

2007-01-21 Thread Yonik Seeley

On 1/21/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:

On Jan 20, 2007, at 9:23 PM, Chris Hostetter wrote:
> : > In the back of my mind, I've been thinking about *how* to support
> : > multiple query syntaxes.
> : > trying to add new parameters everywhere specifying the type
> doesn't
> : > seem like a great idea (way too many places).
>
> : Good point.  Yeah, it does make sense for the query type to be part
> : of the query string itself.  There are lots of places that a
> : QueryParser expression can currently be used (&q=, &fq= with
> standard
> : requests, and other places with dismax).
>
> are you guys concerned that people will want to use different
> sytnaxes for
> different places where a query is expressed, ie: use the stock
> syntax in
> one "fq", the xml syntax in a "q", and the surround syntax in a second
> "fq" ?

Yes, I think different syntaxes in different places would be useful.
For example, a user enters a full-text search query that is suitable
to use with Solr's QueryParser, and then the user facets a bit.  The
facet "queries" really aren't queries at all, but rather terms that
don't need to be parsed.  Building up a string to be parsed like "#
{params[:field]}:#{params[:value]}" is tricky because of escaping
syntax (like colons).  So fq wouldn't need to be parsed at all,
except to pull the field name out to build a TermQuery
straightforwardly.


fq params are often queries too (think price ranges, etc).

For syntax, what about

myfield:my unescaped value all as a singe term

Or

my unescaped value all as a singe term

Of course, while that prefix looks decent in bare text, adding it to
an XML config file would look ugly since < would need escaping.

Another syntax option... something like

!term:myfield:my unescaped value all as a singe term
#term:
%term:
@term:

(basically, find a prefix that would be unlikely to appear as an
actual term or wildcard in lucene queryparser syntax)

-Yonik


Re: Using HTTP-Post for Queries

2007-01-21 Thread Yonik Seeley

On 1/21/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Another syntax option... something like

!term:myfield:my unescaped value all as a singe term
#term:
%term:
@term:

(basically, find a prefix that would be unlikely to appear as an
actual term or wildcard in lucene queryparser syntax)


More brainstorming... use same char on both sides for simplicity, and
avoid the ':' char because it's so common in lucene syntax (just for
aesthetics... it's not ambiguous).

!term!myfield:foo
|term|myfield:foo

-Yonik


Re: Split one string into many fields

2007-01-21 Thread Yonik Seeley

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

Deep within the "Update Plugin" discussion, Hoss and I agreed that
adding an interface and registry for DocumentParsers is a good idea:

interface SolrDocumentParser
{
   Document parse(ContentStream content);
}

SolrDocumentParser parser = core.getDocumentParse( "text/html");

This would let update plugins share (pluggable) logic for how to
convert a single stream into a single document...  this is more then
we are talking about doing now, but something (else) to keep in mind.


Yes, please, for another day... ;-)

It would be interesting to explore what we could share with Nutch
too... they're in the business of doc parsing.

When we get to it, I'd like to hear why it (things like PDF parsing)
should be inside Solr rather than outside using our update interfaces.

-Yonik