Re: Using HTTP-Post for Queries
: > 4) now it's easy to subclass StandardRequestHandler as : > XmlQuerySyntaxRequestHandler or SurroundSyntaxRequestHandler : > just by overriding getQUeryParser. : Blech. All the other ideas are good, and rather orthogonal to : declaring the syntax in the query itself. yeah ... i didn't say i was particularlay fond of this idea in light of this discussion ... just that it was what i had been thinking before, and would punt on these issues. : Think about it... if we only had *one* base query and no others, : wouldn't it make sense to just add a single parameter called q.syntax? : That's the way I would go. Except we have many queries all over the : place. maybe .. except that was you said: every different query parser you can imagine might have a bunch of complex options you would need to set on it -- forget all the bq, and fq params the the DisMaxHandler supports, assume it was only going to deal with a single "q" -- even if StandardRequestHandler had supported an easy way of replacing the QueryParser with a parm back when i wrote dimax, it still would have made sense to write a new request handler in order to pass in the qf, pf, ps, mm, and tie options the DisMaxQueryParser needs just to make sense of the "q" param -Hoss
Re: Using HTTP-Post for Queries
On Jan 20, 2007, at 9:23 PM, Chris Hostetter wrote: : > In the back of my mind, I've been thinking about *how* to support : > multiple query syntaxes. : > trying to add new parameters everywhere specifying the type doesn't : > seem like a great idea (way too many places). : Good point. Yeah, it does make sense for the query type to be part : of the query string itself. There are lots of places that a : QueryParser expression can currently be used (&q=, &fq= with standard : requests, and other places with dismax). are you guys concerned that people will want to use different sytnaxes for different places where a query is expressed, ie: use the stock syntax in one "fq", the xml syntax in a "q", and the surround syntax in a second "fq" ? Yes, I think different syntaxes in different places would be useful. For example, a user enters a full-text search query that is suitable to use with Solr's QueryParser, and then the user facets a bit. The facet "queries" really aren't queries at all, but rather terms that don't need to be parsed. Building up a string to be parsed like "# {params[:field]}:#{params[:value]}" is tricky because of escaping syntax (like colons). So fq wouldn't need to be parsed at all, except to pull the field name out to build a TermQuery straightforwardly. Also, I think many applications using Solr would do well to parse their own query strings and had it to Solr pre-digested, perhaps using the XMLQueryParser, yet using a simplified facet query syntax as above. putting a prefix on every query string to identify the syntax seems cumbersome ... put i guess it wouldn't be too bad if it was done as an "override" for a single q.syntax option thta influenced all "query strings" type params that didn't start with the special markup. Yeah, a default would be necessary to keep things clean for the cases where the syntax can be assumed to be the same for all, though maybe the default choice of syntax would be parameter-specific also, so q could be QueryParser, and fq could be, say, a new SimpleTermFacetParser. Erik
Split one string into many fields
Is there any easy way to split a string into a multi-field on the server: given: subject1; subject2; subject- 3 I would like: subject1 subject2 subject- 3 Thanks for any pointers ryan
Re: Split one string into many fields
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: Is there any easy way to split a string into a multi-field on the server: From an indexing perspective, yes... just assign a tokenizer that splits on ';' I don't think we currently have such as configurable Tokenizer though. The (hypothetical) tokenizer could even add a positionIncrement, emulating multiple fields exactly from the indexing perspective. Then you could follow it with the newly added TrimFilter to trim whitespace. From the stored field perspective, you get back what you put in. To be nice and general, perhaps it could be regex based like String.split() -Yonik given: subject1; subject2; subject- 3 I would like: subject1 subject2 subject- 3 Thanks for any pointers ryan
Re: Split one string into many fields
Are you suggesting something like this: ... On 1/21/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > Is there any easy way to split a string into a multi-field on the server: From an indexing perspective, yes... just assign a tokenizer that splits on ';' I don't think we currently have such as configurable Tokenizer though. The (hypothetical) tokenizer could even add a positionIncrement, emulating multiple fields exactly from the indexing perspective. Then you could follow it with the newly added TrimFilter to trim whitespace. From the stored field perspective, you get back what you put in. To be nice and general, perhaps it could be regex based like String.split() -Yonik > given: > > subject1; subject2; subject- 3 > > > I would like: > > subject1 > subject2 > subject- 3 > > > Thanks for any pointers > > ryan >
Re: Split one string into many fields
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: Are you suggesting something like this: ... Exactly, except for that bit... what's that? -Yonik
Re: Split one string into many fields
On 1/21/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > Are you suggesting something like this: > > > sortMissingLast="true" omitNorms="true"> > > > > > > ... > > Exactly, except for that bit... what's that? Maybe the name is wrong, but it is something to tell the updateHandler to use the tokenizer and filters (normally used for analysis) to convert the single field into many fields. I want something that is equivalent to splitting the string on the client side and filling multiple *fields* not just tokens. or are you suggesting: ...
Re: Split one string into many fields
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: Maybe the name is wrong, but it is something to tell the updateHandler to use the tokenizer and filters (normally used for analysis) to convert the single field into many fields. I want something that is equivalent to splitting the string on the client side and filling multiple *fields* not just tokens. Oh, I was talking about indexing only. Why is it that multiple fields are needed? Multiple tokens are indistinguishable from multiple fields during search. Actually splitting things into different fields normally happens in the client (outside Solr), or in a specialized handler (like CSV, SQL, etc). -Yonik
Re: Split one string into many fields
> > I want something that is equivalent to splitting the string on the > client side and filling multiple *fields* not just tokens. Oh, I was talking about indexing only. aaah. Why is it that multiple fields are needed? Multiple tokens are indistinguishable from multiple fields during search. When the app displays search results, it shows a list of subjects. (from the returned doc list). That should be split properly. (Ideally without knowledge of the schema) Actually splitting things into different fields normally happens in the client (outside Solr), or in a specialized handler (like CSV, SQL, etc). In the case I'm looking at, it would be cleaner and more safe to have it on the server side... I guess i have to wait for the "Update Plugins" discussion to wind down!
Re: Split one string into many fields
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > > > I want something that is equivalent to splitting the string on the > > client side and filling multiple *fields* not just tokens. > > Oh, I was talking about indexing only. > aaah. > Why is it that multiple fields are needed? Multiple tokens are > indistinguishable from multiple fields during search. > When the app displays search results, it shows a list of subjects. (from the returned doc list). That should be split properly. (Ideally without knowledge of the schema) > Actually splitting things into different fields normally happens in > the client (outside Solr), or in a specialized handler (like CSV, SQL, > etc). > In the case I'm looking at, it would be cleaner and more safe to have it on the server side... Safer? It precludes adding a subject with a ';' in it... Solr currently assumes your data is structured. Lucene does too... an analyzer in lucene can't create more fields or take info from one field and add it to another. An aside: your need sounds like it's part of that much bigger issue of processing documents and splitting them up into multiple fields, or at least processing certain fields in a way that can add other fields. I'm not sure what a general solution would look like in that case. For example, you might have a field called "mail-headers", and want that split up into multiple fields. Another longer term thing to keep our eye on is UIMA (added to the Apache incubator not that long ago). -Yonik
Re: Split one string into many fields
> > In the case I'm looking at, it would be cleaner and more safe to have > it on the server side... Safer? It precludes adding a subject with a ';' in it... well, in *this* case it is :) An aside: your need sounds like it's part of that much bigger issue of processing documents and splitting them up into multiple fields, or at least processing certain fields in a way that can add other fields. Yes, it is. I'm working with data that is almost structured, but I'd like to have some level of validation and reprocessing before sticking it in solr. I'll use SOLR-104 as that seems like the right thing. I'm not sure what a general solution would look like in that case. For example, you might have a field called "mail-headers", and want that split up into multiple fields. Another longer term thing to keep our eye on is UIMA (added to the Apache incubator not that long ago). Deep within the "Update Plugin" discussion, Hoss and I agreed that adding an interface and registry for DocumentParsers is a good idea: interface SolrDocumentParser { Document parse(ContentStream content); } SolrDocumentParser parser = core.getDocumentParse( "text/html"); This would let update plugins share (pluggable) logic for how to convert a single stream into a single document... this is more then we are talking about doing now, but something (else) to keep in mind.
Re: Using HTTP-Post for Queries
On 1/21/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Jan 20, 2007, at 9:23 PM, Chris Hostetter wrote: > : > In the back of my mind, I've been thinking about *how* to support > : > multiple query syntaxes. > : > trying to add new parameters everywhere specifying the type > doesn't > : > seem like a great idea (way too many places). > > : Good point. Yeah, it does make sense for the query type to be part > : of the query string itself. There are lots of places that a > : QueryParser expression can currently be used (&q=, &fq= with > standard > : requests, and other places with dismax). > > are you guys concerned that people will want to use different > sytnaxes for > different places where a query is expressed, ie: use the stock > syntax in > one "fq", the xml syntax in a "q", and the surround syntax in a second > "fq" ? Yes, I think different syntaxes in different places would be useful. For example, a user enters a full-text search query that is suitable to use with Solr's QueryParser, and then the user facets a bit. The facet "queries" really aren't queries at all, but rather terms that don't need to be parsed. Building up a string to be parsed like "# {params[:field]}:#{params[:value]}" is tricky because of escaping syntax (like colons). So fq wouldn't need to be parsed at all, except to pull the field name out to build a TermQuery straightforwardly. fq params are often queries too (think price ranges, etc). For syntax, what about myfield:my unescaped value all as a singe term Or my unescaped value all as a singe term Of course, while that prefix looks decent in bare text, adding it to an XML config file would look ugly since < would need escaping. Another syntax option... something like !term:myfield:my unescaped value all as a singe term #term: %term: @term: (basically, find a prefix that would be unlikely to appear as an actual term or wildcard in lucene queryparser syntax) -Yonik
Re: Using HTTP-Post for Queries
On 1/21/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: Another syntax option... something like !term:myfield:my unescaped value all as a singe term #term: %term: @term: (basically, find a prefix that would be unlikely to appear as an actual term or wildcard in lucene queryparser syntax) More brainstorming... use same char on both sides for simplicity, and avoid the ':' char because it's so common in lucene syntax (just for aesthetics... it's not ambiguous). !term!myfield:foo |term|myfield:foo -Yonik
Re: Split one string into many fields
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: Deep within the "Update Plugin" discussion, Hoss and I agreed that adding an interface and registry for DocumentParsers is a good idea: interface SolrDocumentParser { Document parse(ContentStream content); } SolrDocumentParser parser = core.getDocumentParse( "text/html"); This would let update plugins share (pluggable) logic for how to convert a single stream into a single document... this is more then we are talking about doing now, but something (else) to keep in mind. Yes, please, for another day... ;-) It would be interesting to explore what we could share with Nutch too... they're in the business of doc parsing. When we get to it, I'd like to hear why it (things like PDF parsing) should be inside Solr rather than outside using our update interfaces. -Yonik