dtSearch parser & Introduction
Hello, all! I'm a BloombergBNA employee and need to obtain/write a dtSearch parser for solr (and probably a bunch of other things a little later). I've looked at the available parsers and thought that the surround parser may do the trick, but it apparently doesn't like nested N or W subqueries. I looked at XmlQueryParser and I'm most impressed with it from a functionality perspective. I liked the SpanQueries, but I either don't understand SpanNot or it has a bug for the exclude. At the end of the day, we will need to continue to support dtSearch syntax. I may as well just bite the bullet and write the dtSearch parser and include it as a patch for Solr. Here are my immediate issues: - I don't know the best path forward on making the parser (I saw something in the HowToContribute page at the bottom about JFlex) - Can someone please take pity on me and help me get started down this path? I probably won't need a lot of help. - I'm great at .NET, not so much Java--yet. I've not yet been able to build a trunk and "deploy" it (I can build it and run tests, but not run it--I'm sure I'm just missing an elusive documentation link on how to do that) - I downloaded and got the solr trunk in Eclipse. I'm not sure the best way of adding unit tests for my stuff--do I add it to an existing subdirectory or create a new package? I think it'd be great if I could get a bare-bones example of a parser so that I can modify it--perhaps even keeping it in a separate Java project. Don't feel like you have to answer all of my questions--an answer to any of them would be quite helpful. Thank you guys and God bless!
Re: SpanQuery - How to wrap a NOT subquery
Thank you, Timothy. I have support for and am using SpanNotQuery elsewhere. Maybe there is another use for it that I'm not considering. I'm wondering if there's a clever way of reusing it in order to satisfy the requirements of proximity NOTs, too. dtSearch allows a user to have NOTs embedded in proximity searches. I.e. Let's say you have an index whose ID has been converted to English phrases, like 1001 would be "One thousand one" "one thousand one hundred" pre/0 (thirty and not (six or seven)) Returns: 1130, 1131, 1132, 1133, 1134, 1135,1138, 1139 Perhaps I've been staring at the screen too long and the obvious answer is hiding from me. Here's how I'm trying to implement it, but it's incorrect... It's giving me 1130..1139 without excluding anything. public Query visitNot_expr(Not_exprContext ctx) { //ProximityNotSupportedFor("NOT"); Query subquery = visit(ctx.expr()); BooleanQuery.Builder query = new BooleanQuery.Builder(); query.add(subquery, BooleanClause.Occur.MUST_NOT); // TODO: Consolidate this so that we don't use MatchAllDocsQuery, but using the other query, to increase performance query.add(new MatchAllDocsQuery(), BooleanClause.Occur.SHOULD); if(currentlyInASpanQuery){ SpanQuery matchAllDocs = getSpanWildcardQuery(new Term(defaultFieldName,"*")); SpanNotQuery snq = new SpanNotQuery(matchAllDocs, (SpanQuery)subquery, Integer.MAX_VALUE, Integer.MAX_VALUE); return snq; } else { return query.build(); } } protected SpanQuery getSpanWildcardQuery(Term term) { WildcardQuery wq = new WildcardQuery(term); SpanQuery swq = new SpanMultiTermQueryWrapper<>(wq); return swq; } On Mon, Jun 20, 2016 at 2:53 PM, Allison, Timothy B. wrote: > Bouncing over to user’s list. > > > > As you’ve found, spans are different from regular queries. MUST_NOT at > the BooleanQuery level means that the term must not appear anywhere in the > document; whereas spans focus on terms near each other. > > > > Have you tried SpanNotQuery? This would allow you at least to do > something like: > > > > termA but not if zyx or yyy appears X words before or Y words after > > > > > > > > *From:* Brandon Miller [mailto:computerengineer.bran...@gmail.com] > *Sent:* Monday, June 20, 2016 2:36 PM > *To:* d...@lucene.apache.org > *Subject:* SpanQuery - How to wrap a NOT subquery > > > > Greetings! > > > > I'm wanting to support this: > > TermA within_N_terms_of (abc and cba or xyz and not zyx or not yyy) > > > > Focusing on the sub-query: > > I have ANDs and ORs figured out (special tricks playing with slops and > such). > > > > I'm having the hardest time figuring out how to wrap a NOT. > > > > Outside of SpanQuery, I'm using a BooleanQuery with a MUST_NOT clause. > That's fine (if you know another way, I'd like to hear that, too, but this > appears to work dandy). > > > > However, SpanQuery requires queries that are also of type SpanQuery or > SpanMultiTermQueryWrapper will allow you to throw in anything derived from > MultiTermQuery (which includes AutomatedQuery). > > > > Right now, I'm at a loss. We have huge, complex, nested boolean queries > inside proximity operators with our current solution. > > > > If I need to write a custom solution, then that's what I need to hear and > perhaps a couple of pointers. > > > > Thanks a bunch and God bless! > > > > Brandon >
Re: SpanQuery - How to wrap a NOT subquery
I saw the second post--the first post was new to me. We plan on connecting with those people later on, but right now, I'm trying to write a stop-gap dtSearch compiler until we can at least secure the funding we need to employ their help. Right now, I have a very functional query parser, with just a few holes needing to be patched. I rewrote my AND NOT and OR NOT queries. Now I'm perplexed why this query is not working as expected: spanNear([ spanNear([field:one, field:thousand, field:one, field:hundred], 0, true), spanNot(field:thirty, spanOr([field:six, field:seven]), 2147483647, 2147483647) ], 0, true) is returning 1130..1139. expected:<[1130, 1131, 1132, 1133, 1134, 1135, 1138, 1139]> but was:<[1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139]> I would've expected 1136 and 1137 to have been excluded. Original dtSearch string: "one thousand one hundred" pre/0 (thirty and not (six or seven)) I even tried it with pre/5 to see if there was something funny going on with that, but it gave the same results: 1130..1139. If you can tell me what it should look like when the SpanQuery is converted to a string, I should be able to figure out the rest. Perhaps I'm misunderstanding the pre/post parameters? Thank you for any help! On Tue, Jun 21, 2016 at 9:46 AM, Allison, Timothy B. wrote: > > dtSearch allows a user to have NOTs embedded in proximity searches. > > And, if you're heading down the path of building your own queryparser to > handle dtSearch's syntax, please read and heed Charlie Hull's post: > > http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ > > See also: > > > http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/ > >
Re: SpanQuery - How to wrap a NOT subquery
Awesome, 0 pre and 1 post works! I replaced pre with Integer.MAX_VALUE and post with Integer.MAX_VALUE - 5 and it works! If I replace post with Integer.MAX_VALUE - 4 (or -3, -2, -1, -0), it fails. But, if it's -(5+), it appears to work. Thank you guys for suffering through my inexperience with Solr. *NOTE*: In case someone would find it helpful to follow my reasoning before I discovered the work-around above: I don't understand why 0,1 works and Integer.MAX_VALUE, Integer.MAX_VALUE doesn't. I mean I know that six and seven both come one word after thirty *in this case*. This is case-dependent. What if I wanted to match thirty, but exclude if six or seven are included anywhere in the document? How will I know what numbers to plug into pre and post when they could be anywhere in the document? In this case, those numbers worked. Why didn't the big numbers work? After all, six and seven were unique throughout the whole number (i.e. six and seven were only at the end of the document). I also tried 0 pre and 0 post, but that gave me the same as I had when I had pre and post as really large numbers. On Tue, Jun 21, 2016 at 12:50 PM, Allison, Timothy B. wrote: > >Perhaps I'm misunderstanding the pre/post parameters? > > Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens > before 'thirty' or $post tokens after 'thirty' > > Maybe something like this: > spanNear([ > spanNear([field:one, field:thousand, field:one, field:hundred], 0, true), > spanNot(field:thirty, spanOr([field:six, field:seven]), 0, > 1) > ], 0, true) > >
Re: SpanQuery - How to wrap a NOT subquery
Thank you!! Okay, I think I have that all squared away. *SpanLastQuery*: I need something like SpanFirstQuery, except that it would be SpanLastQuery. Is there a way to get that to work? *Proximity weighting getting ignored*: I also need to get span term boosting working. Here's my query: "one thousand two hundred thirty" pre/5 (seven:5.32 or three:3 or two:2.9) Here's the resulting Solr query: spanNear([spanNear([field:one, field:thousand, field:two, field:hundred, field:thirty], 0, true), spanOr([spanOr([(field:seven)^5.32, (field:three)^3.0]), (field:two)^2.9])], 5, true) It's returning [1232, 1233, 1237] Expected: [1237, 1233, 1232] Here's what the scoreDocs has to say about this search's results: [doc=1232 score=5.903808 shardIndex=0, doc=1233 score=5.903808 shardIndex=0, doc=1237 score=5.903808 shardIndex=0] Notice that the scores were all the exact same. Why don't the boosts appear to be working? Thank you! On Tue, Jun 21, 2016 at 1:40 PM, Allison, Timothy B. wrote: > >Awesome, 0 pre and 1 post works! > > Great! > > > What if I wanted to match thirty, but exclude if six or seven are > included anywhere in the document? > > Any time you need "anywhere in the document", use a "regular" query (not > SpanQuery). As you wrote initially, you can construct a BooleanQuery that > includes a complex SpanQuery and another Query that is > BooleanClause.Occur.MUST_NOT. > > > I also tried 0 pre and 0 post > You'd use those if you wanted to find something that didn't contain > something else: > > ["William Clinton"~2 Jefferson]!~0,0 > > Find 'william' within two words of 'clinton', but not if 'jefferson' > appears between them. > > > I replaced pre with Integer.MAX_VALUE and post with Integer.MAX_VALUE - > 5 and it works! > I'll have to think about this one... > >