Join Query Behavior
We're attempting to upgrade from Solr 4.2 to 4.5 but are finding that 4.5 is not "honoring" this join query: ... & fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18 type:UserRole & On our Solr 4.2 instance adding/removing that query gives us different (and expected) results, while the query doesn't affect the results at all in 4.5. Is there any known join query behavior differences/fixes between 4.2 and 4.5 that might explain this, or should I be looking at other factors? Thanks, Andy Pickler
Re: Join Query Behavior
If it helps to clarify any, here's the full query: /select ? q=*:* & fq=type:ProjectGroup & fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18 type:UserRole We have two Solr servers that were indexed from the same database. One of the servers is running Solr 4.2, while the other (test server) is running 4.5. Solr 4.2: Solr 4.5.1: Solr 4.2 returns the expected result with the project IDs "filtered" out from the join query, while the 4.5 query shows *all* results (2642 records). I can leave off the join query in 4.5 and get the same results, which tells me obviously it is having no effect. Is there a change to the join query behavior between these releases, or could I have configured something differently in my 4.5.1 install? Thanks, Andy Pickler On Thu, Oct 24, 2013 at 2:42 PM, Andy Pickler wrote: > We're attempting to upgrade from Solr 4.2 to 4.5 but are finding that 4.5 > is not "honoring" this join query: > > ... > & > fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18 > type:UserRole > & > > > On our Solr 4.2 instance adding/removing that query gives us different > (and expected) results, while the query doesn't affect the results at all > in 4.5. Is there any known join query behavior differences/fixes between > 4.2 and 4.5 that might explain this, or should I be looking at other > factors? > > Thanks, > Andy Pickler > >
Highlight: simple.pre/post not being applied always
Solr: 4.5.1 I'm sending in a query of "july" and getting back the results and highlighting I expect with one exception: @@@hl@@@Julie@@@endhl@@@ A #Month:July The simple.pre of @@@hl@@@ and simple.post of @@@endhl@@@ is not being applied to the one case of the field "#Month:July", even though it's included in the highlighting section. I've tried changing various highlighting parameters to no avail. Could someone help me know where to look for why the pre/post aren't being applied? Thanks, Andy Pickler
DIH: HTMLStripTransformer in sub-entities?
Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for "top-level" entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) *THIS WORKS!* *THIS DOESN'T WORK!* We've tried several different permutations of putting the sub-entity column in different nest levels of the XML to no avail. I'm curious if we're trying something that is just not supported or whether we are just trying the wrong things. Thanks, Andy Pickler
Re: DIH: HTMLStripTransformer in sub-entities?
Thanks for the quick reply. Unfortunately, I don't believe my company would want me sharing our exact production schema in a public forum, although I realize it makes it harder to diagnose the problem. The sub-entity is a multi-valued field that indeed does have a relationship to the outer entity. I just left off the 'where' clause from the sub-entity, as I didn't believe it was helpful in the context of this problem. We use the convention of.. SELECT dbColumnName AS solrFieldName ...so that we can relate the database column name to what we what it to be named in the Solr index. I don't think any of this helps you identify my problem, but I tried to address your questions. Thanks, Andy On Tue, Jul 2, 2013 at 9:14 AM, Gora Mohanty wrote: > On 2 July 2013 20:29, Andy Pickler wrote: > > Solr 4.1.0 > > > > We've been using the DIH to pull data in from a MySQL database for quite > > some time now. We're now wanting to strip all the HTML content out of > many > > fields using the HTMLStripTransformer ( > > http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). > > Unfortunately, while it seems to be working fine for "top-level" > entities, > > we can't seem to get it to work for sub-entities: > > > > (not exact schema, reduced for example purposes) > > Please do not do that. This DIH configuration file does > not make sense (please see comments below), and we > are left guessing in the dark. If the file is too large, > you can share it on something like pastebin.com > > > > transformer="HTMLStripTransformer" query=" > > SELECT > > id as blockId, > > name as blockTitle, > > content as content > > FROM engagement_block > > "> > > *THIS WORKS!* > >> transformer="HTMLStripTransformer" query=" > > SELECT > > br.other_content AS replyContent > > FROM block_reply > > "> > > *THIS DOESN'T > WORK!* > [...] > > (a) You SELECT replyContent, but the column attribute > in the field is named "other_content". Nothing should > be getting indexed into the field. > (b) Why are your entities nested if the inner entity has no > relationship to the outer one? > > Regards, > Gora >
Re: DIH: HTMLStripTransformer in sub-entities?
That's exactly what turned out to be the problem. We thought we had already tried that permutation but apparently hadn't. I know it's obvious in retrospect. Thanks for the suggestion. Thanks, Andy Pickler On Wed, Jul 3, 2013 at 2:38 PM, Alexandre Rafalovitch wrote: > On Tue, Jul 2, 2013 at 10:59 AM, Andy Pickler >wrote: > > > SELECT > > br.other_content AS replyContent > > FROM block_reply > > "> > > *THIS DOESN'T > WORK!* > > > > shouldn't it be > column="replyContent" > since you are renaming it in SELECT? > > Regards, >Alex. > > > > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >
Top 10 Terms in Index (by date)
Our company has an application that is "Facebook-like" for usage by enterprise customers. We'd like to do a report of "top 10 terms entered by users over (some time period)". With that in mind I'm using the DataImportHandler to put all the relevant data from our database into a Solr 'content' field: Along with the content is the 'dateCreated' for that content: I'm struggling with the TermVectorComponent documentation to understand how I can put together a query that answers the 'report' mentioned above. For each document I need each term counted however many times it is entered (content of "I think what I think" would report 'think' as used twice). Does anyone have any insight as to whether I'm headed in the right direction and then what my query would be? Thanks, Andy Pickler
Re: Top 10 Terms in Index (by date)
I need "total number of occurrences" across all documents for each term. Imagine this... Post #1: "I think, therefore I am like you" Reply #1: "You think too much" Reply #2 "I think that I think much as you" Each of those "documents" are put into 'content'. Pretending I don't have stop words, the top term query (not considering dateCreated in this example) would result in something like... "think": 4 "I": 4 "you": 3 "much": 2 ... Thus, just a "number of documents" approach doesn't work, because if a word occurs more than one time in a document it needs to be counted that many times. That seemed to rule out faceting like you mentioned as well as the TermsComponent (which as I understand also only counts "documents"). Thanks, Andy Pickler On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe wrote: > So you have one document per user comment? Why not use faceting plus > filtering on the "dateCreated" field? That would count "number of > documents" for each term (so, in your case, if a term is used twice in one > comment it would only count once). Is that what you are looking for? > > Tomás > > > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler > wrote: > > > Our company has an application that is "Facebook-like" for usage by > > enterprise customers. We'd like to do a report of "top 10 terms entered > by > > users over (some time period)". With that in mind I'm using the > > DataImportHandler to put all the relevant data from our database into a > > Solr 'content' field: > > > > > multiValued="false" required="true" termVectors="true"/> > > > > Along with the content is the 'dateCreated' for that content: > > > > > multiValued="false" required="true"/> > > > > I'm struggling with the TermVectorComponent documentation to understand > how > > I can put together a query that answers the 'report' mentioned above. > For > > each document I need each term counted however many times it is entered > > (content of "I think what I think" would report 'think' as used twice). > > Does anyone have any insight as to whether I'm headed in the right > > direction and then what my query would be? > > > > Thanks, > > Andy Pickler > > >
Re: Top 10 Terms in Index (by date)
A key problem with those approaches as well as Lucene's HighFreqTerms class ( http://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/HighFreqTerms.html) is that none of them seem to have the ability to combine with a date range query...which is key in my scenario. I'm kinda thinking that what I'm asking to do just isn't supported by Lucene or Solr, and that I'll have to pursue another avenue. If anyone has any other suggestions, I'm all ears. I'm starting to wonder if I need to have some nightly batch job that executes against my database and builds up "that day's top terms" in a table or something. Thanks, Andy Pickler On Tue, Apr 2, 2013 at 7:16 AM, Tomás Fernández Löbbe wrote: > Oh, I see, essentially you want to get the sum of the term frequencies for > every term in a subset of documents (instead of the document frequency as > the FacetComponent would give you). I don't know of an easy/out of the box > solution for this. I know the TermVectorComponent will give you the tf for > every term in a document, but I'm not sure if you can filter or sort on it. > Maybe you can do something like: > https://issues.apache.org/jira/browse/LUCENE-2393 > or what's suggested here: > http://search-lucene.com/m/of5Fn1PUOHU/ > but I have never used something like that. > > Tomás > > > > On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler > wrote: > > > I need "total number of occurrences" across all documents for each term. > > Imagine this... > > > > Post #1: "I think, therefore I am like you" > > Reply #1: "You think too much" > > Reply #2 "I think that I think much as you" > > > > Each of those "documents" are put into 'content'. Pretending I don't > have > > stop words, the top term query (not considering dateCreated in this > > example) would result in something like... > > > > "think": 4 > > "I": 4 > > "you": 3 > > "much": 2 > > ... > > > > Thus, just a "number of documents" approach doesn't work, because if a > word > > occurs more than one time in a document it needs to be counted that many > > times. That seemed to rule out faceting like you mentioned as well as > the > > TermsComponent (which as I understand also only counts "documents"). > > > > Thanks, > > Andy Pickler > > > > On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe < > > tomasflo...@gmail.com > > > wrote: > > > > > So you have one document per user comment? Why not use faceting plus > > > filtering on the "dateCreated" field? That would count "number of > > > documents" for each term (so, in your case, if a term is used twice in > > one > > > comment it would only count once). Is that what you are looking for? > > > > > > Tomás > > > > > > > > > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler > > > wrote: > > > > > > > Our company has an application that is "Facebook-like" for usage by > > > > enterprise customers. We'd like to do a report of "top 10 terms > > entered > > > by > > > > users over (some time period)". With that in mind I'm using the > > > > DataImportHandler to put all the relevant data from our database > into a > > > > Solr 'content' field: > > > > > > > > stored="false" > > > > multiValued="false" required="true" termVectors="true"/> > > > > > > > > Along with the content is the 'dateCreated' for that content: > > > > > > > > > > > multiValued="false" required="true"/> > > > > > > > > I'm struggling with the TermVectorComponent documentation to > understand > > > how > > > > I can put together a query that answers the 'report' mentioned above. > > > For > > > > each document I need each term counted however many times it is > entered > > > > (content of "I think what I think" would report 'think' as used > twice). > > > > Does anyone have any insight as to whether I'm headed in the right > > > > direction and then what my query would be? > > > > > > > > Thanks, > > > > Andy Pickler > > > > > > > > > >
MoreLikeThis - No Results
I'm a developing a recommendation feature in our app using the MoreLikeThisHandler <http://wiki.apache.org/solr/MoreLikeThisHandler>, and so far it is doing a great job. We're using a user's "competency keywords" as the MLT field list and the user's corresponding document in Solr as the "comparison document". I have found that for one user I'm not receiving any recommendations, and I'm not sure why. Solr: 4.1.0 *relevant schema*: *user's values*: Healthcare Cost Trends Is it possible that among all the ~40,000 users in this index (about 500 of which have the same competency keywords), that the words "healthcare", "cost" and "trends" are just judged by Lucene to not be "significant". I realize that I may not understand how the MLT Handler is doing things under the covers...I've only been guessing until now based on the (otherwise excellent) results I've been seeing. Thanks, Andy Pickler P.S. For some additional information, the following query: /mlt?q=objectId:user91813&mlt.fl=competencyKeywords&mlt.interestingTerms=details&debugQuery=true&mlt.match.include=false ...produces the following results... 0 2 objectId:user91813 objectId:user91813
Re: MoreLikeThis - No Results
Answered my own question... mlt.mintf: Minimum Term Frequency - the frequency below which terms will be ignored in the source doc Our "source doc" is a set of limited terms...not a large content field. So in our case I need to set that value to 1 (rather than the default of 2). Now I'm getting results...and they indeed are relevant. Thanks, Andy Pickler On Wed, May 22, 2013 at 12:20 PM, Andy Pickler wrote: > I'm a developing a recommendation feature in our app using the > MoreLikeThisHandler <http://wiki.apache.org/solr/MoreLikeThisHandler>, > and so far it is doing a great job. We're using a user's "competency > keywords" as the MLT field list and the user's corresponding document in > Solr as the "comparison document". I have found that for one user I'm not > receiving any recommendations, and I'm not sure why. > > Solr: 4.1.0 > > *relevant schema*: > > stored="true" multiValued="true" termVectors="true"/> > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > > > > > *user's values*: > > > Healthcare Cost Trends > > > Is it possible that among all the ~40,000 users in this index (about 500 > of which have the same competency keywords), that the words "healthcare", > "cost" and "trends" are just judged by Lucene to not be "significant". I > realize that I may not understand how the MLT Handler is doing things under > the covers...I've only been guessing until now based on the (otherwise > excellent) results I've been seeing. > > Thanks, > Andy Pickler > > P.S. For some additional information, the following query: > > > /mlt?q=objectId:user91813&mlt.fl=competencyKeywords&mlt.interestingTerms=details&debugQuery=true&mlt.match.include=false > > ...produces the following results... > > > > 0 > 2 > > > > > objectId:user91813 > objectId:user91813 > > > > > >