Re: Multiple Words in String
I managed to find both documents with your two input queries . Add this filter in your analyzer query part : => The main problem is that your query "microsoft" is transformed into one single PhraseQuery which cannot match the document containing "micro soft". The PositionFilterFactory will transform the query into multiple queries. You can activate the debug mode to see the differences. you can see more informations here : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Words-in-String-tp2767964p2770713.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Difference between Solr and Lucidworks distribution
How can they require payment for something that was developed under the apache license? -- View this message in context: http://lucene.472066.n3.nabble.com/Difference-between-Solr-and-Lucidworks-distribution-tp2474792p2771191.html Sent from the Solr - User mailing list archive at Nabble.com.
does overwrite=false work with json
I'm doing some performance benchmarking of Solr and I started with a single big JSON file containing all the docs that I'm sending via curl. The results are fantastic - I'm achieving an indexing rate of about 44,000 docs/sec using this method (these are really small test docs). In the past I have used CSV and adding overwrite=false to the URL increased performance when doing a fresh reindex when I know all the document ids are unique. I tried this with the JSON upload, and nothing seemed to change. Is this supposed to work with the JSON update handler? Anyway, Solr is doing spectacular against the competition so far. Keep up the great work! --Dave
AW: Difference between Solr and Lucidworks distribution
Take "Lucidworks for Solr", it's free. Regards, Wolfram -Ursprüngliche Nachricht- Von: yehosef [mailto:yeho...@gmail.com] Gesendet: Sonntag, 3. April 2011 15:57 An: solr-user@lucene.apache.org Betreff: Re: Difference between Solr and Lucidworks distribution How can they require payment for something that was developed under the apache license? -- View this message in context: http://lucene.472066.n3.nabble.com/Difference-between-Solr-and-Lucidworks-di stribution-tp2474792p2771191.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Difference between Solr and Lucidworks distribution
On Apr 3, 2011, at 6:56am, yehosef wrote: > How can they require payment for something that was developed under the > apache license? It's the difference between free speech and free beer :) See http://en.wikipedia.org/wiki/Gratis_versus_libre -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Faceting on multivalued field
Hi, My index contains a root entity "Post" and a child entity "Comments". Each post can have multiple comments. data-config.xml: The schema has all columns of "comment" entity as "MultiValued" fields and all fields are indexed & stored. My requirement is to count the number of comments for each post. Approach I'm taking is to query on "*:*" and faceting the result on "comment_post_id" so that it gives the count of comment occurred for that post. But I'm getting incorrect result e.g. if a post has 2 comments, the multivalued fields are populated alright but the facet count is coming as 1 (for that post_id). What else do I need to do? Thanks, Kaushik
Re: Using EmbeddedSolrServer with static documents
Well, what is "a document on the filesystem"? Solr deals with well-formed XML documents of a specific format. You can't just stream a random file to Solr. Specifically documents look like: value for field . . . perhaps with an . There are ways for structured documents to be added using the Tika libraries etc. But before we go there, what is it you want to do? What is the nature of your document? Best Erick On Sat, Apr 2, 2011 at 12:35 PM, michael.i wrote: > Hi, > I am new to solr so please excuse me if my question sounds basic. > > I would like to use the EmbeddedSolrServer. > It happens that all examples I've found on the web use documents that have > been generated dynamically such as: > > > SolrServer solrServer = new EmbeddedSolrServer(container, "core"); > SolrInputDocument doc = new SolrInputDocument(); > doc.addField("docText", "This is a sample file"); > solrServer.add(doc); > solrServer.commit(); > > > I would like to be able to load a document that is stored on the > filesystem. > Ideally, I would have liked to do something such as: > SolrInputDocument doc = new SolrInputDocument("path/myDoc.txt"); > solrServer.add(doc); > solrServer.commit(); > > It does not seem possible to do such thing. Am I missing something? Are > there some best practices with regards to referring to a document on the > filesystem? > > Thanx! > Michael. > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Using-EmbeddedSolrServer-with-static-documents-tp2767614p2767614.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Using EmbeddedSolrServer with static documents
Hi Erick, thanx for getting back to me. "Well, what is "a document on the filesystem"? Solr deals with well-formed XML documents of a specific format." I would like to index all kinds of documents. For a start I'll be happy to be able to work with xml and html documents. -- View this message in context: http://lucene.472066.n3.nabble.com/Using-EmbeddedSolrServer-with-static-documents-tp2767614p2773012.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Words in String
Is this a general question or specific? You can handle specific ones by using synonyms. But the general case, that is treating any two pairs of tokens as a single pair seems fraught with unintended consequences, but you know your problem space better than I do. Best Erick On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach wrote: > Good afternoon everyone! > I am stumped, and I would love some help.I'm new to solr/lucene, > but I have thrown myself into it, so I think I have a solid > understanding. Using the analysis tool in the admin interface, I see > these words stemmed and processed as I assume they would be, so I'm > stuck. > > In my index, I have two documents, each with a text field, and here > are example values > > 1) microsoft.com > 2) micro soft > > I want to do a search using microsoft or "micro soft" and find both. > I'm using the dismax interface, the fields are properly listed in the > config, and I can find both records, but never at the same time. > Here's my schema.xml for my text field, any thoughts on what I can do > to find these together? > > > positionIncrementGap="100"> > > > > words="stopwords.txt" enablePositionIncrements="true"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > preserveOriginal="1"/> > synonyms="syn/index_synonyms.txt" ignoreCase="true" expand="true"/> > maxGramSize="15" side="front"/> > maxGramSize="15" side="back"/> > language="English" protected="protwords.txt"/> > > > > > maxGramSize="15" side="front"/> > maxGramSize="15" side="back"/> > words="stopwords.txt" enablePositionIncrements="true"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > preserveOriginal="1"/> > language="English" protected="protwords.txt"/> > > > >
Re: Faceting on multivalued field
Hmmm, I think you're misunderstanding faceting. It's counting the number of documents that have a particular value. So if you're faceting on "comment_post_id", there is one and only one document with that value (assuming that the comment_post_ids are unique). Which is what's being reported This will be quite expensive on a large corpus, BTW. Is your task to show the totals for *every* document in your corpus or just the ones in a display page? Because if the latter, your app could just count up the number of elements in the XML returned for the multiValued comments field. If that's not relevant, could you explain a bit more why you need this count? Best Erick On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty wrote: > Hi, > > My index contains a root entity "Post" and a child entity "Comments". Each > post can have multiple comments. data-config.xml: > > > dataSource="jdbc" query=""> > > > > > > > > > > > > > > The schema has all columns of "comment" entity as "MultiValued" fields and > all fields are indexed & stored. My requirement is to count the number of > comments for each post. Approach I'm taking is to query on "*:*" and > faceting the result on "comment_post_id" so that it gives the count of > comment occurred for that post. > > But I'm getting incorrect result e.g. if a post has 2 comments, the > multivalued fields are populated alright but the facet count is coming as 1 > (for that post_id). What else do I need to do? > > > Thanks, > Kaushik >
Re: Using EmbeddedSolrServer with static documents
OK, you're still not quite on the right track. You can't just index XML documents without transforming them into valid Solr XML documents. Ditto for HTML. Take a look at the ExtractingRequestHandler documentation at: http://wiki.apache.org/solr/ExtractingRequestHandler Here's some more documentation that might help. http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika But at root, you have to extract the relevant info from the file in question and form your own valid Solr document and send *that* to Solr if you want to do it by hand. Or you can use the ExtractingRequestHandler to do it for you, but then you need to be aware that it'll do the best it can at putting meta-data information into the appropriate fields in your schema, but you don't have total control over that. Oh, and why are you using embedded Solr? The normal HTTP request process is recommended, which you can connect to easily with SolrJ.. FWIW Erick On Sun, Apr 3, 2011 at 6:48 PM, michael.i wrote: > Hi Erick, > thanx for getting back to me. > > "Well, what is "a document on the filesystem"? Solr deals > with well-formed XML documents of a specific format." > > I would like to index all kinds of documents. For a start I'll be happy to > be able to work with xml and html documents. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Using-EmbeddedSolrServer-with-static-documents-tp2767614p2773012.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: admin/index.jsp double submit on IE
Jeffery: It's perfectly appropriate to raise a JIRA for something like this. If you could add the steps to make this happen, that'd be great. see: http://wiki.apache.org/solr/HowToContribute#Contributing_your_work. If you can add a patch, that'd be even better (instructions on that page too). You'll find the Solr committers are quite willing to work with you on the patch. Thanks for finding this and digging into the underlying reason! Best Erick On Sat, Apr 2, 2011 at 12:39 PM, Jeffrey Chang wrote: > Hi, > > I noticed /admin/index.jsp could issue a double submit on IE causing Jetty > to error out. > > Fixed by modifying index.jsp's javascript submit to return false. > > ... queryForm.submit(); return false; ... > > Not sure if I should log a defect for this or not. > > - Jeff >
Re: Multiple Words in String
It's not a specific case only ( e.g. microsoft.com), but it's really a multi word issue. carwash, bookkeeper etc... I'm ultimately looking for a schema for search and retrieve that's heavily focused on 'names'.. these are peoples names, business names etc.. not content like large text fields, web sites or anything like that, but business data that I'm very succesfully receiving using dataimport handlers... it's these special cases that are really tripping me up .. my business folks keep coming up with them! Chris Fauerbach chrisfauerb...@gmail.com On Sun, Apr 3, 2011 at 6:51 PM, Erick Erickson wrote: > Is this a general question or specific? You can handle specific ones by > using synonyms. > > But the general case, that is treating any two pairs of tokens as > a single pair seems fraught with unintended consequences, but > you know your problem space better than I do. > > Best > Erick > > On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach >wrote: > > > Good afternoon everyone! > > I am stumped, and I would love some help.I'm new to solr/lucene, > > but I have thrown myself into it, so I think I have a solid > > understanding. Using the analysis tool in the admin interface, I see > > these words stemmed and processed as I assume they would be, so I'm > > stuck. > > > > In my index, I have two documents, each with a text field, and here > > are example values > > > > 1) microsoft.com > > 2) micro soft > > > > I want to do a search using microsoft or "micro soft" and find both. > > I'm using the dismax interface, the fields are properly listed in the > > config, and I can find both records, but never at the same time. > > Here's my schema.xml for my text field, any thoughts on what I can do > > to find these together? > > > > > > > positionIncrementGap="100"> > > > > > > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > > preserveOriginal="1"/> > > > synonyms="syn/index_synonyms.txt" ignoreCase="true" expand="true"/> > > minGramSize="2" > > maxGramSize="15" side="front"/> > > minGramSize="2" > > maxGramSize="15" side="back"/> > > > language="English" protected="protwords.txt"/> > > > > > > > > > > minGramSize="2" > > maxGramSize="15" side="front"/> > > minGramSize="2" > > maxGramSize="15" side="back"/> > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > > preserveOriginal="1"/> > > > language="English" protected="protwords.txt"/> > > > > > > > > >
Re: Multiple Words in String
Short form: I think you're going down a rabbit-hole and should just use synonyms and forget about it. I'm particularly thinking that a general-purpose solution that somehow breaks up or combines adjacent tokens will have consequences that pop out other places that you don't want and you'll have to fix *that*. I can't think of a way to do this that wouldn't run that danger. Long form, think of it as a sermon, it's Sunday after all. This is the point, in my experience, where you have to ask your business people "what's it worth to you"? You can handle any case the come up similar to the examples you've shown by adding it into your synonyms file - compressing any pair into it's joined form (as a synonym) and be done with it. This is a very straight-forward approach that has predictable consequences. Or you can mess around, possibly for quite some time, trying to find a general purpose solution that will almost inevitably lead to unanticipated behavior that you'll then spend lots of time trying to chase down, time you could have spent putting in features that your users will actually notice. Here's a test. Ask your business people to create a list of all the pairs they want to see treated like this. If your response is any variant of "we don't have time to do that" then even *they* must not think it's very important . And if they do, put it in your synonyms file and be a hero Evil thoughts aside, I'm dead serious. This is the kind of rabbit-hole that development efforts go down that, in all probability, add almost zero *value* to the product. There's a way to handle 95% of the cases that's very easy to implement. It's already there in Solr. Historically, we in the programming field have done a very poor job of making it clear to the business folks that every such request has not only an implementation cost (and we all too often don't include debugging/maintenance in that cost) but an opportunity cost. We owe it to the business folks *and ourselves* to clearly explain to them the cost and let them make the decision whether it's worth it. A decision based on information. And understand that I'm not knocking the business folks here. We haven't given them the consequences to weigh, so how can we fault their decisions? OK, sermon over . I've just too often said "yes, we can do that" without thinking to add "and it'll cost 3 weeks of development effort". Eventually I figured out that adding the estimate and letting the business folks know what I wouldn't be able to get to because of that time spent lead to "Oh, never mind". Best Erick P.S. Ok, it's late Sunday night and I feel like writing long, involved responses that aren't entirely on-topic On Sun, Apr 3, 2011 at 9:04 PM, Chris Fauerbach wrote: > It's not a specific case only ( e.g. microsoft.com), but it's really a > multi word issue. > > carwash, bookkeeper etc... > > I'm ultimately looking for a schema for search and retrieve that's heavily > focused on 'names'.. these are peoples names, business names etc.. not > content like large text fields, web sites or anything like that, but > business data that I'm very succesfully receiving using dataimport > handlers... it's these special cases that are really tripping me up .. my > business folks keep coming up with them! > > > Chris Fauerbach > chrisfauerb...@gmail.com > > > On Sun, Apr 3, 2011 at 6:51 PM, Erick Erickson >wrote: > > > Is this a general question or specific? You can handle specific ones by > > using synonyms. > > > > But the general case, that is treating any two pairs of tokens as > > a single pair seems fraught with unintended consequences, but > > you know your problem space better than I do. > > > > Best > > Erick > > > > On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach < > chrisfauerb...@gmail.com > > >wrote: > > > > > Good afternoon everyone! > > > I am stumped, and I would love some help.I'm new to solr/lucene, > > > but I have thrown myself into it, so I think I have a solid > > > understanding. Using the analysis tool in the admin interface, I see > > > these words stemmed and processed as I assume they would be, so I'm > > > stuck. > > > > > > In my index, I have two documents, each with a text field, and here > > > are example values > > > > > > 1) microsoft.com > > > 2) micro soft > > > > > > I want to do a search using microsoft or "micro soft" and find both. > > > I'm using the dismax interface, the fields are properly listed in the > > > config, and I can find both records, but never at the same time. > > > Here's my schema.xml for my text field, any thoughts on what I can do > > > to find these together? > > > > > > > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > > > preserveOriginal="1"/> >
Re: Faceting on multivalued field
Ok. My expectation was since "comment_post_id" is a MultiValued field hence it would appear multiple times (i.e. for each comment). And hence when I would facet with that field it would also give me the count of those many documents where comment_post_id appears. My requirement is getting total for every document i.e. finding number of comments per post in the whole corpus. To explain it more clearly, I'm getting a result xml something like this 46 Hello World 20 9 10 19 2 46 46 Hello - from World Hi *1* I need the count to be 2 as the post 46 has 2 comments. What other way can I approach? Thanks, Kaushik On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson wrote: > Hmmm, I think you're misunderstanding faceting. It's counting the > number of documents that have a particular value. So if you're > faceting on "comment_post_id", there is one and only one document > with that value (assuming that the comment_post_ids are unique). > Which is what's being reported This will be quite expensive on a > large corpus, BTW. > > Is your task to show the totals for *every* document in your corpus or > just the ones in a display page? Because if the latter, your app could > just count up the number of elements in the XML returned for the > multiValued comments field. > > If that's not relevant, could you explain a bit more why you need this > count? > > Best > Erick > > On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty >wrote: > > > Hi, > > > > My index contains a root entity "Post" and a child entity "Comments". > Each > > post can have multiple comments. data-config.xml: > > > > > > > dataSource="jdbc" query=""> > > > > > > > > > > > > > > > > > > > > > > > > > > > > The schema has all columns of "comment" entity as "MultiValued" fields > and > > all fields are indexed & stored. My requirement is to count the number of > > comments for each post. Approach I'm taking is to query on "*:*" and > > faceting the result on "comment_post_id" so that it gives the count of > > comment occurred for that post. > > > > But I'm getting incorrect result e.g. if a post has 2 comments, the > > multivalued fields are populated alright but the facet count is coming as > 1 > > (for that post_id). What else do I need to do? > > > > > > Thanks, > > Kaushik > > >
Re: Faceting on multivalued field
Wouldn't you want to extract your original data format from the index and then 'count' the comments for each post ? I don't think facets are appropriate. On Apr 3, 2011, at 22:10, Kaushik Chakraborty wrote: > Ok. My expectation was since "comment_post_id" is a MultiValued field hence > it would appear multiple times (i.e. for each comment). And hence when I > would facet with that field it would also give me the count of those many > documents where comment_post_id appears. > > My requirement is getting total for every document i.e. finding number of > comments per post in the whole corpus. To explain it more clearly, I'm > getting a result xml something like this > > 46 > Hello World > 20 > >9 >10 > > > 19 > 2 > > > 46 > 46 > > > Hello - from World > Hi > > > > > *1* > > I need the count to be 2 as the post 46 has 2 comments. > > What other way can I approach? > > Thanks, > Kaushik > > > On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson wrote: > >> Hmmm, I think you're misunderstanding faceting. It's counting the >> number of documents that have a particular value. So if you're >> faceting on "comment_post_id", there is one and only one document >> with that value (assuming that the comment_post_ids are unique). >> Which is what's being reported This will be quite expensive on a >> large corpus, BTW. >> >> Is your task to show the totals for *every* document in your corpus or >> just the ones in a display page? Because if the latter, your app could >> just count up the number of elements in the XML returned for the >> multiValued comments field. >> >> If that's not relevant, could you explain a bit more why you need this >> count? >> >> Best >> Erick >> >> On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty >> wrote: >> >>> Hi, >>> >>> My index contains a root entity "Post" and a child entity "Comments". >> Each >>> post can have multiple comments. data-config.xml: >>> >>> >>> >> dataSource="jdbc" query=""> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> The schema has all columns of "comment" entity as "MultiValued" fields >> and >>> all fields are indexed & stored. My requirement is to count the number of >>> comments for each post. Approach I'm taking is to query on "*:*" and >>> faceting the result on "comment_post_id" so that it gives the count of >>> comment occurred for that post. >>> >>> But I'm getting incorrect result e.g. if a post has 2 comments, the >>> multivalued fields are populated alright but the facet count is coming as >> 1 >>> (for that post_id). What else do I need to do? >>> >>> >>> Thanks, >>> Kaushik >>> >>
Re: Faceting on multivalued field
Why not count them on the way in and just store that number along with the original e-mail? Best Erick On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty wrote: > Ok. My expectation was since "comment_post_id" is a MultiValued field hence > it would appear multiple times (i.e. for each comment). And hence when I > would facet with that field it would also give me the count of those many > documents where comment_post_id appears. > > My requirement is getting total for every document i.e. finding number of > comments per post in the whole corpus. To explain it more clearly, I'm > getting a result xml something like this > > 46 > Hello World > 20 > >9 >10 > > > 19 > 2 > > > 46 > 46 > > > Hello - from World > Hi > > > > > *1* > > I need the count to be 2 as the post 46 has 2 comments. > > What other way can I approach? > > Thanks, > Kaushik > > > On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson >wrote: > > > Hmmm, I think you're misunderstanding faceting. It's counting the > > number of documents that have a particular value. So if you're > > faceting on "comment_post_id", there is one and only one document > > with that value (assuming that the comment_post_ids are unique). > > Which is what's being reported This will be quite expensive on a > > large corpus, BTW. > > > > Is your task to show the totals for *every* document in your corpus or > > just the ones in a display page? Because if the latter, your app could > > just count up the number of elements in the XML returned for the > > multiValued comments field. > > > > If that's not relevant, could you explain a bit more why you need this > > count? > > > > Best > > Erick > > > > On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty > >wrote: > > > > > Hi, > > > > > > My index contains a root entity "Post" and a child entity "Comments". > > Each > > > post can have multiple comments. data-config.xml: > > > > > > > > > > > dataSource="jdbc" query=""> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The schema has all columns of "comment" entity as "MultiValued" fields > > and > > > all fields are indexed & stored. My requirement is to count the number > of > > > comments for each post. Approach I'm taking is to query on "*:*" and > > > faceting the result on "comment_post_id" so that it gives the count of > > > comment occurred for that post. > > > > > > But I'm getting incorrect result e.g. if a post has 2 comments, the > > > multivalued fields are populated alright but the facet count is coming > as > > 1 > > > (for that post_id). What else do I need to do? > > > > > > > > > Thanks, > > > Kaushik > > > > > >