Allow custom overrides
I need to implement a search engine that will allow users to override pieces of data and then search against or view that data. For example, a doc that has the following values: DocId FulltextMeta1 Meta2 Meta3 1 The quick brown fox foofoo foo Now say a user overrides Meta2 : DocId FulltextMeta1 Meta2 Meta3 1 The quick brown fox foofoo foo bar For that user, if they search for Meta2:bar, I need to hit, but no other user should hit on it. Likewise, if that user searches for Meta2:foo, it should not hit. Also, any searches against that document for that user should return the value 'bar' for Meta2, but should return 'foo' for other users. I'm not sure the best way to implement this. Maybe I could do this with field collapsing somehow? Or with payloads? Custom analyzer? Any help would be appreciated. - Charlie
Status of Solr in the cloud?
There seem to be a few parallel efforts at putting Solr in a cloud configuration. See http://wiki.apache.org/solr/KattaIntegration, which is based off of https://issues.apache.org/jira/browse/SOLR-1395. Also http://wiki.apache.org/solr/SolrCloud which is https://issues.apache.org/jira/browse/SOLR-1873. And another JIRA: https://issues.apache.org/jira/browse/SOLR-1301. These all seem aimed at the same goal, correct? I'm interested in evaluating one of these solutions for my company; which is the most stable or most likely to eventually be part of the Solr distribution? Thanks, Charlie
RE: How to extend IndexSchema and SchemaField
Have you already explored the idea of using a custom analyzer for your field? Depending on your use case, that might work for you. - Charlie
Field compression
I know I'm late to the party, but I recently learned that field compression was removed as of Solr 1.4.1. I think a lot of sites were relying on that feature, so I'm curious what people are doing now that it's gone. Specifically, what are people doing to efficiently store *and highlight* large fulltext fields? I can think of ways to store the text efficiently (compress it myself), or highlight it (leave it uncompressed), but not both at the same time. Also, is anyone working on anything to restore compression to Solr? I understand it was removed because Lucene removed support for it, but I was hoping to upgrade my site to 3.1 soon and we rely on that feature. - Charlie
Sorting/paging problem
I've run into a strange issue with my Solr installation. I'm running queries that are sorting by a DateField field but from time to time, I'm seeing individual records very much out of order. What's more, they appear on multiple pages of my result set. Let me give an example. Starting with a basic query, I sort on the date that the document was added to the index and see these rows on the first page (I'm just showing the date field here): 2009-09-23T19:24:47.419Z 2009-09-23T19:25:03.229Z 2009-09-23T19:25:03.400Z 2009-09-23T19:25:19.951 2009-09-23T20:10:07.919Z Note how the last document's date jumps a bit. Not necessarily a problem, but the next page looks this: 2009-09-23T19:26:16.022Z 2009-09-23T19:26:32.547Z 2009-09-23T19:27:45.470Z 2009-09-23T19:27:45.592Z 2009-09-23T20:10:07.919Z So, not only is the date sorting wrong, but the exact same document shows up on the next page, also still out of date order. I've seen the same document show up in 4-5 pages in some cases. It's always the last record on the page, too. If I change the page size, the problem seems to disappear for a while, but then starts up again later. Also, running the same query/queries later on doesn't show the same behavior. Could it be some sort of page boundary issue with the cache? Has anyone else run into a problem like this? I'm using the Sept 22 nightly build. - Charlie
RE: Sorting/paging problem
Oops, the missing trailing Z was probably just a cut and paste error. It might be tough to come up with a case that can reproduce it -- it's a sticky issue. I'll post it if I can, though. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Tuesday, September 29, 2009 6:08 PM To: solr-user@lucene.apache.org Subject: Re: Sorting/paging problem : 2009-09-23T19:25:03.400Z : : 2009-09-23T19:25:19.951 : : 2009-09-23T20:10:07.919Z is that a cut/paste error, or did you really get a date back from Solr w/o the trailing "Z" ?!?!?! ... : So, not only is the date sorting wrong, but the exact same document : shows up on the next page, also still out of date order. I've seen the : same document show up in 4-5 pages in some cases. It's always the last : record on the page, too. If I change the page size, the problem seems to that is really freaking weird. can you reproduce this in a simple example? maybe an index that's small enough (and doesn't contain confidential information) that you could zip up and post online? -Hoss
NGram query failing
I have a requirement to be able to find hits within words in a free-form id field. The field can have any type of alphanumeric data - it's as likely it will be something like "123456" as it is to be "SUN-123-ABC". I thought of using NGrams to accomplish the task, but I'm having a problem. I set up a field like this After indexing a field like this, the analysis page indicates my queries should work. If I give it a sample field value of "ABC-123456-SUN" and a query value of "45" it shows hits in several places, which is what I expected. However, when I actually query the field with something like "45" I get no hits back. Looking at the debugQuery output, it looks like it's taking my analyzed query text and putting it into a phrase query. So, for a query of "45" it turns into a phrase query of :"4 5 45" which then doesn't hit on anything in my index. What am I missing to make this work? - Charlie
RE: NGram query failing
Well, I fixed my own problem in the end. For the record, this is the schema I ended up going with: I could have left it a trigram but went with a bigram because with this setup, I can get queries to properly hit as long as the min/max gram size is met. In other words, for any queries two or more characters long, this works for me. Less than two characters and it fails. I don't know exactly why that is, but I'll take it anyway! - Charlie -Original Message----- From: Charlie Jackson [mailto:charlie.jack...@cision.com] Sent: Friday, October 23, 2009 10:00 AM To: solr-user@lucene.apache.org Subject: NGram query failing I have a requirement to be able to find hits within words in a free-form id field. The field can have any type of alphanumeric data - it's as likely it will be something like "123456" as it is to be "SUN-123-ABC". I thought of using NGrams to accomplish the task, but I'm having a problem. I set up a field like this After indexing a field like this, the analysis page indicates my queries should work. If I give it a sample field value of "ABC-123456-SUN" and a query value of "45" it shows hits in several places, which is what I expected. However, when I actually query the field with something like "45" I get no hits back. Looking at the debugQuery output, it looks like it's taking my analyzed query text and putting it into a phrase query. So, for a query of "45" it turns into a phrase query of :"4 5 45" which then doesn't hit on anything in my index. What am I missing to make this work? - Charlie
Entity extraction?
During a recent sales pitch to my company by FAST, they mentioned entity extraction. I'd never heard of it before, but they described it as basically recognizing people/places/things in documents being indexed and then being able to do faceting on this data at query time. Does anything like this already exist in SOLR? If not, I'm not opposed to developing it myself, but I could use some pointers on where to start. Thanks, - Charlie
RE: Entity extraction?
Thanks for the replies, guys, that gives me a good place to start looking. - Charlie -Original Message- From: Rogerio Pereira [mailto:[EMAIL PROTECTED] Sent: Friday, October 24, 2008 5:14 PM To: solr-user@lucene.apache.org Subject: Re: Entity extraction? You can find more about this topic in this book availabe at amazon: http://www.amazon.com/Building-Search-Applications-Lucene-Lingpipe/dp/0615204252/ 2008/10/24 Rafael Rossini <[EMAIL PROTECTED]> > Solr can do a simple facet seach like FAST, but the entity extraction > demands other tecnologies. I do not know how FAST does it but at the > company > I´m working on (www.cortex-intelligence.com), we use a mix of statistical > and language-specific tasks to recognize and categorize entities in the > text. Ling Pipe is another tool (free) that does that too. In case you > would > like to see a simple demo: http://www.cortex-intelligence.com/tech/ > > Rossini > > > On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson < > [EMAIL PROTECTED] > > wrote: > > > During a recent sales pitch to my company by FAST, they mentioned entity > > extraction. I'd never heard of it before, but they described it as > > basically recognizing people/places/things in documents being indexed > > and then being able to do faceting on this data at query time. Does > > anything like this already exist in SOLR? If not, I'm not opposed to > > developing it myself, but I could use some pointers on where to start. > > > > > > > > Thanks, > > > > - Charlie > > > > > -- Regards, Rogério (_rogerio_) [Blog: http://faces.eti.br] [Sandbox: http://bmobile.dyndns.org] [Twitter: http://twitter.com/ararog] "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento, distribua e aprenda mais." (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
RE: Entity extraction?
True, though I may be able to convince the powers that be that it's worth the investment. There are a number of open source or free tools listed on the Wikipedia entry for entity extraction (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free) -- does anyone have any experience with any of these? Charlie Jackson 312-873-6537 [EMAIL PROTECTED] -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Monday, October 27, 2008 10:23 AM To: solr-user@lucene.apache.org Subject: Re: Entity extraction? For the record, LingPipe is not free. It's good, but it's not free. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Rafael Rossini <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, October 24, 2008 6:08:14 PM > Subject: Re: Entity extraction? > > Solr can do a simple facet seach like FAST, but the entity extraction > demands other tecnologies. I do not know how FAST does it but at the company > I´m working on (www.cortex-intelligence.com), we use a mix of statistical > and language-specific tasks to recognize and categorize entities in the > text. Ling Pipe is another tool (free) that does that too. In case you would > like to see a simple demo: http://www.cortex-intelligence.com/tech/ > > Rossini > > > On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson > > wrote: > > > During a recent sales pitch to my company by FAST, they mentioned entity > > extraction. I'd never heard of it before, but they described it as > > basically recognizing people/places/things in documents being indexed > > and then being able to do faceting on this data at query time. Does > > anything like this already exist in SOLR? If not, I'm not opposed to > > developing it myself, but I could use some pointers on where to start. > > > > > > > > Thanks, > > > > - Charlie > > > >
RE: Entity extraction?
Yeah, when they first mentioned it, my initial thought was "cool, but we don't need it." However, some of the higher ups in the company are saying we might want it at some point, so I've been asked to look into it. I'll be sure to let them know about the flaws in the concept, thanks for that info. ________ Charlie Jackson [EMAIL PROTECTED] -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Monday, October 27, 2008 11:17 AM To: solr-user@lucene.apache.org Subject: Re: Entity extraction? The vendor mentioned entity extraction, but that doesn't mean you need it. Entity extraction is a pretty specific technology, and it has been a money-losing product at many companies for many years, going back to Xerox ThingFinder well over ten years ago. My guess is that very few people really need entity extraction. Using EE for automatic taxonomy generation is even harder to get right. At best, that is a way to get a starter set of categories that you can edit. You will not get a production quality taxonomy automatically. wunder On 10/27/08 8:31 AM, "Charlie Jackson" <[EMAIL PROTECTED]> wrote: > True, though I may be able to convince the powers that be that it's worth the > investment. > > There are a number of open source or free tools listed on the Wikipedia entry > for entity extraction > (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free) -- > does anyone have any experience with any of these? > > > Charlie Jackson > 312-873-6537 > [EMAIL PROTECTED] > > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Monday, October 27, 2008 10:23 AM > To: solr-user@lucene.apache.org > Subject: Re: Entity extraction? > > For the record, LingPipe is not free. It's good, but it's not free. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Rafael Rossini <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Friday, October 24, 2008 6:08:14 PM >> Subject: Re: Entity extraction? >> >> Solr can do a simple facet seach like FAST, but the entity extraction >> demands other tecnologies. I do not know how FAST does it but at the company >> I´m working on (www.cortex-intelligence.com), we use a mix of statistical >> and language-specific tasks to recognize and categorize entities in the >> text. Ling Pipe is another tool (free) that does that too. In case you would >> like to see a simple demo: http://www.cortex-intelligence.com/tech/ >> >> Rossini >> >> >> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson >>> wrote: >> >>> During a recent sales pitch to my company by FAST, they mentioned entity >>> extraction. I'd never heard of it before, but they described it as >>> basically recognizing people/places/things in documents being indexed >>> and then being able to do faceting on this data at query time. Does >>> anything like this already exist in SOLR? If not, I'm not opposed to >>> developing it myself, but I could use some pointers on where to start. >>> >>> >>> >>> Thanks, >>> >>> - Charlie >>> >>> > > >
Availability during merge
The wiki page for merging solr cores (http://wiki.apache.org/solr/MergingSolrIndexes) mentions that the cores being merged cannot be indexed to during the merge. What about the core being merged *to*? In terms of the example on the wiki page, I'm asking if core0 can add docs while core1 and core2 are being merged into it. Thanks, - Charlie
Rounding dates on sort and filter
I've got a legacy date field that I'd like to round for sorting and filtering. Right now, the index is large enough that sorting or filtering on a date field takes 10-20 seconds (unless it's cached). I know this is because the date field's precision is down to the millisecond, and I don't really need that level of precision for most of my searches. So, is it possible to round my field at query time without having to reindex the field or add a second one? I already tried the function sorting in 1.5-dev, but my field isn't a TrieDate field so I can't use the ms() function (which seems to allow date math unlike the other functions). Thanks, Charlie
RE: Rounding dates on sort and filter
Good point. So it doesn't sound like there's a way to do this without adding a new field or reindexing. Thanks anyway. - Charlie -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, January 19, 2010 2:04 PM To: solr-user@lucene.apache.org Subject: Re: Rounding dates on sort and filter Charlie, Query-time terms/tokens need to match what's in your index, and my guess is that if you just altered query-time date field analysis, you'd get a mismatch. Easy enough to check through Solr Admin Analysis page. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message ---- > From: Charlie Jackson > To: solr-user@lucene.apache.org > Sent: Tue, January 19, 2010 1:20:02 PM > Subject: Rounding dates on sort and filter > > I've got a legacy date field that I'd like to round for sorting and > filtering. Right now, the index is large enough that sorting or > filtering on a date field takes 10-20 seconds (unless it's cached). I > know this is because the date field's precision is down to the > millisecond, and I don't really need that level of precision for most of > my searches. So, is it possible to round my field at query time without > having to reindex the field or add a second one? > > > > I already tried the function sorting in 1.5-dev, but my field isn't a > TrieDate field so I can't use the ms() function (which seems to allow > date math unlike the other functions). > > > > Thanks, > > Charlie
HTTP caching and distributed search
Currently, I've got a Solr setup in which we're distributing searches across two cores on a machine, say core1 and core2. I'm toying with the notion of enabling Solr's HTTP caching on our system, but I noticed an oddity when using it in combination with distributed searching. Say, for example, I have this query: http://localhost:8080/solr/core1/select/?q=google&start=0&rows=10&shards =localhost:8080/solr/core1,localhost:8080/solr/core2 Both cores have HTTP caching enabled, and it seems to be working. First time I run the query through Squid, it correctly sees it doesn't have this cached and so requests it from Solr. Second time I request it, it hits the Squid cache. That part works fine. Here's the problem. If I commit to core1, it changes the ETag value of the request, which will invalidate the cache, as it should. But committing to core2 doesn't, so I get the cached version back, even though core2 has changed and the cache is stale. I'm guessing this is because the request is going against core1, hence using core1's cache values, but in a distributed search, it seems like it should be using cache values from all cores in the shards parameter. Is this a known issue, and if so, is there a patch for it? Thanks, Charlie
RE: HTTP caching and distributed search
I tried your suggestion, Hoss, but committing to the new coordinator core doesn't change the indexVersion and therefore the ETag value isn't changed. I opened a new JIRA issue for this http://issues.apache.org/jira/browse/SOLR-1765 Thanks, Charlie -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, February 04, 2010 2:16 PM To: solr-user@lucene.apache.org Subject: Re: HTTP caching and distributed search : > http://localhost:8080/solr/core1/select/?q=google&start=0&rows=10&shards : > =localhost:8080/solr/core1,localhost:8080/solr/core2 : You are right, etag is calculated using the searcher on core1 only and it : does not take other shards into account. Can you open a Jira issue? ...as a possible work arround i would suggest creating a seperate "coordinator" core that is neither core1 nor core2 ... it doesn't have to have any docs in it, it just has to have consistent schemas with the other two cores. That way you can use a distinct settings on the coordinator core (perhaps never304="true" but with an explicit setting? ... or lastModifiedFrom="openTime" and then you could send an explicit "commit" to the (empty) coordinator core anytime you modify one of the shards. -Hoss
Odd query result
I've got an odd scenario with a query a user's running. The user is searching for the term "I-Car". It will hit if the document contains the term "I-CAR" (all caps) but not if it's "I-Car". When I throw the terms into the analysis page, the resulting tokens look identical, and my "I-Car" tokens hit on either term. Here's the definition of the field: I'm pretty sure this has to do with the settings on the WordDelimiterFactory, but I must be missing something because I don't see anything that would cause the behavior I'm seeing.
RE: Odd query result
I'll take another look and see if it makes sense to have the index and query time parameters the same or different. As far as the initial issue, I think you're right Tom, it is hitting on both. I think what threw me off was the highlighting -- in one of my matching documents, the term "I-CAR" is highlighted, but I think it actually hit on the term "ISHIN-I (car" which is also in the document. The debug output for my query is ft:I-Car ft:I-Car +MultiPhraseQuery(ft:"i (car icar)") +ft:"i (car icar)" Thanks! -Original Message- From: Tom Hill [mailto:solr-l...@worldware.com] Sent: Tuesday, April 20, 2010 2:08 PM To: solr-user@lucene.apache.org Subject: Re: Odd query result I agree that, if they are the same, you want to merge them. In this case, I don't think you want them to be the same. In particular, you usually don't want to catenateWords and catenateNumbers both index time AND at query time. You generate the permutations on one, or the other, but you don't need to do it for both. I usually do it at index time Tom On Tue, Apr 20, 2010 at 11:29 AM, MitchK wrote: > > It has nothing to do with your problem, since it seems to work when Tom > tested it. > However, it seems like you are using the same configurations on query- and > index-type analyzer. > If you did not hide anything from (for example own filter-implementations), > because you don't want to confuse us, you can just delete the definitions > "type=index" and "type=query". If you do so, the whole > fieldType-filter-configuration will be applied on both: index- and > query-time. There is no need to specify two equal ones. > > I think this would be easier to maintain in future :). > > Kind regards > - Mitch > > --> > > > > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > >ignoreCase="true" > >words="stopwords.txt" > >enablePositionIncrements="true" > >/> > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > > > -- > View this message in context: > http://n3.nabble.com/Odd-query-result-tp732958p733095.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Solrsharp highlighting
Trying to use Solrsharp (which is a great tool, BTW) to get some results in a C# application. I see the HighlightFields method of the QueryBuilder object and I've set it to my highlight field, but how do I get at the results? I don't see anything in the SearchResults code that does anything with the highlight results XML. Did I miss something? Thanks, Charlie
RE: Solrsharp highlighting
Also, are there any examples out there of how to use Solrsharp's faceting capabilities? Charlie Jackson 312-873-6537 [EMAIL PROTECTED] -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Friday, August 10, 2007 3:51 PM To: solr-user@lucene.apache.org Subject: Solrsharp highlighting Trying to use Solrsharp (which is a great tool, BTW) to get some results in a C# application. I see the HighlightFields method of the QueryBuilder object and I've set it to my highlight field, but how do I get at the results? I don't see anything in the SearchResults code that does anything with the highlight results XML. Did I miss something? Thanks, Charlie
RE: Solrsharp highlighting
Thanks for adding in those facet examples. That should help me out a great deal. As for the highlighting, did you have any ideas about a good way to go about it? I was thinking about taking a stab at it, but I want to get your input first. Thanks, Charlie -Original Message- From: Jeff Rodenburg [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 14, 2007 1:08 AM To: solr-user@lucene.apache.org Subject: Re: Solrsharp highlighting Pull down the latest example code from http://solrstuff.org/svn/solrsharpwhich includes adding facets to search results. It's really short and simple to add facets; the example application implements one form of it. The nice thing about the facet support is that it utilizes generics to allow you to have strongly typed name/value pairs for the fieldname/count data. Hope this helps. -- jeff r. On 8/10/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: > > Also, are there any examples out there of how to use Solrsharp's > faceting capabilities? > > ________ > Charlie Jackson > 312-873-6537 > [EMAIL PROTECTED] > -Original Message- > From: Charlie Jackson [mailto:[EMAIL PROTECTED] > Sent: Friday, August 10, 2007 3:51 PM > To: solr-user@lucene.apache.org > Subject: Solrsharp highlighting > > Trying to use Solrsharp (which is a great tool, BTW) to get some results > in a C# application. I see the HighlightFields method of the > QueryBuilder object and I've set it to my highlight field, but how do I > get at the results? I don't see anything in the SearchResults code that > does anything with the highlight results XML. Did I miss something? > > > > > > Thanks, > > Charlie > >
RE: UTF-8 encoding problem on one of two Solr setups
You might want to check out this page http://wiki.apache.org/solr/SolrTomcat Tomcat needs a small config change out of the box to properly support UTF-8. Thanks, Charlie -Original Message- From: Mario Knezovic [mailto:[EMAIL PROTECTED] Sent: Friday, August 17, 2007 12:58 PM To: solr-user@lucene.apache.org Subject: UTF-8 encoding problem on one of two Solr setups Hi all, I have set up an identical Solr 1.1 on two different machines. One works fine, the other one has a UTF-8 encoding problem. #1 is my local Windows XP machine. Solr is running basically in a configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows XP/5.1 x86 java/1.6.0). Everything works fine here as expected. #2 is a Linux machine with Solr running inside Tomcat 6. The problem happens here. This is the place where Solr will be running finally. To rule out all problems in my PHP and Java code, I tested the problem with the Solr admin page and it happens there as well. (Tested with Firefox 2 with site's char encoding UTF-8.) When entering an arbitrary search string containing UTF-8 chars I get a correct response from the local Windows Solr setup: 0 0 on 0 München <-- sample string containing a German umlaut-u 10 2.2 [...] When I do exactly the same, just on the admin page of the other Solr setup (but from exactly the same browser), I get the following response: [...] item$searchstring_de:München [...] Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two 8-bit chars instead of one UTF-8 char. Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was not able to find the problem yet. My guess is that it is outside of Solr, maybe in the Tomcat configuration, but so far I spent the entire day without a further clue. But apart from that Solr really rocks. Indexing tons of content and searching works just fine and fast and it was pretty easy to get into everything. Now I am changing all data to UTF-8 and ran into my first serious obstacle... after a few weeks of Solr usage! Any hint/help appreciated. Thank you very much. Mario
RE: dataset parameters suitable for lucene application
My experiences so far with this level of data have been good. Number of records: Maxed out at 8.8 million Database size: friggin huge (100+ GB) Index size: ~24 GB 1) It took me about a day to index 8 million docs using a non-optimized program I wrote. It's non-optimized in the sense that it's not multi-threaded. It batched together groups of about 5,000 docs at a time to be indexed. 2) Search times for a basic search are almost always sub-second. If we toss in some faceting, it takes a little longer, but I've hardly ever seen it go above 1-2 seconds even with the most advanced queries. Hope that helps. Charlie -Original Message- From: Law, John [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 9:28 AM To: solr-user@lucene.apache.org Subject: dataset parameters suitable for lucene application I am new to the list and new to lucene and solr. I am considering Lucene for a potential new application and need to know how well it scales. Following are the parameters of the dataset. Number of records: 7+ million Database size: 13.3 GB Index Size: 10.9 GB My questions are simply: 1) Approximately how long would it take Lucene to index these documents? 2) What would the approximate retrieval time be (i.e. search response time)? Can someone provide me with some informed guidance in this regard? Thanks in advance, John __ John Law Director, Platform Management ProQuest 789 Eisenhower Parkway Ann Arbor, MI 48106 734-997-4877 [EMAIL PROTECTED] www.proquest.com www.csa.com ProQuest... Start here.
RE: dataset parameters suitable for lucene application
Sorry, I meant that it maxed out in the sense that my maxDoc field on the stats page was 8.8 million, which indicates that the most docs it has ever had was around 8.8 million. It's down to about 7.8 million currently. I have seen no signs of a "maximum" number of docs Solr can handle. -Original Message- From: Chris Harris [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 11:49 AM To: solr-user@lucene.apache.org Subject: Re: dataset parameters suitable for lucene application By "maxed out" do you mean that Solr's performance became unacceptable beyond 8.8M records, or that you only had 8.8M records to index? If the former, can you share the particular symptoms? On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: > My experiences so far with this level of data have been good. > > Number of records: Maxed out at 8.8 million > Database size: friggin huge (100+ GB) > Index size: ~24 GB > > 1) It took me about a day to index 8 million docs using a non-optimized > program I wrote. It's non-optimized in the sense that it's not > multi-threaded. It batched together groups of about 5,000 docs at a time > to be indexed. > > 2) Search times for a basic search are almost always sub-second. If we > toss in some faceting, it takes a little longer, but I've hardly ever > seen it go above 1-2 seconds even with the most advanced queries. > > Hope that helps. > > > Charlie > > > > -Original Message- > From: Law, John [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 9:28 AM > To: solr-user@lucene.apache.org > Subject: dataset parameters suitable for lucene application > > I am new to the list and new to lucene and solr. I am considering Lucene > for a potential new application and need to know how well it scales. > > Following are the parameters of the dataset. > > Number of records: 7+ million > Database size: 13.3 GB > Index Size: 10.9 GB > > My questions are simply: > > 1) Approximately how long would it take Lucene to index these documents? > 2) What would the approximate retrieval time be (i.e. search response > time)? > > Can someone provide me with some informed guidance in this regard? > > Thanks in advance, > John > > __ > John Law > Director, Platform Management > ProQuest > 789 Eisenhower Parkway > Ann Arbor, MI 48106 > 734-997-4877 > [EMAIL PROTECTED] > www.proquest.com > www.csa.com > > ProQuest... Start here. > > > >
quick allowDups questions
Normally this is the type of thing I'd just scour through the online docs or the source code for, but I'm under the gun a bit. Anyway, I need to update some docs in my index because my client program wasn't accurately putting these docs in (values for one of the fields was missing). I'm hoping I won't have to write additional code to go through and delete each existing doc before I add the new one, and I think setting allowDups on the add command to false will allow me to do this. I seem to recall something in the update handler code that goes through and deletes all but the last copy of the doc if allowDups is false - does that sound accurate? If so, I just need to make sure that solrj properly sets that flag, which leads me to my next question. Does solrj default allowDups to false? If not, what do I need to do to make sure allowDups is set to false when I'm adding these docs?
RE: quick allowDups questions
Thanks for the response, Mike. A quick test using the example app confirms your statement. As for Solrj, you're probably right, but I'm not going to take any chances for the time being. The server.add method has an optional Boolean flag named "overwrite" that defaults to true. Without knowing for sure what it does, I'm not going to mess with it. For the purposes of my problem, I've got an upper and lower bound of affected docs, so I'm just going to delete them all and then initiate a re-index of those specific ids from my source. Thanks again for the help! -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 3:58 PM To: solr-user@lucene.apache.org Subject: Re: quick allowDups questions On 10-Oct-07, at 1:11 PM, Charlie Jackson wrote: > Anyway, I need to update some docs in my index because my client > program > wasn't accurately putting these docs in (values for one of the fields > was missing). I'm hoping I won't have to write additional code to go > through and delete each existing doc before I add the new one, and I > think setting allowDups on the add command to false will allow me > to do > this. I seem to recall something in the update handler code that goes > through and deletes all but the last copy of the doc if allowDups is > false - does that sound accurate? Yes. But you need to define a uniqueKey in schema and make sure it is the same for docs you want overwritten. This is how solr detects "dups". > > If so, I just need to make sure that solrj properly sets that flag, > which leads me to my next question. Does solrj default allowDups to > false? If not, what do I need to do to make sure allowDups is set to > false when I'm adding these docs? It is the normal mode of operation for Solr, so I'd be surprised if it wasn't the default in solrj (but I don't actually know). -Mike
RE: quick allowDups questions
Cool, thanks for the clarification, Ryan. -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 5:28 PM To: solr-user@lucene.apache.org Subject: Re: quick allowDups questions the default solrj implementation should do what you need. > > As for Solrj, you're probably right, but I'm not going to take any > chances for the time being. The server.add method has an optional > Boolean flag named "overwrite" that defaults to true. Without knowing > for sure what it does, I'm not going to mess with it. > direct solr update allows a few extra fields allowDups, overwritePending, overwriteCommited -- the future of overwritePending, overwriteCommited is in doubt (SOLR-60), so i did not want to bake that into the solrj API. internally, allowDups = !overwrite; (the one field you can set) overwritePending = !allowDups; overwriteCommited = !allowDups; ryan
RE: Timeout Settings
The CommonsHttpSolrServer has a setConnectionTimeout method. For my import, which was on a similar scale as yours, I had to set it up to 1000 (1 second). I think messing with this setting may take care of your timeout problem. -Original Message- From: Daniel Clark [mailto:[EMAIL PROTECTED] Sent: Monday, October 22, 2007 6:59 PM To: solr-user@lucene.apache.org Subject: Timeout Settings I'm indexing about 10,000,000 docs and I'm getting the following error at the optimize stage. I'm using Tomcat 6. I believe it's timing out due to the size of the index. How can increase the timeout setting while it's optimizing? Any help would be greatly appreciated. java.lang.Exception: at org.apache.solr.client.SolrClient.update(SolrClient.java:660) at org.apache.solr.client.SolrClient.update(SolrClient.java:620) at org.apache.solr.client.SolrClient.addDocuments(SolrClient.java:580) at org.apache.solr.client.SolrClient.addDocuments(SolrClient.java:595) at com.aol.music.search.indexer2.MusicIndexer$SolrUpdateTask.call(MusicInde xer. java:244) at com.aol.music.search.indexer2.MusicIndexer$SolrUpdateTask.call(MusicInde xer. java:214) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto r.ja va:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja va:6 75) at java.lang.Thread.run(Thread.java:595) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav a:11 15) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon nect ionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa se.j ava:1832) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase .jav a:1590) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java :995 ) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe thod Director.java:397) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho dDir ector.java:170) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 96) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 24) at org.apache.solr.client.SolrClient.update(SolrClient.java:637) ... 10 more ~ Daniel Clark, President DAC Systems, Inc. (703) 403-0340 ~
RE: Forced Top Document
Do you know which document you want at the top? If so, I believe you could just add an "OR" clause to your query to boost that document very high, such as ?q=foo OR id:bar^1000 Tried this on my installation and it did, indeed push the document specified to the top. -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 10:17 AM To: solr-user@lucene.apache.org Subject: Re: Forced Top Document I'd love to know this, as I just got a development request for this very feature. I'd rather not spend time on it if it already exists. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Oct 23, 2007, at 10:12 PM, mark angelillo wrote: > Hi all, > > Is there a way to get a specific document to appear on top of > search results even if a sorting parameter would push it further down? > > Thanks in advance, > Mark > > mark angelillo > snooth inc. > o: 646.723.4328 > c: 484.437.9915 > [EMAIL PROTECTED] > snooth -- 1.8 million ratings and counting... > >
RE: Forced Top Document
Yes, this will only work if the results are sorted by score (the default). One thing I thought of after I sent this out was that this will include the specified document even if it doesn't match your search criteria, which may not be what you want. -Original Message- From: mark angelillo [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 12:44 PM To: solr-user@lucene.apache.org Subject: Re: Forced Top Document Charlie, That's interesting. I did try something like this. Did you try your query with a sorting parameter? What I've read suggests that all the results are returned based on the query specified, but then resorted as specified. Boosting (which modifies the document's score) should not change the order unless the results are sorted by score. Mark On Oct 24, 2007, at 1:05 PM, Charlie Jackson wrote: > Do you know which document you want at the top? If so, I believe you > could just add an "OR" clause to your query to boost that document > very > high, such as > > ?q=foo OR id:bar^1000 > > Tried this on my installation and it did, indeed push the document > specified to the top. > > > > -Original Message- > From: Matthew Runo [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 24, 2007 10:17 AM > To: solr-user@lucene.apache.org > Subject: Re: Forced Top Document > > I'd love to know this, as I just got a development request for this > very feature. I'd rather not spend time on it if it already exists. > > ++ > | Matthew Runo > | Zappos Development > | [EMAIL PROTECTED] > | 702-943-7833 > ++ > > > On Oct 23, 2007, at 10:12 PM, mark angelillo wrote: > >> Hi all, >> >> Is there a way to get a specific document to appear on top of >> search results even if a sorting parameter would push it further >> down? >> >> Thanks in advance, >> Mark >> >> mark angelillo >> snooth inc. >> o: 646.723.4328 >> c: 484.437.9915 >> [EMAIL PROTECTED] >> snooth -- 1.8 million ratings and counting... >> >> > mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.8 million ratings and counting...
RE: Forced Top Document
Took the words right out my mouth! That second method would be particularly effective but will only work if you can identify these docs at index time. -Original Message- From: Kyle Banerjee [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 1:31 PM To: solr-user@lucene.apache.org Subject: Re: Forced Top Document This method Charlie suggested will work just fine with a minor tweak. For relevancy sorting ?q=foo OR (foo AND id:bar) For nonrelevancy sorting, all you need is a multilevel sort. Just add a bogus field that only the important document contains. Then sort by bogus field in descending order before any other sorting criteria are applied. Either way, the document only appears when it matches the search criteria, and it will always be on top. kyle On 10/24/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: > Yes, this will only work if the results are sorted by score (the > default). > > One thing I thought of after I sent this out was that this will include > the specified document even if it doesn't match your search criteria, > which may not be what you want. > > > -Original Message- > From: mark angelillo [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 24, 2007 12:44 PM > To: solr-user@lucene.apache.org > Subject: Re: Forced Top Document > > Charlie, > > That's interesting. I did try something like this. Did you try your > query with a sorting parameter? > > What I've read suggests that all the results are returned based on > the query specified, but then resorted as specified. Boosting (which > modifies the document's score) should not change the order unless the > results are sorted by score. > > Mark > > On Oct 24, 2007, at 1:05 PM, Charlie Jackson wrote: > > > Do you know which document you want at the top? If so, I believe you > > could just add an "OR" clause to your query to boost that document > > very > > high, such as > > > > ?q=foo OR id:bar^1000 > > > > Tried this on my installation and it did, indeed push the document > > specified to the top. > > > > > > > > -Original Message- > > From: Matthew Runo [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, October 24, 2007 10:17 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Forced Top Document > > > > I'd love to know this, as I just got a development request for this > > very feature. I'd rather not spend time on it if it already exists. > > > > ++ > > | Matthew Runo > > | Zappos Development > > | [EMAIL PROTECTED] > > | 702-943-7833 > > ++ > > > > > > On Oct 23, 2007, at 10:12 PM, mark angelillo wrote: > > > >> Hi all, > >> > >> Is there a way to get a specific document to appear on top of > >> search results even if a sorting parameter would push it further > >> down? > >> > >> Thanks in advance, > >> Mark > >> > >> mark angelillo > >> snooth inc. > >> o: 646.723.4328 > >> c: 484.437.9915 > >> [EMAIL PROTECTED] > >> snooth -- 1.8 million ratings and counting... > >> > >> > > > > mark angelillo > snooth inc. > o: 646.723.4328 > c: 484.437.9915 > [EMAIL PROTECTED] > snooth -- 1.8 million ratings and counting... > > > -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [EMAIL PROTECTED] / 541.359.9599
RE: Tomcat6?
$CALINA_HOME/conf/Catalina/localhost doesn't exist by default, but you can create it and it will work exactly the same way it did in Tomcat 5. It's not created by default because its not needed by the manager webapp anymore. -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Monday, December 03, 2007 10:15 AM To: solr-user@lucene.apache.org Subject: Re: Tomcat6? In context.xml, I added.. I think that's all I did to get it working in Tocmat 6. --Matthew Runo On Dec 3, 2007, at 7:58 AM, Jörg Kiegeland wrote: > In the Solr wiki, there is not described how to install Solr on > Tomcat 6, and I not managed it myself :( > In the chapter "Configuring Solr Home with JNDI" there is mentioned > the directory $CATALINA_HOME/conf/Catalina/localhost , which not > exists with TOMCAT 6. > > Alternatively I tried the folder $CATALINA_HOME/work/Catalina/ > localhost, but with no success.. (I can query the top level page, > but the "Solr Admin" link then not works). > > Can anybody help? > > -- > Dipl.-Inf. Jörg Kiegeland > ikv++ technologies ag > Bernburger Strasse 24-25, D-10963 Berlin > e-mail: [EMAIL PROTECTED], web: http://www.ikv.de > phone: +49 30 34 80 77 18, fax: +49 30 34 80 78 0 > = > Handelsregister HRB 81096; Amtsgericht Berlin-Charlottenburg > board of directors: Dr. Olaf Kath (CEO); Dr. Marc Born (CTO) > supervising board: Prof. Dr. Bernd Mahr (chairman) > _ >
RE: Successful project based on SOLR
Congratulations! > It uses an custom hibernate-SOLR bridge which allows transparent persistence of entities on different SOLR servers. Any chance of this code making its way back to the SOLR community? Or, if not, can you give me an idea how you did it? This seamless integration of Hibernate and Solr is something I'm interested in. -- Charlie -Original Message- From: Marius Hanganu [mailto:[EMAIL PROTECTED] Sent: Thursday, December 20, 2007 10:43 AM To: solr-user@lucene.apache.org Subject: Successful project based on SOLR Hi guys, I just wanted to let you know our company has successfully launched a new high traffic website based on a powerful CMS built on top of SOLR. The website - http://www.hotnews.ro - serves up to 80k users per day with an average 400K pages per day. It uses an custom hibernate-SOLR bridge which allows transparent persistence of entities on different SOLR servers. The CMS behind it also uses an in house developed API for querying SOLR. It was and will be a pleasure to use SOLR in this project. It has many advantages that you're all probably aware of, but the most impressing thing was its reliability. You start your SOLR server and simply forget about it. Once again, congratulations! Marius Hanganu, Director, Tremend Software Consulting www.tremend.ro
RE: Successful project based on SOLR
That's the first I've seen of Hibernate Search. Looks interesting, but I think it's a little different than what I was looking for. Since it indexes into Lucene, it's close, but I wouldn't have a bunch of my favorite Solr features, such as remote indexing and field-level analysis at index and query time. Ultimately, what I'd like is something like Hibernate Search or like Compass GPS (http://www.opensymphony.com/compass/content/about.html) but leveraging Solr's features. That ability to transition back and forth between object and index record would be really elegant but I need those extras that Solr brings to a Lucene index. - Charlie -Original Message- From: Jonathan Ariel [mailto:[EMAIL PROTECTED] Sent: Thursday, December 20, 2007 12:49 PM To: solr-user@lucene.apache.org Subject: Re: Successful project based on SOLR What's the difference with that and Hibernate Search<http://www.hibernate.org/410.html> ? On Dec 20, 2007 2:09 PM, Charlie Jackson <[EMAIL PROTECTED]> wrote: > Congratulations! > > > It uses an custom hibernate-SOLR > bridge which allows transparent persistence of entities on different > SOLR servers. > > Any chance of this code making its way back to the SOLR community? Or, > if not, can you give me an idea how you did it? This seamless > integration of Hibernate and Solr is something I'm interested in. > > -- Charlie > > -Original Message- > From: Marius Hanganu [mailto:[EMAIL PROTECTED] > Sent: Thursday, December 20, 2007 10:43 AM > To: solr-user@lucene.apache.org > Subject: Successful project based on SOLR > > Hi guys, > > I just wanted to let you know our company has successfully launched a > new high traffic website based on a powerful CMS built on top of SOLR. > > The website - http://www.hotnews.ro - serves up to 80k users per day > with an average 400K pages per day. It uses an custom hibernate-SOLR > bridge which allows transparent persistence of entities on different > SOLR servers. The CMS behind it also uses an in house developed API for > querying SOLR. > > It was and will be a pleasure to use SOLR in this project. It has many > advantages that you're all probably aware of, but the most impressing > thing was its reliability. You start your SOLR server and simply forget > about it. > > Once again, congratulations! > Marius Hanganu, > Director, Tremend Software Consulting > www.tremend.ro >
RE: Successful project based on SOLR
Yeah I remember seeing that at one point when I was first looking at the solrj client. I had plans to build on it but I got pulled away on something else. Maybe it's time to take another look and see what I can do with it. As Jonathan said, it's a good project to work on. -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Thursday, December 20, 2007 3:01 PM To: solr-user@lucene.apache.org Subject: Re: Successful project based on SOLR > > Ultimately, what I'd like is something like Hibernate Search or like > Compass GPS (http://www.opensymphony.com/compass/content/about.html) but > leveraging Solr's features. That ability to transition back and forth > between object and index record would be really elegant but I > need those extras that Solr brings to a Lucene index. > Here is an old verison of solrj that connects to Hibernate similar to Compass GPS... http://solrstuff.org/svn/solrj-hibernate/ It is out of date and references some classes that did not make it into the official solrj release, but it is worth a look. ryan
RE: Backup of a Solr index
Solr indexes are file-based, so there's no need to "dump" the index to a file. In terms of how to create backups and move those backups to other servers, check out this page http://wiki.apache.org/solr/CollectionDistribution. Hope that helps. -Original Message- From: Jörg Kiegeland [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 02, 2008 3:17 AM To: solr-user@lucene.apache.org Subject: Backup of a Solr index Is there a standard way to dump the Solr index to a file or to a directory as backup, and to import a such saved index to another Solr index later? Another question I have, is whether one is allowed to copy the /data/index folder while the Solr server is still running, as easy alternative to do a backup (may this conflict with Solr holding open files?)? Happy new year, Jörg
RE: Backup of a Solr index
> But however one has first to shutdown the Solr server before copying the index folder? If you want to copy the hard files from the data/index directory, yes, you'll probably want to shut down the server first. You may be able to get away with leaving the server up but stopping any index/commit operations, but I could be wrong. > It notes a script "abc", but I cannot find it in my Solr distribution (nightly build)? All of the collection distribution scripts can be found in src/scripts in the nightly build if they aren't in the bin directory of the example solr directory. > Run those scripts on Windows XP? No, unfortunately the Collection Distribution scripts won't work in Windows because they use Unix filesystem trickery to operate. -Original Message- From: Jörg Kiegeland [mailto:[EMAIL PROTECTED] Sent: Thursday, January 03, 2008 11:00 AM To: solr-user@lucene.apache.org Subject: Re: Backup of a Solr index Charlie Jackson wrote: > Solr indexes are file-based, so there's no need to "dump" the index to a > file. > But however one has first to shutdown the Solr server before copying the index folder? > In terms of how to create backups and move those backups to other servers, > check out this page http://wiki.apache.org/solr/CollectionDistribution. > It notes a script "abc", but I cannot find it in my Solr distribution (nightly build)? Run those scripts on Windows XP?
RE: highlighting marks wrong words
I believe changing the "AND id: etc etc " part of the query to it's on filter query will take care of your highlighting problem. In other words, try a query like this: q=(auto)&fq=id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 This could also get you a performance boost if you're querying against this set of ids often. -Original Message- From: Alexey Shakov [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 15, 2008 6:54 AM To: solr-user@lucene.apache.org Subject: highlighting marks wrong words Hi all, I have a query like this: q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 Default field is content. So, I expect, that only occurrencies of "auto" will be marked. BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also present in content field, are marked as well... The result looks like: North American International Auto Show 2007 - Celebrating 100 years Any ideas? Thanx in advance!
RE: Bossting a token with space at the end
If you haven't explicity set the sort parameter, Solr will default to ordering my score. Information about Lucene scoring can be found here http://lucene.apache.org/java/docs/scoring.html And, specifically, the score formula can be found here http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apac he/lucene/search/Similarity.html I'm curious, though, what are you basing your "expected" order on? If it's based on some other data in your domain (such as company size or location or something) you can explicitly set your sort parameter accordingly. - Charlie -Original Message- From: Yerraguntla [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 13, 2008 1:50 PM To: solr-user@lucene.apache.org Subject: Bossting a token with space at the end Hi, I have to search on Company Names, which contain multiple words. Some of the examples are Micro Image Systems, Microsoft Corp, Sun Microsystems, Advanced Micro systems. For the above example when the search is for micro, the expected results order is Micro Image Systems Advanced Micro systems Microsoft Corp Sun Microsystems What needs to be done both for field type, indexing and query . There are bunch of company names for the each of the compnay name example I mentioned. I have been trying with couple of ways with multiple queries, but I am not able retrieve Micro Image systems on the top at all. Appreciate any hints and help. --Yerra -- View this message in context: http://www.nabble.com/Bossting-a-token-with-space-at-the-end-tp15465726p 15465726.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Shared index base
How do you handle commits to the index? By that, I mean that Solr recreates its searcher when you issue a commit, but only for the system that does the commit. Wouldn't you be left with searchers on the other machines that are stale? - Charlie -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 26, 2008 12:18 PM To: solr-user@lucene.apache.org Subject: Re: Shared index base I hope so. I've found that every once in a while Solr 1.2 replication will die, from a temp-index file that seems to ham it up. Removing that file on all the servers fixes the issue though. We'd like to be able to point all the servers at an NFS location for their index files, and use a single server to update it. Thanks! Matthew Runo Software Developer Zappos.com 702.943.7833 On Feb 26, 2008, at 9:39 AM, Alok Dhir wrote: > Are you saying all the servers will use the same 'data' dir? Is > that a supported config? > > On Feb 26, 2008, at 12:29 PM, Matthew Runo wrote: > >> We're about to do the same thing here, but have not tried yet. We >> currently run Solr with replication across several servers. So long >> as only one server is doing updates to the index, I think it should >> work fine. >> >> >> Thanks! >> >> Matthew Runo >> Software Developer >> Zappos.com >> 702.943.7833 >> >> On Feb 26, 2008, at 7:51 AM, Evgeniy Strokin wrote: >> >>> I know there was such discussions about the subject, but I want to >>> ask again if somebody could share more information. >>> We are planning to have several separate servers for our search >>> engine. One of them will be index/search server, and all others >>> are search only. >>> We want to use SAN (BTW: should we consider something else?) and >>> give access to it from all servers. So all servers will use the >>> same index base, without any replication, same files. >>> Is this a good practice? Did somebody do the same? Any problems >>> noticed? Or any suggestions, even about different configurations >>> are highly appreciated. >>> >>> Thanks, >>> Gene >> >
NullPointerException (not schema related)
Hello, I'm evaluating solr for potential use in an application I'm working on, and it sounds like a really great fit. I'm having trouble getting the Collection Distribution part set up, though. Initially, I had problems setting up the postCommit listener. I first used this xml to configure the listener: snapshooter /usr/local/Production/solr/solr/bin/ true This is what came in the solrconfig.xml file with just a minor tweak to the directory. However, when I committed data to the index, I was getting "No such file or directory" errors from the Runtime.exec call. I verified all of the permissions, etc, with the user I was trying to use. In the end, I wrote up a little test program to see if it was a problem with the Runtime.exec call and I think it is. I'm running this on CentOS 4.4 and Runtime.exec seems to have a hard time directly executing bash scripts. For example, if I called Runtime.exec with a command of "test_program" (which is a bash script), it failed. If I called Runtime.exec with a command of "/bin/bash test_program" it worked. So, with this knowledge in hand, I modified the solrconfig.xml file again to this: /bin/bash /usr/local/Production/solr/solr/bin/ true snapshooter When I commit data now, however, I get a NullPointerException. I'm including the stack trace here: SEVERE: java.lang.NullPointerException at org.apache.solr.core.SolrCore.update(SolrCore.java:716) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java: 53) at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:269) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:210) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1 51) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:87 0) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.proc essConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint .java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollow erWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool .java:685) at java.lang.Thread.run(Thread.java:619) I know this has something to do with my config change (the problem goes away if I turn off the postCommit listener) but I don't know what! BTW I'm using solr-1.1.0-incubating. Thanks in advance for any help! Charlie
RE: NullPointerException (not schema related)
Nevermind this...looks like my problem was tagging the "args" as an node instead of an node. Thanks anyway! Charlie -Original Message----- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 12:02 PM To: solr-user@lucene.apache.org Subject: NullPointerException (not schema related) Hello, I'm evaluating solr for potential use in an application I'm working on, and it sounds like a really great fit. I'm having trouble getting the Collection Distribution part set up, though. Initially, I had problems setting up the postCommit listener. I first used this xml to configure the listener: snapshooter /usr/local/Production/solr/solr/bin/ true This is what came in the solrconfig.xml file with just a minor tweak to the directory. However, when I committed data to the index, I was getting "No such file or directory" errors from the Runtime.exec call. I verified all of the permissions, etc, with the user I was trying to use. In the end, I wrote up a little test program to see if it was a problem with the Runtime.exec call and I think it is. I'm running this on CentOS 4.4 and Runtime.exec seems to have a hard time directly executing bash scripts. For example, if I called Runtime.exec with a command of "test_program" (which is a bash script), it failed. If I called Runtime.exec with a command of "/bin/bash test_program" it worked. So, with this knowledge in hand, I modified the solrconfig.xml file again to this: /bin/bash /usr/local/Production/solr/solr/bin/ true snapshooter When I commit data now, however, I get a NullPointerException. I'm including the stack trace here: SEVERE: java.lang.NullPointerException at org.apache.solr.core.SolrCore.update(SolrCore.java:716) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java: 53) at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:269) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:210) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1 51) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:87 0) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.proc essConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint .java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollow erWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool .java:685) at java.lang.Thread.run(Thread.java:619) I know this has something to do with my config change (the problem goes away if I turn off the postCommit listener) but I don't know what! BTW I'm using solr-1.1.0-incubating. Thanks in advance for any help! Charlie
RE: NullPointerException (not schema related)
I went with the first approach which got me up and running. Your other example config (using ./snapshooter) made me realize how foolish my original problem was! Anyway, I've got the whole thing up and running and it looks pretty awesome! One quick question, though. As stated in the wiki, one of the benefits of distributing the indexes is load balance the queries. Is there a built-in solr mechanism for performing this query load balancing? I'm suspecting there is not, and I haven't seen anything about it in the wiki, but I wanted to check because I know I'm going to be asked. Thanks, Charlie -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 3:20 PM To: solr-user@lucene.apache.org Subject: RE: NullPointerException (not schema related) : : snapshooter : /usr/local/Production/solr/solr/bin/ : true : : the directory. However, when I committed data to the index, I was : getting "No such file or directory" errors from the Runtime.exec call. I : verified all of the permissions, etc, with the user I was trying to use. : In the end, I wrote up a little test program to see if it was a problem : with the Runtime.exec call and I think it is. I'm running this on CentOS : 4.4 and Runtime.exec seems to have a hard time directly executing bash : scripts. For example, if I called Runtime.exec with a command of : "test_program" (which is a bash script), it failed. If I called : Runtime.exec with a command of "/bin/bash test_program" it worked. this initial problem you were having may be a result of path issues. dir doesn't need to be the directory where your script lives, it's the directory where you wnat your script to run (the "cwd" of the process). it's possible that the error you were getting was because "." isn't in the PATH that was being used, you should try something like this... /usr/local/Production/solr/solr/bin/snapshooter /usr/local/Production/solr/solr/bin/ true ...or maybe even... ./snapshooter /usr/local/Production/solr/solr/bin/ true -Hoss
RE: NullPointerException (not schema related)
Otis, Thanks for the response, that list should be very useful! Charlie -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 02, 2007 11:13 AM To: solr-user@lucene.apache.org Subject: Re: NullPointerException (not schema related) Charlie, There is nothing built into Solr for that. But you can use any of the numerous free proxies/load balancers. Here is a collection that I've got: http://www.simpy.com/user/otis/search/load%2Bbalance+OR+proxy Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Charlie Jackson <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, May 1, 2007 5:31:13 PM Subject: RE: NullPointerException (not schema related) I went with the first approach which got me up and running. Your other example config (using ./snapshooter) made me realize how foolish my original problem was! Anyway, I've got the whole thing up and running and it looks pretty awesome! One quick question, though. As stated in the wiki, one of the benefits of distributing the indexes is load balance the queries. Is there a built-in solr mechanism for performing this query load balancing? I'm suspecting there is not, and I haven't seen anything about it in the wiki, but I wanted to check because I know I'm going to be asked. Thanks, Charlie -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 3:20 PM To: solr-user@lucene.apache.org Subject: RE: NullPointerException (not schema related) : : snapshooter : /usr/local/Production/solr/solr/bin/ : true : : the directory. However, when I committed data to the index, I was : getting "No such file or directory" errors from the Runtime.exec call. I : verified all of the permissions, etc, with the user I was trying to use. : In the end, I wrote up a little test program to see if it was a problem : with the Runtime.exec call and I think it is. I'm running this on CentOS : 4.4 and Runtime.exec seems to have a hard time directly executing bash : scripts. For example, if I called Runtime.exec with a command of : "test_program" (which is a bash script), it failed. If I called : Runtime.exec with a command of "/bin/bash test_program" it worked. this initial problem you were having may be a result of path issues. dir doesn't need to be the directory where your script lives, it's the directory where you wnat your script to run (the "cwd" of the process). it's possible that the error you were getting was because "." isn't in the PATH that was being used, you should try something like this... /usr/local/Production/solr/solr/bin/snapshooter /usr/local/Production/solr/solr/bin/ true ...or maybe even... ./snapshooter /usr/local/Production/solr/solr/bin/ true -Hoss
Index corruptions?
I have a couple of questions regarding index corruptions. 1) Has anyone using Solr in a production environment ever experienced an index corruption? If so, how frequently do they occur? 2) It seems like the CollectionDistribution setup would be a good way to put in place a recovery plan for (or at least have some viable backups of) the index. However, I have a small concern that if the index gets corrupted on the master server, the corruption would propagate down to the slave servers as well. Is this concern unfounded? Also, each of the snapshots taken by snapshooter are viable full indexes, correct? If so, that means I'd have a backup of the index each and every time a commit (or optimize for that matter) is done, which would be awesome. One of our biggest requirements for the indexing process is to have a good backup/recover strategy in place and I want to make sure Solr will be able to provide that. Thanks in advance! Charlie
RE: fast update handlers
What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson <[EMAIL PROTECTED]> wrote: > I guess I was more concerned with doing the frequent commits and how > that would affect the caches. Say I have 2M docs in my main index but I > want to add docs every 2 seconds all while doing queries. if I do > commits every 2 seconds I basically loose any caching advantage and my > faceting performance goes down the tube. If however, I were to add > things to a smaller index and then roll it into the larger one every ~30 > minutes then I only take the hit on computing the larger filters caches > on that interval. Further, if my smaller index were based on a > RAMDirectory instead of a FSDirectory I assume computing the filter sets > for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik