Re: Phrase Highlighter + Surround Query Parser
Picking up this thread again... When you said 'stock one' you meant in built surround Query parser of customized? We already use usePhrasehighlighter=true. On Mon, Aug 4, 2014 at 10:38 AM, Ahmet Arslan wrote: > Hi, > > You are using a customized surround query parser, right? > > Did you check/try with the stock one? I recall correctly > usePhrasehighlighter=true was working in the past for surround. > > Ahmet > > > > On Monday, August 4, 2014 8:25 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > Anyone? > > > On Fri, Aug 1, 2014 at 12:31 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > We are having an issue in Phrase highlighter with Surround Query Parser > > e.g. *"first thing" w/100 "you must" *brings correct results but also > > highlights individual words of the phrase - first, thing are highlighted > > where they come separately as well. > > > > Any idea how this can be fixed? > > > > > > -- > > Regards, > > > > Salman Akram > > > > > > > > > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Re: Partial Counts in SOLR
The issue with timeallowed is you never know if it will return minimum amount of docs or not. I do want docs to be sorted based on date but it seems its not possible that solr starts searching from recent docs and stops after finding certain no. of docs...any other tweak? Thanks On Saturday, March 8, 2014, Chris Hostetter wrote: > > : Reason: In an index with millions of documents I don't want to know that > a > : certain query matched 1 million docs (of course it will take time to > : calculate that). Why don't just stop looking for more results lets say > : after it finds 100 docs? Possible?? > > but if you care about sorting, ie: you want the top 100 documents sorted > by score, or sorted by date, you still have to "collect" all 1 million > matches in order to know what the first 100 are. > > if you really don't care about sorting, you can use the "timAllowed" > option to tell the seraching method to do the best job it can in an > (approximated) limited amount of time, and then pretend that the docs > collected so far represent the total number of matches... > > > https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter > > > -Hoss > http://www.lucidworks.com/ > -- Regards, Salman Akram Project Manager - Intelligize NorthBay Solutions 410-G4 Johar Town, Lahore Off: +92-42-35290152 Cell: +92-302-8495621
Re: Partial Counts in SOLR
Its a long video and I will definitely go through it but it seems this is not possible with SOLR as it is? I just thought it would be quite a common issue; I mean generally for search engines its more important to show the first page results, rather than using timeAllowed which might not even return a single result. Thanks! -- Regards, Salman Akram
Re: Partial Counts in SOLR
Well some of the searches take minutes. Below are some stats about this particular index that I am talking about: Index size = 400GB (Using CommonGrams so without that the index is around 180GB) Position File = 280GB Total Docs = 170 million (just indexed for searching - for highlighting contents are stored in another index) Avg Doc Size = Few hundred KBs RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of the total index cached) Phrase queries run pretty fast with CG but complex versions of wildcard and proximity queries can be really slow. I know using CG will make them slow but they just take too long. By default sorting is on date but users have few other parameters too on which they can sort. I wanted to avoid creating multiple indexes (maybe based on years) but seems that to search on partial data that's the only feasible way. On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan wrote: > As Hoss pointed out above, different projects have different requirements. > Some want to sort by date of ingestion reverse, which means that having > posting lists organized in a reverse order with the early termination is > the way to go (no such feature in Solr directly). Some other projects want > to collect all docs matching a query, and then sort by rank, but you cannot > guarantee, that the most recently inserted document is the most relevant in > terms of your ranking. > > > Do your current searches take too long? > > > On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > Its a long video and I will definitely go through it but it seems this is > > not possible with SOLR as it is? > > > > I just thought it would be quite a common issue; I mean generally for > > search engines its more important to show the first page results, rather > > than using timeAllowed which might not even return a single result. > > > > Thanks! > > > > > > -- > > Regards, > > > > Salman Akram > > > > > > -- > Dmitry > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > -- Regards, Salman Akram
Re: Partial Counts in SOLR
1- SOLR 4.6 2- We do but right now I am talking about plain keyword queries just sorted by date. Once this is better will start looking into caches which we already changed a little. 3- As I said the contents are not stored in this index. Some other metadata fields are but with normal queries its super fast so I guess even if I change there it will be a minor difference. We have SSD and quite fast too. 4- That's something we need to do but even in low workload those queries take a lot of time 5- Every 10 mins and currently no auto warming as user queries are rarely same and also once its fully warmed those queries are still slow. 6- Nops. On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan wrote: > 1. What is your solr version? In 4.x family the proximity searches have > been optimized among other query types. > 2. Do you use the filter queries? What is the situation with the cache > utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if > you have low hitratios and many evictions. > 3. Can you avoid storing some fields and only index them? When the field is > stored and it is retrieved in the result, there are couple of disk seeks > per field=> search slows down. Consider SSD disks. > 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you > observe STW GC pauses? > 5. How often do you commit & do you have the autowarming / external warming > configured? > 6. If you use faceting, consider storing DocValues for facet fields. > > some solr wiki docs: > > https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 > > > > > > On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > Well some of the searches take minutes. > > > > Below are some stats about this particular index that I am talking about: > > > > Index size = 400GB (Using CommonGrams so without that the index is around > > 180GB) > > Position File = 280GB > > Total Docs = 170 million (just indexed for searching - for highlighting > > contents are stored in another index) > > Avg Doc Size = Few hundred KBs > > RAM = 384GB (it has other indexes too but still OS cache can have 60-80% > of > > the total index cached) > > > > Phrase queries run pretty fast with CG but complex versions of wildcard > and > > proximity queries can be really slow. I know using CG will make them slow > > but they just take too long. By default sorting is on date but users have > > few other parameters too on which they can sort. > > > > I wanted to avoid creating multiple indexes (maybe based on years) but > > seems that to search on partial data that's the only feasible way. > > > > > > > > > > On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan > wrote: > > > > > As Hoss pointed out above, different projects have different > > requirements. > > > Some want to sort by date of ingestion reverse, which means that having > > > posting lists organized in a reverse order with the early termination > is > > > the way to go (no such feature in Solr directly). Some other projects > > want > > > to collect all docs matching a query, and then sort by rank, but you > > cannot > > > guarantee, that the most recently inserted document is the most > relevant > > in > > > terms of your ranking. > > > > > > > > > Do your current searches take too long? > > > > > > > > > On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram < > > > salman.ak...@northbaysolutions.net> wrote: > > > > > > > Its a long video and I will definitely go through it but it seems > this > > is > > > > not possible with SOLR as it is? > > > > > > > > I just thought it would be quite a common issue; I mean generally for > > > > search engines its more important to show the first page results, > > rather > > > > than using timeAllowed which might not even return a single result. > > > > > > > > Thanks! > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Salman Akram > > > > > > > > > > > > > > > > -- > > > Dmitry > > > Blog: http://dmitrykan.blogspot.com > > > Twitter: http://twitter.com/dmitrykan > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > > > > -- > Dmitry > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > -- Regards, Salman Akram
Re: Partial Counts in SOLR
Below is one of the sample slow query that takes mins! ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or purchase* or repurchase*)) w/10 (executive or director) If a filter is used it comes in fq but what can be done about plain keyword search? On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson wrote: > What are our complex queries? You > say that your app will very rarely see the > same query thus you aren't using caches... > But, if you can move some of your > clauses to fq clauses, then the filterCache > might well be used to good effect. > > > > On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram > wrote: > > 1- SOLR 4.6 > > 2- We do but right now I am talking about plain keyword queries just > sorted > > by date. Once this is better will start looking into caches which we > > already changed a little. > > 3- As I said the contents are not stored in this index. Some other > metadata > > fields are but with normal queries its super fast so I guess even if I > > change there it will be a minor difference. We have SSD and quite fast > too. > > 4- That's something we need to do but even in low workload those queries > > take a lot of time > > 5- Every 10 mins and currently no auto warming as user queries are rarely > > same and also once its fully warmed those queries are still slow. > > 6- Nops. > > > > On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan > wrote: > > > >> 1. What is your solr version? In 4.x family the proximity searches have > >> been optimized among other query types. > >> 2. Do you use the filter queries? What is the situation with the cache > >> utilization ratios? Optimize (= i.e. bump up the respective cache > sizes) if > >> you have low hitratios and many evictions. > >> 3. Can you avoid storing some fields and only index them? When the > field is > >> stored and it is retrieved in the result, there are couple of disk seeks > >> per field=> search slows down. Consider SSD disks. > >> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you > >> observe STW GC pauses? > >> 5. How often do you commit & do you have the autowarming / external > warming > >> configured? > >> 6. If you use faceting, consider storing DocValues for facet fields. > >> > >> some solr wiki docs: > >> > >> > https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 > >> > >> > >> > >> > >> > >> On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram < > >> salman.ak...@northbaysolutions.net> wrote: > >> > >> > Well some of the searches take minutes. > >> > > >> > Below are some stats about this particular index that I am talking > about: > >> > > >> > Index size = 400GB (Using CommonGrams so without that the index is > around > >> > 180GB) > >> > Position File = 280GB > >> > Total Docs = 170 million (just indexed for searching - for > highlighting > >> > contents are stored in another index) > >> > Avg Doc Size = Few hundred KBs > >> > RAM = 384GB (it has other indexes too but still OS cache can have > 60-80% > >> of > >> > the total index cached) > >> > > >> > Phrase queries run pretty fast with CG but complex versions of > wildcard > >> and > >> > proximity queries can be really slow. I know using CG will make them > slow > >> > but they just take too long. By default sorting is on date but users > have > >> > few other parameters too on which they can sort. > >> > > >> > I wanted to avoid creating multiple indexes (maybe based on years) but > >> > seems that to search on partial data that's the only feasible way. > >> > > >> > > >> > > >> > > >> > On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan > >> wrote: > >> > > >> > > As Hoss pointed out above, different projects have different > >> > requirements. > >> > > Some want to sort by date of ingestion reverse, which means that > having > >> > > posting lists organized in a reverse order with the early > termination > >> is > >> > > the way to go (no such feature in Solr directly). Some other > projects > >> > want > >> > > to collect all docs matching a query, and then sort by rank, but y
Best SSD block size for large SOLR indexes
All, Is there a rule of thumb for ideal block size for SSDs for large indexes (in hundreds of GBs)? Read performance is of top importance for us and we can sacrifice the space a little... This is the one we just got and wanted to see if there are any test results out there http://www.storagereview.com/micron_p420m_enterprise_pcie_ssd_review -- Regards, Salman Akram
Re: Best SSD block size for large SOLR indexes
This SSD default size seems to be 4K not 16K (as can be seen below). Bytes Per Sector : 512 Bytes Per Physical Sector : 4096 Bytes Per Cluster : 4096 Bytes Per FileRecord Segment: 1024 I will go through the articles you sent. Thanks On Tue, Mar 18, 2014 at 6:31 PM, Shawn Heisey wrote: > On 3/18/2014 7:12 AM, Salman Akram wrote: > > Is there a rule of thumb for ideal block size for SSDs for large indexes > > (in hundreds of GBs)? Read performance is of top importance for us and we > > can sacrifice the space a little... > > > > This is the one we just got and wanted to see if there are any test > results > > out there > > http://www.storagereview.com/micron_p420m_enterprise_pcie_ssd_review > > The best filesystem block size to use for SSDs is dictated more by the > characteristics of the SSD itself than what data you put on it. > > Here's an awesome series of articles about SSDs that I heard about from > Shalin Shekhar Mangar: > > > http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/ > > With the page size of most large SSDs at 16KB, you might want to go with > a multiple of that, like 64KB, and learn about the proper use of parted > to align partition boundaries. > > As for whether there are Solr settings that can improve the I/O > characteristics when reading/writing, that I do not know. > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Best SSD block size for large SOLR indexes
Thanks for the info. The articles were really useful but still seems I have to do my own testing to find the right page size? I thought for large indexes there would already be some tests done in SOLR community. Side note: We are heavily using Microsoft technology (.NET etc) for development so by looking at all the pros/cons decided to stick with Windows. Wasn't rude ;) On Tue, Mar 18, 2014 at 7:22 PM, Shawn Heisey wrote: > On 3/18/2014 7:39 AM, Salman Akram wrote: > > This SSD default size seems to be 4K not 16K (as can be seen below). > > > > Bytes Per Sector : 512 > > Bytes Per Physical Sector : 4096 > > Bytes Per Cluster : 4096 > > Bytes Per FileRecord Segment: 1024 > > The *sector* size on a typical SSD is 4KB, but the *page* size is a > lower level detail, and is more likely to be 16KB, especially on a very > large SSD. > > The Micron P420m is actually mentioned specifically in the SSD article I > linked, and a table in part 2 states that its page size is 16KB, with a > block size of 8MB. > > Possibly rude side note: Windows? Really? > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Best SSD block size for large SOLR indexes
We do have couple of commodity SSDs already and they perform good. However, our user queries are very complex and quite a few of them go above a minute so we really had to do something about it. Using this beast vs putting the whole index to RAM, the beast still seemed a better option. Also we are using some top notch servers already. On Wed, Mar 19, 2014 at 1:52 AM, Toke Eskildsen wrote: > Salman Akram [salman.ak...@northbaysolutions.net] wrote: > > [Hundreds of GB index] > > > http://www.storagereview.com/micron_p420m_enterprise_pcie_ssd_review > > May I ask why you have chosen a drive with such a high speed and matching > cost? > > We have some years of experience with using SSDs for search at work and it > is our experience that commodity SSDs performs very well (one test showed > something like 80% of RAM speed, YMMW). It seems to me that more servers > with commodity SSDs could very well be cheaper and give better throughput > than the beast(s) you're using. Are you trying to minimize latency "at all > cost"? > > Regards, > Toke Eskildsen -- Regards, Salman Akram
Re: Partial Counts in SOLR
Anyone? On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Below is one of the sample slow query that takes mins! > > ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or > purchase* or repurchase*)) w/10 (executive or director) > > If a filter is used it comes in fq but what can be done about plain > keyword search? > > > On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson > wrote: > >> What are our complex queries? You >> say that your app will very rarely see the >> same query thus you aren't using caches... >> But, if you can move some of your >> clauses to fq clauses, then the filterCache >> might well be used to good effect. >> >> >> >> On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram >> wrote: >> > 1- SOLR 4.6 >> > 2- We do but right now I am talking about plain keyword queries just >> sorted >> > by date. Once this is better will start looking into caches which we >> > already changed a little. >> > 3- As I said the contents are not stored in this index. Some other >> metadata >> > fields are but with normal queries its super fast so I guess even if I >> > change there it will be a minor difference. We have SSD and quite fast >> too. >> > 4- That's something we need to do but even in low workload those queries >> > take a lot of time >> > 5- Every 10 mins and currently no auto warming as user queries are >> rarely >> > same and also once its fully warmed those queries are still slow. >> > 6- Nops. >> > >> > On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan >> wrote: >> > >> >> 1. What is your solr version? In 4.x family the proximity searches have >> >> been optimized among other query types. >> >> 2. Do you use the filter queries? What is the situation with the cache >> >> utilization ratios? Optimize (= i.e. bump up the respective cache >> sizes) if >> >> you have low hitratios and many evictions. >> >> 3. Can you avoid storing some fields and only index them? When the >> field is >> >> stored and it is retrieved in the result, there are couple of disk >> seeks >> >> per field=> search slows down. Consider SSD disks. >> >> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do >> you >> >> observe STW GC pauses? >> >> 5. How often do you commit & do you have the autowarming / external >> warming >> >> configured? >> >> 6. If you use faceting, consider storing DocValues for facet fields. >> >> >> >> some solr wiki docs: >> >> >> >> >> https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram < >> >> salman.ak...@northbaysolutions.net> wrote: >> >> >> >> > Well some of the searches take minutes. >> >> > >> >> > Below are some stats about this particular index that I am talking >> about: >> >> > >> >> > Index size = 400GB (Using CommonGrams so without that the index is >> around >> >> > 180GB) >> >> > Position File = 280GB >> >> > Total Docs = 170 million (just indexed for searching - for >> highlighting >> >> > contents are stored in another index) >> >> > Avg Doc Size = Few hundred KBs >> >> > RAM = 384GB (it has other indexes too but still OS cache can have >> 60-80% >> >> of >> >> > the total index cached) >> >> > >> >> > Phrase queries run pretty fast with CG but complex versions of >> wildcard >> >> and >> >> > proximity queries can be really slow. I know using CG will make them >> slow >> >> > but they just take too long. By default sorting is on date but users >> have >> >> > few other parameters too on which they can sort. >> >> > >> >> > I wanted to avoid creating multiple indexes (maybe based on years) >> but >> >> > seems that to search on partial data that's the only feasible way. >> >> > >> >> > >> >> > >> >> > >> >> > On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan >> >> wrote: >> >> > >> >&g
Re: Partial Counts in SOLR
This was one example. Users can even add phrase searches with wildcards/proximity etc so can't really use stemming. Sharding is definitely something we are already looking into. On Wed, Mar 19, 2014 at 6:59 PM, Erick Erickson wrote: > Yes, that'll be slow. Wildcards are, at best, interesting and at worst > resource consumptive. Especially when you're doing this kind of > positioning information as well. > > Consider looking at the problem sideways. That is, what is your > purpose in searching for, say, "buy*"? You want to find buy, buying, > buyers, etc? Would you get bette results if you just stemmed and > omitted the wildcards? > > Do you have a restricted vocabulary that would allow you to define > synonyms for the "important" words and all their variants at index > time and use that? > > Finally, of course, you could shard your index (or add more shards if > you're already sharding) if you really _must_ support these kinds of > queries and can't work around the problem. > > Best, > Erick > > On Tue, Mar 18, 2014 at 11:21 PM, Salman Akram > wrote: > > Anyone? > > > > > > On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > >> Below is one of the sample slow query that takes mins! > >> > >> ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or > >> purchase* or repurchase*)) w/10 (executive or director) > >> > >> If a filter is used it comes in fq but what can be done about plain > >> keyword search? > >> > >> > >> On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson < > erickerick...@gmail.com>wrote: > >> > >>> What are our complex queries? You > >>> say that your app will very rarely see the > >>> same query thus you aren't using caches... > >>> But, if you can move some of your > >>> clauses to fq clauses, then the filterCache > >>> might well be used to good effect. > >>> > >>> > >>> > >>> On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram > >>> wrote: > >>> > 1- SOLR 4.6 > >>> > 2- We do but right now I am talking about plain keyword queries just > >>> sorted > >>> > by date. Once this is better will start looking into caches which we > >>> > already changed a little. > >>> > 3- As I said the contents are not stored in this index. Some other > >>> metadata > >>> > fields are but with normal queries its super fast so I guess even if > I > >>> > change there it will be a minor difference. We have SSD and quite > fast > >>> too. > >>> > 4- That's something we need to do but even in low workload those > queries > >>> > take a lot of time > >>> > 5- Every 10 mins and currently no auto warming as user queries are > >>> rarely > >>> > same and also once its fully warmed those queries are still slow. > >>> > 6- Nops. > >>> > > >>> > On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan > >>> wrote: > >>> > > >>> >> 1. What is your solr version? In 4.x family the proximity searches > have > >>> >> been optimized among other query types. > >>> >> 2. Do you use the filter queries? What is the situation with the > cache > >>> >> utilization ratios? Optimize (= i.e. bump up the respective cache > >>> sizes) if > >>> >> you have low hitratios and many evictions. > >>> >> 3. Can you avoid storing some fields and only index them? When the > >>> field is > >>> >> stored and it is retrieved in the result, there are couple of disk > >>> seeks > >>> >> per field=> search slows down. Consider SSD disks. > >>> >> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do > >>> you > >>> >> observe STW GC pauses? > >>> >> 5. How often do you commit & do you have the autowarming / external > >>> warming > >>> >> configured? > >>> >> 6. If you use faceting, consider storing DocValues for facet fields. > >>> >> > >>> >> some solr wiki docs: > >>> >> > >>> >> > >>> > https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29 > >>> >> >
Re: w/10 ? [was: Partial Counts in SOLR]
Yup! On Thu, Mar 20, 2014 at 5:13 AM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi, > > Guessing it's surround query parser's support for "within" backed by span > queries. > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Mar 19, 2014 4:44 PM, "T. Kuro Kurosaka" wrote: > > > In the thread "Partial Counts in SOLR", Salman gave us this sample query: > > > > ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or > >> purchase* or repurchase*)) w/10 (executive or director) > >> > > > > I'm not familiar with this w/10 notation. What does this mean, > > and what parser(s) supports this syntax? > > > > Kuro > > > > > -- Regards, Salman Akram
Re: Best SSD block size for large SOLR indexes
For now I am going with 64kb and results seem good. Thanks for the useful feedback. On Wed, Mar 19, 2014 at 9:30 PM, Shawn Heisey wrote: > On 3/19/2014 12:09 AM, Salman Akram wrote: > >> Thanks for the info. The articles were really useful but still seems I >> have >> to do my own testing to find the right page size? I thought for large >> indexes there would already be some tests done in SOLR community. >> >> Side note: We are heavily using Microsoft technology (.NET etc) for >> development so by looking at all the pros/cons decided to stick with >> Windows. Wasn't rude ;) >> > > Assuming you are only going to be putting Solr data on it, or anything > else you put on it will also consist of large files, I would probably go > with a cluster size at least 64KB for an NTFS volume, and I might consider > 128KB or 256KB. There *ARE* a few small files in a Solr index, but not > enough of them for the wasted space to become a problem. > > The easiest way to configure Solr to use a different location than the > program directory is to change the solr home. > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: w/10 ? [was: Partial Counts in SOLR]
Basically we just created this syntax for the ease of users, otherwise on back end it uses W or N operators. On Tue, Mar 25, 2014 at 4:21 AM, Ahmet Arslan wrote: > Hi, > > There is no w/ syntax in surround. > /* Query language operators: OR, AND, NOT, W, N, (, ), ^, *, ?, " and > comma */ > > Ahmet > > > > On Monday, March 24, 2014 9:46 PM, T. Kuro Kurosaka > wrote: > On 3/19/14 5:13 PM, Otis Gospodnetic wrote:> Hi, > > > > Guessing it's surround query parser's support for "within" backed by span > > queries. > > > > Otis > > You mean this? > http://wiki.apache.org/solr/SurroundQueryParser > > I guess this parser needs improvement in documentation area. > It doesn't explain or have an example of the w/ syntax at all. > (Is this the infix notation of W?) > An example would help explaining difference between W and N; > some readers may not understand what "ordered" and "unordered" > in this context mean. > > > Kuro > -- Regards, Salman Akram
More Robust Search Timeouts (to Kill Zombie Queries)?
With reference to this thread<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E>I wanted to know if there was any response to that or if Chris Harris himself can comment on what he ended up doing, that would be great! -- Regards, Salman Akram
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
Anyone? On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > With reference to this > thread<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E>I > wanted to know if there was any response to that or if Chris Harris > himself can comment on what he ended up doing, that would be great! > > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
So you too never got any response... On Mon, Mar 31, 2014 at 6:57 PM, Luis Lebolo wrote: > Hi Salman, > > I was interested in something similar, take a look at the following thread: > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCADSoL-i04aYrsOo2%3DGcaFqsQ3mViF%2Bhn24ArDtT%3D7kpALtVHzA%40mail.gmail.com%3E#archives > > I never followed through, however. > > -Luis > > > On Mon, Mar 31, 2014 at 6:24 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > Anyone? > > > > > > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > With reference to this thread< > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E > >I > > wanted to know if there was any response to that or if Chris Harris > > > himself can comment on what he ended up doing, that would be great! > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
Looking at this, sharding seems to be best and simple option to handle such queries. On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev wrote: > Hello Salman, > Let's me drop few thoughts on > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E > > There two aspects of this question: > 1. dealing with long running processing (thread divergence actions > http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and > 2. an actual time checking. > "terminating" or "aborting" thread (2.) are just a way to tracking time > externally, and send interrupt() which the thread should react on, which > they don't do now, and we returning to the core issue (1.) > > Solr's time allowed is to the proper way to handle this things, the only > problem is that expect that the only core search is long running, but in > your case rewriting MultiTermQuery-s takes a huge time. > Let's consider this problem. First of all MultiTermQuery.rewrite() is the > nearly design issue, after heavy rewrite occurs, it's thrown away, after > search is done. I think the most straightforward way is to address this > issue by caching these expensive queries. Solr does it well > http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for > http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there > is > a workaround allows to cache disjunction legs see > http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html > If you still want to run expensively rewritten queries you need to > implement timeout check (similar to TimeLimitingCollector) for TermsEnum > returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums > is the good way, to apply queries injecting time limiting wrapper > TermsEnum, you might consider override methods like > SolrQueryParserBase.newWildcardQuery(Term) or post process the query three > after parsing. > > > > On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > Anyone? > > > > > > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > With reference to this thread< > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E > >I > > wanted to know if there was any response to that or if Chris Harris > > > himself can comment on what he ended up doing, that would be great! > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > > -- Regards, Salman Akram
Re: timeAllowed in not honoring
I had this issue too. timeAllowed only works for a certain phase of the query. I think that's the 'process' part. However, if the query is taking time in 'prepare' phase (e.g. I think for wildcards to get all the possible combinations before running the query) it won't have any impact on that. You can debug your query and confirm that. On Wed, Apr 30, 2014 at 10:43 AM, Aman Tandon wrote: > Shawn this is the first time i raised this problem. > > My heap size is 14GB and i am not using solr cloud currently, 40GB index > is replicated from master to two slaves. > > I read somewhere that it return the partial results which is computed by > the query in that specified amount of time which is defined by this > timeAllowed parameter, but it doesn't seems to happen. > > Here is the link : > http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed > > *The time allowed for a search to finish. This value only applies to the > search and not to requests in general. Time is in milliseconds. Values <= 0 > mean no time restriction. Partial results may be returned (if there are > any). * > > > > With Regards > Aman Tandon > > > On Wed, Apr 30, 2014 at 10:05 AM, Shawn Heisey wrote: > > > On 4/29/2014 10:05 PM, Aman Tandon wrote: > > > I am using solr 4.2 with the index size of 40GB, while querying to my > > index > > > there are some queries which is taking the significant amount of time > of > > > about 22 seconds *in the case of minmatch of 50%*. So i added a > parameter > > > timeAllowed = 2000 in my query but this doesn't seems to be work. > Please > > > help me out. > > > > I remember reading that timeAllowed has some limitations about which > > stages of a query it can limit, particularly in the distributed case. > > These limitations mean that it cannot always limit the total time for a > > query. I do not remember precisely what those limitations are, and I > > cannot find whatever it was that I was reading. > > > > When I looked through my local list archive to see if you had ever > > mentioned how much RAM you have and what the size of your Solr heap is, > > there didn't seem to be anything. There's not enough information for me > > to know whether that 40GB is the amount of index data on a single > > SolrCloud server, or whether it's the total size of the index across all > > servers. > > > > If we leave timeAllowed alone for a moment and treat this purely as a > > performance problem, usually my questions revolve around figuring out > > whether you have enough RAM. Here's where that conversation ends up: > > > > http://wiki.apache.org/solr/SolrPerformanceProblems > > > > I think I've probably mentioned this to you before on another thread. > > > > Thanks, > > Shawn > > > > > -- Regards, Salman Akram
Re: Solr vs ElasticSearch
This is quite an old discussion. Wanted to check any new comparisons after SOLR 4 especially with regards to performance/scalability/throughput? On Tue, Jul 26, 2011 at 7:33 PM, Peter wrote: > Have a look: > > > http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage > > http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/ > > Regards, > Peter. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Regards, Salman Akram
Re: Solr vs ElasticSearch
I did see that earlier. My main concern is search performance/scalability/throughput which unfortunately that article didn't address. Any benchmarks or comments about that? We are already using SOLR but there has been a push to check elasticsearch. All the benchmarks I have seen are at least few years old. On Fri, Aug 1, 2014 at 4:59 AM, Otis Gospodnetic wrote: > Not super fresh, but more recent than the 2 links you sent: > http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/ > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Thu, Jul 31, 2014 at 10:33 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > This is quite an old discussion. Wanted to check any new comparisons > after > > SOLR 4 especially with regards to performance/scalability/throughput? > > > > > > On Tue, Jul 26, 2011 at 7:33 PM, Peter wrote: > > > > > Have a look: > > > > > > > > > > > > http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage > > > > > > > http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/ > > > > > > Regards, > > > Peter. > > > > > > -- > > > View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html > > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Phrase Highlighter + Surround Query Parser
We are having an issue in Phrase highlighter with Surround Query Parser e.g. *"first thing" w/100 "you must" *brings correct results but also highlights individual words of the phrase - first, thing are highlighted where they come separately as well. Any idea how this can be fixed? -- Regards, Salman Akram
Re: Phrase Highlighter + Surround Query Parser
Anyone? On Fri, Aug 1, 2014 at 12:31 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > We are having an issue in Phrase highlighter with Surround Query Parser > e.g. *"first thing" w/100 "you must" *brings correct results but also > highlights individual words of the phrase - first, thing are highlighted > where they come separately as well. > > Any idea how this can be fixed? > > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Re: Solr vs ElasticSearch
Thanks everyone!! This has been really helpful discussion and in short based on this we have taken the decision to stick to SOLR. On Mon, Aug 4, 2014 at 6:17 PM, Jack Krupansky wrote: > And neither project supports the Lucene faceting module, correct? > > And the ES web site says: "WARNING: Facets are deprecated and will be > removed in a future release. You are encouraged to migrate to aggregations > instead." > > That makes it more of an apples/oranges comparison. > > -- Jack Krupansky > > -Original Message- From: Toke Eskildsen > Sent: Monday, August 4, 2014 3:33 AM > To: solr-user@lucene.apache.org > > Subject: Re: Solr vs ElasticSearch > > On Mon, 2014-08-04 at 08:31 +0200, Harald Kirsch wrote: > >> As for performance, I would expect that it is very hard to find one of >> the two technologies to be generally ahead. Except for plain blunders >> that may be lurking in the code, I would think the inner loops, the >> stuff that really burns CPU cycles, all happens in Lucene, which is the >> same for both. >> > > Faceting/Aggregation is implemented independently and with different > designs for Solr and Elasticsearch. I would be surprised if memory > overhead and performance were about the same for this functionality. > > - Toke Eskildsen, State and University Library, Denmark > > -- Regards, Salman Akram
SOLR 4 not utilizing multi CPU cores
Hi, We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the performance went down for large phrase queries. On some analysis we have seen that 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only utilizing single cpu core. Any idea on what could be the reason? Note: We are not using SOLR Sharding. -- Regards, Salman Akram
Re: SOLR 4 not utilizing multi CPU cores
I missed one imp piece of info. Due to large size we have indexed the date with Common Grams. All of the words in slow search are in common grams and when I debug it, they query is made properly with common grams. In debug all of the time is shown in process query time. Let me know what other info you need? Thanks On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini wrote: > Hi, I did moreless the same but didn't get that behaviour...could you give > us more details > > Best, > Gazza > On 5 Dec 2013 06:54, "Salman Akram" > wrote: > > > Hi, > > > > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the performance > > went down for large phrase queries. On some analysis we have seen that > > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only > > utilizing single cpu core. Any idea on what could be the reason? > > > > Note: We are not using SOLR Sharding. > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: SOLR 4 not utilizing multi CPU cores
More info on Cpu consumption: We have a server with 32 physical cores. Same search when executed on SOLR 4.6 takes quite long and throughout only uses 3% cpu (1 core). Same search when executed on SOLR 1.4.1 takes much less time and on average uses around 40-50% cpu. On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > I missed one imp piece of info. Due to large size we have indexed the date > with Common Grams. All of the words in slow search are in common grams and > when I debug it, they query is made properly with common grams. > > In debug all of the time is shown in process query time. > > Let me know what other info you need? Thanks > > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini > wrote: > >> Hi, I did moreless the same but didn't get that behaviour...could you give >> us more details >> >> Best, >> Gazza >> On 5 Dec 2013 06:54, "Salman Akram" >> wrote: >> >> > Hi, >> > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the >> performance >> > went down for large phrase queries. On some analysis we have seen that >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only >> > utilizing single cpu core. Any idea on what could be the reason? >> > >> > Note: We are not using SOLR Sharding. >> > >> > -- >> > Regards, >> > >> > Salman Akram >> > >> > > > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Re: SOLR 4 not utilizing multi CPU cores
So I think I found one issue that somewhat explains the time difference but not sure why this is happening. We are using Surround Query Parser. Below is a two words query, both of them are in Common Grams list. Query = "only be" Here is what debug shows. I have highlighted the red part which is different in both versions i.e. SOLR 4.6 is making it a multiphrasequery. I am going to look into Surround Query Parser but not sure if it's an issue with it or something else. *SOLR 4.6 (takes 20 secs)* {!surround} {!surround} MultiPhraseQuery(Contents:"(only only_be) be") Contents:"(only only_be) be" *SOLR 1.4.1 (takes 1 sec)* {!surround} {!surround} Contents:only_be Contents:only_be P.S: The other issue still remains there that why is it not utilizing multiple cpu cores. On Thu, Dec 5, 2013 at 2:11 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > More info on Cpu consumption: We have a server with 32 physical cores. > > Same search when executed on SOLR 4.6 takes quite long and throughout only > uses 3% cpu (1 core). > > Same search when executed on SOLR 1.4.1 takes much less time and on > average uses around 40-50% cpu. > > > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > >> I missed one imp piece of info. Due to large size we have indexed the >> date with Common Grams. All of the words in slow search are in common grams >> and when I debug it, they query is made properly with common grams. >> >> In debug all of the time is shown in process query time. >> >> Let me know what other info you need? Thanks >> >> >> On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini >> wrote: >> >>> Hi, I did moreless the same but didn't get that behaviour...could you >>> give >>> us more details >>> >>> Best, >>> Gazza >>> On 5 Dec 2013 06:54, "Salman Akram" >>> wrote: >>> >>> > Hi, >>> > >>> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the >>> performance >>> > went down for large phrase queries. On some analysis we have seen that >>> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only >>> > utilizing single cpu core. Any idea on what could be the reason? >>> > >>> > Note: We are not using SOLR Sharding. >>> > >>> > -- >>> > Regards, >>> > >>> > Salman Akram >>> > >>> >> >> >> >> -- >> Regards, >> >> Salman Akram >> >> > > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Re: SOLR 4 not utilizing multi CPU cores
I am not using Shards. I gave more info in a previous mail but I know its a single index and what you are saying makes sense but from what I could see in 1.4.1 that it was better 'utilizing' the hardware resources available. I mean if the CPU is free then why not do multi threading (if possible of course). Not sure if that was a bug with 1.4.1 or was just a better resource utilization. However, the main issue seems to be what I referred in my last mail...Thanks! On Thu, Dec 5, 2013 at 2:24 PM, Daniel Collins wrote: > Not sure if you are really stating the problem here. > > If you don't use Solr sharding, (I also assume you aren't using SolrCloud), > and I'm guessing you are a single core (but can you confirm). > > As I understand Solr's logic, for a single query on a single core, that > will only use 1 thread (ignoring updates, background merges, etc). A > Lucene index (with multiple segments) has each segment read sequentially, > so a search must scan all the segments and that inherently is a > single-threaded activity. > > The fact that the search uses less CPU is not really the issue (it might > actually be a GOOD thing, it could mean the code is more efficient!), so I > would consider that a red herring. The real issue is that the search takes > longer in elapsed time. > > The usual questions apply: > > 1) how did you upgrade, did you port your config, or start from a fresh > Solr 4 config and add your custom stuff to it. > 2) Is your new index comparable to your old one, does it have more > segments, how did you fill it (bulk import or upgrade of old 1.4.1 index), > and what is your merge policy for the index? > > Upgrades from such an old version of Solr have been asked before on the > list, the consensus is that you probably need to re-tune your configuration > (starting with a Solr 4 basic config) since Solr 4 is so different under > the hood from 1.x > > > On 5 December 2013 09:11, Salman Akram > wrote: > > > More info on Cpu consumption: We have a server with 32 physical cores. > > > > Same search when executed on SOLR 4.6 takes quite long and throughout > only > > uses 3% cpu (1 core). > > > > Same search when executed on SOLR 1.4.1 takes much less time and on > average > > uses around 40-50% cpu. > > > > > > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > I missed one imp piece of info. Due to large size we have indexed the > > date > > > with Common Grams. All of the words in slow search are in common grams > > and > > > when I debug it, they query is made properly with common grams. > > > > > > In debug all of the time is shown in process query time. > > > > > > Let me know what other info you need? Thanks > > > > > > > > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini < > agazzar...@apache.org > > >wrote: > > > > > >> Hi, I did moreless the same but didn't get that behaviour...could you > > give > > >> us more details > > >> > > >> Best, > > >> Gazza > > >> On 5 Dec 2013 06:54, "Salman Akram" < > salman.ak...@northbaysolutions.net > > > > > >> wrote: > > >> > > >> > Hi, > > >> > > > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the > > >> performance > > >> > went down for large phrase queries. On some analysis we have seen > that > > >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is > > only > > >> > utilizing single cpu core. Any idea on what could be the reason? > > >> > > > >> > Note: We are not using SOLR Sharding. > > >> > > > >> > -- > > >> > Regards, > > >> > > > >> > Salman Akram > > >> > > > >> > > > > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: SOLR 4 not utilizing multi CPU cores
Here is the response to your 2 questions: 1- Started from fresh Solr 4 config and modified custom stuff. 2- Index is same and optimized. However, as I said in a previous mail the issue seems to be Surround Query Parser which is parsing the query in a different format. On Thu, Dec 5, 2013 at 2:24 PM, Daniel Collins wrote: > Not sure if you are really stating the problem here. > > If you don't use Solr sharding, (I also assume you aren't using SolrCloud), > and I'm guessing you are a single core (but can you confirm). > > As I understand Solr's logic, for a single query on a single core, that > will only use 1 thread (ignoring updates, background merges, etc). A > Lucene index (with multiple segments) has each segment read sequentially, > so a search must scan all the segments and that inherently is a > single-threaded activity. > > The fact that the search uses less CPU is not really the issue (it might > actually be a GOOD thing, it could mean the code is more efficient!), so I > would consider that a red herring. The real issue is that the search takes > longer in elapsed time. > > The usual questions apply: > > 1) how did you upgrade, did you port your config, or start from a fresh > Solr 4 config and add your custom stuff to it. > 2) Is your new index comparable to your old one, does it have more > segments, how did you fill it (bulk import or upgrade of old 1.4.1 index), > and what is your merge policy for the index? > > Upgrades from such an old version of Solr have been asked before on the > list, the consensus is that you probably need to re-tune your configuration > (starting with a Solr 4 basic config) since Solr 4 is so different under > the hood from 1.x > > > On 5 December 2013 09:11, Salman Akram > wrote: > > > More info on Cpu consumption: We have a server with 32 physical cores. > > > > Same search when executed on SOLR 4.6 takes quite long and throughout > only > > uses 3% cpu (1 core). > > > > Same search when executed on SOLR 1.4.1 takes much less time and on > average > > uses around 40-50% cpu. > > > > > > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > I missed one imp piece of info. Due to large size we have indexed the > > date > > > with Common Grams. All of the words in slow search are in common grams > > and > > > when I debug it, they query is made properly with common grams. > > > > > > In debug all of the time is shown in process query time. > > > > > > Let me know what other info you need? Thanks > > > > > > > > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini < > agazzar...@apache.org > > >wrote: > > > > > >> Hi, I did moreless the same but didn't get that behaviour...could you > > give > > >> us more details > > >> > > >> Best, > > >> Gazza > > >> On 5 Dec 2013 06:54, "Salman Akram" < > salman.ak...@northbaysolutions.net > > > > > >> wrote: > > >> > > >> > Hi, > > >> > > > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the > > >> performance > > >> > went down for large phrase queries. On some analysis we have seen > that > > >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is > > only > > >> > utilizing single cpu core. Any idea on what could be the reason? > > >> > > > >> > Note: We are not using SOLR Sharding. > > >> > > > >> > -- > > >> > Regards, > > >> > > > >> > Salman Akram > > >> > > > >> > > > > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: SOLR 4 not utilizing multi CPU cores
After debugging it seems that Query Parser code in Surround Parser is giving an issue in queries with Common Words. Has anyone tried Surround and Common Grams with SOLR 4? On Thu, Dec 5, 2013 at 7:00 PM, Daniel Collins wrote: > Fair enough, I'm not famiilar with Surround parser, but it does look like > some logic has changed there. > > > On 5 December 2013 12:38, Salman Akram > wrote: > > > Here is the response to your 2 questions: > > > > 1- Started from fresh Solr 4 config and modified custom stuff. > > > > 2- Index is same and optimized. > > > > However, as I said in a previous mail the issue seems to be Surround > Query > > Parser which is parsing the query in a different format. > > > > > > On Thu, Dec 5, 2013 at 2:24 PM, Daniel Collins > >wrote: > > > > > Not sure if you are really stating the problem here. > > > > > > If you don't use Solr sharding, (I also assume you aren't using > > SolrCloud), > > > and I'm guessing you are a single core (but can you confirm). > > > > > > As I understand Solr's logic, for a single query on a single core, that > > > will only use 1 thread (ignoring updates, background merges, etc). A > > > Lucene index (with multiple segments) has each segment read > sequentially, > > > so a search must scan all the segments and that inherently is a > > > single-threaded activity. > > > > > > The fact that the search uses less CPU is not really the issue (it > might > > > actually be a GOOD thing, it could mean the code is more efficient!), > so > > I > > > would consider that a red herring. The real issue is that the search > > takes > > > longer in elapsed time. > > > > > > The usual questions apply: > > > > > > 1) how did you upgrade, did you port your config, or start from a > fresh > > > Solr 4 config and add your custom stuff to it. > > > 2) Is your new index comparable to your old one, does it have more > > > segments, how did you fill it (bulk import or upgrade of old 1.4.1 > > index), > > > and what is your merge policy for the index? > > > > > > Upgrades from such an old version of Solr have been asked before on the > > > list, the consensus is that you probably need to re-tune your > > configuration > > > (starting with a Solr 4 basic config) since Solr 4 is so different > under > > > the hood from 1.x > > > > > > > > > On 5 December 2013 09:11, Salman Akram > > > wrote: > > > > > > > More info on Cpu consumption: We have a server with 32 physical > cores. > > > > > > > > Same search when executed on SOLR 4.6 takes quite long and throughout > > > only > > > > uses 3% cpu (1 core). > > > > > > > > Same search when executed on SOLR 1.4.1 takes much less time and on > > > average > > > > uses around 40-50% cpu. > > > > > > > > > > > > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram < > > > > salman.ak...@northbaysolutions.net> wrote: > > > > > > > > > I missed one imp piece of info. Due to large size we have indexed > the > > > > date > > > > > with Common Grams. All of the words in slow search are in common > > grams > > > > and > > > > > when I debug it, they query is made properly with common grams. > > > > > > > > > > In debug all of the time is shown in process query time. > > > > > > > > > > Let me know what other info you need? Thanks > > > > > > > > > > > > > > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini < > > > agazzar...@apache.org > > > > >wrote: > > > > > > > > > >> Hi, I did moreless the same but didn't get that behaviour...could > > you > > > > give > > > > >> us more details > > > > >> > > > > >> Best, > > > > >> Gazza > > > > >> On 5 Dec 2013 06:54, "Salman Akram" < > > > salman.ak...@northbaysolutions.net > > > > > > > > > >> wrote: > > > > >> > > > > >> > Hi, > > > > >> > > > > > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the > > > > >> performance > > > > >> > went down for large phrase queries. On some analysis we have > seen > > > that > > > > >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 > is > > > > only > > > > >> > utilizing single cpu core. Any idea on what could be the reason? > > > > >> > > > > > >> > Note: We are not using SOLR Sharding. > > > > >> > > > > > >> > -- > > > > >> > Regards, > > > > >> > > > > > >> > Salman Akram > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > > > > > > Salman Akram > > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Salman Akram > > > > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
SOLR 4 - Query Issue in Common Grams with Surround Query Parser
All, I posted this sub-issue with another issue few days back but maybe it was not obvious so posting it on a separate thread. We recently migrated to SOLR 4.6. We use Common Grams but queries with words in the CG list have slowed down. On debugging we found that for CG words the parser is adding individual tokens of those words in the query too which ends up slowing it. Below is an example: Query = "only be" Here is what debug shows. I have highlighted the red part which is different in both versions i.e. SOLR 4.6 is making it a multiphrasequery and adding individual tokens too. Can someone help? SOLR 4.6 (takes 20 secs) {!surround} {!surround} MultiPhraseQuery(Contents:"(only only_be) be") Contents:"(only only_be) be" SOLR 1.4.1 (takes 1 sec) {!surround} {!surround} Contents:only_be Contents:only_be-- Regards, Salman Akram
Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser
Yup on debugging I found that its coming in Analyzer. We are using Standard Analyzer. It seems to be a SOLR 4 issue with Common Grams. Not sure if its a bug or I am missing some config. On Mon, Dec 9, 2013 at 2:03 PM, Ahmet Arslan wrote: > Hi Salman, > I am confused because with surround no analysis is applied at query time. > I suspect that surround query parser is not kicking in. You should see > SrndQuery or something like at parser query section. > > > > On Monday, December 9, 2013 6:24 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > All, > > I posted this sub-issue with another issue few days back but maybe it was > not obvious so posting it on a separate thread. > > We recently migrated to SOLR 4.6. We use Common Grams but queries with > words in the CG list have slowed down. On debugging we found that for CG > words the parser is adding individual tokens of those words in the query > too which ends up slowing it. Below is an example: > > Query = "only be" > > Here is what debug shows. I have highlighted the red part which is > different in both versions i.e. SOLR 4.6 is making it a multiphrasequery > and adding individual tokens too. Can someone help? > > SOLR 4.6 (takes 20 secs) > {!surround} > {!surround} > MultiPhraseQuery(Contents:"(only only_be) > be") > Contents:"(only only_be) be" > > SOLR 1.4.1 (takes 1 sec) > {!surround} > {!surround} > Contents:only_be > Contents:only_be-- > > > Regards, > > Salman Akram > -- Regards, Salman Akram
Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser
We used that syntax in 1.4.1 when Surround was not part of SOLR and has to register it. Didn't know that it is now part of SOLR. Any ways this is a red herring since I have totally removed Surround and the issue remains there. Below is the debug info when I give a simple phrase query having common words with default Query Parser. What I don't understand is that why is it including single tokens as well? I have also included the relevant config part below. "rawquerystring": "Contents:\"only be\"", "querystring": "Contents:\"only be\"", "parsedquery": "MultiPhraseQuery(Contents:\"(only only_be) be\")", "parsedquery_toString": "Contents:\"(only only_be) be\"", "QParser": "LuceneQParser", = On Mon, Dec 9, 2013 at 7:46 PM, Erik Hatcher wrote: > But again, as Ahmet mentioned… it doesn't look like the surround query > parser is actually being used. The debug output also mentioned the query > parser used, but that part wasn't provided below. One thing to note here, > the surround query parser is not available in 1.4.1. It also looks like > you're surrounding your query with angle brackets, as it says query string > is {!surround}, which is not correct syntax. And one > of the most important things to note here is that the surround query parser > does NOT use the analysis chain of the field, see < > http://wiki.apache.org/solr/SurroundQueryParser#Limitations>. In short, > you're going to have to do some work to get common grams factored into a > surround query (such as maybe calling to the analysis request hander to > "parse" the query before sending it to the surround query parser). > > Erik > > > On Dec 9, 2013, at 9:36 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > Yup on debugging I found that its coming in Analyzer. We are using > Standard > > Analyzer. It seems to be a SOLR 4 issue with Common Grams. Not sure if > its > > a bug or I am missing some config. > > > > > > On Mon, Dec 9, 2013 at 2:03 PM, Ahmet Arslan wrote: > > > >> Hi Salman, > >> I am confused because with surround no analysis is applied at query > time. > >> I suspect that surround query parser is not kicking in. You should see > >> SrndQuery or something like at parser query section. > >> > >> > >> > >> On Monday, December 9, 2013 6:24 AM, Salman Akram < > >> salman.ak...@northbaysolutions.net> wrote: > >> > >> All, > >> > >> I posted this sub-issue with another issue few days back but maybe it > was > >> not obvious so posting it on a separate thread. > >> > >> We recently migrated to SOLR 4.6. We use Common Grams but queries with > >> words in the CG list have slowed down. On debugging we found that for CG > >> words the parser is adding individual tokens of those words in the query > >> too which ends up slowing it. Below is an example: > >> > >> Query = "only be" > >> > >> Here is what debug shows. I have highlighted the red part which is > >> different in both versions i.e. SOLR 4.6 is making it a multiphrasequery > >> and adding individual tokens too. Can someone help? > >> > >> SOLR 4.6 (takes 20 secs) > >> {!surround} > >> {!surround} > >> MultiPhraseQuery(Contents:"(only only_be) > >> be") > >> Contents:"(only only_be) be" > >> > >> SOLR 1.4.1 (takes 1 sec) > >> {!surround} > >> {!surround} > >> Contents:only_be > >> Contents:only_be-- > >> > >> > >> Regards, > >> > >> Salman Akram > >> > > > > > > > > -- > > Regards, > > > > Salman Akram > > -- Regards, Salman Akram
Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser
Thanks!! Using CommonGramsQueryFilter resolved the issue. This was not there in 1.4.1 and also for some reason was not there in SOLR 4 Release Notes that we studied before upgrading. On Tue, Dec 10, 2013 at 9:55 AM, Ahmet Arslan wrote: > Hi Salman, > > I never used commons gram filer but I remember there are two classes in > this family. CommonGramsFilter and CommonGramsQueryFilter. It seems that > CommonsGramsQueryFilter is what you are after. > > > http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsQueryFilter.html > > > http://khaidoan.wikidot.com/solr-common-gram-filter > > > > > > On Tuesday, December 10, 2013 6:43 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > We used that syntax in 1.4.1 when Surround was not part of SOLR and has to > register it. Didn't know that it is now part of SOLR. Any ways this is a > red herring since I have totally removed Surround and the issue remains > there. > > Below is the debug info when I give a simple phrase query having common > words with default Query Parser. What I don't understand is that why is it > including single tokens as well? I have also included the relevant config > part below. > > "rawquerystring": "Contents:\"only be\"", > "querystring": "Contents:\"only be\"", > "parsedquery": "MultiPhraseQuery(Contents:\"(only only_be) be\")", > "parsedquery_toString": "Contents:\"(only only_be) be\"", > > "QParser": "LuceneQParser", > > = > > > > > > > ignoreCase="true"/> > > > > > > On Mon, Dec 9, 2013 at 7:46 PM, Erik Hatcher > wrote: > > > But again, as Ahmet mentioned… it doesn't look like the surround query > > parser is actually being used. The debug output also mentioned the > query > > parser used, but that part wasn't provided below. One thing to note > here, > > the surround query parser is not available in 1.4.1. It also looks like > > you're surrounding your query with angle brackets, as it says query > string > > is {!surround}, which is not correct syntax. And one > > of the most important things to note here is that the surround query > parser > > does NOT use the analysis chain of the field, see < > > http://wiki.apache.org/solr/SurroundQueryParser#Limitations>. In short, > > you're going to have to do some work to get common grams factored into a > > surround query (such as maybe calling to the analysis request hander to > > "parse" the query before sending it to the surround query parser). > > > > Erik > > > > > > On Dec 9, 2013, at 9:36 AM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > Yup on debugging I found that its coming in Analyzer. We are using > > Standard > > > Analyzer. It seems to be a SOLR 4 issue with Common Grams. Not sure if > > its > > > a bug or I am missing some config. > > > > > > > > > On Mon, Dec 9, 2013 at 2:03 PM, Ahmet Arslan > wrote: > > > > > >> Hi Salman, > > >> I am confused because with surround no analysis is applied at query > > time. > > >> I suspect that surround query parser is not kicking in. You should see > > >> SrndQuery or something like at parser query section. > > >> > > >> > > >> > > >> On Monday, December 9, 2013 6:24 AM, Salman Akram < > > >> salman.ak...@northbaysolutions.net> wrote: > > >> > > >> All, > > >> > > >> I posted this sub-issue with another issue few days back but maybe it > > was > > >> not obvious so posting it on a separate thread. > > >> > > >> We recently migrated to SOLR 4.6. We use Common Grams but queries with > > >> words in the CG list have slowed down. On debugging we found that for > CG > > >> words the parser is adding individual tokens of those words in the > query > > >> too which ends up slowing it. Below is an example: > > >> > > >> Query = "only be" > > >> > > >> Here is what debug shows. I have highlighted the red part which is > > >> different in both versions i.e. SOLR 4.6 is making it a > multiphrasequery > > >> and adding individual tokens too. Can someone help? > > >> > > >> SOLR 4.6 (takes 20 secs) > > >> {!surround} > > >> {!surround} > > >> MultiPhraseQuery(Contents:"(only only_be) > > >> be") > > >> Contents:"(only only_be) be" > > >> > > >> SOLR 1.4.1 (takes 1 sec) > > >> {!surround} > > >> {!surround} > > >> Contents:only_be > > >> Contents:only_be-- > > >> > > >> > > >> Regards, > > > >> > > >> Salman Akram > > >> > > > > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
Optimizing index on Slave
All, I know normally index should be optimized on master and it will be replicated to slave but we have an issue with the network bandwidth. We optimize indexes weekly (total size is around 1.5TB). We have few slaves set up on local network so replication the whole indexes is not a big issue. However, we have one slave in another city too (on a backup network) which of course gets replicated over internet which is quite slow and expensive. We want to avoid copying the complete indexes every week after optimization and were thinking if its possible to optimize it independently on slave so that there is no delta between master and slave? We tried to do it but still the slave replicated from master. -- Regards, Salman Akram
Re: Optimizing index on Slave
We do. We have a lot of updates/deletes every day and a weekly optimization definitely gives a considerable improvement so don't see a downside to it except the complete replication part which is not an issue on local network.
Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser
Apologies for the late response as this mail was lost somewhere in filters. Issue was that CommonGramsQueryFilterFactory should be used for searching and CommonGramsFilterFactory for indexing. We were using CommonGramsFilterFactory for both due to which it was not dropping single tokens for common grams in a phrase query. I will go through the link you sent and see if it needs any explanation. Thanks!
Re: Optimizing index on Slave
Unfortunately we can't do sharding right now. If we optimize on master and slave separately the file names and sizes are same. I think it's just the version no that is different. Maybe if there was a to copy master version to slave that would resolve this issue?
Partial Counts in SOLR
All, Is it possible to get partial counts in SOLR? The idea is to get the count but if its above a certain limit than just return that limit. Reason: In an index with millions of documents I don't want to know that a certain query matched 1 million docs (of course it will take time to calculate that). Why don't just stop looking for more results lets say after it finds 100 docs? Possible?? e.g. Something similar that we can do in MySQL: SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias -- Regards, Salman Akram
Re: Partial Counts in SOLR
I know about numFound. That's where the issue is. On a complex query that takes mins I think there would be a major chunk of that spent in calculating "numFound" whereas I don't need it. Let's say I just need first 100 docs and then want SOLR to STOP looking further to populate the "numFound". Let's say I just don't want SOLR to return me numFound. Is that possible? Also would it really help on the performance? In MySQL you can simply stop it to look further a certain count for "total count" and that gives a considerable improvement for complex queries but that's not an inverted index so not sure how it works in SOLR... On Fri, Mar 7, 2014 at 3:17 PM, Gora Mohanty wrote: > On 7 March 2014 15:18, Salman Akram > wrote: > > All, > > > > Is it possible to get partial counts in SOLR? The idea is to get the > count > > but if its above a certain limit than just return that limit. > > > > Reason: In an index with millions of documents I don't want to know that > a > > certain query matched 1 million docs (of course it will take time to > > calculate that). Why don't just stop looking for more results lets say > > after it finds 100 docs? Possible?? > > > > e.g. Something similar that we can do in MySQL: > > > > SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias > > The response to the /select Solr URL has a "numFound" attribute that > is the number > of matches. > > Regards, > Gora > -- Regards, Salman Akram
Master - Master / Upgrading a slave to master
We have a redundant data center in case the primary goes down. Currently we have 1 master and multiple slaves on primary data center. This master also replicates to a slave in secondary data center. So if the primary goes down at least the read only part works. However, now we want writes to work on secondary data center too when primary goes down. - Is it possible in SOLR to have Master - Master? - If not then what's the best strategy to upgrade a slave to master? - Naturally there would be some latency due to data centers being in different geographical locations so what are the normal data issues and best practices in case primary goes down? We would also like to shift back to primary as soon as its back. Thanks! -- Regards, Salman Akram
Re: Master - Master / Upgrading a slave to master
You mean 3 'data centers' or 'nodes'? I am thinking if we have 2 nodes on primary and 1 in secondary and we normally keep the secondary down would that work? Basically secondary network is just for redundancy and won't be as fast so normally we won't like to shift traffic there. So can we just have nodes for redundancy and NOT load balancing i.e. it has 3 nodes but update is only on one of them? Similarly for the slave replicas can we limit the searches to a certain slave or it will be auto balanced? Also apart from SOLR cloud is it possible to have multiple master in SOLR or a good guide to upgrade a slave to master? Thanks On Tue, Sep 9, 2014 at 5:40 PM, Shawn Heisey wrote: > On 9/8/2014 9:54 PM, Salman Akram wrote: > > We have a redundant data center in case the primary goes down. Currently > we > > have 1 master and multiple slaves on primary data center. This master > also > > replicates to a slave in secondary data center. So if the primary goes > down > > at least the read only part works. However, now we want writes to work on > > secondary data center too when primary goes down. > > > > - Is it possible in SOLR to have Master - Master? > > - If not then what's the best strategy to upgrade a slave to master? > > - Naturally there would be some latency due to data centers being in > > different geographical locations so what are the normal data issues and > > best practices in case primary goes down? We would also like to shift > back > > to primary as soon as its back. > > SolrCloud would work, but only if you have *three* datacenters. Two of > them would need to remain fully operational. SolrCloud is a true > cluster -- there is no master. Each of the shards in a collection has > one or more replicas. One of the replicas gets elected to be leader, > but the leader designation can change. > > The reason that you need three is because of zookeeper, which is the > software that actually maintains the cluster and handles leader > elections. A majority of zookeeper nodes (more than half of them) must > be operational for zookeeper to maintain quorum. That means that the > minimum number of zookeepers is three, and in a three-node system, one > can go down without disrupting operation. > > One thing that SolrCloud doesn't yet have is rack/datacenter awareness. > Requests get load balanced across the entire cluster, regardless of > where they are located. It's something that will eventually come, but I > don't have any kind of estimate for when. > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Master - Master / Upgrading a slave to master
So realistically speaking you cannot have SolrCloud work for 2 data centers as a redundant solution because no matter how many nodes you add you still would need at least 1 node in the 2nd center working too. So that just leaves with non-SolrCloud solutions. "1) Change the replication config to redefine the master and reload the core or restart Solr." That of course is a simple way but the real issue is about the possible issues and some good practices e.g. normally the scenario would be that primary data center goes down for few hours and till then we upgrade one of the slaves in secondary to a master. Now - IF there is no lag there won't be any issue in secondary at least but what if there is lag and one of the files is not completely replicated? That file would be discarded or there is a possibility that whole index is not usable? - Once the primary comes back how would we now copy the delta from secondary? Make it a slave of secondary first, replicate the delta and then set it as a master again? In other words is there a good guide out there for this with possible issues and solutions? Definitely before SolrCloud people would be doing this and even now SolrCloud doesn't seem practical in quite a few situations. Thanks again!! On Tue, Sep 9, 2014 at 8:02 PM, Shawn Heisey wrote: > On 9/9/2014 8:46 AM, Salman Akram wrote: > > You mean 3 'data centers' or 'nodes'? I am thinking if we have 2 nodes on > > primary and 1 in secondary and we normally keep the secondary down would > > that work? Basically secondary network is just for redundancy and won't > be > > as fast so normally we won't like to shift traffic there. > > > > So can we just have nodes for redundancy and NOT load balancing i.e. it > has > > 3 nodes but update is only on one of them? Similarly for the slave > replicas > > can we limit the searches to a certain slave or it will be auto balanced? > > > > Also apart from SOLR cloud is it possible to have multiple master in SOLR > > or a good guide to upgrade a slave to master? > > You must have three zookeeper nodes for a redundant setup. If you only > have two data centers, then you must put at least two of those nodes in > one data center. If the data center with two zookeeper nodes goes down, > zookeeper cannot function, which means SolrCloud will not work > correctly. There is no way to maintain SolrCloud redundancy with only > two data centers. You might think to add a fourth ZK node and split > them between the data centers ... except that in that situation, at > least three nodes must be functional. Two out of four nodes is not enough. > > A minimal fault-tolerant SolrCloud install is three physical machines. > Two of them run ZK and Solr, one of them runs ZK only. > > If you don't use SolrCloud, then you have two choices to switch masters: > > 1) Change the replication config to redefine the master and reload the > core or restart Solr. > 2) Write scripts that manually use the replication HTTP API to do all > your replication, rather than let Solr handle it automatically. You can > choose the master for every replication with HTTP calls. > > https://wiki.apache.org/solr/SolrReplication#HTTP_API > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Master - Master / Upgrading a slave to master
Anyone? On Tue, Sep 9, 2014 at 8:20 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > So realistically speaking you cannot have SolrCloud work for 2 data > centers as a redundant solution because no matter how many nodes you add > you still would need at least 1 node in the 2nd center working too. > > So that just leaves with non-SolrCloud solutions. > > "1) Change the replication config to redefine the master and reload the core > or restart Solr." > > That of course is a simple way but the real issue is about the possible > issues and some good practices e.g. normally the scenario would be that > primary data center goes down for few hours and till then we upgrade one of > the slaves in secondary to a master. Now > > - IF there is no lag there won't be any issue in secondary at least but > what if there is lag and one of the files is not completely replicated? > That file would be discarded or there is a possibility that whole index is > not usable? > > - Once the primary comes back how would we now copy the delta from > secondary? Make it a slave of secondary first, replicate the delta and then > set it as a master again? > > In other words is there a good guide out there for this with possible > issues and solutions? Definitely before SolrCloud people would be doing > this and even now SolrCloud doesn't seem practical in quite a few > situations. > > Thanks again!! > > On Tue, Sep 9, 2014 at 8:02 PM, Shawn Heisey wrote: > >> On 9/9/2014 8:46 AM, Salman Akram wrote: >> > You mean 3 'data centers' or 'nodes'? I am thinking if we have 2 nodes >> on >> > primary and 1 in secondary and we normally keep the secondary down would >> > that work? Basically secondary network is just for redundancy and won't >> be >> > as fast so normally we won't like to shift traffic there. >> > >> > So can we just have nodes for redundancy and NOT load balancing i.e. it >> has >> > 3 nodes but update is only on one of them? Similarly for the slave >> replicas >> > can we limit the searches to a certain slave or it will be auto >> balanced? >> > >> > Also apart from SOLR cloud is it possible to have multiple master in >> SOLR >> > or a good guide to upgrade a slave to master? >> >> You must have three zookeeper nodes for a redundant setup. If you only >> have two data centers, then you must put at least two of those nodes in >> one data center. If the data center with two zookeeper nodes goes down, >> zookeeper cannot function, which means SolrCloud will not work >> correctly. There is no way to maintain SolrCloud redundancy with only >> two data centers. You might think to add a fourth ZK node and split >> them between the data centers ... except that in that situation, at >> least three nodes must be functional. Two out of four nodes is not >> enough. >> >> A minimal fault-tolerant SolrCloud install is three physical machines. >> Two of them run ZK and Solr, one of them runs ZK only. >> >> If you don't use SolrCloud, then you have two choices to switch masters: >> >> 1) Change the replication config to redefine the master and reload the >> core or restart Solr. >> 2) Write scripts that manually use the replication HTTP API to do all >> your replication, rather than let Solr handle it automatically. You can >> choose the master for every replication with HTTP calls. >> >> https://wiki.apache.org/solr/SolrReplication#HTTP_API >> >> Thanks, >> Shawn >> >> > > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Recovering from Out of Mem
I know there are some suggestions to avoid OOM issue e.g. setting appropriate Max Heap size etc. However, what's the best way to recover from it as it goes into non-responding state? We are using Tomcat on back end. The scenario is that once we face OOM issue it keeps on taking queries (doesn't give any error) but they just time out. So even though we have a fail over system implemented but we don't have a way to distinguish if these are real time out queries OR due to OOM. -- Regards, Salman Akram
Re: Recovering from Out of Mem
I know this might sound weird but any easy way to do it in Windows? On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer wrote: > yago, > > you can put more complex restart logic as shown in the examples below or > just do something similar to the java_oom.sh i posted earlier where you > just spit out an email alert and deal with service restarts and > troubleshooting manually > > > e.g. something like the following for a java_error.sh will drop an email > with a timestamp > > > > echo `date` | mail -s "Java Error: General - $HOSTNAME" not...@domain.com > > > > From: Tim Potter > Sent: Tuesday, October 14, 2014 07:35 > To: solr-user@lucene.apache.org > Subject: Re: Recovering from Out of Mem > > jfyi - the bin/solr script does the following: > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where > $SOLR_PORT is the port Solr is bound to, e.g. 8983 > > The oom_solr.sh script looks like: > > SOLR_PORT=$1 > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk > '{print $2}' | sort -r` > > if [ "$SOLR_PID" == "" ]; then > > echo "Couldn't find Solr process running on port $SOLR_PORT!" > > exit > > fi > > NOW=$(date +"%F%T") > > ( > > echo "Running OOM killer script for process $SOLR_PID for Solr on port > $SOLR_PORT" > > kill -9 $SOLR_PID > > echo "Killed process $SOLR_PID" > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log > > > I usually run Solr behind a supervisor type process (supervisord or > upstart) that will restart it if the process dies. > > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma > wrote: > > > This will do: > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > > > pkill should also work > > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > > Boogie, > > > > > > > > > > > > > > > Any example for java_error.sh script? > > > > > > > > > — > > > /Yago Riveiro > > > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > > boogie.sha...@proquest.com> > > > > > > wrote: > > > > a really simple approach is to have the OOM generate an email > > > > e.g. > > > > 1) create a simple script (call it java_oom.sh) and drop it in your > > tomcat > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > > > > not...@domain.com 2) configure your java options (in setenv.sh or > > > > similar) to trigger heap dump and the email script when OOM occurs # > > > > config error behaviors > > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > > > > > > > From: Mark Miller > > > > Sent: Tuesday, October 14, 2014 06:30 > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: Recovering from Out of Mem > > > > Best is to pass the Java cmd line option that kills the process on > OOM > > and > > > > setup a supervisor on the process to restart it. You need a somewhat > > > > recent release for this to work properly though. - Mark > > > > > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram > > > >> wrote: > > > >> > > > >> I know there are some suggestions to avoid OOM issue e.g. setting > > > >> appropriate Max Heap size etc. However, what's the best way to > recover > > > >> from > > > >> it as it goes into non-responding state? We are using Tomcat on back > > end. > > > >> > > > >> The scenario is that once we face OOM issue it keeps on taking > queries > > > >> (doesn't give any error) but they just time out. So even though we > > have a > > > >> fail over system implemented but we don't have a way to distinguish > if > > > >> these are real time out queries OR due to OOM. > > > >> > > > >> -- > > > >> Regards, > > > >> > > > >> Salman Akram > > > > > -- Regards, Salman Akram
Re: Recovering from Out of Mem
I assume you will have to write a script to restart the service as well? On Fri, Oct 17, 2014 at 7:17 PM, Tim Potter wrote: > You'd still want to kill it ... so you'll need to register a cmd script > with the JVM using -XX:OnOutOfMemoryError=kill.cmd and then you could > either > > 1) trap the PID at startup using something like: > > title SolrCloud > > for /F "tokens=2 delims= " %%A in ('TASKLIST /FI ^"WINDOWTITLE eq > SolrCloud^" /NH') do ( > > set /A SOLR_PID=%%A > > echo !SOLR_PID!>solr.pid > > > or > > > 2) if you keep track of the port (which all my Windows scripts do), then > you can do: > > > For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find > ":%SOLR_PORT%"') do ( > > taskkill /t /f /pid %%j > nul 2>&1 > > ) > > > On Fri, Oct 17, 2014 at 1:11 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > I know this might sound weird but any easy way to do it in Windows? > > > > On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer < > boogie.sha...@proquest.com > > > > > wrote: > > > > > yago, > > > > > > you can put more complex restart logic as shown in the examples below > or > > > just do something similar to the java_oom.sh i posted earlier where you > > > just spit out an email alert and deal with service restarts and > > > troubleshooting manually > > > > > > > > > e.g. something like the following for a java_error.sh will drop an > email > > > with a timestamp > > > > > > > > > > > > echo `date` | mail -s "Java Error: General - $HOSTNAME" > > not...@domain.com > > > > > > > > > > > > From: Tim Potter > > > Sent: Tuesday, October 14, 2014 07:35 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Recovering from Out of Mem > > > > > > jfyi - the bin/solr script does the following: > > > > > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where > > > $SOLR_PORT is the port Solr is bound to, e.g. 8983 > > > > > > The oom_solr.sh script looks like: > > > > > > SOLR_PORT=$1 > > > > > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | > awk > > > '{print $2}' | sort -r` > > > > > > if [ "$SOLR_PID" == "" ]; then > > > > > > echo "Couldn't find Solr process running on port $SOLR_PORT!" > > > > > > exit > > > > > > fi > > > > > > NOW=$(date +"%F%T") > > > > > > ( > > > > > > echo "Running OOM killer script for process $SOLR_PID for Solr on port > > > $SOLR_PORT" > > > > > > kill -9 $SOLR_PID > > > > > > echo "Killed process $SOLR_PID" > > > > > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log > > > > > > > > > I usually run Solr behind a supervisor type process (supervisord or > > > upstart) that will restart it if the process dies. > > > > > > > > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma > > > wrote: > > > > > > > This will do: > > > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > > > > > > > pkill should also work > > > > > > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > > > > Boogie, > > > > > > > > > > > > > > > > > > > > > > > > > Any example for java_error.sh script? > > > > > > > > > > > > > > > — > > > > > /Yago Riveiro > > > > > > > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > > > > boogie.sha...@proquest.com> > > > > > > > > > > wrote: > > > > > > a really simple approach is to have the OOM generate an email > > > > > > e.g. > > > > > > 1) create a simple script (call it java_oom.sh) and drop it in > your > > > > tomcat > > > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - > $HOSTNAME" > > > > > > not...@domain.com 2) configure your java options (in setenv.sh > or > &
Re: Recovering from Out of Mem
" That's why it is considered better to crash the program and restart it for OOME." In the end aren't you also saying the same thing or I misunderstood something? We don't get this issue on master server (indexing). Our real concern is slave where sometimes (rare) so not an obvious heap config issue but when it happens our failover doesn't even work (moving to another slave) as there is no error so I just want a good way to know if there is an OOM and shift to a failover or just have that server restarted. On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey wrote: > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote: > > You can create a script to ping on Solr every 10 sec. if no response, > then > > restart it (Kill process id and run Solr again). > > This is the fastest and easiest way to do that on windows. > > I wouldn't do this myself. Any temporary problem that results in a long > query time might result in a true outage while Solr restarts. If OOME > is a problem, then you can deal with that by providing a program for > Java to call when OOME occurs. > > Sending notification when ping times get excessive is a good idea, but I > wouldn't make it automatically restart, unless you've got a threshold > for that action so it only happens when the ping time is *REALLY* high. > > The real fix for OOME is to make the heap larger or to reduce the heap > requirements by changing how Solr is configured or used. > > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > Writing a program that has deterministic behavior in an out of memory > condition is very difficult. The Lucene devs *have* done this hard work > in the lower levels of IndexWriter and the specific Directory > implementations, so that OOME doesn't cause *index corruption*. > > In general, once OOME happens, program operation (and in some cases the > status of the most recently indexed documents) is completely > undetermined. We can be sure that the data which has already been > written to disk will be correct, but nothing beyond that. That's why it > is considered better to crash the program and restart it for OOME. > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Recovering from Out of Mem
Yes so the most imp thing is what's the best way to 'know' that there is OOM? Some script of a ping with 1-2 mins time? The reason I want auto restart or at least some error (so that it can switch to another slave) is I want to have a good sleep if something goes wrong at night so that the systems keep on working and can look into details in the morning. That's the whole purpose of having a fail over implemented. On a side node the instance where we had this OOM didn't have an explicit Xmx set (on 64 bit Windows) so in that case is there some default max? There was ample mem available so why would it throw OOM? On Mon, Oct 20, 2014 at 9:00 PM, Boogie Shafer wrote: > > i think we can agree that the basic requirement of *knowing* when the OOM > occurs is the minimal requirement, triggering an alert (email, etc) would > be the first thing to get into your script > > once you know when the OOM conditions are occuring you can start to get to > the root cause or remedy (adjust heap sizes, or adjust the input side that > is triggering the OOM). the correct remedy will obviously require some more > deeper investigation into the actual solr usage at the point of OOM and the > gc logs (you have these being generated too i hope). just bumping the Xmx > because you hit an OOM during an abusive query is no guarantee of a fix and > is likely going to cost you OS cache memory space which you want to leave > available for holding the actual index data. the real fix would be cleaning > up the query (if that is possible) > > fundamentally, its a preference thing, but i'm personally not a fan of > auto restarts as the problem that triggered the original OOM (say an > expensive poorly constructed query) may just come back and you get into an > oscillating situation of restart after restart. i generally want a human > involved when error conditions which should be outliers (like OOM) are > happening > > > > From: Salman Akram > Sent: Monday, October 20, 2014 08:47 > To: Solr Group > Subject: Re: Recovering from Out of Mem > > " That's why it is considered better to crash the program and restart it > for OOME." > > In the end aren't you also saying the same thing or I misunderstood > something? > > We don't get this issue on master server (indexing). Our real concern is > slave where sometimes (rare) so not an obvious heap config issue but when > it happens our failover doesn't even work (moving to another slave) as > there is no error so I just want a good way to know if there is an OOM and > shift to a failover or just have that server restarted. > > > > > On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey wrote: > > > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote: > > > You can create a script to ping on Solr every 10 sec. if no response, > > then > > > restart it (Kill process id and run Solr again). > > > This is the fastest and easiest way to do that on windows. > > > > I wouldn't do this myself. Any temporary problem that results in a long > > query time might result in a true outage while Solr restarts. If OOME > > is a problem, then you can deal with that by providing a program for > > Java to call when OOME occurs. > > > > Sending notification when ping times get excessive is a good idea, but I > > wouldn't make it automatically restart, unless you've got a threshold > > for that action so it only happens when the ping time is *REALLY* high. > > > > The real fix for OOME is to make the heap larger or to reduce the heap > > requirements by changing how Solr is configured or used. > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > > > Writing a program that has deterministic behavior in an out of memory > > condition is very difficult. The Lucene devs *have* done this hard work > > in the lower levels of IndexWriter and the specific Directory > > implementations, so that OOME doesn't cause *index corruption*. > > > > In general, once OOME happens, program operation (and in some cases the > > status of the most recently indexed documents) is completely > > undetermined. We can be sure that the data which has already been > > written to disk will be correct, but nothing beyond that. That's why it > > is considered better to crash the program and restart it for OOME. > > > > Thanks, > > Shawn > > > > > > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
Common Grams Highlighting
Hi, I have gone through a lot of posts about Highlighting issues with Common Grams but I am still a little confused. Below is my requirement: - Highlighting needs to work properly with Common Grams - Phrase highlighting needs to work - Wildcard highlighting needs to work Is this possible with Phrase Highlighter (with some patch)? e.g. https://issues.apache.org/jira/browse/LUCENE-1489 (everything works fine for me except the issue mentioned in this link) Is this possible with Fast Vector Highlighter (wildcard/phrase highlighting needs to work too)? Is this possible with new Postings Highlighter https://issues.apache.org/jira/browse/LUCENE-4290 If the answer is NO for all of the above questions then what if I index same field twice; once 'indexed' with common grams for search and once just 'stored' without common grams for highlighting? Will this work and would it have any impact on performance/size? -- Regards, Salman Akram
MaxRows and disabling sort
Hi, I want to limit my SOLR results so that it stops further searching once it founds a certain number of records (just like 'limit' in MySQL). I know it has timeAllowed property but is there anything like MaxRows? I am NOT talking about 'rows' attribute which returns a specific no. of rows to client. This seems a very nice way to stop SOLR from traversing through the complete index but I am not sure if there is anything like this. Also I guess default sorting is on Scoring and sorting can only be done once it has the scores of all matches so then limiting it to the max rows becomes useless. So if there a way to disable sorting? e.g. it returns the rows as it finds without any order? Thanks! -- Regards, Salman Akram Cell: +92-321-4391210
Re: MaxRows and disabling sort
In some cases my search takes too long. Now I want to show user partial matches if its taking too long. The problem with timeAllowed is that lets say I set its value to 10 secs then for some queries it would be fine and will at least return few hundred rows but in really worse scenarios it might not even return few records in that time (even 0 is highly possible) so the user would think nothing matched though there were many matches. Telling SOLR to return first 20/50 records would ensure that it will at least return user the first page even if it takes more time. On Sat, Jan 15, 2011 at 3:11 AM, Erick Erickson wrote: > Why do you want to do this? That is, what problem do you think would > be solved by this? Because there are other problems if you're trying to, > say, return all rows that match > > But no, there's nothing that I know of that would do what you want (of > course that doesn't mean there isn't). > > Best > Erick > > On Fri, Jan 14, 2011 at 12:17 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > Hi, > > > > I want to limit my SOLR results so that it stops further searching once > it > > founds a certain number of records (just like 'limit' in MySQL). > > > > I know it has timeAllowed property but is there anything like MaxRows? I > am > > NOT talking about 'rows' attribute which returns a specific no. of rows > to > > client. This seems a very nice way to stop SOLR from traversing through > the > > complete index but I am not sure if there is anything like this. > > > > Also I guess default sorting is on Scoring and sorting can only be done > > once > > it has the scores of all matches so then limiting it to the max rows > > becomes > > useless. So if there a way to disable sorting? e.g. it returns the rows > as > > it finds without any order? > > > > Thanks! > > > > > > -- > > Regards, > > > > Salman Akram > > Cell: +92-321-4391210 > > > -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
MaxRows and disabling sort
Hi, I want to limit my SOLR results so that it stops further searching once it founds a certain number of records (just like 'limit' in MySQL). I know it has timeAllowed property but is there anything like MaxRows? I am NOT talking about 'rows' attribute which returns a specific no. of rows to client. This seems a very nice way to stop SOLR from traversing through the complete index but I am not sure if there is anything like this. Also I guess default sorting is on Scoring and sorting can only be done once it has the scores of all matches so then limiting it to the max rows becomes useless. So if there a way to disable sorting? e.g. it returns the rows as it finds without any order? Thanks! -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: MaxRows and disabling sort
It still returns me (total) 'numFound' that means its scanning all records. So it seems except timeAllowed there is no way to tell SOLR to stop searching for all records? Thanks! On Sat, Jan 15, 2011 at 7:03 AM, Chris Hostetter wrote: > > : Also I guess default sorting is on Scoring and sorting can only be done > once > : it has the scores of all matches so then limiting it to the max rows > becomes > : useless. So if there a way to disable sorting? e.g. it returns the rows > as > : it finds without any order? > > http://wiki.apache.org/solr/CommonQueryParameters#sort > "You can sort by index id using sort=_docid_ asc or sort=_docid_ desc" > > if you specify _docid_ asc then solr should return as soon as it finds the > first N matching results w/o scoring all docs (because no score will be > computed) > > if you use any complex features however (faceting or what not) then it > will still most likely need to scan all docs. > > > -Hoss > -- Regards, Salman Akram
Lucene 2.9.x vs 3.x
Hi, SOLR 1.4.1 uses Lucene 2.9.3 by default (I think so). I have few questions Are there any major performance (or other) improvements in Lucene 3.0.3/Lucene 2.9.4? Does 3.x has major compatibility issues moving from 2.9.x? Will SOLR 1.4.1 build work fine with Lucene 3.0.3? Thanks! -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
TVF file
Hi, >From my understanding TVF file stores the Term Vectors (Positions/Offset) so if no field has Field.TermVector set (default is NO) so it shouldn't be created, right? I have an index created through SOLR on which no field had any value for TermVectors so by default it shouldn't be saved. All the fields are either String or Text. All fields have just indexed and stored attributes set to True. String fields have omitNorms = true as well. Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF file in my index. Its almost 30% of the total index (around 60% is the PRX positions file). Also in Luke it shows 'f' (omitTF) flag for strings but not for text fields. Any ideas what's going on? Thanks! -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: TVF file
Some more info I copied it from Luke and below is what it says for... Text Fields --> stored/uncompressed,indexed,tokenized String Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions The main contents field is not stored so it doesn't show up on Luke but that is Analyzed and Tokenized for searching. On Sun, Jan 16, 2011 at 3:50 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Hi, > > From my understanding TVF file stores the Term Vectors (Positions/Offset) > so if no field has Field.TermVector set (default is NO) so it shouldn't be > created, right? > > I have an index created through SOLR on which no field had any value for > TermVectors so by default it shouldn't be saved. All the fields are either > String or Text. All fields have just indexed and stored attributes set to > True. String fields have omitNorms = true as well. > > Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF file > in my index. Its almost 30% of the total index (around 60% is the PRX > positions file). > > Also in Luke it shows 'f' (omitTF) flag for strings but not for text > fields. > > Any ideas what's going on? Thanks! > > -- > Regards, > > Salman Akram > Senior Software Engineer - Tech Lead > 80-A, Abu Bakar Block, Garden Town, Pakistan > Cell: +92-321-4391210 > -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: TVF file
Nops. I optimized it with Standard File Format and cleaned up Index dir through Luke. It adds upto to the total size when I optimized it with Compound File Format. On Sun, Jan 16, 2011 at 5:46 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Is it possible that the tvf file you are looking at is old (i.e. not part > of > your active index)? > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Sun, January 16, 2011 6:17:23 AM > > Subject: Re: TVF file > > > > Some more info I copied it from Luke and below is what it says for... > > > > Text Fields --> stored/uncompressed,indexed,tokenized > > String Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions > > > > The main contents field is not stored so it doesn't show up on Luke but > that > > is Analyzed and Tokenized for searching. > > > > On Sun, Jan 16, 2011 at 3:50 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > Hi, > > > > > > From my understanding TVF file stores the Term Vectors > (Positions/Offset) > > > so if no field has Field.TermVector set (default is NO) so it > shouldn't be > > > created, right? > > > > > > I have an index created through SOLR on which no field had any value > for > > > TermVectors so by default it shouldn't be saved. All the fields are > either > > > String or Text. All fields have just indexed and stored attributes set > to > > > True. String fields have omitNorms = true as well. > > > > > > Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF > file > > > in my index. Its almost 30% of the total index (around 60% is the PRX > > > positions file). > > > > > > Also in Luke it shows 'f' (omitTF) flag for strings but not for text > > > fields. > > > > > > Any ideas what's going on? Thanks! > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > Senior Software Engineer - Tech Lead > > > 80-A, Abu Bakar Block, Garden Town, Pakistan > > > Cell: +92-321-4391210 > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > > > -- Regards, Salman Akram
Re: TVF file
Please see below the dir listing and relevant part of schema file (I have removed the name part from fields for obvious reasons). Also regarding .frq file why exactly is it needed? Is it required in phrase searching (I am not using highlighting or MoreLikeThis on this index file) too? and this is not made if all fields are using omitTF? Thanks alot! --Dir Listing-- 01/16/2011 06:05 AM . 01/16/2011 06:05 AM .. 01/15/2011 03:58 PM log 04/22/2010 12:42 AM 549 luke.jnlp 01/16/2011 04:58 AM20 segments.gen 01/16/2011 04:58 AM 287 segments_5hl 01/16/2011 02:17 AM 4,760,716,827 _36w.fdt 01/16/2011 02:17 AM 107,732,836 _36w.fdx 01/16/2011 02:15 AM 4,032 _36w.fnm 01/16/2011 04:36 AM25,221,109,245 _36w.frq 01/16/2011 04:38 AM 4,457,445,928 _36w.nrm 01/16/2011 04:36 AM 126,866,227,056 _36w.prx 01/16/2011 04:36 AM22,510,915 _36w.tii 01/16/2011 04:36 AM 1,635,096,862 _36w.tis 01/16/2011 04:58 AM18,341,750 _36w.tvd 01/16/2011 04:58 AM78,450,397,739 _36w.tvf 01/16/2011 04:58 AM 215,465,668 _36w.tvx 14 File(s) 241,755,049,714 bytes 3 Dir(s) 1,072,112,025,600 bytes free -Schema File-- F:\IndexingAppsRealTime\index> --> On Sun, Jan 16, 2011 at 6:52 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hm, want to email the index dir listing (ls -lah) + the field type and > field > definitions from your schema.xml? > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Sun, January 16, 2011 7:51:15 AM > > Subject: Re: TVF file > > > > Nops. I optimized it with Standard File Format and cleaned up Index dir > > through Luke. It adds upto to the total size when I optimized it with > > Compound File Format. > > > > On Sun, Jan 16, 2011 at 5:46 PM, Otis Gospodnetic < > > otis_gospodne...@yahoo.com> wrote: > > > > > Is it possible that the tvf file you are looking at is old (i.e. not > part > > > of > > > your active index)? > > > > > > Otis > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > - Original Message > > > > From: Salman Akram > > > > To: solr-user@lucene.apache.org > > > > Sent: Sun, January 16, 2011 6:17:23 AM > > > > Subject: Re: TVF file > > > > > > > > Some more info I copied it from Luke and below is what it says > for... > > > > > > > > Text Fields --> stored/uncompressed,indexed,tokenized > > > > String Fields --> > stored/uncompressed,indexed,omitTermFreqAndPositions > > > > > > > > The main contents field is not stored so it doesn't show up on Luke > but > > > that > > > > is Analyzed and Tokenized for searching. > > > > > > > > On Sun, Jan 16, 2011 at 3:50 PM, Salman Akram < > > > > salman.ak...@northbaysolutions.net> wrote: > > > > > > > > > Hi, > > > > > > > > > > From my understanding TVF file stores the Term Vectors > > > (Positions/Offset) > > > > > so if no field has Field.TermVector set (default is NO) so it > > > shouldn't be > > > > > created, right? > > > > > > > > > > I have an index created through SOLR on which no field had any > value > > > for > > > > > TermVectors so by default it shouldn't be saved. All the fields > are > > > either > > > > > String or Text. All fields have just indexed and stored > attributes set > > > to > > > > > True. String fields have omitNorms = true as well. > > > > > > > > > > Even in Luke it doesn't show V (Term Vector) flag but I have a > big > TVF > > > file > > > > > in my index. Its almost 30% of the total index (around 60% is the > PRX > > > > > positions file). > > > > > > > > > > Also in Luke it shows 'f' (omitTF) flag for strings but not for > text > > > > > fields. > > > > > > > > > > Any ideas what's going on? Thanks! > > > > > > > > > > -- > > > > > Regards, > > > > > > > > > > Salman Akram > > > > > Senior Software Engineer - Tech Lead > > > > > 80-A, Abu Bakar Block, Garden Town, Pakistan > > > > > Cell: +92-321-4391210 > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Salman Akram > > > > > > > > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: TVF file
Well anyways thanks for the help. Also can you please reply to this about .frq file (since that's quite big too). "Also regarding .frq file why exactly is it needed? Is it required in phrase searching (I am not using highlighting or MoreLikeThis on this index file) too? and this is not made if all fields are using omitTF?" On Mon, Jan 17, 2011 at 10:18 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hm, this is a mystery to me - I don't see anything that would turn on Term > Vectors... > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Sun, January 16, 2011 2:26:53 PM > > Subject: Re: TVF file > > > > Please see below the dir listing and relevant part of schema file (I > have > > removed the name part from fields for obvious reasons). > > > > Also regarding .frq file why exactly is it needed? Is it required in > phrase > > searching (I am not using highlighting or MoreLikeThis on this index > file) > > too? and this is not made if all fields are using omitTF? > > > > Thanks alot! > > > > --Dir Listing-- > > 01/16/2011 06:05 AM . > > 01/16/2011 06:05 AM .. > > 01/15/2011 03:58 PM log > > 04/22/2010 12:42 AM549 luke.jnlp > > 01/16/2011 04:58 AM 20 segments.gen > > 01/16/2011 04:58 AM287 segments_5hl > > 01/16/2011 02:17 AM 4,760,716,827 _36w.fdt > > 01/16/2011 02:17 AM107,732,836 _36w.fdx > > 01/16/2011 02:15 AM 4,032 _36w.fnm > > 01/16/2011 04:36 AM 25,221,109,245 _36w.frq > > 01/16/2011 04:38 AM 4,457,445,928 _36w.nrm > > 01/16/2011 04:36 AM 126,866,227,056 _36w.prx > > 01/16/2011 04:36 AM22,510,915 _36w.tii > > 01/16/2011 04:36 AM 1,635,096,862 _36w.tis > > 01/16/2011 04:58 AM18,341,750 _36w.tvd > > 01/16/2011 04:58 AM78,450,397,739 _36w.tvf > > 01/16/2011 04:58 AM 215,465,668 _36w.tvx > > 14 File(s) 241,755,049,714 bytes > >3 Dir(s) 1,072,112,025,600 bytes free > > > > > > -Schema File-- > > > > F:\IndexingAppsRealTime\index> > > > > > > > > sortMissingLast="true" > > omitNorms="true"/> > > > > > >> luceneMatchVersion="LUCENE_29"/> > > > > > >--> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Jan 16, 2011 at 6:52 PM, Otis Gospodnetic < > > otis_gospodne...@yahoo.com> wrote: > > > > > Hm, want to email the index dir listing (ls -lah) + the field type and > > > field > > > definitions from your schema.xml? > > > > > > Otis > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > - Original Message > > > > From: Salman Akram > > > > To: solr-user@lucene.apache.org > > > > Sent: Sun, January 16, 2011 7:51:15 AM > > > > Subject: Re: TVF file > > > > > > > > Nops. I optimized it with Standard File Format and cleaned up Index > dir > > > > through Luke. It adds upto to the total size when I optimized it > with > > > > Compound File Format. > > > > >
CommonGrams phrase query
Hi, I have made an index using CommonGrams. Now when I query "a b" and explain it, SOLR makes it +MultiPhraseQuery(Contents:"(a a_b) b"). Shouldn't it just be searching "a_b"? I am asking this coz even though I am using CommonGrams it's much slower than normal index which just searches on "a b". Note: Both words are in the words list of CommonGrams. -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: CommonGrams phrase query
Ok sorry it was my fault. I wasn't using CommonGramsQueryFilter for query, just had Filter for indexing. The query seems fine now. On Mon, Jan 17, 2011 at 1:44 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Hi, > > I have made an index using CommonGrams. Now when I query "a b" and explain > it, SOLR makes it +MultiPhraseQuery(Contents:"(a a_b) b"). > > Shouldn't it just be searching "a_b"? I am asking this coz even though I am > using CommonGrams it's much slower than normal index which just searches on > "a b". > > Note: Both words are in the words list of CommonGrams. > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
Re: sort problem
Yes. On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL < vincent.ro...@gmail.com> wrote: > Le 17/01/11 10:32, Grijesh a écrit : > > Use Lowercase filter to lowering your data at both index time and search >> time >> it will make case insensitive >> >> - >> Thanx: >> Grijesh >> > Thanks, > so tell me if i m wrong... i need to modify my schema.xml to add lowercase > filter and reindex my content? > > > -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: FilterQuery reaching maxBooleanClauses, alternatives?
You can index a field which can the User types e.g. UserType (possible values can be TypeA,TypeB and so on...) and then you can just do ?q=name:Stefan&fq=UserType:TypeB BTW you can even increase the size of maxBooleanClauses but in this case definitely this is not a good idea. Also you would hit the max limit of HTTP GET so you will have to change it to POST. Better handle it with a new field. On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis < matheis.ste...@googlemail.com> wrote: > Hi List, > > we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per > default). So, the used query looks like: > > ?q=name:Stefan&fq=5 10 12 15 16 [...] > > where the values are ids of users, which the current user is allowed to see > - so long, nothing special. sometimes the filter-query includes user-ids > from an different Type of User (let's say we have TypeA and TypeB) where > TypeB contains more then 2k users. Then we hit the given Limit. > > Now the Question is .. is it possible to enable an Filter/Function/Feature > in Solr, which it makes possible, that we don't need to send over alle the > user ids from TypeB Users? Just to tell Solr "include all TypeB Users in > the > (given) FilterQuery" (or something in that direction)? > > If so, what's the Name of this Filter/Function/Feature? :) > > Don't hesitate to ask, if my question/description is weird! > > Thanks > Stefan > -- Regards, Salman Akram
Re: FilterQuery reaching maxBooleanClauses, alternatives?
You are welcome. By new field I meant if you don't have a field for UserType already. On Mon, Jan 17, 2011 at 6:22 PM, Stefan Matheis < matheis.ste...@googlemail.com> wrote: > Thanks Salman, > > talking with others about problems really helps. Adding another FilterQuery > is a bit too much - but combining both is working fine! > > not seen the wood for the trees =) > Thanks, Stefan > > > On Mon, Jan 17, 2011 at 2:07 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > You can index a field which can the User types e.g. UserType (possible > > values can be TypeA,TypeB and so on...) and then you can just do > > > > ?q=name:Stefan&fq=UserType:TypeB > > > > BTW you can even increase the size of maxBooleanClauses but in this case > > definitely this is not a good idea. Also you would hit the max limit of > > HTTP > > GET so you will have to change it to POST. Better handle it with a new > > field. > > > > On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis < > > matheis.ste...@googlemail.com> wrote: > > > > > Hi List, > > > > > > we are sometimes reaching the maxBooleanClauses Limit (which is 1024, > per > > > default). So, the used query looks like: > > > > > > ?q=name:Stefan&fq=5 10 12 15 16 [...] > > > > > > where the values are ids of users, which the current user is allowed to > > see > > > - so long, nothing special. sometimes the filter-query includes > user-ids > > > from an different Type of User (let's say we have TypeA and TypeB) > where > > > TypeB contains more then 2k users. Then we hit the given Limit. > > > > > > Now the Question is .. is it possible to enable an > > Filter/Function/Feature > > > in Solr, which it makes possible, that we don't need to send over alle > > the > > > user ids from TypeB Users? Just to tell Solr "include all TypeB Users > in > > > the > > > (given) FilterQuery" (or something in that direction)? > > > > > > If so, what's the Name of this Filter/Function/Feature? :) > > > > > > Don't hesitate to ask, if my question/description is weird! > > > > > > Thanks > > > Stefan > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
CommonGrams and SOLR - 1604
Hi, I am trying to use CommonGrams with SOLR - 1604 patch but doesn't seem to work. If I don't add {!complexphrase} it uses CommonGramsQueryFilterFactory and proper bi-grams are made but of course doesn't use this patch. If I add {!complexphrase} it simply does it the old way i.e. ignore CommonGrams. Does anyone know how to combine both these features? Also once they are combined (hopefully they will be) would phrase proximity search work fine? Thanks -- Regards, Salman Akram
Re: CommonGrams and SOLR - 1604
Anyone? On Mon, Jan 17, 2011 at 7:48 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Hi, > > I am trying to use CommonGrams with SOLR - 1604 patch but doesn't seem to > work. > > If I don't add {!complexphrase} it uses CommonGramsQueryFilterFactory and > proper bi-grams are made but of course doesn't use this patch. > > If I add {!complexphrase} it simply does it the old way i.e. ignore > CommonGrams. > > Does anyone know how to combine both these features? > > Also once they are combined (hopefully they will be) would phrase proximity > search work fine? > > Thanks > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Mem allocation - SOLR vs OS
Hi, I know this is a subjective topic but from what I have read it seems more RAM should be spared for OS caching and much less for SOLR/Tomcat even on a dedicated SOLR server. Can someone give me an idea about the theoretically ideal proportion b/w them for a dedicated Windows server with 32GB RAM? Also the index is updated every hour. -- Regards, Salman Akram
Re: Mem allocation - SOLR vs OS
In case it helps there are two SOLR indexes (160GB and 700GB) on the machine. Also these are separate indexes and not shards so would it help to put them on two separate Tomcat servers on same machine? This way I think one index won't be affecting others cache. On Wed, Jan 19, 2011 at 12:00 PM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Hi, > > I know this is a subjective topic but from what I have read it seems more > RAM should be spared for OS caching and much less for SOLR/Tomcat even on a > dedicated SOLR server. > > Can someone give me an idea about the theoretically ideal proportion b/w > them for a dedicated Windows server with 32GB RAM? Also the index is updated > every hour. > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Re: Mem allocation - SOLR vs OS
Actually we don't have much load on the server (like the usage currently is quite low) but user queries are very complex e.g. long phrases/multiple proximity/wildcard etc so I know these values need to be tried out but I wanted to see whats the right 'start' so that I am not way off. Also regarding Solr cores just to clarify they are totally different indexes (not 2 parts of one index) so the queries on them are separate do you still think its better to keep them on two cores? Thanks a lot! On Wed, Jan 19, 2011 at 9:43 PM, Erick Erickson wrote: > You're better off using two cores on the same Solr instance rather than two > instances of Tomcat, that way you avoid some overhead. > > The usual advice is to monitor the Solr caches, particularly for evictions > and > size the Solr caches accordingly. You can see these from the admin/stats > page > and also by mining the logs, looking particularly for cache evictions. > Since > cache > usage is so dependent on the particular installation and usage pattern > (particularly > sorting and faceting), "general" advice is hard to give. > > Hope this helps > Erick > > On Wed, Jan 19, 2011 at 2:25 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > In case it helps there are two SOLR indexes (160GB and 700GB) on the > > machine. > > > > Also these are separate indexes and not shards so would it help to put > them > > on two separate Tomcat servers on same machine? This way I think one > index > > won't be affecting others cache. > > > > On Wed, Jan 19, 2011 at 12:00 PM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > Hi, > > > > > > I know this is a subjective topic but from what I have read it seems > more > > > RAM should be spared for OS caching and much less for SOLR/Tomcat even > on > > a > > > dedicated SOLR server. > > > > > > Can someone give me an idea about the theoretically ideal proportion > b/w > > > them for a dedicated Windows server with 32GB RAM? Also the index is > > updated > > > every hour. > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Mem allocation - SOLR vs OS
We do have sorting but not faceting. OK so I guess there is no 'hard and fast rule' as such so I will play with it and see. Thanks for the help On Wed, Jan 19, 2011 at 11:48 PM, Markus Jelsma wrote: > You only need so much for Solr so it can do its thing. Faceting can take > quite > some memory on a large index but sorting can be a really big RAM consumer. > > As Erick pointed out, inspect and tune the cache settings and adjust RAM > allocated to the JVM if required. Using tools like JConsole you can monitor > various things via JMX including RAM consumption. > > > Hi, > > > > I know this is a subjective topic but from what I have read it seems more > > RAM should be spared for OS caching and much less for SOLR/Tomcat even on > a > > dedicated SOLR server. > > > > Can someone give me an idea about the theoretically ideal proportion b/w > > them for a dedicated Windows server with 32GB RAM? Also the index is > > updated every hour. > -- Regards, Salman Akram
Re: Mem allocation - SOLR vs OS
I will be looking into JConsole. One more question regarding caching. When we talk about warm-up queries does that mean that some of the complex queries (esp those which require high I/O e.g. phrase queries) will really be very slow (on lets say an index of 200GB) if they are not cached? I am talking about difference of more than few seconds... Also regarding the cache settings I wanted to get another advice when you talk about evictions do you mean the cumulative or current? I know ideally the hit rate should be high and evictions low but can you please look into the below stats for documentCache of both indexes and see what is it really 'saying'. Should the size be increased/decreases/kept same or this data is not enough to judge and should collect at least few days data? Note: Tomcat was restarted a day back and as I said there isn't much workload but complex queries and index is updated every hour (currently haven't implemented replication). Max Size and InitialSize for both is 4096 lookups : 9849 hits : 5144 hitratio : 0.52 inserts : 4705 evictions : 609 size : 4096 warmupTime : 0 cumulative_lookups : 82492 cumulative_hits : 52059 cumulative_hitratio : 0.63 cumulative_inserts : 30433 cumulative_evictions : 685 -- lookups : 5539 hits : 3765 hitratio : 0.67 inserts : 1774 evictions : 0 size : 1774 warmupTime : 0 cumulative_lookups : 29062 cumulative_hits : 20568 cumulative_hitratio : 0.70 cumulative_inserts : 8494 cumulative_evictions : 0 On Wed, Jan 19, 2011 at 11:48 PM, Markus Jelsma wrote: > You only need so much for Solr so it can do its thing. Faceting can take > quite > some memory on a large index but sorting can be a really big RAM consumer. > > As Erick pointed out, inspect and tune the cache settings and adjust RAM > allocated to the JVM if required. Using tools like JConsole you can monitor > various things via JMX including RAM consumption. > > > Hi, > > > > I know this is a subjective topic but from what I have read it seems more > > RAM should be spared for OS caching and much less for SOLR/Tomcat even on > a > > dedicated SOLR server. > > > > Can someone give me an idea about the theoretically ideal proportion b/w > > them for a dedicated Windows server with 32GB RAM? Also the index is > > updated every hour. > -- Regards, Salman Akram
Wildcard, OR's inside Phrases (SOLR - 1604) & SurroundQueryParser
Hi, To get phrase search with proximity work fine I am planning to integrate SurroundQueryParser. However, I wanted to know whether the functionality provided in SOLR 1604 (i.e. Wildcard, OR's inside Phrases) would work fine with it or not? If not, what's the alternative as I need both functionality? Thanks! -- Regards, Salman Akram
Re: Wildcard, OR's inside Phrases (SOLR - 1604) & SurroundQueryParser
It seems SurroundQueryParser is in Lucene NOT Solr. So does this mean I will have to integrate it in Lucene and update that jar file in SOLR? Thanks On Fri, Jan 21, 2011 at 11:33 PM, Ahmet Arslan wrote: > > --- On Fri, 1/21/11, Salman Akram > wrote: > > > From: Salman Akram > > Subject: Wildcard, OR's inside Phrases (SOLR - 1604) & > SurroundQueryParser > > To: solr-user@lucene.apache.org > > Date: Friday, January 21, 2011, 7:18 PM > > Hi, > > > > To get phrase search with proximity work fine I am planning > > to integrate > > SurroundQueryParser. However, I wanted to know whether the > > functionality > > provided in SOLR 1604 (i.e. Wildcard, OR's inside Phrases) > > would work fine > > with it or not? > > surround is somehow superset of complexphrase. But their syntax is > different. > > "a* b*"~10 => a* 10n b* => 10n(a*,b*) > > Lucene in Action second edition book (chapter 9.6) talks about surround. > > > > -- Regards, Salman Akram
Highlighting with/without Term Vectors
Hi, Does anyone have any benchmarks how much highlighting speeds up with Term Vectors (compared to without it)? e.g. if highlighting on 20 documents take 1 sec with Term Vectors any idea how long it will take without them? I need to know since the index used for highlighting has a TVF file of around 450GB (approx 65% of total index size) so I am trying to see whether the decreasing the index size by dropping TVF would be more helpful for performance (less RAM, should be good for I/O too I guess) or keeping it is still better? I know the best way is try it out but indexing takes a very long time so trying to see whether its even worthy or not. -- Regards, Salman Akram
Re: Highlighting with/without Term Vectors
Just to add one thing, in case it makes a difference. Max document size on which highlighting needs to be done is few hundred kb's (in file system). In index its compressed so should be much smaller. Total documents are more than 100 million. On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Hi, > > Does anyone have any benchmarks how much highlighting speeds up with Term > Vectors (compared to without it)? e.g. if highlighting on 20 documents take > 1 sec with Term Vectors any idea how long it will take without them? > > I need to know since the index used for highlighting has a TVF file of > around 450GB (approx 65% of total index size) so I am trying to see whether > the decreasing the index size by dropping TVF would be more helpful for > performance (less RAM, should be good for I/O too I guess) or keeping it is > still better? > > I know the best way is try it out but indexing takes a very long time so > trying to see whether its even worthy or not. > > -- > Regards, > > Salman Akram > > -- Regards, Salman Akram
Performance optimization of Proximity/Wildcard searches
Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count > 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen wrote: > On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > > Cache warming is a good option too but the index get updated every hour > so > > not sure how much would that help. > > What is the time difference between queries with a warmed index and a > cold one? If the warmed index performs satisfactory, then one answer is > to upgrade your underlying storage. As always for IO-caused performance > problem in Lucene/Solr-land, SSD is the answer. > > -- Regards, Salman Akram
Re: Highlighting with/without Term Vectors
Anyone? On Tue, Jan 25, 2011 at 12:57 AM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Just to add one thing, in case it makes a difference. > > Max document size on which highlighting needs to be done is few hundred > kb's (in file system). In index its compressed so should be much smaller. > Total documents are more than 100 million. > > > On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > >> Hi, >> >> Does anyone have any benchmarks how much highlighting speeds up with Term >> Vectors (compared to without it)? e.g. if highlighting on 20 documents take >> 1 sec with Term Vectors any idea how long it will take without them? >> >> I need to know since the index used for highlighting has a TVF file of >> around 450GB (approx 65% of total index size) so I am trying to see whether >> the decreasing the index size by dropping TVF would be more helpful for >> performance (less RAM, should be good for I/O too I guess) or keeping it is >> still better? >> >> I know the best way is try it out but indexing takes a very long time so >> trying to see whether its even worthy or not. >> >> -- >> Regards, >> >> Salman Akram >> >> > > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
Re: Highlighting with/without Term Vectors
Basically Term Vectors are only on one main field i.e. Contents. Average size of each document would be few KB's but there are around 130 million documents so what do you suggest now? On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic wrote: > Salman, > > It also depends on the size of your documents. Re-analyzing 20 fields of > 500 > bytes each will be a lot faster than re-analyzing 20 fields with 50 KB > each. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Grant Ingersoll > > To: solr-user@lucene.apache.org > > Sent: Wed, January 26, 2011 10:44:09 AM > > Subject: Re: Highlighting with/without Term Vectors > > > > > > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote: > > > > > Hi, > > > > > > Does anyone have any benchmarks how much highlighting speeds up with > Term > > > Vectors (compared to without it)? e.g. if highlighting on 20 documents > take > > > 1 sec with Term Vectors any idea how long it will take without them? > > > > > > I need to know since the index used for highlighting has a TVF file of > > > around 450GB (approx 65% of total index size) so I am trying to see > whether > > > the decreasing the index size by dropping TVF would be more helpful > for > > > performance (less RAM, should be good for I/O too I guess) or keeping > it is > > > still better? > > > > > > I know the best way is try it out but indexing takes a very long time > so > > > trying to see whether its even worthy or not. > > > > > > Try testing on a smaller set. In general, you are saving the process of > >re-analyzing the content, so, to some extent it is going to be dependent > on how > >fast your analyzer chain is. At the size you are at, I don't know if > storing > >TVs is worth it. > -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic wrote: > Salman, > > I only skimmed your email, but wanted to say that this part sounds a little > suspicious: > > > Our warm up script currently executes all distinct queries in our logs > > having count > 5. It was run yesterday (with all the indexing update > every > > It sounds like this will make warmup take a long time, assuming you > have > more than a handful distinct queries in your logs. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk > > Sent: Tue, January 25, 2011 6:32:48 AM > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > By warmed index you only mean warming the SOLR cache or OS cache? As I > said > > our index is updated every hour so I am not sure how much SOLR cache > would > > be helpful but OS cache should still be helpful, right? > > > > I haven't compared the results with a proper script but from manual > testing > > here are some of the observations. > > > > 'Recent' queries which are in cache of course return immediately (only > if > > they are exactly same - even if they took 3-4 mins first time). I will > need > > to test how many recent queries stay in cache but still this would work > only > > for very common queries. User can run different queries and I want at > least > > them to be at 'acceptable' level (5-10 secs) even if not very fast. > > > > Our warm up script currently executes all distinct queries in our logs > > having count > 5. It was run yesterday (with all the indexing update > every > > hour after that) and today when I executed some of the same queries > again > > their time seemed a little less (around 15-20%), I am not sure if this > means > > anything. However, still their time is not acceptable. > > > > What do you think is the best way to compare results? First run all the > warm > > up queries and then execute same randomly and compare? > > > > We are using Windows server, would it make a big difference if we move > to > > Linux? Our load is not high but some queries are really complex. > > > > Also I was hoping to move to SSD in last after trying out all software > > options. Is that an agreed fact that on large indexes (which don't fit > in > > RAM) proximity/wildcard/phrase queries (on common words) would be slow > and > > it can be only improved by cache warm up and better hardware? Otherwise > with > > an index of around 150GB such queries will take more than a min? > > > > If that's the case I know this question is very subjective but if a > single > > query takes 2 min on SAS 10K RPM what would its approx time be on a good > SSD > > (everything else same)? > > > > Thanks! > > > > > > On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen > wrote: > > > > > On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > > > > Cache warming is a good option too but the index get updated every > hour > > > so > > > > not sure how much would that help. > > > > > > What is the time difference between queries with a warmed index and a > > > cold one? If the warmed index performs satisfactory, then one answer > is > > > to upgrade your underlying storage. As always for IO-caused > performance > > > problem in Lucene/Solr-land, SSD is the answer. > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Well I assume many people out there would have indexes larger than 100GB and I don't think so normally you will have more RAM than 32GB or 64! As I mentioned the queries are mostly phrase, proximity, wildcard and combination of these. What exactly do you mean by distribution of documents? On this index our documents are not more than few hundred KB's on average (file system size) and there are around 14 million documents. 80% of the index size is taken up by position file. I am not sure if this is what you asked? On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic wrote: > Hi, > > > > Sharding is an option too but that too comes with limitations so want to > > keep that as a last resort but I think there must be other things coz > 150GB > > is not too big for one drive/server with 32GB Ram. > > Hmm what makes you think 32 GB is enough for your 150 GB index? > It depends on queries and distribution of matching documents, for example. > What's yours like? > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Tue, January 25, 2011 4:20:34 AM > > Subject: Performance optimization of Proximity/Wildcard searches > > > > Hi, > > > > I am facing performance issues in three types of queries (and their > > combination). Some of the queries take more than 2-3 mins. Index size is > > around 150GB. > > > > > >- Wildcard > >- Proximity > >- Phrases (with common words) > > > > I know CommonGrams and Stop words are a good way to resolve such issues > but > > they don't fulfill our functional requirements (Common Grams seem to > have > > issues with phrase proximity, stop words have issues with exact match > etc). > > > > Sharding is an option too but that too comes with limitations so want to > > keep that as a last resort but I think there must be other things coz > 150GB > > is not too big for one drive/server with 32GB Ram. > > > > Cache warming is a good option too but the index get updated every hour > so > > not sure how much would that help. > > > > What are the other main tips that can help in performance optimization > of > > the above queries? > > > > Thanks > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Correct me if I am wrong. Commit in index flushes SOLR cache but of course OS cache would still be useful? If a an index is updated every hour then a warm up that takes less than 5 mins should be more than enough, right? On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic wrote: > Salman, > > Warming up may be useful if your caches are getting decent hit ratios. > Plus, you > are warming up the OS cache when you warm up. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Fri, February 4, 2011 3:33:41 PM > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > I know so we are not really using it for regular warm-ups (in any case > index > > is updated on hourly basis). Just tried few times to compare results. > The > > issue is I am not even sure if warming up is useful for such regular > > updates. > > > > > > > > On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com > > > wrote: > > > > > Salman, > > > > > > I only skimmed your email, but wanted to say that this part sounds a > little > > > suspicious: > > > > > > > Our warm up script currently executes all distinct queries in our > logs > > > > having count > 5. It was run yesterday (with all the indexing > update > > > every > > > > > > It sounds like this will make warmup take a long time, assuming > you > > > have > > > more than a handful distinct queries in your logs. > > > > > > Otis > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > - Original Message > > > > From: Salman Akram > > > > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk > > > > Sent: Tue, January 25, 2011 6:32:48 AM > > > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > > > > > By warmed index you only mean warming the SOLR cache or OS cache? As > I > > > said > > > > our index is updated every hour so I am not sure how much SOLR cache > > > would > > > > be helpful but OS cache should still be helpful, right? > > > > > > > > I haven't compared the results with a proper script but from manual > > > testing > > > > here are some of the observations. > > > > > > > > 'Recent' queries which are in cache of course return immediately > (only > > > if > > > > they are exactly same - even if they took 3-4 mins first time). I > will > > > need > > > > to test how many recent queries stay in cache but still this would > work > > > only > > > > for very common queries. User can run different queries and I want > at > > > least > > > > them to be at 'acceptable' level (5-10 secs) even if not very fast. > > > > > > > > Our warm up script currently executes all distinct queries in our > logs > > > > having count > 5. It was run yesterday (with all the indexing > update > > > every > > > > hour after that) and today when I executed some of the same > queries > > > again > > > > their time seemed a little less (around 15-20%), I am not sure if > this > > > means > > > > anything. However, still their time is not acceptable. > > > > > > > > What do you think is the best way to compare results? First run all > the > > > warm > > > > up queries and then execute same randomly and compare? > > > > > > > > We are using Windows server, would it make a big difference if we > move > > > to > > > > Linux? Our load is not high but some queries are really complex. > > > > > > > > Also I was hoping to move to SSD in last after trying out all > software > > > > options. Is that an agreed fact that on large indexes (which don't > fit > > > in > > > > RAM) proximity/wildcard/phrase queries (on common words) would be > slow > > > and > > > > it can be only improved by cache warm up and better hardware? &g
Re: Performance optimization of Proximity/Wildcard searches
Since all queries return total count as well so on average a query matches 10% of the total documents. The index I am talking about is around 13 million so that means around 1.3 million documents match on average. Of course all of them won't be overlapping so I am guessing that around 30-50% documents do match the daily queries. I tried to find out a lot if you can tell SOLR to stop searching after a certain count - I don't mean no. of rows but just like MySQL limit so that it doesn't have to spend time calculating the total count whereas its only returning few rows to UI and we are OK in showing count as 1000+ (if its more than 1000) but couldn't find any way. On Sat, Feb 5, 2011 at 7:45 AM, Otis Gospodnetic wrote: > Heh, I'm not sure if this is valid thinking. :) > > By *matching* doc distribution I meant: what proportion of your millions of > documents actually ever get matched and then how many of those make it to > the > UI. > If you have 1000 queries in a day and they all end up matching only 3 of > your > docs, the system will need less RAM than a system where 1000 queries match > 5 > different docs. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Fri, February 4, 2011 3:38:55 PM > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > Well I assume many people out there would have indexes larger than 100GB > and > > I don't think so normally you will have more RAM than 32GB or 64! > > > > As I mentioned the queries are mostly phrase, proximity, wildcard and > > combination of these. > > > > What exactly do you mean by distribution of documents? On this index our > > documents are not more than few hundred KB's on average (file system > size) > > and there are around 14 million documents. 80% of the index size is > taken up > > by position file. I am not sure if this is what you asked? > > > > On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com > > > wrote: > > > > > Hi, > > > > > > > > > > Sharding is an option too but that too comes with limitations so > want to > > > > keep that as a last resort but I think there must be other things > coz > > > 150GB > > > > is not too big for one drive/server with 32GB Ram. > > > > > > Hmm what makes you think 32 GB is enough for your 150 GB index? > > > It depends on queries and distribution of matching documents, for > example. > > > What's yours like? > > > > > > Otis > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > - Original Message > > > > From: Salman Akram > > > > To: solr-user@lucene.apache.org > > > > Sent: Tue, January 25, 2011 4:20:34 AM > > > > Subject: Performance optimization of Proximity/Wildcard searches > > > > > > > > Hi, > > > > > > > > I am facing performance issues in three types of queries (and their > > > > combination). Some of the queries take more than 2-3 mins. Index > size is > > > > around 150GB. > > > > > > > > > > > >- Wildcard > > > > - Proximity > > > >- Phrases (with common words) > > > > > > > > I know CommonGrams and Stop words are a good way to resolve such > issues > > > but > > > > they don't fulfill our functional requirements (Common Grams seem > to > > > have > > > > issues with phrase proximity, stop words have issues with exact > match > > > etc). > > > > > > > > Sharding is an option too but that too comes with limitations so > want to > > > > keep that as a last resort but I think there must be other things > coz > > > 150GB > > > > is not too big for one drive/server with 32GB Ram. > > > > > > > > Cache warming is a good option too but the index get updated every > hour > > > so > > > > not sure how much would that help. > > > > > > > > What are the other main tips that can help in performance > optimization > > > of > > > > the above queries? > > > > > > > > Thanks > > > > > > > > -- > > > > Regards, > > > > > > > > Salman Akram > > > > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Highlighting with/without Term Vectors
Yea I was going to reply to that thread but then it just slipped out of my mind. :) Actually we have two indexes. One that is used for searching and other for highlighting. Their structure is different too like the 1st one has all the metadata + document contents indexed (just for searching). This has around 13 million rows. In 2nd one we have mainly the document PAGE contents indexed/stored with Terms Vectors. This has around 130 million rows (since each row is a page). What we do is search on the 1st index (around 150GB) and get document ID's based on the page size (20/50/100) and then just search on these document ID's on 2nd index (but on pages - as we need to show results based on page no's) with text for highlighting as well. The 2nd index is around 700GB (which has that 450GB TVF file I was talking about) but since its only referred for small no. of documents mostly that is not an issue (in some queries that's slow too but its size is the main issue). On average more than 90% of the query time is taken by 1st index file in searching (and total count as well). The confusion that I had was on the 1st index file which didn't have Term Vectors in any of the fields in SOLR schema file but still had a TVF file. The reason in the end turned out to be Lucene indexing. Some of the initial documents were indexed through Lucene and there one of the field did had Term Vectors! Sorry for that... *Keeping in mind the above description any other ideas you would like to suggest? Thanks!!* On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic wrote: > Hi Salman, > > Ah, so in the end you *did* have TV enabled on one of your fields! :) (I > think > this was a problem we were trying to solve a few weeks ago here) > > How many docs you have in the index doesn't matter here - only N > docs/fields > that you need to display on a page with N results need to be reanalyzed for > highlighting purposes, so follow Grant's advice, make a small index without > TV, > and compare highlighting speed with and without TV. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Fri, February 4, 2011 8:03:06 AM > > Subject: Re: Highlighting with/without Term Vectors > > > > Basically Term Vectors are only on one main field i.e. Contents. Average > > size of each document would be few KB's but there are around 130 million > > documents so what do you suggest now? > > > > On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com > > > wrote: > > > > > Salman, > > > > > > It also depends on the size of your documents. Re-analyzing 20 fields > of > > > 500 > > > bytes each will be a lot faster than re-analyzing 20 fields with 50 KB > > > each. > > > > > > Otis > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > - Original Message > > > > From: Grant Ingersoll > > > > To: solr-user@lucene.apache.org > > > > Sent: Wed, January 26, 2011 10:44:09 AM > > > > Subject: Re: Highlighting with/without Term Vectors > > > > > > > > > > > > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote: > > > > > > > > > Hi, > > > > > > > > > > Does anyone have any benchmarks how much highlighting speeds up > with > > > Term > > > > > Vectors (compared to without it)? e.g. if highlighting on 20 > documents > > > take > > > > > 1 sec with Term Vectors any idea how long it will take without > them? > > > > > > > > > > I need to know since the index used for highlighting has a TVF > file of > > > > > around 450GB (approx 65% of total index size) so I am trying to > see > > > whether > > > > > the decreasing the index size by dropping TVF would be more > helpful > > > for > > > > > performance (less RAM, should be good for I/O too I guess) or > keeping > > > it is > > > > > still better? > > > > > > > > > > I know the best way is try it out but indexing takes a very long > time > > > so > > > > > trying to see whether its even worthy or not. > > > > > > > > > > > > Try testing on a smaller set. In general, you are saving the > process of > > > >re-analyzing the content, so, to some extent it is going to be > dependent > > > on how > > > >fast your analyzer chain is. At the size you are at, I don't know > if > > > storing > > > >TVs is worth it. > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Only couple of thousand documents are added daily so the old OS cache should still be useful since old documents remain same, right? Also can you please comment on my other thread related to Term Vectors? Thanks! On Sat, Feb 5, 2011 at 8:40 PM, Otis Gospodnetic wrote: > Yes, OS cache mostly remains (obviously index files that are no longer > around > are going to remain the OS cache for a while, but will be useless and > gradually > replaced by new index files). > How long warmup takes is not relevant here, but what queries you use to > warm up > the index and how much you auto-warm the caches. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Sat, February 5, 2011 4:06:54 AM > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > Correct me if I am wrong. > > > > Commit in index flushes SOLR cache but of course OS cache would still be > > useful? If a an index is updated every hour then a warm up that takes > less > > than 5 mins should be more than enough, right? > > > > On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com > > > wrote: > > > > > Salman, > > > > > > Warming up may be useful if your caches are getting decent hit ratios. > > > Plus, you > > > are warming up the OS cache when you warm up. > > > > > > Otis > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > - Original Message > > > > From: Salman Akram > > > > To: solr-user@lucene.apache.org > > > > Sent: Fri, February 4, 2011 3:33:41 PM > > > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > > > > > I know so we are not really using it for regular warm-ups (in any > case > > > index > > > > is updated on hourly basis). Just tried few times to compare > results. > > > The > > > > issue is I am not even sure if warming up is useful for such > regular > > > > updates. > > > > > > > > > > > > > > > > On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic < > > > otis_gospodne...@yahoo.com > > > > > wrote: > > > > > > > > > Salman, > > > > > > > > > > I only skimmed your email, but wanted to say that this part > sounds a > > > little > > > > > suspicious: > > > > > > > > > > > Our warm up script currently executes all distinct queries in > our > > > logs > > > > > > having count > 5. It was run yesterday (with all the indexing > > > update > > > > > every > > > > > > > > > > It sounds like this will make warmup take a long time, > assuming > > > you > > > > > have > > > > > more than a handful distinct queries in your logs. > > > > > > > > > > Otis > > > > > > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > > > > > > > > > - Original Message > > > > > > From: Salman Akram > > > > > > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk > > > > > > Sent: Tue, January 25, 2011 6:32:48 AM > > > > > > Subject: Re: Performance optimization of Proximity/Wildcard > searches > > > > > > > > > > > > By warmed index you only mean warming the SOLR cache or OS > cache? As > > > I > > > > > said > > > > > > our index is updated every hour so I am not sure how much SOLR > cache > > > > > would > > > > > > be helpful but OS cache should still be helpful, right? > > > > > > > > > > > > I haven't compared the results with a proper script but from > manual > > > > > testing > > > > > > here are some of the observations. > > > > > > > > > > > > &
Filter Query
Hi, I know Filter Query is really useful due to caching but I am confused about how it filter results. Lets say I have following criteria Text:: "Abc def" Date: 24th Feb, 2011 Now "abc def" might be coming in almost every document but if SOLR first filters based on date it will have to do search only on few documents (instead of millions) If I put Date parameter in fq would it be first filtering on date and then doing text search or both of them would be filtered separately and then intersection? If its filtered separately the issue would be that lets say "abd def" takes 20 secs on all documents (without any filters - due to large # of documents) and it will be still taking same time but if its done only on few documents on that specific date it would be super fast. If fq doesn't give what I am looking for, is there any other parameter? There should be a way as this is a very common scenario. -- Regards, Salman Akram
Re: Filter Query
Yea I had an idea about that... Now logically speaking main text search should be in the Query filter so there is no way to first filter based on meta data and then do text search on that limited data set? Thanks! On Thu, Feb 24, 2011 at 5:24 PM, Stefan Matheis < matheis.ste...@googlemail.com> wrote: > Salman, > > afaik, the Query is executed first and afterwards FilterQuery steps in > Place .. so it's only an additional Filter on your Results. > > Recommended Wiki-Pages on FilterQuery: > * http://wiki.apache.org/solr/CommonQueryParameters#fq > * http://wiki.apache.org/solr/FilterQueryGuidance > > Regards > Stefan > > On Thu, Feb 24, 2011 at 12:46 PM, Salman Akram > wrote: > > Hi, > > > > I know Filter Query is really useful due to caching but I am confused > about > > how it filter results. > > > > Lets say I have following criteria > > > > Text:: "Abc def" > > Date: 24th Feb, 2011 > > > > Now "abc def" might be coming in almost every document but if SOLR first > > filters based on date it will have to do search only on few documents > > (instead of millions) > > > > If I put Date parameter in fq would it be first filtering on date and > then > > doing text search or both of them would be filtered separately and then > > intersection? If its filtered separately the issue would be that lets say > > "abd def" takes 20 secs on all documents (without any filters - due to > large > > # of documents) and it will be still taking same time but if its done > only > > on few documents on that specific date it would be super fast. > > > > If fq doesn't give what I am looking for, is there any other parameter? > > There should be a way as this is a very common scenario. > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Filter Query
So you are agreeing that it does what I want? So in my example "Abc def" would only be searched on 24th Feb 2010 documents? When you say 'last with filters' does it mean first it filters out with Filter Query and then applies Query on it? On Thu, Feb 24, 2011 at 9:29 PM, Yonik Seeley wrote: > On Thu, Feb 24, 2011 at 6:46 AM, Salman Akram > wrote: > > Hi, > > > > I know Filter Query is really useful due to caching but I am confused > about > > how it filter results. > > > > Lets say I have following criteria > > > > Text:: "Abc def" > > Date: 24th Feb, 2011 > > > > Now "abc def" might be coming in almost every document but if SOLR first > > filters based on date it will have to do search only on few documents > > (instead of millions) > > Yes, this is the way Solr works. The filters are executed separately, > but the query is executed last with the filters (i.e. it will be > faster if the filter cuts down the number of documents). > > -Yonik > http://lucidimagination.com > -- Regards, Salman Akram
CommonGrams indexing very slow!
All, We have created index with CommonGrams and the final size is around 370GB. Everything is working fine but now when we add more documents into index it takes forever (almost 12 hours)...seems to change all the segments file in a commit. The same commit used to take few mins with normal index. Any idea whats going on? -- Regards, Salman Akram Principal Software Engineer - Tech Lead NorthBay Solutions 410-G4 Johar Town, Lahore Off: +92-42-35290152 Cell: +92-321-4391210 -- +92-300-4009941
Re: CommonGrams indexing very slow!
No way. It just does this while committing. Also before this when we merged multiple small indexes without optimization - as it was done in past it again took around 12 hours and made around 20 CFS files (it never happened before) On Wed, Apr 27, 2011 at 8:21 PM, Erick Erickson wrote: > Are you by any chance optimizing? > > Best > Erick > > On Wed, Apr 27, 2011 at 11:04 AM, Salman Akram > wrote: > > All, > > > > We have created index with CommonGrams and the final size is around > 370GB. > > Everything is working fine but now when we add more documents into index > it > > takes forever (almost 12 hours)...seems to change all the segments file > in a > > commit. > > > > The same commit used to take few mins with normal index. > > > > Any idea whats going on? > > > > -- > > Regards, > > > > Salman Akram > > Principal Software Engineer - Tech Lead > > NorthBay Solutions > > 410-G4 Johar Town, Lahore > > Off: +92-42-35290152 > > > > Cell: +92-321-4391210 -- +92-300-4009941 > > > -- Regards, Salman Akram
Re: CommonGrams indexing very slow!
Thanks for the response. We got it resolved! . We made small indexes in bulk using SOLR with Standard File Format and then merged it with a Lucene app which for some reason made it CFS. Now when we started adding real time documents using SOLR (with Compound File Format set to false) it was merging with every commit! We just set the CFF to true and now its normal. Weird but that's how it got resolved. BTW any idea why this happening and if we now optimize it using SFF it should be fine in future with CFF= false? P.S: Increasing the MergeFactor didn't even work. On Wed, Apr 27, 2011 at 10:09 PM, Burton-West, Tom wrote: > Hi Salman, > > Sounds like somehow you are triggering merges or optimizes. What is your > mergeFactor? > > Have you turned on the IndexWriter log? > > In solrconfig.xml > true > > In our case we feed the directory name as a Java property in our java > startup script , but you can also hard code where you want the log written > like in the current example Solr config: > > false > > That should provide some clues. For example you can see how many segments > of each level there are just before you do the commit that triggers the > problem. My first guess is that you have enough segments so that adding > the documents and committing triggers a cascading merge. (But this is a WAG > without seeing what's in your indexwriter log) > > Can you also send your solrconfig so we can see your mergeFactor and > ramBufferSizeMB settings? > > Tom > > > > All, > > > > > > We have created index with CommonGrams and the final size is around > > 370GB. > > > Everything is working fine but now when we add more documents into > index > > it > > > takes forever (almost 12 hours)...seems to change all the segments file > > in a > > > commit. > > > > > > The same commit used to take few mins with normal index. > > > > > > Any idea whats going on? > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > Principal Software Engineer - Tech Lead > > > NorthBay Solutions > > > 410-G4 Johar Town, Lahore > > > Off: +92-42-35290152 > > > > > > Cell: +92-321-4391210 -- +92-300-4009941 > > > > > > > > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
JRockit with SOLR3.4/3.5
We used JRockit with SOLR1.4 as default JVM had mem issues (not only it was consuming more mem but didn't restrict to the max mem allocated to tomcat - jrockit did restrict to max mem). However, JRockit gives an error while using it with SOLR3.4/3.5. Any ideas, why? *** This Message Has Been Sent Using BlackBerry Internet Service from Mobilink ***
Re: JRockit with SOLR3.4/3.5
19) at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent( LifecycleBase.java:90) at org.apache.catalina.util.LifecycleBase.setStateInternal( LifecycleBase.java:401) at org.apache.catalina.util.LifecycleBase.init(LifecycleBase.java:110) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:139) at org.apache.catalina.core.ContainerBase.addChildInternal( ContainerBase.java:866) at org.apache.catalina.core.ContainerBase.addChild( ContainerBase.java:842) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615) at org.apache.catalina.startup.HostConfig.deployDirectory( HostConfig.java:1095) at org.apache.catalina.startup.HostConfig$DeployDirectory. run(HostConfig.java:1617) at java.util.concurrent.Executors$RunnableAdapter. call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker. runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:908) ... 1 more Jun 7, 2012 4:03:04 AM org.apache.coyote.AbstractProtocol start On Mon, Jul 16, 2012 at 2:20 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: > Hello, Salman, > > It would probably be helpful if you included the text/stack trace of > the error you're encountering, plus any other pertinent system > information you can think of. > > One thing to remember is the memory usage you tune with Xmx is only > the maximum size of the heap, and there are other types of memory > usage by the JVM that don't fall under that (Permgen space, memory > mapped files, etc). > > Michael Della Bitta > > > Appinions, Inc. -- Where Influence Isn’t a Game. > http://www.appinions.com > > > On Sun, Jul 15, 2012 at 3:19 PM, Salman Akram > wrote: > > We used JRockit with SOLR1.4 as default JVM had mem issues (not only it > was consuming more mem but didn't restrict to the max mem allocated to > tomcat - jrockit did restrict to max mem). However, JRockit gives an error > while using it with SOLR3.4/3.5. Any ideas, why? > > > > *** This Message Has Been Sent Using BlackBerry Internet Service from > Mobilink *** > -- Regards, Salman Akram Project Manager - Intelligize NorthBay Solutions 410-G4 Johar Town, Lahore Off: +92-42-35290152 Cell: +92-302-8495621