Re: Phrase Highlighter + Surround Query Parser

2015-06-11 Thread Salman Akram
Picking up this thread again...

When you said 'stock one' you meant in built surround Query parser of
customized? We already use usePhrasehighlighter=true.


On Mon, Aug 4, 2014 at 10:38 AM, Ahmet Arslan 
wrote:

> Hi,
>
> You are using a customized surround query parser, right?
>
> Did you check/try with the stock one? I recall correctly
> usePhrasehighlighter=true was working in the past for surround.
>
> Ahmet
>
>
>
> On Monday, August 4, 2014 8:25 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
> Anyone?
>
>
> On Fri, Aug 1, 2014 at 12:31 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > We are having an issue in Phrase highlighter with Surround Query Parser
> > e.g. *"first thing" w/100 "you must" *brings correct results but also
> > highlights individual words of the phrase - first, thing are highlighted
> > where they come separately as well.
> >
> > Any idea how this can be fixed?
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
>
>
>
> >
> >
>
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-08 Thread Salman Akram
The issue with timeallowed is you never know if it will return minimum
amount of docs or not.

I do want docs to be sorted based on date but it seems its not possible
that solr starts searching from recent docs and stops after finding certain
no. of docs...any other tweak?

Thanks


On Saturday, March 8, 2014, Chris Hostetter 
wrote:

>
> : Reason: In an index with millions of documents I don't want to know that
> a
> : certain query matched 1 million docs (of course it will take time to
> : calculate that). Why don't just stop looking for more results lets say
> : after it finds 100 docs? Possible??
>
> but if you care about sorting, ie: you want the top 100 documents sorted
> by score, or sorted by date, you still have to "collect" all 1 million
> matches in order to know what the first 100 are.
>
> if you really don't care about sorting, you can use the "timAllowed"
> option to tell the seraching method to do the best job it can in an
> (approximated) limited amount of time, and then pretend that the docs
> collected so far represent the total number of matches...
>
>
> https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter
>
>
> -Hoss
> http://www.lucidworks.com/
>


-- 
Regards,

Salman Akram
Project Manager - Intelligize
NorthBay Solutions
410-G4 Johar Town, Lahore
Off: +92-42-35290152

Cell: +92-302-8495621


Re: Partial Counts in SOLR

2014-03-11 Thread Salman Akram
Its a long video and I will definitely go through it but it seems this is
not possible with SOLR as it is?

I just thought it would be quite a common issue; I mean generally for
search engines its more important to show the first page results, rather
than using timeAllowed which might not even return a single result.

Thanks!


-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-12 Thread Salman Akram
Well some of the searches take minutes.

Below are some stats about this particular index that I am talking about:

Index size = 400GB (Using CommonGrams so without that the index is around
180GB)
Position File = 280GB
Total Docs = 170 million (just indexed for searching - for highlighting
contents are stored in another index)
Avg Doc Size = Few hundred KBs
RAM = 384GB (it has other indexes too but still OS cache can have 60-80% of
the total index cached)

Phrase queries run pretty fast with CG but complex versions of wildcard and
proximity queries can be really slow. I know using CG will make them slow
but they just take too long. By default sorting is on date but users have
few other parameters too on which they can sort.

I wanted to avoid creating multiple indexes (maybe based on years) but
seems that to search on partial data that's the only feasible way.




On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan  wrote:

> As Hoss pointed out above, different projects have different requirements.
> Some want to sort by date of ingestion reverse, which means that having
> posting lists organized in a reverse order with the early termination is
> the way to go (no such feature in Solr directly). Some other projects want
> to collect all docs matching a query, and then sort by rank, but you cannot
> guarantee, that the most recently inserted document is the most relevant in
> terms of your ranking.
>
>
> Do your current searches take too long?
>
>
> On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Its a long video and I will definitely go through it but it seems this is
> > not possible with SOLR as it is?
> >
> > I just thought it would be quite a common issue; I mean generally for
> > search engines its more important to show the first page results, rather
> > than using timeAllowed which might not even return a single result.
> >
> > Thanks!
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>
>
>
> --
> Dmitry
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
>



-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-13 Thread Salman Akram
1- SOLR 4.6
2- We do but right now I am talking about plain keyword queries just sorted
by date. Once this is better will start looking into caches which we
already changed a little.
3- As I said the contents are not stored in this index. Some other metadata
fields are but with normal queries its super fast so I guess even if I
change there it will be a minor difference. We have SSD and quite fast too.
4- That's something we need to do but even in low workload those queries
take a lot of time
5- Every 10 mins and currently no auto warming as user queries are rarely
same and also once its fully warmed those queries are still slow.
6- Nops.

On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan  wrote:

> 1. What is your solr version? In 4.x family the proximity searches have
> been optimized among other query types.
> 2. Do you use the filter queries? What is the situation with the cache
> utilization ratios? Optimize (= i.e. bump up the respective cache sizes) if
> you have low hitratios and many evictions.
> 3. Can you avoid storing some fields and only index them? When the field is
> stored and it is retrieved in the result, there are couple of disk seeks
> per field=> search slows down. Consider SSD disks.
> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you
> observe STW GC pauses?
> 5. How often do you commit & do you have the autowarming / external warming
> configured?
> 6. If you use faceting, consider storing DocValues for facet fields.
>
> some solr wiki docs:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
>
>
>
>
>
> On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Well some of the searches take minutes.
> >
> > Below are some stats about this particular index that I am talking about:
> >
> > Index size = 400GB (Using CommonGrams so without that the index is around
> > 180GB)
> > Position File = 280GB
> > Total Docs = 170 million (just indexed for searching - for highlighting
> > contents are stored in another index)
> > Avg Doc Size = Few hundred KBs
> > RAM = 384GB (it has other indexes too but still OS cache can have 60-80%
> of
> > the total index cached)
> >
> > Phrase queries run pretty fast with CG but complex versions of wildcard
> and
> > proximity queries can be really slow. I know using CG will make them slow
> > but they just take too long. By default sorting is on date but users have
> > few other parameters too on which they can sort.
> >
> > I wanted to avoid creating multiple indexes (maybe based on years) but
> > seems that to search on partial data that's the only feasible way.
> >
> >
> >
> >
> > On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan 
> wrote:
> >
> > > As Hoss pointed out above, different projects have different
> > requirements.
> > > Some want to sort by date of ingestion reverse, which means that having
> > > posting lists organized in a reverse order with the early termination
> is
> > > the way to go (no such feature in Solr directly). Some other projects
> > want
> > > to collect all docs matching a query, and then sort by rank, but you
> > cannot
> > > guarantee, that the most recently inserted document is the most
> relevant
> > in
> > > terms of your ranking.
> > >
> > >
> > > Do your current searches take too long?
> > >
> > >
> > > On Tue, Mar 11, 2014 at 11:51 AM, Salman Akram <
> > > salman.ak...@northbaysolutions.net> wrote:
> > >
> > > > Its a long video and I will definitely go through it but it seems
> this
> > is
> > > > not possible with SOLR as it is?
> > > >
> > > > I just thought it would be quite a common issue; I mean generally for
> > > > search engines its more important to show the first page results,
> > rather
> > > > than using timeAllowed which might not even return a single result.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Salman Akram
> > > >
> > >
> > >
> > >
> > > --
> > > Dmitry
> > > Blog: http://dmitrykan.blogspot.com
> > > Twitter: http://twitter.com/dmitrykan
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>
>
>
> --
> Dmitry
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
>



-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-17 Thread Salman Akram
Below is one of the sample slow query that takes mins!

((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
purchase* or repurchase*)) w/10 (executive or director)

If a filter is used it comes in fq but what can be done about plain keyword
search?


On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson wrote:

> What are our complex queries? You
> say that your app will very rarely see the
> same query thus you aren't using caches...
> But, if you can move some of your
> clauses to fq clauses, then the filterCache
> might well be used to good effect.
>
>
>
> On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
>  wrote:
> > 1- SOLR 4.6
> > 2- We do but right now I am talking about plain keyword queries just
> sorted
> > by date. Once this is better will start looking into caches which we
> > already changed a little.
> > 3- As I said the contents are not stored in this index. Some other
> metadata
> > fields are but with normal queries its super fast so I guess even if I
> > change there it will be a minor difference. We have SSD and quite fast
> too.
> > 4- That's something we need to do but even in low workload those queries
> > take a lot of time
> > 5- Every 10 mins and currently no auto warming as user queries are rarely
> > same and also once its fully warmed those queries are still slow.
> > 6- Nops.
> >
> > On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan 
> wrote:
> >
> >> 1. What is your solr version? In 4.x family the proximity searches have
> >> been optimized among other query types.
> >> 2. Do you use the filter queries? What is the situation with the cache
> >> utilization ratios? Optimize (= i.e. bump up the respective cache
> sizes) if
> >> you have low hitratios and many evictions.
> >> 3. Can you avoid storing some fields and only index them? When the
> field is
> >> stored and it is retrieved in the result, there are couple of disk seeks
> >> per field=> search slows down. Consider SSD disks.
> >> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do you
> >> observe STW GC pauses?
> >> 5. How often do you commit & do you have the autowarming / external
> warming
> >> configured?
> >> 6. If you use faceting, consider storing DocValues for facet fields.
> >>
> >> some solr wiki docs:
> >>
> >>
> https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram <
> >> salman.ak...@northbaysolutions.net> wrote:
> >>
> >> > Well some of the searches take minutes.
> >> >
> >> > Below are some stats about this particular index that I am talking
> about:
> >> >
> >> > Index size = 400GB (Using CommonGrams so without that the index is
> around
> >> > 180GB)
> >> > Position File = 280GB
> >> > Total Docs = 170 million (just indexed for searching - for
> highlighting
> >> > contents are stored in another index)
> >> > Avg Doc Size = Few hundred KBs
> >> > RAM = 384GB (it has other indexes too but still OS cache can have
> 60-80%
> >> of
> >> > the total index cached)
> >> >
> >> > Phrase queries run pretty fast with CG but complex versions of
> wildcard
> >> and
> >> > proximity queries can be really slow. I know using CG will make them
> slow
> >> > but they just take too long. By default sorting is on date but users
> have
> >> > few other parameters too on which they can sort.
> >> >
> >> > I wanted to avoid creating multiple indexes (maybe based on years) but
> >> > seems that to search on partial data that's the only feasible way.
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan 
> >> wrote:
> >> >
> >> > > As Hoss pointed out above, different projects have different
> >> > requirements.
> >> > > Some want to sort by date of ingestion reverse, which means that
> having
> >> > > posting lists organized in a reverse order with the early
> termination
> >> is
> >> > > the way to go (no such feature in Solr directly). Some other
> projects
> >> > want
> >> > > to collect all docs matching a query, and then sort by rank, but y

Best SSD block size for large SOLR indexes

2014-03-18 Thread Salman Akram
All,

Is there a rule of thumb for ideal block size for SSDs for large indexes
(in hundreds of GBs)? Read performance is of top importance for us and we
can sacrifice the space a little...

This is the one we just got and wanted to see if there are any test results
out there
http://www.storagereview.com/micron_p420m_enterprise_pcie_ssd_review

-- 
Regards,

Salman Akram


Re: Best SSD block size for large SOLR indexes

2014-03-18 Thread Salman Akram
This SSD default size seems to be 4K not 16K (as can be seen below).

Bytes Per Sector  :   512
Bytes Per Physical Sector :   4096
Bytes Per Cluster :   4096
Bytes Per FileRecord Segment: 1024

I will go through the articles you sent. Thanks


On Tue, Mar 18, 2014 at 6:31 PM, Shawn Heisey  wrote:

> On 3/18/2014 7:12 AM, Salman Akram wrote:
> > Is there a rule of thumb for ideal block size for SSDs for large indexes
> > (in hundreds of GBs)? Read performance is of top importance for us and we
> > can sacrifice the space a little...
> >
> > This is the one we just got and wanted to see if there are any test
> results
> > out there
> > http://www.storagereview.com/micron_p420m_enterprise_pcie_ssd_review
>
> The best filesystem block size to use for SSDs is dictated more by the
> characteristics of the SSD itself than what data you put on it.
>
> Here's an awesome series of articles about SSDs that I heard about from
> Shalin Shekhar Mangar:
>
>
> http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/
>
> With the page size of most large SSDs at 16KB, you might want to go with
> a multiple of that, like 64KB, and learn about the proper use of parted
> to align partition boundaries.
>
> As for whether there are Solr settings that can improve the I/O
> characteristics when reading/writing, that I do not know.
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: Best SSD block size for large SOLR indexes

2014-03-18 Thread Salman Akram
Thanks for the info. The articles were really useful but still seems I have
to do my own testing to find the right page size? I thought for large
indexes there would already be some tests done in SOLR community.

Side note: We are heavily using Microsoft technology (.NET etc) for
development so by looking at all the pros/cons decided to stick with
Windows. Wasn't rude ;)


On Tue, Mar 18, 2014 at 7:22 PM, Shawn Heisey  wrote:

> On 3/18/2014 7:39 AM, Salman Akram wrote:
> > This SSD default size seems to be 4K not 16K (as can be seen below).
> >
> > Bytes Per Sector  :   512
> > Bytes Per Physical Sector :   4096
> > Bytes Per Cluster :   4096
> > Bytes Per FileRecord Segment: 1024
>
> The *sector* size on a typical SSD is 4KB, but the *page* size is a
> lower level detail, and is more likely to be 16KB, especially on a very
> large SSD.
>
> The Micron P420m is actually mentioned specifically in the SSD article I
> linked, and a table in part 2 states that its page size is 16KB, with a
> block size of 8MB.
>
> Possibly rude side note: Windows? Really?
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: Best SSD block size for large SOLR indexes

2014-03-18 Thread Salman Akram
We do have couple of commodity SSDs already and they perform good. However,
our user queries are very complex and quite a few of them go above a minute
so we really had to do something about it.

Using this beast vs putting the whole index to RAM, the beast still seemed
a better option. Also we are using some top notch servers already.


On Wed, Mar 19, 2014 at 1:52 AM, Toke Eskildsen wrote:

> Salman Akram [salman.ak...@northbaysolutions.net] wrote:
>
> [Hundreds of GB index]
>
> > http://www.storagereview.com/micron_p420m_enterprise_pcie_ssd_review
>
> May I ask why you have chosen a drive with such a high speed and matching
> cost?
>
> We have some years of experience with using SSDs for search at work and it
> is our experience that commodity SSDs performs very well (one test showed
> something like 80% of RAM speed, YMMW). It seems to me that more servers
> with commodity SSDs could very well be cheaper and give better throughput
> than the beast(s) you're using. Are you trying to minimize latency "at all
> cost"?
>
> Regards,
> Toke Eskildsen




-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-18 Thread Salman Akram
Anyone?


On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Below is one of the sample slow query that takes mins!
>
> ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
> purchase* or repurchase*)) w/10 (executive or director)
>
> If a filter is used it comes in fq but what can be done about plain
> keyword search?
>
>
> On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson 
> wrote:
>
>> What are our complex queries? You
>> say that your app will very rarely see the
>> same query thus you aren't using caches...
>> But, if you can move some of your
>> clauses to fq clauses, then the filterCache
>> might well be used to good effect.
>>
>>
>>
>> On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
>>  wrote:
>> > 1- SOLR 4.6
>> > 2- We do but right now I am talking about plain keyword queries just
>> sorted
>> > by date. Once this is better will start looking into caches which we
>> > already changed a little.
>> > 3- As I said the contents are not stored in this index. Some other
>> metadata
>> > fields are but with normal queries its super fast so I guess even if I
>> > change there it will be a minor difference. We have SSD and quite fast
>> too.
>> > 4- That's something we need to do but even in low workload those queries
>> > take a lot of time
>> > 5- Every 10 mins and currently no auto warming as user queries are
>> rarely
>> > same and also once its fully warmed those queries are still slow.
>> > 6- Nops.
>> >
>> > On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan 
>> wrote:
>> >
>> >> 1. What is your solr version? In 4.x family the proximity searches have
>> >> been optimized among other query types.
>> >> 2. Do you use the filter queries? What is the situation with the cache
>> >> utilization ratios? Optimize (= i.e. bump up the respective cache
>> sizes) if
>> >> you have low hitratios and many evictions.
>> >> 3. Can you avoid storing some fields and only index them? When the
>> field is
>> >> stored and it is retrieved in the result, there are couple of disk
>> seeks
>> >> per field=> search slows down. Consider SSD disks.
>> >> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do
>> you
>> >> observe STW GC pauses?
>> >> 5. How often do you commit & do you have the autowarming / external
>> warming
>> >> configured?
>> >> 6. If you use faceting, consider storing DocValues for facet fields.
>> >>
>> >> some solr wiki docs:
>> >>
>> >>
>> https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Mar 13, 2014 at 8:52 AM, Salman Akram <
>> >> salman.ak...@northbaysolutions.net> wrote:
>> >>
>> >> > Well some of the searches take minutes.
>> >> >
>> >> > Below are some stats about this particular index that I am talking
>> about:
>> >> >
>> >> > Index size = 400GB (Using CommonGrams so without that the index is
>> around
>> >> > 180GB)
>> >> > Position File = 280GB
>> >> > Total Docs = 170 million (just indexed for searching - for
>> highlighting
>> >> > contents are stored in another index)
>> >> > Avg Doc Size = Few hundred KBs
>> >> > RAM = 384GB (it has other indexes too but still OS cache can have
>> 60-80%
>> >> of
>> >> > the total index cached)
>> >> >
>> >> > Phrase queries run pretty fast with CG but complex versions of
>> wildcard
>> >> and
>> >> > proximity queries can be really slow. I know using CG will make them
>> slow
>> >> > but they just take too long. By default sorting is on date but users
>> have
>> >> > few other parameters too on which they can sort.
>> >> >
>> >> > I wanted to avoid creating multiple indexes (maybe based on years)
>> but
>> >> > seems that to search on partial data that's the only feasible way.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Mar 12, 2014 at 2:47 PM, Dmitry Kan 
>> >> wrote:
>> >> >
>> >&g

Re: Partial Counts in SOLR

2014-03-19 Thread Salman Akram
This was one example. Users can even add phrase searches with
wildcards/proximity etc so can't really use stemming.

Sharding is definitely something we are already looking into.


On Wed, Mar 19, 2014 at 6:59 PM, Erick Erickson wrote:

> Yes, that'll be slow. Wildcards are, at best, interesting and at worst
> resource consumptive. Especially when you're doing this kind of
> positioning information as well.
>
> Consider looking at the problem sideways. That is, what is your
> purpose in searching for, say, "buy*"? You want to find buy, buying,
> buyers, etc? Would you get bette results if you just stemmed and
> omitted the wildcards?
>
> Do you have a restricted vocabulary that would allow you to define
> synonyms for the "important" words and all their variants at index
> time and use that?
>
> Finally, of course, you could shard your index (or add more shards if
> you're already sharding) if you really _must_ support these kinds of
> queries and can't work around the problem.
>
> Best,
> Erick
>
> On Tue, Mar 18, 2014 at 11:21 PM, Salman Akram
>  wrote:
> > Anyone?
> >
> >
> > On Mon, Mar 17, 2014 at 12:03 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> >> Below is one of the sample slow query that takes mins!
> >>
> >> ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
> >> purchase* or repurchase*)) w/10 (executive or director)
> >>
> >> If a filter is used it comes in fq but what can be done about plain
> >> keyword search?
> >>
> >>
> >> On Sun, Mar 16, 2014 at 4:37 AM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >>
> >>> What are our complex queries? You
> >>> say that your app will very rarely see the
> >>> same query thus you aren't using caches...
> >>> But, if you can move some of your
> >>> clauses to fq clauses, then the filterCache
> >>> might well be used to good effect.
> >>>
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 7:22 AM, Salman Akram
> >>>  wrote:
> >>> > 1- SOLR 4.6
> >>> > 2- We do but right now I am talking about plain keyword queries just
> >>> sorted
> >>> > by date. Once this is better will start looking into caches which we
> >>> > already changed a little.
> >>> > 3- As I said the contents are not stored in this index. Some other
> >>> metadata
> >>> > fields are but with normal queries its super fast so I guess even if
> I
> >>> > change there it will be a minor difference. We have SSD and quite
> fast
> >>> too.
> >>> > 4- That's something we need to do but even in low workload those
> queries
> >>> > take a lot of time
> >>> > 5- Every 10 mins and currently no auto warming as user queries are
> >>> rarely
> >>> > same and also once its fully warmed those queries are still slow.
> >>> > 6- Nops.
> >>> >
> >>> > On Thu, Mar 13, 2014 at 5:38 PM, Dmitry Kan 
> >>> wrote:
> >>> >
> >>> >> 1. What is your solr version? In 4.x family the proximity searches
> have
> >>> >> been optimized among other query types.
> >>> >> 2. Do you use the filter queries? What is the situation with the
> cache
> >>> >> utilization ratios? Optimize (= i.e. bump up the respective cache
> >>> sizes) if
> >>> >> you have low hitratios and many evictions.
> >>> >> 3. Can you avoid storing some fields and only index them? When the
> >>> field is
> >>> >> stored and it is retrieved in the result, there are couple of disk
> >>> seeks
> >>> >> per field=> search slows down. Consider SSD disks.
> >>> >> 4. Do you monitor your system in terms of RAM / cache stats / GC? Do
> >>> you
> >>> >> observe STW GC pauses?
> >>> >> 5. How often do you commit & do you have the autowarming / external
> >>> warming
> >>> >> configured?
> >>> >> 6. If you use faceting, consider storing DocValues for facet fields.
> >>> >>
> >>> >> some solr wiki docs:
> >>> >>
> >>> >>
> >>>
> https://wiki.apache.org/solr/SolrPerformanceProblems?highlight=%28%28SolrPerformanceFactors%29%29
> >>> >>
> 

Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-19 Thread Salman Akram
Yup!


On Thu, Mar 20, 2014 at 5:13 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> Guessing it's surround query parser's support for "within" backed by span
> queries.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Mar 19, 2014 4:44 PM, "T. Kuro Kurosaka"  wrote:
>
> > In the thread "Partial Counts in SOLR", Salman gave us this sample query:
> >
> >  ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
> >> purchase* or repurchase*)) w/10 (executive or director)
> >>
> >
> > I'm not familiar with this w/10 notation. What does this mean,
> > and what parser(s) supports this syntax?
> >
> > Kuro
> >
> >
>



-- 
Regards,

Salman Akram


Re: Best SSD block size for large SOLR indexes

2014-03-21 Thread Salman Akram
For now I am going with 64kb and results seem good. Thanks for the useful
feedback.


On Wed, Mar 19, 2014 at 9:30 PM, Shawn Heisey  wrote:

> On 3/19/2014 12:09 AM, Salman Akram wrote:
>
>> Thanks for the info. The articles were really useful but still seems I
>> have
>> to do my own testing to find the right page size? I thought for large
>> indexes there would already be some tests done in SOLR community.
>>
>> Side note: We are heavily using Microsoft technology (.NET etc) for
>> development so by looking at all the pros/cons decided to stick with
>> Windows. Wasn't rude ;)
>>
>
> Assuming you are only going to be putting Solr data on it, or anything
> else you put on it will also consist of large files, I would probably go
> with a cluster size at least 64KB for an NTFS volume, and I might consider
> 128KB or 256KB.  There *ARE* a few small files in a Solr index, but not
> enough of them for the wasted space to become a problem.
>
> The easiest way to configure Solr to use a different location than the
> program directory is to change the solr home.
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-24 Thread Salman Akram
Basically we just created this syntax for the ease of users, otherwise on
back end it uses W or N operators.


On Tue, Mar 25, 2014 at 4:21 AM, Ahmet Arslan  wrote:

> Hi,
>
> There is no w/ syntax in surround.
> /* Query language operators: OR, AND, NOT, W, N, (, ), ^, *, ?, " and
> comma */
>
> Ahmet
>
>
>
> On Monday, March 24, 2014 9:46 PM, T. Kuro Kurosaka 
> wrote:
> On 3/19/14 5:13 PM, Otis Gospodnetic wrote:> Hi,
> >
> > Guessing it's surround query parser's support for "within" backed by span
> > queries.
> >
> > Otis
>
> You mean this?
> http://wiki.apache.org/solr/SurroundQueryParser
>
> I guess this parser needs improvement in documentation area.
> It doesn't explain or have an example of the w/ syntax at all.
> (Is this the infix notation of W?)
> An example would help explaining difference between W and N;
> some readers may not understand what "ordered" and "unordered"
> in this context mean.
>
>
> Kuro
>



-- 
Regards,

Salman Akram


More Robust Search Timeouts (to Kill Zombie Queries)?

2014-03-26 Thread Salman Akram
With reference to this
thread<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E>I
wanted to know if there was any response to that or if Chris Harris
himself can comment on what he ended up doing, that would be great!


-- 
Regards,

Salman Akram


Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-03-31 Thread Salman Akram
Anyone?


On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> With reference to this 
> thread<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E>I
>  wanted to know if there was any response to that or if Chris Harris
> himself can comment on what he ended up doing, that would be great!
>
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-01 Thread Salman Akram
So you too never got any response...


On Mon, Mar 31, 2014 at 6:57 PM, Luis Lebolo  wrote:

> Hi Salman,
>
> I was interested in something similar, take a look at the following thread:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCADSoL-i04aYrsOo2%3DGcaFqsQ3mViF%2Bhn24ArDtT%3D7kpALtVHzA%40mail.gmail.com%3E#archives
>
> I never followed through, however.
>
> -Luis
>
>
> On Mon, Mar 31, 2014 at 6:24 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Anyone?
> >
> >
> > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > With reference to this thread<
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
> >I
> > wanted to know if there was any response to that or if Chris Harris
> > > himself can comment on what he ended up doing, that would be great!
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-15 Thread Salman Akram
Looking at this, sharding seems to be best and simple option to handle such
queries.


On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev  wrote:

> Hello Salman,
> Let's me drop few thoughts on
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
>
> There two aspects of this question:
> 1. dealing with long running processing (thread divergence actions
> http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and
> 2. an actual time checking.
> "terminating" or "aborting" thread (2.) are just a way to tracking time
> externally, and send interrupt() which the thread should react on, which
> they don't do now, and we returning to the core issue (1.)
>
> Solr's time allowed is to the proper way to handle this things, the only
> problem is that expect that the only core search is long running, but in
> your case rewriting MultiTermQuery-s takes a huge time.
> Let's consider this problem. First of all MultiTermQuery.rewrite() is the
> nearly design issue, after heavy rewrite occurs, it's thrown away, after
> search is done. I think the most straightforward way is to address this
> issue by caching these expensive queries. Solr does it well
> http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for
> http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there
> is
> a workaround allows to cache disjunction legs see
> http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
> If you still want to run expensively rewritten queries you need to
> implement timeout check (similar to TimeLimitingCollector) for TermsEnum
> returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums
> is the good way, to apply queries injecting time limiting wrapper
> TermsEnum, you might consider override methods like
> SolrQueryParserBase.newWildcardQuery(Term) or post process the query three
> after parsing.
>
>
>
> On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Anyone?
> >
> >
> > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > With reference to this thread<
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
> >I
> > wanted to know if there was any response to that or if Chris Harris
> > > himself can comment on what he ended up doing, that would be great!
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  
>



-- 
Regards,

Salman Akram


Re: timeAllowed in not honoring

2014-04-29 Thread Salman Akram
I had this issue too. timeAllowed only works for a certain phase of the
query. I think that's the 'process' part. However, if the query is taking
time in 'prepare' phase (e.g. I think for wildcards to get all the possible
combinations before running the query) it won't have any impact on that.
You can debug your query and confirm that.


On Wed, Apr 30, 2014 at 10:43 AM, Aman Tandon wrote:

> Shawn this is the first time i raised this problem.
>
> My heap size is 14GB and  i am not using solr cloud currently, 40GB index
> is replicated from master to two slaves.
>
> I read somewhere that it return the partial results which is computed by
> the query in that specified amount of time which is defined by this
> timeAllowed parameter, but it doesn't seems to happen.
>
> Here is the link :
> http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed
>
>  *The time allowed for a search to finish. This value only applies to the
> search and not to requests in general. Time is in milliseconds. Values <= 0
> mean no time restriction. Partial results may be returned (if there are
> any). *
>
>
>
> With Regards
> Aman Tandon
>
>
> On Wed, Apr 30, 2014 at 10:05 AM, Shawn Heisey  wrote:
>
> > On 4/29/2014 10:05 PM, Aman Tandon wrote:
> > > I am using solr 4.2 with the index size of 40GB, while querying to my
> > index
> > > there are some queries which is taking the significant amount of time
> of
> > > about 22 seconds *in the case of minmatch of 50%*. So i added a
> parameter
> > > timeAllowed = 2000 in my query but this doesn't seems to be work.
> Please
> > > help me out.
> >
> > I remember reading that timeAllowed has some limitations about which
> > stages of a query it can limit, particularly in the distributed case.
> > These limitations mean that it cannot always limit the total time for a
> > query.  I do not remember precisely what those limitations are, and I
> > cannot find whatever it was that I was reading.
> >
> > When I looked through my local list archive to see if you had ever
> > mentioned how much RAM you have and what the size of your Solr heap is,
> > there didn't seem to be anything.  There's not enough information for me
> > to know whether that 40GB is the amount of index data on a single
> > SolrCloud server, or whether it's the total size of the index across all
> > servers.
> >
> > If we leave timeAllowed alone for a moment and treat this purely as a
> > performance problem, usually my questions revolve around figuring out
> > whether you have enough RAM.  Here's where that conversation ends up:
> >
> > http://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > I think I've probably mentioned this to you before on another thread.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Regards,

Salman Akram


Re: Solr vs ElasticSearch

2014-07-31 Thread Salman Akram
This is quite an old discussion. Wanted to check any new comparisons after
SOLR 4 especially with regards to performance/scalability/throughput?


On Tue, Jul 26, 2011 at 7:33 PM, Peter  wrote:

> Have a look:
>
>
> http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage
>
> http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/
>
> Regards,
> Peter.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Salman Akram


Re: Solr vs ElasticSearch

2014-07-31 Thread Salman Akram
I did see that earlier. My main concern is search
performance/scalability/throughput which unfortunately that article didn't
address. Any benchmarks or comments about that?

We are already using SOLR but there has been a push to check elasticsearch.
All the benchmarks I have seen are at least few years old.


On Fri, Aug 1, 2014 at 4:59 AM, Otis Gospodnetic  wrote:

> Not super fresh, but more recent than the 2 links you sent:
> http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Jul 31, 2014 at 10:33 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > This is quite an old discussion. Wanted to check any new comparisons
> after
> > SOLR 4 especially with regards to performance/scalability/throughput?
> >
> >
> > On Tue, Jul 26, 2011 at 7:33 PM, Peter  wrote:
> >
> > > Have a look:
> > >
> > >
> > >
> >
> http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage
> > >
> > >
> http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/
> > >
> > > Regards,
> > > Peter.
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Phrase Highlighter + Surround Query Parser

2014-08-01 Thread Salman Akram
We are having an issue in Phrase highlighter with Surround Query Parser
e.g. *"first thing" w/100 "you must" *brings correct results but also
highlights individual words of the phrase - first, thing are highlighted
where they come separately as well.

Any idea how this can be fixed?


-- 
Regards,

Salman Akram


Re: Phrase Highlighter + Surround Query Parser

2014-08-03 Thread Salman Akram
Anyone?


On Fri, Aug 1, 2014 at 12:31 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> We are having an issue in Phrase highlighter with Surround Query Parser
> e.g. *"first thing" w/100 "you must" *brings correct results but also
> highlights individual words of the phrase - first, thing are highlighted
> where they come separately as well.
>
> Any idea how this can be fixed?
>
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: Solr vs ElasticSearch

2014-08-06 Thread Salman Akram
Thanks everyone!! This has been really helpful discussion and in short
based on this we have taken the decision to stick to SOLR.


On Mon, Aug 4, 2014 at 6:17 PM, Jack Krupansky 
wrote:

> And neither project supports the Lucene faceting module, correct?
>
> And the ES web site says: "WARNING: Facets are deprecated and will be
> removed in a future release. You are encouraged to migrate to aggregations
> instead."
>
> That makes it more of an apples/oranges comparison.
>
> -- Jack Krupansky
>
> -Original Message- From: Toke Eskildsen
> Sent: Monday, August 4, 2014 3:33 AM
> To: solr-user@lucene.apache.org
>
> Subject: Re: Solr vs ElasticSearch
>
> On Mon, 2014-08-04 at 08:31 +0200, Harald Kirsch wrote:
>
>> As for performance, I would expect that it is very hard to find one of
>> the two technologies to be generally ahead. Except for plain blunders
>> that may be lurking in the code, I would think the inner loops, the
>> stuff that really burns CPU cycles, all happens in Lucene, which is the
>> same for both.
>>
>
> Faceting/Aggregation is implemented independently and with different
> designs for Solr and Elasticsearch. I would be surprised if memory
> overhead and performance were about the same for this functionality.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>


-- 
Regards,

Salman Akram


SOLR 4 not utilizing multi CPU cores

2013-12-04 Thread Salman Akram
Hi,

We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the performance
went down for large phrase queries. On some analysis we have seen that
1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only
utilizing single cpu core. Any idea on what could be the reason?

Note: We are not using SOLR Sharding.

-- 
Regards,

Salman Akram


Re: SOLR 4 not utilizing multi CPU cores

2013-12-05 Thread Salman Akram
I missed one imp piece of info. Due to large size we have indexed the date
with Common Grams. All of the words in slow search are in common grams and
when I debug it, they query is made properly with common grams.

In debug all of the time is shown in process query time.

Let me know what other info you need? Thanks


On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini wrote:

> Hi, I did moreless the same but didn't get that behaviour...could you give
> us more details
>
> Best,
> Gazza
> On 5 Dec 2013 06:54, "Salman Akram" 
> wrote:
>
> > Hi,
> >
> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the performance
> > went down for large phrase queries. On some analysis we have seen that
> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only
> > utilizing single cpu core. Any idea on what could be the reason?
> >
> > Note: We are not using SOLR Sharding.
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: SOLR 4 not utilizing multi CPU cores

2013-12-05 Thread Salman Akram
More info on Cpu consumption: We have a server with 32 physical cores.

Same search when executed on SOLR 4.6 takes quite long and throughout only
uses 3% cpu (1 core).

Same search when executed on SOLR 1.4.1 takes much less time and on average
uses around 40-50% cpu.


On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> I missed one imp piece of info. Due to large size we have indexed the date
> with Common Grams. All of the words in slow search are in common grams and
> when I debug it, they query is made properly with common grams.
>
> In debug all of the time is shown in process query time.
>
> Let me know what other info you need? Thanks
>
>
> On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini 
> wrote:
>
>> Hi, I did moreless the same but didn't get that behaviour...could you give
>> us more details
>>
>> Best,
>> Gazza
>> On 5 Dec 2013 06:54, "Salman Akram" 
>> wrote:
>>
>> > Hi,
>> >
>> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the
>> performance
>> > went down for large phrase queries. On some analysis we have seen that
>> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only
>> > utilizing single cpu core. Any idea on what could be the reason?
>> >
>> > Note: We are not using SOLR Sharding.
>> >
>> > --
>> > Regards,
>> >
>> > Salman Akram
>> >
>>
>
>
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: SOLR 4 not utilizing multi CPU cores

2013-12-05 Thread Salman Akram
So I think I found one issue that somewhat explains the time difference but
not sure why this is happening. We are using Surround Query Parser. Below
is a two words query, both of them are in Common Grams list.

Query = "only be"

Here is what debug shows. I have highlighted the red part which is
different in both versions i.e. SOLR 4.6 is making it a multiphrasequery. I
am going to look into Surround Query Parser but not sure if it's an issue
with it or something else.

*SOLR 4.6 (takes 20 secs)*
{!surround}
{!surround}
MultiPhraseQuery(Contents:"(only only_be) be")
Contents:"(only only_be) be"

*SOLR 1.4.1 (takes 1 sec)*
{!surround}
{!surround}
Contents:only_be
Contents:only_be


P.S: The other issue still remains there that why is it not utilizing
multiple cpu cores.


On Thu, Dec 5, 2013 at 2:11 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> More info on Cpu consumption: We have a server with 32 physical cores.
>
> Same search when executed on SOLR 4.6 takes quite long and throughout only
> uses 3% cpu (1 core).
>
> Same search when executed on SOLR 1.4.1 takes much less time and on
> average uses around 40-50% cpu.
>
>
> On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
>> I missed one imp piece of info. Due to large size we have indexed the
>> date with Common Grams. All of the words in slow search are in common grams
>> and when I debug it, they query is made properly with common grams.
>>
>> In debug all of the time is shown in process query time.
>>
>> Let me know what other info you need? Thanks
>>
>>
>> On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini 
>> wrote:
>>
>>> Hi, I did moreless the same but didn't get that behaviour...could you
>>> give
>>> us more details
>>>
>>> Best,
>>> Gazza
>>> On 5 Dec 2013 06:54, "Salman Akram" 
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the
>>> performance
>>> > went down for large phrase queries. On some analysis we have seen that
>>> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is only
>>> > utilizing single cpu core. Any idea on what could be the reason?
>>> >
>>> > Note: We are not using SOLR Sharding.
>>> >
>>> > --
>>> > Regards,
>>> >
>>> > Salman Akram
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Salman Akram
>>
>>
>
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: SOLR 4 not utilizing multi CPU cores

2013-12-05 Thread Salman Akram
I am not using Shards.

I gave more info in a previous mail but I know its a single index and what
you are saying makes sense but from what I could see in 1.4.1 that it was
better 'utilizing' the hardware resources available. I mean if the CPU is
free then why not do multi threading (if possible of course). Not sure if
that was a bug with 1.4.1 or was just a better resource utilization.

However, the main issue seems to be what I referred in my last
mail...Thanks!


On Thu, Dec 5, 2013 at 2:24 PM, Daniel Collins wrote:

> Not sure if you are really stating the problem here.
>
> If you don't use Solr sharding, (I also assume you aren't using SolrCloud),
> and I'm guessing you are a single core (but can you confirm).
>
> As I understand Solr's logic, for a single query on a single core, that
> will only use 1 thread (ignoring updates, background merges, etc).  A
> Lucene index (with multiple segments) has each segment read sequentially,
> so a search must scan all the segments and that inherently is a
> single-threaded activity.
>
> The fact that the search uses less CPU is not really the issue (it might
> actually be a GOOD thing, it could mean the code is more efficient!), so I
> would consider that a red herring.  The real issue is that the search takes
> longer in elapsed time.
>
> The usual questions apply:
>
> 1)  how did you upgrade, did you port your config, or start from a fresh
> Solr 4 config and add your custom stuff to it.
> 2)  Is your new index comparable to your old one, does it have more
> segments, how did you fill it (bulk import or upgrade of old 1.4.1 index),
> and what is your merge policy for the index?
>
> Upgrades from such an old version of Solr have been asked before on the
> list, the consensus is that you probably need to re-tune your configuration
> (starting with a Solr 4 basic config) since Solr 4 is so different under
> the hood from 1.x
>
>
> On 5 December 2013 09:11, Salman Akram
> wrote:
>
> > More info on Cpu consumption: We have a server with 32 physical cores.
> >
> > Same search when executed on SOLR 4.6 takes quite long and throughout
> only
> > uses 3% cpu (1 core).
> >
> > Same search when executed on SOLR 1.4.1 takes much less time and on
> average
> > uses around 40-50% cpu.
> >
> >
> > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > I missed one imp piece of info. Due to large size we have indexed the
> > date
> > > with Common Grams. All of the words in slow search are in common grams
> > and
> > > when I debug it, they query is made properly with common grams.
> > >
> > > In debug all of the time is shown in process query time.
> > >
> > > Let me know what other info you need? Thanks
> > >
> > >
> > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini <
> agazzar...@apache.org
> > >wrote:
> > >
> > >> Hi, I did moreless the same but didn't get that behaviour...could you
> > give
> > >> us more details
> > >>
> > >> Best,
> > >> Gazza
> > >> On 5 Dec 2013 06:54, "Salman Akram" <
> salman.ak...@northbaysolutions.net
> > >
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the
> > >> performance
> > >> > went down for large phrase queries. On some analysis we have seen
> that
> > >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is
> > only
> > >> > utilizing single cpu core. Any idea on what could be the reason?
> > >> >
> > >> > Note: We are not using SOLR Sharding.
> > >> >
> > >> > --
> > >> > Regards,
> > >> >
> > >> > Salman Akram
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: SOLR 4 not utilizing multi CPU cores

2013-12-05 Thread Salman Akram
Here is the response to your 2 questions:

1- Started from fresh Solr 4 config and modified custom stuff.

2- Index is same and optimized.

However, as I said in a previous mail the issue seems to be Surround Query
Parser which is parsing the query in a different format.


On Thu, Dec 5, 2013 at 2:24 PM, Daniel Collins wrote:

> Not sure if you are really stating the problem here.
>
> If you don't use Solr sharding, (I also assume you aren't using SolrCloud),
> and I'm guessing you are a single core (but can you confirm).
>
> As I understand Solr's logic, for a single query on a single core, that
> will only use 1 thread (ignoring updates, background merges, etc).  A
> Lucene index (with multiple segments) has each segment read sequentially,
> so a search must scan all the segments and that inherently is a
> single-threaded activity.
>
> The fact that the search uses less CPU is not really the issue (it might
> actually be a GOOD thing, it could mean the code is more efficient!), so I
> would consider that a red herring.  The real issue is that the search takes
> longer in elapsed time.
>
> The usual questions apply:
>
> 1)  how did you upgrade, did you port your config, or start from a fresh
> Solr 4 config and add your custom stuff to it.
> 2)  Is your new index comparable to your old one, does it have more
> segments, how did you fill it (bulk import or upgrade of old 1.4.1 index),
> and what is your merge policy for the index?
>
> Upgrades from such an old version of Solr have been asked before on the
> list, the consensus is that you probably need to re-tune your configuration
> (starting with a Solr 4 basic config) since Solr 4 is so different under
> the hood from 1.x
>
>
> On 5 December 2013 09:11, Salman Akram
> wrote:
>
> > More info on Cpu consumption: We have a server with 32 physical cores.
> >
> > Same search when executed on SOLR 4.6 takes quite long and throughout
> only
> > uses 3% cpu (1 core).
> >
> > Same search when executed on SOLR 1.4.1 takes much less time and on
> average
> > uses around 40-50% cpu.
> >
> >
> > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > I missed one imp piece of info. Due to large size we have indexed the
> > date
> > > with Common Grams. All of the words in slow search are in common grams
> > and
> > > when I debug it, they query is made properly with common grams.
> > >
> > > In debug all of the time is shown in process query time.
> > >
> > > Let me know what other info you need? Thanks
> > >
> > >
> > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini <
> agazzar...@apache.org
> > >wrote:
> > >
> > >> Hi, I did moreless the same but didn't get that behaviour...could you
> > give
> > >> us more details
> > >>
> > >> Best,
> > >> Gazza
> > >> On 5 Dec 2013 06:54, "Salman Akram" <
> salman.ak...@northbaysolutions.net
> > >
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the
> > >> performance
> > >> > went down for large phrase queries. On some analysis we have seen
> that
> > >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6 is
> > only
> > >> > utilizing single cpu core. Any idea on what could be the reason?
> > >> >
> > >> > Note: We are not using SOLR Sharding.
> > >> >
> > >> > --
> > >> > Regards,
> > >> >
> > >> > Salman Akram
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: SOLR 4 not utilizing multi CPU cores

2013-12-05 Thread Salman Akram
After debugging it seems that Query Parser code in Surround Parser is
giving an issue in queries with Common Words. Has anyone tried Surround and
Common Grams with SOLR 4?


On Thu, Dec 5, 2013 at 7:00 PM, Daniel Collins wrote:

> Fair enough, I'm not famiilar with Surround parser, but it does look like
> some logic has changed there.
>
>
> On 5 December 2013 12:38, Salman Akram
> wrote:
>
> > Here is the response to your 2 questions:
> >
> > 1- Started from fresh Solr 4 config and modified custom stuff.
> >
> > 2- Index is same and optimized.
> >
> > However, as I said in a previous mail the issue seems to be Surround
> Query
> > Parser which is parsing the query in a different format.
> >
> >
> > On Thu, Dec 5, 2013 at 2:24 PM, Daniel Collins  > >wrote:
> >
> > > Not sure if you are really stating the problem here.
> > >
> > > If you don't use Solr sharding, (I also assume you aren't using
> > SolrCloud),
> > > and I'm guessing you are a single core (but can you confirm).
> > >
> > > As I understand Solr's logic, for a single query on a single core, that
> > > will only use 1 thread (ignoring updates, background merges, etc).  A
> > > Lucene index (with multiple segments) has each segment read
> sequentially,
> > > so a search must scan all the segments and that inherently is a
> > > single-threaded activity.
> > >
> > > The fact that the search uses less CPU is not really the issue (it
> might
> > > actually be a GOOD thing, it could mean the code is more efficient!),
> so
> > I
> > > would consider that a red herring.  The real issue is that the search
> > takes
> > > longer in elapsed time.
> > >
> > > The usual questions apply:
> > >
> > > 1)  how did you upgrade, did you port your config, or start from a
> fresh
> > > Solr 4 config and add your custom stuff to it.
> > > 2)  Is your new index comparable to your old one, does it have more
> > > segments, how did you fill it (bulk import or upgrade of old 1.4.1
> > index),
> > > and what is your merge policy for the index?
> > >
> > > Upgrades from such an old version of Solr have been asked before on the
> > > list, the consensus is that you probably need to re-tune your
> > configuration
> > > (starting with a Solr 4 basic config) since Solr 4 is so different
> under
> > > the hood from 1.x
> > >
> > >
> > > On 5 December 2013 09:11, Salman Akram
> > > wrote:
> > >
> > > > More info on Cpu consumption: We have a server with 32 physical
> cores.
> > > >
> > > > Same search when executed on SOLR 4.6 takes quite long and throughout
> > > only
> > > > uses 3% cpu (1 core).
> > > >
> > > > Same search when executed on SOLR 1.4.1 takes much less time and on
> > > average
> > > > uses around 40-50% cpu.
> > > >
> > > >
> > > > On Thu, Dec 5, 2013 at 2:05 PM, Salman Akram <
> > > > salman.ak...@northbaysolutions.net> wrote:
> > > >
> > > > > I missed one imp piece of info. Due to large size we have indexed
> the
> > > > date
> > > > > with Common Grams. All of the words in slow search are in common
> > grams
> > > > and
> > > > > when I debug it, they query is made properly with common grams.
> > > > >
> > > > > In debug all of the time is shown in process query time.
> > > > >
> > > > > Let me know what other info you need? Thanks
> > > > >
> > > > >
> > > > > On Thu, Dec 5, 2013 at 11:38 AM, Andrea Gazzarini <
> > > agazzar...@apache.org
> > > > >wrote:
> > > > >
> > > > >> Hi, I did moreless the same but didn't get that behaviour...could
> > you
> > > > give
> > > > >> us more details
> > > > >>
> > > > >> Best,
> > > > >> Gazza
> > > > >> On 5 Dec 2013 06:54, "Salman Akram" <
> > > salman.ak...@northbaysolutions.net
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > We recently upgraded to SOLR 4.6 from SOLR 1.4.1. Overall the
> > > > >> performance
> > > > >> > went down for large phrase queries. On some analysis we have
> seen
> > > that
> > > > >> > 1.4.1 utilized multiple cpu cores for such queries but SOLR 4.6
> is
> > > > only
> > > > >> > utilizing single cpu core. Any idea on what could be the reason?
> > > > >> >
> > > > >> > Note: We are not using SOLR Sharding.
> > > > >> >
> > > > >> > --
> > > > >> > Regards,
> > > > >> >
> > > > >> > Salman Akram
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Salman Akram
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Salman Akram
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


SOLR 4 - Query Issue in Common Grams with Surround Query Parser

2013-12-08 Thread Salman Akram
All,

I posted this sub-issue with another issue few days back but maybe it was
not obvious so posting it on a separate thread.

We recently migrated to SOLR 4.6. We use Common Grams but queries with
words in the CG list have slowed down. On debugging we found that for CG
words the parser is adding individual tokens of those words in the query
too which ends up slowing it. Below is an example:

Query = "only be"

Here is what debug shows. I have highlighted the red part which is
different in both versions i.e. SOLR 4.6 is making it a multiphrasequery
and adding individual tokens too. Can someone help?

SOLR 4.6 (takes 20 secs)
{!surround}
{!surround}
MultiPhraseQuery(Contents:"(only only_be) be")
Contents:"(only only_be) be"

SOLR 1.4.1 (takes 1 sec)
{!surround}
{!surround}
Contents:only_be
Contents:only_be--


Regards,

Salman Akram


Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser

2013-12-09 Thread Salman Akram
Yup on debugging I found that its coming in Analyzer. We are using Standard
Analyzer. It seems to be a SOLR 4 issue with Common Grams. Not sure if its
a bug or I am missing some config.


On Mon, Dec 9, 2013 at 2:03 PM, Ahmet Arslan  wrote:

> Hi Salman,
> I am confused because with surround no analysis is applied at query time.
> I suspect that surround query parser is not kicking in. You should see
> SrndQuery or something like at parser query section.
>
>
>
> On Monday, December 9, 2013 6:24 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> All,
>
> I posted this sub-issue with another issue few days back but maybe it was
> not obvious so posting it on a separate thread.
>
> We recently migrated to SOLR 4.6. We use Common Grams but queries with
> words in the CG list have slowed down. On debugging we found that for CG
> words the parser is adding individual tokens of those words in the query
> too which ends up slowing it. Below is an example:
>
> Query = "only be"
>
> Here is what debug shows. I have highlighted the red part which is
> different in both versions i.e. SOLR 4.6 is making it a multiphrasequery
> and adding individual tokens too. Can someone help?
>
> SOLR 4.6 (takes 20 secs)
> {!surround}
> {!surround}
> MultiPhraseQuery(Contents:"(only only_be)
> be")
> Contents:"(only only_be) be"
>
> SOLR 1.4.1 (takes 1 sec)
> {!surround}
> {!surround}
> Contents:only_be
> Contents:only_be--
>
>
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram


Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser

2013-12-09 Thread Salman Akram
We used that syntax in 1.4.1 when Surround was not part of SOLR and has to
register it. Didn't know that it is now part of SOLR. Any ways this is a
red herring since I have totally removed Surround and the issue remains
there.

Below is the debug info when I give a simple phrase query having common
words with default Query Parser. What I don't understand is that why is it
including single tokens as well? I have also included the relevant config
part below.

"rawquerystring": "Contents:\"only be\"",
"querystring": "Contents:\"only be\"",
"parsedquery": "MultiPhraseQuery(Contents:\"(only only_be) be\")",
"parsedquery_toString": "Contents:\"(only only_be) be\"",

"QParser": "LuceneQParser",

=



 
 
 
 





On Mon, Dec 9, 2013 at 7:46 PM, Erik Hatcher  wrote:

> But again, as Ahmet mentioned… it doesn't look like the surround query
> parser is actually being used.   The debug output also mentioned the query
> parser used, but that part wasn't provided below.  One thing to note here,
> the surround query parser is not available in 1.4.1.   It also looks like
> you're surrounding your query with angle brackets, as it says query string
> is {!surround}, which is not correct syntax.  And one
> of the most important things to note here is that the surround query parser
> does NOT use the analysis chain of the field, see <
> http://wiki.apache.org/solr/SurroundQueryParser#Limitations>.  In short,
> you're going to have to do some work to get common grams factored into a
> surround query (such as maybe calling to the analysis request hander to
> "parse" the query before sending it to the surround query parser).
>
> Erik
>
>
> On Dec 9, 2013, at 9:36 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Yup on debugging I found that its coming in Analyzer. We are using
> Standard
> > Analyzer. It seems to be a SOLR 4 issue with Common Grams. Not sure if
> its
> > a bug or I am missing some config.
> >
> >
> > On Mon, Dec 9, 2013 at 2:03 PM, Ahmet Arslan  wrote:
> >
> >> Hi Salman,
> >> I am confused because with surround no analysis is applied at query
> time.
> >> I suspect that surround query parser is not kicking in. You should see
> >> SrndQuery or something like at parser query section.
> >>
> >>
> >>
> >> On Monday, December 9, 2013 6:24 AM, Salman Akram <
> >> salman.ak...@northbaysolutions.net> wrote:
> >>
> >> All,
> >>
> >> I posted this sub-issue with another issue few days back but maybe it
> was
> >> not obvious so posting it on a separate thread.
> >>
> >> We recently migrated to SOLR 4.6. We use Common Grams but queries with
> >> words in the CG list have slowed down. On debugging we found that for CG
> >> words the parser is adding individual tokens of those words in the query
> >> too which ends up slowing it. Below is an example:
> >>
> >> Query = "only be"
> >>
> >> Here is what debug shows. I have highlighted the red part which is
> >> different in both versions i.e. SOLR 4.6 is making it a multiphrasequery
> >> and adding individual tokens too. Can someone help?
> >>
> >> SOLR 4.6 (takes 20 secs)
> >> {!surround}
> >> {!surround}
> >> MultiPhraseQuery(Contents:"(only only_be)
> >> be")
> >> Contents:"(only only_be) be"
> >>
> >> SOLR 1.4.1 (takes 1 sec)
> >> {!surround}
> >> {!surround}
> >> Contents:only_be
> >> Contents:only_be--
> >>
> >>
> >> Regards,
> >>
> >> Salman Akram
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser

2013-12-10 Thread Salman Akram
Thanks!! Using CommonGramsQueryFilter resolved the issue.

This was not there in 1.4.1 and also for some reason was not there in SOLR
4 Release Notes that we studied before upgrading.


On Tue, Dec 10, 2013 at 9:55 AM, Ahmet Arslan  wrote:

> Hi Salman,
>
> I never used commons gram filer but I remember there are two classes in
> this family. CommonGramsFilter and CommonGramsQueryFilter. It seems that
> CommonsGramsQueryFilter is what you are after.
>
>
> http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsQueryFilter.html
>
>
> http://khaidoan.wikidot.com/solr-common-gram-filter
>
>
>
>
>
> On Tuesday, December 10, 2013 6:43 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
> We used that syntax in 1.4.1 when Surround was not part of SOLR and has to
> register it. Didn't know that it is now part of SOLR. Any ways this is a
> red herring since I have totally removed Surround and the issue remains
> there.
>
> Below is the debug info when I give a simple phrase query having common
> words with default Query Parser. What I don't understand is that why is it
> including single tokens as well? I have also included the relevant config
> part below.
>
> "rawquerystring": "Contents:\"only be\"",
> "querystring": "Contents:\"only be\"",
> "parsedquery": "MultiPhraseQuery(Contents:\"(only only_be) be\")",
> "parsedquery_toString": "Contents:\"(only only_be) be\"",
>
> "QParser": "LuceneQParser",
>
> =
>
> 
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
>
>
>
> On Mon, Dec 9, 2013 at 7:46 PM, Erik Hatcher 
> wrote:
>
> > But again, as Ahmet mentioned… it doesn't look like the surround query
> > parser is actually being used.   The debug output also mentioned the
> query
> > parser used, but that part wasn't provided below.  One thing to note
> here,
> > the surround query parser is not available in 1.4.1.   It also looks like
> > you're surrounding your query with angle brackets, as it says query
> string
> > is {!surround}, which is not correct syntax.  And one
> > of the most important things to note here is that the surround query
> parser
> > does NOT use the analysis chain of the field, see <
> > http://wiki.apache.org/solr/SurroundQueryParser#Limitations>.  In short,
> > you're going to have to do some work to get common grams factored into a
> > surround query (such as maybe calling to the analysis request hander to
> > "parse" the query before sending it to the surround query parser).
> >
> > Erik
> >
> >
> > On Dec 9, 2013, at 9:36 AM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > Yup on debugging I found that its coming in Analyzer. We are using
> > Standard
> > > Analyzer. It seems to be a SOLR 4 issue with Common Grams. Not sure if
> > its
> > > a bug or I am missing some config.
> > >
> > >
> > > On Mon, Dec 9, 2013 at 2:03 PM, Ahmet Arslan 
> wrote:
> > >
> > >> Hi Salman,
> > >> I am confused because with surround no analysis is applied at query
> > time.
> > >> I suspect that surround query parser is not kicking in. You should see
> > >> SrndQuery or something like at parser query section.
> > >>
> > >>
> > >>
> > >> On Monday, December 9, 2013 6:24 AM, Salman Akram <
> > >> salman.ak...@northbaysolutions.net> wrote:
> > >>
> > >> All,
> > >>
> > >> I posted this sub-issue with another issue few days back but maybe it
> > was
> > >> not obvious so posting it on a separate thread.
> > >>
> > >> We recently migrated to SOLR 4.6. We use Common Grams but queries with
> > >> words in the CG list have slowed down. On debugging we found that for
> CG
> > >> words the parser is adding individual tokens of those words in the
> query
> > >> too which ends up slowing it. Below is an example:
> > >>
> > >> Query = "only be"
> > >>
> > >> Here is what debug shows. I have highlighted the red part which is
> > >> different in both versions i.e. SOLR 4.6 is making it a
> multiphrasequery
> > >> and adding individual tokens too. Can someone help?
> > >>
> > >> SOLR 4.6 (takes 20 secs)
> > >> {!surround}
> > >> {!surround}
> > >> MultiPhraseQuery(Contents:"(only only_be)
> > >> be")
> > >> Contents:"(only only_be) be"
> > >>
> > >> SOLR 1.4.1 (takes 1 sec)
> > >> {!surround}
> > >> {!surround}
> > >> Contents:only_be
> > >> Contents:only_be--
> > >>
> > >>
> > >> Regards,
>
> > >>
> > >> Salman Akram
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> >
> >
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram


Optimizing index on Slave

2014-01-20 Thread Salman Akram
All,

I know normally index should be optimized on master and it will be
replicated to slave but we have an issue with the network bandwidth.

We optimize indexes weekly (total size is around 1.5TB). We have few slaves
set up on local network so replication the whole indexes is not a big
issue.

However, we have one slave in another city too (on a backup network) which
of course gets replicated over internet which is quite slow and expensive.
We want to avoid copying the complete indexes every week after optimization
and were thinking if its possible to optimize it independently on slave so
that there is no delta between master and slave? We tried to do it but
still the slave replicated from master.


-- 
Regards,

Salman Akram


Re: Optimizing index on Slave

2014-01-22 Thread Salman Akram
We do. We have a lot of updates/deletes every day and a weekly optimization
definitely gives a considerable improvement so don't see a downside to it
except the complete replication part which is not an issue on local
network.


Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser

2014-01-22 Thread Salman Akram
Apologies for the late response as this mail was lost somewhere in filters.

Issue was that CommonGramsQueryFilterFactory should be used for searching
and CommonGramsFilterFactory for indexing. We were using
CommonGramsFilterFactory for both due to which it was not dropping single
tokens for common grams in a phrase query.

I will go through the link you sent and see if it needs any explanation.
Thanks!


Re: Optimizing index on Slave

2014-01-22 Thread Salman Akram
Unfortunately we can't do sharding right now.

If we optimize on master and slave separately the file names and sizes are
same. I think it's just the version no that is different. Maybe if there
was a to copy master version to slave that would resolve this issue?


Partial Counts in SOLR

2014-03-07 Thread Salman Akram
All,

Is it possible to get partial counts in SOLR? The idea is to get the count
but if its above a certain limit than just return that limit.

Reason: In an index with millions of documents I don't want to know that a
certain query matched 1 million docs (of course it will take time to
calculate that). Why don't just stop looking for more results lets say
after it finds 100 docs? Possible??

e.g. Something similar that we can do in MySQL:

SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias


-- 
Regards,

Salman Akram


Re: Partial Counts in SOLR

2014-03-07 Thread Salman Akram
I know about numFound. That's where the issue is.

On a complex query that takes mins I think there would be a major chunk of
that spent in calculating "numFound" whereas I don't need it. Let's say I
just need first 100 docs and then want SOLR to STOP looking further to
populate the "numFound".

Let's say I just don't want SOLR to return me numFound. Is that possible?
Also would it really help on the performance?

In MySQL you can simply stop it to look further a certain count for "total
count" and that gives a considerable improvement for complex queries but
that's not an inverted index so not sure how it works in SOLR...


On Fri, Mar 7, 2014 at 3:17 PM, Gora Mohanty  wrote:

> On 7 March 2014 15:18, Salman Akram 
> wrote:
> > All,
> >
> > Is it possible to get partial counts in SOLR? The idea is to get the
> count
> > but if its above a certain limit than just return that limit.
> >
> > Reason: In an index with millions of documents I don't want to know that
> a
> > certain query matched 1 million docs (of course it will take time to
> > calculate that). Why don't just stop looking for more results lets say
> > after it finds 100 docs? Possible??
> >
> > e.g. Something similar that we can do in MySQL:
> >
> > SELECT COUNT(*) FROM ( (SELECT * FROM table where 1 = 1) LIMIT 100) Alias
>
> The response to the /select Solr URL has a "numFound" attribute that
> is the number
> of matches.
>
> Regards,
> Gora
>



-- 
Regards,

Salman Akram


Master - Master / Upgrading a slave to master

2014-09-08 Thread Salman Akram
We have a redundant data center in case the primary goes down. Currently we
have 1 master and multiple slaves on primary data center. This master also
replicates to a slave in secondary data center. So if the primary goes down
at least the read only part works. However, now we want writes to work on
secondary data center too when primary goes down.

- Is it possible in SOLR to have Master - Master?
- If not then what's the best strategy to upgrade a slave to master?
- Naturally there would be some latency due to data centers being in
different geographical locations so what are the normal data issues and
best practices in case primary goes down? We would also like to shift back
to primary as soon as its back.


Thanks!

-- 
Regards,

Salman Akram


Re: Master - Master / Upgrading a slave to master

2014-09-09 Thread Salman Akram
You mean 3 'data centers' or 'nodes'? I am thinking if we have 2 nodes on
primary and 1 in secondary and we normally keep the secondary down would
that work? Basically secondary network is just for redundancy and won't be
as fast so normally we won't like to shift traffic there.

So can we just have nodes for redundancy and NOT load balancing i.e. it has
3 nodes but update is only on one of them? Similarly for the slave replicas
can we limit the searches to a certain slave or it will be auto balanced?

Also apart from SOLR cloud is it possible to have multiple master in SOLR
or a good guide to upgrade a slave to master?

Thanks

On Tue, Sep 9, 2014 at 5:40 PM, Shawn Heisey  wrote:

> On 9/8/2014 9:54 PM, Salman Akram wrote:
> > We have a redundant data center in case the primary goes down. Currently
> we
> > have 1 master and multiple slaves on primary data center. This master
> also
> > replicates to a slave in secondary data center. So if the primary goes
> down
> > at least the read only part works. However, now we want writes to work on
> > secondary data center too when primary goes down.
> >
> > - Is it possible in SOLR to have Master - Master?
> > - If not then what's the best strategy to upgrade a slave to master?
> > - Naturally there would be some latency due to data centers being in
> > different geographical locations so what are the normal data issues and
> > best practices in case primary goes down? We would also like to shift
> back
> > to primary as soon as its back.
>
> SolrCloud would work, but only if you have *three* datacenters.  Two of
> them would need to remain fully operational.  SolrCloud is a true
> cluster -- there is no master.  Each of the shards in a collection has
> one or more replicas.  One of the replicas gets elected to be leader,
> but the leader designation can change.
>
> The reason that you need three is because of zookeeper, which is the
> software that actually maintains the cluster and handles leader
> elections.  A majority of zookeeper nodes (more than half of them) must
> be operational for zookeeper to maintain quorum.  That means that the
> minimum number of zookeepers is three, and in a three-node system, one
> can go down without disrupting operation.
>
> One thing that SolrCloud doesn't yet have is rack/datacenter awareness.
>  Requests get load balanced across the entire cluster, regardless of
> where they are located.  It's something that will eventually come, but I
> don't have any kind of estimate for when.
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: Master - Master / Upgrading a slave to master

2014-09-09 Thread Salman Akram
So realistically speaking you cannot have SolrCloud work for 2 data centers
as a redundant solution because no matter how many nodes you add you still
would need at least 1 node in the 2nd center working too.

So that just leaves with non-SolrCloud solutions.

"1) Change the replication config to redefine the master and reload the core
or restart Solr."

That of course is a simple way but the real issue is about the possible
issues and some good practices e.g. normally the scenario would be that
primary data center goes down for few hours and till then we upgrade one of
the slaves in secondary to a master. Now

- IF there is no lag there won't be any issue in secondary at least but
what if there is lag and one of the files is not completely replicated?
That file would be discarded or there is a possibility that whole index is
not usable?

- Once the primary comes back how would we now copy the delta from
secondary? Make it a slave of secondary first, replicate the delta and then
set it as a master again?

In other words is there a good guide out there for this with possible
issues and solutions? Definitely before SolrCloud people would be doing
this and even now SolrCloud doesn't seem practical in quite a few
situations.

Thanks again!!

On Tue, Sep 9, 2014 at 8:02 PM, Shawn Heisey  wrote:

> On 9/9/2014 8:46 AM, Salman Akram wrote:
> > You mean 3 'data centers' or 'nodes'? I am thinking if we have 2 nodes on
> > primary and 1 in secondary and we normally keep the secondary down would
> > that work? Basically secondary network is just for redundancy and won't
> be
> > as fast so normally we won't like to shift traffic there.
> >
> > So can we just have nodes for redundancy and NOT load balancing i.e. it
> has
> > 3 nodes but update is only on one of them? Similarly for the slave
> replicas
> > can we limit the searches to a certain slave or it will be auto balanced?
> >
> > Also apart from SOLR cloud is it possible to have multiple master in SOLR
> > or a good guide to upgrade a slave to master?
>
> You must have three zookeeper nodes for a redundant setup.  If you only
> have two data centers, then you must put at least two of those nodes in
> one data center.  If the data center with two zookeeper nodes goes down,
> zookeeper cannot function, which means SolrCloud will not work
> correctly.  There is no way to maintain SolrCloud redundancy with only
> two data centers.  You might think to add a fourth ZK node and split
> them between the data centers ... except that in that situation, at
> least three nodes must be functional.  Two out of four nodes is not enough.
>
> A minimal fault-tolerant SolrCloud install is three physical machines.
> Two of them run ZK and Solr, one of them runs ZK only.
>
> If you don't use SolrCloud, then you have two choices to switch masters:
>
> 1) Change the replication config to redefine the master and reload the
> core or restart Solr.
> 2) Write scripts that manually use the replication HTTP API to do all
> your replication, rather than let Solr handle it automatically.  You can
> choose the master for every replication with HTTP calls.
>
> https://wiki.apache.org/solr/SolrReplication#HTTP_API
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: Master - Master / Upgrading a slave to master

2014-09-11 Thread Salman Akram
Anyone?

On Tue, Sep 9, 2014 at 8:20 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> So realistically speaking you cannot have SolrCloud work for 2 data
> centers as a redundant solution because no matter how many nodes you add
> you still would need at least 1 node in the 2nd center working too.
>
> So that just leaves with non-SolrCloud solutions.
>
> "1) Change the replication config to redefine the master and reload the core
> or restart Solr."
>
> That of course is a simple way but the real issue is about the possible
> issues and some good practices e.g. normally the scenario would be that
> primary data center goes down for few hours and till then we upgrade one of
> the slaves in secondary to a master. Now
>
> - IF there is no lag there won't be any issue in secondary at least but
> what if there is lag and one of the files is not completely replicated?
> That file would be discarded or there is a possibility that whole index is
> not usable?
>
> - Once the primary comes back how would we now copy the delta from
> secondary? Make it a slave of secondary first, replicate the delta and then
> set it as a master again?
>
> In other words is there a good guide out there for this with possible
> issues and solutions? Definitely before SolrCloud people would be doing
> this and even now SolrCloud doesn't seem practical in quite a few
> situations.
>
> Thanks again!!
>
> On Tue, Sep 9, 2014 at 8:02 PM, Shawn Heisey  wrote:
>
>> On 9/9/2014 8:46 AM, Salman Akram wrote:
>> > You mean 3 'data centers' or 'nodes'? I am thinking if we have 2 nodes
>> on
>> > primary and 1 in secondary and we normally keep the secondary down would
>> > that work? Basically secondary network is just for redundancy and won't
>> be
>> > as fast so normally we won't like to shift traffic there.
>> >
>> > So can we just have nodes for redundancy and NOT load balancing i.e. it
>> has
>> > 3 nodes but update is only on one of them? Similarly for the slave
>> replicas
>> > can we limit the searches to a certain slave or it will be auto
>> balanced?
>> >
>> > Also apart from SOLR cloud is it possible to have multiple master in
>> SOLR
>> > or a good guide to upgrade a slave to master?
>>
>> You must have three zookeeper nodes for a redundant setup.  If you only
>> have two data centers, then you must put at least two of those nodes in
>> one data center.  If the data center with two zookeeper nodes goes down,
>> zookeeper cannot function, which means SolrCloud will not work
>> correctly.  There is no way to maintain SolrCloud redundancy with only
>> two data centers.  You might think to add a fourth ZK node and split
>> them between the data centers ... except that in that situation, at
>> least three nodes must be functional.  Two out of four nodes is not
>> enough.
>>
>> A minimal fault-tolerant SolrCloud install is three physical machines.
>> Two of them run ZK and Solr, one of them runs ZK only.
>>
>> If you don't use SolrCloud, then you have two choices to switch masters:
>>
>> 1) Change the replication config to redefine the master and reload the
>> core or restart Solr.
>> 2) Write scripts that manually use the replication HTTP API to do all
>> your replication, rather than let Solr handle it automatically.  You can
>> choose the master for every replication with HTTP calls.
>>
>> https://wiki.apache.org/solr/SolrReplication#HTTP_API
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Recovering from Out of Mem

2014-10-14 Thread Salman Akram
I know there are some suggestions to avoid OOM issue e.g. setting
appropriate Max Heap size etc. However, what's the best way to recover from
it as it goes into non-responding state? We are using Tomcat on back end.

The scenario is that once we face OOM issue it keeps on taking queries
(doesn't give any error) but they just time out. So even though we have a
fail over system implemented but we don't have a way to distinguish if
these are real time out queries OR due to OOM.

-- 
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-17 Thread Salman Akram
I know this might sound weird but any easy way to do it in Windows?

On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer 
wrote:

> yago,
>
> you can put more complex restart logic as shown in the examples below or
> just do something similar to the java_oom.sh i posted earlier where you
> just spit out an email alert and deal with service restarts and
> troubleshooting manually
>
>
> e.g. something like the following for a java_error.sh will drop an email
> with a timestamp
>
>
>
> echo `date` | mail -s "Java Error: General - $HOSTNAME" not...@domain.com
>
>
> 
> From: Tim Potter 
> Sent: Tuesday, October 14, 2014 07:35
> To: solr-user@lucene.apache.org
> Subject: Re: Recovering from Out of Mem
>
> jfyi - the bin/solr script does the following:
>
> -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
> $SOLR_PORT is the port Solr is bound to, e.g. 8983
>
> The oom_solr.sh script looks like:
>
> SOLR_PORT=$1
>
> SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk
> '{print $2}' | sort -r`
>
> if [ "$SOLR_PID" == "" ]; then
>
>   echo "Couldn't find Solr process running on port $SOLR_PORT!"
>
>   exit
>
> fi
>
> NOW=$(date +"%F%T")
>
> (
>
> echo "Running OOM killer script for process $SOLR_PID for Solr on port
> $SOLR_PORT"
>
> kill -9 $SOLR_PID
>
> echo "Killed process $SOLR_PID"
>
> ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log
>
>
> I usually run Solr behind a supervisor type process (supervisord or
> upstart) that will restart it if the process dies.
>
>
> On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma 
> wrote:
>
> > This will do:
> > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
> >
> > pkill should also work
> >
> > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > > Boogie,
> > >
> > >
> > >
> > >
> > > Any example for java_error.sh script?
> > >
> > >
> > > —
> > > /Yago Riveiro
> > >
> > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> > boogie.sha...@proquest.com>
> > >
> > > wrote:
> > > > a really simple approach is to have the OOM generate an email
> > > > e.g.
> > > > 1) create a simple script (call it java_oom.sh) and drop it in your
> > tomcat
> > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> > > > not...@domain.com 2) configure your java options (in setenv.sh or
> > > > similar) to trigger heap dump and the email script when OOM occurs #
> > > > config error behaviors
> > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> > > > 
> > > > From: Mark Miller 
> > > > Sent: Tuesday, October 14, 2014 06:30
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Recovering from Out of Mem
> > > > Best is to pass the Java cmd line option that kills the process on
> OOM
> > and
> > > > setup a supervisor on the process to restart it.  You need a somewhat
> > > > recent release for this to work properly though. - Mark
> > > >
> > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram
> > > >>  wrote:
> > > >>
> > > >> I know there are some suggestions to avoid OOM issue e.g. setting
> > > >> appropriate Max Heap size etc. However, what's the best way to
> recover
> > > >> from
> > > >> it as it goes into non-responding state? We are using Tomcat on back
> > end.
> > > >>
> > > >> The scenario is that once we face OOM issue it keeps on taking
> queries
> > > >> (doesn't give any error) but they just time out. So even though we
> > have a
> > > >> fail over system implemented but we don't have a way to distinguish
> if
> > > >> these are real time out queries OR due to OOM.
> > > >>
> > > >> --
> > > >> Regards,
> > > >>
> > > >> Salman Akram
> >
> >
>



-- 
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-19 Thread Salman Akram
I assume you will have to write a script to restart the service as well?

On Fri, Oct 17, 2014 at 7:17 PM, Tim Potter 
wrote:

> You'd still want to kill it ... so you'll need to register a cmd script
> with the JVM using -XX:OnOutOfMemoryError=kill.cmd and then you could
> either
>
> 1) trap the PID at startup using something like:
>
> title SolrCloud
>
> for /F "tokens=2 delims= " %%A in ('TASKLIST /FI ^"WINDOWTITLE eq
> SolrCloud^" /NH') do (
>
> set /A SOLR_PID=%%A
>
> echo !SOLR_PID!>solr.pid
>
>
> or
>
>
> 2) if you keep track of the port (which all my Windows scripts do), then
> you can do:
>
>
> For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find
> ":%SOLR_PORT%"') do (
>
>   taskkill /t /f /pid %%j > nul 2>&1
>
> )
>
>
> On Fri, Oct 17, 2014 at 1:11 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > I know this might sound weird but any easy way to do it in Windows?
> >
> > On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer <
> boogie.sha...@proquest.com
> > >
> > wrote:
> >
> > > yago,
> > >
> > > you can put more complex restart logic as shown in the examples below
> or
> > > just do something similar to the java_oom.sh i posted earlier where you
> > > just spit out an email alert and deal with service restarts and
> > > troubleshooting manually
> > >
> > >
> > > e.g. something like the following for a java_error.sh will drop an
> email
> > > with a timestamp
> > >
> > >
> > >
> > > echo `date` | mail -s "Java Error: General - $HOSTNAME"
> > not...@domain.com
> > >
> > >
> > > 
> > > From: Tim Potter 
> > > Sent: Tuesday, October 14, 2014 07:35
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Recovering from Out of Mem
> > >
> > > jfyi - the bin/solr script does the following:
> > >
> > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
> > > $SOLR_PORT is the port Solr is bound to, e.g. 8983
> > >
> > > The oom_solr.sh script looks like:
> > >
> > > SOLR_PORT=$1
> > >
> > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep |
> awk
> > > '{print $2}' | sort -r`
> > >
> > > if [ "$SOLR_PID" == "" ]; then
> > >
> > >   echo "Couldn't find Solr process running on port $SOLR_PORT!"
> > >
> > >   exit
> > >
> > > fi
> > >
> > > NOW=$(date +"%F%T")
> > >
> > > (
> > >
> > > echo "Running OOM killer script for process $SOLR_PID for Solr on port
> > > $SOLR_PORT"
> > >
> > > kill -9 $SOLR_PID
> > >
> > > echo "Killed process $SOLR_PID"
> > >
> > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log
> > >
> > >
> > > I usually run Solr behind a supervisor type process (supervisord or
> > > upstart) that will restart it if the process dies.
> > >
> > >
> > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma 
> > > wrote:
> > >
> > > > This will do:
> > > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
> > > >
> > > > pkill should also work
> > > >
> > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > > > > Boogie,
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Any example for java_error.sh script?
> > > > >
> > > > >
> > > > > —
> > > > > /Yago Riveiro
> > > > >
> > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> > > > boogie.sha...@proquest.com>
> > > > >
> > > > > wrote:
> > > > > > a really simple approach is to have the OOM generate an email
> > > > > > e.g.
> > > > > > 1) create a simple script (call it java_oom.sh) and drop it in
> your
> > > > tomcat
> > > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory -
> $HOSTNAME"
> > > > > > not...@domain.com 2) configure your java options (in setenv.sh
> or
> &

Re: Recovering from Out of Mem

2014-10-20 Thread Salman Akram
" That's why it is considered better to crash the program and restart it
for OOME."

In the end aren't you also saying the same thing or I misunderstood
something?

We don't get this issue on master server (indexing). Our real concern is
slave where sometimes (rare) so not an obvious heap config issue but when
it happens our failover doesn't even work (moving to another slave) as
there is no error so I just want a good way to know if there is an OOM and
shift to a failover or just have that server restarted.




On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey  wrote:

> On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> > You can create a script to ping on Solr every 10 sec. if no response,
> then
> > restart it (Kill process id and run Solr again).
> > This is the fastest and easiest way to do that on windows.
>
> I wouldn't do this myself.  Any temporary problem that results in a long
> query time might result in a true outage while Solr restarts.  If OOME
> is a problem, then you can deal with that by providing a program for
> Java to call when OOME occurs.
>
> Sending notification when ping times get excessive is a good idea, but I
> wouldn't make it automatically restart, unless you've got a threshold
> for that action so it only happens when the ping time is *REALLY* high.
>
> The real fix for OOME is to make the heap larger or to reduce the heap
> requirements by changing how Solr is configured or used.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> Writing a program that has deterministic behavior in an out of memory
> condition is very difficult.  The Lucene devs *have* done this hard work
> in the lower levels of IndexWriter and the specific Directory
> implementations, so that OOME doesn't cause *index corruption*.
>
> In general, once OOME happens, program operation (and in some cases the
> status of the most recently indexed documents) is completely
> undetermined.  We can be sure that the data which has already been
> written to disk will be correct, but nothing beyond that.  That's why it
> is considered better to crash the program and restart it for OOME.
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-21 Thread Salman Akram
Yes so the most imp thing is what's the best way to 'know' that there is
OOM? Some script of a ping with 1-2 mins time?

The reason I want auto restart or at least some error (so that it can
switch to another slave) is I want to have a good sleep if something goes
wrong at night so that the systems keep on working and can look into
details in the morning. That's the whole purpose of having a fail over
implemented.

On a side node the instance where we had this OOM didn't have an explicit
Xmx set (on 64 bit Windows) so in that case is there some default max?
There was ample mem available so why would it throw OOM?

On Mon, Oct 20, 2014 at 9:00 PM, Boogie Shafer 
wrote:

>
> i think we can agree that the basic requirement of *knowing* when the OOM
> occurs is the minimal requirement, triggering an alert (email, etc) would
> be the first thing to get into your script
>
> once you know when the OOM conditions are occuring you can start to get to
> the root cause or remedy (adjust heap sizes, or adjust the input side that
> is triggering the OOM). the correct remedy will obviously require some more
> deeper investigation into the actual solr usage at the point of OOM and the
> gc logs (you have these being generated too i hope). just bumping the Xmx
> because you hit an OOM during an abusive query is no guarantee of a fix and
> is likely going to cost you OS cache memory space which you want to leave
> available for holding the actual index data. the real fix would be cleaning
> up the query (if that is possible)
>
> fundamentally, its a preference thing, but i'm personally not a fan of
> auto restarts as the problem that triggered the original OOM (say an
> expensive poorly constructed query) may just come back and you get into an
> oscillating situation of restart after restart. i generally want a human
> involved when error conditions which should be outliers (like OOM) are
> happening
>
>
> 
> From: Salman Akram 
> Sent: Monday, October 20, 2014 08:47
> To: Solr Group
> Subject: Re: Recovering from Out of Mem
>
> " That's why it is considered better to crash the program and restart it
> for OOME."
>
> In the end aren't you also saying the same thing or I misunderstood
> something?
>
> We don't get this issue on master server (indexing). Our real concern is
> slave where sometimes (rare) so not an obvious heap config issue but when
> it happens our failover doesn't even work (moving to another slave) as
> there is no error so I just want a good way to know if there is an OOM and
> shift to a failover or just have that server restarted.
>
>
>
>
> On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey  wrote:
>
> > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> > > You can create a script to ping on Solr every 10 sec. if no response,
> > then
> > > restart it (Kill process id and run Solr again).
> > > This is the fastest and easiest way to do that on windows.
> >
> > I wouldn't do this myself.  Any temporary problem that results in a long
> > query time might result in a true outage while Solr restarts.  If OOME
> > is a problem, then you can deal with that by providing a program for
> > Java to call when OOME occurs.
> >
> > Sending notification when ping times get excessive is a good idea, but I
> > wouldn't make it automatically restart, unless you've got a threshold
> > for that action so it only happens when the ping time is *REALLY* high.
> >
> > The real fix for OOME is to make the heap larger or to reduce the heap
> > requirements by changing how Solr is configured or used.
> >
> > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
> >
> > Writing a program that has deterministic behavior in an out of memory
> > condition is very difficult.  The Lucene devs *have* done this hard work
> > in the lower levels of IndexWriter and the specific Directory
> > implementations, so that OOME doesn't cause *index corruption*.
> >
> > In general, once OOME happens, program operation (and in some cases the
> > status of the most recently indexed documents) is completely
> > undetermined.  We can be sure that the data which has already been
> > written to disk will be correct, but nothing beyond that.  That's why it
> > is considered better to crash the program and restart it for OOME.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram


Common Grams Highlighting

2013-08-20 Thread Salman Akram
Hi,

I have gone through a lot of posts about Highlighting issues with Common
Grams but I am still a little confused. Below is my requirement:

- Highlighting needs to work properly with Common Grams
- Phrase highlighting needs to work
- Wildcard highlighting needs to work

Is this possible with Phrase Highlighter (with some patch)? e.g.
https://issues.apache.org/jira/browse/LUCENE-1489 (everything works fine
for me except the issue mentioned in this link)

Is this possible with Fast Vector Highlighter (wildcard/phrase highlighting
needs to work too)?

Is this possible with new Postings Highlighter
https://issues.apache.org/jira/browse/LUCENE-4290

If the answer is NO for all of the above questions then what if I index
same field twice; once 'indexed' with common grams for search and once just
'stored' without common grams for highlighting? Will this work and would it
have any impact on performance/size?


-- 
Regards,

Salman Akram


MaxRows and disabling sort

2011-01-14 Thread Salman Akram
Hi,

I want to limit my SOLR results so that it stops further searching once it
founds a certain number of records (just like 'limit' in MySQL).

I know it has timeAllowed property but is there anything like MaxRows? I am
NOT talking about 'rows' attribute which returns a specific no. of rows to
client. This seems a very nice way to stop SOLR from traversing through the
complete index but I am not sure if there is anything like this.

Also I guess default sorting is on Scoring and sorting can only be done once
it has the scores of all matches so then limiting it to the max rows becomes
useless. So if there a way to disable sorting? e.g. it returns the rows as
it finds without any order?

Thanks!


-- 
Regards,

Salman Akram
Cell: +92-321-4391210


Re: MaxRows and disabling sort

2011-01-14 Thread Salman Akram
In some cases my search takes too long. Now I want to show user partial
matches if its taking too long.

The problem with timeAllowed is that lets say I set its value to 10 secs
then for some queries it would be fine and will at least return few hundred
rows but in really worse scenarios it might not even return few records in
that time (even 0 is highly possible) so the user would think nothing
matched though there were many matches.

Telling SOLR to return first 20/50 records would ensure that it will at
least return user the first page even if it takes more time.

On Sat, Jan 15, 2011 at 3:11 AM, Erick Erickson wrote:

> Why do you want to do this? That is, what problem do you think would
> be solved by this? Because there are other problems if you're trying to,
> say, return all rows that match
>
> But no, there's nothing that I know of that would do what you want (of
> course that doesn't mean there isn't).
>
> Best
> Erick
>
> On Fri, Jan 14, 2011 at 12:17 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Hi,
> >
> > I want to limit my SOLR results so that it stops further searching once
> it
> > founds a certain number of records (just like 'limit' in MySQL).
> >
> > I know it has timeAllowed property but is there anything like MaxRows? I
> am
> > NOT talking about 'rows' attribute which returns a specific no. of rows
> to
> > client. This seems a very nice way to stop SOLR from traversing through
> the
> > complete index but I am not sure if there is anything like this.
> >
> > Also I guess default sorting is on Scoring and sorting can only be done
> > once
> > it has the scores of all matches so then limiting it to the max rows
> > becomes
> > useless. So if there a way to disable sorting? e.g. it returns the rows
> as
> > it finds without any order?
> >
> > Thanks!
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> > Cell: +92-321-4391210
> >
>



-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


MaxRows and disabling sort

2011-01-15 Thread Salman Akram
Hi,

I want to limit my SOLR results so that it stops further searching once it
founds a certain number of records (just like 'limit' in MySQL).

I know it has timeAllowed property but is there anything like MaxRows? I am
NOT talking about 'rows' attribute which returns a specific no. of rows to
client. This seems a very nice way to stop SOLR from traversing through the
complete index but I am not sure if there is anything like this.

Also I guess default sorting is on Scoring and sorting can only be done once
it has the scores of all matches so then limiting it to the max rows becomes
useless. So if there a way to disable sorting? e.g. it returns the rows as
it finds without any order?

Thanks!


-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: MaxRows and disabling sort

2011-01-15 Thread Salman Akram
It still returns me (total) 'numFound' that means its scanning all records.

So it seems except timeAllowed there is no way to tell SOLR to stop
searching for all records?

Thanks!

On Sat, Jan 15, 2011 at 7:03 AM, Chris Hostetter
wrote:

>
> : Also I guess default sorting is on Scoring and sorting can only be done
> once
> : it has the scores of all matches so then limiting it to the max rows
> becomes
> : useless. So if there a way to disable sorting? e.g. it returns the rows
> as
> : it finds without any order?
>
> http://wiki.apache.org/solr/CommonQueryParameters#sort
> "You can sort by index id using sort=_docid_ asc or sort=_docid_ desc"
>
> if you specify _docid_ asc then solr should return as soon as it finds the
> first N matching results w/o scoring all docs (because no score will be
> computed)
>
> if you use any complex features however (faceting or what not) then it
> will still most likely need to scan all docs.
>
>
> -Hoss
>



-- 
Regards,

Salman Akram


Lucene 2.9.x vs 3.x

2011-01-15 Thread Salman Akram
Hi,

SOLR 1.4.1 uses Lucene 2.9.3 by default (I think so). I have few questions

Are there any major performance (or other) improvements in Lucene
3.0.3/Lucene 2.9.4?

Does 3.x has major compatibility issues moving from 2.9.x?

Will SOLR 1.4.1 build work fine with Lucene 3.0.3?

Thanks!

-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


TVF file

2011-01-16 Thread Salman Akram
Hi,

>From my understanding TVF file stores the Term Vectors (Positions/Offset) so
if no field has Field.TermVector set (default is NO) so it shouldn't be
created, right?

I have an index created through SOLR on which no field had any value for
TermVectors so by default it shouldn't be saved. All the fields are either
String or Text. All fields have just indexed and stored attributes set to
True. String fields have omitNorms = true as well.

Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF file
in my index. Its almost 30% of the total index (around 60% is the PRX
positions file).

Also in Luke it shows 'f' (omitTF) flag for strings but not for text fields.

Any ideas what's going on? Thanks!

-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: TVF file

2011-01-16 Thread Salman Akram
Some more info I copied it from Luke and below is what it says for...

Text Fields --> stored/uncompressed,indexed,tokenized
String Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions

The main contents field is not stored so it doesn't show up on Luke but that
is Analyzed and Tokenized for searching.

On Sun, Jan 16, 2011 at 3:50 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Hi,
>
> From my understanding TVF file stores the Term Vectors (Positions/Offset)
> so if no field has Field.TermVector set (default is NO) so it shouldn't be
> created, right?
>
> I have an index created through SOLR on which no field had any value for
> TermVectors so by default it shouldn't be saved. All the fields are either
> String or Text. All fields have just indexed and stored attributes set to
> True. String fields have omitNorms = true as well.
>
> Even in Luke it doesn't show V (Term Vector) flag but I have a big TVF file
> in my index. Its almost 30% of the total index (around 60% is the PRX
> positions file).
>
> Also in Luke it shows 'f' (omitTF) flag for strings but not for text
> fields.
>
> Any ideas what's going on? Thanks!
>
> --
> Regards,
>
> Salman Akram
> Senior Software Engineer - Tech Lead
> 80-A, Abu Bakar Block, Garden Town, Pakistan
> Cell: +92-321-4391210
>



-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: TVF file

2011-01-16 Thread Salman Akram
Nops. I optimized it with Standard File Format and cleaned up Index dir
through Luke. It adds upto to the total size when I optimized it with
Compound File Format.

On Sun, Jan 16, 2011 at 5:46 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Is it possible that the tvf file you are looking at is old (i.e. not part
> of
> your active index)?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 16, 2011 6:17:23 AM
> > Subject: Re: TVF file
> >
> > Some more info I copied it from Luke and below is what it says  for...
> >
> > Text Fields --> stored/uncompressed,indexed,tokenized
> > String  Fields --> stored/uncompressed,indexed,omitTermFreqAndPositions
> >
> > The  main contents field is not stored so it doesn't show up on Luke but
> that
> > is  Analyzed and Tokenized for searching.
> >
> > On Sun, Jan 16, 2011 at 3:50 PM,  Salman Akram <
> > salman.ak...@northbaysolutions.net>  wrote:
> >
> > > Hi,
> > >
> > > From my understanding TVF file stores  the Term Vectors
> (Positions/Offset)
> > > so if no field has Field.TermVector  set (default is NO) so it
> shouldn't be
> > > created, right?
> > >
> > > I  have an index created through SOLR on which no field had any value
> for
> > >  TermVectors so by default it shouldn't be saved. All the fields are
>  either
> > > String or Text. All fields have just indexed and stored  attributes set
> to
> > > True. String fields have omitNorms = true as  well.
> > >
> > > Even in Luke it doesn't show V (Term Vector) flag but I  have a big TVF
> file
> > > in my index. Its almost 30% of the total index  (around 60% is the PRX
> > > positions file).
> > >
> > > Also in Luke it  shows 'f' (omitTF) flag for strings but not for text
> > >  fields.
> > >
> > > Any ideas what's going on? Thanks!
> > >
> > >  --
> > > Regards,
> > >
> > > Salman Akram
> > > Senior Software  Engineer - Tech Lead
> > > 80-A, Abu Bakar Block, Garden Town,  Pakistan
> > > Cell: +92-321-4391210
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
> >
>



-- 
Regards,

Salman Akram


Re: TVF file

2011-01-16 Thread Salman Akram
Please see below the dir listing and relevant part of schema file (I have
removed the name part from fields for obvious reasons).

Also regarding .frq file why exactly is it needed? Is it required in phrase
searching (I am not using highlighting or MoreLikeThis on this index file)
too? and this is not made if all fields are using omitTF?

Thanks alot!

--Dir Listing--
01/16/2011  06:05 AM  .
01/16/2011  06:05 AM  ..
01/15/2011  03:58 PM  log
04/22/2010  12:42 AM   549 luke.jnlp
01/16/2011  04:58 AM20 segments.gen
01/16/2011  04:58 AM   287 segments_5hl
01/16/2011  02:17 AM 4,760,716,827 _36w.fdt
01/16/2011  02:17 AM   107,732,836 _36w.fdx
01/16/2011  02:15 AM 4,032 _36w.fnm
01/16/2011  04:36 AM25,221,109,245 _36w.frq
01/16/2011  04:38 AM 4,457,445,928 _36w.nrm
01/16/2011  04:36 AM   126,866,227,056 _36w.prx
01/16/2011  04:36 AM22,510,915 _36w.tii
01/16/2011  04:36 AM 1,635,096,862 _36w.tis
01/16/2011  04:58 AM18,341,750 _36w.tvd
01/16/2011  04:58 AM78,450,397,739 _36w.tvf
01/16/2011  04:58 AM   215,465,668 _36w.tvx
  14 File(s) 241,755,049,714 bytes
   3 Dir(s)  1,072,112,025,600 bytes free


-Schema File--

F:\IndexingAppsRealTime\index>


 

  

  
  
  
  -->

  
 
 
   
 


 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   











On Sun, Jan 16, 2011 at 6:52 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hm, want to email the index dir listing (ls -lah) + the field type and
> field
> definitions from your schema.xml?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 16, 2011 7:51:15 AM
> > Subject: Re: TVF file
> >
> > Nops. I optimized it with Standard File Format and cleaned up Index  dir
> > through Luke. It adds upto to the total size when I optimized it  with
> > Compound File Format.
> >
> > On Sun, Jan 16, 2011 at 5:46 PM, Otis  Gospodnetic <
> > otis_gospodne...@yahoo.com>  wrote:
> >
> > > Is it possible that the tvf file you are looking at is old  (i.e. not
> part
> > > of
> > > your active index)?
> > >
> > >  Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene  ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Sun, January 16, 2011 6:17:23 AM
> > > > Subject: Re: TVF  file
> > > >
> > > > Some more info I copied it from Luke and below is  what it says
>  for...
> > > >
> > > > Text Fields -->  stored/uncompressed,indexed,tokenized
> > > > String  Fields -->
>  stored/uncompressed,indexed,omitTermFreqAndPositions
> > > >
> > > >  The  main contents field is not stored so it doesn't show up on Luke
>  but
> > > that
> > > > is  Analyzed and Tokenized for  searching.
> > > >
> > > > On Sun, Jan 16, 2011 at 3:50 PM,   Salman Akram <
> > > > salman.ak...@northbaysolutions.net>   wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > >  From my understanding TVF file stores  the Term Vectors
> > >  (Positions/Offset)
> > > > > so if no field has Field.TermVector   set (default is NO) so it
> > > shouldn't be
> > > > > created,  right?
> > > > >
> > > > > I  have an index created through  SOLR on which no field had any
> value
> > > for
> > > > >   TermVectors so by default it shouldn't be saved. All the fields
>  are
> > >  either
> > > > > String or Text. All fields have just  indexed and stored
>  attributes set
> > > to
> > > > > True.  String fields have omitNorms = true as  well.
> > > > >
> > >  > > Even in Luke it doesn't show V (Term Vector) flag but I  have a
>  big
> TVF
> > > file
> > > > > in my index. Its almost 30% of the total  index  (around 60% is the
> PRX
> > > > > positions file).
> > >  > >
> > > > > Also in Luke it  shows 'f' (omitTF) flag for  strings but not for
> text
> > > > >  fields.
> > > >  >
> > > > > Any ideas what's going on? Thanks!
> > > >  >
> > > > >  --
> > > > > Regards,
> > > >  >
> > > > > Salman Akram
> > > > > Senior Software   Engineer - Tech Lead
> > > > > 80-A, Abu Bakar Block, Garden Town,   Pakistan
> > > > > Cell: +92-321-4391210
> > > > >
> > >  >
> > > >
> > > >
> > > > --
> > > > Regards,
> > >  >
> > > > Salman Akram
> > > >
> > >  >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: TVF file

2011-01-16 Thread Salman Akram
Well anyways thanks for the help.

Also can you please reply to this about .frq file (since that's quite big
too).

"Also regarding .frq file why exactly is it needed? Is it required in phrase
searching (I am not using highlighting or MoreLikeThis on this index  file)
too? and this is not made if all fields are using omitTF?"

On Mon, Jan 17, 2011 at 10:18 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hm, this is a mystery to me - I don't see anything that would turn on Term
> Vectors...
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 16, 2011 2:26:53 PM
> > Subject: Re: TVF file
> >
> > Please see below the dir listing and relevant part of schema file (I
>  have
> > removed the name part from fields for obvious reasons).
> >
> > Also  regarding .frq file why exactly is it needed? Is it required in
>  phrase
> > searching (I am not using highlighting or MoreLikeThis on this index
>  file)
> > too? and this is not made if all fields are using omitTF?
> >
> > Thanks  alot!
> >
> > --Dir Listing--
> > 01/16/2011   06:05 AM   .
> > 01/16/2011  06:05 AM   ..
> > 01/15/2011  03:58 PM   log
> > 04/22/2010  12:42 AM549 luke.jnlp
> > 01/16/2011  04:58 AM 20  segments.gen
> > 01/16/2011  04:58 AM287 segments_5hl
> > 01/16/2011  02:17 AM  4,760,716,827 _36w.fdt
> > 01/16/2011  02:17 AM107,732,836 _36w.fdx
> > 01/16/2011  02:15 AM  4,032 _36w.fnm
> > 01/16/2011  04:36 AM 25,221,109,245 _36w.frq
> > 01/16/2011  04:38 AM 4,457,445,928  _36w.nrm
> > 01/16/2011  04:36 AM   126,866,227,056  _36w.prx
> > 01/16/2011  04:36 AM22,510,915  _36w.tii
> > 01/16/2011  04:36 AM 1,635,096,862  _36w.tis
> > 01/16/2011  04:58 AM18,341,750  _36w.tvd
> > 01/16/2011  04:58 AM78,450,397,739  _36w.tvf
> > 01/16/2011  04:58 AM   215,465,668  _36w.tvx
> >   14 File(s)  241,755,049,714 bytes
> >3  Dir(s)  1,072,112,025,600 bytes free
> >
> >
> > -Schema  File--
> >
> > F:\IndexingAppsRealTime\index>
> > 
> > 
> >  
> >  sortMissingLast="true"
> > omitNorms="true"/>
> >
> >  
> >> luceneMatchVersion="LUCENE_29"/>
> >
> >
> >-->
> >  
> >   
> >  
> >   
> >
> >  
> >
> >
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > 
> >
> >
> >
> > On Sun,  Jan 16, 2011 at 6:52 PM, Otis Gospodnetic <
> > otis_gospodne...@yahoo.com>  wrote:
> >
> > > Hm, want to email the index dir listing (ls -lah) + the field  type and
> > > field
> > > definitions from your schema.xml?
> > >
> > >  Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene  ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Sun, January 16, 2011 7:51:15 AM
> > > > Subject: Re: TVF  file
> > > >
> > > > Nops. I optimized it with Standard File Format  and cleaned up Index
>  dir
> > > > through Luke. It adds upto to the  total size when I optimized it
>  with
> > > > Compound File  Format.
> > > >
> 

CommonGrams phrase query

2011-01-17 Thread Salman Akram
Hi,

I have made an index using CommonGrams. Now when I query "a b" and explain
it, SOLR makes it +MultiPhraseQuery(Contents:"(a a_b) b").

Shouldn't it just be searching "a_b"? I am asking this coz even though I am
using CommonGrams it's much slower than normal index which just searches on
"a b".

Note: Both words are in the words list of CommonGrams.

-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: CommonGrams phrase query

2011-01-17 Thread Salman Akram
Ok sorry it was my fault.

I wasn't using CommonGramsQueryFilter for query, just had Filter for
indexing. The query seems fine now.

On Mon, Jan 17, 2011 at 1:44 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Hi,
>
> I have made an index using CommonGrams. Now when I query "a b" and explain
> it, SOLR makes it +MultiPhraseQuery(Contents:"(a a_b) b").
>
> Shouldn't it just be searching "a_b"? I am asking this coz even though I am
> using CommonGrams it's much slower than normal index which just searches on
> "a b".
>
> Note: Both words are in the words list of CommonGrams.
>
> --
> Regards,
>
> Salman Akram
>


-- 
Regards,

Salman Akram


Re: sort problem

2011-01-17 Thread Salman Akram
Yes.

On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL <
vincent.ro...@gmail.com> wrote:

> Le 17/01/11 10:32, Grijesh a écrit :
>
>  Use Lowercase filter to lowering your data at both index time and search
>> time
>> it will make case insensitive
>>
>> -
>> Thanx:
>> Grijesh
>>
> Thanks,
> so tell me if i m wrong... i need to modify my schema.xml to add lowercase
> filter and reindex my content?
>
>
>


-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: FilterQuery reaching maxBooleanClauses, alternatives?

2011-01-17 Thread Salman Akram
You can index a field which can the User types e.g. UserType (possible
values can be TypeA,TypeB and so on...) and then you can just do

?q=name:Stefan&fq=UserType:TypeB

BTW you can even increase the size of maxBooleanClauses but in this case
definitely this is not a good idea. Also you would hit the max limit of HTTP
GET so you will have to change it to POST. Better handle it with a new
field.

On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis <
matheis.ste...@googlemail.com> wrote:

> Hi List,
>
> we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per
> default). So, the used query looks like:
>
> ?q=name:Stefan&fq=5 10 12 15 16 [...]
>
> where the values are ids of users, which the current user is allowed to see
> - so long, nothing special. sometimes the filter-query includes user-ids
> from an different Type of User (let's say we have TypeA and TypeB) where
> TypeB contains more then 2k users. Then we hit the given Limit.
>
> Now the Question is .. is it possible to enable an Filter/Function/Feature
> in Solr, which it makes possible, that we don't need to send over alle the
> user ids from TypeB Users? Just to tell Solr "include all TypeB Users in
> the
> (given) FilterQuery" (or something in that direction)?
>
> If so, what's the Name of this Filter/Function/Feature? :)
>
> Don't hesitate to ask, if my question/description is weird!
>
> Thanks
> Stefan
>



-- 
Regards,

Salman Akram


Re: FilterQuery reaching maxBooleanClauses, alternatives?

2011-01-17 Thread Salman Akram
You are welcome.

By new field I meant if you don't have a field for UserType already.

On Mon, Jan 17, 2011 at 6:22 PM, Stefan Matheis <
matheis.ste...@googlemail.com> wrote:

> Thanks Salman,
>
> talking with others about problems really helps. Adding another FilterQuery
> is a bit too much - but combining both is working fine!
>
> not seen the wood for the trees =)
> Thanks, Stefan
>
>
> On Mon, Jan 17, 2011 at 2:07 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > You can index a field which can the User types e.g. UserType (possible
> > values can be TypeA,TypeB and so on...) and then you can just do
> >
> > ?q=name:Stefan&fq=UserType:TypeB
> >
> > BTW you can even increase the size of maxBooleanClauses but in this case
> > definitely this is not a good idea. Also you would hit the max limit of
> > HTTP
> > GET so you will have to change it to POST. Better handle it with a new
> > field.
> >
> > On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis <
> > matheis.ste...@googlemail.com> wrote:
> >
> > > Hi List,
> > >
> > > we are sometimes reaching the maxBooleanClauses Limit (which is 1024,
> per
> > > default). So, the used query looks like:
> > >
> > > ?q=name:Stefan&fq=5 10 12 15 16 [...]
> > >
> > > where the values are ids of users, which the current user is allowed to
> > see
> > > - so long, nothing special. sometimes the filter-query includes
> user-ids
> > > from an different Type of User (let's say we have TypeA and TypeB)
> where
> > > TypeB contains more then 2k users. Then we hit the given Limit.
> > >
> > > Now the Question is .. is it possible to enable an
> > Filter/Function/Feature
> > > in Solr, which it makes possible, that we don't need to send over alle
> > the
> > > user ids from TypeB Users? Just to tell Solr "include all TypeB Users
> in
> > > the
> > > (given) FilterQuery" (or something in that direction)?
> > >
> > > If so, what's the Name of this Filter/Function/Feature? :)
> > >
> > > Don't hesitate to ask, if my question/description is weird!
> > >
> > > Thanks
> > > Stefan
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


CommonGrams and SOLR - 1604

2011-01-17 Thread Salman Akram
Hi,

I am trying to use CommonGrams with SOLR - 1604 patch but doesn't seem to
work.

If I don't add {!complexphrase} it uses CommonGramsQueryFilterFactory and
proper bi-grams are made but of course doesn't use this patch.

If I add {!complexphrase} it simply does it the old way i.e. ignore
CommonGrams.

Does anyone know how to combine both these features?

Also once they are combined (hopefully they will be) would phrase proximity
search work fine?

Thanks

-- 
Regards,

Salman Akram


Re: CommonGrams and SOLR - 1604

2011-01-18 Thread Salman Akram
Anyone?


On Mon, Jan 17, 2011 at 7:48 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Hi,
>
> I am trying to use CommonGrams with SOLR - 1604 patch but doesn't seem to
> work.
>
> If I don't add {!complexphrase} it uses CommonGramsQueryFilterFactory and
> proper bi-grams are made but of course doesn't use this patch.
>
> If I add {!complexphrase} it simply does it the old way i.e. ignore
> CommonGrams.
>
> Does anyone know how to combine both these features?
>
> Also once they are combined (hopefully they will be) would phrase proximity
> search work fine?
>
> Thanks
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Mem allocation - SOLR vs OS

2011-01-18 Thread Salman Akram
Hi,

I know this is a subjective topic but from what I have read it seems more
RAM should be spared for OS caching and much less for SOLR/Tomcat even on a
dedicated SOLR server.

Can someone give me an idea about the theoretically ideal proportion b/w
them for a dedicated Windows server with 32GB RAM? Also the index is updated
every hour.

-- 
Regards,

Salman Akram


Re: Mem allocation - SOLR vs OS

2011-01-18 Thread Salman Akram
In case it helps there are two SOLR indexes (160GB and 700GB) on the
machine.

Also these are separate indexes and not shards so would it help to put them
on two separate Tomcat servers on same machine? This way I think one index
won't be affecting others cache.

On Wed, Jan 19, 2011 at 12:00 PM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Hi,
>
> I know this is a subjective topic but from what I have read it seems more
> RAM should be spared for OS caching and much less for SOLR/Tomcat even on a
> dedicated SOLR server.
>
> Can someone give me an idea about the theoretically ideal proportion b/w
> them for a dedicated Windows server with 32GB RAM? Also the index is updated
> every hour.
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Re: Mem allocation - SOLR vs OS

2011-01-19 Thread Salman Akram
Actually we don't have much load on the server (like the usage currently is
quite low) but user queries are very complex e.g. long phrases/multiple
proximity/wildcard etc so I know these values need to be tried out but I
wanted to see whats the right 'start' so that I am not way off.

Also regarding Solr cores just to clarify they are totally different indexes
(not 2 parts of one index) so the queries on them are separate do you still
think its better to keep them on two cores?

Thanks a lot!

On Wed, Jan 19, 2011 at 9:43 PM, Erick Erickson wrote:

> You're better off using two cores on the same Solr instance rather than two
> instances of Tomcat, that way you avoid some overhead.
>
> The usual advice is to monitor the Solr caches, particularly for evictions
> and
> size the Solr caches accordingly. You can see these from the admin/stats
> page
> and also by mining the logs, looking particularly for cache evictions.
> Since
> cache
> usage is so dependent on the particular installation and usage pattern
> (particularly
> sorting and faceting), "general" advice is hard to give.
>
> Hope this helps
> Erick
>
> On Wed, Jan 19, 2011 at 2:25 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > In case it helps there are two SOLR indexes (160GB and 700GB) on the
> > machine.
> >
> > Also these are separate indexes and not shards so would it help to put
> them
> > on two separate Tomcat servers on same machine? This way I think one
> index
> > won't be affecting others cache.
> >
> > On Wed, Jan 19, 2011 at 12:00 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > Hi,
> > >
> > > I know this is a subjective topic but from what I have read it seems
> more
> > > RAM should be spared for OS caching and much less for SOLR/Tomcat even
> on
> > a
> > > dedicated SOLR server.
> > >
> > > Can someone give me an idea about the theoretically ideal proportion
> b/w
> > > them for a dedicated Windows server with 32GB RAM? Also the index is
> > updated
> > > every hour.
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Mem allocation - SOLR vs OS

2011-01-19 Thread Salman Akram
We do have sorting but not faceting. OK so I guess there is no 'hard and
fast rule' as such so I will play with it and see.

Thanks for the help

On Wed, Jan 19, 2011 at 11:48 PM, Markus Jelsma
wrote:

> You only need so much for Solr so it can do its thing. Faceting can take
> quite
> some memory on a large index but sorting can be a really big RAM consumer.
>
> As Erick pointed out, inspect and tune the cache settings and adjust RAM
> allocated to the JVM if required. Using tools like JConsole you can monitor
> various things via JMX including RAM consumption.
>
> > Hi,
> >
> > I know this is a subjective topic but from what I have read it seems more
> > RAM should be spared for OS caching and much less for SOLR/Tomcat even on
> a
> > dedicated SOLR server.
> >
> > Can someone give me an idea about the theoretically ideal proportion b/w
> > them for a dedicated Windows server with 32GB RAM? Also the index is
> > updated every hour.
>



-- 
Regards,

Salman Akram


Re: Mem allocation - SOLR vs OS

2011-01-20 Thread Salman Akram
I will be looking into JConsole.

One more question regarding caching. When we talk about warm-up queries does
that mean that some of the complex queries (esp those which require high I/O
e.g. phrase queries) will really be very slow (on lets say an index of
200GB) if they are not cached? I am talking about difference of more than
few seconds...

Also regarding the cache settings I wanted to get another advice when you
talk about evictions do you mean the cumulative or current? I know ideally
the hit rate should be high and evictions low but can you please look into
the below stats for documentCache of both indexes and see what is it really
'saying'. Should the size be increased/decreases/kept same or this data is
not enough to judge and should collect at least few days data?

Note: Tomcat was restarted a day back and as I said there isn't much
workload but complex queries and index is updated every hour (currently
haven't implemented replication).

Max Size and InitialSize for both is 4096

lookups : 9849
hits : 5144
hitratio : 0.52
inserts : 4705
evictions : 609
size : 4096
warmupTime : 0
cumulative_lookups : 82492
cumulative_hits : 52059
cumulative_hitratio : 0.63
cumulative_inserts : 30433
cumulative_evictions : 685
--
lookups : 5539
hits : 3765
hitratio : 0.67
inserts : 1774
evictions : 0
size : 1774
warmupTime : 0
cumulative_lookups : 29062
cumulative_hits : 20568
cumulative_hitratio : 0.70
cumulative_inserts : 8494
cumulative_evictions : 0



On Wed, Jan 19, 2011 at 11:48 PM, Markus Jelsma
wrote:

> You only need so much for Solr so it can do its thing. Faceting can take
> quite
> some memory on a large index but sorting can be a really big RAM consumer.
>
> As Erick pointed out, inspect and tune the cache settings and adjust RAM
> allocated to the JVM if required. Using tools like JConsole you can monitor
> various things via JMX including RAM consumption.
>
> > Hi,
> >
> > I know this is a subjective topic but from what I have read it seems more
> > RAM should be spared for OS caching and much less for SOLR/Tomcat even on
> a
> > dedicated SOLR server.
> >
> > Can someone give me an idea about the theoretically ideal proportion b/w
> > them for a dedicated Windows server with 32GB RAM? Also the index is
> > updated every hour.
>



-- 
Regards,

Salman Akram


Wildcard, OR's inside Phrases (SOLR - 1604) & SurroundQueryParser

2011-01-21 Thread Salman Akram
Hi,

To get phrase search with proximity work fine I am planning to integrate
SurroundQueryParser. However, I wanted to know whether the functionality
provided in SOLR 1604 (i.e. Wildcard, OR's inside Phrases) would work fine
with it or not?

If not, what's the alternative as I need both functionality?

Thanks!

-- 
Regards,

Salman Akram


Re: Wildcard, OR's inside Phrases (SOLR - 1604) & SurroundQueryParser

2011-01-21 Thread Salman Akram
It seems SurroundQueryParser is in Lucene NOT Solr. So does this mean I will
have to integrate it in Lucene and update that jar file in SOLR?

Thanks

On Fri, Jan 21, 2011 at 11:33 PM, Ahmet Arslan  wrote:

>
> --- On Fri, 1/21/11, Salman Akram 
> wrote:
>
> > From: Salman Akram 
> > Subject: Wildcard, OR's inside Phrases (SOLR - 1604) &
> SurroundQueryParser
> > To: solr-user@lucene.apache.org
> > Date: Friday, January 21, 2011, 7:18 PM
> > Hi,
> >
> > To get phrase search with proximity work fine I am planning
> > to integrate
> > SurroundQueryParser. However, I wanted to know whether the
> > functionality
> > provided in SOLR 1604 (i.e. Wildcard, OR's inside Phrases)
> > would work fine
> > with it or not?
>
> surround is somehow superset of complexphrase. But their syntax is
> different.
>
> "a* b*"~10 => a* 10n b* => 10n(a*,b*)
>
> Lucene in Action second edition book (chapter 9.6) talks about surround.
>
>
>
>


-- 
Regards,

Salman Akram


Highlighting with/without Term Vectors

2011-01-24 Thread Salman Akram
Hi,

Does anyone have any benchmarks how much highlighting speeds up with Term
Vectors (compared to without it)? e.g. if highlighting on 20 documents take
1 sec with Term Vectors any idea how long it will take without them?

I need to know since the index used for highlighting has a TVF file of
around 450GB (approx 65% of total index size) so I am trying to see whether
the decreasing the index size by dropping TVF would be more helpful for
performance (less RAM, should be good for I/O too I guess) or keeping it is
still better?

I know the best way is try it out but indexing takes a very long time so
trying to see whether its even worthy or not.

-- 
Regards,

Salman Akram


Re: Highlighting with/without Term Vectors

2011-01-24 Thread Salman Akram
Just to add one thing, in case it makes a difference.

Max document size on which highlighting needs to be done is few hundred kb's
(in file system). In index its compressed so should be much smaller. Total
documents are more than 100 million.

On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Hi,
>
> Does anyone have any benchmarks how much highlighting speeds up with Term
> Vectors (compared to without it)? e.g. if highlighting on 20 documents take
> 1 sec with Term Vectors any idea how long it will take without them?
>
> I need to know since the index used for highlighting has a TVF file of
> around 450GB (approx 65% of total index size) so I am trying to see whether
> the decreasing the index size by dropping TVF would be more helpful for
> performance (less RAM, should be good for I/O too I guess) or keeping it is
> still better?
>
> I know the best way is try it out but indexing takes a very long time so
> trying to see whether its even worthy or not.
>
> --
> Regards,
>
> Salman Akram
>
>


-- 
Regards,

Salman Akram


Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram
Hi,

I am facing performance issues in three types of queries (and their
combination). Some of the queries take more than 2-3 mins. Index size is
around 150GB.


   - Wildcard
   - Proximity
   - Phrases (with common words)

I know CommonGrams and Stop words are a good way to resolve such issues but
they don't fulfill our functional requirements (Common Grams seem to have
issues with phrase proximity, stop words have issues with exact match etc).

Sharding is an option too but that too comes with limitations so want to
keep that as a last resort but I think there must be other things coz 150GB
is not too big for one drive/server with 32GB Ram.

Cache warming is a good option too but the index get updated every hour so
not sure how much would that help.

What are the other main tips that can help in performance optimization of
the above queries?

Thanks

-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram
By warmed index you only mean warming the SOLR cache or OS cache? As I said
our index is updated every hour so I am not sure how much SOLR cache would
be helpful but OS cache should still be helpful, right?

I haven't compared the results with a proper script but from manual testing
here are some of the observations.

'Recent' queries which are in cache of course return immediately (only if
they are exactly same - even if they took 3-4 mins first time). I will need
to test how many recent queries stay in cache but still this would work only
for very common queries. User can run different queries and I want at least
them to be at 'acceptable' level (5-10 secs) even if not very fast.

Our warm up script currently executes all distinct queries in our logs
having count > 5. It was run yesterday (with all the indexing update every
hour after that) and today when I executed some of the same queries again
their time seemed a little less (around 15-20%), I am not sure if this means
anything. However, still their time is not acceptable.

What do you think is the best way to compare results? First run all the warm
up queries and then execute same randomly and compare?

We are using Windows server, would it make a big difference if we move to
Linux? Our load is not high but some queries are really complex.

Also I was hoping to move to SSD in last after trying out all software
options. Is that an agreed fact that on large indexes (which don't fit in
RAM) proximity/wildcard/phrase queries (on common words) would be slow and
it can be only improved by cache warm up and better hardware? Otherwise with
an index of around 150GB such queries will take more than a min?

If that's the case I know this question is very subjective but if a single
query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD
(everything else same)?

Thanks!


On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen wrote:

> On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
> > Cache warming is a good option too but the index get updated every hour
> so
> > not sure how much would that help.
>
> What is the time difference between queries with a warmed index and a
> cold one? If the warmed index performs satisfactory, then one answer is
> to upgrade your underlying storage. As always for IO-caused performance
> problem in Lucene/Solr-land, SSD is the answer.
>
>


-- 
Regards,

Salman Akram


Re: Highlighting with/without Term Vectors

2011-01-25 Thread Salman Akram
Anyone?

On Tue, Jan 25, 2011 at 12:57 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Just to add one thing, in case it makes a difference.
>
> Max document size on which highlighting needs to be done is few hundred
> kb's (in file system). In index its compressed so should be much smaller.
> Total documents are more than 100 million.
>
>
> On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
>> Hi,
>>
>> Does anyone have any benchmarks how much highlighting speeds up with Term
>> Vectors (compared to without it)? e.g. if highlighting on 20 documents take
>> 1 sec with Term Vectors any idea how long it will take without them?
>>
>> I need to know since the index used for highlighting has a TVF file of
>> around 450GB (approx 65% of total index size) so I am trying to see whether
>> the decreasing the index size by dropping TVF would be more helpful for
>> performance (less RAM, should be good for I/O too I guess) or keeping it is
>> still better?
>>
>> I know the best way is try it out but indexing takes a very long time so
>> trying to see whether its even worthy or not.
>>
>> --
>> Regards,
>>
>> Salman Akram
>>
>>
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram


Re: Highlighting with/without Term Vectors

2011-02-04 Thread Salman Akram
Basically Term Vectors are only on one main field i.e. Contents. Average
size of each document would be few KB's but there are around 130 million
documents so what do you suggest now?

On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic  wrote:

> Salman,
>
> It also depends on the size of your documents.  Re-analyzing 20 fields of
> 500
> bytes each will be a lot faster than re-analyzing 20 fields with 50 KB
> each.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Grant Ingersoll 
> > To: solr-user@lucene.apache.org
> > Sent: Wed, January 26, 2011 10:44:09 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> >
> > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> >
> > > Hi,
> > >
> > > Does anyone have any benchmarks how much highlighting speeds up with
>  Term
> > > Vectors (compared to without it)? e.g. if highlighting on 20  documents
> take
> > > 1 sec with Term Vectors any idea how long it will take  without them?
> > >
> > > I need to know since the index used for  highlighting has a TVF file of
> > > around 450GB (approx 65% of total index  size) so I am trying to see
> whether
> > > the decreasing the index size by  dropping TVF would be more helpful
> for
> > > performance (less RAM, should be  good for I/O too I guess) or keeping
> it is
> > > still better?
> > >
> > > I know the best way is try it out but indexing takes a very long time
>  so
> > > trying to see whether its even worthy or not.
> >
> >
> > Try testing  on a smaller set.  In general, you are saving the process of
> >re-analyzing  the content, so, to some extent it is going to be dependent
> on how
> >fast your  analyzer chain is.  At the size you are at, I don't know if
> storing
> >TVs is  worth it.
>



-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Salman Akram
I know so we are not really using it for regular warm-ups (in any case index
is updated on hourly basis). Just tried few times to compare results. The
issue is I am not even sure if warming up is useful for such regular
updates.



On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic  wrote:

> Salman,
>
> I only skimmed your email, but wanted to say that this part sounds a little
> suspicious:
>
> > Our warm up script currently  executes all distinct queries in our logs
> > having count > 5. It was run  yesterday (with all the indexing update
> every
>
> It sounds like this will make warmup take a long time, assuming you
> have
> more than a handful distinct queries in your logs.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > Sent: Tue, January 25, 2011 6:32:48 AM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > By warmed index you only mean warming the SOLR cache or OS cache? As I
>  said
> > our index is updated every hour so I am not sure how much SOLR cache
>  would
> > be helpful but OS cache should still be helpful, right?
> >
> > I  haven't compared the results with a proper script but from manual
>  testing
> > here are some of the observations.
> >
> > 'Recent' queries which are  in cache of course return immediately (only
> if
> > they are exactly same - even  if they took 3-4 mins first time). I will
> need
> > to test how many recent  queries stay in cache but still this would work
> only
> > for very common queries.  User can run different queries and I want at
> least
> > them to be at 'acceptable'  level (5-10 secs) even if not very fast.
> >
> > Our warm up script currently  executes all distinct queries in our logs
> > having count > 5. It was run  yesterday (with all the indexing update
> every
> > hour after that) and today when  I executed some of the same queries
> again
> > their time seemed a little less  (around 15-20%), I am not sure if this
> means
> > anything. However, still their  time is not acceptable.
> >
> > What do you think is the best way to compare  results? First run all the
> warm
> > up queries and then execute same randomly and  compare?
> >
> > We are using Windows server, would it make a big difference if  we move
> to
> > Linux? Our load is not high but some queries are really  complex.
> >
> > Also I was hoping to move to SSD in last after trying out all  software
> > options. Is that an agreed fact that on large indexes (which don't  fit
> in
> > RAM) proximity/wildcard/phrase queries (on common words) would be slow
>  and
> > it can be only improved by cache warm up and better hardware? Otherwise
>  with
> > an index of around 150GB such queries will take more than a  min?
> >
> > If that's the case I know this question is very subjective but if a
>  single
> > query takes 2 min on SAS 10K RPM what would its approx time be on a  good
> SSD
> > (everything else same)?
> >
> > Thanks!
> >
> >
> > On Tue, Jan 25,  2011 at 3:44 PM, Toke Eskildsen
> wrote:
> >
> > >  On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
> > > > Cache  warming is a good option too but the index get updated every
> hour
> > >  so
> > > > not sure how much would that help.
> > >
> > > What is the  time difference between queries with a warmed index and a
> > > cold one? If  the warmed index performs satisfactory, then one answer
> is
> > > to upgrade  your underlying storage. As always for IO-caused
> performance
> > > problem in  Lucene/Solr-land, SSD is the answer.
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-02-04 Thread Salman Akram
Well I assume many people out there would have indexes larger than 100GB and
I don't think so normally you will have more RAM than 32GB or 64!

As I mentioned the queries are mostly phrase, proximity, wildcard and
combination of these.

What exactly do you mean by distribution of documents? On this index our
documents are not more than few hundred KB's on average (file system size)
and there are around 14 million documents. 80% of the index size is taken up
by position file. I am not sure if this is what you asked?

On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic  wrote:

> Hi,
>
>
> > Sharding is an  option too but that too comes with limitations so want to
> > keep that as a last  resort but I think there must be other things coz
> 150GB
> > is not too big for  one drive/server with 32GB Ram.
>
> Hmm what makes you think 32 GB is enough for your 150 GB index?
> It depends on queries and distribution of matching documents, for example.
> What's yours like?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Tue, January 25, 2011 4:20:34 AM
> > Subject: Performance optimization of Proximity/Wildcard searches
> >
> > Hi,
> >
> > I am facing performance issues in three types of queries (and  their
> > combination). Some of the queries take more than 2-3 mins. Index size  is
> > around 150GB.
> >
> >
> >- Wildcard
> >-  Proximity
> >- Phrases (with common words)
> >
> > I know CommonGrams and  Stop words are a good way to resolve such issues
> but
> > they don't fulfill our  functional requirements (Common Grams seem to
> have
> > issues with phrase  proximity, stop words have issues with exact match
> etc).
> >
> > Sharding is an  option too but that too comes with limitations so want to
> > keep that as a last  resort but I think there must be other things coz
> 150GB
> > is not too big for  one drive/server with 32GB Ram.
> >
> > Cache warming is a good option too but  the index get updated every hour
> so
> > not sure how much would that  help.
> >
> > What are the other main tips that can help in performance  optimization
> of
> > the above queries?
> >
> > Thanks
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Salman Akram
Correct me if I am wrong.

Commit in index flushes SOLR cache but of course OS cache would still be
useful? If a an index is updated every hour then a warm up that takes less
than 5 mins should be more than enough, right?

On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic  wrote:

> Salman,
>
> Warming up may be useful if your caches are getting decent hit ratios.
> Plus, you
> are warming up the OS cache when you warm up.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 3:33:41 PM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > I know so we are not really using it for regular warm-ups (in any case
>  index
> > is updated on hourly basis). Just tried few times to compare results.
>  The
> > issue is I am not even sure if warming up is useful for such  regular
> > updates.
> >
> >
> >
> > On Fri, Feb 4, 2011 at 5:16 PM, Otis  Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > I only skimmed your email, but wanted  to say that this part sounds a
> little
> > > suspicious:
> > >
> > > >  Our warm up script currently  executes all distinct queries in our
>  logs
> > > > having count > 5. It was run  yesterday (with all the  indexing
> update
> > > every
> > >
> > > It sounds like this will make  warmup take a long time, assuming
> you
> > > have
> > > more than a  handful distinct queries in your logs.
> > >
> > > Otis
> > > 
> > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > > >  Sent: Tue, January 25, 2011 6:32:48 AM
> > > > Subject: Re: Performance  optimization of Proximity/Wildcard searches
> > > >
> > > > By warmed  index you only mean warming the SOLR cache or OS cache? As
> I
> > >   said
> > > > our index is updated every hour so I am not sure how much SOLR  cache
> > >  would
> > > > be helpful but OS cache should still be  helpful, right?
> > > >
> > > > I  haven't compared the results  with a proper script but from manual
> > >  testing
> > > > here are  some of the observations.
> > > >
> > > > 'Recent' queries which  are  in cache of course return immediately
> (only
> > > if
> > > >  they are exactly same - even  if they took 3-4 mins first time). I
>  will
> > > need
> > > > to test how many recent  queries stay in  cache but still this would
> work
> > > only
> > > > for very common  queries.  User can run different queries and I want
> at
> > >  least
> > > > them to be at 'acceptable'  level (5-10 secs) even if  not very fast.
> > > >
> > > > Our warm up script currently   executes all distinct queries in our
> logs
> > > > having count > 5. It  was run  yesterday (with all the indexing
> update
> > > every
> > > >  hour after that) and today when  I executed some of the same
>  queries
> > > again
> > > > their time seemed a little less  (around  15-20%), I am not sure if
> this
> > > means
> > > > anything. However,  still their  time is not acceptable.
> > > >
> > > > What do you  think is the best way to compare  results? First run all
> the
> > >  warm
> > > > up queries and then execute same randomly and   compare?
> > > >
> > > > We are using Windows server, would it make a  big difference if  we
> move
> > > to
> > > > Linux? Our load is not  high but some queries are really  complex.
> > > >
> > > > Also I  was hoping to move to SSD in last after trying out all
>  software
> > >  > options. Is that an agreed fact that on large indexes (which don't
> fit
> > > in
> > > > RAM) proximity/wildcard/phrase queries (on common  words) would be
> slow
> > >  and
> > > > it can be only improved by  cache warm up and better hardware?
&g

Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Salman Akram
Since all queries return total count as well so on average a query matches
10% of the total documents. The index I am talking about is around 13
million so that means around 1.3 million documents match on average. Of
course all of them won't be overlapping so I am guessing that around 30-50%
documents do match the daily queries.

I tried to find out a lot if you can tell SOLR to stop searching after a
certain count - I don't mean no. of rows but just like MySQL limit so that
it doesn't have to spend time calculating the total count whereas its only
returning few rows to UI and we are OK in showing count as 1000+ (if its
more than 1000) but couldn't find any way.

On Sat, Feb 5, 2011 at 7:45 AM, Otis Gospodnetic  wrote:

> Heh, I'm not sure if this is valid thinking. :)
>
> By *matching* doc distribution I meant: what proportion of your millions of
> documents actually ever get matched and then how many of those make it to
> the
> UI.
> If you have 1000 queries in a day and they all end up matching only 3 of
> your
> docs, the system will need less RAM than a system where 1000 queries match
> 5
> different docs.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 3:38:55 PM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > Well I assume many people out there would have indexes larger than 100GB
>  and
> > I don't think so normally you will have more RAM than 32GB or  64!
> >
> > As I mentioned the queries are mostly phrase, proximity, wildcard  and
> > combination of these.
> >
> > What exactly do you mean by distribution of  documents? On this index our
> > documents are not more than few hundred KB's on  average (file system
> size)
> > and there are around 14 million documents. 80% of  the index size is
> taken up
> > by position file. I am not sure if this is what  you asked?
> >
> > On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Hi,
> > >
> > >
> > > > Sharding is an  option  too but that too comes with limitations so
> want to
> > > > keep that as a  last  resort but I think there must be other things
> coz
> > >  150GB
> > > > is not too big for  one drive/server with 32GB  Ram.
> > >
> > > Hmm what makes you think 32 GB is enough for your 150  GB index?
> > > It depends on queries and distribution of matching documents,  for
> example.
> > > What's yours like?
> > >
> > > Otis
> > >  
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Tue, January 25, 2011 4:20:34 AM
> > > > Subject: Performance  optimization of Proximity/Wildcard searches
> > > >
> > > >  Hi,
> > > >
> > > > I am facing performance issues in three types of  queries (and  their
> > > > combination). Some of the queries take  more than 2-3 mins. Index
> size  is
> > > > around 150GB.
> > >  >
> > > >
> > > >- Wildcard
> > > > -  Proximity
> > > >- Phrases (with common  words)
> > > >
> > > > I know CommonGrams and  Stop words are a  good way to resolve such
> issues
> > > but
> > > > they don't fulfill  our  functional requirements (Common Grams seem
> to
> > > have
> > >  > issues with phrase  proximity, stop words have issues with exact
>  match
> > > etc).
> > > >
> > > > Sharding is an  option too  but that too comes with limitations so
> want to
> > > > keep that as a  last  resort but I think there must be other things
> coz
> > >  150GB
> > > > is not too big for  one drive/server with 32GB  Ram.
> > > >
> > > > Cache warming is a good option too but  the  index get updated every
> hour
> > > so
> > > > not sure how much would  that  help.
> > > >
> > > > What are the other main tips that can  help in performance
>  optimization
> > > of
> > > > the above  queries?
> > > >
> > > > Thanks
> > > >
> > > > --
> > >  > Regards,
> > > >
> > > > Salman Akram
> > >  >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Highlighting with/without Term Vectors

2011-02-05 Thread Salman Akram
Yea I was going to reply to that thread but then it just slipped out of my
mind. :)

Actually we have two indexes. One that is used for searching and other for
highlighting. Their structure is different too like the 1st one has all the
metadata + document contents indexed (just for searching). This has around
13 million rows. In 2nd one we have mainly the document PAGE contents
indexed/stored with Terms Vectors. This has around 130 million rows (since
each row is a page).

What we do is search on the 1st index (around 150GB) and get document ID's
based on the page size (20/50/100) and then just search on these document
ID's on 2nd index (but on pages - as we need to show results based on page
no's) with text for highlighting as well.

The 2nd index is around 700GB (which has that 450GB TVF file I was talking
about) but since its only referred for small no. of documents mostly that is
not an issue (in some queries that's slow too but its size is the main
issue).

On average more than 90% of the query time is taken by 1st index file in
searching (and total count as well).

The confusion that I had was on the 1st index file which didn't have Term
Vectors in any of the fields in SOLR schema file but still had a TVF file.
The reason in the end turned out to be Lucene indexing. Some of the initial
documents were indexed through Lucene and there one of the field did had
Term Vectors! Sorry for that...

*Keeping in mind the above description any other ideas you would like to
suggest? Thanks!!*

On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic  wrote:

> Hi Salman,
>
> Ah, so in the end you *did* have TV enabled on one of your fields! :) (I
> think
> this was a problem we were trying to solve a few weeks ago here)
>
> How many docs you have in the index doesn't matter here - only N
> docs/fields
> that you need to display on a page with N results need to be reanalyzed for
> highlighting purposes, so follow Grant's advice, make a small index without
> TV,
> and compare highlighting speed with and without TV.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 8:03:06 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> > Basically Term Vectors are only on one main field i.e. Contents. Average
> > size  of each document would be few KB's but there are around 130 million
> > documents  so what do you suggest now?
> >
> > On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > It also depends on the size of your  documents.  Re-analyzing 20 fields
> of
> > > 500
> > > bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
> > >  each.
> > >
> > > Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > > Lucene ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Grant Ingersoll 
> > > > To: solr-user@lucene.apache.org
> > >  > Sent: Wed, January 26, 2011 10:44:09 AM
> > > > Subject: Re:  Highlighting with/without Term Vectors
> > > >
> > > >
> > > > On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Does anyone have any benchmarks how much  highlighting speeds up
> with
> > >  Term
> > > > > Vectors  (compared to without it)? e.g. if highlighting on 20
>  documents
> > >  take
> > > > > 1 sec with Term Vectors any idea how long it will  take  without
> them?
> > > > >
> > > > > I need to know  since the index used for  highlighting has a TVF
> file of
> > > > >  around 450GB (approx 65% of total index  size) so I am trying to
>  see
> > > whether
> > > > > the decreasing the index size by   dropping TVF would be more
> helpful
> > > for
> > > > > performance  (less RAM, should be  good for I/O too I guess) or
> keeping
> > > it  is
> > > > > still better?
> > > > >
> > > > > I know  the best way is try it out but indexing takes a very long
> time
> > >   so
> > > > > trying to see whether its even worthy or not.
> > >  >
> > > >
> > > > Try testing  on a smaller set.  In  general, you are saving the
> process of
> > > >re-analyzing  the  content, so, to some extent it is going to be
> dependent
> > > on how
> > >  >fast your  analyzer chain is.  At the size you are at, I don't  know
> if
> > > storing
> > > >TVs is  worth  it.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-02-06 Thread Salman Akram
Only couple of thousand documents are added daily so the old OS cache should
still be useful since old documents remain same, right?

Also can you please comment on my other thread related to Term Vectors?
Thanks!

On Sat, Feb 5, 2011 at 8:40 PM, Otis Gospodnetic  wrote:

> Yes, OS cache mostly remains (obviously index files that are no longer
> around
> are going to remain the OS cache for a while, but will be useless and
> gradually
> replaced by new index files).
> How long warmup takes is not relevant here, but what queries you use to
> warm up
> the index and how much you auto-warm the caches.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Sat, February 5, 2011 4:06:54 AM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > Correct me if I am wrong.
> >
> > Commit in index flushes SOLR cache but of  course OS cache would still be
> > useful? If a an index is updated every hour  then a warm up that takes
> less
> > than 5 mins should be more than enough,  right?
> >
> > On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > Warming up may be useful if your  caches are getting decent hit ratios.
> > > Plus, you
> > > are warming up  the OS cache when you warm up.
> > >
> > > Otis
> > > 
> > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Fri, February 4, 2011 3:33:41 PM
> > > > Subject: Re:  Performance optimization of Proximity/Wildcard searches
> > > >
> > >  > I know so we are not really using it for regular warm-ups (in any
>  case
> > >  index
> > > > is updated on hourly basis). Just tried  few times to compare
> results.
> > >  The
> > > > issue is I am not  even sure if warming up is useful for such
>  regular
> > > >  updates.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 4, 2011  at 5:16 PM, Otis  Gospodnetic <
> > > otis_gospodne...@yahoo.com
> > >  > >  wrote:
> > > >
> > > > > Salman,
> > > >  >
> > > > > I only skimmed your email, but wanted  to say that  this part
> sounds a
> > > little
> > > > > suspicious:
> > > >  >
> > > > > >  Our warm up script currently  executes  all distinct queries in
> our
> > >  logs
> > > > > > having  count > 5. It was run  yesterday (with all the  indexing
> > >  update
> > > > > every
> > > > >
> > > > > It sounds  like this will make  warmup take a long time,
> assuming
> > >  you
> > > > > have
> > > > > more than a  handful distinct  queries in your logs.
> > > > >
> > > > > Otis
> > > > >  
> > > > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > >  > Lucene ecosystem  search :: http://search-lucene.com/
> > > > >
> > > >  >
> > > > >
> > > > > - Original  Message  
> > > > > > From: Salman Akram 
> > >  > >  > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > > >  > >  Sent: Tue, January 25, 2011 6:32:48 AM
> > > > > >  Subject: Re: Performance  optimization of Proximity/Wildcard
> searches
> > > > > >
> > > > > > By warmed  index you  only mean warming the SOLR cache or OS
> cache? As
> > > I
> > > >  >   said
> > > > > > our index is updated every hour so I am  not sure how much SOLR
>  cache
> > > > >  would
> > > >  > > be helpful but OS cache should still be  helpful, right?
> > >  > > >
> > > > > > I  haven't compared the results   with a proper script but from
> manual
> > > > >  testing
> > >  > > > here are  some of the observations.
> > > > >  >
> > > > > > &

Filter Query

2011-02-24 Thread Salman Akram
Hi,

I know Filter Query is really useful due to caching but I am confused about
how it filter results.

Lets say I have following criteria

Text:: "Abc def"
Date: 24th Feb, 2011

Now "abc def" might be coming in almost every document but if SOLR first
filters based on date it will have to do search only on few documents
(instead of millions)

If I put Date parameter in fq would it be first filtering on date and then
doing text search or both of them would be filtered separately and then
intersection? If its filtered separately the issue would be that lets say
"abd def" takes 20 secs on all documents (without any filters - due to large
# of documents) and it will be still taking same time but if its done only
on few documents on that specific date it would be super fast.

If fq doesn't give what I am looking for, is there any other parameter?
There should be a way as this is a very common scenario.



-- 
Regards,

Salman Akram


Re: Filter Query

2011-02-24 Thread Salman Akram
Yea I had an idea about that...

Now logically speaking main text search should be in the Query filter so
there is no way to first filter based on meta data and then do text search
on that limited data set?

Thanks!

On Thu, Feb 24, 2011 at 5:24 PM, Stefan Matheis <
matheis.ste...@googlemail.com> wrote:

> Salman,
>
> afaik, the Query is executed first and afterwards FilterQuery steps in
> Place .. so it's only an additional Filter on your Results.
>
> Recommended Wiki-Pages on FilterQuery:
> * http://wiki.apache.org/solr/CommonQueryParameters#fq
> * http://wiki.apache.org/solr/FilterQueryGuidance
>
> Regards
> Stefan
>
> On Thu, Feb 24, 2011 at 12:46 PM, Salman Akram
>  wrote:
> > Hi,
> >
> > I know Filter Query is really useful due to caching but I am confused
> about
> > how it filter results.
> >
> > Lets say I have following criteria
> >
> > Text:: "Abc def"
> > Date: 24th Feb, 2011
> >
> > Now "abc def" might be coming in almost every document but if SOLR first
> > filters based on date it will have to do search only on few documents
> > (instead of millions)
> >
> > If I put Date parameter in fq would it be first filtering on date and
> then
> > doing text search or both of them would be filtered separately and then
> > intersection? If its filtered separately the issue would be that lets say
> > "abd def" takes 20 secs on all documents (without any filters - due to
> large
> > # of documents) and it will be still taking same time but if its done
> only
> > on few documents on that specific date it would be super fast.
> >
> > If fq doesn't give what I am looking for, is there any other parameter?
> > There should be a way as this is a very common scenario.
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


Re: Filter Query

2011-02-24 Thread Salman Akram
So you are agreeing that it does what I want? So in my example "Abc def"
would only be searched on 24th Feb 2010 documents?

When you say 'last with filters' does it mean first it filters out with
Filter Query and then applies Query on it?

On Thu, Feb 24, 2011 at 9:29 PM, Yonik Seeley wrote:

> On Thu, Feb 24, 2011 at 6:46 AM, Salman Akram
>  wrote:
> > Hi,
> >
> > I know Filter Query is really useful due to caching but I am confused
> about
> > how it filter results.
> >
> > Lets say I have following criteria
> >
> > Text:: "Abc def"
> > Date: 24th Feb, 2011
> >
> > Now "abc def" might be coming in almost every document but if SOLR first
> > filters based on date it will have to do search only on few documents
> > (instead of millions)
>
> Yes, this is the way Solr works.  The filters are executed separately,
> but the query is executed last with the filters (i.e. it will be
> faster if the filter cuts down the number of documents).
>
> -Yonik
> http://lucidimagination.com
>



-- 
Regards,

Salman Akram


CommonGrams indexing very slow!

2011-04-27 Thread Salman Akram
All,

We have created index with CommonGrams and the final size is around 370GB.
Everything is working fine but now when we add more documents into index it
takes forever (almost 12 hours)...seems to change all the segments file in a
commit.

The same commit used to take few mins with normal index.

Any idea whats going on?

-- 
Regards,

Salman Akram
Principal Software Engineer - Tech Lead
NorthBay Solutions
410-G4 Johar Town, Lahore
Off: +92-42-35290152

Cell: +92-321-4391210 -- +92-300-4009941


Re: CommonGrams indexing very slow!

2011-04-27 Thread Salman Akram
No way. It just does this while committing.

Also before this when we merged multiple small indexes without optimization
- as it was done in past it again took around 12 hours and made around 20
CFS files (it never happened before)

On Wed, Apr 27, 2011 at 8:21 PM, Erick Erickson wrote:

> Are you by any chance optimizing?
>
> Best
> Erick
>
> On Wed, Apr 27, 2011 at 11:04 AM, Salman Akram
>  wrote:
> > All,
> >
> > We have created index with CommonGrams and the final size is around
> 370GB.
> > Everything is working fine but now when we add more documents into index
> it
> > takes forever (almost 12 hours)...seems to change all the segments file
> in a
> > commit.
> >
> > The same commit used to take few mins with normal index.
> >
> > Any idea whats going on?
> >
> > --
> > Regards,
> >
> > Salman Akram
> > Principal Software Engineer - Tech Lead
> > NorthBay Solutions
> > 410-G4 Johar Town, Lahore
> > Off: +92-42-35290152
> >
> > Cell: +92-321-4391210 -- +92-300-4009941
> >
>



-- 
Regards,

Salman Akram


Re: CommonGrams indexing very slow!

2011-04-27 Thread Salman Akram
Thanks for the response. We got it resolved! .

We made small indexes in bulk using SOLR with Standard File Format and then
merged it with a Lucene app which for some reason made it CFS. Now when we
started adding real time documents using SOLR (with Compound File Format set
to false) it was merging with every commit!

We just set the CFF to true and now its normal. Weird but that's how it got
resolved.

BTW any idea why this happening and if we now optimize it using SFF it
should be fine in future with CFF= false?

P.S: Increasing the MergeFactor didn't even work.

On Wed, Apr 27, 2011 at 10:09 PM, Burton-West, Tom wrote:

> Hi Salman,
>
> Sounds like somehow you are triggering merges or optimizes.  What is your
> mergeFactor?
>
> Have you turned on the IndexWriter log?
>
> In solrconfig.xml
> true
>
>  In our case we feed the directory name as a Java property in our java
> startup script , but you can also hard code where you want the log written
> like in the current example Solr config:
>
> false
>
> That should provide some clues.  For example you can see how many segments
> of each level there are just before you do the commit that triggers the
> problem.   My first guess is that you have enough segments so that adding
> the documents and committing triggers a cascading merge. (But this is a WAG
> without seeing what's in your indexwriter log)
>
> Can you also send your solrconfig so we can see your mergeFactor and
> ramBufferSizeMB settings?
>
> Tom
>
> > > All,
> > >
> > > We have created index with CommonGrams and the final size is around
> > 370GB.
> > > Everything is working fine but now when we add more documents into
> index
> > it
> > > takes forever (almost 12 hours)...seems to change all the segments file
> > in a
> > > commit.
> > >
> > > The same commit used to take few mins with normal index.
> > >
> > > Any idea whats going on?
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > > Principal Software Engineer - Tech Lead
> > > NorthBay Solutions
> > > 410-G4 Johar Town, Lahore
> > > Off: +92-42-35290152
> > >
> > > Cell: +92-321-4391210 -- +92-300-4009941
> > >
> >
>
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram


JRockit with SOLR3.4/3.5

2012-07-15 Thread Salman Akram
We used JRockit with SOLR1.4 as default JVM had mem issues (not only it was 
consuming more mem but didn't restrict to the max mem allocated to tomcat - 
jrockit did restrict to max mem). However, JRockit gives an error while using 
it with SOLR3.4/3.5. Any ideas, why?

*** This Message Has Been Sent Using BlackBerry Internet Service from Mobilink 
***


Re: JRockit with SOLR3.4/3.5

2012-07-16 Thread Salman Akram
19)
at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(
LifecycleBase.java:90)
at org.apache.catalina.util.LifecycleBase.setStateInternal(
LifecycleBase.java:401)
at org.apache.catalina.util.LifecycleBase.init(LifecycleBase.java:110)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:139)
at org.apache.catalina.core.ContainerBase.addChildInternal(
ContainerBase.java:866)
at org.apache.catalina.core.ContainerBase.addChild(
ContainerBase.java:842)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
at org.apache.catalina.startup.HostConfig.deployDirectory(
HostConfig.java:1095)
at org.apache.catalina.startup.HostConfig$DeployDirectory.
run(HostConfig.java:1617)
at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.
runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:908)
... 1 more
Jun 7, 2012 4:03:04 AM org.apache.coyote.AbstractProtocol start

On Mon, Jul 16, 2012 at 2:20 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> Hello, Salman,
>
> It would probably be helpful if you included the text/stack trace of
> the error you're encountering, plus any other pertinent system
> information you can think of.
>
> One thing to remember is the memory usage you tune with Xmx is only
> the maximum size of the heap, and there are other types of memory
> usage by the JVM that don't fall under that (Permgen space, memory
> mapped files, etc).
>
> Michael Della Bitta
>
> 
> Appinions, Inc. -- Where Influence Isn’t a Game.
> http://www.appinions.com
>
>
> On Sun, Jul 15, 2012 at 3:19 PM, Salman Akram
>  wrote:
> > We used JRockit with SOLR1.4 as default JVM had mem issues (not only it
> was consuming more mem but didn't restrict to the max mem allocated to
> tomcat - jrockit did restrict to max mem). However, JRockit gives an error
> while using it with SOLR3.4/3.5. Any ideas, why?
> >
> > *** This Message Has Been Sent Using BlackBerry Internet Service from
> Mobilink ***
>



-- 
Regards,

Salman Akram
Project Manager - Intelligize
NorthBay Solutions
410-G4 Johar Town, Lahore
Off: +92-42-35290152

Cell: +92-302-8495621