Re: Solr Segments, Segment Merges,Optimize

2014-02-23 Thread KNitin
Commit Parameters: Server does an auto commit every 30 seconds with
open_searcher=false. The pipeline does a hard commit only at the very end
of its run

The high CPU issue I am seeing is only during the reads and not during the
writes. Right now I see a direct corelation between latencies and # of
segments atleast for a few large collections. Will post back if the theory
is invalidated

Thanks
- Nitin


On Sat, Feb 22, 2014 at 10:01 PM, Erick Erickson wrote:

> Well, it's always possible. I wouldn't expect the search time/CPU
> utilization to increase with # segments, within reasonable limits.
> At some point, the important parts of the index get read into memory
> and the number of segments is pretty irrelevant. You do mention
> that you have a heavy ingestion pipeline, which leads me to wonder
> whether you're committing too often, what are your commit
> parameters?
>
> For % deleted docs, I'm really talking about deletedDocs/numDocs.
>
> I suppose the interesting question is whether the CPU utilization you're
> seeing is _always_ correlated with # segments or are you seeing certain
> machines always having the high CPU utilization. I suppose you could
> issue a commit and see what difference that made.
>
> I rather doubt that the # of segments is the underlying issue, but that's
> nothing but a SWAG...
>
> Best,
> Erick
>
>
>
>
> On Sat, Feb 22, 2014 at 6:16 PM, KNitin  wrote:
>
> > Thanks, Erick.
> >
> > *2> There are, but you'll have to dig. *
> >
> >>> Any pointers on where to get started?
> >
> >
> >
> > *3> Well, I'd ask a counter-question. Are you seeing
> > unacceptableperformance? If not, why worry? :)*
> >
> >>> When you mean % do you refer to deleted_docs/NumDocs or
> > deleted_docs/Max_docs ? To answer your question, yes i see some of our
> > shards taking 3x more time and 3x more cpu than other shards for the same
> > queries and same number of hits (all shards have exact same number of
> docs
> > but i see a few shards having more deleted documents than the rest).
> >
> > My understanding is that the Search time /CPU would increase with # of
> > segments ?  The core of my issue is that few nodes are running with
> > extremely high CPU (90+)  and rest are running under 30% CPU and the only
> > difference between both is the  # of segments in the shards on the
> > machines. The nodes running hot have shards with 30 segments and the ones
> > running with lesser CPU contain 20 segments and much lesser deleted
> > documents.
> >
> > Is it possible that a difference of 10 segments could impact CPU /Search
> > time?
> >
> > Thanks
> > - Nitin
> >
> >
> > On Sat, Feb 22, 2014 at 4:36 PM, Erick Erickson  > >wrote:
> >
> > > 1> It Depends. Soft commits will not add a new segment. Hard commits
> > > with openSearcher=true or false _will_ create a new segment.
> > > 2> There are, but you'll have to dig.
> > > 3> Well, I'd ask a counter-question. Are you seeing unacceptable
> > > performance? If not, why worry? :)
> > >
> > > A better answer is that 24-28 segments is not at all unusual.
> > >
> > > By and large, don't bother with optimize/force merge. What I would do
> is
> > > look at the admin screen and note the percentage of deleted documents.
> > > If it's above some arbitrary number (I typically use 15-20%) and
> _stays_
> > > there, consider optimizing.
> > >
> > > However! There is a parameter you can explicitly set in solrconfig.xml
> > > (sorry, which one escapes me now) that increases the "weight" of the %
> > > deleted documents when the merge policy decides which segments
> > > to merge. Upping this number will have the effect of more aggressively
> > > merging segments with a greater % of deleted docs. But these are
> already
> > > pretty heavily weighted for merging already...
> > >
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Sat, Feb 22, 2014 at 1:23 PM, KNitin  wrote:
> > >
> > > > Hi
> > > >
> > > >   I have the following questions
> > > >
> > > >
> > > >1. I have a job that runs for 3-4 hours continuously committing
> data
> > > to
> > > >a collection with auto commit of 30 seconds. Does it mean that
> every
> > > 30
> > > >seconds I would get a new solr segment ?
> > > >2. My current segment merge policy is set to 10. Will merger
> always
> > > >continue running in the background to reduce the segments ? Is
> > there a
> > > > way
> > > >to see metrics regarding segment merging from solr (mbeans or any
> > > other
> > > >way)?
> > > >3. A few of my collections are very large with around 24-28
> segments
> > > per
> > > >shard and around 16 shards. Is it bad to have this many segments
> > for a
> > > >shard for a collection? Is it a good practice to optimize the
> index
> > > very
> > > >often or just rely on segment merges alone?
> > > >
> > > >
> > > >
> > > > Thanks for the help in advance
> > > > Nitin
> > > >
> > >
> >
>


Fwd: configuration for heavy system

2014-02-23 Thread Harish Reddy
Hi,
We are testing solr.
We have a document with some 100 indexes and there are around 10 million
records.It is failing,either stuck or timed out on query.

Is this indexing job possible with solr?
If Yes,what should be the hardware,solr configuration and how many nodes
would be optimum?
Now I am running solr on four nodes with solr config: number of shards=2 on
a 16 GB machine.

If No,how many indexes/records can solr handle without issues?


Re: configuration for heavy system

2014-02-23 Thread Erick Erickson
You haven't told us anything about _how_ you're
trying to index this document nor what it's format
is. Nor what "100 indexes and around 10 million
records" means. 1B total records? 10M total records?

Solr easily handles 10s of M records on a single decent
size node, I've seen between 50M and 300M

Perhaps you should review:

http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick


On Sat, Feb 22, 2014 at 10:17 PM, Harish Reddy  wrote:

> Hi,
> We are testing solr.
> We have a document with some 100 indexes and there are around 10 million
> records.It is failing,either stuck or timed out on query.
>
> Is this indexing job possible with solr?
> If Yes,what should be the hardware,solr configuration and how many nodes
> would be optimum?
> Now I am running solr on four nodes with solr config: number of shards=2 on
> a 16 GB machine.
>
> If No,how many indexes/records can solr handle without issues?
>


DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-23 Thread Gregg Donovan
In most of our Solr use-cases, we fetch only fl= or
fl=,. I'd like to be able to do a distributed
search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is
queried for the documents found the  the top ids -- as it seems like we
could be collecting this information earlier in the pipeline.

Is this possible out-of-the-box? If not, how would you recommend
implementing it?

Thanks!

--Gregg


Re: DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-23 Thread Shalin Shekhar Mangar
What a coincidence - I was about to commit a patch which makes it
possible. It will be released with 4.8

See https://issues.apache.org/jira/browse/SOLR-1880

On Sun, Feb 23, 2014 at 11:27 PM, Gregg Donovan  wrote:
> In most of our Solr use-cases, we fetch only fl= or
> fl=,. I'd like to be able to do a distributed
> search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is
> queried for the documents found the  the top ids -- as it seems like we
> could be collecting this information earlier in the pipeline.
>
> Is this possible out-of-the-box? If not, how would you recommend
> implementing it?
>
> Thanks!
>
> --Gregg



-- 
Regards,
Shalin Shekhar Mangar.


Re: DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-23 Thread Shalin Shekhar Mangar
I should clarify though that this optimization only works with fl=id,score.

On Sun, Feb 23, 2014 at 11:34 PM, Shalin Shekhar Mangar
 wrote:
> What a coincidence - I was about to commit a patch which makes it
> possible. It will be released with 4.8
>
> See https://issues.apache.org/jira/browse/SOLR-1880
>
> On Sun, Feb 23, 2014 at 11:27 PM, Gregg Donovan  wrote:
>> In most of our Solr use-cases, we fetch only fl= or
>> fl=,. I'd like to be able to do a distributed
>> search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is
>> queried for the documents found the  the top ids -- as it seems like we
>> could be collecting this information earlier in the pipeline.
>>
>> Is this possible out-of-the-box? If not, how would you recommend
>> implementing it?
>>
>> Thanks!
>>
>> --Gregg
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.



-- 
Regards,
Shalin Shekhar Mangar.


Re: DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-23 Thread Yonik Seeley
On Sun, Feb 23, 2014 at 1:08 PM, Shalin Shekhar Mangar
 wrote:
> I should clarify though that this optimization only works with fl=id,score.

Although it seems like it should be relatively simple to make it work
with other fields as well, by passing down the complete "fl" requested
if some optional parameter is set (distrib.singlePass?)

-Yonik
http://heliosearch.org - native off-heap filters and fieldcache for solr


Re: DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-23 Thread Shalin Shekhar Mangar
Yes that should be simple. But regardless of the parameter, the
fl=id,score use-case should be optimized by default. I think I'll
commit the patch as-is and open a new issue to add the
distrib.singlePass parameter.

On Sun, Feb 23, 2014 at 11:49 PM, Yonik Seeley  wrote:
> On Sun, Feb 23, 2014 at 1:08 PM, Shalin Shekhar Mangar
>  wrote:
>> I should clarify though that this optimization only works with fl=id,score.
>
> Although it seems like it should be relatively simple to make it work
> with other fields as well, by passing down the complete "fl" requested
> if some optional parameter is set (distrib.singlePass?)
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Segments, Segment Merges,Optimize

2014-02-23 Thread KNitin
I should also mention that apart from committing, the pipeline also does a
bunch of deletes for stale documents (based on a custom version field). The
# of deletes can be very significant causing the % of deleted documents to
be easily 40-50% of the index itself

Thanks
KNitin


On Sun, Feb 23, 2014 at 12:02 AM, KNitin  wrote:

> Commit Parameters: Server does an auto commit every 30 seconds with
> open_searcher=false. The pipeline does a hard commit only at the very end
> of its run
>
> The high CPU issue I am seeing is only during the reads and not during the
> writes. Right now I see a direct corelation between latencies and # of
> segments atleast for a few large collections. Will post back if the theory
> is invalidated
>
> Thanks
> - Nitin
>
>
> On Sat, Feb 22, 2014 at 10:01 PM, Erick Erickson 
> wrote:
>
>> Well, it's always possible. I wouldn't expect the search time/CPU
>> utilization to increase with # segments, within reasonable limits.
>> At some point, the important parts of the index get read into memory
>> and the number of segments is pretty irrelevant. You do mention
>> that you have a heavy ingestion pipeline, which leads me to wonder
>> whether you're committing too often, what are your commit
>> parameters?
>>
>> For % deleted docs, I'm really talking about deletedDocs/numDocs.
>>
>> I suppose the interesting question is whether the CPU utilization you're
>> seeing is _always_ correlated with # segments or are you seeing certain
>> machines always having the high CPU utilization. I suppose you could
>> issue a commit and see what difference that made.
>>
>> I rather doubt that the # of segments is the underlying issue, but that's
>> nothing but a SWAG...
>>
>> Best,
>> Erick
>>
>>
>>
>>
>> On Sat, Feb 22, 2014 at 6:16 PM, KNitin  wrote:
>>
>> > Thanks, Erick.
>> >
>> > *2> There are, but you'll have to dig. *
>> >
>> >>> Any pointers on where to get started?
>> >
>> >
>> >
>> > *3> Well, I'd ask a counter-question. Are you seeing
>> > unacceptableperformance? If not, why worry? :)*
>> >
>> >>> When you mean % do you refer to deleted_docs/NumDocs or
>> > deleted_docs/Max_docs ? To answer your question, yes i see some of our
>> > shards taking 3x more time and 3x more cpu than other shards for the
>> same
>> > queries and same number of hits (all shards have exact same number of
>> docs
>> > but i see a few shards having more deleted documents than the rest).
>> >
>> > My understanding is that the Search time /CPU would increase with # of
>> > segments ?  The core of my issue is that few nodes are running with
>> > extremely high CPU (90+)  and rest are running under 30% CPU and the
>> only
>> > difference between both is the  # of segments in the shards on the
>> > machines. The nodes running hot have shards with 30 segments and the
>> ones
>> > running with lesser CPU contain 20 segments and much lesser deleted
>> > documents.
>> >
>> > Is it possible that a difference of 10 segments could impact CPU /Search
>> > time?
>> >
>> > Thanks
>> > - Nitin
>> >
>> >
>> > On Sat, Feb 22, 2014 at 4:36 PM, Erick Erickson <
>> erickerick...@gmail.com
>> > >wrote:
>> >
>> > > 1> It Depends. Soft commits will not add a new segment. Hard commits
>> > > with openSearcher=true or false _will_ create a new segment.
>> > > 2> There are, but you'll have to dig.
>> > > 3> Well, I'd ask a counter-question. Are you seeing unacceptable
>> > > performance? If not, why worry? :)
>> > >
>> > > A better answer is that 24-28 segments is not at all unusual.
>> > >
>> > > By and large, don't bother with optimize/force merge. What I would do
>> is
>> > > look at the admin screen and note the percentage of deleted documents.
>> > > If it's above some arbitrary number (I typically use 15-20%) and
>> _stays_
>> > > there, consider optimizing.
>> > >
>> > > However! There is a parameter you can explicitly set in solrconfig.xml
>> > > (sorry, which one escapes me now) that increases the "weight" of the %
>> > > deleted documents when the merge policy decides which segments
>> > > to merge. Upping this number will have the effect of more aggressively
>> > > merging segments with a greater % of deleted docs. But these are
>> already
>> > > pretty heavily weighted for merging already...
>> > >
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > >
>> > > On Sat, Feb 22, 2014 at 1:23 PM, KNitin  wrote:
>> > >
>> > > > Hi
>> > > >
>> > > >   I have the following questions
>> > > >
>> > > >
>> > > >1. I have a job that runs for 3-4 hours continuously committing
>> data
>> > > to
>> > > >a collection with auto commit of 30 seconds. Does it mean that
>> every
>> > > 30
>> > > >seconds I would get a new solr segment ?
>> > > >2. My current segment merge policy is set to 10. Will merger
>> always
>> > > >continue running in the background to reduce the segments ? Is
>> > there a
>> > > > way
>> > > >to see metrics regarding segment merging from solr (mbeans or any
>> > > other
>> > > >way)?
>> > > >3.

Re: Wikipedia Data Cleaning at Solr

2014-02-23 Thread Furkan KAMACI
I've compared the results when using WikipediaTokenizer for  index time
analyzer but there is no difference?


2014-02-23 3:44 GMT+02:00 Ahmet Arslan :

> Hi Furkan,
>
> There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer
>
> Ahmet
>
>
> On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI <
> furkankam...@gmail.com> wrote:
> Hi;
>
> I want to run an NLP algorithm for Wikipedia data. I used dataimport
> handler for dump data and everything is OK. However there are some texts as
> like:
>
> == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
> eğitimden yararlanılmaktadır.
>
> I think that it should be like that:
>
> Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
> yararlanılmaktadır.
>
> On the other hand this should be removed:
>
> {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aa"
> |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dd" |[[2009]] |kazım
> güngör |- bgcolor="#dd" | |Ömer Gungor |- bgcolor="#dd" | |Fazlı
> Uzun |- bgcolor="#dd" | |Cemal Özden |- bgcolor="#dd" | | |}
>
> Also including titles as like == Altyapı bilgileri == should be optional (I
> think that they can be removed for some purposes)
>
> My question is that. Is there any analyzer combination to clean up
> Wikipedia data for Solr?
>
> Thanks;
> Furkan KAMACI
>


Issue with PHP urlencode and solr encoding

2014-02-23 Thread manju16832003
Hi,
I come across the issue with urlencoding between PHP and Solr.
I have a field indexed with value *WBE(Honda Edix)* in Solr.

>From PHP codes, if I urlencode($string) and send to Solr, I do not get the
accurate results.
Here is the part of the solr query *fq=model:WBE(Honda+Edix)*

However, If I do it *fq=model:WBE\(Honda+Edix\)* this way directly from
Solr, I would get the accurate results.

I assume that the '(' and ')' part of the solr query.

How do I escape '(' and ')' from the client side.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-PHP-urlencode-and-solr-encoding-tp4119176.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issue with PHP urlencode and solr encoding

2014-02-23 Thread Shawn Heisey
On 2/23/2014 8:58 PM, manju16832003 wrote:
> I come across the issue with urlencoding between PHP and Solr.
> I have a field indexed with value *WBE(Honda Edix)* in Solr.
> 
> From PHP codes, if I urlencode($string) and send to Solr, I do not get the
> accurate results.
> Here is the part of the solr query *fq=model:WBE(Honda+Edix)*
> 
> However, If I do it *fq=model:WBE\(Honda+Edix\)* this way directly from
> Solr, I would get the accurate results.
> 
> I assume that the '(' and ')' part of the solr query.
> 
> How do I escape '(' and ')' from the client side.

This reply got to be a lot longer than I intended.  Here's the novel:

URL encoding is only what needs to be done when you are constructing a
URL.  Those values will be decoded by Solr before passing it to the
query parser.

The query parser has its own set of characters that are special.  If you
intend any of these characters to be literal, they must be escaped with
a backslash.

http://lucene.apache.org/core/4_6_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

Although there is some overlap between URL encoding and query escaping,
they do have different lists of characters that require changing.
Escaping query characters must be done before URL encoding.

Another way to allow special characters in your query is to make it a
phrase query - enclose it in double quotes.  This would be your query
using this method, before URL encoding:

fq=model:"WBE(Honda Edix)"

Note that the phrase query method does not always produce the expected
results, and depending on your configuration, in some cases won't work
at all.

The PECL Solr library for PHP has a query escaping method similar to
what can be found in SolrJ.  Here's their documentation reference for it:

http://www.php.net/manual/en/solrutils.escapequerychars.php

The Solarium library for PHP also says that it does escaping, but I
can't find the manual section that they mention about term escaping.
Here's a section that has an example of phrase escaping (putting the
value in double quotes):

http://wiki.solarium-project.org/index.php/V3:Escaping

There is a bug in the PECL library that makes it not work with Solr 4.x.
 I created a patch for this bug, but they haven't fixed it in any
downloadable version.

https://bugs.php.net/bug.php?id=62332

Thanks,
Shawn



Re: Issue with PHP urlencode and solr encoding

2014-02-23 Thread Rico P
On Mon, Feb 24, 2014 at 11:52 AM, Shawn Heisey  wrote:
>
>
> The Solarium library for PHP also says that it does escaping, but I
> can't find the manual section that they mention about term escaping.
> Here's a section that has an example of phrase escaping (putting the
> value in double quotes):
>
> http://wiki.solarium-project.org/index.php/V3:Escaping
>
> Thanks,
> Shawn
>
> They do have it.
https://github.com/basdenooijer/solarium/blob/master/library/Solarium/Core/Query/Helper.php#L104

regards,

rico


Re: Issue with PHP urlencode and solr encoding

2014-02-23 Thread manju16832003
Hi Shawn and Rico,
Thanks you for your suggestions, those are valuable suggestions :-). 

If Pharse Query does not work as we expected sometimes, I guess we could use
*TermQuery* instead.

http://blog.florian-hopf.de/2013/01/make-your-filters-match-faceting-in-solr.html

This worked fine *fq={!term%20f=model%20v="WBE(Honda%20Edix)"}*.

I agree with Shawn comments that "Escaping query characters must be done
before URL encoding."

:-).
Thanks again for your replies.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-PHP-urlencode-and-solr-encoding-tp4119176p4119187.html
Sent from the Solr - User mailing list archive at Nabble.com.


Can not index raw binary data stored in Database in BLOB format.

2014-02-23 Thread Chandan khatua
Hi,

 

We have raw binary data stored in database(not word,excel,xml etc files) in
BLOB.

We are trying to index using TikaEntityProcessor but nothing seems to get
indexed.

But the same configuration works when xml/word/excel files are stored in the
BLOB field.

Below is our data-config.xml:

 











 











  





 



 

Please suggest us the changes required to index binary data.

 

Thanking you,

 

-Chandan