Re: Solr Segments, Segment Merges,Optimize
Commit Parameters: Server does an auto commit every 30 seconds with open_searcher=false. The pipeline does a hard commit only at the very end of its run The high CPU issue I am seeing is only during the reads and not during the writes. Right now I see a direct corelation between latencies and # of segments atleast for a few large collections. Will post back if the theory is invalidated Thanks - Nitin On Sat, Feb 22, 2014 at 10:01 PM, Erick Erickson wrote: > Well, it's always possible. I wouldn't expect the search time/CPU > utilization to increase with # segments, within reasonable limits. > At some point, the important parts of the index get read into memory > and the number of segments is pretty irrelevant. You do mention > that you have a heavy ingestion pipeline, which leads me to wonder > whether you're committing too often, what are your commit > parameters? > > For % deleted docs, I'm really talking about deletedDocs/numDocs. > > I suppose the interesting question is whether the CPU utilization you're > seeing is _always_ correlated with # segments or are you seeing certain > machines always having the high CPU utilization. I suppose you could > issue a commit and see what difference that made. > > I rather doubt that the # of segments is the underlying issue, but that's > nothing but a SWAG... > > Best, > Erick > > > > > On Sat, Feb 22, 2014 at 6:16 PM, KNitin wrote: > > > Thanks, Erick. > > > > *2> There are, but you'll have to dig. * > > > >>> Any pointers on where to get started? > > > > > > > > *3> Well, I'd ask a counter-question. Are you seeing > > unacceptableperformance? If not, why worry? :)* > > > >>> When you mean % do you refer to deleted_docs/NumDocs or > > deleted_docs/Max_docs ? To answer your question, yes i see some of our > > shards taking 3x more time and 3x more cpu than other shards for the same > > queries and same number of hits (all shards have exact same number of > docs > > but i see a few shards having more deleted documents than the rest). > > > > My understanding is that the Search time /CPU would increase with # of > > segments ? The core of my issue is that few nodes are running with > > extremely high CPU (90+) and rest are running under 30% CPU and the only > > difference between both is the # of segments in the shards on the > > machines. The nodes running hot have shards with 30 segments and the ones > > running with lesser CPU contain 20 segments and much lesser deleted > > documents. > > > > Is it possible that a difference of 10 segments could impact CPU /Search > > time? > > > > Thanks > > - Nitin > > > > > > On Sat, Feb 22, 2014 at 4:36 PM, Erick Erickson > >wrote: > > > > > 1> It Depends. Soft commits will not add a new segment. Hard commits > > > with openSearcher=true or false _will_ create a new segment. > > > 2> There are, but you'll have to dig. > > > 3> Well, I'd ask a counter-question. Are you seeing unacceptable > > > performance? If not, why worry? :) > > > > > > A better answer is that 24-28 segments is not at all unusual. > > > > > > By and large, don't bother with optimize/force merge. What I would do > is > > > look at the admin screen and note the percentage of deleted documents. > > > If it's above some arbitrary number (I typically use 15-20%) and > _stays_ > > > there, consider optimizing. > > > > > > However! There is a parameter you can explicitly set in solrconfig.xml > > > (sorry, which one escapes me now) that increases the "weight" of the % > > > deleted documents when the merge policy decides which segments > > > to merge. Upping this number will have the effect of more aggressively > > > merging segments with a greater % of deleted docs. But these are > already > > > pretty heavily weighted for merging already... > > > > > > > > > Best, > > > Erick > > > > > > > > > On Sat, Feb 22, 2014 at 1:23 PM, KNitin wrote: > > > > > > > Hi > > > > > > > > I have the following questions > > > > > > > > > > > >1. I have a job that runs for 3-4 hours continuously committing > data > > > to > > > >a collection with auto commit of 30 seconds. Does it mean that > every > > > 30 > > > >seconds I would get a new solr segment ? > > > >2. My current segment merge policy is set to 10. Will merger > always > > > >continue running in the background to reduce the segments ? Is > > there a > > > > way > > > >to see metrics regarding segment merging from solr (mbeans or any > > > other > > > >way)? > > > >3. A few of my collections are very large with around 24-28 > segments > > > per > > > >shard and around 16 shards. Is it bad to have this many segments > > for a > > > >shard for a collection? Is it a good practice to optimize the > index > > > very > > > >often or just rely on segment merges alone? > > > > > > > > > > > > > > > > Thanks for the help in advance > > > > Nitin > > > > > > > > > >
Fwd: configuration for heavy system
Hi, We are testing solr. We have a document with some 100 indexes and there are around 10 million records.It is failing,either stuck or timed out on query. Is this indexing job possible with solr? If Yes,what should be the hardware,solr configuration and how many nodes would be optimum? Now I am running solr on four nodes with solr config: number of shards=2 on a 16 GB machine. If No,how many indexes/records can solr handle without issues?
Re: configuration for heavy system
You haven't told us anything about _how_ you're trying to index this document nor what it's format is. Nor what "100 indexes and around 10 million records" means. 1B total records? 10M total records? Solr easily handles 10s of M records on a single decent size node, I've seen between 50M and 300M Perhaps you should review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Sat, Feb 22, 2014 at 10:17 PM, Harish Reddy wrote: > Hi, > We are testing solr. > We have a document with some 100 indexes and there are around 10 million > records.It is failing,either stuck or timed out on query. > > Is this indexing job possible with solr? > If Yes,what should be the hardware,solr configuration and how many nodes > would be optimum? > Now I am running solr on four nodes with solr config: number of shards=2 on > a 16 GB machine. > > If No,how many indexes/records can solr handle without issues? >
DistributedSearch: Skipping STAGE_GET_FIELDS?
In most of our Solr use-cases, we fetch only fl= or fl=,. I'd like to be able to do a distributed search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is queried for the documents found the the top ids -- as it seems like we could be collecting this information earlier in the pipeline. Is this possible out-of-the-box? If not, how would you recommend implementing it? Thanks! --Gregg
Re: DistributedSearch: Skipping STAGE_GET_FIELDS?
What a coincidence - I was about to commit a patch which makes it possible. It will be released with 4.8 See https://issues.apache.org/jira/browse/SOLR-1880 On Sun, Feb 23, 2014 at 11:27 PM, Gregg Donovan wrote: > In most of our Solr use-cases, we fetch only fl= or > fl=,. I'd like to be able to do a distributed > search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is > queried for the documents found the the top ids -- as it seems like we > could be collecting this information earlier in the pipeline. > > Is this possible out-of-the-box? If not, how would you recommend > implementing it? > > Thanks! > > --Gregg -- Regards, Shalin Shekhar Mangar.
Re: DistributedSearch: Skipping STAGE_GET_FIELDS?
I should clarify though that this optimization only works with fl=id,score. On Sun, Feb 23, 2014 at 11:34 PM, Shalin Shekhar Mangar wrote: > What a coincidence - I was about to commit a patch which makes it > possible. It will be released with 4.8 > > See https://issues.apache.org/jira/browse/SOLR-1880 > > On Sun, Feb 23, 2014 at 11:27 PM, Gregg Donovan wrote: >> In most of our Solr use-cases, we fetch only fl= or >> fl=,. I'd like to be able to do a distributed >> search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is >> queried for the documents found the the top ids -- as it seems like we >> could be collecting this information earlier in the pipeline. >> >> Is this possible out-of-the-box? If not, how would you recommend >> implementing it? >> >> Thanks! >> >> --Gregg > > > > -- > Regards, > Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: DistributedSearch: Skipping STAGE_GET_FIELDS?
On Sun, Feb 23, 2014 at 1:08 PM, Shalin Shekhar Mangar wrote: > I should clarify though that this optimization only works with fl=id,score. Although it seems like it should be relatively simple to make it work with other fields as well, by passing down the complete "fl" requested if some optional parameter is set (distrib.singlePass?) -Yonik http://heliosearch.org - native off-heap filters and fieldcache for solr
Re: DistributedSearch: Skipping STAGE_GET_FIELDS?
Yes that should be simple. But regardless of the parameter, the fl=id,score use-case should be optimized by default. I think I'll commit the patch as-is and open a new issue to add the distrib.singlePass parameter. On Sun, Feb 23, 2014 at 11:49 PM, Yonik Seeley wrote: > On Sun, Feb 23, 2014 at 1:08 PM, Shalin Shekhar Mangar > wrote: >> I should clarify though that this optimization only works with fl=id,score. > > Although it seems like it should be relatively simple to make it work > with other fields as well, by passing down the complete "fl" requested > if some optional parameter is set (distrib.singlePass?) > > -Yonik > http://heliosearch.org - native off-heap filters and fieldcache for solr -- Regards, Shalin Shekhar Mangar.
Re: Solr Segments, Segment Merges,Optimize
I should also mention that apart from committing, the pipeline also does a bunch of deletes for stale documents (based on a custom version field). The # of deletes can be very significant causing the % of deleted documents to be easily 40-50% of the index itself Thanks KNitin On Sun, Feb 23, 2014 at 12:02 AM, KNitin wrote: > Commit Parameters: Server does an auto commit every 30 seconds with > open_searcher=false. The pipeline does a hard commit only at the very end > of its run > > The high CPU issue I am seeing is only during the reads and not during the > writes. Right now I see a direct corelation between latencies and # of > segments atleast for a few large collections. Will post back if the theory > is invalidated > > Thanks > - Nitin > > > On Sat, Feb 22, 2014 at 10:01 PM, Erick Erickson > wrote: > >> Well, it's always possible. I wouldn't expect the search time/CPU >> utilization to increase with # segments, within reasonable limits. >> At some point, the important parts of the index get read into memory >> and the number of segments is pretty irrelevant. You do mention >> that you have a heavy ingestion pipeline, which leads me to wonder >> whether you're committing too often, what are your commit >> parameters? >> >> For % deleted docs, I'm really talking about deletedDocs/numDocs. >> >> I suppose the interesting question is whether the CPU utilization you're >> seeing is _always_ correlated with # segments or are you seeing certain >> machines always having the high CPU utilization. I suppose you could >> issue a commit and see what difference that made. >> >> I rather doubt that the # of segments is the underlying issue, but that's >> nothing but a SWAG... >> >> Best, >> Erick >> >> >> >> >> On Sat, Feb 22, 2014 at 6:16 PM, KNitin wrote: >> >> > Thanks, Erick. >> > >> > *2> There are, but you'll have to dig. * >> > >> >>> Any pointers on where to get started? >> > >> > >> > >> > *3> Well, I'd ask a counter-question. Are you seeing >> > unacceptableperformance? If not, why worry? :)* >> > >> >>> When you mean % do you refer to deleted_docs/NumDocs or >> > deleted_docs/Max_docs ? To answer your question, yes i see some of our >> > shards taking 3x more time and 3x more cpu than other shards for the >> same >> > queries and same number of hits (all shards have exact same number of >> docs >> > but i see a few shards having more deleted documents than the rest). >> > >> > My understanding is that the Search time /CPU would increase with # of >> > segments ? The core of my issue is that few nodes are running with >> > extremely high CPU (90+) and rest are running under 30% CPU and the >> only >> > difference between both is the # of segments in the shards on the >> > machines. The nodes running hot have shards with 30 segments and the >> ones >> > running with lesser CPU contain 20 segments and much lesser deleted >> > documents. >> > >> > Is it possible that a difference of 10 segments could impact CPU /Search >> > time? >> > >> > Thanks >> > - Nitin >> > >> > >> > On Sat, Feb 22, 2014 at 4:36 PM, Erick Erickson < >> erickerick...@gmail.com >> > >wrote: >> > >> > > 1> It Depends. Soft commits will not add a new segment. Hard commits >> > > with openSearcher=true or false _will_ create a new segment. >> > > 2> There are, but you'll have to dig. >> > > 3> Well, I'd ask a counter-question. Are you seeing unacceptable >> > > performance? If not, why worry? :) >> > > >> > > A better answer is that 24-28 segments is not at all unusual. >> > > >> > > By and large, don't bother with optimize/force merge. What I would do >> is >> > > look at the admin screen and note the percentage of deleted documents. >> > > If it's above some arbitrary number (I typically use 15-20%) and >> _stays_ >> > > there, consider optimizing. >> > > >> > > However! There is a parameter you can explicitly set in solrconfig.xml >> > > (sorry, which one escapes me now) that increases the "weight" of the % >> > > deleted documents when the merge policy decides which segments >> > > to merge. Upping this number will have the effect of more aggressively >> > > merging segments with a greater % of deleted docs. But these are >> already >> > > pretty heavily weighted for merging already... >> > > >> > > >> > > Best, >> > > Erick >> > > >> > > >> > > On Sat, Feb 22, 2014 at 1:23 PM, KNitin wrote: >> > > >> > > > Hi >> > > > >> > > > I have the following questions >> > > > >> > > > >> > > >1. I have a job that runs for 3-4 hours continuously committing >> data >> > > to >> > > >a collection with auto commit of 30 seconds. Does it mean that >> every >> > > 30 >> > > >seconds I would get a new solr segment ? >> > > >2. My current segment merge policy is set to 10. Will merger >> always >> > > >continue running in the background to reduce the segments ? Is >> > there a >> > > > way >> > > >to see metrics regarding segment merging from solr (mbeans or any >> > > other >> > > >way)? >> > > >3.
Re: Wikipedia Data Cleaning at Solr
I've compared the results when using WikipediaTokenizer for index time analyzer but there is no difference? 2014-02-23 3:44 GMT+02:00 Ahmet Arslan : > Hi Furkan, > > There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer > > Ahmet > > > On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI < > furkankam...@gmail.com> wrote: > Hi; > > I want to run an NLP algorithm for Wikipedia data. I used dataimport > handler for dump data and everything is OK. However there are some texts as > like: > > == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı > eğitimden yararlanılmaktadır. > > I think that it should be like that: > > Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden > yararlanılmaktadır. > > On the other hand this should be removed: > > {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aa" > |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dd" |[[2009]] |kazım > güngör |- bgcolor="#dd" | |Ömer Gungor |- bgcolor="#dd" | |Fazlı > Uzun |- bgcolor="#dd" | |Cemal Özden |- bgcolor="#dd" | | |} > > Also including titles as like == Altyapı bilgileri == should be optional (I > think that they can be removed for some purposes) > > My question is that. Is there any analyzer combination to clean up > Wikipedia data for Solr? > > Thanks; > Furkan KAMACI >
Issue with PHP urlencode and solr encoding
Hi, I come across the issue with urlencoding between PHP and Solr. I have a field indexed with value *WBE(Honda Edix)* in Solr. >From PHP codes, if I urlencode($string) and send to Solr, I do not get the accurate results. Here is the part of the solr query *fq=model:WBE(Honda+Edix)* However, If I do it *fq=model:WBE\(Honda+Edix\)* this way directly from Solr, I would get the accurate results. I assume that the '(' and ')' part of the solr query. How do I escape '(' and ')' from the client side. -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-PHP-urlencode-and-solr-encoding-tp4119176.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issue with PHP urlencode and solr encoding
On 2/23/2014 8:58 PM, manju16832003 wrote: > I come across the issue with urlencoding between PHP and Solr. > I have a field indexed with value *WBE(Honda Edix)* in Solr. > > From PHP codes, if I urlencode($string) and send to Solr, I do not get the > accurate results. > Here is the part of the solr query *fq=model:WBE(Honda+Edix)* > > However, If I do it *fq=model:WBE\(Honda+Edix\)* this way directly from > Solr, I would get the accurate results. > > I assume that the '(' and ')' part of the solr query. > > How do I escape '(' and ')' from the client side. This reply got to be a lot longer than I intended. Here's the novel: URL encoding is only what needs to be done when you are constructing a URL. Those values will be decoded by Solr before passing it to the query parser. The query parser has its own set of characters that are special. If you intend any of these characters to be literal, they must be escaped with a backslash. http://lucene.apache.org/core/4_6_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters Although there is some overlap between URL encoding and query escaping, they do have different lists of characters that require changing. Escaping query characters must be done before URL encoding. Another way to allow special characters in your query is to make it a phrase query - enclose it in double quotes. This would be your query using this method, before URL encoding: fq=model:"WBE(Honda Edix)" Note that the phrase query method does not always produce the expected results, and depending on your configuration, in some cases won't work at all. The PECL Solr library for PHP has a query escaping method similar to what can be found in SolrJ. Here's their documentation reference for it: http://www.php.net/manual/en/solrutils.escapequerychars.php The Solarium library for PHP also says that it does escaping, but I can't find the manual section that they mention about term escaping. Here's a section that has an example of phrase escaping (putting the value in double quotes): http://wiki.solarium-project.org/index.php/V3:Escaping There is a bug in the PECL library that makes it not work with Solr 4.x. I created a patch for this bug, but they haven't fixed it in any downloadable version. https://bugs.php.net/bug.php?id=62332 Thanks, Shawn
Re: Issue with PHP urlencode and solr encoding
On Mon, Feb 24, 2014 at 11:52 AM, Shawn Heisey wrote: > > > The Solarium library for PHP also says that it does escaping, but I > can't find the manual section that they mention about term escaping. > Here's a section that has an example of phrase escaping (putting the > value in double quotes): > > http://wiki.solarium-project.org/index.php/V3:Escaping > > Thanks, > Shawn > > They do have it. https://github.com/basdenooijer/solarium/blob/master/library/Solarium/Core/Query/Helper.php#L104 regards, rico
Re: Issue with PHP urlencode and solr encoding
Hi Shawn and Rico, Thanks you for your suggestions, those are valuable suggestions :-). If Pharse Query does not work as we expected sometimes, I guess we could use *TermQuery* instead. http://blog.florian-hopf.de/2013/01/make-your-filters-match-faceting-in-solr.html This worked fine *fq={!term%20f=model%20v="WBE(Honda%20Edix)"}*. I agree with Shawn comments that "Escaping query characters must be done before URL encoding." :-). Thanks again for your replies. -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-PHP-urlencode-and-solr-encoding-tp4119176p4119187.html Sent from the Solr - User mailing list archive at Nabble.com.
Can not index raw binary data stored in Database in BLOB format.
Hi, We have raw binary data stored in database(not word,excel,xml etc files) in BLOB. We are trying to index using TikaEntityProcessor but nothing seems to get indexed. But the same configuration works when xml/word/excel files are stored in the BLOB field. Below is our data-config.xml: Please suggest us the changes required to index binary data. Thanking you, -Chandan