Re: xml indexing
Thanks for your reply, but it only works when i got no response. But as i said im working on arrays. As soon as i get an array it doesnt matter if array's length is 1 or 105 it returns what i get earlier. #1 json response "detailComment", [ "100.01", null, "102.01", null ] return as #1 indexed "detailComment": [ "100.01", "102.01" ] #2 json response "detailComment", [ null ] return as #2 indexed "detailComment": [ "0.0" ] Result i want to see #3 json response "detailComment", [ "100.01", null, "102.01", null ] return as #3 indexed "detailComment": [ "100.01", "0.0", "102.01", "0.0" ] detailComment 0.0 dih-config.xml upd -- View this message in context: http://lucene.472066.n3.nabble.com/xml-indexing-tp4344191p4344298.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange boolean query behaviour on 5.5.4
On 04/07/17 18:10, Erick Erickson wrote: > I think you'll get what you expect by something like: > (*:* -someField:Foo) AND (otherField: (Bar OR Baz)) Yeah that's what I figured. It's not a big deal since we generate Solr syntax using a parser/generator on top of our own query syntax. Still a little strange! Thanks for the heads up, - Bram
Re: Did /export use to emit tuples and now does not?
Thanks, Joel. I just wanted to confirm, as I was having trouble tracking down when the change occurred. -R On 04/07/2017, 23:51, "Joel Bernstein" wrote: In the very early releases (5x) the /export handler had a different format then the /search handler. Later the /export handler was changed to have the same basic response format as the /search handler. This was done in anticipation of unifying /search and /export at a later date. The /export handler still powers the parallel relational algebra expressions. In Solr 7.0 there is a shuffle expression that always uses the /export handler to sort and partition result sets. In 6x the search expression can be used with the qt=/export param to use the /export handler. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Jul 4, 2017 at 11:38 AM, Ronald Wood wrote: > 9 months ago I did a proof of concept for solr streaming using the /export > handler. At that time, I got tuples back. > > Now when I try 6.x, I get results in a format similar to /search > (including a count), instead of tuples (with an EOF). > > Did something change between 5.x and 6.x in this regard? > > I am trying to stream results in a non-cloud scenario, and I was under the > impression that /export was the primitive handler for the more advanced > streaming operations only possible under Solr Cloud. > > I am using official docker images for testing. I tried to to retest under > 5.5.4 but I need to do some more work as docValues aren’t the default when > using the gettingstarted index. > > -Ronald Wood > >
High disk write usage
Hi, We are implementing a solrcloud cluster (6.6 version) with NRT requisites. We are indexing 600 docs/sec with 1500 docs/sec peaks, and we are serving about 1500qps. Our documents has 300 fields with some doc values, about 4kb and we have 3 million of documents. HardCommit is set to 15 minutes, but disk writing is about 15mbps all the time (60mbps on peaks), without higher writing disk rates each 15 minutes ... ¿is this the expected behaviour?
RE: Solr Prod Issue | KeeperErrorCode = ConnectionLoss for /overseer_elect/leader
Hi I'm not sure if any of you have had a chance to see this email yet. We had a reoccurrence of the Issue Today, and I'm attaching the Logs from today as well inline below. Please let me know if any of you have seen this issue before as this would really help me to get to the root of the problem to fix it. I'm a little lost here and not entirely sure what to do. Thanks, Rahat Bhalla 8696248 [qtp778720569-28] [ WARN] 2017-07-04 01:40:20 (HttpParser.java:parseNext:1391) - parse exception: java.lang.IllegalArgumentException: No Authority for HttpChannelOverHttp@30a86e14{r=0,c=false,a=IDLE,uri=null} java.lang.IllegalArgumentException: No Authority at org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43) at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877) at org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266) at org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Unknown Source) 8697308 [qtp778720569-21] [ WARN] 2017-07-04 01:40:21 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Bad URI for HttpChannelOverHttp@1276{r=16,c=false,a=IDLE,uri=/../../../../../../../../../../etc/passwd} 8697338 [qtp778720569-29] [ WARN] 2017-07-04 01:40:21 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 No Host for HttpChannelOverHttp@50a994ce{r=29,c=false,a=IDLE,uri=null} 8697388 [qtp778720569-21] [ WARN] 2017-07-04 01:40:22 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Bad URI for HttpChannelOverHttp@19a624ec{r=1,c=false,a=IDLE,uri=//prod-solr-node01.healthplan.com:9080/solr/admin/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd} 8697401 [qtp778720569-27] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte C0 in state 0 8697444 [qtp778720569-25] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte 80 in state 4 8697475 [qtp778720569-26] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte 80 in state 6 8697500 [qtp778720569-29] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/%f8%80%80%80%ae%f8%80%80%80%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte F8 in state 0 8706641 [qtp778720569-27] [ WARN] 2017-07-04 01:40:31 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Unknown Version for HttpChannelOverHttp@7fcd594a{r=54,c=false,a=IDLE,uri=null} 8707033 [qtp778720569-20] [ WARN] 2017-07-04 01:40:31 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Unknown Version for HttpChannelOverHttp@66740d77{r=54,c=false,a=IDLE,uri=null} 8719390 [qtp778720569-23] [ WARN] 2017-07-04 01:40:44 (HttpParser.java::1740) - Illegal character 0xA in state=HEADER_IN_N
Re: Solr Prod Issue | KeeperErrorCode = ConnectionLoss for /overseer_elect/leader
From the fact that someone has tried to access /etc/passwd file via your Solr (see all those WARN messages), it seems you have it exposed to the world, unless of course it's a security scanner you use internally. Internet is a hostile place, and the very first thing I would do is shield Solr from external traffic. Even if it's your own security scanning, I wouldn't do it until you have the system stable. Doing the above you'll reduce noise in the logs and might be able to better identify the issue. Losing the Zookeeper connection is typically a Java garbage collection issue. If GC causes too long pauses, the connection may time out. So I would recommend you start by reading https://wiki.apache.org/solr/SolrPerformanceProblems and https://wiki.apache.org/solr/ShawnHeisey. Also make sure that Zookeeper's Java settings are good. --Ere Bhalla, Rahat kirjoitti 5.7.2017 klo 11.05: Hi I’m not sure if any of you have had a chance to see this email yet. We had a reoccurrence of the Issue Today, and I’m attaching the Logs from today as well inline below. Please let me know if any of you have seen this issue before as this would really help me to get to the root of the problem to fix it. I’m a little lost here and not entirely sure what to do. Thanks, Rahat Bhalla 8696248 [qtp778720569-28] [ WARN] 2017-07-04 01:40:20 (HttpParser.java:parseNext:1391) - parse exception: java.lang.IllegalArgumentException: No Authority for HttpChannelOverHttp@30a86e14{r=0,c=false,a=IDLE,uri=null} java.lang.IllegalArgumentException: No Authority at org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43) at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877) at org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266) at org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Unknown Source) 8697308 [qtp778720569-21] [ WARN] 2017-07-04 01:40:21 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Bad URI for HttpChannelOverHttp@1276{r=16,c=false,a=IDLE,uri=/../../../../../../../../../../etc/passwd} 8697338 [qtp778720569-29] [ WARN] 2017-07-04 01:40:21 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 No Host for HttpChannelOverHttp@50a994ce{r=29,c=false,a=IDLE,uri=null} 8697388 [qtp778720569-21] [ WARN] 2017-07-04 01:40:22 (HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Bad URI for HttpChannelOverHttp@19a624ec{r=1,c=false,a=IDLE,uri=//prod-solr-node01.healthplan.com:9080/solr/admin/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd} 8697401 [qtp778720569-27] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte C0 in state 0 8697444 [qtp778720569-25] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte 80 in state 4 8697475 [qtp778720569-26] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/etc/passwd org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte 80 in state 6 8697500 [qtp778720569-29] [ WARN] 2017-07-04 01:40:22 (URIUtil.java:decodePath:348) - /solr/admin/%f8%80%80%80%
help on implicit routing
I am trying out the document routing feature in Solr 6.4.1. I am unable to comprehend the documentation where it states that “The 'implicit' router does not automatically route documents to different shards. Whichever shard you indicate on the indexing request (or within each document) will be used as the destination for those documents” How do you specify the shard inside a document? E.g If I have basic collection with two shards called day_1 and day_2. What value should be populated in the router field that will ensure the document routing to the respective shard? Regards, Imran Sent from Mail for Windows 10
index new discovered fileds of different types
Hi, We are trying to index documents of different types. Document have different fields. fields are known at indexing time. We run a query on a database and we index what comes using query variables as field names in solr. Our current solution: we use dynamic fields with prefix, for example feature_i_*, the issue with that 1) we need to define the type of the dynamic field and to be able to cover the type of discovered fields we define the following feature_i_* for integers, feature_t_* for string, feature_d_* for double, 1.a) this means we need to check the type of the discovered field and then put in the corresponding dynamic field 2) at search time, we need to know the right prefix We are looking for help to find away to ignore the prefix and check of the type regards, Thaer
Re: Solr dynamic "on the fly fields"
Thanks Erick for the answer. Function Queries are great, but for my use case what I really do is making aggregations (using Json Facet for example) with this functions. I have tried using Function Queries with Json Facet but it does not support it. Any other idea you can imagine? 2017-07-03 21:57 GMT-03:00 Erick Erickson : > I don't know how one would do this. But I would ask what the use-case > is. Creating such fields at index time just seems like it would be > inviting abuse by creating a zillion fields as you have no control > over what gets created. I'm assuming your tenants don't talk to each > other > > Have you thought about using function queries to pull this data out as > needed at _query_ time? See: > https://cwiki.apache.org/confluence/display/solr/Function+Queries > > Best, > Erick > > On Mon, Jul 3, 2017 at 12:06 PM, Pablo Anzorena > wrote: > > Thanks Erick, > > > > For my use case it's not possible any of those solutions. I have a > > multitenancy scheme in the most basic level, that is I have a single > > collection with fields (clientId, field1, field2, ..., field50) attending > > many clients. > > > > Clients can create custom fields based on arithmetic operations of any > > other field. > > > > So, is it possible to update let's say field49 with the follow operation: > > log(field39) + field25 on clientId=43? > > > > Do field39 and field25 need to be stored to accomplish this? Is there any > > other way to avoid storing them? > > > > Thanks! > > > > > > 2017-07-03 15:00 GMT-03:00 Erick Erickson : > > > >> There are two ways: > >> 1> define a dynamic field pattern, i.e. > >> > >> > >> > >> Now just add any field in the doc you want. If it ends in "_sum" and > >> no other explicit field matches you have a new field. > >> > >> 2> Use the managed schema to add these on the fly. I don't recommend > >> this from what I know of your use case, this is primarily intended for > >> front-ends to be able to modify the schema and/or "field guessing". > >> > >> I do caution you though that either way don't go over-the-top. If > >> you're thinking of thousands of different fields that can lead to > >> performance issues. > >> > >> You can either put stuff in the field on your indexing client or > >> create a custom update component, perhaps the simplest would be a > >> "StatelessScriptUpdateProcessorFactory: > >> > >> see: https://cwiki.apache.org/confluence/display/solr/ > >> Update+Request+Processors#UpdateRequestProcessors- > >> UpdateRequestProcessorFactories > >> > >> Best, > >> Erick > >> > >> On Mon, Jul 3, 2017 at 10:52 AM, Pablo Anzorena < > anzorena.f...@gmail.com> > >> wrote: > >> > Hey, > >> > > >> > I was wondering if there is some way to add fields "on the fly" based > on > >> > arithmetic operations on other fields. For example add a new field > >> > "custom_field" = log(field1) + field2 -5. > >> > > >> > Thanks. > >> >
Re: High disk write usage
Is the phisical machine dedicated ? Is a dedicated VM on shared metal ? Apart from this operational checks I will assume the machine is dedicated. In Solr a write to the disk does not happen only on commit, I can think to other scenarios : 1) *Transaction log* [1] 2) 3) Spellcheck and SuggestComponent building ( this depends on the config in case you use them) 4) memory Swapping ? 5) merges ( they are triggered potentially by a segment writing or an explicit optimize call and they can last a while potentially) Maybe other edge cases, but i would first check this list! [1] https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/High-disk-write-usage-tp4344356p4344383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: High disk write usage
Point 2 was the ram Buffer size : *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene indexing for buffering added documents and deletions before they are flushed to the Directory. maxBufferedDocs sets a limit on the number of documents buffered before flushing. If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. 100 1000 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/High-disk-write-usage-tp4344356p4344386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique() metrics not supported in Solr Streaming facet stream source
Does "uniq" expression sounds good to use for UniqueMetric class? Thanks, Susheel On Tue, Jul 4, 2017 at 5:45 PM, Susheel Kumar wrote: > Hello Joel, > > I tried to create a patch to add UniqueMetric and it works, but soon > realized, we have UniqueStream as well and can't load both of them (like > below) when required, since both uses "unique" keyword. > > Any advice how we can handle this. Come up with different keyword for > UniqueMetric or rename UniqueStream etc..? > >StreamFactory factory = new StreamFactory() > .withCollectionZkHost (...) >.withFunctionName("facet", FacetStream.class) > .withFunctionName("sum", SumMetric.class) > .withFunctionName("unique", UniqueStream.class) > .withFunctionName("unique", UniqueMetric.class) > > On Thu, Jun 29, 2017 at 9:32 AM, Joel Bernstein > wrote: > >> This is mainly due to focus on other things. It would great to support all >> the aggregate functions in facet, rollup and timeseries expressions. >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Thu, Jun 29, 2017 at 8:23 AM, Zheng Lin Edwin Yeo < >> edwinye...@gmail.com> >> wrote: >> >> > Hi, >> > >> > We are working on the Solr Streaming expression, using the facet stream >> > source. >> > >> > As the underlying structure is using JSON Facet, would like to find out >> why >> > the unique() metrics is not supported? Currently, it only supports >> sum(col) >> > , avg(col), min(col), max(col), count(*) >> > >> > I'm using Solr 6.5.1 >> > >> > Regards, >> > Edwin >> > >> > >
Re: cursorMark / Deep Paging and SEO
On 6/30/2017 1:30 AM, Jacques du Rand wrote: > I'm not quite sure I understand the deep paging / cursorMark internals > > We have implemented it on our search pages like so: > > http://mysite.com/search?foobar&page=1 > http://mysite.com/search?foobar&page=2&cmark=djkldskljsdsa > http://mysite.com/search?foobar&page=3&cmark=uoieuwqjdlsa > > So if we reindex the data the cursorMark for search "foobar" and page2 will > the cursorMark value change ??? > > But google might have already index our page as " > http://mysite.com/search?foobar&page=2&cmark=djkldskljsdsa"; so that > cursorMark will keep changing ?? The cursorMark feature does not use page numbers, so your "page" parameter won't provide any useful information to Solr. Presumably you're using that for your own application. The string values used in cursorMark have a tendency to lose their usefulness the more you index after making the query, so they are not useful to have in Google's index. The cursorMark value points at a specific document ... if you index new documents or delete old documents, that specific document might end up on a completely different page number than where it was when you initially made the query. Thanks, Shawn
Re: High disk write usage
Thnaks a lot alessandro! Yes, we have very big physical dedicated machines, with a topology of 5 shards and10 replicas each shard. 1. transaction log files are increasing but not with this rate 2. we 've probed with values between 300 and 2000 MB... without any visible results 3. We don't use those features 4. No. 5. I've probed with low and high mergefacors and i think that is the point. With low merge factor (over 4) we 've high write disk rate as i said previously with merge factor of 20, writing disk rate is decreasing, but now, with high qps rates (over 1000 qps) system is overloaded. i think that's the expected behaviour :( 2017-07-05 15:49 GMT+02:00 alessandro.benedetti : > Point 2 was the ram Buffer size : > > *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene > indexing for buffering added documents and deletions before they > are > flushed to the Directory. > maxBufferedDocs sets a limit on the number of documents buffered > before flushing. > If both ramBufferSizeMB and maxBufferedDocs is set, then > Lucene will flush based on whichever limit is hit first. > > 100 > 1000 > > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/High-disk-write-usage-tp4344356p4344386.html > Sent from the Solr - User mailing list archive at Nabble.com. >
RE: High disk write usage
Try mergeFactor of 10 (default) which should be fine in most cases. If you got an extreme case, either create more shards and consider better hardware (SSD's) -Original message- > From:Antonio De Miguel > Sent: Wednesday 5th July 2017 16:48 > To: solr-user@lucene.apache.org > Subject: Re: High disk write usage > > Thnaks a lot alessandro! > > Yes, we have very big physical dedicated machines, with a topology of 5 > shards and10 replicas each shard. > > > 1. transaction log files are increasing but not with this rate > > 2. we 've probed with values between 300 and 2000 MB... without any > visible results > > 3. We don't use those features > > 4. No. > > 5. I've probed with low and high mergefacors and i think that is the point. > > With low merge factor (over 4) we 've high write disk rate as i said > previously > > with merge factor of 20, writing disk rate is decreasing, but now, with > high qps rates (over 1000 qps) system is overloaded. > > i think that's the expected behaviour :( > > > > > 2017-07-05 15:49 GMT+02:00 alessandro.benedetti : > > > Point 2 was the ram Buffer size : > > > > *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene > > indexing for buffering added documents and deletions before they > > are > > flushed to the Directory. > > maxBufferedDocs sets a limit on the number of documents buffered > > before flushing. > > If both ramBufferSizeMB and maxBufferedDocs is set, then > > Lucene will flush based on whichever limit is hit first. > > > > 100 > > 1000 > > > > > > > > > > - > > --- > > Alessandro Benedetti > > Search Consultant, R&D Software Engineer, Director > > Sease Ltd. - www.sease.io > > -- > > View this message in context: http://lucene.472066.n3. > > nabble.com/High-disk-write-usage-tp4344356p4344386.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > >
Re: index new discovered fileds of different types
Hi Thaer, Do you use schemeless mode [1] ? Kind Regards, Furkan KAMACI [1] https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode On Wed, Jul 5, 2017 at 4:23 PM, Thaer Sammar wrote: > Hi, > We are trying to index documents of different types. Document have > different fields. fields are known at indexing time. We run a query on a > database and we index what comes using query variables as field names in > solr. Our current solution: we use dynamic fields with prefix, for example > feature_i_*, the issue with that > 1) we need to define the type of the dynamic field and to be able to cover > the type of discovered fields we define the following > feature_i_* for integers, feature_t_* for string, feature_d_* for double, > > 1.a) this means we need to check the type of the discovered field and then > put in the corresponding dynamic field > 2) at search time, we need to know the right prefix > We are looking for help to find away to ignore the prefix and check of the > type > > regards, > Thaer
Re: High disk write usage
thanks Markus! We already have SSD. About changing topology we probed yesterday with 10 shards, but system goes more inconsistent than with the current topology (5x10). I dont know why... too many traffic perhaps? About merge factor.. we set default configuration for some days... but when a merge occurs system overload. We probed with mergefactor of 4 to improbe query times and trying to have smaller merges. 2017-07-05 16:51 GMT+02:00 Markus Jelsma : > Try mergeFactor of 10 (default) which should be fine in most cases. If you > got an extreme case, either create more shards and consider better hardware > (SSD's) > > -Original message- > > From:Antonio De Miguel > > Sent: Wednesday 5th July 2017 16:48 > > To: solr-user@lucene.apache.org > > Subject: Re: High disk write usage > > > > Thnaks a lot alessandro! > > > > Yes, we have very big physical dedicated machines, with a topology of 5 > > shards and10 replicas each shard. > > > > > > 1. transaction log files are increasing but not with this rate > > > > 2. we 've probed with values between 300 and 2000 MB... without any > > visible results > > > > 3. We don't use those features > > > > 4. No. > > > > 5. I've probed with low and high mergefacors and i think that is the > point. > > > > With low merge factor (over 4) we 've high write disk rate as i said > > previously > > > > with merge factor of 20, writing disk rate is decreasing, but now, with > > high qps rates (over 1000 qps) system is overloaded. > > > > i think that's the expected behaviour :( > > > > > > > > > > 2017-07-05 15:49 GMT+02:00 alessandro.benedetti : > > > > > Point 2 was the ram Buffer size : > > > > > > *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene > > > indexing for buffering added documents and deletions before > they > > > are > > > flushed to the Directory. > > > maxBufferedDocs sets a limit on the number of documents > buffered > > > before flushing. > > > If both ramBufferSizeMB and maxBufferedDocs is set, then > > > Lucene will flush based on whichever limit is hit first. > > > > > > 100 > > > 1000 > > > > > > > > > > > > > > > - > > > --- > > > Alessandro Benedetti > > > Search Consultant, R&D Software Engineer, Director > > > Sease Ltd. - www.sease.io > > > -- > > > View this message in context: http://lucene.472066.n3. > > > nabble.com/High-disk-write-usage-tp4344356p4344386.html > > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > > >
Re: index new discovered fileds of different types
Hi Furkan, No, In the schema we also defined some static fields such as uri and geo field. On 5 July 2017 at 17:07, Furkan KAMACI wrote: > Hi Thaer, > > Do you use schemeless mode [1] ? > > Kind Regards, > Furkan KAMACI > > [1] https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode > > On Wed, Jul 5, 2017 at 4:23 PM, Thaer Sammar wrote: > > > Hi, > > We are trying to index documents of different types. Document have > > different fields. fields are known at indexing time. We run a query on a > > database and we index what comes using query variables as field names in > > solr. Our current solution: we use dynamic fields with prefix, for > example > > feature_i_*, the issue with that > > 1) we need to define the type of the dynamic field and to be able to > cover > > the type of discovered fields we define the following > > feature_i_* for integers, feature_t_* for string, feature_d_* for double, > > > > 1.a) this means we need to check the type of the discovered field and > then > > put in the corresponding dynamic field > > 2) at search time, we need to know the right prefix > > We are looking for help to find away to ignore the prefix and check of > the > > type > > > > regards, > > Thaer >
Re: index new discovered fileds of different types
I really have no idea what "to ignore the prefix and check of the type" means. When? How? Can you give an example of inputs and outputs? You might want to review: https://wiki.apache.org/solr/UsingMailingLists And to add to what Furkan mentioned, in addition to schemaless you can use "managed schema" which will allow you to add fields and types on the fly. Best, Erick On Wed, Jul 5, 2017 at 8:12 AM, Thaer Sammar wrote: > Hi Furkan, > > No, In the schema we also defined some static fields such as uri and geo > field. > > On 5 July 2017 at 17:07, Furkan KAMACI wrote: > >> Hi Thaer, >> >> Do you use schemeless mode [1] ? >> >> Kind Regards, >> Furkan KAMACI >> >> [1] https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode >> >> On Wed, Jul 5, 2017 at 4:23 PM, Thaer Sammar wrote: >> >> > Hi, >> > We are trying to index documents of different types. Document have >> > different fields. fields are known at indexing time. We run a query on a >> > database and we index what comes using query variables as field names in >> > solr. Our current solution: we use dynamic fields with prefix, for >> example >> > feature_i_*, the issue with that >> > 1) we need to define the type of the dynamic field and to be able to >> cover >> > the type of discovered fields we define the following >> > feature_i_* for integers, feature_t_* for string, feature_d_* for double, >> > >> > 1.a) this means we need to check the type of the discovered field and >> then >> > put in the corresponding dynamic field >> > 2) at search time, we need to know the right prefix >> > We are looking for help to find away to ignore the prefix and check of >> the >> > type >> > >> > regards, >> > Thaer >>
Re: High disk write usage
What is your soft commit interval? That'll cause I/O as well. How much physical RAM and how much is dedicated to _all_ the JVMs on a machine? One cause here is that Lucene uses MMapDirectory which can be starved for OS memory if you use too much JVM, my rule of thumb is that _at least_ half of the physical memory should be reserved for the OS. Your transaction logs should fluctuate but even out. By that I mean they should increase in size but every hard commit should truncate some of them so I wouldn't expect them to grow indefinitely. One strategy is to put your tlogs on a separate drive exactly to reduce contention. You could disable them too at a cost of risking your data. That might be a quick experiment you could run though, disable tlogs and see what that changes. Of course I'd do this on my test system ;). But yeah, Solr will use a lot of I/O in the scenario you are outlining I'm afraid. Best, Erick On Wed, Jul 5, 2017 at 8:08 AM, Antonio De Miguel wrote: > thanks Markus! > > We already have SSD. > > About changing topology we probed yesterday with 10 shards, but system > goes more inconsistent than with the current topology (5x10). I dont know > why... too many traffic perhaps? > > About merge factor.. we set default configuration for some days... but when > a merge occurs system overload. We probed with mergefactor of 4 to improbe > query times and trying to have smaller merges. > > 2017-07-05 16:51 GMT+02:00 Markus Jelsma : > >> Try mergeFactor of 10 (default) which should be fine in most cases. If you >> got an extreme case, either create more shards and consider better hardware >> (SSD's) >> >> -Original message- >> > From:Antonio De Miguel >> > Sent: Wednesday 5th July 2017 16:48 >> > To: solr-user@lucene.apache.org >> > Subject: Re: High disk write usage >> > >> > Thnaks a lot alessandro! >> > >> > Yes, we have very big physical dedicated machines, with a topology of 5 >> > shards and10 replicas each shard. >> > >> > >> > 1. transaction log files are increasing but not with this rate >> > >> > 2. we 've probed with values between 300 and 2000 MB... without any >> > visible results >> > >> > 3. We don't use those features >> > >> > 4. No. >> > >> > 5. I've probed with low and high mergefacors and i think that is the >> point. >> > >> > With low merge factor (over 4) we 've high write disk rate as i said >> > previously >> > >> > with merge factor of 20, writing disk rate is decreasing, but now, with >> > high qps rates (over 1000 qps) system is overloaded. >> > >> > i think that's the expected behaviour :( >> > >> > >> > >> > >> > 2017-07-05 15:49 GMT+02:00 alessandro.benedetti : >> > >> > > Point 2 was the ram Buffer size : >> > > >> > > *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene >> > > indexing for buffering added documents and deletions before >> they >> > > are >> > > flushed to the Directory. >> > > maxBufferedDocs sets a limit on the number of documents >> buffered >> > > before flushing. >> > > If both ramBufferSizeMB and maxBufferedDocs is set, then >> > > Lucene will flush based on whichever limit is hit first. >> > > >> > > 100 >> > > 1000 >> > > >> > > >> > > >> > > >> > > - >> > > --- >> > > Alessandro Benedetti >> > > Search Consultant, R&D Software Engineer, Director >> > > Sease Ltd. - www.sease.io >> > > -- >> > > View this message in context: http://lucene.472066.n3. >> > > nabble.com/High-disk-write-usage-tp4344356p4344386.html >> > > Sent from the Solr - User mailing list archive at Nabble.com. >> > > >> > >>
Optimization/Merging space
Hi all, I am curious to know what happens when solr begins a merge/optimize operation, but then runs out of physical disk space. I havent had the chance to try this out yet but I was wondering if anyone knows what the underlying codes response to the situation would be if it happened. Thanks -David
Re: Solr dynamic "on the fly fields"
Some aggregations are supported by combining stats with pivot facets? See: https://lucidworks.com/2015/01/29/you-got-stats-in-my-facets/ Don't quite think that works for your use case though. the other thing that _might_ help is all the Streaming Expression/Streaming Aggregation work. Best, Erick On Wed, Jul 5, 2017 at 6:23 AM, Pablo Anzorena wrote: > Thanks Erick for the answer. Function Queries are great, but for my use > case what I really do is making aggregations (using Json Facet for example) > with this functions. > > I have tried using Function Queries with Json Facet but it does not support > it. > > Any other idea you can imagine? > > > > > > 2017-07-03 21:57 GMT-03:00 Erick Erickson : > >> I don't know how one would do this. But I would ask what the use-case >> is. Creating such fields at index time just seems like it would be >> inviting abuse by creating a zillion fields as you have no control >> over what gets created. I'm assuming your tenants don't talk to each >> other >> >> Have you thought about using function queries to pull this data out as >> needed at _query_ time? See: >> https://cwiki.apache.org/confluence/display/solr/Function+Queries >> >> Best, >> Erick >> >> On Mon, Jul 3, 2017 at 12:06 PM, Pablo Anzorena >> wrote: >> > Thanks Erick, >> > >> > For my use case it's not possible any of those solutions. I have a >> > multitenancy scheme in the most basic level, that is I have a single >> > collection with fields (clientId, field1, field2, ..., field50) attending >> > many clients. >> > >> > Clients can create custom fields based on arithmetic operations of any >> > other field. >> > >> > So, is it possible to update let's say field49 with the follow operation: >> > log(field39) + field25 on clientId=43? >> > >> > Do field39 and field25 need to be stored to accomplish this? Is there any >> > other way to avoid storing them? >> > >> > Thanks! >> > >> > >> > 2017-07-03 15:00 GMT-03:00 Erick Erickson : >> > >> >> There are two ways: >> >> 1> define a dynamic field pattern, i.e. >> >> >> >> >> >> >> >> Now just add any field in the doc you want. If it ends in "_sum" and >> >> no other explicit field matches you have a new field. >> >> >> >> 2> Use the managed schema to add these on the fly. I don't recommend >> >> this from what I know of your use case, this is primarily intended for >> >> front-ends to be able to modify the schema and/or "field guessing". >> >> >> >> I do caution you though that either way don't go over-the-top. If >> >> you're thinking of thousands of different fields that can lead to >> >> performance issues. >> >> >> >> You can either put stuff in the field on your indexing client or >> >> create a custom update component, perhaps the simplest would be a >> >> "StatelessScriptUpdateProcessorFactory: >> >> >> >> see: https://cwiki.apache.org/confluence/display/solr/ >> >> Update+Request+Processors#UpdateRequestProcessors- >> >> UpdateRequestProcessorFactories >> >> >> >> Best, >> >> Erick >> >> >> >> On Mon, Jul 3, 2017 at 10:52 AM, Pablo Anzorena < >> anzorena.f...@gmail.com> >> >> wrote: >> >> > Hey, >> >> > >> >> > I was wondering if there is some way to add fields "on the fly" based >> on >> >> > arithmetic operations on other fields. For example add a new field >> >> > "custom_field" = log(field1) + field2 -5. >> >> > >> >> > Thanks. >> >> >>
Re: Optimization/Merging space
Bad Things Can Happen. Solr (well, Lucene in this case) tries very hard to keep disk full operations from having repercussions., but it's kind of like OOMs. What happens next? It's not so much the merge/optimize, but what happens in the future when the _next_ segment is written... The merge or optimize goes something like this: 1> copy and merge all the segments you intend to 2> when all that is successful, update the segments file 3> delete the old segments. So theoretically if your disk fills up during <1> or <2> your old index is intact and usable. It isn't until the segments file has been successfully changed that the new snapshot of the index is active. Which, in a nutshell, is why you need to have at least as much free space on your disk as your index occupies since you can't control when a merge happens which may copy _all_ of your segments to new ones. Let's say that during <1> your disk fills up _and_ you're indexing new documents at the same time. Solr/Lucene can't guarantee that the new documents got written to disk in that case. So while your current snapshot is probably OK, your index may not be in the state you want. Meanwhile if you're using transaction logs Solr is trying to write tlogs to disk. It's unknown what happened to them (another good argument for putting them on a separate disk!). Best, Erick On Wed, Jul 5, 2017 at 9:07 AM, David Hastings wrote: > Hi all, I am curious to know what happens when solr begins a merge/optimize > operation, but then runs out of physical disk space. I havent had the > chance to try this out yet but I was wondering if anyone knows what the > underlying codes response to the situation would be if it happened. Thanks > -David
Re: help on implicit routing
Use the _route_ field and put in "day_1" or "day_2". You've presumably named the shards (the "shard" parameter) when you added them with the CREATESHARD command so use the value you specified there. Best, Erick On Wed, Jul 5, 2017 at 6:15 PM, wrote: > I am trying out the document routing feature in Solr 6.4.1. I am unable to > comprehend the documentation where it states that > “The 'implicit' router does not > automatically route documents to different > shards. Whichever shard you indicate on the > indexing request (or within each document) will > be used as the destination for those documents” > > How do you specify the shard inside a document? E.g If I have basic > collection with two shards called day_1 and day_2. What value should be > populated in the router field that will ensure the document routing to the > respective shard? > > Regards, > Imran > > Sent from Mail for Windows 10 >
Re: High disk write usage
Hi Erik! thanks for your response! Our soft commit is 5 seconds. Why generates I/0 a softcommit? first notice. We have enough physical RAM to store full collection and 16Gb for each JVM. The collection is relatively small. I've tried (for testing purposes) disabling transactionlog (commenting )... but cluster does not go up. I'll try writing into separated drive, nice idea... 2017-07-05 18:04 GMT+02:00 Erick Erickson : > What is your soft commit interval? That'll cause I/O as well. > > How much physical RAM and how much is dedicated to _all_ the JVMs on a > machine? One cause here is that Lucene uses MMapDirectory which can be > starved for OS memory if you use too much JVM, my rule of thumb is > that _at least_ half of the physical memory should be reserved for the > OS. > > Your transaction logs should fluctuate but even out. By that I mean > they should increase in size but every hard commit should truncate > some of them so I wouldn't expect them to grow indefinitely. > > One strategy is to put your tlogs on a separate drive exactly to > reduce contention. You could disable them too at a cost of risking > your data. That might be a quick experiment you could run though, > disable tlogs and see what that changes. Of course I'd do this on my > test system ;). > > But yeah, Solr will use a lot of I/O in the scenario you are outlining > I'm afraid. > > Best, > Erick > > On Wed, Jul 5, 2017 at 8:08 AM, Antonio De Miguel > wrote: > > thanks Markus! > > > > We already have SSD. > > > > About changing topology we probed yesterday with 10 shards, but > system > > goes more inconsistent than with the current topology (5x10). I dont know > > why... too many traffic perhaps? > > > > About merge factor.. we set default configuration for some days... but > when > > a merge occurs system overload. We probed with mergefactor of 4 to > improbe > > query times and trying to have smaller merges. > > > > 2017-07-05 16:51 GMT+02:00 Markus Jelsma : > > > >> Try mergeFactor of 10 (default) which should be fine in most cases. If > you > >> got an extreme case, either create more shards and consider better > hardware > >> (SSD's) > >> > >> -Original message- > >> > From:Antonio De Miguel > >> > Sent: Wednesday 5th July 2017 16:48 > >> > To: solr-user@lucene.apache.org > >> > Subject: Re: High disk write usage > >> > > >> > Thnaks a lot alessandro! > >> > > >> > Yes, we have very big physical dedicated machines, with a topology of > 5 > >> > shards and10 replicas each shard. > >> > > >> > > >> > 1. transaction log files are increasing but not with this rate > >> > > >> > 2. we 've probed with values between 300 and 2000 MB... without any > >> > visible results > >> > > >> > 3. We don't use those features > >> > > >> > 4. No. > >> > > >> > 5. I've probed with low and high mergefacors and i think that is the > >> point. > >> > > >> > With low merge factor (over 4) we 've high write disk rate as i said > >> > previously > >> > > >> > with merge factor of 20, writing disk rate is decreasing, but now, > with > >> > high qps rates (over 1000 qps) system is overloaded. > >> > > >> > i think that's the expected behaviour :( > >> > > >> > > >> > > >> > > >> > 2017-07-05 15:49 GMT+02:00 alessandro.benedetti >: > >> > > >> > > Point 2 was the ram Buffer size : > >> > > > >> > > *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene > >> > > indexing for buffering added documents and deletions before > >> they > >> > > are > >> > > flushed to the Directory. > >> > > maxBufferedDocs sets a limit on the number of documents > >> buffered > >> > > before flushing. > >> > > If both ramBufferSizeMB and maxBufferedDocs is set, then > >> > > Lucene will flush based on whichever limit is hit first. > >> > > > >> > > 100 > >> > > 1000 > >> > > > >> > > > >> > > > >> > > > >> > > - > >> > > --- > >> > > Alessandro Benedetti > >> > > Search Consultant, R&D Software Engineer, Director > >> > > Sease Ltd. - www.sease.io > >> > > -- > >> > > View this message in context: http://lucene.472066.n3. > >> > > nabble.com/High-disk-write-usage-tp4344356p4344386.html > >> > > Sent from the Solr - User mailing list archive at Nabble.com. > >> > > > >> > > >> >
Re: High disk write usage
bq: We have enough physical RAM to store full collection and 16Gb for each JVM. That's not quite what I was asking for. Lucene uses MMapDirectory to map part of the index into the OS memory space. If you've over-allocated the JVM space relative to your physical memory that space can start swapping. Frankly I'd expect your query performance to die if that was happening so this is a sanity check. How much physical memory does the machine have and how much memory is allocated to _all_ of the JVMs running on that machine? see: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Wed, Jul 5, 2017 at 9:41 AM, Antonio De Miguel wrote: > Hi Erik! thanks for your response! > > Our soft commit is 5 seconds. Why generates I/0 a softcommit? first notice. > > > We have enough physical RAM to store full collection and 16Gb for each > JVM. The collection is relatively small. > > I've tried (for testing purposes) disabling transactionlog (commenting > )... but cluster does not go up. I'll try writing into separated > drive, nice idea... > > > > > > > > > 2017-07-05 18:04 GMT+02:00 Erick Erickson : > >> What is your soft commit interval? That'll cause I/O as well. >> >> How much physical RAM and how much is dedicated to _all_ the JVMs on a >> machine? One cause here is that Lucene uses MMapDirectory which can be >> starved for OS memory if you use too much JVM, my rule of thumb is >> that _at least_ half of the physical memory should be reserved for the >> OS. >> >> Your transaction logs should fluctuate but even out. By that I mean >> they should increase in size but every hard commit should truncate >> some of them so I wouldn't expect them to grow indefinitely. >> >> One strategy is to put your tlogs on a separate drive exactly to >> reduce contention. You could disable them too at a cost of risking >> your data. That might be a quick experiment you could run though, >> disable tlogs and see what that changes. Of course I'd do this on my >> test system ;). >> >> But yeah, Solr will use a lot of I/O in the scenario you are outlining >> I'm afraid. >> >> Best, >> Erick >> >> On Wed, Jul 5, 2017 at 8:08 AM, Antonio De Miguel >> wrote: >> > thanks Markus! >> > >> > We already have SSD. >> > >> > About changing topology we probed yesterday with 10 shards, but >> system >> > goes more inconsistent than with the current topology (5x10). I dont know >> > why... too many traffic perhaps? >> > >> > About merge factor.. we set default configuration for some days... but >> when >> > a merge occurs system overload. We probed with mergefactor of 4 to >> improbe >> > query times and trying to have smaller merges. >> > >> > 2017-07-05 16:51 GMT+02:00 Markus Jelsma : >> > >> >> Try mergeFactor of 10 (default) which should be fine in most cases. If >> you >> >> got an extreme case, either create more shards and consider better >> hardware >> >> (SSD's) >> >> >> >> -Original message- >> >> > From:Antonio De Miguel >> >> > Sent: Wednesday 5th July 2017 16:48 >> >> > To: solr-user@lucene.apache.org >> >> > Subject: Re: High disk write usage >> >> > >> >> > Thnaks a lot alessandro! >> >> > >> >> > Yes, we have very big physical dedicated machines, with a topology of >> 5 >> >> > shards and10 replicas each shard. >> >> > >> >> > >> >> > 1. transaction log files are increasing but not with this rate >> >> > >> >> > 2. we 've probed with values between 300 and 2000 MB... without any >> >> > visible results >> >> > >> >> > 3. We don't use those features >> >> > >> >> > 4. No. >> >> > >> >> > 5. I've probed with low and high mergefacors and i think that is the >> >> point. >> >> > >> >> > With low merge factor (over 4) we 've high write disk rate as i said >> >> > previously >> >> > >> >> > with merge factor of 20, writing disk rate is decreasing, but now, >> with >> >> > high qps rates (over 1000 qps) system is overloaded. >> >> > >> >> > i think that's the expected behaviour :( >> >> > >> >> > >> >> > >> >> > >> >> > 2017-07-05 15:49 GMT+02:00 alessandro.benedetti > >: >> >> > >> >> > > Point 2 was the ram Buffer size : >> >> > > >> >> > > *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene >> >> > > indexing for buffering added documents and deletions before >> >> they >> >> > > are >> >> > > flushed to the Directory. >> >> > > maxBufferedDocs sets a limit on the number of documents >> >> buffered >> >> > > before flushing. >> >> > > If both ramBufferSizeMB and maxBufferedDocs is set, then >> >> > > Lucene will flush based on whichever limit is hit first. >> >> > > >> >> > > 100 >> >> > > 1000 >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > - >> >> > > --- >> >> > > Alessandro Benedetti >> >> > > Search Consultant, R&D Software Engineer, Director >> >> > > Sease Ltd. - www.sease.io >> >> > > -- >> >> > > View this message in context: http://lucene.472066.n3. >> >> > > nabble.com/High-disk-write-usa
Best way to split text
We are working on a search application for large pdfs (~ 10 - 100 Mb), there are been correctly indexed. However we want to make some training in the pipeline, so we are implementing some spark mllib algorithms. But now, some requirements are to split documents into either paragraphs or pages. Some alternatives, we find, is to split via tika-pdfbox or making a custom processor to catch words. In terms of performance, what options is preferred? A custom class of tika that extracts just paragraphs or with all document filter paragraphs that match our vocabulary. Thanks for your advice. -- View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-split-text-tp4344498.html Sent from the Solr - User mailing list archive at Nabble.com.
solr alias not working on streaming query search
** PROTECTED 関係者外秘 Have anyone faced a similar issue? I have a collection named “solr_test”. I created an alias to it as “solr_alias”. This alias works well when I do a simple search: http://localhost:8983/solr/solr_alias/select?indent=on&q=*:*&wt=json But, this will not work when used in a streaming expression: http://localhost:8983/solr/solr_alias/stream?expr=search(solr_alias, q=*:*, fl="p_PrimaryKey, p_name", qt="/select", sort="p_name asc") This gives me an error: "EXCEPTION": "java.lang.Exception: Collection not found:solr_alias" The same streaming query works when I use the actual collection name: “solr_test” Is this a limitation for aliases in solr? Or am I doing something wrong? Thanks, Lewin
Re: Unique() metrics not supported in Solr Streaming facet stream source
There are a number of functions that are currently being held up because of conflicting duplicate function names. We haven't come to an agreement yet on the best way forward for this yet. I think we should open a separate ticket to discuss how best to handle this issue. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Jul 5, 2017 at 10:04 AM, Susheel Kumar wrote: > Does "uniq" expression sounds good to use for UniqueMetric class? > > Thanks, > Susheel > > On Tue, Jul 4, 2017 at 5:45 PM, Susheel Kumar > wrote: > > > Hello Joel, > > > > I tried to create a patch to add UniqueMetric and it works, but soon > > realized, we have UniqueStream as well and can't load both of them (like > > below) when required, since both uses "unique" keyword. > > > > Any advice how we can handle this. Come up with different keyword for > > UniqueMetric or rename UniqueStream etc..? > > > >StreamFactory factory = new StreamFactory() > > .withCollectionZkHost (...) > >.withFunctionName("facet", FacetStream.class) > > .withFunctionName("sum", SumMetric.class) > > .withFunctionName("unique", UniqueStream.class) > > .withFunctionName("unique", UniqueMetric.class) > > > > On Thu, Jun 29, 2017 at 9:32 AM, Joel Bernstein > > wrote: > > > >> This is mainly due to focus on other things. It would great to support > all > >> the aggregate functions in facet, rollup and timeseries expressions. > >> > >> Joel Bernstein > >> http://joelsolr.blogspot.com/ > >> > >> On Thu, Jun 29, 2017 at 8:23 AM, Zheng Lin Edwin Yeo < > >> edwinye...@gmail.com> > >> wrote: > >> > >> > Hi, > >> > > >> > We are working on the Solr Streaming expression, using the facet > stream > >> > source. > >> > > >> > As the underlying structure is using JSON Facet, would like to find > out > >> why > >> > the unique() metrics is not supported? Currently, it only supports > >> sum(col) > >> > , avg(col), min(col), max(col), count(*) > >> > > >> > I'm using Solr 6.5.1 > >> > > >> > Regards, > >> > Edwin > >> > > >> > > > > >
Re: solr alias not working on streaming query search
This should be fixed in Solr 6.4: https://issues.apache.org/jira/browse/SOLR-9077 Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Jul 5, 2017 at 2:40 PM, Lewin Joy (TMS) wrote: > ** PROTECTED 関係者外秘 > > Have anyone faced a similar issue? > > I have a collection named “solr_test”. I created an alias to it as > “solr_alias”. > This alias works well when I do a simple search: > http://localhost:8983/solr/solr_alias/select?indent=on&q=*:*&wt=json > > But, this will not work when used in a streaming expression: > > http://localhost:8983/solr/solr_alias/stream?expr=search(solr_alias, > q=*:*, fl="p_PrimaryKey, p_name", qt="/select", sort="p_name asc") > > This gives me an error: > "EXCEPTION": "java.lang.Exception: Collection not found:solr_alias" > > The same streaming query works when I use the actual collection name: > “solr_test” > > > Is this a limitation for aliases in solr? Or am I doing something wrong? > > Thanks, > Lewin >
Re: Allow Join over two sharded collection
How are you planing to manual route? What key(s) are you thinking to use. Second the link i shared was collection aliasing and if you use that, you will end up with multiple collections. Just want to clarify as you said above "...manual routing and creating alias" Again until the join feature is available across shards, you can still continue with one shard (and replica's if needed). 20M + 1M/per month shouldn't be a big deal. Thanks, Susheel On Mon, Jul 3, 2017 at 11:16 PM, mganeshs wrote: > Hi Susheel, > > To make use of Joins only option is I should go for manual routing. If I go > for manual routing based on time, we miss the power of distributing the > load > while indexing. It will end up with all indexing happens in newly created > shard, which we feel this will not be efficient approach and degrades the > performance of indexing as we have lot of jvms running, but still all > indexing going to one single shard for indexing and we are also expecting > 1M+ docs per month in coming days. > > For your question on whether we will query old aged document... ? Mostly we > won't query old aged documents. With querying pattern, it's clear we should > go for manual routing and creating alias. But when it comes to indexing, in > order to distribute the load of indexing, we felt default routing is the > best option, but Join will not work. And that's the reason for asking when > this feature will be in place ? > > Regards, > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4344098.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Unique() metrics not supported in Solr Streaming facet stream source
Hello Joel, Opened the ticket https://issues.apache.org/jira/browse/SOLR-11017 Thanks, Susheel On Wed, Jul 5, 2017 at 2:46 PM, Joel Bernstein wrote: > There are a number of functions that are currently being held up because of > conflicting duplicate function names. We haven't come to an agreement yet > on the best way forward for this yet. I think we should open a separate > ticket to discuss how best to handle this issue. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Jul 5, 2017 at 10:04 AM, Susheel Kumar > wrote: > > > Does "uniq" expression sounds good to use for UniqueMetric class? > > > > Thanks, > > Susheel > > > > On Tue, Jul 4, 2017 at 5:45 PM, Susheel Kumar > > wrote: > > > > > Hello Joel, > > > > > > I tried to create a patch to add UniqueMetric and it works, but soon > > > realized, we have UniqueStream as well and can't load both of them > (like > > > below) when required, since both uses "unique" keyword. > > > > > > Any advice how we can handle this. Come up with different keyword for > > > UniqueMetric or rename UniqueStream etc..? > > > > > >StreamFactory factory = new StreamFactory() > > > .withCollectionZkHost (...) > > >.withFunctionName("facet", FacetStream.class) > > > .withFunctionName("sum", SumMetric.class) > > > .withFunctionName("unique", UniqueStream.class) > > > .withFunctionName("unique", UniqueMetric.class) > > > > > > On Thu, Jun 29, 2017 at 9:32 AM, Joel Bernstein > > > wrote: > > > > > >> This is mainly due to focus on other things. It would great to support > > all > > >> the aggregate functions in facet, rollup and timeseries expressions. > > >> > > >> Joel Bernstein > > >> http://joelsolr.blogspot.com/ > > >> > > >> On Thu, Jun 29, 2017 at 8:23 AM, Zheng Lin Edwin Yeo < > > >> edwinye...@gmail.com> > > >> wrote: > > >> > > >> > Hi, > > >> > > > >> > We are working on the Solr Streaming expression, using the facet > > stream > > >> > source. > > >> > > > >> > As the underlying structure is using JSON Facet, would like to find > > out > > >> why > > >> > the unique() metrics is not supported? Currently, it only supports > > >> sum(col) > > >> > , avg(col), min(col), max(col), count(*) > > >> > > > >> > I'm using Solr 6.5.1 > > >> > > > >> > Regards, > > >> > Edwin > > >> > > > >> > > > > > > > > >
Re: High disk write usage
Hi erik. What i want to said is that we have enough memory to store shards, and furthermore, JVMs heapspaces Machine has 400gb of RAM. I think we have enough. We have 10 JVM running on the machine, each of one using 16gb. Shard size is about 8gb. When we have query or indexing peaks our problem are the CPU ussage and the disk io, but we have a lot of unused memory. El 5/7/2017 19:04, "Erick Erickson" escribió: > bq: We have enough physical RAM to store full collection and 16Gb for each > JVM. > > That's not quite what I was asking for. Lucene uses MMapDirectory to > map part of the index into the OS memory space. If you've > over-allocated the JVM space relative to your physical memory that > space can start swapping. Frankly I'd expect your query performance to > die if that was happening so this is a sanity check. > > How much physical memory does the machine have and how much memory is > allocated to _all_ of the JVMs running on that machine? > > see: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory- > on-64bit.html > > Best, > Erick > > > On Wed, Jul 5, 2017 at 9:41 AM, Antonio De Miguel > wrote: > > Hi Erik! thanks for your response! > > > > Our soft commit is 5 seconds. Why generates I/0 a softcommit? first > notice. > > > > > > We have enough physical RAM to store full collection and 16Gb for each > > JVM. The collection is relatively small. > > > > I've tried (for testing purposes) disabling transactionlog (commenting > > )... but cluster does not go up. I'll try writing into > separated > > drive, nice idea... > > > > > > > > > > > > > > > > > > 2017-07-05 18:04 GMT+02:00 Erick Erickson : > > > >> What is your soft commit interval? That'll cause I/O as well. > >> > >> How much physical RAM and how much is dedicated to _all_ the JVMs on a > >> machine? One cause here is that Lucene uses MMapDirectory which can be > >> starved for OS memory if you use too much JVM, my rule of thumb is > >> that _at least_ half of the physical memory should be reserved for the > >> OS. > >> > >> Your transaction logs should fluctuate but even out. By that I mean > >> they should increase in size but every hard commit should truncate > >> some of them so I wouldn't expect them to grow indefinitely. > >> > >> One strategy is to put your tlogs on a separate drive exactly to > >> reduce contention. You could disable them too at a cost of risking > >> your data. That might be a quick experiment you could run though, > >> disable tlogs and see what that changes. Of course I'd do this on my > >> test system ;). > >> > >> But yeah, Solr will use a lot of I/O in the scenario you are outlining > >> I'm afraid. > >> > >> Best, > >> Erick > >> > >> On Wed, Jul 5, 2017 at 8:08 AM, Antonio De Miguel > >> wrote: > >> > thanks Markus! > >> > > >> > We already have SSD. > >> > > >> > About changing topology we probed yesterday with 10 shards, but > >> system > >> > goes more inconsistent than with the current topology (5x10). I dont > know > >> > why... too many traffic perhaps? > >> > > >> > About merge factor.. we set default configuration for some days... but > >> when > >> > a merge occurs system overload. We probed with mergefactor of 4 to > >> improbe > >> > query times and trying to have smaller merges. > >> > > >> > 2017-07-05 16:51 GMT+02:00 Markus Jelsma >: > >> > > >> >> Try mergeFactor of 10 (default) which should be fine in most cases. > If > >> you > >> >> got an extreme case, either create more shards and consider better > >> hardware > >> >> (SSD's) > >> >> > >> >> -Original message- > >> >> > From:Antonio De Miguel > >> >> > Sent: Wednesday 5th July 2017 16:48 > >> >> > To: solr-user@lucene.apache.org > >> >> > Subject: Re: High disk write usage > >> >> > > >> >> > Thnaks a lot alessandro! > >> >> > > >> >> > Yes, we have very big physical dedicated machines, with a topology > of > >> 5 > >> >> > shards and10 replicas each shard. > >> >> > > >> >> > > >> >> > 1. transaction log files are increasing but not with this rate > >> >> > > >> >> > 2. we 've probed with values between 300 and 2000 MB... without > any > >> >> > visible results > >> >> > > >> >> > 3. We don't use those features > >> >> > > >> >> > 4. No. > >> >> > > >> >> > 5. I've probed with low and high mergefacors and i think that is > the > >> >> point. > >> >> > > >> >> > With low merge factor (over 4) we 've high write disk rate as i > said > >> >> > previously > >> >> > > >> >> > with merge factor of 20, writing disk rate is decreasing, but now, > >> with > >> >> > high qps rates (over 1000 qps) system is overloaded. > >> >> > > >> >> > i think that's the expected behaviour :( > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > 2017-07-05 15:49 GMT+02:00 alessandro.benedetti < > a.benede...@sease.io > >> >: > >> >> > > >> >> > > Point 2 was the ram Buffer size : > >> >> > > > >> >> > > *ramBufferSizeMB* sets the amount of RAM that may be used by > Lucene > >> >> > > indexing for buffering added docume
Placing different collections on different hard disk/folder
Hi, Would like to check, how can we place the indexed files of different collections on different hard disk/folder, but they are in the same node? For example, I want collection1 to be placed in C: drive, collection2 to be placed in D: drive, and collection3 to be placed in E: drive. I am using Solr 6.5.1 Regards, Edwin
Re: Unique() metrics not supported in Solr Streaming facet stream source
Thanks for your help, Joel and Susheel. Regards, Edwin On 6 July 2017 at 05:49, Susheel Kumar wrote: > Hello Joel, > > Opened the ticket > > https://issues.apache.org/jira/browse/SOLR-11017 > > Thanks, > Susheel > > On Wed, Jul 5, 2017 at 2:46 PM, Joel Bernstein wrote: > > > There are a number of functions that are currently being held up because > of > > conflicting duplicate function names. We haven't come to an agreement yet > > on the best way forward for this yet. I think we should open a separate > > ticket to discuss how best to handle this issue. > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Wed, Jul 5, 2017 at 10:04 AM, Susheel Kumar > > wrote: > > > > > Does "uniq" expression sounds good to use for UniqueMetric class? > > > > > > Thanks, > > > Susheel > > > > > > On Tue, Jul 4, 2017 at 5:45 PM, Susheel Kumar > > > wrote: > > > > > > > Hello Joel, > > > > > > > > I tried to create a patch to add UniqueMetric and it works, but soon > > > > realized, we have UniqueStream as well and can't load both of them > > (like > > > > below) when required, since both uses "unique" keyword. > > > > > > > > Any advice how we can handle this. Come up with different keyword > for > > > > UniqueMetric or rename UniqueStream etc..? > > > > > > > >StreamFactory factory = new StreamFactory() > > > > .withCollectionZkHost (...) > > > >.withFunctionName("facet", FacetStream.class) > > > > .withFunctionName("sum", SumMetric.class) > > > > .withFunctionName("unique", UniqueStream.class) > > > > .withFunctionName("unique", UniqueMetric.class) > > > > > > > > On Thu, Jun 29, 2017 at 9:32 AM, Joel Bernstein > > > > wrote: > > > > > > > >> This is mainly due to focus on other things. It would great to > support > > > all > > > >> the aggregate functions in facet, rollup and timeseries expressions. > > > >> > > > >> Joel Bernstein > > > >> http://joelsolr.blogspot.com/ > > > >> > > > >> On Thu, Jun 29, 2017 at 8:23 AM, Zheng Lin Edwin Yeo < > > > >> edwinye...@gmail.com> > > > >> wrote: > > > >> > > > >> > Hi, > > > >> > > > > >> > We are working on the Solr Streaming expression, using the facet > > > stream > > > >> > source. > > > >> > > > > >> > As the underlying structure is using JSON Facet, would like to > find > > > out > > > >> why > > > >> > the unique() metrics is not supported? Currently, it only supports > > > >> sum(col) > > > >> > , avg(col), min(col), max(col), count(*) > > > >> > > > > >> > I'm using Solr 6.5.1 > > > >> > > > > >> > Regards, > > > >> > Edwin > > > >> > > > > >> > > > > > > > > > > > > > >
Joins in Parallel SQL?
Is it possible to join documents from different collections through Parallel SQL? In addition to the LIMIT feature on Parallel SQL, can we do use OFFSET to implement paging? Thanks, Imran Sent from Mail for Windows 10