both way synonyms with ManagedSynonymFilterFactory
Hi, one-way managed synonyms seems to work fine, but I cannot make both-way synonyms work. Steps to reproduce with Solr 5.4.1: 1. create a core: $ bin/solr create_core -c test -d server/solr/configsets/basic_configs 2. edit schema.xml so fieldType text_general looks like this: 3. reload the core: $ curl -X GET " http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; 4. add synonyms, one one-way synonym, one two-way, reload the core again: $ curl -X PUT -H 'Content-type:application/json' --data-binary '{"mad":["angry","upset"]}' " http://localhost:8983/solr/test/schema/analysis/synonyms/english"; $ curl -X PUT -H 'Content-type:application/json' --data-binary '["mb","megabytes"]' " http://localhost:8983/solr/test/schema/analysis/synonyms/english"; $ curl -X GET " http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; 5. list the synonyms: { "responseHeader":{ "status":0, "QTime":0}, "synonymMappings":{ "initArgs":{"ignoreCase":false}, "initializedOn":"2016-02-11T09:00:50.354Z", "managedMap":{ "mad":["angry", "upset"], "mb":["megabytes"], "megabytes":["mb"]}}} 6. add two documents: $ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t" : "10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be sufficient" }]' $ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t" : "100 mb should be sufficient" }]' 7. search for the documents: - all these return the first document, so one-way synonyms work: $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:angry&indent=true"; $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:upset&indent=true"; $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:mad&indent=true"; - this only returns the document with "mb": $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:mb&indent=true"; - this only returns the document with "megabytes" $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true"; Any input on how to make this work would be appreciated. Thanks, Bjørn
Re: Running Solr on port 80
The script essentially automates what you would do manually, for the first time when starting up the system. It is no different from extracting the archive, setting permissions etc. yourself. So the next time you wanted to stop/ restart solr, you'll have to do it using the solr script. That being said, I see that you've included a -f option along with your command. Is that a typo? The script file doesn't have a -f option. On Thu, 11 Feb 2016, 13:09 Jeyaprakash Singarayar wrote: > That ok if I'm using it in local, but I'm doing it in a production based > on the below page > > https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production > > > > On Thu, Feb 11, 2016 at 12:58 PM, Binoy Dalal > wrote: > >> Why don't you directly run solr from the script provided in >> {SOLR_DIST}\bin >> ./solr start -p 8984 >> >> On Thu, 11 Feb 2016, 12:56 Jeyaprakash Singarayar > > >> wrote: >> >> > Hi, >> > >> > I'm trying to install solr 5.4.1 on CentOS. I know that while installing >> > Solr as a service in the Linux we can pass -p to shift the >> > app to host on that port. >> > >> > ./install_solr_service.sh solr-5.4.1.tgz -p 8984 -f >> > >> > but still it shows as it is hosted on 8983 and not on 8984. Any idea? >> > >> > Waiting up to 30 seconds to see Solr running on port 8983 [/] >> > Started Solr server on port 8983 (pid=33034). Happy searching! >> > >> > Found 1 Solr nodes: >> > >> > Solr process 33034 running on port 8983 >> > { >> > "solr_home":"/var/solr/data", >> > "version":"5.4.1 1725212 - jpountz - 2016-01-18 11:51:45", >> > "startTime":"2016-02-11T07:25:03.996Z", >> > "uptime":"0 days, 0 hours, 0 minutes, 11 seconds", >> > "memory":"68 MB (%13.9) of 490.7 MB"} >> > >> > Service solr installed. >> > >> -- >> Regards, >> Binoy Dalal >> > > -- Regards, Binoy Dalal
optimize requests that fetch 1000 rows
Hi, I'm trying to optimize a solr application. The bottleneck are queries that request 1000 rows to solr. Unfortunately the application can't be modified at the moment, can you suggest me what could be done on the solr side to increase the performance? The bottleneck is just on fetching the results, the query executes very fast. I suggested caching .fdx and .fdt files on the file system cache. Anything else? Thanks
Re: optimize requests that fetch 1000 rows
If you're fetching large text fields, consider highlighting on them and just returning the snippets. I faced such a problem some time ago and highlighting sped things up nearly 10x for us. On Thu, 11 Feb 2016, 15:03 Matteo Grolla wrote: > Hi, > I'm trying to optimize a solr application. > The bottleneck are queries that request 1000 rows to solr. > Unfortunately the application can't be modified at the moment, can you > suggest me what could be done on the solr side to increase the performance? > The bottleneck is just on fetching the results, the query executes very > fast. > I suggested caching .fdx and .fdt files on the file system cache. > Anything else? > > Thanks > -- Regards, Binoy Dalal
Are fieldCache and/or DocValues used by Function Queries
Hi, I need to evaluate different boost solutions performance and I can't find any relevant documentation about it. Are fieldCache and/or DocValues used by Function Queries?
Re: optimize requests that fetch 1000 rows
On Thu, Feb 11, 2016, at 09:33 AM, Matteo Grolla wrote: > Hi, > I'm trying to optimize a solr application. > The bottleneck are queries that request 1000 rows to solr. > Unfortunately the application can't be modified at the moment, can you > suggest me what could be done on the solr side to increase the > performance? > The bottleneck is just on fetching the results, the query executes very > fast. > I suggested caching .fdx and .fdt files on the file system cache. > Anything else? The index files will automatically be cached in the OS disk cache without any intervention, so that can't be the issue. How are you sorting the results? Are you letting it calculate scores? 1000 rows shouldn't be particularly expensive, beyond the unavoidable network cost. Have you considered using the /export endpoint and the streaming API? I haven't used it myself, but it is intended for getting larger amounts of data out of a Solr index. Upayavira
Re: optimize requests that fetch 1000 rows
Hi Upayavira, I'm working with solr 4.0, sorting on score (default). I tried setting the document cache size to 2048, so all docs of a single request fit (2 requests fit actually) If I execute a query the first time it takes 24s I reexecute it, with all docs in the documentCache and it takes 15s execute it with rows = 400 and it takes 3s it seems that below rows = 400 times are acceptable, beyond they get slow 2016-02-11 11:27 GMT+01:00 Upayavira : > > > On Thu, Feb 11, 2016, at 09:33 AM, Matteo Grolla wrote: > > Hi, > > I'm trying to optimize a solr application. > > The bottleneck are queries that request 1000 rows to solr. > > Unfortunately the application can't be modified at the moment, can you > > suggest me what could be done on the solr side to increase the > > performance? > > The bottleneck is just on fetching the results, the query executes very > > fast. > > I suggested caching .fdx and .fdt files on the file system cache. > > Anything else? > > The index files will automatically be cached in the OS disk cache > without any intervention, so that can't be the issue. > > How are you sorting the results? Are you letting it calculate scores? > 1000 rows shouldn't be particularly expensive, beyond the unavoidable > network cost. > > Have you considered using the /export endpoint and the streaming API? I > haven't used it myself, but it is intended for getting larger amounts of > data out of a Solr index. > > Upayavira >
Re: optimize requests that fetch 1000 rows
On Thu, 2016-02-11 at 11:53 +0100, Matteo Grolla wrote: > I'm working with solr 4.0, sorting on score (default). > I tried setting the document cache size to 2048, so all docs of a single > request fit (2 requests fit actually) > If I execute a query the first time it takes 24s > I reexecute it, with all docs in the documentCache and it takes 15s > execute it with rows = 400 and it takes 3s Those are very long execution times. It sounds like you either have very complex queries or very large fields, as Binoy suggests. Can you provide us with a full sample request and tell us how large a single documnent is when returned? If you do not need all the fields in the returned documents, you should limit them with the fl-parameter. - Toke Eskildsen, State and University Library, Denmark
Re: optimize requests that fetch 1000 rows
Thanks Toke, yes, they are long times, and solr qtime (to execute the query) is a fraction of a second. The response in javabin format is around 300k. Currently I can't limit the rows requested or the fields requested, those are fixed for me. 2016-02-11 13:14 GMT+01:00 Toke Eskildsen : > On Thu, 2016-02-11 at 11:53 +0100, Matteo Grolla wrote: > > I'm working with solr 4.0, sorting on score (default). > > I tried setting the document cache size to 2048, so all docs of a single > > request fit (2 requests fit actually) > > If I execute a query the first time it takes 24s > > I reexecute it, with all docs in the documentCache and it takes 15s > > execute it with rows = 400 and it takes 3s > > Those are very long execution times. It sounds like you either have very > complex queries or very large fields, as Binoy suggests. Can you provide > us with a full sample request and tell us how large a single documnent > is when returned? If you do not need all the fields in the returned > documents, you should limit them with the fl-parameter. > > - Toke Eskildsen, State and University Library, Denmark > > >
Re: [More Like This] Query building
Hi Guys, is it possible to have any feedback ? Is there any process to speed up bug resolution / discussions ? just want to understand if the patch is not good enough, if I need to improve it or simply no-one took a look ... https://issues.apache.org/jira/browse/LUCENE-6954 Cheers On 11 January 2016 at 15:25, Alessandro Benedetti wrote: > Hi guys, > the patch seems fine to me. > I didn't spend much more time on the code but I checked the tests and the > pre-commit checks. > It seems fine to me. > Let me know , > > Cheers > > On 31 December 2015 at 18:40, Alessandro Benedetti > wrote: > >> https://issues.apache.org/jira/browse/LUCENE-6954 >> >> First draft patch available, I will check better the tests new year ! >> >> On 29 December 2015 at 13:43, Alessandro Benedetti > > wrote: >> >>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests. >>> >>> In the meantime let's try to collect some additional feedback. >>> >>> Cheers >>> >>> On 29 December 2015 at 12:43, Anshum Gupta >>> wrote: >>> Feel free to create a JIRA and put up a patch if you can. On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < abenede...@apache.org > wrote: > Hi guys, > While I was exploring the way we build the More Like This query, I > discovered a part I am not convinced of : > > > > Let's see how we build the query : > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > > 1) we extract the terms from the interesting fields, adding them to a map : > > Map termFreqMap = new HashMap<>(); > > *( we lose the relation field-> term, we don't know anymore where the term > was coming ! )* > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > > 2) we build the queue that will contain the query terms, at this point we > connect again there terms to some field, but : > > ... >> // go through all the fields and find the largest document frequency >> String topField = fieldNames[0]; >> int docFreq = 0; >> for (String fieldName : fieldNames) { >> int freq = ir.docFreq(new Term(fieldName, word)); >> topField = (freq > docFreq) ? fieldName : topField; >> docFreq = (freq > docFreq) ? freq : docFreq; >> } >> ... > > > We identify the topField as the field with the highest document frequency > for the term t . > Then we build the termQuery : > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); > > In this way we lose a lot of precision. > Not sure why we do that. > I would prefer to keep the relation between terms and fields. > The MLT query can improve a lot the quality. > If i run the MLT on 2 fields : *description* and *facilities* for example. > It is likely I want to find documents with similar terms in the > description and similar terms in the facilities, without mixing up the > things and loosing the semantic of the terms. > > Let me know your opinion, > > Cheers > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- Anshum Gupta >>> >>> >>> >>> -- >>> -- >>> >>> Benedetti Alessandro >>> Visiting card : http://about.me/alessandro_benedetti >>> >>> "Tyger, tyger burning bright >>> In the forests of the night, >>> What immortal hand or eye >>> Could frame thy fearful symmetry?" >>> >>> William Blake - Songs of Experience -1794 England >>> >> >> >> >> -- >> -- >> >> Benedetti Alessandro >> Visiting card : http://about.me/alessandro_benedetti >> >> "Tyger, tyger burning bright >> In the forests of the night, >> What immortal hand or eye >> Could frame thy fearful symmetry?" >> >> William Blake - Songs of Experience -1794 England >> > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Solr architecture
Hi Mark, Nothing comes for free :) With doc per action, you will have to handle large number of docs. There is hard limit for number of docs per shard - it is ~4 billion (size of int) so sharding is mandatory. It is most likely that you will have to have more than one collection. Depending on your queries, different layouts can be applied. What will be these 320 qps? Will you do some filtering (by user, country,...), will you focus on the latest data, what is your data retention strategy... You should answer to such questions and decide setup that will handle important one in efficient way. With this amount of data you will most likely have to do some tradeoffs. When it comes to sending docs to Solr, sending bulks is mandatory. Regards, Emir On 10.02.2016 22:48, Mark Robinson wrote: Thanks everyone for your suggestions. Based on it I am planning to have one doc per event with sessionId common. So in this case hopefully indexing each doc as and when it comes would be okay? Or do we still need to batch and index to Solr? Also with 4M sessions a day with about 6000 docs (events) per session we can expect about 24Billion docs per day! Will Solr still hold good. If so could some one please recommend a sizing to cater to this levels of data. The queries per second is around 320 qps. Thanks! Mark On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: Hi Mark, Appending session actions just to be able to return more than one session without retrieving large number of results is not good tradeoff. Like Upayavira suggested, you should consider storing one action per doc and aggregate on read time or push to Solr once session ends and aggregate on some other layer. If you are thinking handling infrastructure might be too much, you may consider using some of logging services to hold data. One such service is Sematext's Logsene (http://sematext.com/logsene). Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On 10.02.2016 03:22, Mark Robinson wrote: Thanks for your replies and suggestions! Why I store all events related to a session under one doc? Each session can have about 500 total entries (events) corresponding to it. So when I try to retrieve a session's info it can back with around 500 records. If it is this compounded one doc per session, I can retrieve more sessions at a time with one doc per session. eg under a sessionId an array of eventA activities, eventB activities (using json). When an eventA activity again occurs, we will read all that data for that session, append this extra info to evenA data and push the whole session related data back (indexing) to Solr. Like this for many sessions parallely. Why NRT? Parallely many sessions are being written (4Million sessions hence 4Million docs per day). A person can do this querying any time. It is just a look up? Yes. We just need to retrieve all info for a session and pass it on to another system. We may even do some extra querying on some data like timestamps, pageurl etc in that info added to a session. Thinking of having the data separate from the actual Solr Instance and mention the loc of the dataDir in solrconfig. If Solr is not a good option could you please suggest something which will satisfy this use case with min response time while querying. Thanks! Mark On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins wrote: So as I understand your use case, its effectively logging actions within a user session, why do you have to do the update in NRT? Why not just log all the user session events (with some unique key, and ensuring the session Id is in the document somewhere), then when you want to do the query, you join on the session id, and that gives you all the data records for that session. I don't really follow why it has to be 1 document (which you continually update). If you really need that aggregation, couldn't that happen offline? I guess your 1 saving grace is that you query using the unique ID (in your scenario) so you could use the real-time get handler, since you aren't doing a complex query (strictly its not a search, its a raw key lookup). But I would still question your use case, if you go the Solr route for that kind of scale with querying and indexing that much, you're going to have to throw a lot of hardware at it, as Jack says probably in the order of hundreds of machines... On 9 February 2016 at 19:00, Upayavira wrote: Bear in mind that Lucene is optimised towards high read lower write. That is, it puts in a lot of effort at write time to make reading efficient. It sounds like you are going to be doing far more writing than reading, and I wonder whether you are necessarily choosing the right tool for the job. How would you later use this data, and what advantage is there to storing it in Solr? Upayavira On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote: Hi, Thanks
Re: Json faceting, aggregate numeric field by day?
On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsma wrote: > Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats > over ranges is not yet supported. More specifically, SOLR-6352 is what we > would need. > > [1]: https://issues.apache.org/jira/browse/SOLR-6348 > [2]: https://issues.apache.org/jira/browse/SOLR-6352 > > Thanks anyway, at least we found the tickets :) > No problem - as I was reading this I was thinking "But wait, I *know* we do this ourselves for average price vs month published". In fact, I was forgetting that we index the ranges that we will want to facet over as part of the document - so a document with a date_published of "2010-03-29T00:00:00Z" also has a date_published.month of "201003" (and a bunch of other ranges that we want to facet by). The frontend then converts those fields in to the appropriate values for display. This might be an acceptable solution for you guys too, depending on how many ranges that you require, and how much larger it would make your index. Cheers Tom
Re: Size of logs are high
Can you check your log level? Probably log level of error would suffice for your purpose and it would most certainly reduce your log size(s). On Thu, Feb 11, 2016 at 12:53 PM, kshitij tyagi wrote: > Hi, > I have migrated to solr 5.2 and the size of logs are high. > > Can anyone help me out here how to control this? > -- Aditya Sundaram Software Engineer, Technology team AKR Tech park B Block, B1 047 +91-9844006866
Re: Json faceting, aggregate numeric field by day?
On Wed, Feb 10, 2016 at 5:21 AM, Markus Jelsma wrote: > Hi - if we assume the following simple documents: > > > 2015-01-01T00:00:00Z > 2 > > > 2015-01-01T00:00:00Z > 4 > > > 2015-01-02T00:00:00Z > 3 > > > 2015-01-02T00:00:00Z > 7 > > > Can i get a daily average for the field 'value' by day? e.g. > > > 3.0 > 5.0 > For the JSON Facet API, I guess this would be: json.facet= by_day : { type : range, start : ..., end : ..., gap : "+1DAY", facet : { x : "avg(value)" } } -Yonik
Re: optimize requests that fetch 1000 rows
Hi Matteo, as an addition to Upayavira observation, how is the memory assigned for that Solr Instance ? How much memory is assigned to Solr and how much left for the OS ? Is this a VM on top of a physical machine ? So it is the real physical memory used, or swapping could happen frequently ? Is there enough memory to allow the OS to cache the stored content index segments in memory ? As a first thing I would try to exclude I/O bottlenecks with the disk ( and apparently your document cache experiment should exclude them) Unfortunately the export request handler is not an option with 4.0 . Are you obtaining those timings in high load ( huge query load) or in low load timeframes ? What happens if you take the solr instance all for you an repeat the experiment? In an healthy memory mapped scenario i would not expect to half the time of a single query thanks to the document cache ( of course I would expect a benefit but it looks too much difference for something that should be already in RAM). In a dedicated instance , have you tried without the document cache, to repeat the query execution ? ( to trigger possibly memory mapping) But also that should be an alarming point, in a low load Solr, with the document fields in cache ( so java heap memory), it is impressive we take 14s to load the documents fields. I am curious to know updates on this, Cheers On 11 February 2016 at 12:45, Matteo Grolla wrote: > Thanks Toke, yes, they are long times, and solr qtime (to execute the > query) is a fraction of a second. > The response in javabin format is around 300k. > Currently I can't limit the rows requested or the fields requested, those > are fixed for me. > > 2016-02-11 13:14 GMT+01:00 Toke Eskildsen : > > > On Thu, 2016-02-11 at 11:53 +0100, Matteo Grolla wrote: > > > I'm working with solr 4.0, sorting on score (default). > > > I tried setting the document cache size to 2048, so all docs of a > single > > > request fit (2 requests fit actually) > > > If I execute a query the first time it takes 24s > > > I reexecute it, with all docs in the documentCache and it takes 15s > > > execute it with rows = 400 and it takes 3s > > > > Those are very long execution times. It sounds like you either have very > > complex queries or very large fields, as Binoy suggests. Can you provide > > us with a full sample request and tell us how large a single documnent > > is when returned? If you do not need all the fields in the returned > > documents, you should limit them with the fl-parameter. > > > > - Toke Eskildsen, State and University Library, Denmark > > > > > > > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: optimize requests that fetch 1000 rows
On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla wrote: > Thanks Toke, yes, they are long times, and solr qtime (to execute the > query) is a fraction of a second. > The response in javabin format is around 300k. OK, That tells us a lot. And if you actually tested so that all the docs would be in the cache (can you verify this by looking at the cache stats after you re-execute?) then it seems like the slowness is down to any of: a) serializing the response (it doesn't seem like a 300K response should take *that* long to serialize) b) reading/processing the response (how fast the client can do something with each doc is also a factor...) c) other (GC, network, etc) You can try taking client processing out of the equation by trying a curl request. -Yonik
Re: multiple but identical suggestions in autocomplete
Related this, I just created this : https://issues.apache.org/jira/browse/SOLR-8672 To be fair, I see no utility in returning duplicate suggestions ( if they have no different payload, they are un-distinguishable from a human perspective hence useless to have duplication) . I would like to hear some counter example. In my opinion we should contribute a way to avoid the duplicates directly in Solr. If there are valid counter examples, we could add an additional parameter for the solr.SuggestComponent like boolean . In a lot of scenarios I guess it could be a good fit. Cheers On 5 August 2015 at 12:06, Nutch Solr User wrote: > You will need to call this service from UI as you are calling suggester > component currently. (may be on every key-press event in text box). You > will > pass required parameters too. > > Service will internally form a solr suggester query and query Solr. From > the > returned response it will keep only unique suggestions from top N > suggestions and return suggestions to UI. > > > > - > Nutch Solr User > > "The ultimate search engine would basically understand everything in the > world, and it would always give you the right thing." > -- > View this message in context: > http://lucene.472066.n3.nabble.com/multiple-but-identical-suggestions-in-autocomplete-tp4220055p4220953.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: optimize requests that fetch 1000 rows
Hi Yonic, after the first query I find 1000 docs in the document cache. I'm using curl to send the request and requesting javabin format to mimic the application. gc activity is low I managed to load the entire 50GB index in the filesystem cache, after that queries don't cause disk activity anymore. Time improves now queries that took ~30s take <10s. But I hoped better I'm going to use jvisualvm's sampler to analyze where time is spent 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla > wrote: > > Thanks Toke, yes, they are long times, and solr qtime (to execute the > > query) is a fraction of a second. > > The response in javabin format is around 300k. > > OK, That tells us a lot. > And if you actually tested so that all the docs would be in the cache > (can you verify this by looking at the cache stats after you > re-execute?) then it seems like the slowness is down to any of: > a) serializing the response (it doesn't seem like a 300K response > should take *that* long to serialize) > b) reading/processing the response (how fast the client can do > something with each doc is also a factor...) > c) other (GC, network, etc) > > You can try taking client processing out of the equation by trying a > curl request. > > -Yonik >
Re: Solr architecture
Your biggest issue here is likely to be http connections. Making an HTTP connection to Solr is way more expensive than the ask of adding a single document to the index. If you are expecting to add 24 billion docs per day, I'd suggest that somehow merging those documents into batches before sending them to Solr will be necessary. To my previous question - what do you gain by using Solr that you don't get from other solutions? I'd suggest that to make this system really work, you are going to need a deep understanding of how Lucene works - segments, segment merges, deletions, and many other things because when you start to work at that scale, the implementation details behind Lucene really start to matter and impact upon your ability to succeed. I'd suggest that what you are undertaking can certainly be done, but is a substantial project. Upayavira On Wed, Feb 10, 2016, at 09:48 PM, Mark Robinson wrote: > Thanks everyone for your suggestions. > Based on it I am planning to have one doc per event with sessionId > common. > > So in this case hopefully indexing each doc as and when it comes would be > okay? Or do we still need to batch and index to Solr? > > Also with 4M sessions a day with about 6000 docs (events) per session we > can expect about 24Billion docs per day! > > Will Solr still hold good. If so could some one please recommend a sizing > to cater to this levels of data. > The queries per second is around 320 qps. > > Thanks! > Mark > > > On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic < > emir.arnauto...@sematext.com> wrote: > > > Hi Mark, > > Appending session actions just to be able to return more than one session > > without retrieving large number of results is not good tradeoff. Like > > Upayavira suggested, you should consider storing one action per doc and > > aggregate on read time or push to Solr once session ends and aggregate on > > some other layer. > > If you are thinking handling infrastructure might be too much, you may > > consider using some of logging services to hold data. One such service is > > Sematext's Logsene (http://sematext.com/logsene). > > > > Thanks, > > Emir > > > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > On 10.02.2016 03:22, Mark Robinson wrote: > > > >> Thanks for your replies and suggestions! > >> > >> Why I store all events related to a session under one doc? > >> Each session can have about 500 total entries (events) corresponding to > >> it. > >> So when I try to retrieve a session's info it can back with around 500 > >> records. If it is this compounded one doc per session, I can retrieve more > >> sessions at a time with one doc per session. > >> eg under a sessionId an array of eventA activities, eventB activities > >> (using json). When an eventA activity again occurs, we will read all > >> that > >> data for that session, append this extra info to evenA data and push the > >> whole session related data back (indexing) to Solr. Like this for many > >> sessions parallely. > >> > >> > >> Why NRT? > >> Parallely many sessions are being written (4Million sessions hence > >> 4Million > >> docs per day). A person can do this querying any time. > >> > >> It is just a look up? > >> Yes. We just need to retrieve all info for a session and pass it on to > >> another system. We may even do some extra querying on some data like > >> timestamps, pageurl etc in that info added to a session. > >> > >> Thinking of having the data separate from the actual Solr Instance and > >> mention the loc of the dataDir in solrconfig. > >> > >> If Solr is not a good option could you please suggest something which will > >> satisfy this use case with min response time while querying. > >> > >> Thanks! > >> Mark > >> > >> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins > >> wrote: > >> > >> So as I understand your use case, its effectively logging actions within a > >>> user session, why do you have to do the update in NRT? Why not just log > >>> all the user session events (with some unique key, and ensuring the > >>> session > >>> Id is in the document somewhere), then when you want to do the query, you > >>> join on the session id, and that gives you all the data records for that > >>> session. I don't really follow why it has to be 1 document (which you > >>> continually update). If you really need that aggregation, couldn't that > >>> happen offline? > >>> > >>> I guess your 1 saving grace is that you query using the unique ID (in > >>> your > >>> scenario) so you could use the real-time get handler, since you aren't > >>> doing a complex query (strictly its not a search, its a raw key lookup). > >>> > >>> But I would still question your use case, if you go the Solr route for > >>> that > >>> kind of scale with querying and indexing that much, you're going to have > >>> to > >>> throw a lot of hardware at it, as Jack says probably in the order of > >>> hundreds
Re: optimize requests that fetch 1000 rows
On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla wrote: > Hi Yonic, > after the first query I find 1000 docs in the document cache. > I'm using curl to send the request and requesting javabin format to mimic > the application. > gc activity is low > I managed to load the entire 50GB index in the filesystem cache, after that > queries don't cause disk activity anymore. > Time improves now queries that took ~30s take <10s. But I hoped better > I'm going to use jvisualvm's sampler to analyze where time is spent Thanks, please keep us posted... something is definitely strange. -Yonik
Re: optimize requests that fetch 1000 rows
I see a lot of time spent in splitOnTokens which is called by (last part of stack trace) BinaryResponseWriter$Resolver.writeResultsBody() ... solr.search.ReturnsField.wantsField() commons.io.FileNameUtils.wildcardmatch() commons.io.FileNameUtils.splitOnTokens() 2016-02-11 15:42 GMT+01:00 Matteo Grolla : > Hi Yonic, > after the first query I find 1000 docs in the document cache. > I'm using curl to send the request and requesting javabin format to mimic > the application. > gc activity is low > I managed to load the entire 50GB index in the filesystem cache, after > that queries don't cause disk activity anymore. > Time improves now queries that took ~30s take <10s. But I hoped better > I'm going to use jvisualvm's sampler to analyze where time is spent > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > >> On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla >> wrote: >> > Thanks Toke, yes, they are long times, and solr qtime (to execute the >> > query) is a fraction of a second. >> > The response in javabin format is around 300k. >> >> OK, That tells us a lot. >> And if you actually tested so that all the docs would be in the cache >> (can you verify this by looking at the cache stats after you >> re-execute?) then it seems like the slowness is down to any of: >> a) serializing the response (it doesn't seem like a 300K response >> should take *that* long to serialize) >> b) reading/processing the response (how fast the client can do >> something with each doc is also a factor...) >> c) other (GC, network, etc) >> >> You can try taking client processing out of the equation by trying a >> curl request. >> >> -Yonik >> > >
RE: Json faceting, aggregate numeric field by day?
Thanks. But this yields an error in FacetModule: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map at org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:100) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:247) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073) ... Is it supposed to work? I also found open issues SOLR-6348 and SOLR-6352 which made me doubt is wat supported at all. Thanks, Markus [1]: https://issues.apache.org/jira/browse/SOLR-6348 [2]: https://issues.apache.org/jira/browse/SOLR-6352 -Original message- > From:Yonik Seeley > Sent: Thursday 11th February 2016 15:11 > To: solr-user@lucene.apache.org > Subject: Re: Json faceting, aggregate numeric field by day? > > On Wed, Feb 10, 2016 at 5:21 AM, Markus Jelsma > wrote: > > Hi - if we assume the following simple documents: > > > > > > 2015-01-01T00:00:00Z > > 2 > > > > > > 2015-01-01T00:00:00Z > > 4 > > > > > > 2015-01-02T00:00:00Z > > 3 > > > > > > 2015-01-02T00:00:00Z > > 7 > > > > > > Can i get a daily average for the field 'value' by day? e.g. > > > > > > 3.0 > > 5.0 > > > > For the JSON Facet API, I guess this would be: > > json.facet= > > by_day : { > type : range, > start : ..., > end : ..., > gap : "+1DAY", > facet : { > x : "avg(value)" > } > } > > > -Yonik >
Re: Json faceting, aggregate numeric field by day?
On Thu, Feb 11, 2016 at 10:04 AM, Markus Jelsma wrote: > Thanks. But this yields an error in FacetModule: > > java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map > at > org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:100) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:247) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073) > ... We don't have the best error reporting yet... can you show what you sent for json.facet? > Is it supposed to work? Yep, there are tests. Here's an example of calculating percentiles per range facet bucket (at the bottom): http://yonik.com/percentiles-for-solr-faceting/ > I also found open issues SOLR-6348 and SOLR-6352 which made me doubt is wat > supported at all. Those issues aren't related to the new facet module... you can tell by the syntax. -Yonik
Select distinct records
I am trying to select distinct records from a collection. (I need distinct name and corresponding id) I have tried using grouping and group format of simple but that takes a long time to execute and sometimes runs into out of memory exception. Another limitation seems to be that total number of groups are not returned. Is there another faster and more efficient way to do this? Thank you
Re: optimize requests that fetch 1000 rows
[image: Immagine incorporata 1] 2016-02-11 16:05 GMT+01:00 Matteo Grolla : > I see a lot of time spent in splitOnTokens > > which is called by (last part of stack trace) > > BinaryResponseWriter$Resolver.writeResultsBody() > ... > solr.search.ReturnsField.wantsField() > commons.io.FileNameUtils.wildcardmatch() > commons.io.FileNameUtils.splitOnTokens() > > > > 2016-02-11 15:42 GMT+01:00 Matteo Grolla : > >> Hi Yonic, >> after the first query I find 1000 docs in the document cache. >> I'm using curl to send the request and requesting javabin format to mimic >> the application. >> gc activity is low >> I managed to load the entire 50GB index in the filesystem cache, after >> that queries don't cause disk activity anymore. >> Time improves now queries that took ~30s take <10s. But I hoped better >> I'm going to use jvisualvm's sampler to analyze where time is spent >> >> >> 2016-02-11 15:25 GMT+01:00 Yonik Seeley : >> >>> On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla >>> wrote: >>> > Thanks Toke, yes, they are long times, and solr qtime (to execute the >>> > query) is a fraction of a second. >>> > The response in javabin format is around 300k. >>> >>> OK, That tells us a lot. >>> And if you actually tested so that all the docs would be in the cache >>> (can you verify this by looking at the cache stats after you >>> re-execute?) then it seems like the slowness is down to any of: >>> a) serializing the response (it doesn't seem like a 300K response >>> should take *that* long to serialize) >>> b) reading/processing the response (how fast the client can do >>> something with each doc is also a factor...) >>> c) other (GC, network, etc) >>> >>> You can try taking client processing out of the equation by trying a >>> curl request. >>> >>> -Yonik >>> >> >> >
Re: optimize requests that fetch 1000 rows
Is this a scenario that was working fine and suddenly deteriorated, or has it always been slow? -- Jack Krupansky On Thu, Feb 11, 2016 at 4:33 AM, Matteo Grolla wrote: > Hi, > I'm trying to optimize a solr application. > The bottleneck are queries that request 1000 rows to solr. > Unfortunately the application can't be modified at the moment, can you > suggest me what could be done on the solr side to increase the performance? > The bottleneck is just on fetching the results, the query executes very > fast. > I suggested caching .fdx and .fdt files on the file system cache. > Anything else? > > Thanks >
RE: Running Solr on port 80
You should edit the files installed by install_solr_service.sh - change the init.d script to pass the -p argument to ${SOLRINSTALLDIR}/bin/solr. By the way, my initscript is modified (a) to support the conventional /etc/sysconfig/ convention, and (b) to run solr as a different user than the user who owns the jars. In the end, I don't use install_solr_service.sh at all, because it expects to have root and what I run to provision doesn't have root. -Original Message- From: Jeyaprakash Singarayar [mailto:jpsingara...@gmail.com] Sent: Thursday, February 11, 2016 2:40 AM To: solr-user@lucene.apache.org; binoydala...@gmail.com Subject: Re: Running Solr on port 80 That ok if I'm using it in local, but I'm doing it in a production based on the below page https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production On Thu, Feb 11, 2016 at 12:58 PM, Binoy Dalal wrote: > Why don't you directly run solr from the script provided in > {SOLR_DIST}\bin ./solr start -p 8984 > > On Thu, 11 Feb 2016, 12:56 Jeyaprakash Singarayar > > wrote: > > > Hi, > > > > I'm trying to install solr 5.4.1 on CentOS. I know that while > > installing Solr as a service in the Linux we can pass -p > number> to shift the app to host on that port. > > > > ./install_solr_service.sh solr-5.4.1.tgz -p 8984 -f > > > > but still it shows as it is hosted on 8983 and not on 8984. Any idea? > > > > Waiting up to 30 seconds to see Solr running on port 8983 [/] > > Started Solr server on port 8983 (pid=33034). Happy searching! > > > > Found 1 Solr nodes: > > > > Solr process 33034 running on port 8983 { > > "solr_home":"/var/solr/data", > > "version":"5.4.1 1725212 - jpountz - 2016-01-18 11:51:45", > > "startTime":"2016-02-11T07:25:03.996Z", > > "uptime":"0 days, 0 hours, 0 minutes, 11 seconds", > > "memory":"68 MB (%13.9) of 490.7 MB"} > > > > Service solr installed. > > > -- > Regards, > Binoy Dalal >
Re: Select distinct records
What version of Solr are you using? Have you taken a look at the Collapsing Query Parser. It basically performs the same functions as grouping but is much more efficient at doing it. Take a look here: https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi wrote: > I am trying to select distinct records from a collection. (I need distinct > name and corresponding id) > > I have tried using grouping and group format of simple but that takes a > long time to execute and sometimes runs into out of memory exception. > Another limitation seems to be that total number of groups are not > returned. > > Is there another faster and more efficient way to do this? > > Thank you > -- Regards, Binoy Dalal
Re: optimize requests that fetch 1000 rows
Responses have always been slow but previously time was dominated by faceting. After few optimization this is my bottleneck. My suggestion has been to properly implement paging and reduce rows, unfortunately this is not possible at least not soon 2016-02-11 16:18 GMT+01:00 Jack Krupansky : > Is this a scenario that was working fine and suddenly deteriorated, or has > it always been slow? > > -- Jack Krupansky > > On Thu, Feb 11, 2016 at 4:33 AM, Matteo Grolla > wrote: > > > Hi, > > I'm trying to optimize a solr application. > > The bottleneck are queries that request 1000 rows to solr. > > Unfortunately the application can't be modified at the moment, can you > > suggest me what could be done on the solr side to increase the > performance? > > The bottleneck is just on fetching the results, the query executes very > > fast. > > I suggested caching .fdx and .fdt files on the file system cache. > > Anything else? > > > > Thanks > > >
Re: optimize requests that fetch 1000 rows
Are queries scaling linearly - does a query for 100 rows take 1/10th the time (1 sec vs. 10 sec or 3 sec vs. 30 sec)? Does the app need/expect exactly 1,000 documents for the query or is that just what this particular query happened to return? What does they query look like? Is it complex or use wildcards or function queries, or is it very simple keywords? How many operators? Have you used the debugQuery=true parameter to see which search components are taking the time? -- Jack Krupansky On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla wrote: > Hi Yonic, > after the first query I find 1000 docs in the document cache. > I'm using curl to send the request and requesting javabin format to mimic > the application. > gc activity is low > I managed to load the entire 50GB index in the filesystem cache, after that > queries don't cause disk activity anymore. > Time improves now queries that took ~30s take <10s. But I hoped better > I'm going to use jvisualvm's sampler to analyze where time is spent > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla > > wrote: > > > Thanks Toke, yes, they are long times, and solr qtime (to execute the > > > query) is a fraction of a second. > > > The response in javabin format is around 300k. > > > > OK, That tells us a lot. > > And if you actually tested so that all the docs would be in the cache > > (can you verify this by looking at the cache stats after you > > re-execute?) then it seems like the slowness is down to any of: > > a) serializing the response (it doesn't seem like a 300K response > > should take *that* long to serialize) > > b) reading/processing the response (how fast the client can do > > something with each doc is also a factor...) > > c) other (GC, network, etc) > > > > You can try taking client processing out of the equation by trying a > > curl request. > > > > -Yonik > > >
Re: Select distinct records
I am using Solr 5.1.0 On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal wrote: > What version of Solr are you using? > Have you taken a look at the Collapsing Query Parser. It basically performs > the same functions as grouping but is much more efficient at doing it. > Take a look here: > > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > > On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi wrote: > > > I am trying to select distinct records from a collection. (I need > distinct > > name and corresponding id) > > > > I have tried using grouping and group format of simple but that takes a > > long time to execute and sometimes runs into out of memory exception. > > Another limitation seems to be that total number of groups are not > > returned. > > > > Is there another faster and more efficient way to do this? > > > > Thank you > > > -- > Regards, > Binoy Dalal >
Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2
For my application, the solution I implemented is I log the chunk that failed into a file. This file is than post processed one record at a time. The ones that fail, are reported to the admin and never looked at again until the admin takes action. This is not the most efficient solution right now but I intend to refactor this code so that the failed chunk is itself re-processed in smaller chunks till the chunk with the failed record(s) is down to 1 record "chunk" that will fail. Like Debraj, I would love to hear from others how they handle such failures. Steve On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna wrote: > Thanks Erik. How do people handle this scenario? Right now the only option > I can think of is to replay the entire batch by doing add for every single > doc. Then this will give me error for all the docs which got added from the > batch. > > On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson > wrote: > > > This has been a long standing issue, Hoss is doing some current work on > it > > see: > > https://issues.apache.org/jira/browse/SOLR-445 > > > > But the short form is "no, not yet". > > > > Best, > > Erick > > > > On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna > > wrote: > > > Hi, > > > > > > > > > > > > I have a Document Centric Versioning Constraints added in solr schema:- > > > > > > > > > false > > > doc_version > > > > > > > > > I am adding multiple documents in solr in a single call using SolrJ > 5.2. > > > The code fragment looks something like below :- > > > > > > > > > try { > > > UpdateResponse resp = solrClient.add(docs.getDocCollection(), > > > 500); > > > if (resp.getStatus() != 0) { > > > throw new Exception(new StringBuilder( > > > "Failed to add docs in solr ").append(resp.toString()) > > > .toString()); > > > } > > > } catch (Exception e) { > > > logError("Adding docs to solr failed", e); > > > } > > > > > > > > > If one of the document is violating the versioning constraints then > Solr > > is > > > returning an exception with error message like "user version is not > high > > > enough: 1454587156" & the other documents are getting added perfectly. > Is > > > there a way I can know which document is violating the constraints > either > > > in Solr logs or from the Update response returned by Solr? > > > > > > Thanks > > >
Re: optimize requests that fetch 1000 rows
Hi Jack, response time scale with rows. Relationship doens't seem linear but Below 400 rows times are much faster, I view query times from solr logs and they are fast the same query with rows = 1000 takes 8s with rows = 10 takes 0.2s 2016-02-11 16:22 GMT+01:00 Jack Krupansky : > Are queries scaling linearly - does a query for 100 rows take 1/10th the > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)? > > Does the app need/expect exactly 1,000 documents for the query or is that > just what this particular query happened to return? > > What does they query look like? Is it complex or use wildcards or function > queries, or is it very simple keywords? How many operators? > > Have you used the debugQuery=true parameter to see which search components > are taking the time? > > -- Jack Krupansky > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla > wrote: > > > Hi Yonic, > > after the first query I find 1000 docs in the document cache. > > I'm using curl to send the request and requesting javabin format to mimic > > the application. > > gc activity is low > > I managed to load the entire 50GB index in the filesystem cache, after > that > > queries don't cause disk activity anymore. > > Time improves now queries that took ~30s take <10s. But I hoped better > > I'm going to use jvisualvm's sampler to analyze where time is spent > > > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla < > matteo.gro...@gmail.com> > > > wrote: > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute the > > > > query) is a fraction of a second. > > > > The response in javabin format is around 300k. > > > > > > OK, That tells us a lot. > > > And if you actually tested so that all the docs would be in the cache > > > (can you verify this by looking at the cache stats after you > > > re-execute?) then it seems like the slowness is down to any of: > > > a) serializing the response (it doesn't seem like a 300K response > > > should take *that* long to serialize) > > > b) reading/processing the response (how fast the client can do > > > something with each doc is also a factor...) > > > c) other (GC, network, etc) > > > > > > You can try taking client processing out of the equation by trying a > > > curl request. > > > > > > -Yonik > > > > > >
RE: Json faceting, aggregate numeric field by day?
Hi - i was sending the following value for json.facet: json.facet=by_day:{type : range, start : NOW-30DAY/DAY, end : NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}} I now also notice i didn't include the time field. But adding it gives the same error: json.facet=by_day:{type : range, field : time, start : NOW-30DAY/DAY, end : NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}} I must be missing something completely :) Thanks, Markus -Original message- > From:Yonik Seeley > Sent: Thursday 11th February 2016 16:13 > To: solr-user@lucene.apache.org > Subject: Re: Json faceting, aggregate numeric field by day? > > On Thu, Feb 11, 2016 at 10:04 AM, Markus Jelsma > wrote: > > Thanks. But this yields an error in FacetModule: > > > > java.lang.ClassCastException: java.lang.String cannot be cast to > > java.util.Map > > at > > org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:100) > > at > > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:247) > > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073) > > ... > > We don't have the best error reporting yet... can you show what you > sent for json.facet? > > > Is it supposed to work? > > Yep, there are tests. Here's an example of calculating percentiles > per range facet bucket (at the bottom): > http://yonik.com/percentiles-for-solr-faceting/ > > > I also found open issues SOLR-6348 and SOLR-6352 which made me doubt is wat > > supported at all. > > Those issues aren't related to the new facet module... you can tell by > the syntax. > > -Yonik >
Re: Json faceting, aggregate numeric field by day?
On Thu, Feb 11, 2016 at 11:07 AM, Markus Jelsma wrote: > Hi - i was sending the following value for json.facet: > json.facet=by_day:{type : range, start : NOW-30DAY/DAY, end : NOW/DAY, gap : > "+1DAY", facet:{x : "avg(rank)"}} > > I now also notice i didn't include the time field. But adding it gives the > same error: > json.facet=by_day:{type : range, field : time, start : NOW-30DAY/DAY, end : > NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}} Hmmm, the whole thing is a JSON object, so it needs curly braces around the whole thing... json.facet={by_day: [...] } You may need quotes around the date specs as well (containing slashes, etc)... not sure if they will be parsed as a single string or not -Yonik
Re: Logging request times
On 2/10/2016 10:33 AM, McCallick, Paul wrote: > We’re trying to fine tune our query and ingestion performance and would like > to get more metrics out of SOLR around this. We are capturing the standard > logs as well as the jetty request logs. The standard logs get us QTime, > which is not a good indication of how long the actual request took to > process. The Jetty request logs only show requests between nodes. I can’t > seem to find the client requests in there. > > I’d like to start tracking: > > * each request to index a document (or batch of documents) and the time > it took. > * Each request to execute a query and the time it took. The Jetty request log will usually include the IP address of the client making the request. If IP addresses are included in your log and you aren't seeing anything from your client address(es), perhaps those requests are being sent to another node. Logging elapsed time is also something that the clients can do. If the client is using SolrJ, every response object has a "getElapsedTime" method (and also "getQTime") that would allow the client program to log the elapsed time without doing its own calculation. Or the client program could calculate the elapsed time using whatever facilities are available in the relevant language. Thanks, Shawn
Re: optimize requests that fetch 1000 rows
Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but still relatively bad. Even 50ms for 10 rows would be considered barely okay. But... again it depends on query complexity - simple queries should be well under 50 ms for decent modern hardware. -- Jack Krupansky On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla wrote: > Hi Jack, > response time scale with rows. Relationship doens't seem linear but > Below 400 rows times are much faster, > I view query times from solr logs and they are fast > the same query with rows = 1000 takes 8s > with rows = 10 takes 0.2s > > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky : > > > Are queries scaling linearly - does a query for 100 rows take 1/10th the > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)? > > > > Does the app need/expect exactly 1,000 documents for the query or is that > > just what this particular query happened to return? > > > > What does they query look like? Is it complex or use wildcards or > function > > queries, or is it very simple keywords? How many operators? > > > > Have you used the debugQuery=true parameter to see which search > components > > are taking the time? > > > > -- Jack Krupansky > > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla > > wrote: > > > > > Hi Yonic, > > > after the first query I find 1000 docs in the document cache. > > > I'm using curl to send the request and requesting javabin format to > mimic > > > the application. > > > gc activity is low > > > I managed to load the entire 50GB index in the filesystem cache, after > > that > > > queries don't cause disk activity anymore. > > > Time improves now queries that took ~30s take <10s. But I hoped better > > > I'm going to use jvisualvm's sampler to analyze where time is spent > > > > > > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > > > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla < > > matteo.gro...@gmail.com> > > > > wrote: > > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute > the > > > > > query) is a fraction of a second. > > > > > The response in javabin format is around 300k. > > > > > > > > OK, That tells us a lot. > > > > And if you actually tested so that all the docs would be in the cache > > > > (can you verify this by looking at the cache stats after you > > > > re-execute?) then it seems like the slowness is down to any of: > > > > a) serializing the response (it doesn't seem like a 300K response > > > > should take *that* long to serialize) > > > > b) reading/processing the response (how fast the client can do > > > > something with each doc is also a factor...) > > > > c) other (GC, network, etc) > > > > > > > > You can try taking client processing out of the equation by trying a > > > > curl request. > > > > > > > > -Yonik > > > > > > > > > >
Re: Select distinct records
Solr 6.0 supports SELECT DISTINCT (SQL) queries. You can even choose between a MapReduce implementation and a Json Facet implementation. The MapReduce Implementation supports extremely high cardinality for the distinct fields. Json Facet implementation supports lower cardinality but high QPS. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 11, 2016 at 10:30 AM, Brian Narsi wrote: > I am using > > Solr 5.1.0 > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal > wrote: > > > What version of Solr are you using? > > Have you taken a look at the Collapsing Query Parser. It basically > performs > > the same functions as grouping but is much more efficient at doing it. > > Take a look here: > > > > > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > > > > On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi wrote: > > > > > I am trying to select distinct records from a collection. (I need > > distinct > > > name and corresponding id) > > > > > > I have tried using grouping and group format of simple but that takes a > > > long time to execute and sometimes runs into out of memory exception. > > > Another limitation seems to be that total number of groups are not > > > returned. > > > > > > Is there another faster and more efficient way to do this? > > > > > > Thank you > > > > > -- > > Regards, > > Binoy Dalal > > >
Re: optimize requests that fetch 1000 rows
virtual hardware, 200ms is taken on the client until response is written to disk qtime on solr is ~90ms not great but acceptable Is it possible that the method FilenameUtils.splitOnTokens is really so heavy when requesting a lot of rows on slow hardware? 2016-02-11 17:17 GMT+01:00 Jack Krupansky : > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but still > relatively bad. Even 50ms for 10 rows would be considered barely okay. > But... again it depends on query complexity - simple queries should be well > under 50 ms for decent modern hardware. > > -- Jack Krupansky > > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla > wrote: > > > Hi Jack, > > response time scale with rows. Relationship doens't seem linear but > > Below 400 rows times are much faster, > > I view query times from solr logs and they are fast > > the same query with rows = 1000 takes 8s > > with rows = 10 takes 0.2s > > > > > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky : > > > > > Are queries scaling linearly - does a query for 100 rows take 1/10th > the > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)? > > > > > > Does the app need/expect exactly 1,000 documents for the query or is > that > > > just what this particular query happened to return? > > > > > > What does they query look like? Is it complex or use wildcards or > > function > > > queries, or is it very simple keywords? How many operators? > > > > > > Have you used the debugQuery=true parameter to see which search > > components > > > are taking the time? > > > > > > -- Jack Krupansky > > > > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla < > matteo.gro...@gmail.com> > > > wrote: > > > > > > > Hi Yonic, > > > > after the first query I find 1000 docs in the document cache. > > > > I'm using curl to send the request and requesting javabin format to > > mimic > > > > the application. > > > > gc activity is low > > > > I managed to load the entire 50GB index in the filesystem cache, > after > > > that > > > > queries don't cause disk activity anymore. > > > > Time improves now queries that took ~30s take <10s. But I hoped > better > > > > I'm going to use jvisualvm's sampler to analyze where time is spent > > > > > > > > > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > > > > > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla < > > > matteo.gro...@gmail.com> > > > > > wrote: > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute > > the > > > > > > query) is a fraction of a second. > > > > > > The response in javabin format is around 300k. > > > > > > > > > > OK, That tells us a lot. > > > > > And if you actually tested so that all the docs would be in the > cache > > > > > (can you verify this by looking at the cache stats after you > > > > > re-execute?) then it seems like the slowness is down to any of: > > > > > a) serializing the response (it doesn't seem like a 300K response > > > > > should take *that* long to serialize) > > > > > b) reading/processing the response (how fast the client can do > > > > > something with each doc is also a factor...) > > > > > c) other (GC, network, etc) > > > > > > > > > > You can try taking client processing out of the equation by trying > a > > > > > curl request. > > > > > > > > > > -Yonik > > > > > > > > > > > > > > >
RE: Json faceting, aggregate numeric field by day?
Awesome! The surrounding braces did the thing. Fixed the quotes just before. Many thanks!! The remaining issue is that some source files in o.a.s.search.facet package are package protected or private. I can't implement a custom Agg using FacetContext and such. Created issue: https://issues.apache.org/jira/browse/SOLR-8673 Thanks again! Markus -Original message- > From:Yonik Seeley > Sent: Thursday 11th February 2016 17:12 > To: solr-user@lucene.apache.org > Subject: Re: Json faceting, aggregate numeric field by day? > > On Thu, Feb 11, 2016 at 11:07 AM, Markus Jelsma > wrote: > > Hi - i was sending the following value for json.facet: > > json.facet=by_day:{type : range, start : NOW-30DAY/DAY, end : NOW/DAY, gap > > : "+1DAY", facet:{x : "avg(rank)"}} > > > > I now also notice i didn't include the time field. But adding it gives the > > same error: > > json.facet=by_day:{type : range, field : time, start : NOW-30DAY/DAY, end : > > NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}} > > Hmmm, the whole thing is a JSON object, so it needs curly braces > around the whole thing... > json.facet={by_day: [...] } > > You may need quotes around the date specs as well (containing slashes, > etc)... not sure if they will be parsed as a single string or not > > -Yonik >
slave is getting full synced every polling
Hi Guys, I'm having a problem with master slave syncing. So I have two cores one is small core (just keep data use frequently for fast results) and another is big core (for rare query and for search in every thing). both core has same solrconfig file. But small core replication is fine, other than this big core is doing full sync every time wherever it start (every minute). I found this http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master But not really usefull. Solr verion 5.2.0 Small core has doc 10 mil. size around 10 to 15 GB. Big core has doc greater than 100 mil. size around 25 to 35 GB. How can I stop full sync. Thanks Novin
Re: optimize requests that fetch 1000 rows
Out of curiosity, have you tried to debug that solr version to see which text arrives to the splitOnTokens method ? In latest solr that part has changed completely. Would be curious to understand what it tries to tokenise by ? and * ! Cheers On 11 February 2016 at 16:33, Matteo Grolla wrote: > virtual hardware, 200ms is taken on the client until response is written to > disk > qtime on solr is ~90ms > not great but acceptable > > Is it possible that the method FilenameUtils.splitOnTokens is really so > heavy when requesting a lot of rows on slow hardware? > > 2016-02-11 17:17 GMT+01:00 Jack Krupansky : > > > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but > still > > relatively bad. Even 50ms for 10 rows would be considered barely okay. > > But... again it depends on query complexity - simple queries should be > well > > under 50 ms for decent modern hardware. > > > > -- Jack Krupansky > > > > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla > > > wrote: > > > > > Hi Jack, > > > response time scale with rows. Relationship doens't seem linear > but > > > Below 400 rows times are much faster, > > > I view query times from solr logs and they are fast > > > the same query with rows = 1000 takes 8s > > > with rows = 10 takes 0.2s > > > > > > > > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky : > > > > > > > Are queries scaling linearly - does a query for 100 rows take 1/10th > > the > > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)? > > > > > > > > Does the app need/expect exactly 1,000 documents for the query or is > > that > > > > just what this particular query happened to return? > > > > > > > > What does they query look like? Is it complex or use wildcards or > > > function > > > > queries, or is it very simple keywords? How many operators? > > > > > > > > Have you used the debugQuery=true parameter to see which search > > > components > > > > are taking the time? > > > > > > > > -- Jack Krupansky > > > > > > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla < > > matteo.gro...@gmail.com> > > > > wrote: > > > > > > > > > Hi Yonic, > > > > > after the first query I find 1000 docs in the document cache. > > > > > I'm using curl to send the request and requesting javabin format to > > > mimic > > > > > the application. > > > > > gc activity is low > > > > > I managed to load the entire 50GB index in the filesystem cache, > > after > > > > that > > > > > queries don't cause disk activity anymore. > > > > > Time improves now queries that took ~30s take <10s. But I hoped > > better > > > > > I'm going to use jvisualvm's sampler to analyze where time is spent > > > > > > > > > > > > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > > > > > > > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla < > > > > matteo.gro...@gmail.com> > > > > > > wrote: > > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to > execute > > > the > > > > > > > query) is a fraction of a second. > > > > > > > The response in javabin format is around 300k. > > > > > > > > > > > > OK, That tells us a lot. > > > > > > And if you actually tested so that all the docs would be in the > > cache > > > > > > (can you verify this by looking at the cache stats after you > > > > > > re-execute?) then it seems like the slowness is down to any of: > > > > > > a) serializing the response (it doesn't seem like a 300K response > > > > > > should take *that* long to serialize) > > > > > > b) reading/processing the response (how fast the client can do > > > > > > something with each doc is also a factor...) > > > > > > c) other (GC, network, etc) > > > > > > > > > > > > You can try taking client processing out of the equation by > trying > > a > > > > > > curl request. > > > > > > > > > > > > -Yonik > > > > > > > > > > > > > > > > > > > > > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Select distinct records
I have tried to use the Collapsing feature but it appears that it leaves duplicated records in the result set. Is that expected? Or any suggestions on working around it? Thanks On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi wrote: > I am using > > Solr 5.1.0 > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal > wrote: > >> What version of Solr are you using? >> Have you taken a look at the Collapsing Query Parser. It basically >> performs >> the same functions as grouping but is much more efficient at doing it. >> Take a look here: >> >> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results >> >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi wrote: >> >> > I am trying to select distinct records from a collection. (I need >> distinct >> > name and corresponding id) >> > >> > I have tried using grouping and group format of simple but that takes a >> > long time to execute and sometimes runs into out of memory exception. >> > Another limitation seems to be that total number of groups are not >> > returned. >> > >> > Is there another faster and more efficient way to do this? >> > >> > Thank you >> > >> -- >> Regards, >> Binoy Dalal >> > >
Enforce client auth in Solr
Hello, I am trying to implement a Solr cluster with mutual authentication using client and server SSL certificates. I have both client and server certificates signed by CA. The set up is working good, however any client cert that chains up to issuer CA are able to access the Solr cluster without validating the actual client cert that is added to the trust store of the server. Is there any way that we could enforce validation of client cert UID and DC on Solr server to ensure only allowed client certs are able to access the Solr ? Solr version used - 4.10.3 and 5.4.1 Container used - jetty Thanks in advance. Regards, Gautham
Custom plugin to handle proprietary binary input stream
I'm looking for an option to write a Solr plugin which can deal with a custom binary input stream. Unfortunately Solr's javabin as a protocol is not an option for us. I already had a look at some possibilities like writing a custom request handler, but it seems like the classes/interfaces one would need to implement are not "generic" enough (e.g. the SolrRequestHandler#handleRequest() method expects objects of classes SolrQueryRequest and SolrQueryResponse rsp) It would be of great help if you could direct me to any "pluggable" solution which allows to receive and parse a proprietary binary stream at a Solr server so that we do not have to provide an own customized binary solr server. Background: Our problem is that, we use a proprietary protocol to transfer our solr queries together with some other Java objects to our solr server (at present 3.6).The reason for this is, that we have some logic at the solr server which heavily depends on theses other java objects. Unfortunately we cannot easily shift that logic to the client side. Thank you! Michael
Re: Select distinct records
The CollapsingQParserPlugin shouldn't have duplicates in the result set. Can you provide the details? Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi wrote: > I have tried to use the Collapsing feature but it appears that it leaves > duplicated records in the result set. > > Is that expected? Or any suggestions on working around it? > > Thanks > > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi wrote: > > > I am using > > > > Solr 5.1.0 > > > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal > > wrote: > > > >> What version of Solr are you using? > >> Have you taken a look at the Collapsing Query Parser. It basically > >> performs > >> the same functions as grouping but is much more efficient at doing it. > >> Take a look here: > >> > >> > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > >> > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi wrote: > >> > >> > I am trying to select distinct records from a collection. (I need > >> distinct > >> > name and corresponding id) > >> > > >> > I have tried using grouping and group format of simple but that takes > a > >> > long time to execute and sometimes runs into out of memory exception. > >> > Another limitation seems to be that total number of groups are not > >> > returned. > >> > > >> > Is there another faster and more efficient way to do this? > >> > > >> > Thank you > >> > > >> -- > >> Regards, > >> Binoy Dalal > >> > > > > >
Re: Select distinct records
Ok I see that Collapsing features requires documents to be co-located in the same shard in SolrCloud. Could that be a reason for duplication? On Thu, Feb 11, 2016 at 11:09 AM, Joel Bernstein wrote: > The CollapsingQParserPlugin shouldn't have duplicates in the result set. > Can you provide the details? > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi wrote: > > > I have tried to use the Collapsing feature but it appears that it leaves > > duplicated records in the result set. > > > > Is that expected? Or any suggestions on working around it? > > > > Thanks > > > > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi wrote: > > > > > I am using > > > > > > Solr 5.1.0 > > > > > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal > > > wrote: > > > > > >> What version of Solr are you using? > > >> Have you taken a look at the Collapsing Query Parser. It basically > > >> performs > > >> the same functions as grouping but is much more efficient at doing it. > > >> Take a look here: > > >> > > >> > > > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > > >> > > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi > wrote: > > >> > > >> > I am trying to select distinct records from a collection. (I need > > >> distinct > > >> > name and corresponding id) > > >> > > > >> > I have tried using grouping and group format of simple but that > takes > > a > > >> > long time to execute and sometimes runs into out of memory > exception. > > >> > Another limitation seems to be that total number of groups are not > > >> > returned. > > >> > > > >> > Is there another faster and more efficient way to do this? > > >> > > > >> > Thank you > > >> > > > >> -- > > >> Regards, > > >> Binoy Dalal > > >> > > > > > > > > >
dismax for bigrams and phrases
Hey Solr folks, Current dismax parser behavior is different for unigrams versus bigrams. For unigrams, it's MAX-ed across fields (so called dismax), but for bigrams, it's SUM-ed from Solr 4.10 (according to https://issues.apache.org/jira/browse/SOLR-6062). Given this inconsistency, the dilemma we are facing now is the following: for a query with three terms: [A B C] Relevant doc1: f1:[AB .. C] f2:[BC] // here AB in field1 and BC in field2 are bigrams, and C is a unigram Irrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB] // here only bigram AB is present in the doc, but in three different fields. (A B C here can be e.g. "light blue bag", and doc2 can talk about "light blue coat" a lot, while only mentioning a "bag" somewhere.) Without bigram level MAX across fields, there is no way to rank doc1 above doc2. (doc1 is preferred because it hits two different bigrams, while doc2 only hits one bigram in several different fields.) Also, being a sum makes the retrieval score difficult to bound, making it hard to combine the retrieval score with other document level signals (e.g. document quality), or to trade off between unigrams and bigrams. Are the problems clear? Can someone offer a solution other than dismax for bigrams/phrases? i.e. https://issues.apache.org/jira/browse/SOLR-6600 ? (SOLR-6600 seems to be misclassified as a duplicate of SOLR-6062, while they seem to be the exact opposite.) Thanks, Le
dismax for bigrams and phrases
Hey Solr folks, Current dismax parser behavior is different for unigrams versus bigrams. For unigrams, it's MAX-ed across fields (so called dismax), but for bigrams, it's SUM-ed from Solr 4.10 (according to https://issues.apache.org/jira/browse/SOLR-6062). Given this inconsistency, the dilemma we are facing now is the following: for a query with three terms: [A B C] Relevant doc1: f1:[AB .. C] f2:[BC] // here AB in field1 and BC in field2 are bigrams, and C is a unigram Irrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB] // here only bigram AB is present in the doc, but in three different fields. (A B C here can be e.g. "light blue bag", and doc2 can talk about "light blue coat" a lot, while only mentioning a "bag" somewhere.) Without bigram level MAX across fields, there is no way to rank doc1 above doc2. (doc1 is preferred because it hits two different bigrams, while doc2 only hits one bigram in several different fields.) Also, being a sum makes the retrieval score difficult to bound, making it hard to combine the retrieval score with other document level signals (e.g. document quality), or to trade off between unigrams and bigrams. Are the problems clear? Can someone offer a solution other than dismax for bigrams/phrases? i.e. https://issues.apache.org/jira/browse/SOLR-6600 ? (SOLR-6600 seems to be misclassified as a duplicate of SOLR-6062, while they seem to be the exact opposite.) Thanks, Le PS cc'ing Jan who pointed me to the group.
Re: slave is getting full synced every polling
What is your replication configuration in solrconfig.xml on both master and slave? bq: big core is doing full sync every time wherever it start (every minute). Do you mean the Solr is restarting every minute or the polling interval is 60 seconds? The Solr logs should tell you something about what's going on there. Also, if you are for some reason optimizing the index that'll cause a full replication. Best, Erick On Thu, Feb 11, 2016 at 8:41 AM, Novin Novin wrote: > Hi Guys, > > I'm having a problem with master slave syncing. > > So I have two cores one is small core (just keep data use frequently for > fast results) and another is big core (for rare query and for search in > every thing). both core has same solrconfig file. But small core > replication is fine, other than this big core is doing full sync every time > wherever it start (every minute). > > I found this > http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master > > But not really usefull. > > Solr verion 5.2.0 > Small core has doc 10 mil. size around 10 to 15 GB. > Big core has doc greater than 100 mil. size around 25 to 35 GB. > > How can I stop full sync. > > Thanks > Novin
Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2
Steven's solution is a very common one, complete to the notion of re-chunking. Depending on the throughput requirements, simply resending the offending packet one at a time is often sufficient (but not _efficient). I can imagine fallback scenarios like "try chunking 100 at a time, for those chunks that fail do 10 at a time and for those do 1 at a time". That said, in a lot of situations, the number of failures is low enough that just falling back to one at a time while not elegant is sufficient It sure will be nice to have SOLR-445 done, if we can just keep Hoss from going crazy before he gets done. Best, Erick On Thu, Feb 11, 2016 at 7:39 AM, Steven White wrote: > For my application, the solution I implemented is I log the chunk that > failed into a file. This file is than post processed one record at a > time. The ones that fail, are reported to the admin and never looked at > again until the admin takes action. This is not the most efficient > solution right now but I intend to refactor this code so that the failed > chunk is itself re-processed in smaller chunks till the chunk with the > failed record(s) is down to 1 record "chunk" that will fail. > > Like Debraj, I would love to hear from others how they handle such failures. > > Steve > > > On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna > wrote: > >> Thanks Erik. How do people handle this scenario? Right now the only option >> I can think of is to replay the entire batch by doing add for every single >> doc. Then this will give me error for all the docs which got added from the >> batch. >> >> On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson >> wrote: >> >> > This has been a long standing issue, Hoss is doing some current work on >> it >> > see: >> > https://issues.apache.org/jira/browse/SOLR-445 >> > >> > But the short form is "no, not yet". >> > >> > Best, >> > Erick >> > >> > On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna >> > wrote: >> > > Hi, >> > > >> > > >> > > >> > > I have a Document Centric Versioning Constraints added in solr schema:- >> > > >> > > >> > > false >> > > doc_version >> > > >> > > >> > > I am adding multiple documents in solr in a single call using SolrJ >> 5.2. >> > > The code fragment looks something like below :- >> > > >> > > >> > > try { >> > > UpdateResponse resp = solrClient.add(docs.getDocCollection(), >> > > 500); >> > > if (resp.getStatus() != 0) { >> > > throw new Exception(new StringBuilder( >> > > "Failed to add docs in solr ").append(resp.toString()) >> > > .toString()); >> > > } >> > > } catch (Exception e) { >> > > logError("Adding docs to solr failed", e); >> > > } >> > > >> > > >> > > If one of the document is violating the versioning constraints then >> Solr >> > is >> > > returning an exception with error message like "user version is not >> high >> > > enough: 1454587156" & the other documents are getting added perfectly. >> Is >> > > there a way I can know which document is violating the constraints >> either >> > > in Solr logs or from the Update response returned by Solr? >> > > >> > > Thanks >> > >>
Re: Size of logs are high
You can also look at hour log4j properties file and manipulate the max log size, how many old versions are retained etc. If you're talking about the console log, people often just disable console logging (again in the logging properties file). Best, Erick On Thu, Feb 11, 2016 at 6:11 AM, Aditya Sundaram wrote: > Can you check your log level? Probably log level of error would suffice for > your purpose and it would most certainly reduce your log size(s). > > On Thu, Feb 11, 2016 at 12:53 PM, kshitij tyagi > wrote: > >> Hi, >> I have migrated to solr 5.2 and the size of logs are high. >> >> Can anyone help me out here how to control this? >> > > > > -- > Aditya Sundaram > Software Engineer, Technology team > AKR Tech park B Block, B1 047 > +91-9844006866
Re: Need to move on SOlr cloud (help required)
bq: We want the hits on solr servers to be distributed True, this happens automatically in SolrCloud, but a simple load balancer in front of master/slave does the same thing. bq: what if master node fail what should be our fail over strategy ? This is, indeed one of the advantages for SolrCloud, you don't have to worry about this any more. Another benefit (and you haven't touched on whether this matters) is that in SolrCloud you do not have the latency of polling and replicating from master to slave, in other words it supports Near Real Time. This comes at some additional complexity however. If you have your master node failing often enough to be a problem, you have other issues ;)... And the recovery strategy if the master fails is straightforward: 1> pick one of the slaves to be the master. 2> update the other nodes to point to the new master 3> re-index the docs from before the old master failed to the new master. You can use system variables to not even have to manually edit all of the solrconfig files, just supply different -D parameters on startup. Best, Erick On Wed, Feb 10, 2016 at 10:39 PM, kshitij tyagi wrote: > @Jack > > Currently we have around 55,00,000 docs > > Its not about load on one node we have load on different nodes at different > times as our traffic is huge around 60k users at a given point of time > > We want the hits on solr servers to be distributed so we are planning to > move on solr cloud as it would be fault tolerant. > > > > On Thu, Feb 11, 2016 at 11:10 AM, Midas A wrote: > >> hi, >> what if master node fail what should be our fail over strategy ? >> >> On Wed, Feb 10, 2016 at 9:12 PM, Jack Krupansky >> wrote: >> >> > What exactly is your motivation? I mean, the primary benefit of SolrCloud >> > is better support for sharding, and you have only a single shard. If you >> > have no need for sharding and your master-slave replicated Solr has been >> > working fine, then stick with it. If only one machine is having a load >> > problem, then that one node should be replaced. There are indeed plenty >> of >> > good reasons to prefer SolrCloud over traditional master-slave >> replication, >> > but so far you haven't touched on any of them. >> > >> > How much data (number of documents) do you have? >> > >> > What is your typical query latency? >> > >> > >> > -- Jack Krupansky >> > >> > On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi < >> > kshitij.shopcl...@gmail.com> >> > wrote: >> > >> > > Hi, >> > > >> > > We are currently using solr 5.2 and I need to move on solr cloud >> > > architecture. >> > > >> > > As of now we are using 5 machines : >> > > >> > > 1. I am using 1 master where we are indexing ourdata. >> > > 2. I replicate my data on other machines >> > > >> > > One or the other machine keeps on showing high load so I am planning to >> > > move on solr cloud. >> > > >> > > Need help on following : >> > > >> > > 1. What should be my architecture in case of 5 machines to keep >> > (zookeeper, >> > > shards, core). >> > > >> > > 2. How to add a node. >> > > >> > > 3. what are the exact steps/process I need to follow in order to change >> > to >> > > solr cloud. >> > > >> > > 4. How indexing will work in solr cloud as of now I am using mysql >> query >> > to >> > > get the data on master and then index the same (how I need to change >> this >> > > in case of solr cloud). >> > > >> > > Regards, >> > > Kshitij >> > > >> > >>
Re: Tune Data Import Handler to retrieve maximum records
It's possible with JDBC settings (see the specific ones for your drive), but dangerous. What if the number of rows is 1B or something? You'll blow Solr's memory out of the water Best, Erick On Wed, Feb 10, 2016 at 12:45 PM, Troy Edwards wrote: > Is it possible for the Data Import Handler to bring in maximum number of > records depending on available resources? If so, how should it be > configured? > > Thanks,
Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2
I first wrote the “fall back to one at a time” code for Solr 1.3. It is pretty easy if you plan for it. Make the batch size variable. When a batch fails, retry with a batch size of 1 for that particular batch. Then keep going or fail, either way, you have good logging on which one failed. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 11, 2016, at 10:06 AM, Erick Erickson wrote: > > Steven's solution is a very common one, complete to the > notion of re-chunking. Depending on the throughput requirements, > simply resending the offending packet one at a time is often > sufficient (but not _efficient). I can imagine fallback scenarios > like "try chunking 100 at a time, for those chunks that fail > do 10 at a time and for those do 1 at a time". > > That said, in a lot of situations, the number of failures is low > enough that just falling back to one at a time while not elegant > is sufficient > > It sure will be nice to have SOLR-445 done, if we can just keep > Hoss from going crazy before he gets done. > > Best, > Erick > > On Thu, Feb 11, 2016 at 7:39 AM, Steven White wrote: >> For my application, the solution I implemented is I log the chunk that >> failed into a file. This file is than post processed one record at a >> time. The ones that fail, are reported to the admin and never looked at >> again until the admin takes action. This is not the most efficient >> solution right now but I intend to refactor this code so that the failed >> chunk is itself re-processed in smaller chunks till the chunk with the >> failed record(s) is down to 1 record "chunk" that will fail. >> >> Like Debraj, I would love to hear from others how they handle such failures. >> >> Steve >> >> >> On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna >> wrote: >> >>> Thanks Erik. How do people handle this scenario? Right now the only option >>> I can think of is to replay the entire batch by doing add for every single >>> doc. Then this will give me error for all the docs which got added from the >>> batch. >>> >>> On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson >>> wrote: >>> This has been a long standing issue, Hoss is doing some current work on >>> it see: https://issues.apache.org/jira/browse/SOLR-445 But the short form is "no, not yet". Best, Erick On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna wrote: > Hi, > > > > I have a Document Centric Versioning Constraints added in solr schema:- > > > false > doc_version > > > I am adding multiple documents in solr in a single call using SolrJ >>> 5.2. > The code fragment looks something like below :- > > > try { >UpdateResponse resp = solrClient.add(docs.getDocCollection(), >500); >if (resp.getStatus() != 0) { >throw new Exception(new StringBuilder( >"Failed to add docs in solr ").append(resp.toString()) >.toString()); >} >} catch (Exception e) { >logError("Adding docs to solr failed", e); >} > > > If one of the document is violating the versioning constraints then >>> Solr is > returning an exception with error message like "user version is not >>> high > enough: 1454587156" & the other documents are getting added perfectly. >>> Is > there a way I can know which document is violating the constraints >>> either > in Solr logs or from the Update response returned by Solr? > > Thanks >>>
ApacheCon NA 2016 - Important Dates!!!
Hello everyone! I hope this email finds you well. I hope everyone is as excited about ApacheCon as I am! I'd like to remind you all of a couple of important dates, as well as ask for your assistance in spreading the word! Please use your social media platform(s) to get the word out! The more visibility, the better ApacheCon will be for all!! :) CFP Close: February 12, 2016CFP Notifications: February 29, 2016Schedule Announced: March 3, 2016 To submit a talk, please visit: http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp Link to the main site can be found here: http://events.linuxfoundation.org/events/apache-big-data-north-america Apache: Big Data North America 2016 Registration Fees: Attendee Registration Fee: US$599 through March 6, US$799 through April 10, US$999 thereafterCommitter Registration Fee: US$275 through April 10, US$375 thereafterStudent Registration Fee: US$275 through April 10, $375 thereafter Planning to attend ApacheCon North America 2016 May 11 - 13, 2016? There is an add-on option on the registration form to join the conference for a discounted fee of US$399, available only to Apache: Big Data North America attendees. So, please tweet away!! I look forward to seeing you in Vancouver! Have a groovy day!! ~Melissaon behalf of the ApacheCon Team
SolrCloud shard marked as down and "reloading" collection doesnt restore it
Hi, I noticed while running an indexing job (2M docs but per doc size could be 2-3 MB) that one of the shards goes down just after the commit. (Not related to OOM or high cpu/load). This marks the shard as "down" in zk and even a reload of the collection does not recover the state. There are no exceptions in the logs and the stack trace indicates jetty threads in blocked state. The last few lines in the logs are as follows: trib=TOLEADER&wt=javabin&version=2} {add=[1552605 (1525453861590925312)]} 0 5 INFO - 2016-02-06 19:17:47.658; org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy; SolrDeletionPolicy.onCommit: commits: num=2 INFO - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy; newest commit generation = 6 INFO - 2016-02-06 19:18:02.233; org.apache.solr.search.SolrIndexSearcher; Opening Searcher@321a0cc9 main INFO - 2016-02-06 19:18:02.296; org.apache.solr.core.QuerySenderListener; QuerySenderListener sending requests to Searcher@321a0cc9 main{StandardDirectoryReader(segments_6:180:nrt _20(4.6):C15155/216:delGen=1 _w(4.6):C1538/63:delGen=2 _16(4.6):C279/20:delGen=2 _e(4.6):C11386/514:delGen=3 _g(4.6):C4434/204:delGen=3 _p(4.6):C418/5:delGen=1 _v(4.6):C1 _x(4.6):C17583/316:delGen=2 _y(4.6):C9783/112:delGen=2 _z(4.6):C4736/47:delGen=2 _12(4.6):C705/2:delGen=1 _13(4.6):C275/4:delGen=1 _1b(4.6):C619 _26(4.6):C318/13:delGen=1 _1e(4.6):C25356/763:delGen=3 _1f(4.6):C13024/426:delGen=2 _1g(4.6):C5368/142:delGen=2 _1j(4.6):C499/16:delGen=2 _1m(4.6):C448/23:delGen=2 _1p(4.6):C236/17:delGen=2 _1k(4.6):C173/5:delGen=1 _1s(4.6):C1082/78:delGen=2 _1t(4.6):C195/17:delGen=2 _1u(4.6):C2 _21(4.6):C16494/1278:delGen=1 _22(4.6):C5193/398:delGen=1 _23(4.6):C1361/102:delGen=1 _24(4.6):C475/36:delGen=1 _29(4.6):C126/11:delGen=1 _2d(4.6):C97/3:delGen=1 _27(4.6):C59/7:delGen=1 _28(4.6):C26/6:delGen=1 _2b(4.6):C40 _25(4.6):C39/1:delGen=1 _2c(4.6):C139/9:delGen=1 _2a(4.6):C26/6:delGen=1)} The only solution is to restart the cluster. Why does a reload not work and is this a known bug (for which there is a patch i can apply)? Any pointers are much appreciated Thanks! Nitin
outlook email file pst extraction problem
Hi , I am currently indexing individual outlook messages and searching is working fine. I have created solr core using following command. ./solr create -c sreenimsg1 -d data_driven_schema_configs I am using following command to index individual messages. curl " http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfile=@/home/ec2-user/msg9.msg" This setup is working fine. But new requirement is extract messages using outlook pst file. I tried following command to extract messages from outlook pst file. curl " http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfile=@/home/ec2-user/sateamc_0006.pst" This command extracting only high level tags and extracting all messages into one message. I am not getting all tags when extracted individual messgaes. is above command is correct? is it problem not using recursion? how to add recursion to above command ? is it tika library problem? Please help to solve above problem. Advanced Thanks. --sreenivasa kallu
Re: Custom auth plugin not loaded in SolrCloud
yes, runtime lib cannot be used for loading container level plugins yet. Eventually they must. You can open a ticket On Mon, Jan 4, 2016 at 1:07 AM, tine-2 wrote: > Hi, > > are there any news on this? Was anyone able to get it to work? > > Cheers, > > tine > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Custom-auth-plugin-not-loaded-in-SolrCloud-tp4245670p4248340.html > Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul
Re: Select distinct records
Yeah that would be the reason. If you want distributed unique capabilities, then you might want to start testing out 6.0. Aside from SELECT DISTINCT queries, you also have a much more mature Streaming Expression library which supports the unique operation. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 11, 2016 at 12:28 PM, Brian Narsi wrote: > Ok I see that Collapsing features requires documents to be co-located in > the same shard in SolrCloud. > > Could that be a reason for duplication? > > On Thu, Feb 11, 2016 at 11:09 AM, Joel Bernstein > wrote: > > > The CollapsingQParserPlugin shouldn't have duplicates in the result set. > > Can you provide the details? > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi > wrote: > > > > > I have tried to use the Collapsing feature but it appears that it > leaves > > > duplicated records in the result set. > > > > > > Is that expected? Or any suggestions on working around it? > > > > > > Thanks > > > > > > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi > wrote: > > > > > > > I am using > > > > > > > > Solr 5.1.0 > > > > > > > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal > > > > > wrote: > > > > > > > >> What version of Solr are you using? > > > >> Have you taken a look at the Collapsing Query Parser. It basically > > > >> performs > > > >> the same functions as grouping but is much more efficient at doing > it. > > > >> Take a look here: > > > >> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > > > >> > > > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi > > wrote: > > > >> > > > >> > I am trying to select distinct records from a collection. (I need > > > >> distinct > > > >> > name and corresponding id) > > > >> > > > > >> > I have tried using grouping and group format of simple but that > > takes > > > a > > > >> > long time to execute and sometimes runs into out of memory > > exception. > > > >> > Another limitation seems to be that total number of groups are not > > > >> > returned. > > > >> > > > > >> > Is there another faster and more efficient way to do this? > > > >> > > > > >> > Thank you > > > >> > > > > >> -- > > > >> Regards, > > > >> Binoy Dalal > > > >> > > > > > > > > > > > > > >
RE: outlook email file pst extraction problem
Y, this looks like a Tika feature. If you run the tika-app.jar [1]on your file and you get the same output, then that's Tika's doing. Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs. -Original Message- From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] Sent: Thursday, February 11, 2016 1:43 PM To: solr-user@lucene.apache.org Subject: outlook email file pst extraction problem Hi , I am currently indexing individual outlook messages and searching is working fine. I have created solr core using following command. ./solr create -c sreenimsg1 -d data_driven_schema_configs I am using following command to index individual messages. curl " http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfile=@/home/ec2-user/msg9.msg" This setup is working fine. But new requirement is extract messages using outlook pst file. I tried following command to extract messages from outlook pst file. curl " http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfile=@/home/ec2-user/sateamc_0006.pst" This command extracting only high level tags and extracting all messages into one message. I am not getting all tags when extracted individual messgaes. is above command is correct? is it problem not using recursion? how to add recursion to above command ? is it tika library problem? Please help to solve above problem. Advanced Thanks. --sreenivasa kallu
Re: How is Tika used with Solr
I have found that when you deal with large amounts of all sort of files, in the end you find stuff (pdfs are typically nasty) that will hang tika. That is even worse that a crash or OOM. We used aperture instead of tika because at the time it provided a watchdog feature to kill what seemed like a hanged extracting thread. That feature is super important for a robust text extracting pipeline. Has Tika gained such feature already? xavier On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson wrote: > Timothy's points are absolutely spot-on. In production scenarios, if > you use the simple > "run Tika in a SolrJ program" approach you _must_ abort the program on > OOM errors > and the like and figure out what's going on with the offending > document(s). Or record the > name somewhere and skip it next time 'round. Or > > How much you have to build in here really depends on your use case. > For "small enough" > sets of documents or one-time indexing, you can get by with dealing > with errors one at a time. > For robust systems where you have to have indexing available at all > times and _especially_ > where you don't control the document corpus, you have to build > something far more > tolerant as per Tim's comments. > > FWIW, > Erick > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > wrote: > > I completely agree on the impulse, and for the vast majority of the time > (regular catchable exceptions), that'll work. And, by vast majority, aside > from oom on very large files, we aren't seeing these problems any more in > our 3 million doc corpus (y, I know, small by today's standards) from > govdocs1 and Common Crawl over on our Rackspace vm. > > > > Given my focus on Tika, I'm overly sensitive to the worst case > scenarios. I find it encouraging, Erick, that you haven't seen these types > of problems, that users aren't complaining too often about catastrophic > failures of Tika within Solr Cell, and that this thread is not yet swamped > with integrators agreeing with me. :) > > > > However, because oom can leave memory in a corrupted state (right?), > because you can't actually kill a thread for a permanent hang and because > Tika is a kitchen sink and we can't prevent memory leaks in our > dependencies, one needs to be aware that bad things can happen...if only > very, very rarely. For a fellow traveler who has run into these issues on > massive data sets, see also [0]. > > > > Configuring Hadoop to work around these types of problems is not too > difficult -- it has to be done with some thought, though. On conventional > single box setups, the ForkParser within Tika is one option, tika-batch is > another. Hand rolling your own parent/child process is non-trivial and is > not necessary for the vast majority of use cases. > > > > > > [0] > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > > > > > > > > -Original Message- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Tuesday, February 09, 2016 10:05 PM > > To: solr-user > > Subject: Re: How is Tika used with Solr > > > > My impulse would be to _not_ run Tika in its own JVM, just catch any > exceptions in my code and "do the right thing". I'm not sure I see any real > benefit in yet another JVM. > > > > FWIW, > > Erick > > > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. > wrote: > >> I have one answer here [0], but I'd be interested to hear what Solr > users/devs/integrators have experienced on this topic. > >> > >> [0] > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1P > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlo > >> ok.com%3E > >> > >> -Original Message- > >> From: Steven White [mailto:swhite4...@gmail.com] > >> Sent: Tuesday, February 09, 2016 6:33 PM > >> To: solr-user@lucene.apache.org > >> Subject: Re: How is Tika used with Solr > >> > >> Thank you Erick and Alex. > >> > >> My main question is with a long running process using Tika in the same > JVM as my application. I'm running my file-system-crawler in its own JVM > (not Solr's). On Tika mailing list, it is suggested to run Tika's code in > it's own JVM and invoke it from my file-system-crawler using > Runtime.getRuntime().exec(). > >> > >> I fully understand from Alex suggestion and link provided by Erick to > use Tika outside Solr. But what about using Tika within the same JVM as my > file-system-crawler application or should I be making a system call to > invoke another JAR, that runs in its own JVM to extract the raw text? Are > there known issues with Tika when used in a long running process? > >> > >> Steve > >> > >> >
RE: How is Tika used with Solr
x-post to Tika user's Y and n. If you run tika app as: java -jar tika-app.jar It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). This creates a parent and child process, if the child process notices a hung thread, it dies, and the parent restarts it. Or if your OS gets upset with the child process and kills it out of self preservation, the parent restarts the child, or if there's an OOM...and you can configure how often the child shuts itself down (with parental restarting) to mitigate memory leaks. So, y, if your use case allows , then we now have that in Tika. I've been wanting to add a similar watchdog to tika-server ... any interest in that? -Original Message- From: xavi jmlucjav [mailto:jmluc...@gmail.com] Sent: Thursday, February 11, 2016 2:16 PM To: solr-user Subject: Re: How is Tika used with Solr I have found that when you deal with large amounts of all sort of files, in the end you find stuff (pdfs are typically nasty) that will hang tika. That is even worse that a crash or OOM. We used aperture instead of tika because at the time it provided a watchdog feature to kill what seemed like a hanged extracting thread. That feature is super important for a robust text extracting pipeline. Has Tika gained such feature already? xavier On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson wrote: > Timothy's points are absolutely spot-on. In production scenarios, if > you use the simple "run Tika in a SolrJ program" approach you _must_ > abort the program on OOM errors and the like and figure out what's > going on with the offending document(s). Or record the name somewhere > and skip it next time 'round. Or > > How much you have to build in here really depends on your use case. > For "small enough" > sets of documents or one-time indexing, you can get by with dealing > with errors one at a time. > For robust systems where you have to have indexing available at all > times and _especially_ where you don't control the document corpus, > you have to build something far more tolerant as per Tim's comments. > > FWIW, > Erick > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > > wrote: > > I completely agree on the impulse, and for the vast majority of the > > time > (regular catchable exceptions), that'll work. And, by vast majority, > aside from oom on very large files, we aren't seeing these problems > any more in our 3 million doc corpus (y, I know, small by today's > standards) from > govdocs1 and Common Crawl over on our Rackspace vm. > > > > Given my focus on Tika, I'm overly sensitive to the worst case > scenarios. I find it encouraging, Erick, that you haven't seen these > types of problems, that users aren't complaining too often about > catastrophic failures of Tika within Solr Cell, and that this thread > is not yet swamped with integrators agreeing with me. :) > > > > However, because oom can leave memory in a corrupted state (right?), > because you can't actually kill a thread for a permanent hang and > because Tika is a kitchen sink and we can't prevent memory leaks in > our dependencies, one needs to be aware that bad things can > happen...if only very, very rarely. For a fellow traveler who has run > into these issues on massive data sets, see also [0]. > > > > Configuring Hadoop to work around these types of problems is not too > difficult -- it has to be done with some thought, though. On > conventional single box setups, the ForkParser within Tika is one > option, tika-batch is another. Hand rolling your own parent/child > process is non-trivial and is not necessary for the vast majority of use > cases. > > > > > > [0] > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > eb-content-nanite/ > > > > > > > > -Original Message- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Tuesday, February 09, 2016 10:05 PM > > To: solr-user > > Subject: Re: How is Tika used with Solr > > > > My impulse would be to _not_ run Tika in its own JVM, just catch any > exceptions in my code and "do the right thing". I'm not sure I see any > real benefit in yet another JVM. > > > > FWIW, > > Erick > > > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. > > > wrote: > >> I have one answer here [0], but I'd be interested to hear what Solr > users/devs/integrators have experienced on this topic. > >> > >> [0] > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC > >> Y1P > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou > >> tlo > >> ok.com%3E > >> > >> -Original Message- > >> From: Steven White [mailto:swhite4...@gmail.com] > >> Sent: Tuesday, February 09, 2016 6:33 PM > >> To: solr-user@lucene.apache.org > >> Subject: Re: How is Tika used with Solr > >> > >> Thank you Erick and Alex. > >> > >> My main question is with a long running process using Tika in the > >> same > JVM as my application. I'm running my file-s
RE: outlook email file pst extraction problem
Should have looked at how we handle psts before earlier responsesorry. What you're seeing is Tika's default treatment of embedded documents, it concatenates them all into one string. It'll do the same thing for zip files and other container files. The default Tika format is xhtml, and we include tags that show you where the attachments are. If the tags are stripped, then you only get a big blob of text, which is often all that's necessary for search. Before SOLR-7189, you wouldn't have gotten any content, so that's progress...right? Some options for now: 1) use java-libpst as a preprocessing step to extract contents from your psts before you ingest them in Solr (feel free to borrow code from our OutlookPSTParser). 2) use tika from the commandline with the -J -t options to get a Json representation of the overall file, which includes a list of maps, where each map represents a single embedded file. Again, if you have any questions on this, head over to u...@tika.apache.org I think what you want is something along the lines of SOLR-7229, which would treat each embedded document as its own document. That issue is not resolved, and there's currently no way of doing this within DIH that I'm aware of. If others on this list have an interest in SOLR-7229, let me know, and I'll try to find some time. I'd need feedback on some design decisions. -Original Message- From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] Sent: Thursday, February 11, 2016 1:43 PM To: solr-user@lucene.apache.org Subject: outlook email file pst extraction problem Hi , I am currently indexing individual outlook messages and searching is working fine. I have created solr core using following command. ./solr create -c sreenimsg1 -d data_driven_schema_configs I am using following command to index individual messages. curl " http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfile=@/home/ec2-user/msg9.msg" This setup is working fine. But new requirement is extract messages using outlook pst file. I tried following command to extract messages from outlook pst file. curl " http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfile=@/home/ec2-user/sateamc_0006.pst" This command extracting only high level tags and extracting all messages into one message. I am not getting all tags when extracted individual messgaes. is above command is correct? is it problem not using recursion? how to add recursion to above command ? is it tika library problem? Please help to solve above problem. Advanced Thanks. --sreenivasa kallu
Re: slave is getting full synced every polling
Hi Erick, Below is master slave config: Master: commit optimize 2 Slave: http://master:8983/solr/big_core/replication 00:00:60 username password Do you mean the Solr is restarting every minute or the polling interval is 60 seconds? I meant polling is 60 minutes I didn't not see any suspicious in logs , and I'm not optimizing any thing with commit. Thanks Novin On Thu, 11 Feb 2016 at 18:02 Erick Erickson wrote: > What is your replication configuration in solrconfig.xml on both > master and slave? > > bq: big core is doing full sync every time wherever it start (every > minute). > > Do you mean the Solr is restarting every minute or the polling > interval is 60 seconds? > > The Solr logs should tell you something about what's going on there. > Also, if you are for > some reason optimizing the index that'll cause a full replication. > > Best, > Erick > > On Thu, Feb 11, 2016 at 8:41 AM, Novin Novin wrote: > > Hi Guys, > > > > I'm having a problem with master slave syncing. > > > > So I have two cores one is small core (just keep data use frequently for > > fast results) and another is big core (for rare query and for search in > > every thing). both core has same solrconfig file. But small core > > replication is fine, other than this big core is doing full sync every > time > > wherever it start (every minute). > > > > I found this > > > http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master > > > > But not really usefull. > > > > Solr verion 5.2.0 > > Small core has doc 10 mil. size around 10 to 15 GB. > > Big core has doc greater than 100 mil. size around 25 to 35 GB. > > > > How can I stop full sync. > > > > Thanks > > Novin >
Re: slave is getting full synced every polling
Typo? That's 60 seconds, but that's not especially interesting either way. Do the actual segment's look identical after the polling? On Thu, Feb 11, 2016 at 1:16 PM, Novin Novin wrote: > Hi Erick, > > Below is master slave config: > > Master: > > > commit > optimize > > 2 > > > Slave: > > > > http://master:8983/solr/big_core/replication > > 00:00:60 > username > password > > > > > Do you mean the Solr is restarting every minute or the polling > interval is 60 seconds? > > I meant polling is 60 minutes > > I didn't not see any suspicious in logs , and I'm not optimizing any thing > with commit. > > Thanks > Novin > > On Thu, 11 Feb 2016 at 18:02 Erick Erickson wrote: > >> What is your replication configuration in solrconfig.xml on both >> master and slave? >> >> bq: big core is doing full sync every time wherever it start (every >> minute). >> >> Do you mean the Solr is restarting every minute or the polling >> interval is 60 seconds? >> >> The Solr logs should tell you something about what's going on there. >> Also, if you are for >> some reason optimizing the index that'll cause a full replication. >> >> Best, >> Erick >> >> On Thu, Feb 11, 2016 at 8:41 AM, Novin Novin wrote: >> > Hi Guys, >> > >> > I'm having a problem with master slave syncing. >> > >> > So I have two cores one is small core (just keep data use frequently for >> > fast results) and another is big core (for rare query and for search in >> > every thing). both core has same solrconfig file. But small core >> > replication is fine, other than this big core is doing full sync every >> time >> > wherever it start (every minute). >> > >> > I found this >> > >> http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master >> > >> > But not really usefull. >> > >> > Solr verion 5.2.0 >> > Small core has doc 10 mil. size around 10 to 15 GB. >> > Big core has doc greater than 100 mil. size around 25 to 35 GB. >> > >> > How can I stop full sync. >> > >> > Thanks >> > Novin >>
edismax query parser - pf field question
Clarification needed on edismax query parser "pf" field. *SOLR Query:* /query?q=refrigerator water filter&qf=P_NAME^1.5 CategoryName&wt=xml&debugQuery=on&pf=P_NAME CategoryName&mm=2&fl=CategoryName P_NAME score&defType=edismax *Parsed Query from DebugQuery results:* (+((DisjunctionMaxQuery((P_NAME:refriger^1.5 | CategoryName:refrigerator)) DisjunctionMaxQuery((P_NAME:water^1.5 | CategoryName:water)) DisjunctionMaxQuery((P_NAME:filter^1.5 | CategoryName:filter)))~2) DisjunctionMaxQuery((P_NAME:"refriger water filter")))/no_coord In the SOLR query given above, I am asking for phrase matches on 2 fields: P_NAME and CategoryName. But If you notice ParsedQuery, I see Phrase match is applied only on P_NAME field but not on CategoryName field. Why? -- View this message in context: http://lucene.472066.n3.nabble.com/edismax-query-parser-pf-field-question-tp4256845.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How is Tika used with Solr
Tim, In my case, I have to use Tika as follows: java -jar tika-app.jar -t I will be invoking the above command from my Java app using Runtime.getRuntime().exec(). I will capture stdout and stderr to get back the raw text i need. My app use case will not allow me to use a , it is out of the question. Reading your summary, it looks like I won't get this watch-dog monitoring and thus I have to implement my own. Can you confirm? Thanks Steve On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. wrote: > x-post to Tika user's > > Y and n. If you run tika app as: > > java -jar tika-app.jar > > It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). This > creates a parent and child process, if the child process notices a hung > thread, it dies, and the parent restarts it. Or if your OS gets upset with > the child process and kills it out of self preservation, the parent > restarts the child, or if there's an OOM...and you can configure how often > the child shuts itself down (with parental restarting) to mitigate memory > leaks. > > So, y, if your use case allows , then we now have > that in Tika. > > I've been wanting to add a similar watchdog to tika-server ... any > interest in that? > > > -Original Message- > From: xavi jmlucjav [mailto:jmluc...@gmail.com] > Sent: Thursday, February 11, 2016 2:16 PM > To: solr-user > Subject: Re: How is Tika used with Solr > > I have found that when you deal with large amounts of all sort of files, > in the end you find stuff (pdfs are typically nasty) that will hang tika. > That is even worse that a crash or OOM. > We used aperture instead of tika because at the time it provided a > watchdog feature to kill what seemed like a hanged extracting thread. That > feature is super important for a robust text extracting pipeline. Has Tika > gained such feature already? > > xavier > > On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson > wrote: > > > Timothy's points are absolutely spot-on. In production scenarios, if > > you use the simple "run Tika in a SolrJ program" approach you _must_ > > abort the program on OOM errors and the like and figure out what's > > going on with the offending document(s). Or record the name somewhere > > and skip it next time 'round. Or > > > > How much you have to build in here really depends on your use case. > > For "small enough" > > sets of documents or one-time indexing, you can get by with dealing > > with errors one at a time. > > For robust systems where you have to have indexing available at all > > times and _especially_ where you don't control the document corpus, > > you have to build something far more tolerant as per Tim's comments. > > > > FWIW, > > Erick > > > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > > > > wrote: > > > I completely agree on the impulse, and for the vast majority of the > > > time > > (regular catchable exceptions), that'll work. And, by vast majority, > > aside from oom on very large files, we aren't seeing these problems > > any more in our 3 million doc corpus (y, I know, small by today's > > standards) from > > govdocs1 and Common Crawl over on our Rackspace vm. > > > > > > Given my focus on Tika, I'm overly sensitive to the worst case > > scenarios. I find it encouraging, Erick, that you haven't seen these > > types of problems, that users aren't complaining too often about > > catastrophic failures of Tika within Solr Cell, and that this thread > > is not yet swamped with integrators agreeing with me. :) > > > > > > However, because oom can leave memory in a corrupted state (right?), > > because you can't actually kill a thread for a permanent hang and > > because Tika is a kitchen sink and we can't prevent memory leaks in > > our dependencies, one needs to be aware that bad things can > > happen...if only very, very rarely. For a fellow traveler who has run > > into these issues on massive data sets, see also [0]. > > > > > > Configuring Hadoop to work around these types of problems is not too > > difficult -- it has to be done with some thought, though. On > > conventional single box setups, the ForkParser within Tika is one > > option, tika-batch is another. Hand rolling your own parent/child > > process is non-trivial and is not necessary for the vast majority of use > cases. > > > > > > > > > [0] > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > > eb-content-nanite/ > > > > > > > > > > > > -Original Message- > > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > > Sent: Tuesday, February 09, 2016 10:05 PM > > > To: solr-user > > > Subject: Re: How is Tika used with Solr > > > > > > My impulse would be to _not_ run Tika in its own JVM, just catch any > > exceptions in my code and "do the right thing". I'm not sure I see any > > real benefit in yet another JVM. > > > > > > FWIW, > > > Erick > > > > > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. > > > > > wrote: > > >> I have one answ
Re: How is Tika used with Solr
For sure, if I need heavy duty text extraction again, Tika would be the obvious choice if it covers dealing with hangs. I never used tika-server myself (not sure if it existed at the time) just used tika from my own jvm. On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. wrote: > x-post to Tika user's > > Y and n. If you run tika app as: > > java -jar tika-app.jar > > It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). This > creates a parent and child process, if the child process notices a hung > thread, it dies, and the parent restarts it. Or if your OS gets upset with > the child process and kills it out of self preservation, the parent > restarts the child, or if there's an OOM...and you can configure how often > the child shuts itself down (with parental restarting) to mitigate memory > leaks. > > So, y, if your use case allows , then we now have > that in Tika. > > I've been wanting to add a similar watchdog to tika-server ... any > interest in that? > > > -Original Message- > From: xavi jmlucjav [mailto:jmluc...@gmail.com] > Sent: Thursday, February 11, 2016 2:16 PM > To: solr-user > Subject: Re: How is Tika used with Solr > > I have found that when you deal with large amounts of all sort of files, > in the end you find stuff (pdfs are typically nasty) that will hang tika. > That is even worse that a crash or OOM. > We used aperture instead of tika because at the time it provided a > watchdog feature to kill what seemed like a hanged extracting thread. That > feature is super important for a robust text extracting pipeline. Has Tika > gained such feature already? > > xavier > > On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson > wrote: > > > Timothy's points are absolutely spot-on. In production scenarios, if > > you use the simple "run Tika in a SolrJ program" approach you _must_ > > abort the program on OOM errors and the like and figure out what's > > going on with the offending document(s). Or record the name somewhere > > and skip it next time 'round. Or > > > > How much you have to build in here really depends on your use case. > > For "small enough" > > sets of documents or one-time indexing, you can get by with dealing > > with errors one at a time. > > For robust systems where you have to have indexing available at all > > times and _especially_ where you don't control the document corpus, > > you have to build something far more tolerant as per Tim's comments. > > > > FWIW, > > Erick > > > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > > > > wrote: > > > I completely agree on the impulse, and for the vast majority of the > > > time > > (regular catchable exceptions), that'll work. And, by vast majority, > > aside from oom on very large files, we aren't seeing these problems > > any more in our 3 million doc corpus (y, I know, small by today's > > standards) from > > govdocs1 and Common Crawl over on our Rackspace vm. > > > > > > Given my focus on Tika, I'm overly sensitive to the worst case > > scenarios. I find it encouraging, Erick, that you haven't seen these > > types of problems, that users aren't complaining too often about > > catastrophic failures of Tika within Solr Cell, and that this thread > > is not yet swamped with integrators agreeing with me. :) > > > > > > However, because oom can leave memory in a corrupted state (right?), > > because you can't actually kill a thread for a permanent hang and > > because Tika is a kitchen sink and we can't prevent memory leaks in > > our dependencies, one needs to be aware that bad things can > > happen...if only very, very rarely. For a fellow traveler who has run > > into these issues on massive data sets, see also [0]. > > > > > > Configuring Hadoop to work around these types of problems is not too > > difficult -- it has to be done with some thought, though. On > > conventional single box setups, the ForkParser within Tika is one > > option, tika-batch is another. Hand rolling your own parent/child > > process is non-trivial and is not necessary for the vast majority of use > cases. > > > > > > > > > [0] > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > > eb-content-nanite/ > > > > > > > > > > > > -Original Message- > > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > > Sent: Tuesday, February 09, 2016 10:05 PM > > > To: solr-user > > > Subject: Re: How is Tika used with Solr > > > > > > My impulse would be to _not_ run Tika in its own JVM, just catch any > > exceptions in my code and "do the right thing". I'm not sure I see any > > real benefit in yet another JVM. > > > > > > FWIW, > > > Erick > > > > > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. > > > > > wrote: > > >> I have one answer here [0], but I'd be interested to hear what Solr > > users/devs/integrators have experienced on this topic. > > >> > > >> [0] > > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC > > >> Y1P > > >> R09MB0795EAED9
Re: Select distinct records
In order to use the Collapsing feature I will need to use Document Routing to co-locate related documents in the same shard in SolrCloud. What are the advantages and disadvantages of Document Routing? Thanks, On Thu, Feb 11, 2016 at 12:54 PM, Joel Bernstein wrote: > Yeah that would be the reason. If you want distributed unique capabilities, > then you might want to start testing out 6.0. Aside from SELECT DISTINCT > queries, you also have a much more mature Streaming Expression library > which supports the unique operation. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Feb 11, 2016 at 12:28 PM, Brian Narsi wrote: > > > Ok I see that Collapsing features requires documents to be co-located in > > the same shard in SolrCloud. > > > > Could that be a reason for duplication? > > > > On Thu, Feb 11, 2016 at 11:09 AM, Joel Bernstein > > wrote: > > > > > The CollapsingQParserPlugin shouldn't have duplicates in the result > set. > > > Can you provide the details? > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi > > wrote: > > > > > > > I have tried to use the Collapsing feature but it appears that it > > leaves > > > > duplicated records in the result set. > > > > > > > > Is that expected? Or any suggestions on working around it? > > > > > > > > Thanks > > > > > > > > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi > > wrote: > > > > > > > > > I am using > > > > > > > > > > Solr 5.1.0 > > > > > > > > > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal < > binoydala...@gmail.com > > > > > > > > wrote: > > > > > > > > > >> What version of Solr are you using? > > > > >> Have you taken a look at the Collapsing Query Parser. It basically > > > > >> performs > > > > >> the same functions as grouping but is much more efficient at doing > > it. > > > > >> Take a look here: > > > > >> > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > > > > >> > > > > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi > > > wrote: > > > > >> > > > > >> > I am trying to select distinct records from a collection. (I > need > > > > >> distinct > > > > >> > name and corresponding id) > > > > >> > > > > > >> > I have tried using grouping and group format of simple but that > > > takes > > > > a > > > > >> > long time to execute and sometimes runs into out of memory > > > exception. > > > > >> > Another limitation seems to be that total number of groups are > not > > > > >> > returned. > > > > >> > > > > > >> > Is there another faster and more efficient way to do this? > > > > >> > > > > > >> > Thank you > > > > >> > > > > > >> -- > > > > >> Regards, > > > > >> Binoy Dalal > > > > >> > > > > > > > > > > > > > > > > > > > >
Re: SolrCloud shard marked as down and "reloading" collection doesnt restore it
After more debugging, I figured out that it is related to this: https://issues.apache.org/jira/browse/SOLR-3274 Is there a recommended fix (apart from running a zk ensemble?) On Thu, Feb 11, 2016 at 10:29 AM, KNitin wrote: > Hi, > > I noticed while running an indexing job (2M docs but per doc size could > be 2-3 MB) that one of the shards goes down just after the commit. (Not > related to OOM or high cpu/load). This marks the shard as "down" in zk and > even a reload of the collection does not recover the state. > > There are no exceptions in the logs and the stack trace indicates jetty > threads in blocked state. > > The last few lines in the logs are as follows: > > trib=TOLEADER&wt=javabin&version=2} {add=[1552605 (1525453861590925312)]} > 0 5 > INFO - 2016-02-06 19:17:47.658; > org.apache.solr.update.DirectUpdateHandler2; start > commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} > INFO - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy; > SolrDeletionPolicy.onCommit: commits: num=2 > INFO - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy; > newest commit generation = 6 > INFO - 2016-02-06 19:18:02.233; org.apache.solr.search.SolrIndexSearcher; > Opening Searcher@321a0cc9 main > INFO - 2016-02-06 19:18:02.296; org.apache.solr.core.QuerySenderListener; > QuerySenderListener sending requests to Searcher@321a0cc9 > main{StandardDirectoryReader(segments_6:180:nrt > _20(4.6):C15155/216:delGen=1 _w(4.6):C1538/63:delGen=2 > _16(4.6):C279/20:delGen=2 _e(4.6):C11386/514:delGen=3 > _g(4.6):C4434/204:delGen=3 _p(4.6):C418/5:delGen=1 _v(4.6):C1 > _x(4.6):C17583/316:delGen=2 _y(4.6):C9783/112:delGen=2 > _z(4.6):C4736/47:delGen=2 _12(4.6):C705/2:delGen=1 _13(4.6):C275/4:delGen=1 > _1b(4.6):C619 _26(4.6):C318/13:delGen=1 _1e(4.6):C25356/763:delGen=3 > _1f(4.6):C13024/426:delGen=2 _1g(4.6):C5368/142:delGen=2 > _1j(4.6):C499/16:delGen=2 _1m(4.6):C448/23:delGen=2 > _1p(4.6):C236/17:delGen=2 _1k(4.6):C173/5:delGen=1 > _1s(4.6):C1082/78:delGen=2 _1t(4.6):C195/17:delGen=2 _1u(4.6):C2 > _21(4.6):C16494/1278:delGen=1 _22(4.6):C5193/398:delGen=1 > _23(4.6):C1361/102:delGen=1 _24(4.6):C475/36:delGen=1 > _29(4.6):C126/11:delGen=1 _2d(4.6):C97/3:delGen=1 _27(4.6):C59/7:delGen=1 > _28(4.6):C26/6:delGen=1 _2b(4.6):C40 _25(4.6):C39/1:delGen=1 > _2c(4.6):C139/9:delGen=1 _2a(4.6):C26/6:delGen=1)} > > > The only solution is to restart the cluster. Why does a reload not work > and is this a known bug (for which there is a patch i can apply)? > > Any pointers are much appreciated > > Thanks! > Nitin >
Re: How is Tika used with Solr
Well, I'd imagine you could spawn threads and monitor/kill them as necessary, although that doesn't deal with OOM errors FWIW, Erick On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav wrote: > For sure, if I need heavy duty text extraction again, Tika would be the > obvious choice if it covers dealing with hangs. I never used tika-server > myself (not sure if it existed at the time) just used tika from my own jvm. > > On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. > wrote: > >> x-post to Tika user's >> >> Y and n. If you run tika app as: >> >> java -jar tika-app.jar >> >> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). This >> creates a parent and child process, if the child process notices a hung >> thread, it dies, and the parent restarts it. Or if your OS gets upset with >> the child process and kills it out of self preservation, the parent >> restarts the child, or if there's an OOM...and you can configure how often >> the child shuts itself down (with parental restarting) to mitigate memory >> leaks. >> >> So, y, if your use case allows , then we now have >> that in Tika. >> >> I've been wanting to add a similar watchdog to tika-server ... any >> interest in that? >> >> >> -Original Message- >> From: xavi jmlucjav [mailto:jmluc...@gmail.com] >> Sent: Thursday, February 11, 2016 2:16 PM >> To: solr-user >> Subject: Re: How is Tika used with Solr >> >> I have found that when you deal with large amounts of all sort of files, >> in the end you find stuff (pdfs are typically nasty) that will hang tika. >> That is even worse that a crash or OOM. >> We used aperture instead of tika because at the time it provided a >> watchdog feature to kill what seemed like a hanged extracting thread. That >> feature is super important for a robust text extracting pipeline. Has Tika >> gained such feature already? >> >> xavier >> >> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson >> wrote: >> >> > Timothy's points are absolutely spot-on. In production scenarios, if >> > you use the simple "run Tika in a SolrJ program" approach you _must_ >> > abort the program on OOM errors and the like and figure out what's >> > going on with the offending document(s). Or record the name somewhere >> > and skip it next time 'round. Or >> > >> > How much you have to build in here really depends on your use case. >> > For "small enough" >> > sets of documents or one-time indexing, you can get by with dealing >> > with errors one at a time. >> > For robust systems where you have to have indexing available at all >> > times and _especially_ where you don't control the document corpus, >> > you have to build something far more tolerant as per Tim's comments. >> > >> > FWIW, >> > Erick >> > >> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. >> > >> > wrote: >> > > I completely agree on the impulse, and for the vast majority of the >> > > time >> > (regular catchable exceptions), that'll work. And, by vast majority, >> > aside from oom on very large files, we aren't seeing these problems >> > any more in our 3 million doc corpus (y, I know, small by today's >> > standards) from >> > govdocs1 and Common Crawl over on our Rackspace vm. >> > > >> > > Given my focus on Tika, I'm overly sensitive to the worst case >> > scenarios. I find it encouraging, Erick, that you haven't seen these >> > types of problems, that users aren't complaining too often about >> > catastrophic failures of Tika within Solr Cell, and that this thread >> > is not yet swamped with integrators agreeing with me. :) >> > > >> > > However, because oom can leave memory in a corrupted state (right?), >> > because you can't actually kill a thread for a permanent hang and >> > because Tika is a kitchen sink and we can't prevent memory leaks in >> > our dependencies, one needs to be aware that bad things can >> > happen...if only very, very rarely. For a fellow traveler who has run >> > into these issues on massive data sets, see also [0]. >> > > >> > > Configuring Hadoop to work around these types of problems is not too >> > difficult -- it has to be done with some thought, though. On >> > conventional single box setups, the ForkParser within Tika is one >> > option, tika-batch is another. Hand rolling your own parent/child >> > process is non-trivial and is not necessary for the vast majority of use >> cases. >> > > >> > > >> > > [0] >> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w >> > eb-content-nanite/ >> > > >> > > >> > > >> > > -Original Message- >> > > From: Erick Erickson [mailto:erickerick...@gmail.com] >> > > Sent: Tuesday, February 09, 2016 10:05 PM >> > > To: solr-user >> > > Subject: Re: How is Tika used with Solr >> > > >> > > My impulse would be to _not_ run Tika in its own JVM, just catch any >> > exceptions in my code and "do the right thing". I'm not sure I see any >> > real benefit in yet another JVM. >> > > >> > > FWIW, >> > > Erick >> > > >> > > On Tue, Feb 9,
Re: edismax query parser - pf field question
Try comma instead of space delimiting? On Thu, Feb 11, 2016 at 2:33 PM, Senthil wrote: > Clarification needed on edismax query parser "pf" field. > > *SOLR Query:* > /query?q=refrigerator water filter&qf=P_NAME^1.5 > CategoryName&wt=xml&debugQuery=on&pf=P_NAME > CategoryName&mm=2&fl=CategoryName P_NAME score&defType=edismax > > *Parsed Query from DebugQuery results:* > (+((DisjunctionMaxQuery((P_NAME:refriger^1.5 | > CategoryName:refrigerator)) DisjunctionMaxQuery((P_NAME:water^1.5 | > CategoryName:water)) DisjunctionMaxQuery((P_NAME:filter^1.5 | > CategoryName:filter)))~2) DisjunctionMaxQuery((P_NAME:"refriger water > filter")))/no_coord > > In the SOLR query given above, I am asking for phrase matches on 2 fields: > P_NAME and CategoryName. > But If you notice ParsedQuery, I see Phrase match is applied only on P_NAME > field but not on CategoryName field. Why? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/edismax-query-parser-pf-field-question-tp4256845.html > Sent from the Solr - User mailing list archive at Nabble.com.
RE: How is Tika used with Solr
Y, and you can't actually kill a thread. You can ask nicely via Thread.interrupt(), but some of our dependencies don't bother to listen for that. So, you're pretty much left with a separate process as the only robust solution. So, we did the parent-child process thing for directory-> directory processing in tika-app via tika-batch. The next step is to harden tika-server and to kick that off in a child process in a similar way. For those who want to test their Tika harnesses (whether on single box, Hadoop/Spark etc), we added a MockParser that will do whatever you tell it when it hits an "application/xml+mock" file...full set of options: Nikolai Lobachevsky some content writing to System.out writing to System.err not another IOException -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, February 11, 2016 7:46 PM To: solr-user Subject: Re: How is Tika used with Solr Well, I'd imagine you could spawn threads and monitor/kill them as necessary, although that doesn't deal with OOM errors FWIW, Erick On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav wrote: > For sure, if I need heavy duty text extraction again, Tika would be > the obvious choice if it covers dealing with hangs. I never used > tika-server myself (not sure if it existed at the time) just used tika from > my own jvm. > > On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. > > wrote: > >> x-post to Tika user's >> >> Y and n. If you run tika app as: >> >> java -jar tika-app.jar >> >> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). >> This creates a parent and child process, if the child process notices >> a hung thread, it dies, and the parent restarts it. Or if your OS >> gets upset with the child process and kills it out of self >> preservation, the parent restarts the child, or if there's an >> OOM...and you can configure how often the child shuts itself down >> (with parental restarting) to mitigate memory leaks. >> >> So, y, if your use case allows , then we now >> have that in Tika. >> >> I've been wanting to add a similar watchdog to tika-server ... any >> interest in that? >> >> >> -Original Message- >> From: xavi jmlucjav [mailto:jmluc...@gmail.com] >> Sent: Thursday, February 11, 2016 2:16 PM >> To: solr-user >> Subject: Re: How is Tika used with Solr >> >> I have found that when you deal with large amounts of all sort of >> files, in the end you find stuff (pdfs are typically nasty) that will hang >> tika. >> That is even worse that a crash or OOM. >> We used aperture instead of tika because at the time it provided a >> watchdog feature to kill what seemed like a hanged extracting thread. >> That feature is super important for a robust text extracting >> pipeline. Has Tika gained such feature already? >> >> xavier >> >> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson >> >> wrote: >> >> > Timothy's points are absolutely spot-on. In production scenarios, >> > if you use the simple "run Tika in a SolrJ program" approach you >> > _must_ abort the program on OOM errors and the like and figure out >> > what's going on with the offending document(s). Or record the name >> > somewhere and skip it next time 'round. Or >> > >> > How much you have to build in here really depends on your use case. >> > For "small enough" >> > sets of documents or one-time indexing, you can get by with dealing >> > with errors one at a time. >> > For robust systems where you have to have indexing available at all >> > times and _especially_ where you don't control the document corpus, >> > you have to build something far more tolerant as per Tim's comments. >> > >> > FWIW, >> > Erick >> > >> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. >> > >> > wrote: >> > > I completely agree on the impulse, and for the vast majority of >> > > the time >> > (regular catchable exceptions), that'll work. And, by vast >> > majority, aside from oom on very large files, we aren't seeing >> > these problems any more in our 3 million doc corpus (y, I know, >> > small by today's >> > standards) from >> > govdocs1 and Common Crawl over on our Rackspace vm. >> > > >> > > Given my focus on Tika, I'm overly sensitive to the worst case >> > scenarios. I find it encouraging, Erick, that you haven't seen >> > these types of problems, that users aren't complaining too often >> > about catastrophic failures of Tika within Solr Cell, and that this >> > thread is not yet swamped with integrators agreeing with me. :) >> > > >> > > However, because oom can leave memory in a corrupted state >> > > (right?), >> > because you can't actually kill a thread for a permanent hang and >> > because Tika is a kitchen sink and we can't prevent memory leaks in >> > our dependencies, one needs to be aware that bad things can >> > happen...if only very, very rarely. For a fel
Re: optimize requests that fetch 1000 rows
Again, first things first... debugQuery=true and see which Solr search components are consuming the bulk of qtime. -- Jack Krupansky On Thu, Feb 11, 2016 at 11:33 AM, Matteo Grolla wrote: > virtual hardware, 200ms is taken on the client until response is written to > disk > qtime on solr is ~90ms > not great but acceptable > > Is it possible that the method FilenameUtils.splitOnTokens is really so > heavy when requesting a lot of rows on slow hardware? > > 2016-02-11 17:17 GMT+01:00 Jack Krupansky : > > > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but > still > > relatively bad. Even 50ms for 10 rows would be considered barely okay. > > But... again it depends on query complexity - simple queries should be > well > > under 50 ms for decent modern hardware. > > > > -- Jack Krupansky > > > > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla > > > wrote: > > > > > Hi Jack, > > > response time scale with rows. Relationship doens't seem linear > but > > > Below 400 rows times are much faster, > > > I view query times from solr logs and they are fast > > > the same query with rows = 1000 takes 8s > > > with rows = 10 takes 0.2s > > > > > > > > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky : > > > > > > > Are queries scaling linearly - does a query for 100 rows take 1/10th > > the > > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)? > > > > > > > > Does the app need/expect exactly 1,000 documents for the query or is > > that > > > > just what this particular query happened to return? > > > > > > > > What does they query look like? Is it complex or use wildcards or > > > function > > > > queries, or is it very simple keywords? How many operators? > > > > > > > > Have you used the debugQuery=true parameter to see which search > > > components > > > > are taking the time? > > > > > > > > -- Jack Krupansky > > > > > > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla < > > matteo.gro...@gmail.com> > > > > wrote: > > > > > > > > > Hi Yonic, > > > > > after the first query I find 1000 docs in the document cache. > > > > > I'm using curl to send the request and requesting javabin format to > > > mimic > > > > > the application. > > > > > gc activity is low > > > > > I managed to load the entire 50GB index in the filesystem cache, > > after > > > > that > > > > > queries don't cause disk activity anymore. > > > > > Time improves now queries that took ~30s take <10s. But I hoped > > better > > > > > I'm going to use jvisualvm's sampler to analyze where time is spent > > > > > > > > > > > > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley : > > > > > > > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla < > > > > matteo.gro...@gmail.com> > > > > > > wrote: > > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to > execute > > > the > > > > > > > query) is a fraction of a second. > > > > > > > The response in javabin format is around 300k. > > > > > > > > > > > > OK, That tells us a lot. > > > > > > And if you actually tested so that all the docs would be in the > > cache > > > > > > (can you verify this by looking at the cache stats after you > > > > > > re-execute?) then it seems like the slowness is down to any of: > > > > > > a) serializing the response (it doesn't seem like a 300K response > > > > > > should take *that* long to serialize) > > > > > > b) reading/processing the response (how fast the client can do > > > > > > something with each doc is also a factor...) > > > > > > c) other (GC, network, etc) > > > > > > > > > > > > You can try taking client processing out of the equation by > trying > > a > > > > > > curl request. > > > > > > > > > > > > -Yonik > > > > > > > > > > > > > > > > > > > > >
error
we have upgraded solr version last night getting following error org.apache.solr.common.SolrException: Bad content Type for search handler :application/octet-stream what i should do ? to remove this .
Re: Need to move on SOlr cloud (help required)
Erick , bq: We want the hits on solr servers to be distributed True, this happens automatically in SolrCloud, but a simple load balancer in front of master/slave does the same thing. Midas : in case of solrcloud architecture we need not to have load balancer ? . On Thu, Feb 11, 2016 at 11:42 PM, Erick Erickson wrote: > bq: We want the hits on solr servers to be distributed > > True, this happens automatically in SolrCloud, but a simple load > balancer in front of master/slave does the same thing. > > bq: what if master node fail what should be our fail over strategy ? > > This is, indeed one of the advantages for SolrCloud, you don't have > to worry about this any more. > > Another benefit (and you haven't touched on whether this matters) > is that in SolrCloud you do not have the latency of polling and > replicating from master to slave, in other words it supports Near Real > Time. > > This comes at some additional complexity however. If you have > your master node failing often enough to be a problem, you have > other issues ;)... > > And the recovery strategy if the master fails is straightforward: > 1> pick one of the slaves to be the master. > 2> update the other nodes to point to the new master > 3> re-index the docs from before the old master failed to the new master. > > You can use system variables to not even have to manually edit all of the > solrconfig files, just supply different -D parameters on startup. > > Best, > Erick > > On Wed, Feb 10, 2016 at 10:39 PM, kshitij tyagi > wrote: > > @Jack > > > > Currently we have around 55,00,000 docs > > > > Its not about load on one node we have load on different nodes at > different > > times as our traffic is huge around 60k users at a given point of time > > > > We want the hits on solr servers to be distributed so we are planning to > > move on solr cloud as it would be fault tolerant. > > > > > > > > On Thu, Feb 11, 2016 at 11:10 AM, Midas A wrote: > > > >> hi, > >> what if master node fail what should be our fail over strategy ? > >> > >> On Wed, Feb 10, 2016 at 9:12 PM, Jack Krupansky < > jack.krupan...@gmail.com> > >> wrote: > >> > >> > What exactly is your motivation? I mean, the primary benefit of > SolrCloud > >> > is better support for sharding, and you have only a single shard. If > you > >> > have no need for sharding and your master-slave replicated Solr has > been > >> > working fine, then stick with it. If only one machine is having a load > >> > problem, then that one node should be replaced. There are indeed > plenty > >> of > >> > good reasons to prefer SolrCloud over traditional master-slave > >> replication, > >> > but so far you haven't touched on any of them. > >> > > >> > How much data (number of documents) do you have? > >> > > >> > What is your typical query latency? > >> > > >> > > >> > -- Jack Krupansky > >> > > >> > On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi < > >> > kshitij.shopcl...@gmail.com> > >> > wrote: > >> > > >> > > Hi, > >> > > > >> > > We are currently using solr 5.2 and I need to move on solr cloud > >> > > architecture. > >> > > > >> > > As of now we are using 5 machines : > >> > > > >> > > 1. I am using 1 master where we are indexing ourdata. > >> > > 2. I replicate my data on other machines > >> > > > >> > > One or the other machine keeps on showing high load so I am > planning to > >> > > move on solr cloud. > >> > > > >> > > Need help on following : > >> > > > >> > > 1. What should be my architecture in case of 5 machines to keep > >> > (zookeeper, > >> > > shards, core). > >> > > > >> > > 2. How to add a node. > >> > > > >> > > 3. what are the exact steps/process I need to follow in order to > change > >> > to > >> > > solr cloud. > >> > > > >> > > 4. How indexing will work in solr cloud as of now I am using mysql > >> query > >> > to > >> > > get the data on master and then index the same (how I need to change > >> this > >> > > in case of solr cloud). > >> > > > >> > > Regards, > >> > > Kshitij > >> > > > >> > > >> >
Re: error
my log is increasing . it is urgent .. On Fri, Feb 12, 2016 at 10:43 AM, Midas A wrote: > we have upgraded solr version last night getting following error > > org.apache.solr.common.SolrException: Bad content Type for search handler > :application/octet-stream > > what i should do ? to remove this . >
Re: error
On 2/11/2016 10:13 PM, Midas A wrote: > we have upgraded solr version last night getting following error > > org.apache.solr.common.SolrException: Bad content Type for search handler > :application/octet-stream > > what i should do ? to remove this . What version did you upgrade from and what version did you upgrade to? How was the new version installed, and how are you starting it? What kind of software are you using for your clients? We also need to see all error messages in the solr logfile, including stacktraces. Having access to the entire logfile would be very helpful, but before sharing that, you might want to check it for sensitive information and redact it. Thanks, Shawn
Re: error
solr 5.2.1 On Fri, Feb 12, 2016 at 12:59 PM, Shawn Heisey wrote: > On 2/11/2016 10:13 PM, Midas A wrote: > > we have upgraded solr version last night getting following error > > > > org.apache.solr.common.SolrException: Bad content Type for search handler > > :application/octet-stream > > > > what i should do ? to remove this . > > What version did you upgrade from and what version did you upgrade to? > How was the new version installed, and how are you starting it? What > kind of software are you using for your clients? > > We also need to see all error messages in the solr logfile, including > stacktraces. Having access to the entire logfile would be very helpful, > but before sharing that, you might want to check it for sensitive > information and redact it. > > Thanks, > Shawn > >