Question: index performance
i find it will be OutOfMemory when i get more that 10k records. so now i index 10k records( 5k / record) if i use for to index more data. it always show OutOfMemory. i use top to moniter and find index finish, free memory is 125m,,and sometime it will be 218m it show me solr index finish and use sometime free memory? how can i index more data than 10k records and doesn't stop by OutOfMemory. tomcat i set memory 512m. -- regards jl
Re: Results per user
I don't use Filters very much so this might be a dumb question, but I could overcome the main drawback by hooking into the filter and updating it's bits without affecting the caching, right? I kind of think I have scaling issues no matter what. If you do the post processing way, then you may have to make repeated fetches to Solr in order to get enough results to display. I think I may have to dig a bit deeper into both approaches On Apr 12, 2007, at 7:41 PM, Chris Hostetter wrote: : > results that are filtered on a per user basis, for instance to remove : > results that have already been viewed. I know I could post process : > the results from Solr to do this, but am wondering if a better : > solution is to implement my own request handler that takes in user id : > info and manages a cache of Filters that maintains the bit set info : > on the search side. Is this a good approach? : One issue with your approach would be scaling... if you have multiple : searchers, how do you communicate this user data between them? If the filtering logic can be implemented in a Filter class, you might just want to rely on the built in filterCache (you'd still need a custom request handler that kows about your custom Filter) the plus side is you'd get all the benefits of Solr's filter cache (cached as long as the same searcher is used, autowarmed when a new searcher is opened) the down side is you'd get all the benefits of Solr's filter cache (cached as long as the same searcher is used -- so it wouldn't notice if you'd updated your datastore to remove a bunch of files from their filter) -Hoss
Re: Sort on multiple fields not working?
12 apr 2007 kl. 17.06 skrev Yonik Seeley: Sorting works on indexed tokens, and hence doesn't really work on analyzed fields that produce more than one token per document. I suspect your title field falls into that category. You could also index the title field into another field that is indexed as a string (non-tokenized), but that might take up a lot of memory if you have long titles. It just hit me (and I did not consider it any further) that perhaps one could store String.valueOf(theTitle.hashcode()) in an alternative field and sort by that instead? It will not be 100% accurate, but in most cases it will. However, I'm not sure how negative values will be handled. If that would be a problem, one could convert the integer to alfanum. That should also save a bunch of memory. -- karl
Re: Question: index performance
On 4/13/07, James liu <[EMAIL PROTECTED]> wrote: i find it will be OutOfMemory when i get more that 10k records. so now i index 10k records( 5k / record) In one request? There's really no reason to put more than hundreds of documents in a single add request. If you are indexing using multiple requests, and always run into problems at 10k records, you are probably hitting memory issues with Lucene merging. If that's the case, try lowering the mergeFactor so fewer segments will be merged at the same time. Some other things to be careful of: - don't call commit after you add every batch of documents - don't set maxBufferedDocs too high if you don't have the memory -Yonik
Re: Sort on multiple fields not working?
On 4/13/07, karl wettin <[EMAIL PROTECTED]> wrote: It just hit me (and I did not consider it any further) that perhaps one could store String.valueOf(theTitle.hashcode()) in an alternative field and sort by that instead? It will not be 100% accurate, but in most cases it will. That would only mostly work for titles around 5 characters long, right? It seems like after that, the correlation between hashCode and sort order breaks down almost immediately since you lose the leftmost hash bits. -Yonik
Re: Sort on multiple fields not working?
13 apr 2007 kl. 15.48 skrev Yonik Seeley: On 4/13/07, karl wettin <[EMAIL PROTECTED]> wrote: It just hit me (and I did not consider it any further) that perhaps one could store String.valueOf(theTitle.hashcode()) in an alternative field and sort by that instead? It will not be 100% accurate, but in most cases it will. That would only mostly work for titles around 5 characters long, right? It seems like after that, the correlation between hashCode and sort order breaks down almost immediately since you lose the leftmost hash bits. That might be true, as I said, I didn't really think about it too long. But some alternative hashCode could probably be implemented, one that use all available bits in a string, rather than the 32 bit limitation of an integer. -- karl
Re: Question: index performance
Hi there, I'm building an index to which I'm sending a few hundred thousand entries. I pull them off the database in batches of 25k and send them to solr, 100 documents at a time. I was doing a commit after each of those but after what Yonik says I will remove it and commit only after each batch of 25k. Q1: I've got autocommit set to 1000 now.. in solrconfig.xml, should i disable it in this scenario? Q2: To decide which of those 25k are going to be indexed, we need to do a query for each (this is the main reason to optimize before a new DB batch is indexed), each of these 25k queries take around 30ms which is good enough for us, but i've observed every ~30 queries the time of one search goes up to 150ms or even 1200ms. Then it does another ~30, etc. I guess there is something happening inside the server regularly that causes it. Any clues what it can be and how can i minimize that time? Q3: The 25k searches are done without any cumulative effect on performance (avg/search is ~30ms from start to end). But if inmmediately after start posting documents to the index tomcat peaks CPU. But if i stop tomcat, and then post the 25k documents without doing those searches they're very quick. Is there any reason why the searches would affect tomcat to justify this? Just to clarify, searches are NOT done at the same time as indexing. My tomcat is running with -server -Xmx512m -Xms512m Cheers, galo Yonik Seeley wrote: On 4/13/07, James liu <[EMAIL PROTECTED]> wrote: i find it will be OutOfMemory when i get more that 10k records. so now i index 10k records( 5k / record) In one request? There's really no reason to put more than hundreds of documents in a single add request. If you are indexing using multiple requests, and always run into problems at 10k records, you are probably hitting memory issues with Lucene merging. If that's the case, try lowering the mergeFactor so fewer segments will be merged at the same time. Some other things to be careful of: - don't call commit after you add every batch of documents - don't set maxBufferedDocs too high if you don't have the memory -Yonik
Re: Schema validator/debugger
Yonik Seeley wrote: Oh wait... Andrew, were you always testing via "ping"? Check out what the ping query is configured as in solrconfig.xml: qt=dismax&q=solr&start=3&fq=id:[* TO *]&fq=cat:[* TO *] Perhaps we should change it to something simple by default??? "q=solr"? That solves the Jetty failure mystery... so it looks like you either have a tomcat setup problem, or a Solr bug that only shows under tomcat. Yes, this is the problem! Good catch :) I have been testing via ping. However this still does not solve my original problem ... I will dig a bit more and see what I can find. Thanks Andrew
Embedding Solr vs Lucene, multiple Solr cores?
I'm trying to choose between embedding Lucene versus embedding Solr in one webapp. In Solr terms, functional requirements would more or less lead to multiple schema & conf (need CRUD/generation on those) and deployment constraints imply one webapp instance. The choice I'm trying to make is thus: -Embed Lucene and (attempt to) recode a lot of what Solr provides... (the straw man) -Embed Solr but refactor 'some' of its core, assuming it is correct to see one Solr core as the association of one schema & one conf. There have been a few threads about multiple indexes and/or multiple/reloading schemas. >From what I gathered, one solution stems from the 'multiple webapp instances deployment' and implies 'extracting' the static instance (at least the SolrCore) & thus host multiple Solr cores in one webapp. Obviously, the operations (queries/add/delete doc) would need to carry which core they are targeting (one 'core' being set as the 'default' for compatibility purpose). What will be the other big hurdles, the ones that could even preclude the very idea ? (caches handling, updater threads, HA features...) Are there any easier routes (class-loaders, 'provisional' schema...) ? Any advice welcome. Thanks. Henri -- View this message in context: http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9981355 Sent from the Solr - User mailing list archive at Nabble.com.
Deploying Solr with Jetty
First off, I am aware that the bulk of this question has to do with Jetty, but please have kindness... My end goal is to have a handful of Solr instances running under Jetty all accessible at /app1 /app2 ...etc... I have taken the .war file in the Solr dist/ directory, unpacked it and added in a solr/ dir with the bin/, conf/ sub-dirs, etc. I then zipped this back up, with a .war extension. I placed this .war in my jetty webapps/ However when I start Jetty it unpacks the war and then tries to start the app but it bitches about not finding solrconfig.xml and ends up not starting that app. So my question is, where I do place the solr/ directory so it will be found by the app? If I place it in my root jetty dir, that config will apply to all my Solr instances which is not what I want because they all need to have different indexes, etc. The "Multiple Solr instances for Jetty" on the Solr wiki uses outdated syntax. But it is a step in the right direction because it specifies a JNDI entry for solr/home for each web-app. I guess I am just having basic Jetty config issues. Like I said, I know this question has more to do with Jetty than Solr, but could someone point me in the right direction (besides saying "join the Jetty list")? Thanks /cody
Re: Results per user
: I don't use Filters very much so this might be a dumb question, but I : could overcome the main drawback by hooking into the filter and : updating it's bits without affecting the caching, right? Not really ... Solr doesn't use Filter's the same way as CachingWrapperFilter does ... it builds DocSet's out of them and cachines those for the life of the IndexSearcher (or until the cache gets full and it needs to expunge somehting) when a new IndexSearcher is opened, it auto-warms the new filterCache by executing hte exsiting Filter's against the new IndexSearcher. : I kind of think I have scaling issues no matter what. If you do the : post processing way, then you may have to make repeated fetches to : Solr in order to get enough results to display. anything you can do on the client side you can do in a custom request handler (assuming you cna do it in java) so that will at least save you the overhead of HTTP back and forth with the Solr server ... i was jsut trying to think of ways that existing features available to SolrRequestHandlers could help you more. -Hoss
Re: Sort on multiple fields not working?
: That might be true, as I said, I didn't really think about it too : long. But some alternative hashCode could probably be implemented, : one that use all available bits in a string, rather than the 32 bit : limitation of an integer. if you're going to use all the bits in the string, and not confine yourself to an integer, how is that different from sorting on the string itself? (either way you still need a single value per doc per sort field -- and can't use a tokenized field, so you use copyField) -Hoss
Re: Embedding Solr vs Lucene, multiple Solr cores?
On 4/13/07, Henrib <[EMAIL PROTECTED]> wrote: I'm trying to choose between embedding Lucene versus embedding Solr in one webapp. In Solr terms, functional requirements would more or less lead to multiple schema & conf (need CRUD/generation on those) and deployment constraints imply one webapp instance. Do you really need multiple schema's? multiple indexes? This has been posted many times (i thought i needed it too!) - it turns out most cases can easily be taken care of by putting multiple document types in the same index and including a "type" field. You could have a single schema with names for common fields, or one that has a prefix for each type; either "title" or "typeA_title", "typeB_title" - the common name approach can be easier because it makes it easier to search across types. -Embed Solr but refactor 'some' of its core, assuming it is correct to see one Solr core as the association of one schema & one conf. If you absolutely need multiple indexes, It will probably be easier to fudge the single webapp requirement then to refactor solr to remove the static singleton SolrCore.getSolrCore(). ryan
Re: Embedding Solr vs Lucene, multiple Solr cores?
Hi - Of the various approaches that you could take, the one I'd work on first is: deployment constraints imply one webapp instance. In most environments, it's going to cost a lot less to change this, than to try to roll your own, or extensively modify solr. I know I'm sidestepping your stated requirements, but I'd take a long look at that one. BTW, We cut over from an embedded Lucene instance to Solr about 4 months ago, and are very happy that we did. Tom On 4/13/07, Henrib <[EMAIL PROTECTED]> wrote: I'm trying to choose between embedding Lucene versus embedding Solr in one webapp. In Solr terms, functional requirements would more or less lead to multiple schema & conf (need CRUD/generation on those) and deployment constraints imply one webapp instance. The choice I'm trying to make is thus: -Embed Lucene and (attempt to) recode a lot of what Solr provides... (the straw man) -Embed Solr but refactor 'some' of its core, assuming it is correct to see one Solr core as the association of one schema & one conf. There have been a few threads about multiple indexes and/or multiple/reloading schemas. From what I gathered, one solution stems from the 'multiple webapp instances deployment' and implies 'extracting' the static instance (at least the SolrCore) & thus host multiple Solr cores in one webapp. Obviously, the operations (queries/add/delete doc) would need to carry which core they are targeting (one 'core' being set as the 'default' for compatibility purpose). What will be the other big hurdles, the ones that could even preclude the very idea ? (caches handling, updater threads, HA features...) Are there any easier routes (class-loaders, 'provisional' schema...) ? Any advice welcome. Thanks. Henri -- View this message in context: http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9981355 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sort on multiple fields not working?
13 apr 2007 kl. 20.11 skrev Chris Hostetter: : That might be true, as I said, I didn't really think about it too : long. But some alternative hashCode could probably be implemented, : one that use all available bits in a string, rather than the 32 bit : limitation of an integer. if you're going to use all the bits in the string, and not confine yourself to an integer, how is that different from sorting on the string itself? Smaller string values does not consume as much memory? I might not understand your question. -- karl
Re: Embedding Solr vs Lucene, multiple Solr cores?
Thank you both for your quick answers. The one webapp constraint comes from the main 'embedding' application so I don't have much leeway there. The direct approach was to map the main/hosting application document collection & types to one schema/conf. Since the host collections & types can be dynamically created, this seemed the natural route (albeit hard). The longer story is that in our typical customer environments, IT deploys & monitors webapps (provision space & al, replicate for disaster recovery etc) but does not want to deal with the application itself, leaving the 'business users' side administer it. Even if there is a dedicated Tomcat for the main app, IT will not let the 'business users' install other applications (scope of responsibility, code versus data, validation procedures, etc). Thus the 'one application' constraint. Anyway, it seems a 'provisional' schema where most fields would be dynamic and some notational convention to map them would be the easiest route. And replace the targeted different indexes by equivalent filters. I gather from your inputs the potential functionality loss and/or performance hit is not something I should be afraid of. For the sake of completeness, instead of embedding Solr in that single instance, I thought about using several Solr instances running in different webapp instances & use them as 'coprocessors' for the main application; this would imply serializing/deserializing/redirecting queries & results between webapps which is not the most efficient way on a single host/VM env (may be Tomcat crosscontext could help alleviate that). But this would also require dynamically deploying webapps for that purpose which is a no-no from IT... For the sake of argument :-), besides the SolrCore singletons which is easy to circumvent (a map of cores & at least a pointer from the instantiated schema to the core handling it, are there others that are hiding (Config.config, caches...) that would preclude the multiple core track? Thanks again Henri Tom Hill-6 wrote: > > Hi - > > Of the various approaches that you could take, the one I'd work on first > is: > >> deployment constraints imply one webapp instance. > > In most environments, it's going to cost a lot less to change this, than > to > try to roll your own, or extensively modify solr. > > I know I'm sidestepping your stated requirements, but I'd take a long look > at that one. > > BTW, We cut over from an embedded Lucene instance to Solr about 4 months > ago, and are very happy that we did. > > Tom > > On 4/13/07, Henrib <[EMAIL PROTECTED]> wrote: >> >> >> I'm trying to choose between embedding Lucene versus embedding Solr in >> one >> webapp. >> >> In Solr terms, functional requirements would more or less lead to >> multiple >> schema & conf (need CRUD/generation on those) and deployment constraints >> imply one webapp instance. The choice I'm trying to make is thus: >> -Embed Lucene and (attempt to) recode a lot of what Solr provides... (the >> straw man) >> -Embed Solr but refactor 'some' of its core, assuming it is correct to >> see >> one Solr core as the association of one schema & one conf. >> >> There have been a few threads about multiple indexes and/or >> multiple/reloading schemas. >> From what I gathered, one solution stems from the 'multiple webapp >> instances >> deployment' and implies 'extracting' the static instance (at least the >> SolrCore) & thus host multiple Solr cores in one webapp. >> >> Obviously, the operations (queries/add/delete doc) would need to carry >> which >> core they are targeting (one 'core' being set as the 'default' for >> compatibility purpose). >> What will be the other big hurdles, the ones that could even preclude the >> very idea ? (caches handling, updater threads, HA features...) >> Are there any easier routes (class-loaders, 'provisional' schema...) ? >> >> Any advice welcome. Thanks. >> Henri >> >> >> -- >> View this message in context: >> http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9981355 >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9986289 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Results per user
I wrote the following after hurriedly reading Grant Ingersoll's question, and I completely missed the "to remove results that have already been viewed" bit. Which leads me to think what I wrote may have no bearing on this issue... but perhaps it may have bearing on someone else's issue? - J.J. - Under the assumption that there is an untokenized field, say UserAccess, with user names or IDs that for each document indicate which users can access them... If you could trust the requesting client to modify the request based on the user name or ID, it could either - Add an fq=UserAccess:userName argument to every request - Create a RequestHandler configuration for each user, putting such a fq (with a hardwired username) in an 'appends' section, along with any other needed customization: UserAccess:tony ... But if you cannot trust the requesting client and need to do the filtering on the SOLR side of the divide, then I think you can simply subclass and deploy org.apache.solr.servlet.SolrDispatchFilter, such that in the execute() method you take the user (e.g. from request.getRemoteUser() or some other means), format a fq argument as above, and explicitly add it to the params in the SolrQueryRequest. While users can add filters to their queries, they would not be able to remove the applet-supplied filter query. Regardless of how fq is specified, it would create a cached filter for each user. Obviously the filter cache size should be greater than the number of simultaneously active users plus the filters they use in their queries; inactive users' filters will be scrubbed until the next time. -
Re: Embedding Solr vs Lucene, multiple Solr cores?
: but does not want to deal with the application itself, leaving the 'business : users' side administer it. Even if there is a dedicated Tomcat for the main : app, IT will not let the 'business users' install other applications (scope : of responsibility, code versus data, validation procedures, etc). Thus the : 'one application' constraint. There tends to be a lot of devils in the details of policy discussions like this, but perhaps you could redefine the definition of an "application" from your ops/biz standpoint to be broader then the definition from a servlet container standpoint (ie: let the "application" be the entire Tomcat setup running several webapps) Alternately, I've heard people mention in past discussions issues regarding service provider run servlet containers with self serve WAR hot deployment and the issues involved with only being able to hange your WAR and not having any control over hte container itself and i've always wondered: how hard would be to wrap tomcat (or jetty) so that it is a war that can run inside of another servlet container ... then you can have multiple wars embeded in that war and control the tomcat configsto your hearts content -- treating the ISPs servlet container like an OS. : For the sake of argument :-), besides the SolrCore singletons which is easy : to circumvent (a map of cores & at least a pointer from the instantiated : schema to the core handling it, are there others that are hiding : (Config.config, caches...) that would preclude the multiple core track? There are lots of places in the code where class instances use static refs to find the Core/Config/IndexSchema which would have to know know about your Map and keys ... it would be a lot of non trivial changes and refactoring i believe. That said: If anyone is interested in tackling a patch to eliminate all of the static Singletons i (and many others i suspect) would be extremely gratefull .. both for how much it would improve hte reusability of Solr in embedded situatiosn like this, but also for how it would (hopefully) make hte code eaier to follow for future developers. -Hoss
Re: Solr Scripts.conf Parsing Error
I think you're on to something, here was the output: # Licensed to the Apache Software Foundation (ASF) under one or more^M$ # contributor license agreements. See the NOTICE file distributed with^M$ # this work for additional information regarding copyright ownership.^M$ # The ASF licenses this file to You under the Apache License, Version 2.0^M$ # (the "License"); you may not use this file except in compliance with^M$ # the License. You may obtain a copy of the License at^M$ #^M$ # http://www.apache.org/licenses/LICENSE-2.0^M$ #^M$ # Unless required by applicable law or agreed to in writing, software^M$ # distributed under the License is distributed on an "AS IS" BASIS,^M$ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.^M$ # See the License for the specific language governing permissions and^M$ # limitations under the License.^M$ user=solr^M$ solr_hostname=localhost^M$ solr_port=8080^M$ rsyncd_port=18080^M$ data_dir=^M$ webapp_name=solr^M$ master_host=^M$ master_data_dir=^M$ master_status_dir=^M$ Question now is, what's the best solution to removing those characters? Dan Chris Hostetter wrote: > > > : all the debug output. Here, is a snip of that. Note the "solr\r", yet in > my > : .conf file I have only "user=solr". If I run the script using this > command > > what does "cat -vet scripts.conf" tell you? > > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Solr-Scripts.conf-Parsing-Error-tf3550726.html#a9988416 Sent from the Solr - User mailing list archive at Nabble.com.