Re: Solr feasibility with terabyte-scale data
Thanks Ken. I will take a look be sure of that :) Kindly //Marcus On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi Marcus, > > It seems a lot of what you're describing is really similar to MapReduce, >> so I think Otis' suggestion to look at Hadoop is a good one: it might >> prevent a lot of headaches and they've already solved a lot of the tricky >> problems. There a number of ridiculously sized projects using it to solve >> their scale problems, not least Yahoo... >> > > You should also look at a new project called Katta: > > http://katta.wiki.sourceforge.net/ > > First code check-in should be happening this weekend, so I'd wait until > Monday to take a look :) > > -- Ken > > > On 9 May 2008, at 01:17, Marcus Herou wrote: >> >> Cool. >>> >>> Since you must certainly already have a good partitioning scheme, could >>> you >>> elaborate on high level how you set this up ? >>> >>> I'm certain that I will shoot myself in the foot both once and twice >>> before >>> getting it right but this is what I'm good at; to never stop trying :) >>> However it is nice to start playing at least on the right side of the >>> football field so a little push in the back would be really helpful. >>> >>> Kindly >>> >>> //Marcus >>> >>> >>> >>> On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED] >>> > >>> wrote: >>> >>> Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. > > I will as well head into a path like yours within some months from now. > Currently I have an index of ~10M docs and only store id's in the index > for > performance and distribution reasons. When we enter a new market I'm > assuming we will soon hit 100M and quite soon after that 1G documents. > Each > document have in average about 3-5k data. > > We will use a GlusterFS installation with RAID1 (or RAID10) SATA > enclosures > as shared storage (think of it as a SAN or shared storage at least, one > mount point). Hope this will be the right choice, only future can tell. > > Since we are developing a search engine I frankly don't think even > having > 100's of SOLR instances serving the index will cut it performance wise > if > we > have one big index. I totally agree with the others claiming that you > most > definitely will go OOE or hit some other constraints of SOLR if you > must > have the whole result in memory sort it and create a xml response. I > did > hit > such constraints when I couldn't afford the instances to have enough > memory > and I had only 1M of docs back then. And think of it... Optimizing a TB > index will take a long long time and you really want to have an > optimized > index if you want to reduce search time. > > I am thinking of a sharding solution where I fragment the index over > the > disk(s) and let each SOLR instance only have little piece of the total > index. This will require a master database or namenode (or simpler just > a > properties file in each index dir) of some sort to know what docs is > located > on which machine or at least how many docs each shard have. This is to > ensure that whenever you introduce a new SOLR instance with a new shard > the > master indexer will know what shard to prioritize. This is probably not > enough either since all new docs will go to the new shard until it is > filled > (have the same size as the others) only then will all shards receive > docs > in > a loadbalanced fashion. So whenever you want to add a new indexer you > probably need to initiate a "stealing" process where it steals docs > from > the > others until it reaches some sort of threshold (10 servers = each shard > should have 1/10 of the docs or such). > > I think this will cut it and enabling us to grow with the data. I think > doing a distributed reindexing will as well be a good thing when it > comes > to > cutting both indexing and optimizing speed. Probably each indexer > should > buffer it's shard locally on RAID1 SCSI disks, optimize it and then > just > copy it to
Re: Solr feasibility with terabyte-scale data
Hi Otis. Thanks for the insights. Nice to get feedback from a technorati guy. Nice to see that the snippet of yours is almost a copy of mine, gives me the right stomach feeling about this :) I'm quite familiar with Hadoop as you can see if you check out the code of my OS project AbstractCache-> http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project which aims to create storage solutions based on the Map and SortedMap interface. I use it everywhere in Tailsweep.com and used it as well at my former employer Eniro.se (largest yellow pages site in Sweden). It has been in constant development for five years. Since I'm a cluster freak of nature I love a project named GlusterFS where thay have managed to create a system without master/slave[s] and NameNode. The advantage of this is that it is a lot more scalable, the drawback is that you can get into Split-Brain situations which guys in the mailing-list are complaining about. Anyway I tend to try to solve this with JGroups membership where the coordinator can be any machine in the cluster but in the group joining process the first machine to join get's the privilege of becoming coordinator. But even with JGroups you can run into trouble with race-conditions of all kinds (distributed locks for example). I've created an alternative to the Hadoop file system (mostly for fun) where you just add an object to the cluster and based on what algorithm you choose it is Raided or striped across the cluster. Anyway this was off topic but I think my experience in building membership aware clusters will help me in this particular case. Kindly //Mrcaus On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Marcus, > > You are headed in the right direction. > > We've built a system like this at Technorati (Lucene, not Solr) and had > components like the "namenode" or "controller" that you mention. If you > look at Hadoop project, you will see something similar in concept > (NameNode), though it deals with raw data blocks, their placement in the > cluster, etc. As a matter of fact, I am currently running its "re-balancer" > in order to move some of the blocks around in the cluster. That matches > what you are describing for moving documents from one shard to the other. > Of course, you can simplify things and just have this central piece be > aware of any new servers and simply get it to place any new docs on the new > servers and create a new shard there. Or you can get fancy and take into > consideration the hardware resources - the CPU, the disk space, the memory, > and use that to figure out how much each machine in your cluster can handle > and maximize its use based on this knowledge. :) > > I think Solr and Nutch are in a desperate need of this central component > (must not be SPOF!) for shard management. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: marcusherou <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Friday, May 9, 2008 2:37:19 AM > > Subject: Re: Solr feasibility with terabyte-scale data > > > > > > Hi. > > > > I will as well head into a path like yours within some months from now. > > Currently I have an index of ~10M docs and only store id's in the index > for > > performance and distribution reasons. When we enter a new market I'm > > assuming we will soon hit 100M and quite soon after that 1G documents. > Each > > document have in average about 3-5k data. > > > > We will use a GlusterFS installation with RAID1 (or RAID10) SATA > enclosures > > as shared storage (think of it as a SAN or shared storage at least, one > > mount point). Hope this will be the right choice, only future can tell. > > > > Since we are developing a search engine I frankly don't think even having > > 100's of SOLR instances serving the index will cut it performance wise if > we > > have one big index. I totally agree with the others claiming that you > most > > definitely will go OOE or hit some other constraints of SOLR if you must > > have the whole result in memory sort it and create a xml response. I did > hit > > such constraints when I couldn't afford the instances to have enough > memory > > and I had only 1M of docs back then. And think of it... Optimizing a TB > > index will take a long long time and you really want to have an optimized > > index if you want to reduce search time. > > > > I am thinking of a sharding solution where I fragment the index over the > > disk(s) and let each SOLR instance only have little piece of the total > > index. This will require a master database or namenode (or simpler just a > > properties file in each index dir) of some sort to know what docs is > located > > on which machine or at least how many docs each shard have. This is to > > ensure that whenever you introduce a new SOLR instance with a new shard > the > > master indexer will know what shard to prioritize. This is probably not > > eno
Re: token concat filter?
: I guess not. I've been reading the wiki, but the trouble with wiki's always : seems to be (for me) finding stuff. can you point it out? just search the wiki for "synonyms" (admitedly, the fact that by default it only searches titles and you have to click "text" to search the full wiki is anoying) and you'll find this... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=%28synonyms%29 ...and if you scroll down you'll wind up here... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter -Hoss
Re: Multicore and SolrResourceLoader
: I've been digging around in multicore and I am curious as to how to force a : reload of the sharedLib classloader. I can reload a given core, which : instantiates a new SolrResourceLoader for that core, but I want to be able to : reload the classloader for the sharedLib. that seems really dangerous to me ... you could wind up changing class impls out from under a core ... which could give you serious incompatibilities. The only safe way i can imagine doing this would be if we add a way to compeltely reinitialize the MultiCore (which would reload all the SolrCores) -Hoss
Re: exceeded limit of maxWarmingSearchers
: On a solr instance where I am in the process of indexing moderately large : number of documents (300K+). There is no querying of the index taking place : at all. : I don't understand what operations are causing new searchers to warm, or how : to stop them from doing so. I'd be happy to provide more details of my : configuration if necessary, I've made very few changes to the solrconfig.xml : that comes with the sample application. the one aspect that i didn't see mentioned in this thread so far is cache autowarming. even if no querying is going on while you are doing all the indexing, if some querying took place at any point, and your caches have someentries in them. every commit will cause autowarming of caches to happen (according to the autowarm settings on each cache) which result in queries getting executed on your "warming" searcher, and those queries keep cascading on to the subsequent warming searchers. this is one of the reasosn why it's generlaly a good idea to have the cache sizes on your "master" boxes all have autowarm counts of "0". you can still use the caches in case you do inadvertantly hit your master (you don't want it to fall over and die) but you don't want to waste a lot of time warming them on every commit until the end of time. -Hoss
Re: Missing content Stream
1) Posting the exact same question twice because you didn't get a reply in the first 8 hours isn't going to encourage people to reply faster. best case scenerio: you waste people's time they could be spending reading another email; worst case scnerio: you irk people and put them in a bad mood so they don't feel like being helpful. 2) In both instances of your question, the curl command you listed didn't include a space between the URL and the --data-binary option, perhaps that's your problem? 3) what *exactly* do you see in your shell when you run hte command? you said the lines from books.csv get dumped to your console, but what appears above it? what appears below it? ... books.csv is only 10 lines, just paste it all from the line where you run the command to the next prompt you see. 4) FYI... : Changed the following in solrconfig.xml, : : a) You don't need to enableRemoteStreaming if you plan on posting your CSV files, that's only neccessary if you want to be able to use the stream.file or stream.url options. b) the /update/csv handler should have already been in the example solrconfig.xml exactly as you list it there ... what do you mean you changed it? -Hoss
Re: JSON updates?
: I was wondering if xml is the only format used for updating Solr documents : or can JSON or Ruby be used as well ? Solr ships with two "RequestHandlers" for processing updates -- the XMLRequestHandler and the CSVRequestHandler. some others are in Jira (RichDocumentUpdateHandler and DateImportRequestHandler) or you could implement new ones as plugins. -Hoss
Re: Simple Solr POST using java
: please post a snippet of Java code to add a document to the Solr index that : includes the URL reference as a String? you mean like this one... :) http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup FWIW: if you want to talk to Solr from a Java app, the SolrJ client API is probably worth looking into rather then dealing with the HTTP connections and XML formating directly... http://wiki.apache.org/solr/Solrj -Hoss
Field Filtering that Varies per Query
I have an interesting query filter problem. The application that I am working on contains a directory of user profiles that acts a lot like a social networking site. Each user profile is composed of a set of fields (first_name, last_name, bio, phone_number, etc). Every user profile field is associated with a privacy setting that optionally hides the data inside of the field from other users. The privacy settings allow people to show the field to nobody, only their contacts on the site, all logged in users, or anyone. This presents a problem while designing a search interface for the profiles. All of the filtering options I have seen allow for per document filtering, but that is not sufficient. Since users have the option of selectively displaying portions of their profile to different users, we need to be able to remove individual fields from specific documents from consideration on a per query basis. The only idea we have had for resolving this is to construct an elaborate filter query to restrict the set of documents that the actual search is performed upon, but it has problems: We created a series of multivalue fields to store the user profile information first_name_contact last_name_contact etc... In these fields we stored the privacy preference: anonymous, logged_in, or the set of contact ids that were allowed access.Then we created a query filter that was dynamic depending on the identity of the logged in user, that looked something like this for a query for the term secret: (first_name: secret AND (first_name_contact:anonymous OR first_name_contact:member)) OR (last_name: secret AND (last_name_contact:anonymous OR last_name_contact:member)) OR This was used as a filter query with a standard query for the search term secret performed on the resulting filtered set of documents. This worked great if the search was a single word. However, if the users' search query contained multiple terms - for instance, 'my secret' results might be inappropriately revealed. This is because matches might occur for one term in a public field while the other term might only exist in fields that are private to the user making the query. Because documents would be allowed into the filtered set of potential results in that case, they would be matched by the actual query. By executing a set of queries, a user could infer the contents of a protected document field even though they would be unable to view its contents. We have been unable to think of a way to construct a query that overcomes this issue. Looking briefly at Lucene, there does not seem to be an obvious way to do the sort of field based filtering that varies on a per query basis that we need to do, even if we were willing to dig deeper and write some custom code. Does anyone know of any tricks that we might use? Is it even possible to do this given how the low level architecture of Lucene may or may not work? Any help would be greatly appreciated. Thanks, Nathan Woodhull
Re: Simple Solr POST using java
Agree. I've been using Solrj on product site for 9 months without any problem at all. You should probably give it a try instead of dealing with all those low level details. On Sun, May 11, 2008 at 4:14 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : please post a snippet of Java code to add a document to the Solr index > that > : includes the URL reference as a String? > > you mean like this one... :) > > > http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup > > FWIW: if you want to talk to Solr from a Java app, the SolrJ client API > is probably worth looking into rather then dealing with the HTTP > connections and XML formating directly... > > http://wiki.apache.org/solr/Solrj > > > -Hoss > > -- Regards, Cuong Hoang