Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Ted Dunning
This copying is a bit overstated here because of the way that small segments are merged into larger segments. Those larger segments are then copied much less often than the smaller ones. While you can wind up with lots of copying in certain extreme cases, it is quite rare. In particular, if you

Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Lance Norskog
Here is an example of schema design: a PDF file of 5MB might have maybe 50k of actual text. The Solr ExtractingRequestHandler will find that text and only index that. If you set the field to stored=true, the 5mb will be saved. If saved=false, the PDF is not saved. Instead, you would store a link to

Re: Solr Distributed Search vs Hadoop

2011-12-23 Thread Nick Vincent
For data of this size you may want to look at something like Apache Cassandra, which is made specifically to handle data at this kind of scale across many machines. You can still use Hadoop to analyse and transform the data in a performant manner, however it's probably best to do some research on

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Ted Dunning
Well that begins to not look so much like a Solr/Lucene problem. Overall data is moderately large (TB's to 10's of TB's) for Lucene and the individual user profiles are distinctly large to be storing in Lucene. If there is part of the profile that you might want to search, that would be appropria

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Alireza Salimi
Well, actually we haven't started the actual project yet. But probably it will have to handle the data of millions of users, and a rough estimation for each user's data would be something around 5 MB. The other problem is that those data will be changed very often. I hope I answered your question

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Ted Dunning
You didn't mention how big your data is or how you create it. Hadoop would mostly used in the preparation of the data or the off-line creation of indexes. On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi wrote: > Hi, > > I have a basic question, let's say we're going to have a very very huge set

Solr Distributed Search vs Hadoop

2011-12-20 Thread Alireza Salimi
Hi, I have a basic question, let's say we're going to have a very very huge set of data. In a way that for sure we will need many servers (tens or hundreds of servers). We will also need failover. Now the question is, if we should use Hadoop or using Solr Distributed Search with shards would be en