Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

Otis Gospodnetic Fri, 05 Jul 2013 15:50:16 -0700

Furkan,

It's perfectly fine.  Some people have small indices and lots of
queries, some have large indices and very few queries, and lucky ones
have very large indices and lots of queries at the same time.


We once helped a client take their indexing down from many hours to a
couple of minutes by using Hadoop MapReduce.  It made experimentation
with indexing, relevance, etc. easy, while before that it was nearly
impossible.

What you are looking to do it perfectly fine.  Saw your comment in
JIRA - go for it, but I suggest you look at what's been done before
(beyond Solr!) and learn a bit first, get some ideas, and then
implement.  For example, I'd look at
https://github.com/elasticsearch/elasticsearch-hadoop and learn from
and good ideas you see there (or what look like bad decisions if you
see any!) and then implement this for Solr(Cloud).

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Jul 5, 2013 at 5:58 PM, Furkan KAMACI <furkankam...@gmail.com> wrote:
> Ok, I know that it is really unnecessary to start a complex design. On the
> other hand if your resources and needs are adequate and if you have a
> bottleneck at your design it is really a fail not to plan a new design.
>
> We have more than terabytes of data and we have dedicated some developers
> at Hadoop and Hbase side. There are many machines at our network
> architecture (currently we are making some tests and improvements). Data is
> stored as distributed and indexed at SolrCloud as distributed.
>
> However there is a bottleneck at this architecture. Taking from data from
> Hbase and sending to SolrCloud is not fast as much as other parts of
> system. If we don't resolve that problem and use current architecture I
> think that will be a design fault.
>
> That's why I asked this question and it seems reasonable for me. I know
> that some people are storing Lucene indexes at Hbase and that is the
> correct design for them. Sending data at Solrj with Map Reduce jobs may be
> another good thing according to our needs and I think there maybe some
> people from community that has tried that or even think about it. Thanks
> for the answers.
>
> 2013/7/5 Roman Chyla <roman.ch...@gmail.com>
>
>> I don't want to sound negative, but I think it is a valid question to
>> consider - for the lack of information and certain mental rigidity may make
>> it sound bad - first of all, it is probably not for few gigabytes of data
>> and I can imagine that building indexes at the side when data lives is much
>> faster/cheaper, then sending data to solr - if we think the index is the
>> product of the map, then the 'reduce' part may be this
>> http://wiki.apache.org/solr/MergingSolrIndexes
>>
>> I don't really know enough about CloudSolrServer and how to fit the cloud
>> there
>>
>> roman
>>
>> On Fri, Jul 5, 2013 at 12:23 PM, Jack Krupansky <j...@basetechnology.com
>> >wrote:
>>
>> > Software developers are sometimes compensated based on the degree of
>> > complexity that they deal with.
>> >
>> > And managers are sometimes compensated based on the number of people they
>> > manage, as well as the degree of complexity of what they manage.
>> >
>> > And... training organizations can charge more and have a larger pool of
>> > eager customers when the subject matter has higher complexity.
>> >
>> > And... consultants and contractors will be in higher demand and able to
>> > charge more, based on the degree of complexity that they have mastered.
>> >
>> > So, more complexity results in greater opportunity for higher income!
>> >
>> > (Oh, and, writers and book authors have more to write about and readers
>> > are more eager to purchase those writings as well, especially if the
>> > subject matter is constantly changing.)
>> >
>> > Somebody please remind me I said this any time you catch me trying to
>> > argue for Solr to be made simpler and easier to use!
>> >
>> > -- Jack Krupansky
>> >
>> > -----Original Message----- From: Walter Underwood
>> > Sent: Friday, July 05, 2013 12:11 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj
>> >
>> >
>> > Why is it better to require another large software system (Hadoop), when
>> > it works fine without it?
>> >
>> > That just sounds like more stuff to configure, misconfigure, and cause
>> > problems with indexing.
>> >
>> > wunder
>> >
>> > On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:
>> >
>> >  We are using Nutch to crawl web sites and it stores documents at Hbase.
>> >> Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
>> >> ecosystem as well. I think that there should be an implementation at
>> Solrj
>> >> that sends documents (via CloudSolrServer or something like that) as
>> >> MapReduce jobs. Is there any implentation for it or is it not a good
>> idea?
>> >>
>> >
>> >
>> >
>>

Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

Reply via email to