Re: Question about best way to architect a Solr application with many data sources

Erick Erickson Tue, 21 Feb 2017 19:36:13 -0800

Dave:

Oh, I agree that a DB is a perfectly valid place to store the data and
you're absolutely right that it allows better interaction than flat
files; you can ask questions of an RDBMS that you can't easily ask the
disk ;). Storing to disk is an alternative if you're unwilling to deal
with a DB is all.


But the main point is you'll change your schema sometime and have to
re-index. Having the data you're indexing stored locally in whatever
form will allow much faster turn-around rather than re-crawling. Of
course it'll result in out of date data so you'll have to refresh
somehow sometime.

Erick

On Tue, Feb 21, 2017 at 6:07 PM, Dave <hastings.recurs...@gmail.com> wrote:
> Ha I think I went to one of your training seminars in NYC maybe 4 years ago 
> Eric. I'm going to have to respectfully disagree about the rdbms.  It's such 
> a well know data format that you could hire a high school programmer to help 
> with the db end if you knew how to flatten it to solr. Besides it's easy to 
> visualize and interact with the data before it goes to solr. A Json/Nosql 
> format would work just as well, but I really think a database has its place 
> in a scenario like this
>
>> On Feb 21, 2017, at 8:20 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>
>> I'll add that I _guarantee_ you'll want to re-index the data as you
>> change your schema
>> and the like. You'll be able to do that much more quickly if the data
>> is stored locally somehow.
>>
>> A RDBMS is not necessary however. You could simply store the data on
>> disk in some format
>> you could re-read and send to Solr.
>>
>> Best,
>> Erick
>>
>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <hastings.recurs...@gmail.com> wrote:
>>> B is a better option long term. Solr is meant for retrieving flat data, 
>>> fast, not hierarchical. That's what a database is for and trust me you 
>>> would rather have a real database on the end point.  Each tool has a 
>>> purpose, solr can never replace a relational database, and a relational 
>>> database could not replace solr. Start with the slow model (database) for 
>>> control/display and enhance with the fast model (solr) for retrieval/search
>>>
>>>
>>>
>>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rhum...@gmail.com> wrote:
>>>>
>>>> To learn how to properly use Solr, I'm building a little experimental
>>>> project with it to search for used car listings.
>>>>
>>>> Car listings appear on a variety of different places ... central places
>>>> Craigslist and also many many individual Used Car dealership websites.
>>>>
>>>> I am wondering, should I:
>>>>
>>>> (a) deploy a Solr search engine and build individual indexers for every
>>>> type of web site I want to find listings on?
>>>>
>>>> or
>>>>
>>>> (b) build my own database to store car listings, and then build services
>>>> that scrape data from different sites and feed entries into the database;
>>>> then point my Solr search to my database, one simple source of listings?
>>>>
>>>> My concerns are:
>>>>
>>>> With (a) ... I have to be smart enough to understand all those different
>>>> data sources and remove/update listings when they change; while this be
>>>> harder to do with custom Solr indexers than writing something from scratch?
>>>>
>>>> With (b) ... I'm maintaining a huge database of all my listings which seems
>>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>>> just knows it's there.  Is maintaining my own database a bad design?
>>>>
>>>> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Reply via email to