Re: One index or multiple?

Billy Newman Sun, 07 Oct 2012 19:15:01 -0700

Erik,  

Thanks for all the help, what a great community.


Unfortunately the 2 data sets I want to use the DIH for change a ton and are 
changed by a web app accessible to a number of people, as well as a few other 
internal server applications. Since the data sets were a small figured 
re-indexing them every so often would be easiest. Even with re-indexing a SolrJ 
app may be best. 

Thanks again for all the help and advice!

Billy

Sent from my iPhone

On Oct 7, 2012, at 10:27 AM, Erick Erickson <erickerick...@gmail.com> wrote:

> My personal approach would be to take DIH
> out of the mix entirely and do the whole thing in SolrJ
> where you can exercise control to whatever degree
> you want. DIH is a fine tool, but sometimes it's wrong
> for a particular situation.
> 
> Here's some code to get you started if you want to
> go that route.
> 
> http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
> 
> Because you're entirely right. If the commits are
> interleaved with the deletes, you'll have partial data showing.
> 
> But the root of your issue is that you're doing
> blanket deletes of a type. You only have to do this
> if
> 1> you have deletes in your DB
> 2> you have no way of deleting _just_ those
>    documents that correspond to the rows
>    you've deleted
> Otherwise, if you could delete only the docs that
> correspond to the deleted rows in your DB, then
> your data view would be consistent...
> 
> Best
> Erick
> 
> On Sun, Oct 7, 2012 at 10:00 AM, Billy Newman <newman...@gmail.com> wrote:
>> Walter,
>> 
>> Thanks!  You bring up a very important 'commit' problem which I had
>> not thought about.  So I am running a DIH that is wiping out part of
>> the index (ie all animals), then re-indexing/re-importing.  I have
>> another DIH that is wiping out part if the index (minerals), then
>> re-indexing/re-importing.
>> 
>> I see this problem (which I think you already realized):
>> 1. Index is full and people are querying.
>> 2. DIH for animals starts running and wipes out all animals
>> 3. DIH for minerals starts running and wipes out all minerals.
>> 4. DIH for animals finishes, and commits.
>> 5. User queries for minerals which might return 0 or a subset of
>> results.  Because the animals DIH 'commited' the changes made by the
>> mineral DIH (lets assume only the clear happened in the mineral DIH
>> when the animal DIH committed).
>> 
>> To further complicate things I have a third SolrJ application that
>> will be processing another dataset and updating/committing to the
>> index.  Is there a recommended way to handle multiple applications
>> that are wiping out and writing to part of the index, such that the
>> commits do not commit at an inopportune time (ie commit by one
>> application right after another application just wiped part of the
>> index before repopulating it)?
>> 
>> I need to update the index every so often (~30 minutes).  I could
>> write an app that chains the other 'indexer' apps (DIH1, DIH2,
>> SolrJApp1) together such that they run serially and then do one commit
>> at the end.  Not too bad, but wondering if there is anything I can
>> take advantage of in Solr that would help with this problem.  I am
>> using Solr 4.0-BETA if that makes a difference.
>> 
>> Thanks again!
>> 
>> Billy
>> 
>> On Sat, Oct 6, 2012 at 6:05 PM, Walter Underwood <wun...@wunderwood.org> 
>> wrote:
>>> Right. You define three update handlers, something like /update-animal, 
>>> /update-mineral, and /update-vegetable. Each one has a separate DIH config. 
>>> Each config deletes documents of that type and loads documents of that type.
>>> 
>>> You will not want to run them at the same time, because a commit in one 
>>> will commit all the pending changes from any other one. It would be much 
>>> less confusing to run them separately.
>>> 
>>> wunder
>>> 
>>> On Oct 6, 2012, at 2:30 PM, Erick Erickson wrote:
>>> 
>>>> Sure, you need to define the appropriate delete query for each DIH entry.
>>>> 
>>>> Best
>>>> Erick
>>>> 
>>>> On Fri, Oct 5, 2012 at 5:40 PM, Billy Newman <newman...@gmail.com> wrote:
>>>>> Does DIH support only deleting/re-indexing docs of a certain type?
>>>>> 
>>>>> I.E. can I have a DIH for type:vegetable and another for type:mineral
>>>>> and each only deletes/recreates the right types?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> On Fri, Oct 5, 2012 at 1:04 PM, Walter Underwood <wun...@wunderwood.org> 
>>>>> wrote:
>>>>>> Using the same unique key doesn't handle documents which disappear from 
>>>>>> one indexing to the next.
>>>>>> 
>>>>>> Instead, add a field for the type of item, like type:animal, 
>>>>>> type:vegetable, or type:mineral. Then the query used to clean up before 
>>>>>> indexing can delete all items of that type.
>>>>>> 
>>>>>> wunder
>>>>>> 
>>>>>> On Oct 5, 2012, at 12:00 PM, Erick Erickson wrote:
>>>>>> 
>>>>>>> DIH always gives me indigestion.....
>>>>>>> 
>>>>>>> Couple of things:
>>>>>>> See the 'clean' parameter here for full import:
>>>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>>>> it defaults to true. I think if you set it to "false"
>>>>>>> _and_ assuming that your <uniqueKey> is
>>>>>>> defined, it should work OK.
>>>>>>> 
>>>>>>> The other approach would be to control the
>>>>>>> indexing of your XML from, say, a SolrJ program
>>>>>>> combined with a cron job....
>>>>>>> 
>>>>>>> Does that work?
>>>>>>> Erick
>>>>>>> 
>>>>>>> On Fri, Oct 5, 2012 at 2:39 PM, Billy Newman <newman...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> Erick,
>>>>>>>> 
>>>>>>>> I did mention using the DIH to index the first two datasets, that is
>>>>>>>> where my the root of my problem lies.
>>>>>>>> 
>>>>>>>> I do see the benefit of one index.  However the question still
>>>>>>>> remains, can I use the DIH to index xml from data set 1 and 2, every
>>>>>>>> 15 minutes or so (full index) without wiping out all the indexed data
>>>>>>>> in the index from data set 3.
>>>>>>>> 
>>>>>>>> I.E. From a couple of quick tests the DIH full import destroys all
>>>>>>>> data in the index before it repopulates it.  Not sure I can just have
>>>>>>>> it destroy/re-index data of a certain type.  Basically DIH full-import
>>>>>>>> on my_index for type 'dataset1', and DIH full-import on my-index for
>>>>>>>> type 'dataset2'.  Both full-imports leaving alone the type 'dataset3'
>>>>>>>> data in the index.
>>>>>>>> 
>>>>>>>> Any ideas?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Billy
>>>>>>>> 
>>>>>>>> On Fri, Oct 5, 2012 at 10:42 AM, Erick Erickson 
>>>>>>>> <erickerick...@gmail.com> wrote:
>>>>>>>>> The very first question is "what form are your XML docs in?"
>>>>>>>>> Solr does NOT index arbitrary XML, so I'm guessing
>>>>>>>>> you're using DIH and some of the xml stuff there. Do note
>>>>>>>>> that the XSLT is a subset of the full capabilities....
>>>>>>>>> 
>>>>>>>>> Second, I'd recommend you just put it all in a single index, it'll be
>>>>>>>>> simpler. Index a field indicating which of your three sources
>>>>>>>>> the doc belongs to. Then you can group (aka Field Collapse) by
>>>>>>>>> source and your result sets will contain the top N docs from each
>>>>>>>>> type and you can do whatever you want with them at the app
>>>>>>>>> level. See: http://wiki.apache.org/solr/FieldCollapsing
>>>>>>>>> 
>>>>>>>>> By including a type, you an also do nifty things like delete all the
>>>>>>>>> records for a particular type by query.
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> Erick
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 5, 2012 at 11:22 AM, Billy Newman <newman...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>> I am looking into Solr to index a few of my data sets, 3 to be exact.
>>>>>>>>>> 
>>>>>>>>>> The first 2 are really small xml docs retrieved via url, ~300 records
>>>>>>>>>> each.  The data behind both of these changes very frequently ~5
>>>>>>>>>> minutes.  The data itself does not have timestamps so delta-import
>>>>>>>>>> using DIH would not work (at least I don't think it would work).  I 
>>>>>>>>>> am
>>>>>>>>>> thinking about just re-indexing these 2 data sources every 15 minutes
>>>>>>>>>> or so to keep the indexes up to date.
>>>>>>>>>> 
>>>>>>>>>> The 3rd data set is a lot more complicated in which I will probably
>>>>>>>>>> have to use SolrJ and write some custom code to handle
>>>>>>>>>> inserts/updates/deletes.
>>>>>>>>>> 
>>>>>>>>>> I need to be able to search all the data sets once they are indexed 
>>>>>>>>>> in
>>>>>>>>>> one search.
>>>>>>>>>> 
>>>>>>>>>> A couple options:
>>>>>>>>>> 
>>>>>>>>>> 1.  Store the data from all 3 datasets in different indexes, allowing
>>>>>>>>>> the DIH import handler to re-index datasets 1 and 2 without affecting
>>>>>>>>>> indexed data from data set 3.   Not sure this is advised as I am not
>>>>>>>>>> sure it is a good idea, or even possible to search multiple cores.
>>>>>>>>>> 
>>>>>>>>>> 2. Store all the data from all 3 datasets in the same index.  Yet 
>>>>>>>>>> this
>>>>>>>>>> brings the question of how to re-index datasets 1 and 2 using a DIH
>>>>>>>>>> full-import and not lose indexed data from data set 3.
>>>>>>>>>> 
>>>>>>>>>> Just starting with Solr so please go easy ;).  Thanks in advance.
>>>>>>>>>> 
>>>>>>>>>> Billy
>>>>>> 
>>>>>> --
>>>>>> Walter Underwood
>>>>>> wun...@wunderwood.org
>>> 
>>> --
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> 
>>> 
>>>

Re: One index or multiple?

Reply via email to