Re: One index or multiple?

Walter Underwood Sat, 06 Oct 2012 17:05:30 -0700

Right. You define three update handlers, something like /update-animal, 
/update-mineral, and /update-vegetable. Each one has a separate DIH config. 
Each config deletes documents of that type and loads documents of that type.


You will not want to run them at the same time, because a commit in one will 
commit all the pending changes from any other one. It would be much less 
confusing to run them separately.

wunder

On Oct 6, 2012, at 2:30 PM, Erick Erickson wrote:

> Sure, you need to define the appropriate delete query for each DIH entry.
> 
> Best
> Erick
> 
> On Fri, Oct 5, 2012 at 5:40 PM, Billy Newman <newman...@gmail.com> wrote:
>> Does DIH support only deleting/re-indexing docs of a certain type?
>> 
>> I.E. can I have a DIH for type:vegetable and another for type:mineral
>> and each only deletes/recreates the right types?
>> 
>> Thanks.
>> 
>> On Fri, Oct 5, 2012 at 1:04 PM, Walter Underwood <wun...@wunderwood.org> 
>> wrote:
>>> Using the same unique key doesn't handle documents which disappear from one 
>>> indexing to the next.
>>> 
>>> Instead, add a field for the type of item, like type:animal, 
>>> type:vegetable, or type:mineral. Then the query used to clean up before 
>>> indexing can delete all items of that type.
>>> 
>>> wunder
>>> 
>>> On Oct 5, 2012, at 12:00 PM, Erick Erickson wrote:
>>> 
>>>> DIH always gives me indigestion.....
>>>> 
>>>> Couple of things:
>>>> See the 'clean' parameter here for full import:
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>> it defaults to true. I think if you set it to "false"
>>>> _and_ assuming that your <uniqueKey> is
>>>> defined, it should work OK.
>>>> 
>>>> The other approach would be to control the
>>>> indexing of your XML from, say, a SolrJ program
>>>> combined with a cron job....
>>>> 
>>>> Does that work?
>>>> Erick
>>>> 
>>>> On Fri, Oct 5, 2012 at 2:39 PM, Billy Newman <newman...@gmail.com> wrote:
>>>>> Erick,
>>>>> 
>>>>> I did mention using the DIH to index the first two datasets, that is
>>>>> where my the root of my problem lies.
>>>>> 
>>>>> I do see the benefit of one index.  However the question still
>>>>> remains, can I use the DIH to index xml from data set 1 and 2, every
>>>>> 15 minutes or so (full index) without wiping out all the indexed data
>>>>> in the index from data set 3.
>>>>> 
>>>>> I.E. From a couple of quick tests the DIH full import destroys all
>>>>> data in the index before it repopulates it.  Not sure I can just have
>>>>> it destroy/re-index data of a certain type.  Basically DIH full-import
>>>>> on my_index for type 'dataset1', and DIH full-import on my-index for
>>>>> type 'dataset2'.  Both full-imports leaving alone the type 'dataset3'
>>>>> data in the index.
>>>>> 
>>>>> Any ideas?
>>>>> 
>>>>> Thanks,
>>>>> Billy
>>>>> 
>>>>> On Fri, Oct 5, 2012 at 10:42 AM, Erick Erickson <erickerick...@gmail.com> 
>>>>> wrote:
>>>>>> The very first question is "what form are your XML docs in?"
>>>>>> Solr does NOT index arbitrary XML, so I'm guessing
>>>>>> you're using DIH and some of the xml stuff there. Do note
>>>>>> that the XSLT is a subset of the full capabilities....
>>>>>> 
>>>>>> Second, I'd recommend you just put it all in a single index, it'll be
>>>>>> simpler. Index a field indicating which of your three sources
>>>>>> the doc belongs to. Then you can group (aka Field Collapse) by
>>>>>> source and your result sets will contain the top N docs from each
>>>>>> type and you can do whatever you want with them at the app
>>>>>> level. See: http://wiki.apache.org/solr/FieldCollapsing
>>>>>> 
>>>>>> By including a type, you an also do nifty things like delete all the
>>>>>> records for a particular type by query.
>>>>>> 
>>>>>> Best
>>>>>> Erick
>>>>>> 
>>>>>> 
>>>>>> On Fri, Oct 5, 2012 at 11:22 AM, Billy Newman <newman...@gmail.com> 
>>>>>> wrote:
>>>>>>> I am looking into Solr to index a few of my data sets, 3 to be exact.
>>>>>>> 
>>>>>>> The first 2 are really small xml docs retrieved via url, ~300 records
>>>>>>> each.  The data behind both of these changes very frequently ~5
>>>>>>> minutes.  The data itself does not have timestamps so delta-import
>>>>>>> using DIH would not work (at least I don't think it would work).  I am
>>>>>>> thinking about just re-indexing these 2 data sources every 15 minutes
>>>>>>> or so to keep the indexes up to date.
>>>>>>> 
>>>>>>> The 3rd data set is a lot more complicated in which I will probably
>>>>>>> have to use SolrJ and write some custom code to handle
>>>>>>> inserts/updates/deletes.
>>>>>>> 
>>>>>>> I need to be able to search all the data sets once they are indexed in
>>>>>>> one search.
>>>>>>> 
>>>>>>> A couple options:
>>>>>>> 
>>>>>>> 1.  Store the data from all 3 datasets in different indexes, allowing
>>>>>>> the DIH import handler to re-index datasets 1 and 2 without affecting
>>>>>>> indexed data from data set 3.   Not sure this is advised as I am not
>>>>>>> sure it is a good idea, or even possible to search multiple cores.
>>>>>>> 
>>>>>>> 2. Store all the data from all 3 datasets in the same index.  Yet this
>>>>>>> brings the question of how to re-index datasets 1 and 2 using a DIH
>>>>>>> full-import and not lose indexed data from data set 3.
>>>>>>> 
>>>>>>> Just starting with Solr so please go easy ;).  Thanks in advance.
>>>>>>> 
>>>>>>> Billy
>>> 
>>> --
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> 
>>> 
>>> 

--
Walter Underwood
wun...@wunderwood.org

Re: One index or multiple?

Reply via email to