Erick,

I did mention using the DIH to index the first two datasets, that is
where my the root of my problem lies.

I do see the benefit of one index.  However the question still
remains, can I use the DIH to index xml from data set 1 and 2, every
15 minutes or so (full index) without wiping out all the indexed data
in the index from data set 3.

I.E. From a couple of quick tests the DIH full import destroys all
data in the index before it repopulates it.  Not sure I can just have
it destroy/re-index data of a certain type.  Basically DIH full-import
on my_index for type 'dataset1', and DIH full-import on my-index for
type 'dataset2'.  Both full-imports leaving alone the type 'dataset3'
data in the index.

Any ideas?

Thanks,
Billy

On Fri, Oct 5, 2012 at 10:42 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> The very first question is "what form are your XML docs in?"
> Solr does NOT index arbitrary XML, so I'm guessing
> you're using DIH and some of the xml stuff there. Do note
> that the XSLT is a subset of the full capabilities....
>
> Second, I'd recommend you just put it all in a single index, it'll be
> simpler. Index a field indicating which of your three sources
> the doc belongs to. Then you can group (aka Field Collapse) by
> source and your result sets will contain the top N docs from each
> type and you can do whatever you want with them at the app
> level. See: http://wiki.apache.org/solr/FieldCollapsing
>
> By including a type, you an also do nifty things like delete all the
> records for a particular type by query.
>
> Best
> Erick
>
>
> On Fri, Oct 5, 2012 at 11:22 AM, Billy Newman <newman...@gmail.com> wrote:
>> I am looking into Solr to index a few of my data sets, 3 to be exact.
>>
>> The first 2 are really small xml docs retrieved via url, ~300 records
>> each.  The data behind both of these changes very frequently ~5
>> minutes.  The data itself does not have timestamps so delta-import
>> using DIH would not work (at least I don't think it would work).  I am
>> thinking about just re-indexing these 2 data sources every 15 minutes
>> or so to keep the indexes up to date.
>>
>> The 3rd data set is a lot more complicated in which I will probably
>> have to use SolrJ and write some custom code to handle
>> inserts/updates/deletes.
>>
>> I need to be able to search all the data sets once they are indexed in
>> one search.
>>
>> A couple options:
>>
>> 1.  Store the data from all 3 datasets in different indexes, allowing
>> the DIH import handler to re-index datasets 1 and 2 without affecting
>> indexed data from data set 3.   Not sure this is advised as I am not
>> sure it is a good idea, or even possible to search multiple cores.
>>
>> 2. Store all the data from all 3 datasets in the same index.  Yet this
>> brings the question of how to re-index datasets 1 and 2 using a DIH
>> full-import and not lose indexed data from data set 3.
>>
>> Just starting with Solr so please go easy ;).  Thanks in advance.
>>
>> Billy

Reply via email to