Erik, Thanks for all the help, what a great community.
Unfortunately the 2 data sets I want to use the DIH for change a ton and are changed by a web app accessible to a number of people, as well as a few other internal server applications. Since the data sets were a small figured re-indexing them every so often would be easiest. Even with re-indexing a SolrJ app may be best. Thanks again for all the help and advice! Billy Sent from my iPhone On Oct 7, 2012, at 10:27 AM, Erick Erickson <erickerick...@gmail.com> wrote: > My personal approach would be to take DIH > out of the mix entirely and do the whole thing in SolrJ > where you can exercise control to whatever degree > you want. DIH is a fine tool, but sometimes it's wrong > for a particular situation. > > Here's some code to get you started if you want to > go that route. > > http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ > > Because you're entirely right. If the commits are > interleaved with the deletes, you'll have partial data showing. > > But the root of your issue is that you're doing > blanket deletes of a type. You only have to do this > if > 1> you have deletes in your DB > 2> you have no way of deleting _just_ those > documents that correspond to the rows > you've deleted > Otherwise, if you could delete only the docs that > correspond to the deleted rows in your DB, then > your data view would be consistent... > > Best > Erick > > On Sun, Oct 7, 2012 at 10:00 AM, Billy Newman <newman...@gmail.com> wrote: >> Walter, >> >> Thanks! You bring up a very important 'commit' problem which I had >> not thought about. So I am running a DIH that is wiping out part of >> the index (ie all animals), then re-indexing/re-importing. I have >> another DIH that is wiping out part if the index (minerals), then >> re-indexing/re-importing. >> >> I see this problem (which I think you already realized): >> 1. Index is full and people are querying. >> 2. DIH for animals starts running and wipes out all animals >> 3. DIH for minerals starts running and wipes out all minerals. >> 4. DIH for animals finishes, and commits. >> 5. User queries for minerals which might return 0 or a subset of >> results. Because the animals DIH 'commited' the changes made by the >> mineral DIH (lets assume only the clear happened in the mineral DIH >> when the animal DIH committed). >> >> To further complicate things I have a third SolrJ application that >> will be processing another dataset and updating/committing to the >> index. Is there a recommended way to handle multiple applications >> that are wiping out and writing to part of the index, such that the >> commits do not commit at an inopportune time (ie commit by one >> application right after another application just wiped part of the >> index before repopulating it)? >> >> I need to update the index every so often (~30 minutes). I could >> write an app that chains the other 'indexer' apps (DIH1, DIH2, >> SolrJApp1) together such that they run serially and then do one commit >> at the end. Not too bad, but wondering if there is anything I can >> take advantage of in Solr that would help with this problem. I am >> using Solr 4.0-BETA if that makes a difference. >> >> Thanks again! >> >> Billy >> >> On Sat, Oct 6, 2012 at 6:05 PM, Walter Underwood <wun...@wunderwood.org> >> wrote: >>> Right. You define three update handlers, something like /update-animal, >>> /update-mineral, and /update-vegetable. Each one has a separate DIH config. >>> Each config deletes documents of that type and loads documents of that type. >>> >>> You will not want to run them at the same time, because a commit in one >>> will commit all the pending changes from any other one. It would be much >>> less confusing to run them separately. >>> >>> wunder >>> >>> On Oct 6, 2012, at 2:30 PM, Erick Erickson wrote: >>> >>>> Sure, you need to define the appropriate delete query for each DIH entry. >>>> >>>> Best >>>> Erick >>>> >>>> On Fri, Oct 5, 2012 at 5:40 PM, Billy Newman <newman...@gmail.com> wrote: >>>>> Does DIH support only deleting/re-indexing docs of a certain type? >>>>> >>>>> I.E. can I have a DIH for type:vegetable and another for type:mineral >>>>> and each only deletes/recreates the right types? >>>>> >>>>> Thanks. >>>>> >>>>> On Fri, Oct 5, 2012 at 1:04 PM, Walter Underwood <wun...@wunderwood.org> >>>>> wrote: >>>>>> Using the same unique key doesn't handle documents which disappear from >>>>>> one indexing to the next. >>>>>> >>>>>> Instead, add a field for the type of item, like type:animal, >>>>>> type:vegetable, or type:mineral. Then the query used to clean up before >>>>>> indexing can delete all items of that type. >>>>>> >>>>>> wunder >>>>>> >>>>>> On Oct 5, 2012, at 12:00 PM, Erick Erickson wrote: >>>>>> >>>>>>> DIH always gives me indigestion..... >>>>>>> >>>>>>> Couple of things: >>>>>>> See the 'clean' parameter here for full import: >>>>>>> http://wiki.apache.org/solr/DataImportHandler >>>>>>> it defaults to true. I think if you set it to "false" >>>>>>> _and_ assuming that your <uniqueKey> is >>>>>>> defined, it should work OK. >>>>>>> >>>>>>> The other approach would be to control the >>>>>>> indexing of your XML from, say, a SolrJ program >>>>>>> combined with a cron job.... >>>>>>> >>>>>>> Does that work? >>>>>>> Erick >>>>>>> >>>>>>> On Fri, Oct 5, 2012 at 2:39 PM, Billy Newman <newman...@gmail.com> >>>>>>> wrote: >>>>>>>> Erick, >>>>>>>> >>>>>>>> I did mention using the DIH to index the first two datasets, that is >>>>>>>> where my the root of my problem lies. >>>>>>>> >>>>>>>> I do see the benefit of one index. However the question still >>>>>>>> remains, can I use the DIH to index xml from data set 1 and 2, every >>>>>>>> 15 minutes or so (full index) without wiping out all the indexed data >>>>>>>> in the index from data set 3. >>>>>>>> >>>>>>>> I.E. From a couple of quick tests the DIH full import destroys all >>>>>>>> data in the index before it repopulates it. Not sure I can just have >>>>>>>> it destroy/re-index data of a certain type. Basically DIH full-import >>>>>>>> on my_index for type 'dataset1', and DIH full-import on my-index for >>>>>>>> type 'dataset2'. Both full-imports leaving alone the type 'dataset3' >>>>>>>> data in the index. >>>>>>>> >>>>>>>> Any ideas? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Billy >>>>>>>> >>>>>>>> On Fri, Oct 5, 2012 at 10:42 AM, Erick Erickson >>>>>>>> <erickerick...@gmail.com> wrote: >>>>>>>>> The very first question is "what form are your XML docs in?" >>>>>>>>> Solr does NOT index arbitrary XML, so I'm guessing >>>>>>>>> you're using DIH and some of the xml stuff there. Do note >>>>>>>>> that the XSLT is a subset of the full capabilities.... >>>>>>>>> >>>>>>>>> Second, I'd recommend you just put it all in a single index, it'll be >>>>>>>>> simpler. Index a field indicating which of your three sources >>>>>>>>> the doc belongs to. Then you can group (aka Field Collapse) by >>>>>>>>> source and your result sets will contain the top N docs from each >>>>>>>>> type and you can do whatever you want with them at the app >>>>>>>>> level. See: http://wiki.apache.org/solr/FieldCollapsing >>>>>>>>> >>>>>>>>> By including a type, you an also do nifty things like delete all the >>>>>>>>> records for a particular type by query. >>>>>>>>> >>>>>>>>> Best >>>>>>>>> Erick >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Oct 5, 2012 at 11:22 AM, Billy Newman <newman...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> I am looking into Solr to index a few of my data sets, 3 to be exact. >>>>>>>>>> >>>>>>>>>> The first 2 are really small xml docs retrieved via url, ~300 records >>>>>>>>>> each. The data behind both of these changes very frequently ~5 >>>>>>>>>> minutes. The data itself does not have timestamps so delta-import >>>>>>>>>> using DIH would not work (at least I don't think it would work). I >>>>>>>>>> am >>>>>>>>>> thinking about just re-indexing these 2 data sources every 15 minutes >>>>>>>>>> or so to keep the indexes up to date. >>>>>>>>>> >>>>>>>>>> The 3rd data set is a lot more complicated in which I will probably >>>>>>>>>> have to use SolrJ and write some custom code to handle >>>>>>>>>> inserts/updates/deletes. >>>>>>>>>> >>>>>>>>>> I need to be able to search all the data sets once they are indexed >>>>>>>>>> in >>>>>>>>>> one search. >>>>>>>>>> >>>>>>>>>> A couple options: >>>>>>>>>> >>>>>>>>>> 1. Store the data from all 3 datasets in different indexes, allowing >>>>>>>>>> the DIH import handler to re-index datasets 1 and 2 without affecting >>>>>>>>>> indexed data from data set 3. Not sure this is advised as I am not >>>>>>>>>> sure it is a good idea, or even possible to search multiple cores. >>>>>>>>>> >>>>>>>>>> 2. Store all the data from all 3 datasets in the same index. Yet >>>>>>>>>> this >>>>>>>>>> brings the question of how to re-index datasets 1 and 2 using a DIH >>>>>>>>>> full-import and not lose indexed data from data set 3. >>>>>>>>>> >>>>>>>>>> Just starting with Solr so please go easy ;). Thanks in advance. >>>>>>>>>> >>>>>>>>>> Billy >>>>>> >>>>>> -- >>>>>> Walter Underwood >>>>>> wun...@wunderwood.org >>> >>> -- >>> Walter Underwood >>> wun...@wunderwood.org >>> >>> >>>