Erick I used to, but now I find I must have commented it out in a fit of rage ;-)
This could be the whole problem. I have verified via admin schema browser that the field is ORDER_ID and will double check I refer to it in upper case in the appropriate places in the Solr config scheme. Curiously, the admin schema browser display for ORDER_ID says "hasDeletions: false" - which seems the opposite of what I want. I want to be able to delete duplicates. Or am I interpreting this field wrong? In order to check for duplicates, I am going to using the admin browser to enter the following in the Make A Query box: TABLE_ID:1 AND ORDER_ID:674659 When I click search and view the results, 2 records are displayed. One has the original values, one has the changed values. I haven't examined the xml (via view source) too closely and the next time I run I will look for something indicating one of the records is inactive. When you say "change your schema" do you mean via a delta import or by modifying the config files or both? FWIW, I am deleting the index on the file system, doing a full import, modifying the data in the database and then doing a delta import. I am not restarting Solr at all in this process. I understand Solr does not perform key management. You described exactly what I meant. Sorry for any confusion. Mark On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Let me re-state a few things to see if I've got it right: > > > your schema.xml file has an entry like <uniqueKey>order_id</uniqueKey>, > right? > > > given this definition, any document added with an order_id that already > exists in the > Solr index will be replaced. i.e. you should have one and only one > document with a > given order_id. > > > case matters. Check via the admin page ("schema browser") to see if you > have > two fields, order_id an ORDER_ID. > > > How are you checking that your docs are duplicates? If you do a search on > order_id, you should get back one and only one document (assuming the > definition above). A document that's deleted will just be marked as > deleted, > the data won't be purged from the index. It won't show in search results, > but > it will show if you use lower-level ways to access the data. > > > Whenever you change your schema, it's best to clean the index, restart > the server and > re-index from scratch. Solr won't retroactively remove duplicate > <uniqueKey> entries. > > > On the stats admin/stats page you should see maxDocs and numDocs. The > difference > between these should be the number of deleted documents. > > > Solr doesn't "manage" unique keys. All that happens is Solr will replace > any > pre-existing documents where *you've* defined the <uniqueKey> when a > new doc is added... > > Hope this helps > Erick > > On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec <mark.juszc...@gmail.com> > wrote: > > Bob > > > > No, I don't. Let me look into that and post my results. > > > > Mark > > > > > > On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford < > bob.sandif...@sirsidynix.com > >> wrote: > > > >> Hi, Mark. > >> > >> I haven't used DIH myself - so I'll need to leave comments on your set > up > >> to others who have done so. > >> > >> Another question - after your initial index create (and after each > delta), > >> do you run a 'commit'? Do you run an 'optimize'? (Without the > optimize, > >> 'deleted' records still show up in query results...) > >> > >> Bob Sandiford | Lead Software Engineer | SirsiDynix > >> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com > >> www.sirsidynix.com > >> > >> > >> > -----Original Message----- > >> > From: Mark juszczec [mailto:mark.juszc...@gmail.com] > >> > Sent: Thursday, July 07, 2011 10:04 AM > >> > To: solr-user@lucene.apache.org > >> > Subject: Re: updating existing data in index vs inserting new data in > >> > index > >> > > >> > Bob > >> > > >> > Thanks very much for the reply! > >> > > >> > I am using a unique integer called order_id as the Solr index key. > >> > > >> > My query, deltaQuery and deltaImportQuery are below: > >> > > >> > <entity name="item1" > >> > pk="ORDER_ID" > >> > query="select 1 as TABLE_ID , orders.order_id, > >> > orders.order_booked_ind, > >> > orders.order_dt, orders.cancel_dt, orders.account_manager_id, > >> > orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, > >> > orders.approved_discount_pct, orders.campaign_nm, > >> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from > >> > orders" > >> > > >> > deltaImportQuery="select 1 as TABLE_ID, orders.order_id, > >> > orders.order_booked_ind, orders.order_dt, orders.cancel_dt, > >> > orders.account_manager_id, orders.of_header_id, > >> > orders.order_status_lov_id, > >> > orders.order_type_id, orders.approved_discount_pct, > orders.campaign_nm, > >> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from > orders > >> > where orders.order_id = '${dataimporter.delta.ORDER_ID}'" > >> > > >> > deltaQuery="select orders.order_id from orders where > orders.change_dt > >> > > > >> > to_date('${dataimporter.last_index_time}','YYYY-MM-DD HH24:MI:SS')" > > >> > </entity> > >> > > >> > The test I am running is two part: > >> > > >> > 1. After I do a full import of the index, I insert a brand new record > >> > (with > >> > a never existed before order_id) in the database. The delta import > >> > picks > >> > this up just fine. > >> > > >> > 2. After the full import, I modify a record with an order_id that > >> > already > >> > shows up in the index. I have verified there is only one record with > >> > this > >> > order_id in both the index and the db before I do the delta update. > >> > > >> > I guess the question is, am I screwing myself up by defining my own > Solr > >> > index key? I want to, ultimately, be able to search on ORDER_ID in > the > >> > Solr > >> > index. However, the docs say (I think) a field does not have to be > the > >> > Solr > >> > primary key in order to be searchable. Would I be better off letting > >> > Solr > >> > manage the keys? > >> > > >> > Mark > >> > > >> > On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford > >> > <bob.sandif...@sirsidynix.com>wrote: > >> > > >> > > What are you using as the unique id in your Solr index? It sounds > >> > like you > >> > > may have one value as your Solr index unique id, which bears no > >> > resemblance > >> > > to a unique[1] id derived from your data... > >> > > > >> > > Or - another way to put it - what is it that makes these two records > >> > in > >> > > your Solr index 'the same', and what are the unique id's for those > two > >> > > entries in the Solr index? How are those id's related to your > >> > original > >> > > data? > >> > > > >> > > [1] not only unique, but immutable. I.E. if you update a row in > your > >> > > database, the unique id derived from that row has to be the same as > it > >> > would > >> > > have been before the update. Otherwise, there's nothing for Solr to > >> > > recognize as a duplicate entry, and do a 'delete' and 'insert' > instead > >> > of > >> > > just an 'insert'. > >> > > > >> > > Bob Sandiford | Lead Software Engineer | SirsiDynix > >> > > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com > >> > > www.sirsidynix.com > >> > > > >> > > > >> > > > -----Original Message----- > >> > > > From: Mark juszczec [mailto:mark.juszc...@gmail.com] > >> > > > Sent: Thursday, July 07, 2011 9:15 AM > >> > > > To: solr-user@lucene.apache.org > >> > > > Subject: updating existing data in index vs inserting new data in > >> > index > >> > > > > >> > > > Hello all > >> > > > > >> > > > I'm using Solr 3.2 and am confused about updating existing data in > >> > an > >> > > > index. > >> > > > > >> > > > According to the DataImportHandler Wiki: > >> > > > > >> > > > *"delta-import* : For incremental imports and change detection run > >> > the > >> > > > command `http://<host>:<port>/solr/dataimport?command=delta-import > . > >> > It > >> > > > supports the same clean, commit, optimize and debug parameters as > >> > > > full-import command." > >> > > > > >> > > > I know delta-import will find new data in the database and insert > it > >> > > > into > >> > > > the index. My problem is how it handles updates where I've got a > >> > record > >> > > > that exists in the index and the database, the database record is > >> > > > changed > >> > > > and I want to incorporate those changes in the existing record in > >> > the > >> > > > index. > >> > > > IOW I don't want to insert it again. > >> > > > > >> > > > I've tried this and wound up with 2 records with the same key in > the > >> > > > index. > >> > > > The first contains the original db values found when the index > was > >> > > > created, > >> > > > the 2nd contains the db values after the record was changed. > >> > > > > >> > > > I've also found this > >> > > > > >> > > http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720 > >> > > > 66.n3.nabble.com%2FDelta-import-with-solrj-client- > >> > tp1085763p1086173.html > >> > > > the > >> > > > subject is 'Delta-import with solrj client' > >> > > > > >> > > > "Greetings. I have a *solrj* client for fetching data from > database. > >> > I > >> > > > am > >> > > > using *delta*-*import* for fetching data. If a column is changed > in > >> > > > database > >> > > > using timestamp with *delta*-*import* i get the latest column > >> > indexed > >> > > > but > >> > > > there are *duplicate* values in the index similar to the column > but > >> > the > >> > > > data > >> > > > is older. This works with cleaning the index but i want to update > >> > the > >> > > > index > >> > > > without cleaning it. Is there a way to just update the index with > >> > the > >> > > > updated column without having *duplicate* values. Appreciate for > any > >> > > > feedback. > >> > > > > >> > > > Hando" > >> > > > > >> > > > There are 2 responses: > >> > > > > >> > > > "Short answer is no, there isn't a way. *Solr* doesn't have the > >> > concept > >> > > > of > >> > > > 'Update' to an indexed document. You need to add the full document > >> > (all > >> > > > 'columns') each time any one field changes. If doing that in your > >> > > > DataImportHandler logic is difficult you may need to write a > >> > separate > >> > > > Update > >> > > > Service that does: > >> > > > > >> > > > 1) Read UniqueID, UpdatedColumn(s) from database > >> > > > 2) Using UniqueID Retrieve document from *Solr* > >> > > > 3) Add/Update field(s) with updated column(s) > >> > > > 4) Add document back to *Solr* > >> > > > > >> > > > Although, if you use DIH to do a full *import*, using the same > query > >> > in > >> > > > your *Delta*-*Import* to get the whole document shouldn't be that > >> > > > difficult." > >> > > > > >> > > > and > >> > > > > >> > > > "Hi, > >> > > > > >> > > > Make sure you use a proper "ID" field, which does *not* change > even > >> > if > >> > > > the > >> > > > content in the database changes. In this way, when your > >> > > > *delta*-*import* fetches > >> > > > changed rows to index, they will update the existing rows in your > >> > index. > >> > > > " > >> > > > > >> > > > I have an ID field that doesn't change. It is the primary key > field > >> > > > from > >> > > > the database table I am trying to index and I have verified it is > >> > > > unique. > >> > > > > >> > > > So, does Solr allow updates (not inserts) of existing records? Is > >> > > > anyone > >> > > > able to do this? > >> > > > > >> > > > Mark > >> > > > >> > > > >> > >> > > >