Re: updating existing data in index vs inserting new data in index

Mark juszczec Thu, 07 Jul 2011 08:12:48 -0700

Erick

I used to, but now I find I must have commented it out in a fit of rage ;-)


This could be the whole problem.

I have verified via admin schema browser that the field is ORDER_ID and will
double check I refer to it in upper case in the appropriate places in the
Solr config scheme.

Curiously, the admin schema browser display for ORDER_ID says "hasDeletions:
false"  - which seems the opposite of what I want.  I want to be able to
delete duplicates.  Or am I interpreting this field wrong?

In order to check for duplicates, I am going to using the admin browser to
enter the following in the Make A Query box:

TABLE_ID:1 AND ORDER_ID:674659

When I click search and view the results, 2 records are displayed.  One has
the original values, one has the changed values.  I haven't examined the xml
(via view source) too closely and the next time I run I will look for
something indicating one of the records is inactive.

When you say "change your schema" do you mean via a delta import or by
modifying the config files or both?  FWIW, I am deleting the index on the
file system, doing a full import, modifying the data in the database and
then doing a delta import.

I am not restarting Solr at all in this process.

I understand Solr does not perform key management.  You described exactly
what I meant.  Sorry for any confusion.

Mark

On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> Let me re-state a few things to see if I've got it right:
>
> > your schema.xml file has an entry like <uniqueKey>order_id</uniqueKey>,
> right?
>
> > given this definition, any document added with an order_id that already
> exists in the
>   Solr index will be replaced. i.e. you should have one and only one
> document with a
>   given order_id.
>
> > case matters. Check via the admin page ("schema browser") to see if you
> have
>   two fields, order_id an ORDER_ID.
>
> > How are you checking that your docs are duplicates? If you do a search on
>   order_id, you should get back one and only one document (assuming the
>   definition above). A document that's deleted will just be marked as
> deleted,
>   the data won't be purged from the index. It won't show in search results,
> but
>   it will show if you use lower-level ways to access the data.
>
> > Whenever you change your schema, it's best to clean the index, restart
> the server and
>    re-index from scratch. Solr won't retroactively remove duplicate
> <uniqueKey> entries.
>
> > On the stats admin/stats page you should see maxDocs and numDocs. The
> difference
>   between these should be the number of deleted documents.
>
> > Solr doesn't "manage" unique keys. All that happens is Solr will replace
> any
>   pre-existing documents where *you've* defined the <uniqueKey> when a
>   new doc is added...
>
> Hope this helps
> Erick
>
> On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec <mark.juszc...@gmail.com>
> wrote:
> > Bob
> >
> > No, I don't.  Let me look into that and post my results.
> >
> > Mark
> >
> >
> > On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford <
> bob.sandif...@sirsidynix.com
> >> wrote:
> >
> >> Hi, Mark.
> >>
> >> I haven't used DIH myself - so I'll need to leave comments on your set
> up
> >> to others who have done so.
> >>
> >> Another question - after your initial index create (and after each
> delta),
> >> do you run a 'commit'?  Do you run an 'optimize'?  (Without the
> optimize,
> >> 'deleted' records still show up in query results...)
> >>
> >> Bob Sandiford | Lead Software Engineer | SirsiDynix
> >> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> >> www.sirsidynix.com
> >>
> >>
> >> > -----Original Message-----
> >> > From: Mark juszczec [mailto:mark.juszc...@gmail.com]
> >> > Sent: Thursday, July 07, 2011 10:04 AM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: updating existing data in index vs inserting new data in
> >> > index
> >> >
> >> > Bob
> >> >
> >> > Thanks very much for the reply!
> >> >
> >> > I am using a unique integer called order_id as the Solr index key.
> >> >
> >> > My query, deltaQuery and deltaImportQuery are below:
> >> >
> >> > <entity name="item1"
> >> >   pk="ORDER_ID"
> >> >   query="select 1 as TABLE_ID , orders.order_id,
> >> > orders.order_booked_ind,
> >> > orders.order_dt, orders.cancel_dt,     orders.account_manager_id,
> >> > orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
> >> > orders.approved_discount_pct, orders.campaign_nm,
> >> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
> >> > orders"
> >> >
> >> >   deltaImportQuery="select 1 as TABLE_ID, orders.order_id,
> >> > orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
> >> > orders.account_manager_id, orders.of_header_id,
> >> > orders.order_status_lov_id,
> >> > orders.order_type_id, orders.approved_discount_pct,
> orders.campaign_nm,
> >> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
> orders
> >> > where orders.order_id = '${dataimporter.delta.ORDER_ID}'"
> >> >
> >> >   deltaQuery="select orders.order_id from orders where
> orders.change_dt
> >> > >
> >> > to_date('${dataimporter.last_index_time}','YYYY-MM-DD HH24:MI:SS')" >
> >> >         </entity>
> >> >
> >> > The test I am running is two part:
> >> >
> >> > 1.  After I do a full import of the index, I insert a brand new record
> >> > (with
> >> > a never existed before order_id) in the database.  The delta import
> >> > picks
> >> > this up just fine.
> >> >
> >> > 2.  After the full import, I modify a record with an order_id that
> >> > already
> >> > shows up in the index.  I have verified there is only one record with
> >> > this
> >> > order_id in both the index and the db before I do the delta update.
> >> >
> >> > I guess the question is, am I screwing myself up by defining my own
> Solr
> >> > index key?  I want to, ultimately, be able to search on ORDER_ID in
> the
> >> > Solr
> >> > index.  However, the docs say (I think) a field does not have to be
> the
> >> > Solr
> >> > primary key in order to be searchable.  Would I be better off letting
> >> > Solr
> >> > manage the keys?
> >> >
> >> > Mark
> >> >
> >> > On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford
> >> > <bob.sandif...@sirsidynix.com>wrote:
> >> >
> >> > > What are you using as the unique id in your Solr index?  It sounds
> >> > like you
> >> > > may have one value as your Solr index unique id, which bears no
> >> > resemblance
> >> > > to a unique[1] id derived from your data...
> >> > >
> >> > > Or - another way to put it - what is it that makes these two records
> >> > in
> >> > > your Solr index 'the same', and what are the unique id's for those
> two
> >> > > entries in the Solr index?  How are those id's related to your
> >> > original
> >> > > data?
> >> > >
> >> > > [1] not only unique, but immutable.  I.E. if you update a row in
> your
> >> > > database, the unique id derived from that row has to be the same as
> it
> >> > would
> >> > > have been before the update.  Otherwise, there's nothing for Solr to
> >> > > recognize as a duplicate entry, and do a 'delete' and 'insert'
> instead
> >> > of
> >> > > just an 'insert'.
> >> > >
> >> > > Bob Sandiford | Lead Software Engineer | SirsiDynix
> >> > > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> >> > > www.sirsidynix.com
> >> > >
> >> > >
> >> > > > -----Original Message-----
> >> > > > From: Mark juszczec [mailto:mark.juszc...@gmail.com]
> >> > > > Sent: Thursday, July 07, 2011 9:15 AM
> >> > > > To: solr-user@lucene.apache.org
> >> > > > Subject: updating existing data in index vs inserting new data in
> >> > index
> >> > > >
> >> > > > Hello all
> >> > > >
> >> > > > I'm using Solr 3.2 and am confused about updating existing data in
> >> > an
> >> > > > index.
> >> > > >
> >> > > > According to the DataImportHandler Wiki:
> >> > > >
> >> > > > *"delta-import* : For incremental imports and change detection run
> >> > the
> >> > > > command `http://<host>:<port>/solr/dataimport?command=delta-import
> .
> >> > It
> >> > > > supports the same clean, commit, optimize and debug parameters as
> >> > > > full-import command."
> >> > > >
> >> > > > I know delta-import will find new data in the database and insert
> it
> >> > > > into
> >> > > > the index.  My problem is how it handles updates where I've got a
> >> > record
> >> > > > that exists in the index and the database, the database record is
> >> > > > changed
> >> > > > and I want to incorporate those changes in the existing record in
> >> > the
> >> > > > index.
> >> > > >  IOW I don't want to insert it again.
> >> > > >
> >> > > > I've tried this and wound up with 2 records with the same key in
> the
> >> > > > index.
> >> > > >  The first contains the original db values found when the index
> was
> >> > > > created,
> >> > > > the 2nd contains the db values after the record was changed.
> >> > > >
> >> > > > I've also found this
> >> > > >
> >> >
> http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720
> >> > > > 66.n3.nabble.com%2FDelta-import-with-solrj-client-
> >> > tp1085763p1086173.html
> >> > > > the
> >> > > > subject is 'Delta-import with solrj client'
> >> > > >
> >> > > > "Greetings. I have a *solrj* client for fetching data from
> database.
> >> > I
> >> > > > am
> >> > > > using *delta*-*import* for fetching data. If a column is changed
> in
> >> > > > database
> >> > > > using timestamp with *delta*-*import* i get the latest column
> >> > indexed
> >> > > > but
> >> > > > there are *duplicate* values in the index similar to the column
> but
> >> > the
> >> > > > data
> >> > > > is older. This works with cleaning the index but i want to update
> >> > the
> >> > > > index
> >> > > > without cleaning it. Is there a way to just update the index with
> >> > the
> >> > > > updated column without having *duplicate* values. Appreciate for
> any
> >> > > > feedback.
> >> > > >
> >> > > > Hando"
> >> > > >
> >> > > > There are 2 responses:
> >> > > >
> >> > > > "Short answer is no, there isn't a way. *Solr* doesn't have the
> >> > concept
> >> > > > of
> >> > > > 'Update' to an indexed document. You need to add the full document
> >> > (all
> >> > > > 'columns') each time any one field changes. If doing that in your
> >> > > > DataImportHandler logic is difficult you may need to write a
> >> > separate
> >> > > > Update
> >> > > > Service that does:
> >> > > >
> >> > > > 1) Read UniqueID, UpdatedColumn(s)  from database
> >> > > > 2) Using UniqueID Retrieve document from *Solr*
> >> > > > 3) Add/Update field(s) with updated column(s)
> >> > > > 4) Add document back to *Solr*
> >> > > >
> >> > > > Although, if you use DIH to do a full *import*, using the same
> query
> >> > in
> >> > > > your *Delta*-*Import* to get the whole document shouldn't be that
> >> > > > difficult."
> >> > > >
> >> > > > and
> >> > > >
> >> > > > "Hi,
> >> > > >
> >> > > > Make sure you use a proper "ID" field, which does *not* change
> even
> >> > if
> >> > > > the
> >> > > > content in the database changes. In this way, when your
> >> > > > *delta*-*import* fetches
> >> > > > changed rows to index, they will update the existing rows in your
> >> > index.
> >> > > > "
> >> > > >
> >> > > > I have an ID field that doesn't change.  It is the primary key
> field
> >> > > > from
> >> > > > the database table I am trying to index and I have verified it is
> >> > > > unique.
> >> > > >
> >> > > > So, does Solr allow updates (not inserts) of existing records?  Is
> >> > > > anyone
> >> > > > able to do this?
> >> > > >
> >> > > > Mark
> >> > >
> >> > >
> >>
> >>
> >
>

Re: updating existing data in index vs inserting new data in index

Reply via email to