On Oct 3, 2008, at 12:34 PM, Shalin Shekhar Mangar wrote:

On Fri, Oct 3, 2008 at 9:20 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Now, my question. Let's say I have an initial set of ratings for a feed. I then do a full import of the articles on that feed. Everything is peachy so far. Then, I get a new rating for an existing article that I've already
indexed, thus the child entity (named "rating")
has a delta. However, when I run the delta-import, it doesn't pick up any changes, since, I believe, the parent hasn't changed. Either that, or I am doing something wrong. It seems like it is akin to the parentDeltaQuery problem, but, of course, there is no parent query since there is no parent
table, in the DB sense, at least
not how I see it.  The relevant logs are in [3].

Is this case handled? If not, Any suggestions for alternatives? Any help
would be appreciated.


XPathEntityProcessor does not support delta imports. It might be possible to enhance it to accept an xpath condition for joining child to parent but it seems point-less because we'd need to parse the whole XML anyway for each changed child row (Joel Spolsky's words echo in my mind!). If the XML data
is small, we can also have a cached implementation like the
CachedSqlEntityProcessor.

What about somehow using the fact that the variable resolver needs to resolve solrFeed.link and then go get all entries from Solr to get those values, such that the child entity can then be tested?



The easiest workaround here is to reverse the parent-child. Make the DB as
the parent and join on the child which will let you do delta imports,
however full imports may be expensive. Depending on the size of XML, you may
be better off doing a full import always.

I thought of reversing the parent-child, but I don't see how it works, since there isn't necessarily a DB entry for every article. How would you associate the two to make sure you get all articles?

Also, the current approach seems more intuitive, since the RSS feed is the authoritative content.

Essentially, what I am interested in is a join across data sources. I realize that is non-trivial, but boy would it be powerful.



Another thing I noticed from your logs: the ModifiedRowKey count is 0. Are you sure the timestamp column is getting updated? IIRC, you need a stored
proc to do this for postgres.

INFO: Completed ModifiedRowKey for Entity: rating rows obtained : 0

Yeah, that bothers me, too.  My dataimport.properties contains:
[EMAIL PROTECTED] cat dataimport.properties
#Fri Oct 03 12:09:29 EDT 2008
last_index_time=2008-10-03 12\:09\:28

And, when querying by hand my DB shows:

select * from feeds where last_modified > '10 / 03 / 2008 '; feed | rating | last_modified ---------------------------------------------------------------------------+ --------+--------------------- http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/ | 4.9 | 2008-10-04 11:04:00
(1 row)

So, I am reasonably certain there is a change.

I think the reason is, if you notice further down in the log, is that it processes the entities separately. In other words, is the DB entity even getting resolved in the context of the parent entity? Or, is it not resolving the ${solrFeed.link} clause of the delta query?

-Grant


Reply via email to