On Oct 3, 2008, at 12:34 PM, Shalin Shekhar Mangar wrote:
On Fri, Oct 3, 2008 at 9:20 PM, Grant Ingersoll
<[EMAIL PROTECTED]> wrote:
Now, my question. Let's say I have an initial set of ratings for a
feed.
I then do a full import of the articles on that feed. Everything
is peachy
so far. Then, I get a new rating for an existing article that I've
already
indexed, thus the child entity (named "rating")
has a delta. However, when I run the delta-import, it doesn't
pick up
any changes, since, I believe, the parent hasn't changed. Either
that, or I
am doing something wrong. It seems like it is akin to the
parentDeltaQuery
problem, but, of course, there is no parent query since there is no
parent
table, in the DB sense, at least
not how I see it. The relevant logs are in [3].
Is this case handled? If not, Any suggestions for alternatives?
Any help
would be appreciated.
XPathEntityProcessor does not support delta imports. It might be
possible to
enhance it to accept an xpath condition for joining child to parent
but it
seems point-less because we'd need to parse the whole XML anyway for
each
changed child row (Joel Spolsky's words echo in my mind!). If the
XML data
is small, we can also have a cached implementation like the
CachedSqlEntityProcessor.
What about somehow using the fact that the variable resolver needs to
resolve solrFeed.link and then go get all entries from Solr to get
those values, such that the child entity can then be tested?
The easiest workaround here is to reverse the parent-child. Make the
DB as
the parent and join on the child which will let you do delta imports,
however full imports may be expensive. Depending on the size of XML,
you may
be better off doing a full import always.
I thought of reversing the parent-child, but I don't see how it works,
since there isn't necessarily a DB entry for every article. How would
you associate the two to make sure you get all articles?
Also, the current approach seems more intuitive, since the RSS feed is
the authoritative content.
Essentially, what I am interested in is a join across data sources. I
realize that is non-trivial, but boy would it be powerful.
Another thing I noticed from your logs: the ModifiedRowKey count is
0. Are
you sure the timestamp column is getting updated? IIRC, you need a
stored
proc to do this for postgres.
INFO: Completed ModifiedRowKey for Entity: rating rows obtained : 0
Yeah, that bothers me, too. My dataimport.properties contains:
[EMAIL PROTECTED] cat dataimport.properties
#Fri Oct 03 12:09:29 EDT 2008
last_index_time=2008-10-03 12\:09\:28
And, when querying by hand my DB shows:
select * from feeds where last_modified >
'10
/
03
/
2008
'; feed
| rating | last_modified
---------------------------------------------------------------------------+
--------+---------------------
http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/
| 4.9 | 2008-10-04 11:04:00
(1 row)
So, I am reasonably certain there is a change.
I think the reason is, if you notice further down in the log, is that
it processes the entities separately. In other words, is the DB
entity even getting resolved in the context of the parent entity? Or,
is it not resolving the ${solrFeed.link} clause of the delta query?
-Grant