GitHub user javrasya opened a pull request:
https://github.com/apache/incubator-predictionio/pull/356
EventServer - eventId post parameter support
PredictionIO event action behavior is appending. Even it is an entity with
same entity id, it still creates a new record in event data source. This is
good to track history of entities in PredictionIO. But in some cases, it is
not needed to append. Each end of the day, we try to put all products and users
into predictionIo regardless of duplication. To be able to do that, there
should be a mechanism that upsert with an identifier. When I try to understand
the code, I saw something that, predictionIO actually tries doing that in
`HBEventsUtil`;
```scala
def eventToPut(event: Event, appId: Int): (Put, RowKey) = {
// generate new rowKey if eventId is None
val rowKey = event.eventId.map { id =>
RowKey(id) // create rowKey from eventId
}.getOrElse {
// TOOD: use real UUID. not pseudo random
val uuidLow: Long = UUID.randomUUID().getLeastSignificantBits
RowKey(
entityType = event.entityType,
entityId = event.entityId,
millis = event.eventTime.getMillis,
uuidLow = uuidLow
)
}
```
So, I didn't saw that in the documentation but when I saw that in the code
I tried sending eventId;
```bash
curl -i -X POST
"http://localhost:7070/events.json?accessKey=gMC4E73ZZ76NrRBjFxHp3FmY7KCt-OmokBkbvtgidpLXQZzOV_G9dIu_7-gc5X1U"
\
-H "Content-Type: application/json" \
-d '{
"eventId": "KpjNMVrQzY2s0TZhYB3vsAAAAVOFSkM1kLoZgQnOA1EB",
"event" : "$set",
"entityType" : "item",
"entityId" : "i1",
"properties" : {
"categories" : ["c1", "c2","c3"]
}
```
I expected that it doesn't create a new object in my event data store which
is hbase becuase I send `eventId`. But it did.
Even it seems that prediction io tries to upsert an event into event data
source by its event id, `events.json` controller doesn't use it. It means, we
can't send event id for the same entities. This is why a serializer called
`EventJson4sSupport` doesn't handle `eventId` field.
The thing made me think this is a bug was the first piece of code that I
shared. Because it tries doing that as I mentioned.
After I fix it, I could make it works which is; predictionIO doing `upsert`
instead of `insert` all the time now.
I saw that predictionIO tries to aggregate on those data to deduplicate
them. So this bug I spotted, may hit a performance issue on that aggregation
phase. Because every night we put all products and items into prediction io
again and again. Imagine there are millions of users and products being
inserted. After a while, those data will be billions. So, training phase may
be slower on every single day.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/BoypHolding/incubator-predictionio community
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-predictionio/pull/356.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #356
----
commit 48a4b0cac22dadf131527c8eed78d08c1eb2e185
Author: Ahmet DAL <[email protected]>
Date: 2017-03-02T10:58:15Z
Even it seems that prediction io tries to upsert an event by its event id,
events.json controller doesn't support sending eventId.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---