GitHub user javrasya opened a pull request:

    https://github.com/apache/incubator-predictionio/pull/356

    EventServer - eventId post parameter support

    PredictionIO event action behavior is appending. Even it is an entity with 
same entity id, it still creates a new record in event data source. This is 
good to track history of entities in PredictionIO.  But in some cases, it is 
not needed to append. Each end of the day, we try to put all products and users 
into predictionIo regardless of duplication. To be able to do that, there 
should be a mechanism that upsert with an identifier. When I try to understand 
the code, I saw something that, predictionIO actually tries doing that in 
`HBEventsUtil`;
    
    ```scala
      def eventToPut(event: Event, appId: Int): (Put, RowKey) = {
        // generate new rowKey if eventId is None
        val rowKey = event.eventId.map { id =>
          RowKey(id) // create rowKey from eventId
        }.getOrElse {
          // TOOD: use real UUID. not pseudo random
          val uuidLow: Long = UUID.randomUUID().getLeastSignificantBits
          RowKey(
            entityType = event.entityType,
            entityId = event.entityId,
            millis = event.eventTime.getMillis,
            uuidLow = uuidLow
          )
        }
    ```
    
    So, I didn't saw that in the documentation but when I saw that in the code 
I tried sending eventId;
    ```bash
    curl -i -X POST 
"http://localhost:7070/events.json?accessKey=gMC4E73ZZ76NrRBjFxHp3FmY7KCt-OmokBkbvtgidpLXQZzOV_G9dIu_7-gc5X1U";
 \
    -H "Content-Type: application/json" \
    -d '{
      "eventId": "KpjNMVrQzY2s0TZhYB3vsAAAAVOFSkM1kLoZgQnOA1EB",
      "event" : "$set",
      "entityType" : "item",
      "entityId" : "i1",
      "properties" : {
        "categories" : ["c1", "c2","c3"]
      }
    ```
    
    I expected that it doesn't create a new object in my event data store which 
is hbase becuase I send `eventId`. But it did. 
    
    Even it seems that prediction io tries to upsert an event into event data 
source by its event id, `events.json` controller doesn't use it. It means, we 
can't send event id for the same entities. This is why a serializer called 
`EventJson4sSupport` doesn't handle `eventId` field.
    
    The thing made me think this is a bug was the first piece of code that I 
shared. Because it tries doing that as I mentioned. 
    
    After I fix it, I could make it works which is; predictionIO doing `upsert` 
instead of `insert` all the time now.
    
    I saw that predictionIO tries to aggregate on those data to deduplicate 
them. So this bug I spotted, may hit a performance issue on that aggregation 
phase. Because every night we put all products and items into prediction io 
again and again. Imagine there are millions of users and products being 
inserted.  After a while, those data will be billions. So, training phase may 
be slower on every single day. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BoypHolding/incubator-predictionio community

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-predictionio/pull/356.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #356
    
----
commit 48a4b0cac22dadf131527c8eed78d08c1eb2e185
Author: Ahmet DAL <[email protected]>
Date:   2017-03-02T10:58:15Z

    Even it seems that prediction io tries to upsert an event by its event id, 
events.json controller doesn't support sending eventId.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to