[ 
https://issues.apache.org/jira/browse/PIO-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876893#comment-15876893
 ] 

Pat Ferrel commented on PIO-45:
-------------------------------

I've updated the Template here: https://github.com/actionml/db-cleaner.git

It has a simplified integration test and a file that has expected results. 
Neither the trim + deduplicate nor the property aggregation pass. The trim 
drops some events, the aggregation is as described above.

Not sure why the trim + dedup fails, 2 different named events and 2 different 
users. They are accepted by the EventServer and then the train task is started. 
The thing in common is that both are duplicates of events that should be 
trimmed. In other words the event times for all duplicate events span the 
cutoff for trim. This obviously should not happen. Trim should be considered 
independent of dedup. Just because one of the deduped events is too old, the 
rest of them may not be (and in this case aren't).

I could suggest 2 algorithm changes for the trim if above is not enough. Like 
trim before dedup, or dedup to the most recent timestamp and then trim. Either 
should produce only one event inside the duration with the most recent 
timestamp unless all dups are outside the duration (which seems to work).

Just pull the most recent template and run the integration test. You'll see the 
output in data/ for before and after and expected. Let me know if you have any 
questions.  

> SelfCleaningDatasource erases all data
> --------------------------------------
>
>                 Key: PIO-45
>                 URL: https://issues.apache.org/jira/browse/PIO-45
>             Project: PredictionIO
>          Issue Type: Bug
>    Affects Versions: 0.10.0-incubating
>            Reporter: Pat Ferrel
>            Assignee: Alexander  Merritt
>            Priority: Blocker
>             Fix For: 0.11.0
>
>         Attachments: import_handmade_simple.py, 
> sample-time-window-and-downsample-data.txt
>
>
> as integrated into the UR, in the integration-test, the SelfCleaningDataset 
> erases all data. This feature works fine in the AML version of PIO.
> Although not tested one could assume that this would be true with any other 
> Datasource in other templates.
> [~emergentorder] can you check to see if the PIO merge was done correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to