[ 
https://issues.apache.org/jira/browse/PIO-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850734#comment-15850734
 ] 

Pat Ferrel commented on PIO-45:
-------------------------------

[~emergentorder] This has gone through several fixes but still as of Feb 2 2017 
does not work. 

Attached is an input dataset, import it into an app with the supplied python 
import script. The script writes data to the EventServer in backwards order so 
the first are most recent and the last are further in the past. So looking at 
the input file the top is most recent.

Run a template with the SelfCleaningDataSource enabled, like the UR. The line 
enabling it is commented out, just remove the comment.

I then created an eventWindow in engine.json with the following:

"eventWindow": {
        "duration": "3650 days",
        "removeDuplicates": true,
        "compressProperties": false
}

Which passed, so de-dup seems to work. 

Then I tried:

"eventWindow": {
        "duration": "3650 days",
        "removeDuplicates": true,
        "compressProperties": true
}

But the compressed properties do not work. What should happen is the newest set 
of properties of a given name should be the value after compression. In fact 
compression should never affect the properties as they are returned aggregated. 
But if you export the app data after running compression a simple failure case 
is that Galaxy, which has this input:

Galaxy,$set,categories:Phones:Electronics:Samsung
Galaxy,$set,categories:Phones:Electronics
Galaxy,$set,categories:Phones:Electronics:Samsung

Obviously should have: categories:Phones:Electronics:Samsung. 

Without the SelfCleaningDataSource I checked the model the ur creates and this 
is the value written to the model:

"categories": [
                  "Phones",
                  "Electronics",
                  "Samsung"
               ],

After property compression by adding a "true" to the engine.json definition, 
the model dumped from the app in the EventServer has:

"Galaxy","properties":"categories":["Phones","Electronics"]

There appear to be several other errors in property compression but this should 
suffice as an illustration.

This seems pretty severe since the properties will never get back in sync.


> SelfCleaningDatasource erases all data
> --------------------------------------
>
>                 Key: PIO-45
>                 URL: https://issues.apache.org/jira/browse/PIO-45
>             Project: PredictionIO
>          Issue Type: Bug
>    Affects Versions: 0.10.0-incubating
>            Reporter: Pat Ferrel
>            Assignee: Alexander  Merritt
>            Priority: Critical
>             Fix For: 0.11.0
>
>
> as integrated into the UR, in the integration-test, the SelfCleaningDataset 
> erases all data. This feature works fine in the AML version of PIO.
> Although not tested one could assume that this would be true with any other 
> Datasource in other templates.
> [~emergentorder] can you check to see if the PIO merge was done correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to