[
https://issues.apache.org/jira/browse/PIO-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850734#comment-15850734
]
Pat Ferrel commented on PIO-45:
-------------------------------
[~emergentorder] This has gone through several fixes but still as of Feb 2 2017
does not work.
Attached is an input dataset, import it into an app with the supplied python
import script. The script writes data to the EventServer in backwards order so
the first are most recent and the last are further in the past. So looking at
the input file the top is most recent.
Run a template with the SelfCleaningDataSource enabled, like the UR. The line
enabling it is commented out, just remove the comment.
I then created an eventWindow in engine.json with the following:
"eventWindow": {
"duration": "3650 days",
"removeDuplicates": true,
"compressProperties": false
}
Which passed, so de-dup seems to work.
Then I tried:
"eventWindow": {
"duration": "3650 days",
"removeDuplicates": true,
"compressProperties": true
}
But the compressed properties do not work. What should happen is the newest set
of properties of a given name should be the value after compression. In fact
compression should never affect the properties as they are returned aggregated.
But if you export the app data after running compression a simple failure case
is that Galaxy, which has this input:
Galaxy,$set,categories:Phones:Electronics:Samsung
Galaxy,$set,categories:Phones:Electronics
Galaxy,$set,categories:Phones:Electronics:Samsung
Obviously should have: categories:Phones:Electronics:Samsung.
Without the SelfCleaningDataSource I checked the model the ur creates and this
is the value written to the model:
"categories": [
"Phones",
"Electronics",
"Samsung"
],
After property compression by adding a "true" to the engine.json definition,
the model dumped from the app in the EventServer has:
"Galaxy","properties":"categories":["Phones","Electronics"]
There appear to be several other errors in property compression but this should
suffice as an illustration.
This seems pretty severe since the properties will never get back in sync.
> SelfCleaningDatasource erases all data
> --------------------------------------
>
> Key: PIO-45
> URL: https://issues.apache.org/jira/browse/PIO-45
> Project: PredictionIO
> Issue Type: Bug
> Affects Versions: 0.10.0-incubating
> Reporter: Pat Ferrel
> Assignee: Alexander Merritt
> Priority: Critical
> Fix For: 0.11.0
>
>
> as integrated into the UR, in the integration-test, the SelfCleaningDataset
> erases all data. This feature works fine in the AML version of PIO.
> Although not tested one could assume that this would be true with any other
> Datasource in other templates.
> [~emergentorder] can you check to see if the PIO merge was done correctly.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)