Ok so let me restart this discussion...

After a successful first release as an ASF incubating project, we started
discussing what to do with the main dependency, the
signal_processing_algorithms repo. Driving motivation here is that it is
rather central to what Otava is doing, and long term it will be better for
development if we can easily make changes in both halves.

The guidance from our mentors was that the repo is too large to just copy
paste into Otava (2600 lines). For such additions, ASF usually prefers to
receive a copyright transfer/donation in writing from the original author /
copyright holder.

So the guidance was that someone from the Otava IPMC (me) should contact
MongoDB to find out whether they would be open to such a transfer.For
context, we were already in contact with MongoDB when we drafted the
project proposal a year ago. While they were mostly enthusiastic, in
hindsight it seems formally joining Apache Otava (incubating) wasn't a
priority so that it would have actually happened. So chances are, the same
dialogue would play out again: general excitement, but a high risk that the
legal department has other priorities, and in the end we just wasted time
on talking instead of programming.



So, I looked deeper into what we have in front of us, and have discussed
this off-list with Alex.

While all of the codebase in the signal_processing_algorithms is indeed
over 2k lines, most of that is code we don't use in Otava, or don't need.
Also, the number is inflated, because the repo contains multiple different
implementations, all doing exactly the same thing,






                      in particular:


* Piotr already replaced the significance test (which is like the latter
half of what e-divisive does) with a student t test.
* Also for the main part of the algorithm, Piotr introduced the windowing
approach, which is novel and not in the MongoDB code

While Piotr's implementation kind of wraps around the original MongoDB
e-divisive implementation, it could have been done more elegantly and
efficiently if it was modifying the e-divisive code directly. So that's
where we get into the discussion about why don't we just do that then.


* Finally, in Otava we have my optimziation from last year, the incremental
e-divisive implementation, which is also novel and MongoDB code has nothing
like that. However, it still uses the very core part of the e-divisive
algorithm, so from a "code coverage" perspective, it no longer reduces the
amount of lines that we depend on in the signal processing repo.

When all the above is accounted for, there's about 100 lines of code that
executes the very heart of e-divisive: pairwise comparison of the data
points in a time series. This could be rewritten by someone just by
implementing line by line the math from the Matteson & James (2013) paper
(formulas 5 and 6). Given that Piotr's and my work also optimizes the
amount of needed computation a lot, for a first version we don't need to
implement this in C, nor use fancy numpy functions, it could just be the
double for loop that you get when implementing the \sum ... \sum (xi-xj)^2
from the paper.


There aren't a lot of drawbacks with this idea. Ralistically we drop
support for the --orig-edivisive mode, as that by definition depends on the
original signal_proccessing code.



Let me know what you think
henrik





-- 
*nyrkio.com <http://nyrkio.com/>* ~ *git blame for performance*

Henrik Ingo, CEO
[email protected]                               LinkedIn:
www.linkedin.com/in/heingo
+358 40 569 7354                                 Twitter: twitter.com/h_ingo

Reply via email to