[Numpy-discussion] Error in Covariance and Variance calculation
Dear Programmers, This is Gunter Meissner. I am currently writing a book on Forecasting and derived the regression coefficient with Numpy: import numpy as np X=[1,2,3,4] Y=[1,8000,5000,1000] print(np.cov(X,Y)) print(np.var(X)) Beta1 = np.cov(X,Y)/np.var(X) print(Beta1) However, Numpy is using the SAMPLE covariance , (which divides by n-1) and the POPULATION variance VarX = (which divides by n). Therefore the regression coefficient BETA1 is not correct. The solution is easy: Please use the population approach (dividing by n) for BOTH covariance and variance or use the sample approach (dividing by n-1) for BOTH covariance and variance. You may also allow the user to use both as in EXCEL, where the user can choose between Var.S and Var.P and Cov.P and Var.P. Thanks!!! Gunter Gunter Meissner, PhD University of Hawaii Adjunct Professor of MathFinance at Columbia University and NYU President of Derivatives Software www.dersoft.com <http://www.dersoft.com/> CEO Cassandra Capital Management <http://www.cassandracm.com> www.cassandracm.com CV: <http://www.dersoft.com/cv.pdf> www.dersoft.com/cv.pdf Email: <mailto:meiss...@hawaii.edu> meiss...@hawaii.edu Tel: USA (808) 779 3660 From: NumPy-Discussion On Behalf Of Ralf Gommers Sent: Wednesday, March 18, 2020 5:16 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg mailto:sebast...@sipsolutions.net> > wrote: Hi all, in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents. Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot. If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP. If there is not much need for discussion, I would like to officially accept the NEP very soon, sending out an official one week notice in the next days. I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come. To summarize one more time, the main point is that: This point seems fine, and I'm +1 for going ahead with the described parts of the technical design. Cheers, Ralf type(np.dtype(np.float64)) will be `np.dtype[float64]`, a subclass of dtype, so that: issubclass(np.dtype[float64], np.dtype) is true. This means that we will have one class for every current type number: `dtype.num`. The implementation of these subclasses will be a C-written (extension) MetaClass, all details of this class are supposed to remain experimental in flux at this time. Cheers Sebastian On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: > Hi all, > > I am pleased to propose NEP 41: First step towards a new Datatype > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > This NEP motivates the larger restructure of the datatype machinery > in > NumPy and defines a few fundamental design aspects. The long term > user > impact will be allowing easier and more rich featured user defined > datatypes. > > As this is a large restructure, the NEP represents only the first > steps > with some additional information in further NEPs being drafted [1] > (this may be helpful to look at depending on the level of detail you > are interested in). > The NEP itself does not propose to add significant new public API. > Instead it proposes to move forward with an incremental internal > refactor and lays the foundation for this process. > > The main user facing change at this time is that datatypes will > become > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > specific > class. > For most users, the main impact should be many new datatypes in the > long run (see the user impact section). However, for those interested > in API design within NumPy or with respect to implementing new > datatypes, this and the following NEPs are important decisions in the > futur
Re: [Numpy-discussion] Error in Covariance and Variance calculation
Thanks Warren! Worked like a charm 😊 Will mention you in the book... Gunter Meissner, PhD University of Hawaii Adjunct Professor of MathFinance at Columbia University and NYU President of Derivatives Software www.dersoft.com CEO Cassandra Capital Management www.cassandracm.com CV: www.dersoft.com/cv.pdf Email: meiss...@hawaii.edu Tel: USA (808) 779 3660 -Original Message- From: NumPy-Discussion On Behalf Of Warren Weckesser Sent: Friday, March 20, 2020 8:45 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Error in Covariance and Variance calculation On 3/20/20, Gunter Meissner wrote: > Dear Programmers, > > > > This is Gunter Meissner. I am currently writing a book on Forecasting > and derived the regression coefficient with Numpy: > > > > import numpy as np > X=[1,2,3,4] > Y=[1,8000,5000,1000] > print(np.cov(X,Y)) > print(np.var(X)) > Beta1 = np.cov(X,Y)/np.var(X) > print(Beta1) > > > > However, Numpy is using the SAMPLE covariance , (which divides by n-1) > and the POPULATION variance > > VarX = (which divides by n). Therefore the regression coefficient BETA1 is > not correct. > > The solution is easy: Please use the population approach (dividing by > n) for BOTH covariance and variance or use the sample approach > (dividing by n-1) > > for BOTH covariance and variance. You may also allow the user to use > both as in EXCEL, where the user can choose between Var.S and Var.P > > and Cov.P and Var.P. > > > > Thanks!!! > > Gunter > Gunter, This is an unfortunate discrepancy in the API: `var` uses the default `ddof=0`, while `cov` uses, in effect, `ddof=1` by default. You can get the consistent behavior you want by using `ddof=1` in both functions. E.g. Beta1 = np.cov(X,Y, ddof=1) / np.var(X, ddof=1) Using `ddof=1` in `np.cov` is redundant, but in this context, it is probably useful to make explicit to the reader of the code that both functions are using the same convention. Changing the default in either function breaks backwards compatibility. That would require a long and potentially painful deprecation process. Warren > > > > > Gunter Meissner, PhD > > University of Hawaii > > Adjunct Professor of MathFinance at Columbia University and NYU > > President of Derivatives Software www.dersoft.com > <http://www.dersoft.com/> > > > CEO Cassandra Capital Management <http://www.cassandracm.com> > www.cassandracm.com > > CV: <http://www.dersoft.com/cv.pdf> www.dersoft.com/cv.pdf > > Email: <mailto:meiss...@hawaii.edu> meiss...@hawaii.edu > > Tel: USA (808) 779 3660 > > > > > > > > > > From: NumPy-Discussion > On Behalf Of > Ralf Gommers > Sent: Wednesday, March 18, 2020 5:16 AM > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards > a new Datatype System > > > > > > > > On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg > mailto:sebast...@sipsolutions.net> > wrote: > > Hi all, > > in the spirit of trying to keep this moving, can I assume that the > main reason for little discussion is that the actual changes proposed > are not very far reaching as of now? Or is the reason that this is a > fairly complex topic that you need more time to think about it? > > > > Probably (a) it's a long NEP on a complex topic, (b) the past week has > been a very weird week for everyone (in the extra-news-reading-time I > could easily have re-reviewed the NEP), and (c) the amount of feedback > one expects to get on a NEP is roughly inversely proportional to the > scope and complexity of the NEP contents. > > > > Today I re-read the parts I commented on before. This version is a big > improvement over the previous ones. Thanks in particular for adding > clear examples and the diagram, it helps a lot. > > > > If it is the latter, is there some way I can help with it? I tried to > minimize how much is part of this initial NEP. > > If there is not much need for discussion, I would like to officially > accept the NEP very soon, sending out an official one week notice in > the next days. > > > > I agree. I think I would like to keep the option open though to come > back to the NEP later to improve the clarity of the text about > motivation/plan/examples/scope, given that this will be the reference > for a major amount of work for a long time to come. > > > > To summarize one more time, the main point is that: > > > > This point seems fine, and I'm +1 for going ahead with the described > parts of the technical design. > > > > Cheers, >
Re: [Numpy-discussion] Unreliable crash when converting using numpy.asarray via C buffer interface
Aloha Numpy Community, I am just writing a book on "How to Cheat in Statistics - And get Away with It". I noticed there is no built-in syntax for the 'Adjusted R-squared' in any library (do correct me if I am wrong) I think it would be a good idea to program it. The math is straight forward, I can provide it if desired. Thank you, Gunter On Mon, Feb 15, 2021 at 5:56 AM Sebastian Berg wrote: > On Mon, 2021-02-15 at 10:12 +0100, Friedrich Romstedt wrote: > > Hi, > > > > Am Do., 4. Feb. 2021 um 09:07 Uhr schrieb Friedrich Romstedt > > : > > > Am Mo., 1. Feb. 2021 um 09:46 Uhr schrieb Matti Picus < > > > matti.pi...@gmail.com>: > > > > Typically, one would create a complete example and then pointing > > > > to the > > > > code (as repo or pastebin, not as an attachment to a mail here). > > > > > > https://github.com/friedrichromstedt/bughunting-01 > > > > Last week I updated my example code to be more slim. There now > > exists > > a single-file extension module: > > > https://github.com/friedrichromstedt/bughunting-01/blob/master/lib/bugIhuntingfrmod/bughuntingfrmod.cpp > <https://github.com/friedrichromstedt/bughunting-01/blob/master/lib/bughuntingfrmod/bughuntingfrmod.cpp> > > . > > The corresponding test program > > > https://github.com/friedrichromstedt/bughunting-01/blob/master/test/2021-02-11_0909.py > > crashes "properly" both on Windows 10 (Python 3.8.2, numpy 1.19.2) as > > well as on Arch Linux (Python 3.9.1, numpy 1.20.0), when the > > ``print`` > > statement contained in the test file is commented out. > > > > My hope to be able to fix my error myself by reducing the code to > > reproduce the problem has not been fulfillled. I feel that the > > abovementioned test code is short enough to ask for help with it > > here. > > Any hint on how I could solve my problem would be appreciated very > > much. > > I have tried it out, and can confirm that using debugging tools (namely > valgrind), will allow you track down the issue (valgrind reports it > from within python, running a python without debug symbols may > obfuscate the actual problem; if that is the limiting you, I can post > my valgrind output). > Since you are running a linux system, I am confident that you can run > it in valgrind to find it yourself. (There may be other ways.) > > Just remember to run valgrind with `PYTHONMALLOC=malloc valgrind` and > ignore some errors e.g. when importing NumPy. > > Cheers, > > Sebastian > > > > > > There are some points which were not clarified yet; I am citing them > > below. > > > > So far, > > Friedrich > > > > > > - There are tools out there to analyze refcount problems. Python > > > > has > > > > some built-in tools for switching allocation strategies. > > > > > > Can you give me some pointer about this? > > > > > > > - numpy.asarray has a number of strategies to convert instances, > > > > which > > > > one is it using? > > > > > > I've tried to read about this, but couldn't find anything. What > > > are > > > these different strategies? > > ___ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Gunter Meissner, PhD University of Hawaii Adjunct Professor of MathFinance at Columbia University and NYU President of Derivatives Software www.dersoft.com CEO Cassandra Capital Management www.cassandracm.com CV: www.dersoft.com/cv.pdf Email: meiss...@hawaii.edu Tel: USA (808) 779 3660 ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion