Hi Kevin

As I wrote in my original mail, BLAS has no routine for this operation, you need at least 2 BLAS (or numpy) operatios and I'd like to avoid the drawback of those solutions because this is a very performance critical and memory bound operation (X can be very large).

The undertaking of implementing a GEMM-like sandwich product exploiting the symmetry, simd and thread-parallelism is difficult, but doable as demonstrated, e.g., here: https://github.com/Quantco/tabmat/blob/main/src/tabmat/ext/dense_helpers-tmpl.cpp <https://github.com/Quantco/tabmat/blob/main/src/tabmat/ext/dense_helpers-tmpl.cpp>.

The question remains: Is this something that numpy could provide?

Best
Christian

On 04.09.2025 14:19, Kevin Sheppard wrote:
A lighter alternative would be for a downstream project to use Cython + SciPy to create a small binary module that mimics the numba code but uses LAPACK directly.  It isn't particularly difficult to write the equivalent code, and we use a lot of this in statsmodels. You then would only need to distribute a relatively small binary, and the rest would be already included in the install.

Kevin


On Tue, Sep 2, 2025 at 9:02 PM Christian Lorentzen <[email protected]> wrote:

    As nice as numba is, it’s a very heavy dependency that many high
    profile libraries want to avoid.

    Best
    Christian

    Am 18.08.2025 um 14:25 schrieb Kevin Sheppard
    <[email protected]>:

    
    This is a good place for numba if performance and memory use are
    a concern.

    Good NumPy
    125 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    Naive numba (memory efficient)
    386 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    Parallel numba (memory efficient, cpu bounded)
    46.8 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops
    each)

    Here is a gist that produced the timings:
    https://gist.github.com/bashtage/39c50b223580b9bc3f763a3b1be3e478

    All were on a 12 core Ryzen.

    Kevin

    On Mon, Aug 18, 2025 at 12:35 PM Christian Lorentzen via
    NumPy-Discussion <[email protected]> wrote:

        If W is just a vector, i.e. my main use case, than what you
        write is what I wanted to imply and actually how I
        implemented it in several cases.
        The point is that X.T * weights[None, :] still doubles memory
        (same size as X) and this solution doesn't exploit the symmetry.

        Best
        Christian
        On 17.08.2025 22:13, Kevin Jacobs wrote:
        Why not:

        XTWX = (X.T * weights[None, :]) @ X

        if `weights` is a vector?  Allocating `diag(weights)` is the
        big memory and CPU hog at least for my generally long and
        skinny X matrices.

        -Kevin


        On Fri, Aug 15, 2025 at 10:08 AM Christian Lorentzen via
        NumPy-Discussion <[email protected]> wrote:

            Dear numpy community

            I would like to propose a mew feature in np.linalg to
            compute weighted gram matrix (sandwich product), see
            https://github.com/numpy/numpy/issues/29559.


                  Proposed new feature or change:

            The solvers for many important statistical models like
            Generalized Linear Models often need to compute |X.T @ W
            @ X| where |X| is a 2-dimensional array of features
            (features in columns, observations in rows) and |W| is
            very often a diagonal array of (current) weights. This
            computation is usually the main computational bottleneck.

            It would be great if numpy could provide an efficient
            implementation of it, e.g.
            |np.linalg.sandwicht_product(X, weight=w)|.

            Why numpy? It seems even more unrealistic to me, to get
            it into BLAS implementations. Numpy has support for SIMD
            via Highways.


                    Computational alternatives

            Drawbacks of |(X.T * diag(W)) @ X|:

              * Additional memory allocation to compute |X.T @
                diag(W)| , same size as (usually large) |X|.
              * The result is symmetric, but this fact is not used.
                So at least a factor of 2 is possible.
                Note that without weights, |X.T @ X| uses BLAS syrk.
                But the weights are crucial.

            Drawback of |Z = np.sqrt(diag(W))[:, None] * X| and then
            |Z @ Z|:

              * Additional memory allocation for |Z|, same size as
                (usually large) |X|.
              * Taking square roots for possibly large |diag(W)|


                  Additional information

            https://github.com/Quantco/tabmat has an implementation
            of it with XSIMD.


            Best
            Christian Lorentzen

            _______________________________________________
            NumPy-Discussion mailing list -- [email protected]
            To unsubscribe send an email to
            [email protected]
            https://mail.python.org/mailman3//lists/numpy-discussion.python.org
            Member address: [email protected]


        _______________________________________________
        NumPy-Discussion mailing list [email protected]
        To unsubscribe send an email [email protected]
        https://mail.python.org/mailman3//lists/numpy-discussion.python.org
        Member address:[email protected]
        _______________________________________________
        NumPy-Discussion mailing list -- [email protected]
        To unsubscribe send an email to [email protected]
        https://mail.python.org/mailman3//lists/numpy-discussion.python.org
        Member address: [email protected]
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to