rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752101786
btw, another crazy avenue to possibly explore here another day, since we seem bottlenecked on integer multiply. We could try it on arm too. It is faster than the current binary code on my machine at least, but not by much so it needs tweaking. Need to also look at vector size limits and think about it as whole integers can only be exactly represented up to some size. Maybe accumulator has to be integer, needs some thought. ``` // optimized 256/512 bit implementation, uses FPU int upperBound = PREFERRED_BYTE_SPECIES.loopBound(a.length - 3*PREFERRED_BYTE_SPECIES.length()); FloatVector acc1 = FloatVector.zero(FloatVector.SPECIES_PREFERRED); FloatVector acc2 = FloatVector.zero(FloatVector.SPECIES_PREFERRED); FloatVector acc3 = FloatVector.zero(FloatVector.SPECIES_PREFERRED); FloatVector acc4 = FloatVector.zero(FloatVector.SPECIES_PREFERRED); for (; i < upperBound; i += 4*PREFERRED_BYTE_SPECIES.length()) { ByteVector va8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i); ByteVector vb8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i); Vector<Float> va32 = va8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); Vector<Float> vb32 = vb8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); acc1 = acc1.add(va32.mul(vb32)); ByteVector vc8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i + PREFERRED_BYTE_SPECIES.length()); ByteVector vd8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i + PREFERRED_BYTE_SPECIES.length()); Vector<Float> vc32 = vc8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); Vector<Float> vd32 = vd8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); acc2 = acc2.add(vc32.mul(vd32)); ByteVector ve8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i + 2*PREFERRED_BYTE_SPECIES.length()); ByteVector vf8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i + 2*PREFERRED_BYTE_SPECIES.length()); Vector<Float> ve32 = ve8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); Vector<Float> vf32 = vf8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); acc3 = acc3.add(ve32.mul(vf32)); ByteVector vg8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i + 3*PREFERRED_BYTE_SPECIES.length()); ByteVector vh8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i + 3*PREFERRED_BYTE_SPECIES.length()); Vector<Float> vg32 = vg8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); Vector<Float> vh32 = vh8.convertShape(VectorOperators.B2F, FloatVector.SPECIES_PREFERRED, 0); acc3 = acc3.add(vg32.mul(vh32)); } // TODO: vector tail opto // reduce FloatVector res1 = acc1.add(acc2); FloatVector res2 = acc3.add(acc4); res += (int) res1.add(res2).reduceLanes(VectorOperators.ADD) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org