BUG: core: fix ilp64 blas dot/vdot/... for strides > int32 max#17121
Conversation
There was a problem hiding this comment.
My machine requires more than this for the allocation.
EDIT: Needs 32 GiB
There was a problem hiding this comment.
I think you should leave out the multiplication by itemsize here.
There was a problem hiding this comment.
Maybe use int32_max + 2 instead,
There was a problem hiding this comment.
I think no: the BLAS incx = stride//itemsize, and that needs to overflow to trigger this bug.
I can use itemsize*(int32_max + 2) but the test does fail also with the +1.
Ah crap, obviously you're right. No idea what I was thinking.
Fix overlooked int cast when HAVE_BLAS_ILP64 is defined. It was supposed to cast to CBLAS_INT, not int. Also add a regression test. Move blas_stride() to npy_cblas.h Replace npy_is_aligned by modulo; we're going to call BLAS so no need to micro-optimize integer division here.
|
Updated. Obvious mistakes fixed, and moved blas_strides to npy_cblas.h. |
|
|
||
| /* | ||
| * Define a chunksize for CBLAS. CBLAS counts in integers. | ||
| */ |
There was a problem hiding this comment.
Should this definition move file too, since it's also blas-related?
There was a problem hiding this comment.
Looks like its used in exactly one file, and that file uses CBLAS_INT to store the result.
There was a problem hiding this comment.
Not sure. I thought the chunking was mostly related to our *_dot loops, so I left it here.
There was a problem hiding this comment.
I think we should move it, if you're happy to make another update.
Uses: https://github.com/numpy/numpy/search?q=NPY_CBLAS_CHUNK&unscoped_q=NPY_CBLAS_CHUNK
Note CBLAS_INT on the same line in all cases - so the callers are already including both headers (and the headers are internal not public)
There was a problem hiding this comment.
Moved + fixed the definition. No need to have it differ from intp max except when that's larger than CBLAS_INT max
There was a problem hiding this comment.
So just to be clear - this is a change for when HAVE_BLAS_ILP64 is defined and NPY_MAX_INTP > INT_MAX.
- Previously:
NPY_CBLAS_CHUNK = (NPY_MAX_INT64 / 2 + 1) - Now:
NPY_CBLAS_CHUNK = NPY_MAX_INTP
I don't have any understanding of the purpose of the /2 + 1, which makes me worry about removing it.
There was a problem hiding this comment.
Yes, it changes it to depend only on the CBLAS_INT maximum value. Switching to the smaller threshold for INTP>INT with ilp64 I think is a bug (but with no consequences, nobody has arrays of 2^62 elements). With 32-bit BLAS, there is no change here.
I suspect the INT_MAX/2 + 1 is chosen instead of INT_MAX because some 32-bit BLAS implementations have bugs that makes them fail if the number of elements gets too close to the maximum integer. However, for 64-bit blas, this is a theoretical concern as nobody can run BLAS with that large n.
There was a problem hiding this comment.
Actually, 46dd681 seems to clarify it's because that makes the chunksize a power of two.
So I think this is OK. (Restored the comment saying so back.)
There was a problem hiding this comment.
I think is a bug (but with no consequences, nobody has arrays of 2^62 elements)
However, for 64-bit blas, this is a theoretical concern as nobody can run BLAS with that large
n.
I think this was what I was missing. I'd add that the compiler can optimize out the comparison if we use NPY_MAX_INTP, which is an argument for leaving the bug in.
If you could pull in the power of two comment from that commit, that would be great. Either way, I'm no longer worried by this.
|
The error is unrelated. @mattip We have been getting web errors trying to fetch pypy, could you take a look? |
|
@charris: seems to have been a temporary outage. |
|
Thanks @pv |
Fix overlooked int cast when HAVE_BLAS_ILP64 is defined.
It was supposed to cast to CBLAS_INT, not int.
Also add a regression test.
Fixes #17111