Anyone happen to have historical data comparing the Extended #BLAS implementation and performance across different #BLAS packages?
I'm back on grid, re-reading the spec (https://netlib.org/blas/blast-forum/chapter4.pdf) and I *think* there's nothing stopping me from having a #EGEMM routine using the same underlying techniques as @enp1s0 pointed out in their (his?) recent paper.
Partially because @steve gave a subtle nod of "it's not insane", I think it might workout well?