* vulkan: Optimize soft_max
Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.
Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.
Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.
* vulkan: Further soft_max optimizations
Restore the workgroup size of 512 case, use it for >1024.
Use unrollable loops for more iteration counts.
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.
Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.
Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.