llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-14 14:59:52 +00:00

Author	SHA1	Message	Date
Johannes Gäßler	963552903f	CUDA: fix broken oob check for FA vec f32 kernel (#7904 )	2024-06-12 17:41:51 +02:00
Johannes Gäßler	e141ce624a	Fix FlashAttention debug test, FP32 assert (#7684 )	2024-06-01 23:26:10 +02:00
Johannes Gäßler	750f60c03e	CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (#7681 )	2024-06-01 15:47:04 +02:00
Johannes Gäßler	9b596417af	CUDA: quantized KV support for FA vec (#7527 ) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-01 08:44:14 +02:00
Johannes Gäßler	dc685be466	CUDA: add FP32 FlashAttention vector kernel (#7188 ) * CUDA: add FP32 FlashAttention vector kernel * fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel	2024-05-12 19:40:45 +02:00