* build: support ppc64le build for make and CMake
* build: keep __POWER9_VECTOR__ ifdef and extend with __powerpc64__
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add detection code for avx
* Only check hardware when option is ON
* Modify per code review sugguestions
* Build locally will detect CPU
* Fixes CMake style to use lowercase like everywhere else
* cleanup
* fix merge
* linux/gcc version for testing
* msvc combines avx2 and fma into /arch:AVX2 so check for both
* cleanup
* msvc only version
* style
* Update FindSIMD.cmake
---------
Co-authored-by: Howard Su <howard0su@gmail.com>
Co-authored-by: Jeremy Dunn <jeremydunn123@gmail.com>
I wrote the mat*mat shaders from scratch so I understand them better but
they are currently not faster than just multiply-invoking the mat*vec
shaders, by a significant degree - so, except for f32 which needed a new
shader, revert to the m*v ones here.
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
* fix LLAMA_NATIVE
* syntax
* alternate implementation
* my eyes must be getting bad...
* set cmake LLAMA_NATIVE=ON by default
* march=native doesn't work for ios/tvos, so disable for those targets. also see what happens if we use it on msvc
* revert 8283237 and only allow LLAMA_NATIVE on x86 like the Makefile
* remove -DLLAMA_MPI=ON
---------
Co-authored-by: netrunnereve <netrunnereve@users.noreply.github.com>
* Keep static libs and headers with install
* Add logic to generate Config package
* Use proper build info
* Add llama as import library
* Prefix target with package name
* Add example project using CMake package
* Update README
* Update README
* Remove trailing whitespace