Marko Tasic
e4124c2477
readme : add JavaScript/Wasm repo ( #5415 )
2024-02-09 12:17:00 +02:00
Ebey Abraham
8c933b70c2
fix typo in readme ( #5399 )
...
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
2024-02-07 22:11:30 +01:00
Kamil Tomšík
b906596bb7
Add Ava in the list of llama.cpp UIs ( #4362 )
2024-02-07 13:44:52 -05:00
Eve
ed0bf32290
readme : modernize ( #5379 )
...
* first cleanup, update everything to Llama 2 and remove outdated content
* Delete SHA256SUMS
* make build instructions generic
* recommend Q4_K_M quantization method
* Update README.md
2024-02-07 08:21:30 +02:00
Ben Williams
9a697d842b
readme : update ui list ( #5354 )
2024-02-07 08:16:48 +02:00
Kawrakow
b08f22c882
Update README.md ( #5366 )
...
Add some links to quantization related PRs
2024-02-06 19:00:16 +02:00
BarfingLemurs
2e9c0bd6b3
readme : add phi, orion 14b, internlm2, and yi-VL to readme ( #5362 )
2024-02-06 16:06:48 +02:00
Johannes Gäßler
78b00dda6c
README: updated introduction ( #5343 )
...
* README: updated introduction
* readme : update
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 15:55:10 +01:00
chiranko
5d55b0cd82
readme : add CodeShell models to the supported models list ( #5330 )
2024-02-05 09:41:38 +02:00
BADR
6a66c5071a
readme : add tenere in the ui tools list ( #5284 )
2024-02-03 13:20:26 +02:00
Xuan Son Nguyen
6b91b1e0a9
docker : add build for SYCL, Vulkan + update readme ( #5228 )
...
* add vulkan dockerfile
* intel dockerfile: compile sycl by default
* fix vulkan dockerfile
* add docs for vulkan
* docs: sycl build in docker
* docs: remove trailing spaces
* docs: sycl: add docker section
* docs: clarify install vulkan SDK outside docker
* sycl: use intel/oneapi-basekit docker image
* docs: correct TOC
* docs: correct docker image for Intel oneMKL
2024-02-02 09:56:31 +02:00
Georgi Gerganov
5cb04dbc16
llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD ( #5240 )
...
* llama : remove LLAMA_MAX_DEVICES from llama.h
ggml-ci
* Update llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* server : remove LLAMA_MAX_DEVICES
ggml-ci
* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD
ggml-ci
* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD
* readme : add deprecation notice
* readme : change deprecation notice to "remove" and fix url
* llama : remove gpu includes from llama.h
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31 17:30:17 +02:00
Neo Zhang Jianyu
01684139c3
support SYCL backend windows build ( #5208 )
...
* support SYCL backend windows build
* add windows build in CI
* add for win build CI
* correct install oneMKL
* fix install issue
* fix ci
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix win build
* fix win build
* fix win build
* restore other CI part
* restore as base
* rm no new line
* fix no new line issue, add -j
* fix grammer issue
* allow to trigger manually, fix format issue
* fix format
* add newline
* fix format
* fix format
* fix format issuse
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-01-31 08:08:07 +05:30
Romain Neutron
5589921ef8
readme : minor ( #5204 )
...
This is about tuning the code formatting of the README file
2024-01-30 11:16:38 +02:00
Georgi Gerganov
49f44b5c55
readme : update hot topics
2024-01-30 11:14:44 +02:00
Abhilash Majumder
0f648573dd
ggml : add unified SYCL backend for Intel GPUs ( #2690 )
...
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 17:56:23 +02:00
Marcus Dunn
af4980bfed
readme : add link to rust bindings ( #5148 )
...
* added link to another set of rust bindings with brief note on differences.
* fixed link name
2024-01-28 10:30:44 +02:00
Kyle Mistele
39baaf55a1
docker : add server-first container images ( #5157 )
...
* feat: add Dockerfiles for each platform that user ./server instead of ./main
* feat: update .github/workflows/docker.yml to build server-first docker containers
* doc: add information about running the server with Docker to README.md
* doc: add information about running with docker to the server README
* doc: update n-gpu-layers to show correct GPU usage
* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
2024-01-28 09:55:31 +02:00
Georgi Gerganov
aad0b01d73
readme : update hot topics
2024-01-26 10:52:33 +02:00
XiaotaoChen
fe54033b69
readme : add MobileVLM 1.7B/3B to the supported models list ( #5107 )
...
Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-25 22:14:32 +02:00
adel boussaken
48e2b13372
Add a dart/flutter binding to README.md ( #4882 )
2024-01-20 03:05:43 -05:00
iohub
18adb4e9bb
readme : add 3rd party collama reference to UI list ( #4840 )
...
Add a VSCode extension for llama.cpp reference to UI list
2024-01-09 18:45:54 +02:00
Georgi Gerganov
a9a8c5de3d
readme : add link to SOTA models
2024-01-08 20:25:17 +02:00
Lars Grammel
b7e7982953
readme : add lgrammel/modelfusion JS/TS client for llama.cpp ( #4814 )
2024-01-07 22:24:11 +02:00
automaticcat
24a447e20a
ggml : add ggml_cpu_has_avx_vnni() ( #4589 )
...
* feat: add avx_vnni based on intel documents
* ggml: add avx vnni based on intel document
* llama: add avx vnni information display
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* Update ggml.c
Fix indentation upgate
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-30 10:07:48 +02:00
manikbhandari
ea5497df5d
gpt2 : Add gpt2 architecture integration ( #4555 )
2023-12-28 15:03:57 +01:00
Paul Tsochantaris
a206137f92
Adding Emeltal reference to UI list ( #4629 )
2023-12-25 18:09:53 +02:00
Shintarou Okada
753be377b6
llama : add PLaMo model ( #3557 )
...
* add plamo mock
* add tensor loading
* plamo convert
* update norm
* able to compile
* fix norm_rms_eps hparam
* runnable
* use inp_pos
* seems ok
* update kqv code
* remove develop code
* update README
* shuffle attn_q.weight and attn_output.weight for broadcasting
* remove plamo_llm_build_kqv and use llm_build_kqv
* fix style
* update
* llama : remove obsolete KQ_scale
* plamo : fix tensor names for correct GPU offload
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-24 15:35:49 +02:00
FantasyGmm
a55876955b
cuda : fix jetson compile error ( #4560 )
...
* fix old jetson compile error
* Update Makefile
* update jetson detect and cuda version detect
* update cuda marco define
* update makefile and cuda,fix some issue
* Update README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update Makefile
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22 17:11:12 +02:00
Michael Kesper
28cb35a0ec
make : add LLAMA_HIP_UMA option ( #4587 )
...
NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA
2023-12-22 10:03:25 +02:00
Deins
2bb98279c5
readme : add zig bindings ( #4581 )
2023-12-22 08:49:54 +02:00
Erik Garrison
0f630fbc92
cuda : ROCm AMD Unified Memory Architecture (UMA) handling ( #4449 )
...
* AMD ROCm: handle UMA memory VRAM expansions
This resolves #2797 by allowing ROCm AMD GPU users with a UMA to
dynamically expand the VRAM allocated to the GPU.
Without this, AMD ROCm users with shared CPU/GPU memory usually are
stuck with the BIOS-set (or fixed) framebuffer VRAM, making it
impossible to load more than 1-2 layers.
Note that the model is duplicated in RAM because it's loaded once for
the CPU and then copied into a second set of allocations that are
managed by the HIP UMA system. We can fix this later.
* clarify build process for ROCm on linux with cmake
* avoid using deprecated ROCm hipMallocHost
* keep simplifying the change required for UMA
* cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON
2023-12-21 21:45:32 +02:00
Georgi Gerganov
c083718c89
readme : update coding guidelines
2023-12-21 19:27:14 +02:00
Georgi Gerganov
b1306c4394
readme : update hot topics
2023-12-17 20:16:23 +02:00
BarfingLemurs
0353a18401
readme : update supported model list ( #4457 )
2023-12-14 09:38:49 +02:00
Georgi Gerganov
113f9942fc
readme : update hot topics
2023-12-13 14:05:38 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache ( #4309 )
...
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312 )
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
vodkaslime
524907aa76
readme : fix ( #4135 )
...
* fix: readme
* chore: resolve comments
* chore: resolve comments
2023-11-30 23:49:21 +02:00
Dawid Wysocki
74daabae69
readme : fix typo ( #4253 )
...
llama.cpp uses GitHub Actions, not Gitlab Actions.
2023-11-30 23:43:32 +02:00
Peter Sugihara
4fea3420ee
readme : add FreeChat ( #4248 )
2023-11-29 09:16:34 +02:00
Kasumi
0dab8cd7cc
readme : add Amica to UI list ( #4230 )
2023-11-27 19:39:42 +02:00
Georgi Gerganov
9656026b53
readme : update hot topics
2023-11-26 20:42:51 +02:00
Georgi Gerganov
04814e718e
readme : update hot topics
2023-11-25 12:02:13 +02:00
Aaryaman Vasishta
b35f3d0def
readme : use PATH for Windows ROCm ( #4195 )
...
* Update README.md to use PATH for Windows ROCm
* Update README.md
* Update README.md
2023-11-24 09:52:39 +02:00
Georgi Gerganov
d103d935c0
readme : update hot topics
2023-11-23 13:51:22 +02:00
Aaryaman Vasishta
dfc7cd48b1
readme : update ROCm Windows instructions ( #4122 )
...
* Update README.md
* Update README.md
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-11-20 17:02:46 +02:00
Galunid
36eed0c42c
stablelm : StableLM support ( #3586 )
...
* Add support for stablelm-3b-4e1t
* Supports GPU offloading of (n-1) layers
2023-11-14 11:17:12 +01:00
Georgi Gerganov
c049b37d7b
readme : update hot topics
2023-11-13 14:18:08 +02:00
Richard Kiss
532dd74e38
Fix some documentation typos/grammar mistakes ( #4032 )
...
* typos
* Update examples/parallel/README.md
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-11 23:04:58 -07:00
Georgi Gerganov
224e7d5b14
readme : add notice about #3912
2023-11-02 20:44:12 +02:00
Ian Scrivener
5a42a5f8e8
readme : remove unsupported node.js library ( #3703 )
...
- https://github.com/Atome-FE/llama-node is quite out of date
- doesn't support recent/current llama.cpp functionality
2023-10-22 21:16:43 +03:00
Georgi Gerganov
d1031cf49c
sampling : refactor init to use llama_sampling_params ( #3696 )
...
* sampling : refactor init to use llama_sampling_params
* llama : combine repetition, frequency and presence penalties in 1 call
* examples : remove embd-input and gptneox-wip
* sampling : rename penalty params + reduce size of "prev" vector
* sampling : add llama_sampling_print helper
* sampling : hide prev behind API and apply #3661
ggml-ci
2023-10-20 21:07:23 +03:00
Georgi Gerganov
004797f6ac
readme : update hot topics
2023-10-18 21:44:43 +03:00
BarfingLemurs
8402566a7c
readme : update hot-topics & models, detail windows release in usage ( #3615 )
...
* Update README.md
* Update README.md
* Update README.md
* move "Running on Windows" section below "Prepare data and run"
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-17 21:13:21 +03:00
ldwang
5fe268a4d9
readme : add Aquila2 links ( #3610 )
...
Signed-off-by: ldwang <ftgreat@gmail.com>
Co-authored-by: ldwang <ftgreat@gmail.com>
2023-10-17 18:52:33 +03:00
Ian Scrivener
f3040beaab
typo : it is --n-gpu-layers
not --gpu-layers
( #3592 )
...
fixed a typo in the MacOS Metal run doco
2023-10-12 14:10:50 +03:00
Galunid
9f6ede19f3
Add MPT model to supported models in README.md ( #3574 )
2023-10-10 19:02:49 -04:00
Xingchen Song(宋星辰)
c5b49360d0
readme : add bloom ( #3570 )
2023-10-10 19:28:50 +03:00
BarfingLemurs
1faaae8c2b
readme : update models, cuda + ppl instructions ( #3510 )
2023-10-06 22:13:36 +03:00
Georgi Gerganov
beabc8cfb0
readme : add project status link
2023-10-04 16:50:44 +03:00
slaren
40e07a60f9
llama.cpp : add documentation about rope_freq_base and scale values ( #3401 )
...
* llama.cpp : add documentation about rope_freq_base and scale values
* add notice to hot topics
2023-09-29 18:42:32 +02:00
BarfingLemurs
0a4a4a0982
readme : update hot topics + model links ( #3399 )
2023-09-29 15:50:35 +03:00
Andrew Duffy
569550df20
readme : add link to grammars app ( #3388 )
...
* Add link to grammars app per @ggernagov suggestion
Adding a sentence in the Grammars section of README to point to grammar app, per https://github.com/ggerganov/llama.cpp/discussions/2494#discussioncomment-7138211
* Update README.md
2023-09-29 14:15:57 +03:00
Pierre Alexandre SCHEMBRI
4aea3b846e
readme : add Mistral AI release 0.1 ( #3362 )
2023-09-28 15:13:37 +03:00
BarfingLemurs
ffe88a36a9
readme : add some recent perplexity and bpw measurements to READMES, link for k-quants ( #3340 )
...
* Update README.md
* Update README.md
* Update README.md with k-quants bpw measurements
2023-09-27 18:30:36 +03:00
2f38b454
1726f9626f
docs: Fix typo CLBlast_DIR var. ( #3330 )
2023-09-25 20:24:52 +02:00
Lee Drake
bc9d3e3971
Update README.md ( #3289 )
...
* Update README.md
* Update README.md
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-09-21 21:00:24 +02:00
Georgi Gerganov
7eb41179ed
readme : update hot topics
2023-09-20 20:48:22 +03:00
Johannes Gäßler
111163e246
CUDA: enable peer access between devices ( #2470 )
2023-09-17 16:37:53 +02:00
dylan
980ab41afb
docker : add gpu image CI builds ( #3103 )
...
Enables the GPU enabled container images to be built and pushed
alongside the CPU containers.
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
2023-09-14 19:47:00 +03:00
Ikko Eltociear Ashimine
7d99aca759
readme : fix typo ( #3043 )
...
* readme : fix typo
acceleation -> acceleration
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-08 19:04:32 +03:00
Georgi Gerganov
94f10b91ed
readme : update hot tpoics
2023-09-08 18:18:04 +03:00
Yui
6ff712a6d1
Update deprecated GGML TheBloke links to GGUF ( #3079 )
2023-09-08 12:32:55 +02:00
Georgi Gerganov
e36ecdccc8
build : on Mac OS enable Metal by default ( #2901 )
...
* build : on Mac OS enable Metal by default
* make : try to fix build on Linux
* make : move targets back to the top
* make : fix target clean
* llama : enable GPU inference by default with Metal
* llama : fix vocab_only logic when GPU is enabled
* common : better `n_gpu_layers` assignment
* readme : update Metal instructions
* make : fix merge conflict remnants
* gitignore : metal
2023-09-04 22:26:24 +03:00
Ido S
340af42f09
docs : add catai
to README.md
( #2967 )
2023-09-03 08:50:51 +03:00
bandoti
52315a4216
readme : update clblast instructions ( #2903 )
...
* Update Windows CLBlast instructions
* Update Windows CLBlast instructions
* Remove trailing whitespace
2023-09-02 15:53:18 +03:00
Konstantin Herud
49bb9cbe0f
docs : add java-llama.cpp to README.md ( #2935 )
2023-09-01 16:36:14 +03:00
Gilad S
35092fb547
docs : add node-llama-cpp
to README.md
( #2885 )
2023-08-30 11:40:12 +03:00
slaren
c03a243abf
remove outdated references to -eps and -gqa from README ( #2881 )
2023-08-29 23:17:34 +02:00
Jhen-Jie Hong
74e0caeb82
readme : add react-native binding ( #2869 )
2023-08-29 12:30:10 +03:00
Georgi Gerganov
da7455d046
readme : fix headings
2023-08-27 15:52:34 +03:00
Georgi Gerganov
c48c5bb0b0
readme : update hot topics
2023-08-27 14:44:35 +03:00
Henri Vasserman
6bbc598a63
ROCm Port ( #1087 )
...
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5 )
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP
---------
Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25 12:09:42 +03:00
Georgi Gerganov
44d5462b5c
readme : fix link
2023-08-23 23:44:19 +03:00
Georgi Gerganov
c7868b0753
minor : fix trailing whitespace
2023-08-23 23:43:00 +03:00
Georgi Gerganov
79da24b58c
readme : update hot topics
2023-08-23 23:41:16 +03:00
Evan Jones
f5fe98d11b
docs : add grammar docs ( #2701 )
...
* docs : add grammar docs
* tweaks to grammar guide
* rework GBNF example to be a commented grammar
2023-08-22 21:01:57 -04:00
Georgi Gerganov
6381d4e110
gguf : new file format with flexible meta data (beta) ( #2398 )
...
* gguf : first API pass
* gguf : read header + meta data
* gguf : read tensor info
* gguf : initial model loading - not tested
* gguf : add gguf_get_tensor_name()
* gguf : do not support passing existing ggml_context to gguf_init
* gguf : simplify gguf_get_val
* gguf : gguf.c is now part of ggml.c
* gguf : read / write sample models
* gguf : add comments
* refactor : reduce code duplication and better API (#2415 )
* gguf : expose the gguf_type enum through the API for now
* gguf : add array support
* gguf.py : some code style changes
* convert.py : start a new simplified implementation by removing old stuff
* convert.py : remove GGML vocab + other obsolete stuff
* GGUF : write tensor (#2426 )
* WIP: Write tensor
* GGUF : Support writing tensors in Python
* refactor : rm unused import and upd todos
* fix : fix errors upd writing example
* rm example.gguf
* gitignore *.gguf
* undo formatting
* gguf : add gguf_find_key (#2438 )
* gguf.cpp : find key example
* ggml.h : add gguf_find_key
* ggml.c : add gguf_find_key
* gguf : fix writing tensors
* gguf : do not hardcode tensor names to read
* gguf : write sample tensors to read
* gguf : add tokenization constants
* quick and dirty conversion example
* gguf : fix writing gguf arrays
* gguf : write tensors one by one and code reuse
* gguf : fix writing gguf arrays
* gguf : write tensors one by one
* gguf : write tensors one by one
* gguf : write tokenizer data
* gguf : upd gguf conversion script
* Update convert-llama-h5-to-gguf.py
* gguf : handle already encoded string
* ggml.h : get array str and f32
* ggml.c : get arr str and f32
* gguf.py : support any type
* Update convert-llama-h5-to-gguf.py
* gguf : fix set is not subscriptable
* gguf : update convert-llama-h5-to-gguf.py
* constants.py : add layer norm eps
* gguf.py : add layer norm eps and merges
* ggml.h : increase GGML_MAX_NAME to 64
* ggml.c : add gguf_get_arr_n
* Update convert-llama-h5-to-gguf.py
* add gptneox gguf example
* Makefile : add gptneox gguf example
* Update convert-llama-h5-to-gguf.py
* add gptneox gguf example
* Update convert-llama-h5-to-gguf.py
* Update convert-gptneox-h5-to-gguf.py
* Update convert-gptneox-h5-to-gguf.py
* Update convert-llama-h5-to-gguf.py
* gguf : support custom alignment value
* gguf : fix typo in function call
* gguf : mmap tensor data example
* fix : update convert-llama-h5-to-gguf.py
* Update convert-llama-h5-to-gguf.py
* convert-gptneox-h5-to-gguf.py : Special tokens
* gptneox-main.cpp : special tokens
* Update gptneox-main.cpp
* constants.py : special tokens
* gguf.py : accumulate kv and tensor info data + special tokens
* convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens
* gguf : gguf counterpart of llama-util.h
* gguf-util.h : update note
* convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens
* convert-llama-h5-to-gguf.py : special tokens
* Delete gptneox-common.cpp
* Delete gptneox-common.h
* convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer
* gptneox-main.cpp : gpt2 bpe tokenizer
* gpt2 bpe tokenizer (handles merges and unicode)
* Makefile : remove gptneox-common
* gguf.py : bytesarray for gpt2bpe tokenizer
* cmpnct_gpt2bpe.hpp : comments
* gguf.py : use custom alignment if present
* gguf : minor stuff
* Update gptneox-main.cpp
* map tensor names
* convert-gptneox-h5-to-gguf.py : map tensor names
* convert-llama-h5-to-gguf.py : map tensor names
* gptneox-main.cpp : map tensor names
* gguf : start implementing libllama in GGUF (WIP)
* gguf : start implementing libllama in GGUF (WIP)
* rm binary commited by mistake
* upd .gitignore
* gguf : calculate n_mult
* gguf : inference with 7B model working (WIP)
* gguf : rm deprecated function
* gguf : start implementing gguf_file_saver (WIP)
* gguf : start implementing gguf_file_saver (WIP)
* gguf : start implementing gguf_file_saver (WIP)
* gguf : add gguf_get_kv_type
* gguf : add gguf_get_kv_type
* gguf : write metadata in gguf_file_saver (WIP)
* gguf : write metadata in gguf_file_saver (WIP)
* gguf : write metadata in gguf_file_saver
* gguf : rm references to old file formats
* gguf : shorter name for member variable
* gguf : rm redundant method
* gguf : get rid of n_mult, read n_ff from file
* Update gguf_tensor_map.py
* Update gptneox-main.cpp
* gguf : rm references to old file magics
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : quantization is working
* gguf : roper closing of file
* gguf.py : no need to convert tensors twice
* convert-gptneox-h5-to-gguf.py : no need to convert tensors twice
* convert-llama-h5-to-gguf.py : no need to convert tensors twice
* convert-gptneox-h5-to-gguf.py : simplify nbytes
* convert-llama-h5-to-gguf.py : simplify nbytes
* gptneox-main.cpp : n_layer --> n_block
* constants.py : n_layer --> n_block
* gguf.py : n_layer --> n_block
* convert-gptneox-h5-to-gguf.py : n_layer --> n_block
* convert-llama-h5-to-gguf.py : n_layer --> n_block
* gptneox-main.cpp : n_layer --> n_block
* Update gguf_tensor_map.py
* convert-gptneox-h5-to-gguf.py : load model in parts to save memory
* convert-llama-h5-to-gguf.py : load model in parts to save memory
* convert : write more metadata for LLaMA
* convert : rm quantization version
* convert-gptneox-h5-to-gguf.py : add file_type key
* gptneox-main.cpp : add file_type key
* fix conflicts
* gguf : add todos and comments
* convert-gptneox-h5-to-gguf.py : tensor name map changes
* Create gguf_namemap.py : tensor name map changes
* Delete gguf_tensor_map.py
* gptneox-main.cpp : tensor name map changes
* convert-llama-h5-to-gguf.py : fixes
* gguf.py : dont add empty strings
* simple : minor style changes
* gguf : use UNIX line ending
* Create convert-llama-7b-pth-to-gguf.py
* llama : sync gguf-llama.cpp with latest llama.cpp (#2608 )
* llama : sync gguf-llama.cpp with latest llama.cpp
* minor : indentation + assert
* llama : refactor gguf_buffer and gguf_ctx_buffer
* llama : minor
* gitignore : add gptneox-main
* llama : tokenizer fixes (#2549 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* convert : update convert-new.py with tokenizer fixes (#2614 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* llama : sync gguf-llama with llama (#2613 )
* llama : sync gguf-llama with llama
* tests : fix build + warnings (test-tokenizer-1 still fails)
* tests : fix wstring_convert
* convert : fix layer names
* llama : sync gguf-llama.cpp
* convert : update HF converter to new tokenizer voodoo magics
* llama : update tokenizer style
* convert-llama-h5-to-gguf.py : add token types
* constants.py : add token types
* gguf.py : add token types
* convert-llama-7b-pth-to-gguf.py : add token types
* gguf-llama.cpp : fix n_head_kv
* convert-llama-h5-to-gguf.py : add 70b gqa support
* gguf.py : add tensor data layout
* convert-llama-h5-to-gguf.py : add tensor data layout
* convert-llama-7b-pth-to-gguf.py : add tensor data layout
* gptneox-main.cpp : add tensor data layout
* convert-llama-h5-to-gguf.py : clarify the reverse permute
* llama : refactor model loading code (#2620 )
* llama : style formatting + remove helper methods
* llama : fix quantization using gguf tool
* llama : simplify gguf_file_saver
* llama : fix method names
* llama : simplify write_header()
* llama : no need to pass full file loader to the file saver
just gguf_ctx
* llama : gguf_file_saver write I32
* llama : refactor tensor names (#2622 )
* gguf: update tensor names searched in quantization
* gguf : define tensor names as constants
* gguf : initial write API (not tested yet)
* gguf : write to file API (not tested)
* gguf : initial write API ready + example
* gguf : fix header write
* gguf : fixes + simplify example + add ggml_nbytes_pad()
* gguf : minor
* llama : replace gguf_file_saver with new gguf write API
* gguf : streaming support when writing files
* gguf : remove oboslete write methods
* gguf : remove obosolete gguf_get_arr_xxx API
* llama : simplify gguf_file_loader
* llama : move hparams and vocab from gguf_file_loader to llama_model_loader
* llama : merge gguf-util.h in llama.cpp
* llama : reorder definitions in .cpp to match .h
* llama : minor simplifications
* llama : refactor llama_model_loader (WIP)
wip : remove ggml_ctx from llama_model_loader
wip : merge gguf_file_loader in llama_model_loader
* llama : fix shape prints
* llama : fix Windows build + fix norm_rms_eps key
* llama : throw error on missing KV paris in model meta data
* llama : improve printing + log meta data
* llama : switch print order of meta data
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
* gguf : deduplicate (#2629 )
* gguf : better type names
* dedup : CPU + Metal is working
* ggml : fix warnings about unused results
* llama.cpp : fix line feed and compiler warning
* llama : fix strncpy warning + note token_to_str does not write null
* llama : restore the original load/save session implementation
Will migrate this to GGUF in the future
* convert-llama-h5-to-gguf.py : support alt ctx param name
* ggml : assert when using ggml_mul with non-F32 src1
* examples : dedup simple
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
* gguf.py : merge all files in gguf.py
* convert-new.py : pick #2427 for HF 70B support
* examples/gguf : no need to keep q option for quantization any more
* llama.cpp : print actual model size
* llama.cpp : use ggml_elements()
* convert-new.py : output gguf (#2635 )
* convert-new.py : output gguf (WIP)
* convert-new.py : add gguf key-value pairs
* llama : add hparams.ctx_train + no longer print ftype
* convert-new.py : minor fixes
* convert-new.py : vocab-only option should work now
* llama : fix tokenizer to use llama_char_to_byte
* tests : add new ggml-vocab-llama.gguf
* convert-new.py : tensor name mapping
* convert-new.py : add map for skipping tensor serialization
* convert-new.py : convert script now works
* gguf.py : pick some of the refactoring from #2644
* convert-new.py : minor fixes
* convert.py : update to support GGUF output
* Revert "ci : disable CI temporary to not waste energy"
This reverts commit 7e82d25f40
.
* convert.py : n_head_kv optional and .gguf file extension
* convert.py : better always have n_head_kv and default it to n_head
* llama : sync with recent PRs on master
* editorconfig : ignore models folder
ggml-ci
* ci : update ".bin" to ".gguf" extension
ggml-ci
* llama : fix llama_model_loader memory leak
* gptneox : move as a WIP example
* llama : fix lambda capture
ggml-ci
* ggml : fix bug in gguf_set_kv
ggml-ci
* common.h : .bin --> .gguf
* quantize-stats.cpp : .bin --> .gguf
* convert.py : fix HF tensor permuting / unpacking
ggml-ci
* llama.cpp : typo
* llama : throw error if gguf fails to init from file
ggml-ci
* llama : fix tensor name grepping during quantization
ggml-ci
* gguf.py : write tensors in a single pass (#2644 )
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : style fixes in simple conversion script
* gguf : refactor gptneox conversion script
* gguf : rename h5 to hf (for HuggingFace)
* gguf : refactor pth to gguf conversion script
* gguf : rm file_type key and method
* gguf.py : fix vertical alignment
* gguf.py : indentation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* convert-gptneox-hf-to-gguf.py : fixes
* gguf.py : gptneox mapping
* convert-llama-hf-to-gguf.py : fixes
* convert-llama-7b-pth-to-gguf.py : fixes
* ggml.h : reverse GGUF_MAGIC
* gguf.py : reverse GGUF_MAGIC
* test-tokenizer-0.cpp : fix warning
* llama.cpp : print kv general.name
* llama.cpp : get special token kv and linefeed token id
* llama : print number of tensors per type + print arch + style
* tests : update vocab file with new magic
* editorconfig : fix whitespaces
* llama : re-order functions
* llama : remove C++ API + reorganize common source in /common dir
* llama : minor API updates
* llama : avoid hardcoded special tokens
* llama : fix MPI build
ggml-ci
* llama : introduce enum llama_vocab_type + remove hardcoded string constants
* convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested
* falcon-main.cpp : falcon inference example
* convert-falcon-hf-to-gguf.py : remove extra kv
* convert-gptneox-hf-to-gguf.py : remove extra kv
* convert-llama-7b-pth-to-gguf.py : remove extra kv
* convert-llama-hf-to-gguf.py : remove extra kv
* gguf.py : fix for falcon 40b
* falcon-main.cpp : fix for falcon 40b
* convert-falcon-hf-to-gguf.py : update ref
* convert-falcon-hf-to-gguf.py : add tensor data layout
* cmpnct_gpt2bpe.hpp : fixes
* falcon-main.cpp : fixes
* gptneox-main.cpp : fixes
* cmpnct_gpt2bpe.hpp : remove non-general stuff
* Update examples/server/README.md
Co-authored-by: slaren <slarengh@gmail.com>
* cmpnct_gpt2bpe.hpp : cleanup
* convert-llama-hf-to-gguf.py : special tokens
* convert-llama-7b-pth-to-gguf.py : special tokens
* convert-permute-debug.py : permute debug print
* convert-permute-debug-master.py : permute debug for master
* convert-permute-debug.py : change permute type of attn_q
* convert.py : 70b model working (change attn_q permute)
* Delete convert-permute-debug-master.py
* Delete convert-permute-debug.py
* convert-llama-hf-to-gguf.py : fix attn_q permute
* gguf.py : fix rope scale kv
* convert-llama-hf-to-gguf.py : rope scale and added tokens
* convert-llama-7b-pth-to-gguf.py : rope scale and added tokens
* llama.cpp : use rope scale kv
* convert-llama-7b-pth-to-gguf.py : rope scale fix
* convert-llama-hf-to-gguf.py : rope scale fix
* py : fix whitespace
* gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682 )
* First pass at converting GGMLv3 LLaMA models to GGUF
* Cleanups, better output during conversion
* Fix vocab space conversion logic
* More vocab conversion fixes
* Add description to converted GGUF files
* Improve help text, expand warning
* Allow specifying name and description for output GGUF
* Allow overriding vocab and hyperparams from original model metadata
* Use correct params override var name
* Fix wrong type size for Q8_K
Better handling of original style metadata
* Set default value for gguf add_tensor raw_shape KW arg
* llama : improve token type support (#2668 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* Improved tokenizer test
But does it work on MacOS?
* Improve token type support
- Added @klosax code to convert.py
- Improved token type support in vocabulary
* Exclude platform dependent tests
* More sentencepiece compatibility by eliminating magic numbers
* Restored accidentally removed comment
* llama : add API for token type
ggml-ci
* tests : use new tokenizer type API (#2692 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* Improved tokenizer test
But does it work on MacOS?
* Improve token type support
- Added @klosax code to convert.py
- Improved token type support in vocabulary
* Exclude platform dependent tests
* More sentencepiece compatibility by eliminating magic numbers
* Restored accidentally removed comment
* Improve commentary
* Use token type API in test-tokenizer-1.cpp
* py : cosmetics
* readme : add notice about new file format
ggml-ci
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: goerch <jhr.walter@t-online.de>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-08-21 23:07:43 +03:00
Adrian
2d8b76a110
Add link to clojure bindings to Readme. ( #2659 )
2023-08-18 21:39:22 +02:00
Georgi Gerganov
7af633aec3
readme : incoming BREAKING CHANGE
2023-08-18 17:48:31 +03:00
mdrokz
eaf98c2649
readme : add link to Rust bindings ( #2656 )
2023-08-18 13:17:58 +03:00
Johannes Gäßler
0992a7b8b1
README: fix LLAMA_CUDA_MMV_Y documentation ( #2647 )
2023-08-17 23:57:59 +02:00
Henri Vasserman
6ddeefad9b
[Zig] Fixing Zig build and improvements ( #2554 )
...
* Fix zig after console.o was split
* Better include and flag management
* Change LTO to option
2023-08-17 23:11:18 +03:00
Johannes Gäßler
25d43e0eb5
CUDA: tuned mul_mat_q kernels ( #2546 )
2023-08-09 09:42:34 +02:00
ldwang
220d931864
readme : add Aquila-7B model series to supported models ( #2487 )
...
* support bpe tokenizer in convert
Signed-off-by: ldwang <ftgreat@gmail.com>
* support bpe tokenizer in convert
Signed-off-by: ldwang <ftgreat@gmail.com>
* support bpe tokenizer in convert, fix
Signed-off-by: ldwang <ftgreat@gmail.com>
* Add Aquila-7B models in README.md
Signed-off-by: ldwang <ftgreat@gmail.com>
* Up Aquila-7B models in README.md
Signed-off-by: ldwang <ftgreat@gmail.com>
---------
Signed-off-by: ldwang <ftgreat@gmail.com>
Co-authored-by: ldwang <ftgreat@gmail.com>
2023-08-02 11:21:11 +03:00
Yiming Cui
a312193e18
readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models ( #2475 )
...
* add support for chinese llama-2 / alpaca-2
* remove white spaces
2023-08-02 09:18:31 +03:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues ( #2453 )
2023-07-31 15:44:35 +02:00
Johannes Gäßler
11f3ca06b8
CUDA: Quantized matrix matrix multiplication ( #2160 )
...
* mmq implementation for non k-quants
* q6_K
* q2_K
* q3_k
* q4_K
* vdr
* q5_K
* faster q8_1 loading
* loop unrolling
* add __restrict__
* q2_K sc_high
* GGML_CUDA_MMQ_Y
* Updated Makefile
* Update Makefile
* DMMV_F16 -> F16
* Updated README, CMakeLists
* Fix CMakeLists.txt
* Fix CMakeLists.txt
* Fix multi GPU out-of-bounds
2023-07-29 23:04:44 +02:00
niansa/tuxifan
edcc7ae7d2
Obtaining LLaMA 2 instructions ( #2308 )
...
* Obtaining LLaMA 2 instructions
* Removed sharing warning for LLaMA 2
* Linked TheBloke's GGML repos
* Add LLaMA 2 to list of supported models
* Added LLaMA 2 usage instructions
* Added links to LLaMA 2 70B models
2023-07-28 03:14:11 +02:00
Johannes Gäßler
70d26ac388
Fix __dp4a documentation ( #2348 )
2023-07-23 17:49:06 +02:00
Jose Maldonado
91171b8072
make : fix CLBLAST compile support in FreeBSD ( #2331 )
...
* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD
* More general use-case for CLBLAST support (Linux and FreeBSD)
2023-07-23 14:52:08 +03:00
wzy
78a3d13424
flake : remove intel mkl from flake.nix due to missing files ( #2277 )
...
NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261
Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975
So remove it from flake.nix
Some minor changes:
- Change pkgs.python310 to pkgs.python3 to keep latest
- Add pkgconfig to devShells.default
- Remove installPhase because we have `cmake --install` from #2256
2023-07-21 13:26:34 +03:00
wzy
45a1b07e9b
flake : update flake.nix ( #2270 )
...
When `isx86_32 || isx86_64`, it will use mkl, else openblas
According to
https://discourse.nixos.org/t/rpath-of-binary-contains-a-forbidden-reference-to-build/12200/3 ,
add -DCMAKE_SKIP_BUILD_RPATH=ON
Fix #2261 , Nix doesn't provide mkl-sdl.pc.
When we build with -DBUILD_SHARED_LIBS=ON, -DLLAMA_BLAS_VENDOR=Intel10_lp64
replace mkl-sdl.pc by mkl-dynamic-lp64-iomp.pc
2023-07-19 10:01:55 +03:00
Jiří Podivín
27ab66e437
py : turn verify-checksum-models.py into executable ( #2245 )
...
README.md was adjusted to reflect the change.
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2023-07-16 22:54:47 +03:00
Chad Brewbaker
917831c63a
readme : fix zig build instructions ( #2171 )
2023-07-11 19:03:06 +03:00
Evan Miller
5656d10599
mpi : add support for distributed inference via MPI ( #2099 )
...
* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099 )
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10 18:49:56 +03:00
JackJollimore
18780e0a5e
readme : update Termux instructions ( #2147 )
...
The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.
2023-07-09 11:20:43 +03:00
rankaiyx
2492a53fd0
readme : add more docs indexes ( #2127 )
...
* Update README.md to add more docs indexes
* Update README.md to add more docs indexes
2023-07-09 10:38:42 +03:00
dylan
84525e7962
docker : add support for CUDA in docker ( #1461 )
...
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 21:25:25 +03:00
Judd
36680f6e40
convert : update for baichuan ( #2081 )
...
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06 19:23:49 +03:00
Johannes Gäßler
924dd22fd3
Quantized dot products for CUDA mul mat vec ( #2067 )
2023-07-05 14:19:42 +02:00
Georgi Gerganov
b472f3fca5
readme : add link web chat PR
2023-07-04 22:25:22 +03:00
Judd
471aab6e4c
convert : add support of baichuan-7b ( #2055 )
...
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01 20:00:25 +03:00
Roman Parykin
d38e451578
readme : add Scala 3 bindings repo ( #2010 )
2023-06-26 22:47:59 +03:00
Gustavo Rocha Dias
aa777abbb7
readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux ( #2007 )
...
* docs - Alternative way to build at Android, with CLBlast.
* doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.
* doc- fix typo
2023-06-26 22:34:45 +03:00
Georgi Gerganov
412c60e473
readme : add link to new k-quants for visibility
2023-06-26 19:45:09 +03:00
Georgi Gerganov
447ccbe8c3
readme : add new roadmap + manifesto
2023-06-25 16:08:12 +03:00
Georgi Gerganov
66a2555ba6
readme : add Azure CI discussion link
2023-06-25 09:07:03 +03:00
Georgi Gerganov
11da1a85cd
readme : fix whitespaces
2023-06-24 13:38:18 +03:00
Alberto
235b610d65
readme : fixed termux instructions ( #1973 )
2023-06-24 13:32:13 +03:00
eiery
d7b7484f74
Add OpenLLaMA instructions to the README ( #1954 )
...
* add openllama to readme
2023-06-23 10:38:01 +02:00
Rahul Vivek Nair
fb98254f99
Fix typo in README.md ( #1961 )
2023-06-21 23:48:43 +02:00
Georgi Gerganov
049aa16b8c
readme : add link to p1
2023-06-20 19:05:54 +03:00
Xiake Sun
2322ec223a
Fix typo ( #1949 )
2023-06-20 15:42:40 +03:00
Johannes Gäßler
16b9cd1939
Convert vector to f16 for dequantize mul mat vec ( #1913 )
...
* Convert vector to f16 for dmmv
* compile option
* Added compilation option description to README
* Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
2023-06-19 10:23:56 +02:00
Mike
e1886cf4fe
readme : update Android build instructions ( #1922 )
...
Add steps for using termux on android devices to prevent common errors.
2023-06-18 11:28:26 +03:00
Johannes Gäßler
2c9380dd2f
Only one CUDA stream per device for async compute ( #1898 )
2023-06-17 19:15:02 +02:00
Gustavo Rocha Dias
bac19927c3
readme : alternative way to build for Android with CLBlast. ( #1828 )
2023-06-17 12:01:06 +03:00
Aisuko
059e99066d
doc : fix wrong address of BLIS.md ( #1772 )
...
Signed-off-by: Aisuko <urakiny@gmail.com>
2023-06-10 17:08:11 +03:00
Georgi Gerganov
4dc62c545d
readme : add June roadmap
2023-06-07 07:15:08 +03:00
Yuval Peled
f4c55d3bd7
docs : add performance troubleshoot + example benchmark documentation ( #1674 )
...
* test anchor link
* test table
* add benchmarks
* Add performance troubleshoot & benchmark
* add benchmarks
* remove unneeded line
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 23:32:36 +03:00
Foul-Tarnished
f1465624c2
readme : fix typo ( #1700 )
...
Fix a typo in a command in README.md
2023-06-05 23:28:37 +03:00
Georgi Gerganov
827f5eda91
readme : update hot topics
2023-06-04 23:38:19 +03:00
Georgi Gerganov
ecb217db4f
llama : Metal inference ( #1642 )
...
* mtl : export the LLaMA computation graph
* ci : disable temporary
* mtl : adapt the MNIST example as starter
* mtl : no need for mtl-export tool, add cli arg for main instead
* mtl : export just a small part of the graph for now to make it easier
* mtl : move MSL code into separate file for easy editing
* mtl : initial get_rows_q4_0 kernel
* mtl : confirmed get_rows_q4_0 is working correctly
* mtl : add rms_norm kernel + confirm working
* mtl : add mul kernel + confirm working
* mtl : initial mul_mat Q4 kernel (wrong results)
* mtl : mul_mat fixes (still wrong)
* mtl : another mul_mat Q4 (still does not work)
* mtl : working mul_mat q4
* ggml : fix handling of "view" ops in ggml_graph_import()
* mtl : add rope kernel
* mtl : add reshape and transpose handling
* ggml : store offset as opt arg for ggml_view_xd() operators
* mtl : add cpy kernel + handle view ops
* mtl : confirm f16 x f32 attention mul mat
* mtl : add scale kernel
* mtl : add diag_mask_inf kernel
* mtl : fix soft_max kernel
* ggml : update ggml_nbytes() to handle non-contiguous tensors
* mtl : verify V tensor contents
* mtl : add f32 -> f32 cpy kernel
* mtl : add silu kernel
* mtl : add non-broadcast mul kernel
* mtl : full GPU inference of the computation graph
* mtl : optimize rms_norm and soft_max kernels
* mtl : add f16 mat x f32 vec multiplication kernel
* mtl : fix bug in f16 x f32 mul mat + speed-up computation
* mtl : faster mul_mat_q4_0_f32 kernel
* mtl : fix kernel signature + roll inner loop
* mtl : more threads for rms_norm + better timing
* mtl : remove printfs from inner loop
* mtl : simplify implementation
* mtl : add save/load vocab to ggml file
* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)
* mtl : make it work with main example
Lots of hacks but at least now it generates text
* mtl : preparing for merge
* mtl : clean-up ggml mtl interface + suport scratch / inplace
* mtl : remove temp / debug code
* metal : final refactoring and simplification
* Revert "ci : disable temporary"
This reverts commit 98c267fc77
.
* metal : add comments
* metal : clean-up stuff, fix typos
* readme : add Metal instructions
* readme : add example for main
2023-06-04 23:34:30 +03:00
Henri Vasserman
d8bd0013e8
Add info about CUDA_VISIBLE_DEVICES ( #1682 )
2023-06-03 16:35:20 +03:00
Henri Vasserman
97c9b77c4f
Add documentation about CLBlast ( #1604 )
...
Installing, compiling and using.
2023-05-27 18:47:55 +03:00
Evan Jones
c31bbe934b
readme : add docs for chat-persistent.sh ( #1568 )
...
* readme : add docs for chat-persistent.sh
* Update README.md
2023-05-24 09:24:01 +03:00
Zenix
b8ee340abe
feature : support blis and other blas implementation ( #1536 )
...
* feature: add blis support
* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927
* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake
* Fix typo in INTEGER
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix: blas changes on ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 17:58:31 +03:00
Georgi Gerganov
ea600071cb
Revert "feature : add blis and other BLAS implementation support ( #1502 )"
...
This reverts commit 07e9ace0f9
.
2023-05-20 12:03:48 +03:00
Zenix
07e9ace0f9
feature : add blis and other BLAS implementation support ( #1502 )
...
* feature: add blis support
* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927
* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake
* Fix typo in INTEGER
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 12:02:48 +03:00
Georgi Gerganov
2d5db48371
ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 ( #1508 )
...
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0
* llama : bump LLAMA_FILE_VERSION to 3
* cuda : update Q4 and Q8 dequantize kernels
* ggml : fix AVX dot products
* readme : update performance table + hot topics
2023-05-19 22:17:18 +03:00
David Kennedy
79e3efb0e9
readme : adds WizardLM to the list of supported models ( #1485 )
2023-05-19 20:16:30 +03:00
Georgi Gerganov
cdd5350892
readme : update Q4_0 perplexities
...
I think these were affected by the removal of the `round` during quantization
2023-05-13 09:12:44 +03:00
Rinne
089b1c93ba
readme : add C#/.NET bindings repo ( #1409 )
2023-05-12 08:39:40 +03:00
Georgi Gerganov
b9fd7eee57
ggml : remove bit shuffling ( #1405 )
...
* ggml : remove Q4_0 bit shufling (ARM NEON)
* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)
* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)
* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)
* ggml : remove Q5_0 bit shuffling (ARM NEON)
* ggml : 2x faster scalar implementations
* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)
* ggml : simplify scalar dot
* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit
* ggml : fix Q4_1 quantization
* ggml : update cuBLAS + normalize variable names
* ggml : remove Q4_2 mode
* ggml : minor formatting
* ggml : fix Q5_0 quantization
* scripts : add script for measuring the time per token
* AVX implementations (#1370 )
* ggml : uniform 5th bit extraction
* llama : produce error upon loading old model files
* llama : fix model magic/version write
* ggml : speed-up Q5_0 + Q5_1 at 4 threads
* ggml : preserve old Q4 and Q5 formats
* ggml : simplify Q8_1 - no need for low / high sums anymore
* ggml : fix Q8_0 and Q8_1 rounding
* Revert "AVX implementations (#1370 )"
This reverts commit 948d124837
.
* ggml : fix AVX2 implementation
* sha : update hashes for 7B and 13B
* readme : update timings + remove warning banner
* llama : update v2 PR number to 1405
* ggml : fix WASM comments
* ggml : back to original bit order
* readme : add note that Q4 and Q5 have been changed
* llama : fix return for unknown version
---------
Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Georgi Gerganov
56551bc11f
readme : add notice about upcoming breaking change
2023-05-08 22:52:18 +03:00
AlpinDale
fe60904eef
readme : add TOC and Pygmalion instructions ( #1359 )
2023-05-08 19:33:30 +03:00
Georgi Gerganov
f9a6364912
llama : require first token to be BOS ( #1303 )
...
* llama : require first token to be BOS
* scripts : add ppl-run-all.sh
* perplexity : add BOS for each chunk
* readme : update perplexity values after BOS fix
* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
Johannes Gäßler
1f48b0abcf
Documented CUDA reproducibility, added warning ( #1346 )
2023-05-08 02:42:01 +02:00
DaniAndTheWeb
173d0e6419
makefile: automatic Arch Linux detection ( #1332 )
...
This commit is a port of a detection method used in koboldcpp's Makefile in order to automatically set the -lcblas option on Arch Linux
2023-05-05 23:57:14 +02:00