llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-26 11:24:35 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	c14f72db9c	readme : update hot topics	2024-02-21 15:39:54 +02:00
postmasters	580111d42b	llama : add `gemma` model (#5631 ) There are couple things in this architecture: 1. Shared input and output embedding parameters. 2. Key length and value length are not derived from `n_embd`. More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.	2024-02-21 15:08:22 +02:00
Dane Madsen	5207b3fbc5	readme : update UI list (#5605 ) * Add maid to ui list * Specify licence	2024-02-20 12:00:23 +02:00
Mirko185	769a716e30	readme : update (#5572 ) Added 1.5-bit on README.md	2024-02-19 09:39:31 +02:00
Georgi Gerganov	b1de96824b	ci : fix wikitext url + compile warnings (#5569 ) ggml-ci	2024-02-18 22:39:30 +02:00
Rune	594fca3fef	readme : fix typo (#5490 ) executabhle -> executable	2024-02-14 17:15:49 +02:00
Marko Tasic	e4124c2477	readme : add JavaScript/Wasm repo (#5415 )	2024-02-09 12:17:00 +02:00
Ebey Abraham	8c933b70c2	fix typo in readme (#5399 ) Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>	2024-02-07 22:11:30 +01:00
Kamil Tomšík	b906596bb7	Add Ava in the list of llama.cpp UIs (#4362 )	2024-02-07 13:44:52 -05:00
Eve	ed0bf32290	readme : modernize (#5379 ) * first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md	2024-02-07 08:21:30 +02:00
Ben Williams	9a697d842b	readme : update ui list (#5354 )	2024-02-07 08:16:48 +02:00
Kawrakow	b08f22c882	Update README.md (#5366 ) Add some links to quantization related PRs	2024-02-06 19:00:16 +02:00
BarfingLemurs	2e9c0bd6b3	readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362 )	2024-02-06 16:06:48 +02:00
Johannes Gäßler	78b00dda6c	README: updated introduction (#5343 ) * README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-05 15:55:10 +01:00
chiranko	5d55b0cd82	readme : add CodeShell models to the supported models list (#5330 )	2024-02-05 09:41:38 +02:00
BADR	6a66c5071a	readme : add tenere in the ui tools list (#5284 )	2024-02-03 13:20:26 +02:00
Xuan Son Nguyen	6b91b1e0a9	docker : add build for SYCL, Vulkan + update readme (#5228 ) * add vulkan dockerfile * intel dockerfile: compile sycl by default * fix vulkan dockerfile * add docs for vulkan * docs: sycl build in docker * docs: remove trailing spaces * docs: sycl: add docker section * docs: clarify install vulkan SDK outside docker * sycl: use intel/oneapi-basekit docker image * docs: correct TOC * docs: correct docker image for Intel oneMKL	2024-02-02 09:56:31 +02:00
Georgi Gerganov	5cb04dbc16	llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240 ) * llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-31 17:30:17 +02:00
Neo Zhang Jianyu	01684139c3	support SYCL backend windows build (#5208 ) * support SYCL backend windows build * add windows build in CI * add for win build CI * correct install oneMKL * fix install issue * fix ci * fix install cmd * fix install cmd * fix install cmd * fix install cmd * fix install cmd * fix win build * fix win build * fix win build * restore other CI part * restore as base * rm no new line * fix no new line issue, add -j * fix grammer issue * allow to trigger manually, fix format issue * fix format * add newline * fix format * fix format * fix format issuse --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-01-31 08:08:07 +05:30
Romain Neutron	5589921ef8	readme : minor (#5204 ) This is about tuning the code formatting of the README file	2024-01-30 11:16:38 +02:00
Georgi Gerganov	49f44b5c55	readme : update hot topics	2024-01-30 11:14:44 +02:00
Abhilash Majumder	0f648573dd	ggml : add unified SYCL backend for Intel GPUs (#2690 ) * first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 17:56:23 +02:00
Marcus Dunn	af4980bfed	readme : add link to rust bindings (#5148 ) * added link to another set of rust bindings with brief note on differences. * fixed link name	2024-01-28 10:30:44 +02:00
Kyle Mistele	39baaf55a1	docker : add server-first container images (#5157 ) * feat: add Dockerfiles for each platform that user ./server instead of ./main * feat: update .github/workflows/docker.yml to build server-first docker containers * doc: add information about running the server with Docker to README.md * doc: add information about running with docker to the server README * doc: update n-gpu-layers to show correct GPU usage * fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA	2024-01-28 09:55:31 +02:00
Georgi Gerganov	aad0b01d73	readme : update hot topics	2024-01-26 10:52:33 +02:00
XiaotaoChen	fe54033b69	readme : add MobileVLM 1.7B/3B to the supported models list (#5107 ) Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>	2024-01-25 22:14:32 +02:00
adel boussaken	48e2b13372	Add a dart/flutter binding to README.md (#4882 )	2024-01-20 03:05:43 -05:00
iohub	18adb4e9bb	readme : add 3rd party collama reference to UI list (#4840 ) Add a VSCode extension for llama.cpp reference to UI list	2024-01-09 18:45:54 +02:00
Georgi Gerganov	a9a8c5de3d	readme : add link to SOTA models	2024-01-08 20:25:17 +02:00
Lars Grammel	b7e7982953	readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814 )	2024-01-07 22:24:11 +02:00
automaticcat	24a447e20a	ggml : add ggml_cpu_has_avx_vnni() (#4589 ) * feat: add avx_vnni based on intel documents * ggml: add avx vnni based on intel document * llama: add avx vnni information display * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * Update ggml.c Fix indentation upgate Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-30 10:07:48 +02:00
manikbhandari	ea5497df5d	gpt2 : Add gpt2 architecture integration (#4555 )	2023-12-28 15:03:57 +01:00
Paul Tsochantaris	a206137f92	Adding Emeltal reference to UI list (#4629 )	2023-12-25 18:09:53 +02:00
Shintarou Okada	753be377b6	llama : add PLaMo model (#3557 ) * add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-24 15:35:49 +02:00
FantasyGmm	a55876955b	cuda : fix jetson compile error (#4560 ) * fix old jetson compile error * Update Makefile * update jetson detect and cuda version detect * update cuda marco define * update makefile and cuda,fix some issue * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update Makefile * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 17:11:12 +02:00
Michael Kesper	28cb35a0ec	make : add LLAMA_HIP_UMA option (#4587 ) NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA	2023-12-22 10:03:25 +02:00
Deins	2bb98279c5	readme : add zig bindings (#4581 )	2023-12-22 08:49:54 +02:00
Erik Garrison	0f630fbc92	cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449 ) * AMD ROCm: handle UMA memory VRAM expansions This resolves #2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON	2023-12-21 21:45:32 +02:00
Georgi Gerganov	c083718c89	readme : update coding guidelines	2023-12-21 19:27:14 +02:00
Georgi Gerganov	b1306c4394	readme : update hot topics	2023-12-17 20:16:23 +02:00
BarfingLemurs	0353a18401	readme : update supported model list (#4457 )	2023-12-14 09:38:49 +02:00
Georgi Gerganov	113f9942fc	readme : update hot topics	2023-12-13 14:05:38 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
vodkaslime	524907aa76	readme : fix (#4135 ) * fix: readme * chore: resolve comments * chore: resolve comments	2023-11-30 23:49:21 +02:00
Dawid Wysocki	74daabae69	readme : fix typo (#4253 ) llama.cpp uses GitHub Actions, not Gitlab Actions.	2023-11-30 23:43:32 +02:00
Peter Sugihara	4fea3420ee	readme : add FreeChat (#4248 )	2023-11-29 09:16:34 +02:00
Kasumi	0dab8cd7cc	readme : add Amica to UI list (#4230 )	2023-11-27 19:39:42 +02:00
Georgi Gerganov	9656026b53	readme : update hot topics	2023-11-26 20:42:51 +02:00
Georgi Gerganov	04814e718e	readme : update hot topics	2023-11-25 12:02:13 +02:00
Aaryaman Vasishta	b35f3d0def	readme : use PATH for Windows ROCm (#4195 ) * Update README.md to use PATH for Windows ROCm * Update README.md * Update README.md	2023-11-24 09:52:39 +02:00
Georgi Gerganov	d103d935c0	readme : update hot topics	2023-11-23 13:51:22 +02:00
Aaryaman Vasishta	dfc7cd48b1	readme : update ROCm Windows instructions (#4122 ) * Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-11-20 17:02:46 +02:00
Galunid	36eed0c42c	stablelm : StableLM support (#3586 ) * Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers	2023-11-14 11:17:12 +01:00
Georgi Gerganov	c049b37d7b	readme : update hot topics	2023-11-13 14:18:08 +02:00
Richard Kiss	532dd74e38	Fix some documentation typos/grammar mistakes (#4032 ) * typos * Update examples/parallel/README.md Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-11-11 23:04:58 -07:00
Georgi Gerganov	224e7d5b14	readme : add notice about #3912	2023-11-02 20:44:12 +02:00
Ian Scrivener	5a42a5f8e8	readme : remove unsupported node.js library (#3703 ) - https://github.com/Atome-FE/llama-node is quite out of date - doesn't support recent/current llama.cpp functionality	2023-10-22 21:16:43 +03:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Georgi Gerganov	004797f6ac	readme : update hot topics	2023-10-18 21:44:43 +03:00
BarfingLemurs	8402566a7c	readme : update hot-topics & models, detail windows release in usage (#3615 ) * Update README.md * Update README.md * Update README.md * move "Running on Windows" section below "Prepare data and run" --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-17 21:13:21 +03:00
ldwang	5fe268a4d9	readme : add Aquila2 links (#3610 ) Signed-off-by: ldwang <ftgreat@gmail.com> Co-authored-by: ldwang <ftgreat@gmail.com>	2023-10-17 18:52:33 +03:00
Ian Scrivener	f3040beaab	typo : it is `--n-gpu-layers` not `--gpu-layers` (#3592 ) fixed a typo in the MacOS Metal run doco	2023-10-12 14:10:50 +03:00
Galunid	9f6ede19f3	Add MPT model to supported models in README.md (#3574 )	2023-10-10 19:02:49 -04:00
Xingchen Song(宋星辰)	c5b49360d0	readme : add bloom (#3570 )	2023-10-10 19:28:50 +03:00
BarfingLemurs	1faaae8c2b	readme : update models, cuda + ppl instructions (#3510 )	2023-10-06 22:13:36 +03:00
Georgi Gerganov	beabc8cfb0	readme : add project status link	2023-10-04 16:50:44 +03:00
slaren	40e07a60f9	llama.cpp : add documentation about rope_freq_base and scale values (#3401 ) * llama.cpp : add documentation about rope_freq_base and scale values * add notice to hot topics	2023-09-29 18:42:32 +02:00
BarfingLemurs	0a4a4a0982	readme : update hot topics + model links (#3399 )	2023-09-29 15:50:35 +03:00
Andrew Duffy	569550df20	readme : add link to grammars app (#3388 ) * Add link to grammars app per @ggernagov suggestion Adding a sentence in the Grammars section of README to point to grammar app, per https://github.com/ggerganov/llama.cpp/discussions/2494#discussioncomment-7138211 * Update README.md	2023-09-29 14:15:57 +03:00
Pierre Alexandre SCHEMBRI	4aea3b846e	readme : add Mistral AI release 0.1 (#3362 )	2023-09-28 15:13:37 +03:00
BarfingLemurs	ffe88a36a9	readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340 ) * Update README.md * Update README.md * Update README.md with k-quants bpw measurements	2023-09-27 18:30:36 +03:00
2f38b454	1726f9626f	docs: Fix typo CLBlast_DIR var. (#3330 )	2023-09-25 20:24:52 +02:00
Lee Drake	bc9d3e3971	Update README.md (#3289 ) * Update README.md * Update README.md Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-09-21 21:00:24 +02:00
Georgi Gerganov	7eb41179ed	readme : update hot topics	2023-09-20 20:48:22 +03:00
Johannes Gäßler	111163e246	CUDA: enable peer access between devices (#2470 )	2023-09-17 16:37:53 +02:00
dylan	980ab41afb	docker : add gpu image CI builds (#3103 ) Enables the GPU enabled container images to be built and pushed alongside the CPU containers. Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>	2023-09-14 19:47:00 +03:00
Ikko Eltociear Ashimine	7d99aca759	readme : fix typo (#3043 ) * readme : fix typo acceleation -> acceleration * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 19:04:32 +03:00
Georgi Gerganov	94f10b91ed	readme : update hot tpoics	2023-09-08 18:18:04 +03:00
Yui	6ff712a6d1	Update deprecated GGML TheBloke links to GGUF (#3079 )	2023-09-08 12:32:55 +02:00
Georgi Gerganov	e36ecdccc8	build : on Mac OS enable Metal by default (#2901 ) * build : on Mac OS enable Metal by default * make : try to fix build on Linux * make : move targets back to the top * make : fix target clean * llama : enable GPU inference by default with Metal * llama : fix vocab_only logic when GPU is enabled * common : better `n_gpu_layers` assignment * readme : update Metal instructions * make : fix merge conflict remnants * gitignore : metal	2023-09-04 22:26:24 +03:00
Ido S	340af42f09	docs : add `catai` to `README.md` (#2967 )	2023-09-03 08:50:51 +03:00
bandoti	52315a4216	readme : update clblast instructions (#2903 ) * Update Windows CLBlast instructions * Update Windows CLBlast instructions * Remove trailing whitespace	2023-09-02 15:53:18 +03:00
Konstantin Herud	49bb9cbe0f	docs : add java-llama.cpp to README.md (#2935 )	2023-09-01 16:36:14 +03:00
Gilad S	35092fb547	docs : add `node-llama-cpp` to `README.md` (#2885 )	2023-08-30 11:40:12 +03:00
slaren	c03a243abf	remove outdated references to -eps and -gqa from README (#2881 )	2023-08-29 23:17:34 +02:00
Jhen-Jie Hong	74e0caeb82	readme : add react-native binding (#2869 )	2023-08-29 12:30:10 +03:00
Georgi Gerganov	da7455d046	readme : fix headings	2023-08-27 15:52:34 +03:00
Georgi Gerganov	c48c5bb0b0	readme : update hot topics	2023-08-27 14:44:35 +03:00
Henri Vasserman	6bbc598a63	ROCm Port (#1087 ) * use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com> Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com> Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com> Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> Co-authored-by: jammm <2500920+jammm@users.noreply.github.com> Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>	2023-08-25 12:09:42 +03:00
Georgi Gerganov	44d5462b5c	readme : fix link	2023-08-23 23:44:19 +03:00
Georgi Gerganov	c7868b0753	minor : fix trailing whitespace	2023-08-23 23:43:00 +03:00
Georgi Gerganov	79da24b58c	readme : update hot topics	2023-08-23 23:41:16 +03:00
Evan Jones	f5fe98d11b	docs : add grammar docs (#2701 ) * docs : add grammar docs * tweaks to grammar guide * rework GBNF example to be a commented grammar	2023-08-22 21:01:57 -04:00
Georgi Gerganov	6381d4e110	gguf : new file format with flexible meta data (beta) (#2398 ) * gguf : first API pass * gguf : read header + meta data * gguf : read tensor info * gguf : initial model loading - not tested * gguf : add gguf_get_tensor_name() * gguf : do not support passing existing ggml_context to gguf_init * gguf : simplify gguf_get_val * gguf : gguf.c is now part of ggml.c * gguf : read / write sample models * gguf : add comments * refactor : reduce code duplication and better API (#2415) * gguf : expose the gguf_type enum through the API for now * gguf : add array support * gguf.py : some code style changes * convert.py : start a new simplified implementation by removing old stuff * convert.py : remove GGML vocab + other obsolete stuff * GGUF : write tensor (#2426) * WIP: Write tensor * GGUF : Support writing tensors in Python * refactor : rm unused import and upd todos * fix : fix errors upd writing example * rm example.gguf * gitignore .gguf undo formatting * gguf : add gguf_find_key (#2438) * gguf.cpp : find key example * ggml.h : add gguf_find_key * ggml.c : add gguf_find_key * gguf : fix writing tensors * gguf : do not hardcode tensor names to read * gguf : write sample tensors to read * gguf : add tokenization constants * quick and dirty conversion example * gguf : fix writing gguf arrays * gguf : write tensors one by one and code reuse * gguf : fix writing gguf arrays * gguf : write tensors one by one * gguf : write tensors one by one * gguf : write tokenizer data * gguf : upd gguf conversion script * Update convert-llama-h5-to-gguf.py * gguf : handle already encoded string * ggml.h : get array str and f32 * ggml.c : get arr str and f32 * gguf.py : support any type * Update convert-llama-h5-to-gguf.py * gguf : fix set is not subscriptable * gguf : update convert-llama-h5-to-gguf.py * constants.py : add layer norm eps * gguf.py : add layer norm eps and merges * ggml.h : increase GGML_MAX_NAME to 64 * ggml.c : add gguf_get_arr_n * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Makefile : add gptneox gguf example * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Update convert-llama-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * gguf : support custom alignment value * gguf : fix typo in function call * gguf : mmap tensor data example * fix : update convert-llama-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * convert-gptneox-h5-to-gguf.py : Special tokens * gptneox-main.cpp : special tokens * Update gptneox-main.cpp * constants.py : special tokens * gguf.py : accumulate kv and tensor info data + special tokens * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens * gguf : gguf counterpart of llama-util.h * gguf-util.h : update note * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens * convert-llama-h5-to-gguf.py : special tokens * Delete gptneox-common.cpp * Delete gptneox-common.h * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer * gptneox-main.cpp : gpt2 bpe tokenizer * gpt2 bpe tokenizer (handles merges and unicode) * Makefile : remove gptneox-common * gguf.py : bytesarray for gpt2bpe tokenizer * cmpnct_gpt2bpe.hpp : comments * gguf.py : use custom alignment if present * gguf : minor stuff * Update gptneox-main.cpp * map tensor names * convert-gptneox-h5-to-gguf.py : map tensor names * convert-llama-h5-to-gguf.py : map tensor names * gptneox-main.cpp : map tensor names * gguf : start implementing libllama in GGUF (WIP) * gguf : start implementing libllama in GGUF (WIP) * rm binary commited by mistake * upd .gitignore * gguf : calculate n_mult * gguf : inference with 7B model working (WIP) * gguf : rm deprecated function * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : add gguf_get_kv_type * gguf : add gguf_get_kv_type * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver * gguf : rm references to old file formats * gguf : shorter name for member variable * gguf : rm redundant method * gguf : get rid of n_mult, read n_ff from file * Update gguf_tensor_map.py * Update gptneox-main.cpp * gguf : rm references to old file magics * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : quantization is working * gguf : roper closing of file * gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice * convert-llama-h5-to-gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : simplify nbytes * convert-llama-h5-to-gguf.py : simplify nbytes * gptneox-main.cpp : n_layer --> n_block * constants.py : n_layer --> n_block * gguf.py : n_layer --> n_block * convert-gptneox-h5-to-gguf.py : n_layer --> n_block * convert-llama-h5-to-gguf.py : n_layer --> n_block * gptneox-main.cpp : n_layer --> n_block * Update gguf_tensor_map.py * convert-gptneox-h5-to-gguf.py : load model in parts to save memory * convert-llama-h5-to-gguf.py : load model in parts to save memory * convert : write more metadata for LLaMA * convert : rm quantization version * convert-gptneox-h5-to-gguf.py : add file_type key * gptneox-main.cpp : add file_type key * fix conflicts * gguf : add todos and comments * convert-gptneox-h5-to-gguf.py : tensor name map changes * Create gguf_namemap.py : tensor name map changes * Delete gguf_tensor_map.py * gptneox-main.cpp : tensor name map changes * convert-llama-h5-to-gguf.py : fixes * gguf.py : dont add empty strings * simple : minor style changes * gguf : use UNIX line ending * Create convert-llama-7b-pth-to-gguf.py * llama : sync gguf-llama.cpp with latest llama.cpp (#2608) * llama : sync gguf-llama.cpp with latest llama.cpp * minor : indentation + assert * llama : refactor gguf_buffer and gguf_ctx_buffer * llama : minor * gitignore : add gptneox-main * llama : tokenizer fixes (#2549) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * convert : update convert-new.py with tokenizer fixes (#2614) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * llama : sync gguf-llama with llama (#2613) * llama : sync gguf-llama with llama * tests : fix build + warnings (test-tokenizer-1 still fails) * tests : fix wstring_convert * convert : fix layer names * llama : sync gguf-llama.cpp * convert : update HF converter to new tokenizer voodoo magics * llama : update tokenizer style * convert-llama-h5-to-gguf.py : add token types * constants.py : add token types * gguf.py : add token types * convert-llama-7b-pth-to-gguf.py : add token types * gguf-llama.cpp : fix n_head_kv * convert-llama-h5-to-gguf.py : add 70b gqa support * gguf.py : add tensor data layout * convert-llama-h5-to-gguf.py : add tensor data layout * convert-llama-7b-pth-to-gguf.py : add tensor data layout * gptneox-main.cpp : add tensor data layout * convert-llama-h5-to-gguf.py : clarify the reverse permute * llama : refactor model loading code (#2620) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> * gguf : deduplicate (#2629) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> * gguf.py : merge all files in gguf.py * convert-new.py : pick #2427 for HF 70B support * examples/gguf : no need to keep q option for quantization any more * llama.cpp : print actual model size * llama.cpp : use ggml_elements() * convert-new.py : output gguf (#2635) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from #2644 * convert-new.py : minor fixes * convert.py : update to support GGUF output * Revert "ci : disable CI temporary to not waste energy" This reverts commit `7e82d25f40`. * convert.py : n_head_kv optional and .gguf file extension * convert.py : better always have n_head_kv and default it to n_head * llama : sync with recent PRs on master * editorconfig : ignore models folder ggml-ci * ci : update ".bin" to ".gguf" extension ggml-ci * llama : fix llama_model_loader memory leak * gptneox : move as a WIP example * llama : fix lambda capture ggml-ci * ggml : fix bug in gguf_set_kv ggml-ci * common.h : .bin --> .gguf * quantize-stats.cpp : .bin --> .gguf * convert.py : fix HF tensor permuting / unpacking ggml-ci * llama.cpp : typo * llama : throw error if gguf fails to init from file ggml-ci * llama : fix tensor name grepping during quantization ggml-ci * gguf.py : write tensors in a single pass (#2644) * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : style fixes in simple conversion script * gguf : refactor gptneox conversion script * gguf : rename h5 to hf (for HuggingFace) * gguf : refactor pth to gguf conversion script * gguf : rm file_type key and method * gguf.py : fix vertical alignment * gguf.py : indentation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * convert-gptneox-hf-to-gguf.py : fixes * gguf.py : gptneox mapping * convert-llama-hf-to-gguf.py : fixes * convert-llama-7b-pth-to-gguf.py : fixes * ggml.h : reverse GGUF_MAGIC * gguf.py : reverse GGUF_MAGIC * test-tokenizer-0.cpp : fix warning * llama.cpp : print kv general.name * llama.cpp : get special token kv and linefeed token id * llama : print number of tensors per type + print arch + style * tests : update vocab file with new magic * editorconfig : fix whitespaces * llama : re-order functions * llama : remove C++ API + reorganize common source in /common dir * llama : minor API updates * llama : avoid hardcoded special tokens * llama : fix MPI build ggml-ci * llama : introduce enum llama_vocab_type + remove hardcoded string constants * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested * falcon-main.cpp : falcon inference example * convert-falcon-hf-to-gguf.py : remove extra kv * convert-gptneox-hf-to-gguf.py : remove extra kv * convert-llama-7b-pth-to-gguf.py : remove extra kv * convert-llama-hf-to-gguf.py : remove extra kv * gguf.py : fix for falcon 40b * falcon-main.cpp : fix for falcon 40b * convert-falcon-hf-to-gguf.py : update ref * convert-falcon-hf-to-gguf.py : add tensor data layout * cmpnct_gpt2bpe.hpp : fixes * falcon-main.cpp : fixes * gptneox-main.cpp : fixes * cmpnct_gpt2bpe.hpp : remove non-general stuff * Update examples/server/README.md Co-authored-by: slaren <slarengh@gmail.com> * cmpnct_gpt2bpe.hpp : cleanup * convert-llama-hf-to-gguf.py : special tokens * convert-llama-7b-pth-to-gguf.py : special tokens * convert-permute-debug.py : permute debug print * convert-permute-debug-master.py : permute debug for master * convert-permute-debug.py : change permute type of attn_q * convert.py : 70b model working (change attn_q permute) * Delete convert-permute-debug-master.py * Delete convert-permute-debug.py * convert-llama-hf-to-gguf.py : fix attn_q permute * gguf.py : fix rope scale kv * convert-llama-hf-to-gguf.py : rope scale and added tokens * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens * llama.cpp : use rope scale kv * convert-llama-7b-pth-to-gguf.py : rope scale fix * convert-llama-hf-to-gguf.py : rope scale fix * py : fix whitespace * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682) * First pass at converting GGMLv3 LLaMA models to GGUF * Cleanups, better output during conversion * Fix vocab space conversion logic * More vocab conversion fixes * Add description to converted GGUF files * Improve help text, expand warning * Allow specifying name and description for output GGUF * Allow overriding vocab and hyperparams from original model metadata * Use correct params override var name * Fix wrong type size for Q8_K Better handling of original style metadata * Set default value for gguf add_tensor raw_shape KW arg * llama : improve token type support (#2668) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * llama : add API for token type ggml-ci * tests : use new tokenizer type API (#2692) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * Improve commentary * Use token type API in test-tokenizer-1.cpp * py : cosmetics * readme : add notice about new file format ggml-ci --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> Co-authored-by: goerch <jhr.walter@t-online.de> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-08-21 23:07:43 +03:00
Adrian	2d8b76a110	Add link to clojure bindings to Readme. (#2659 )	2023-08-18 21:39:22 +02:00
Georgi Gerganov	7af633aec3	readme : incoming BREAKING CHANGE	2023-08-18 17:48:31 +03:00
mdrokz	eaf98c2649	readme : add link to Rust bindings (#2656 )	2023-08-18 13:17:58 +03:00
Johannes Gäßler	0992a7b8b1	README: fix LLAMA_CUDA_MMV_Y documentation (#2647 )	2023-08-17 23:57:59 +02:00
Henri Vasserman	6ddeefad9b	[Zig] Fixing Zig build and improvements (#2554 ) * Fix zig after console.o was split * Better include and flag management * Change LTO to option	2023-08-17 23:11:18 +03:00
Johannes Gäßler	25d43e0eb5	CUDA: tuned mul_mat_q kernels (#2546 )	2023-08-09 09:42:34 +02:00
ldwang	220d931864	readme : add Aquila-7B model series to supported models (#2487 ) * support bpe tokenizer in convert Signed-off-by: ldwang <ftgreat@gmail.com> * support bpe tokenizer in convert Signed-off-by: ldwang <ftgreat@gmail.com> * support bpe tokenizer in convert, fix Signed-off-by: ldwang <ftgreat@gmail.com> * Add Aquila-7B models in README.md Signed-off-by: ldwang <ftgreat@gmail.com> * Up Aquila-7B models in README.md Signed-off-by: ldwang <ftgreat@gmail.com> --------- Signed-off-by: ldwang <ftgreat@gmail.com> Co-authored-by: ldwang <ftgreat@gmail.com>	2023-08-02 11:21:11 +03:00
Yiming Cui	a312193e18	readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475 ) * add support for chinese llama-2 / alpaca-2 * remove white spaces	2023-08-02 09:18:31 +03:00
Johannes Gäßler	0728c5a8b9	CUDA: mmq CLI option, fixed mmq build issues (#2453 )	2023-07-31 15:44:35 +02:00
Johannes Gäßler	11f3ca06b8	CUDA: Quantized matrix matrix multiplication (#2160 ) * mmq implementation for non k-quants * q6_K * q2_K * q3_k * q4_K * vdr * q5_K * faster q8_1 loading * loop unrolling * add __restrict__ * q2_K sc_high * GGML_CUDA_MMQ_Y * Updated Makefile * Update Makefile * DMMV_F16 -> F16 * Updated README, CMakeLists * Fix CMakeLists.txt * Fix CMakeLists.txt * Fix multi GPU out-of-bounds	2023-07-29 23:04:44 +02:00
niansa/tuxifan	edcc7ae7d2	Obtaining LLaMA 2 instructions (#2308 ) * Obtaining LLaMA 2 instructions * Removed sharing warning for LLaMA 2 * Linked TheBloke's GGML repos * Add LLaMA 2 to list of supported models * Added LLaMA 2 usage instructions * Added links to LLaMA 2 70B models	2023-07-28 03:14:11 +02:00
Johannes Gäßler	70d26ac388	Fix __dp4a documentation (#2348 )	2023-07-23 17:49:06 +02:00
Jose Maldonado	91171b8072	make : fix CLBLAST compile support in FreeBSD (#2331 ) * Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD * More general use-case for CLBLAST support (Linux and FreeBSD)	2023-07-23 14:52:08 +03:00
wzy	78a3d13424	flake : remove intel mkl from flake.nix due to missing files (#2277 ) NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261 Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975 So remove it from flake.nix Some minor changes: - Change pkgs.python310 to pkgs.python3 to keep latest - Add pkgconfig to devShells.default - Remove installPhase because we have `cmake --install` from #2256	2023-07-21 13:26:34 +03:00
wzy	45a1b07e9b	flake : update flake.nix (#2270 ) When `isx86_32 \|\| isx86_64`, it will use mkl, else openblas According to https://discourse.nixos.org/t/rpath-of-binary-contains-a-forbidden-reference-to-build/12200/3, add -DCMAKE_SKIP_BUILD_RPATH=ON Fix #2261, Nix doesn't provide mkl-sdl.pc. When we build with -DBUILD_SHARED_LIBS=ON, -DLLAMA_BLAS_VENDOR=Intel10_lp64 replace mkl-sdl.pc by mkl-dynamic-lp64-iomp.pc	2023-07-19 10:01:55 +03:00
Jiří Podivín	27ab66e437	py : turn verify-checksum-models.py into executable (#2245 ) README.md was adjusted to reflect the change. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-07-16 22:54:47 +03:00
Chad Brewbaker	917831c63a	readme : fix zig build instructions (#2171 )	2023-07-11 19:03:06 +03:00
Evan Miller	5656d10599	mpi : add support for distributed inference via MPI (#2099 ) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-10 18:49:56 +03:00
JackJollimore	18780e0a5e	readme : update Termux instructions (#2147 ) The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.	2023-07-09 11:20:43 +03:00
rankaiyx	2492a53fd0	readme : add more docs indexes (#2127 ) * Update README.md to add more docs indexes * Update README.md to add more docs indexes	2023-07-09 10:38:42 +03:00
dylan	84525e7962	docker : add support for CUDA in docker (#1461 ) Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 21:25:25 +03:00
Judd	36680f6e40	convert : update for baichuan (#2081 ) 1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com>	2023-07-06 19:23:49 +03:00
Johannes Gäßler	924dd22fd3	Quantized dot products for CUDA mul mat vec (#2067 )	2023-07-05 14:19:42 +02:00
Georgi Gerganov	b472f3fca5	readme : add link web chat PR	2023-07-04 22:25:22 +03:00
Judd	471aab6e4c	convert : add support of baichuan-7b (#2055 ) Co-authored-by: Judd <foldl@boxvest.com>	2023-07-01 20:00:25 +03:00
Roman Parykin	d38e451578	readme : add Scala 3 bindings repo (#2010 )	2023-06-26 22:47:59 +03:00
Gustavo Rocha Dias	aa777abbb7	readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007 ) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo	2023-06-26 22:34:45 +03:00
Georgi Gerganov	412c60e473	readme : add link to new k-quants for visibility	2023-06-26 19:45:09 +03:00
Georgi Gerganov	447ccbe8c3	readme : add new roadmap + manifesto	2023-06-25 16:08:12 +03:00
Georgi Gerganov	66a2555ba6	readme : add Azure CI discussion link	2023-06-25 09:07:03 +03:00
Georgi Gerganov	11da1a85cd	readme : fix whitespaces	2023-06-24 13:38:18 +03:00
Alberto	235b610d65	readme : fixed termux instructions (#1973 )	2023-06-24 13:32:13 +03:00
eiery	d7b7484f74	Add OpenLLaMA instructions to the README (#1954 ) * add openllama to readme	2023-06-23 10:38:01 +02:00
Rahul Vivek Nair	fb98254f99	Fix typo in README.md (#1961 )	2023-06-21 23:48:43 +02:00
Georgi Gerganov	049aa16b8c	readme : add link to p1	2023-06-20 19:05:54 +03:00
Xiake Sun	2322ec223a	Fix typo (#1949 )	2023-06-20 15:42:40 +03:00
Johannes Gäßler	16b9cd1939	Convert vector to f16 for dequantize mul mat vec (#1913 ) * Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"	2023-06-19 10:23:56 +02:00
Mike	e1886cf4fe	readme : update Android build instructions (#1922 ) Add steps for using termux on android devices to prevent common errors.	2023-06-18 11:28:26 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Gustavo Rocha Dias	bac19927c3	readme : alternative way to build for Android with CLBlast. (#1828 )	2023-06-17 12:01:06 +03:00
Aisuko	059e99066d	doc : fix wrong address of BLIS.md (#1772 ) Signed-off-by: Aisuko <urakiny@gmail.com>	2023-06-10 17:08:11 +03:00
Georgi Gerganov	4dc62c545d	readme : add June roadmap	2023-06-07 07:15:08 +03:00
Yuval Peled	f4c55d3bd7	docs : add performance troubleshoot + example benchmark documentation (#1674 ) * test anchor link * test table * add benchmarks * Add performance troubleshoot & benchmark * add benchmarks * remove unneeded line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 23:32:36 +03:00
Foul-Tarnished	f1465624c2	readme : fix typo (#1700 ) Fix a typo in a command in README.md	2023-06-05 23:28:37 +03:00
Georgi Gerganov	827f5eda91	readme : update hot topics	2023-06-04 23:38:19 +03:00
Georgi Gerganov	ecb217db4f	llama : Metal inference (#1642 ) * mtl : export the LLaMA computation graph * ci : disable temporary * mtl : adapt the MNIST example as starter * mtl : no need for mtl-export tool, add cli arg for main instead * mtl : export just a small part of the graph for now to make it easier * mtl : move MSL code into separate file for easy editing * mtl : initial get_rows_q4_0 kernel * mtl : confirmed get_rows_q4_0 is working correctly * mtl : add rms_norm kernel + confirm working * mtl : add mul kernel + confirm working * mtl : initial mul_mat Q4 kernel (wrong results) * mtl : mul_mat fixes (still wrong) * mtl : another mul_mat Q4 (still does not work) * mtl : working mul_mat q4 * ggml : fix handling of "view" ops in ggml_graph_import() * mtl : add rope kernel * mtl : add reshape and transpose handling * ggml : store offset as opt arg for ggml_view_xd() operators * mtl : add cpy kernel + handle view ops * mtl : confirm f16 x f32 attention mul mat * mtl : add scale kernel * mtl : add diag_mask_inf kernel * mtl : fix soft_max kernel * ggml : update ggml_nbytes() to handle non-contiguous tensors * mtl : verify V tensor contents * mtl : add f32 -> f32 cpy kernel * mtl : add silu kernel * mtl : add non-broadcast mul kernel * mtl : full GPU inference of the computation graph * mtl : optimize rms_norm and soft_max kernels * mtl : add f16 mat x f32 vec multiplication kernel * mtl : fix bug in f16 x f32 mul mat + speed-up computation * mtl : faster mul_mat_q4_0_f32 kernel * mtl : fix kernel signature + roll inner loop * mtl : more threads for rms_norm + better timing * mtl : remove printfs from inner loop * mtl : simplify implementation * mtl : add save/load vocab to ggml file * mtl : plug Metal inference into llama.cpp (very quick-n-dirty) * mtl : make it work with main example Lots of hacks but at least now it generates text * mtl : preparing for merge * mtl : clean-up ggml mtl interface + suport scratch / inplace * mtl : remove temp / debug code * metal : final refactoring and simplification * Revert "ci : disable temporary" This reverts commit `98c267fc77`. * metal : add comments * metal : clean-up stuff, fix typos * readme : add Metal instructions * readme : add example for main	2023-06-04 23:34:30 +03:00
Henri Vasserman	d8bd0013e8	Add info about CUDA_VISIBLE_DEVICES (#1682 )	2023-06-03 16:35:20 +03:00
Henri Vasserman	97c9b77c4f	Add documentation about CLBlast (#1604 ) Installing, compiling and using.	2023-05-27 18:47:55 +03:00
Evan Jones	c31bbe934b	readme : add docs for chat-persistent.sh (#1568 ) * readme : add docs for chat-persistent.sh * Update README.md	2023-05-24 09:24:01 +03:00
Zenix	b8ee340abe	feature : support blis and other blas implementation (#1536 ) * feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix: blas changes on ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-20 17:58:31 +03:00
Georgi Gerganov	ea600071cb	Revert "feature : add blis and other BLAS implementation support (#1502 )" This reverts commit `07e9ace0f9`.	2023-05-20 12:03:48 +03:00
Zenix	07e9ace0f9	feature : add blis and other BLAS implementation support (#1502 ) * feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-20 12:02:48 +03:00
Georgi Gerganov	2d5db48371	ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508 ) * ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics	2023-05-19 22:17:18 +03:00
David Kennedy	79e3efb0e9	readme : adds WizardLM to the list of supported models (#1485 )	2023-05-19 20:16:30 +03:00
Georgi Gerganov	cdd5350892	readme : update Q4_0 perplexities I think these were affected by the removal of the `round` during quantization	2023-05-13 09:12:44 +03:00
Rinne	089b1c93ba	readme : add C#/.NET bindings repo (#1409 )	2023-05-12 08:39:40 +03:00
Georgi Gerganov	b9fd7eee57	ggml : remove bit shuffling (#1405 ) * ggml : remove Q4_0 bit shufling (ARM NEON) * ggml : remove Q4_1 bit shuffling (ARM NEON + reference) * ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON) * ggml : remove Q4_2 bit shuffling (WIP, BROKEN) * ggml : remove Q5_0 bit shuffling (ARM NEON) * ggml : 2x faster scalar implementations * ggml : remove Q5_1 bit shuffling (ARM NEON + scalar) * ggml : simplify scalar dot * ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit * ggml : fix Q4_1 quantization * ggml : update cuBLAS + normalize variable names * ggml : remove Q4_2 mode * ggml : minor formatting * ggml : fix Q5_0 quantization * scripts : add script for measuring the time per token * AVX implementations (#1370) * ggml : uniform 5th bit extraction * llama : produce error upon loading old model files * llama : fix model magic/version write * ggml : speed-up Q5_0 + Q5_1 at 4 threads * ggml : preserve old Q4 and Q5 formats * ggml : simplify Q8_1 - no need for low / high sums anymore * ggml : fix Q8_0 and Q8_1 rounding * Revert "AVX implementations (#1370)" This reverts commit `948d124837`. * ggml : fix AVX2 implementation * sha : update hashes for 7B and 13B * readme : update timings + remove warning banner * llama : update v2 PR number to 1405 * ggml : fix WASM comments * ggml : back to original bit order * readme : add note that Q4 and Q5 have been changed * llama : fix return for unknown version --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-05-12 00:23:08 +03:00
Georgi Gerganov	56551bc11f	readme : add notice about upcoming breaking change	2023-05-08 22:52:18 +03:00
AlpinDale	fe60904eef	readme : add TOC and Pygmalion instructions (#1359 )	2023-05-08 19:33:30 +03:00
Georgi Gerganov	f9a6364912	llama : require first token to be BOS (#1303 ) * llama : require first token to be BOS * scripts : add ppl-run-all.sh * perplexity : add BOS for each chunk * readme : update perplexity values after BOS fix * perplexity : add clarifying comments	2023-05-08 17:41:54 +03:00
Johannes Gäßler	1f48b0abcf	Documented CUDA reproducibility, added warning (#1346 )	2023-05-08 02:42:01 +02:00
DaniAndTheWeb	173d0e6419	makefile: automatic Arch Linux detection (#1332 ) This commit is a port of a detection method used in koboldcpp's Makefile in order to automatically set the -lcblas option on Arch Linux	2023-05-05 23:57:14 +02:00
Pavol Rusnak	921dcee00a	readme: add missing info (#1324 )	2023-05-05 16:43:36 +02:00
44670	360cfe5bec	readme : add OpenBuddy link (#1321 )	2023-05-04 19:33:31 +03:00
Georgi Gerganov	bca9ad938a	minor : fix whitespaces (#1302 )	2023-05-03 20:09:42 +03:00
KASR	b0c71c7b6d	scripts : platform independent script to verify sha256 checksums (#1203 ) * python script to verify the checksum of the llama models Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability. * Update README.md update to the readme for improved readability and to explain the usage of the python checksum verification script * update the verification script I've extended the script based on suggestions by @prusnak The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks. * minor improvment small change so that the available ram is checked and not the total ram * remove the part of the code that reads the file at once if enough ram is available based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks. * Update verify-checksum-models.py quick fix to pass the git check	2023-05-03 18:31:28 +03:00
Stephan Walter	36d19a603b	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
Georgi Gerganov	7f15c5c477	readme : update hot topics	2023-04-28 21:32:52 +03:00
Folko-Ven	78ec543733	Correcting link to w64devkit (#1214 ) Correcting link to w64devkit (change seeto to skeeto).	2023-04-28 16:22:48 +02:00
Georgi Gerganov	f9be42add0	readme : add quantization info	2023-04-26 23:24:42 +03:00
DaniAndTheWeb	ea3ad7eb60	Updating build instructions to include BLAS support (#1183 ) * Updated build information First update to the build instructions to include BLAS. * Update README.md * Update information about BLAS * Better BLAS explanation Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit. * Better BLAS explanation * BLAS for Mac Specifying that BLAS is already supported on Macs using the Accelerate Framework. * Clarify the effect of BLAS * Windows Make instructions Added the instructions to build with Make on Windows * Fixing typo * Fix trailing whitespace	2023-04-26 22:03:03 +02:00
Pavol Rusnak	859fee6dfb	quantize : use `map` to assign quantization type from `string` (#1191 ) instead of `int` (while `int` option still being supported) This allows the following usage: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0` instead of: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`	2023-04-26 18:43:27 +02:00
mgroeber9110	9b0a4d4214	examples/main README improvements and some light refactoring (#1131 )	2023-04-24 15:45:32 +00:00
Pavol Rusnak	c6524f46eb	readme : update gpt4all instructions (#980 )	2023-04-23 10:21:26 +02:00
CRD716	834695fe3a	Minor: Readme fixed grammar, spelling, and misc updates (#1071 )	2023-04-19 19:52:14 +00:00
Georgi Gerganov	7cd5c4a3e9	readme : add warning about Q4_2 and Q4_3	2023-04-19 19:07:54 +03:00
Georgi Gerganov	7faa7460f0	readme : update hot topics about new LoRA functionality	2023-04-18 20:10:26 +03:00
Atsushi Tatsuma	e9298af389	readme : add Ruby bindings (#1029 )	2023-04-17 22:34:35 +03:00
comex	723dac55fa	py : new conversion script (#545 ) Current status: Working, except for the latest GPTQ-for-LLaMa format that includes `g_idx`. This turns out to require changes to GGML, so for now it only works if you use the `--outtype` option to dequantize it back to f16 (which is pointless except for debugging). I also included some cleanup for the C++ code. This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with: - [x] `LLaMA` (original) - [x] `llama-65b-4bit` - [x] `alpaca-native` - [x] `alpaca-native-4bit` - [x] LLaMA converted to 'transformers' format using `convert_llama_weights_to_hf.py` - [x] `alpaca-native` quantized with `--true-sequential --act-order --groupsize 128` (dequantized only) - [x] same as above plus `--save_safetensors` - [x] GPT4All - [x] stock unversioned ggml - [x] ggmh There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script. I haven't tried this with Alpaca-LoRA because I don't know where to find it. Useful features: - Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway). - Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once. - Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes. - Type annotations checked with mypy. - Some attempts to be extra user-friendly: - The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it. - The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json. - The script tries to give a helpful error message if added_tokens.json is missing.	2023-04-14 10:03:03 +03:00
CRD716	ec29272175	readme : remove python 3.10 warning (#929 )	2023-04-13 16:59:53 +03:00
Genkagaku.GPT	7e941b95eb	readme : llama node binding (#911 ) * chore: add nodejs binding * chore: add nodejs binding	2023-04-13 16:54:27 +03:00
Judd	4579af95e8	zig : update build.zig (#872 ) * update * update readme * minimize the changes. --------- Co-authored-by: zjli2019 <zhengji.li@ingchips.com>	2023-04-13 16:43:22 +03:00
Georgi Gerganov	f76cb3a34d	readme : change "GPU support" link to discussion	2023-04-12 14:48:57 +03:00
Georgi Gerganov	782438070f	readme : update hot topics with link to "GPU support" issue	2023-04-12 14:31:12 +03:00
Nicolai Weitkemper	4dbbd40750	readme: link to sha256sums file (#902 ) This is to emphasize that these do not need to be obtained from elsewhere.	2023-04-12 08:46:20 +02:00
Pavol Rusnak	8b679987cd	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
qouoq	a0caa34b16	Add BAIR's Koala to supported models (#877 )	2023-04-10 22:41:53 +02:00
Pavol Rusnak	d2beca95dc	Make docker instructions more explicit (#785 )	2023-04-06 08:56:58 +02:00
Georgi Gerganov	3416298929	Update README.md	2023-04-05 19:54:30 +03:00
Georgi Gerganov	8d10406d6e	readme : change logo + add bindings + add uis + add wiki	2023-04-05 18:56:20 +03:00
Adithya Balaji	594cc95fab	readme : update with CMake and windows example (#748 ) * README: Update with CMake and windows example * README: update with code-review for cmake build	2023-04-05 17:36:12 +03:00
Thatcher Chamberlin	d8d4e865cd	Add a missing step to the gpt4all instructions (#690 ) `migrate-ggml-2023-03-30-pr613.py` is needed to get gpt4all running.	2023-04-02 12:48:57 +02:00
rimoliga	d0a7f742e7	readme: replace termux links with homepage, play store is deprecated (#680 )	2023-04-01 16:57:30 +02:00
Pavol Rusnak	9733104be5	drop quantize.py (now that models are using a single file)	2023-03-31 01:07:32 +02:00
Georgi Gerganov	3df890aef4	readme : update supported models	2023-03-30 22:31:54 +03:00
Georgi Gerganov	b467702b87	readme : fix typos	2023-03-29 19:38:31 +03:00
Georgi Gerganov	516d88e75c	readme : add GPT4All instructions (close #588 )	2023-03-29 19:37:20 +03:00
Stephan Walter	b391579db9	Update README and comments for standalone perplexity tool (#525 )	2023-03-26 16:14:01 +03:00
Georgi Gerganov	348d6926ee	Add logo to README.md	2023-03-26 10:20:49 +03:00
Georgi Gerganov	55ad42af84	Move chat scripts into "./examples"	2023-03-25 20:37:09 +02:00
Georgi Gerganov	4a7129acd2	Remove obsolete information from README	2023-03-25 16:30:32 +02:00
Gary Mulder	f4f5362edb	Update README.md (#444 ) Added explicit bolded instructions clarifying that people need to request access to models from Facebook and never through through this repo.	2023-03-24 15:23:09 +00:00
Georgi Gerganov	b6b268d441	Add link to Roadmap discussion	2023-03-24 09:13:35 +02:00
Stephan Walter	a50e39c6fe	Revert "Delete SHA256SUMS for now" (#429 ) * Revert "Delete SHA256SUMS for now (#416)" This reverts commit `8eea5ae0e5`. * Remove ggml files until they can be verified * Remove alpaca json * Add also model/tokenizer.model to SHA256SUMS + update README --------- Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	2023-03-23 15:15:48 +01:00
Gary Mulder	8a3e5ef801	Move model section from issue template to README.md (#421 ) * Update custom.md * Removed Model section as it is better placed in README.md * Updates to README.md model section * Inserted text that was removed from issue template about obtaining models from FB and links to papers describing the various models * Removed IPF down links for the Alpaca 7B models as these look to be in the old data format and probably shouldn't be directly linked to, anyway * Updated the perplexity section to point at Perplexity scores #406 discussion	2023-03-23 11:30:40 +00:00
Georgi Gerganov	93208cfb92	Adjust repetition penalty ..	2023-03-23 10:46:58 +02:00

... 2 3 4 5 6 ...

407 Commits