Commit Graph

  • 6b83a3e16f try make CL run w/o tunning, but -ngl stucks no output. had to add task runer and profile id, many changes, see the f codes mqy 2023-06-16 20:32:12 +0800
  • 5342dc075f tunning: support k_quants; disabled rope shapes (workaround); make cache thread safe; fixed shape comprison mqy 2023-06-15 21:34:34 +0800
  • 21e9379707 tunning: add f16, todo: f32 failed with CL mqy 2023-06-15 15:57:31 +0800
  • 7c05049f8b tunning: check GPU offloading before loading model mqy 2023-06-15 14:06:11 +0800
  • bb590f1482 Workrounnd to set node->backend mqy 2023-06-15 08:28:39 +0800
  • 9106232260 threading test: At github, Windows can take more than 20 seconds to start 15 threads.Let's silently ignore when we saw two adjacent slowness. mqy 2023-06-15 07:19:00 +0800
  • 48016f685c bulk refactored task profile to support complete fallback; enable tune by default for ease of dev mqy 2023-06-15 06:43:08 +0800
  • 1b041d7737 threading test: improve readability at both codes and output mqy 2023-06-14 21:17:14 +0800
  • 213f133701 initial mqy 2023-06-14 18:33:14 +0800
  • ce2c7d72e2
    metal : handle buffers larger than device's maxBufferLength (#1826) master-ce2c7d7 Georgi Gerganov 2023-06-18 09:09:47 +0300
  • adaad10e97 lower synchronization overhead zrm 2023-06-18 02:03:41 -0400
  • 670390f915 typo fix when calculation memory address for dest_t in ggml_compute_forward_add_q_f32 l3utterfly 2023-06-18 13:31:49 +0800
  • 57cd69460f
    cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917) master-57cd694 Howard Su 2023-06-18 12:29:47 +0800
  • 80f654631e
    Update README.md John 2023-06-18 05:57:19 +0200
  • 76b41830b9 Added cuda-integration of JohannesGaessler git - disabled offload of non layer tensors for now (not working yet) - corrected tensor size calculation for vram - added some more detailed vram reporting John 2023-06-18 05:46:12 +0200
  • 18e6221ade
    rewrite ternary expressions Evan Jones 2023-06-17 21:58:17 -0400
  • 4eaf4b6bbb Add CUDA_ARCHITECTURES to new target ggml_static Howard Su 2023-06-18 09:51:48 +0800
  • ce95172904 k_quants : add AVX support katsu560 2023-06-18 09:45:18 +0900
  • 5ecd645bce minor verbose messages John 2023-06-18 02:10:26 +0200
  • a8bb0fe358 WIP full GPU acceleration JohannesGaessler 2023-06-18 00:11:20 +0200
  • 74b01eff55 Merge branch 'master' into x0rsh1ft X0RSH1FT 2023-06-17 17:51:52 -0400
  • 31a9fada9c Added window's powershell and BAT scripts for building and testing llama executable. Includes environment configuration powershell script as well that can be used to adjust build and test parameters. X0RSH1FT 2023-06-17 17:44:37 -0400
  • 67e229b7ca
    Merge 'origin/master' into hipblas Henri Vasserman 2023-06-18 00:36:54 +0300
  • 8f81cab1bc WIP full GPU acceleration JohannesGaessler 2023-06-17 23:04:31 +0200
  • 0c916d2357 Offload weights JohannesGaessler 2023-06-17 22:27:55 +0200
  • f75125615a Update README.md John 2023-06-17 18:57:40 +0200
  • 2797754843 Update README.md John 2023-06-17 16:51:34 +0200
  • f9118b0ca5 Update README.md John 2023-06-17 16:42:23 +0200
  • 6ae8567a30 Update README.md John 2023-06-17 16:23:40 +0200
  • 9d4d26554a Update README.md John 2023-06-17 16:23:01 +0200
  • d0c460629d Update README.md John 2023-06-17 16:20:02 +0200
  • ab509ad9e2 added the tensor size calculation routines John 2023-06-17 16:40:57 +0200
  • ea70881941 Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp). Jan Ploski 2023-06-17 04:48:40 +0200
  • c3e9c88d71 Fixed segfault during context swap introduced by commit 3d6ed185 Jan Ploski 2023-06-17 04:00:07 +0200
  • 5ec0d12652 Correction to 4a37251a - since we did not insert the bos token, do not need attempt to rescue it during context swap Jan Ploski 2023-06-16 19:53:38 +0200
  • db0083f7b7 Fixed bos/eos token (which is both 11 according to config.json of Falcon-7B/40B). Also: do not auto-insert a space or (b|e)os at the beginning of prompt (seems to be LLaMA-specific). Jan Ploski 2023-06-16 19:36:27 +0200
  • ed4ad057b2 Went back to the original size calculation for now. Though it appears not to matter. John 2023-06-16 20:20:30 +0200
  • fee7da163b Work in progress. Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) John 2023-06-16 16:31:02 +0200
  • cbb31807a3
    Update README.md John 2023-06-17 21:34:24 +0200
  • b71dfe637f recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement zrm 2023-06-17 15:11:05 -0400
  • bf83dcb279 make --numa a param zrm 2023-06-17 15:03:14 -0400
  • b2416493ab
    make : do not print help for simple example master-b241649 Georgi Gerganov 2023-06-17 20:55:03 +0300
  • 4adb6f1441
    metal : minimize view overlap to try to utilize device memory better Georgi Gerganov 2023-06-17 20:45:31 +0300
  • 961eee6968
    Merge branch 'master' into fix-metal-size Georgi Gerganov 2023-06-17 20:45:26 +0300
  • 4f9c43e3bd
    minor : warning fixes master-4f9c43e Georgi Gerganov 2023-06-17 20:24:11 +0300
  • 2c9380dd2f
    Only one CUDA stream per device for async compute (#1898) master-2c9380d Johannes Gäßler 2023-06-17 19:15:02 +0200
  • f89c7592eb
    Update README.md John 2023-06-17 18:57:40 +0200
  • 051e1b0e6a
    llama : fix kv_cache n init (close #1903) master-051e1b0 Georgi Gerganov 2023-06-17 19:30:22 +0300
  • 86c7571864
    make : update for latest Arch (#1701) master-86c7571 DaniAndTheWeb 2023-06-17 18:17:22 +0200
  • 8a93a05a84 Only one CUDA stream per device for async compute JohannesGaessler 2023-06-16 18:31:52 +0200
  • 3d59ec5935
    ggml : fix warnings under MSVC (#1908) master-3d59ec5 Howard Su 2023-06-17 23:46:15 +0800
  • dc3472eb58 Merge branch 'master' into concedo_experimental Concedo 2023-06-17 23:10:05 +0800
  • dbd11ddd60 up ver Concedo 2023-06-17 23:08:14 +0800
  • c72bc02695
    Update README.md John 2023-06-17 16:51:34 +0200
  • 6e137abe56
    Update README.md John 2023-06-17 16:42:23 +0200
  • abc77a7496 Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp John 2023-06-17 16:41:08 +0200
  • 588ca709fb added the tensor size calculation routines John 2023-06-17 16:40:57 +0200
  • 0711a5f6dc
    metal : add norm, cpy f16->f16, alibi kernels (#1823) Aaron Miller 2023-06-17 07:37:49 -0700
  • 8bc4143e14 Merge branch 'concedo' into concedo_experimental Concedo 2023-06-17 22:29:38 +0800
  • 7c5f607287
    Update README.md John 2023-06-17 16:23:40 +0200
  • d4b9423560
    Update README.md John 2023-06-17 16:23:01 +0200
  • 0ed97e529f
    Update README.md John 2023-06-17 16:20:02 +0200
  • 6f7c15637a
    Merge 'origin/master' into hipblas Henri Vasserman 2023-06-17 16:53:22 +0300
  • 46490c7ad7 Merge remote-tracking branch 'origin/master' into embd_inp ningshanwutuobang 2023-06-17 21:42:56 +0800
  • dd3d346f7a Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp John 2023-06-17 14:39:28 +0200
  • fc45a81bc6
    exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863) Faez Shakil 2023-06-17 17:13:05 +0500
  • 9f8e2f8a18 Merge branch 'master' into concedo_experimental Concedo 2023-06-17 20:02:32 +0800
  • 795b35546b updated lite Concedo 2023-06-17 19:57:09 +0800
  • 971fe9f007
    add tokens per second output (#246) YellowRoseCx 2023-06-17 06:54:29 -0500
  • 794db3e7b9
    Server Example Refactor and Improvements (#1570) master-794db3e Randall Fitzgerald 2023-06-17 07:53:04 -0400
  • cd5da5ebd1 Fix warnings under MSVC Howard Su 2023-06-17 19:37:45 +0800
  • 5ddf7ea1fb
    hooks : setting up flake8 and pre-commit hooks (#1681) Jiří Podivín 2023-06-17 12:32:48 +0200
  • bac19927c3
    readme : alternative way to build for Android with CLBlast. (#1828) Gustavo Rocha Dias 2023-06-17 06:01:06 -0300
  • 93c57a0571 add READMD for llava.py ningshanwutuobang 2023-06-17 16:43:36 +0800
  • 4f1aa3cc76 add READMD for llava.py ningshanwutuobang 2023-06-17 16:41:37 +0800
  • 1b4b93a227
    Merge branch 'ggerganov:master' into master Randall Fitzgerald 2023-06-17 04:21:46 -0400
  • b4c6f46f17
    Allow cmake to build ggml as a library (#1896) master-b4c6f46 Kerfuffle 2023-06-17 01:49:42 -0600
  • 92f20d9942
    train : get raw text instead of page with html (#1905) David Yang 2023-06-17 14:51:54 +0800
  • 2a2f39ef45 #1869 Fix null reference errors when training from scratch with CUDA build Robyn 2023-06-17 15:55:57 +1000
  • 583daaee43 Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp). Jan Ploski 2023-06-17 04:48:40 +0200
  • 04bd2e408e Fixed segfault during context swap introduced by commit 3d6ed185 Jan Ploski 2023-06-17 04:00:07 +0200
  • 3440e30cb5
    Get raw text instead of page with html David Yang 2023-06-17 08:15:05 +0800
  • 5f2c9ce21e Code refactor and optimize using reserve German Semenov 2023-06-17 02:33:57 +0300
  • 8ec0d382d4 Fix uninitialized var causing crash on Windows using MSVC. DAN™ 2023-06-16 19:16:30 -0400
  • 200892a3a5 Pass pointer to params in llama_init_from_file mudler 2023-06-16 23:43:36 +0200
  • 274f3782a4 metal kernels: add norm, cpy f16->f16, alibi Aaron Miller 2023-06-11 10:10:53 -0700
  • a482750590 Build lib versions of ggml separately KerfuffleV2 2023-06-16 14:19:22 -0600
  • d411968e99
    opencl : support k-quants (#1836) master-d411968 0cc4m 2023-06-16 20:59:49 +0200
  • b41b4cad6f
    examples : add "simple" (#1840) master-b41b4ca SuperUserNameMan 2023-06-16 20:58:09 +0200
  • 6804100fe2
    Merge branch 'master' into minimalist_example Georgi Gerganov 2023-06-16 21:57:40 +0300
  • 13fe9d2d84
    cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886) master-13fe9d2 Zenix 2023-06-17 03:53:04 +0900
  • ac3b886953
    llama : fix embd when offloading non-repeating layers (#1891) master-ac3b886 Johannes Gäßler 2023-06-16 20:25:51 +0200
  • 5b9ccaf104
    Fixed possible macro redefinition (#1892) master-5b9ccaf FrankHB 2023-06-17 02:25:01 +0800
  • 9cbf50c041
    build : fix and ignore MSVC warnings (#1889) master-9cbf50c Borislav Stanimirov 2023-06-16 21:23:53 +0300
  • f143d0e6bf Went back to the original size calculation for now. Though it appears not to matter. John 2023-06-16 20:20:30 +0200
  • 5005d07f45
    Merge pull request #2 from jploski/master John 2023-06-16 19:56:00 +0200
  • 3d6ed18542 Correction to 4a37251a - since we did not insert the bos token, do not need attempt to rescue it during context swap Jan Ploski 2023-06-16 19:53:38 +0200
  • 4a37251a18 Fixed bos/eos token (which is both 11 according to config.json of Falcon-7B/40B). Also: do not auto-insert a space or (b|e)os at the beginning of prompt (seems to be LLaMA-specific). Jan Ploski 2023-06-16 19:36:27 +0200
  • 3d01122610
    CUDA : faster k-quant dot kernels (#1862) master-3d01122 Kawrakow 2023-06-16 20:08:44 +0300
  • 0dc0b6995f PR comments Iwan Kawrakow 2023-06-16 19:36:17 +0300