llama.cpp/README_sycl.md

# llama.cpp for SYCL

[Background](#background)

[OS](#os)

[Intel GPU](#intel-gpu)

[Linux](#linux)

[Environment Variable](#environment-variable)

[Known Issue](#known-issue)

[Todo](#todo)

## Background

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.

oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.

Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.

To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.

The llama.cpp for SYCL is used to support Intel GPUs.

For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).

## OS

|OS|Status|Verified|
|-|-|-|
|Linux|Support|Ubuntu 22.04|
|Windows|Ongoing| |


## Intel GPU

|Intel GPU| Status | Verified Model|
|-|-|-|
|Intel Data Center Max Series| Support| Max 1550|
|Intel Data Center Flex Series| Support| Flex 170|
|Intel Arc Series| Support| Arc 770|
|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
|Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|


## Linux

### Setup Environment

1. Install Intel GPU driver.

a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).

Note: for iGPU, please install the client GPU driver.

b. Add user to group: video, render.

```
sudo usermod -aG render username
sudo usermod -aG video username
```

Note: re-login to enable it.

c. Check

```
sudo apt install clinfo
sudo clinfo -l
```

Output (example):

```
Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics


Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
```

2. Install Intel® oneAPI Base toolkit.


a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).

Recommend to install to default folder: **/opt/intel/oneapi**.

Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Check

```
source /opt/intel/oneapi/setvars.sh

sycl-ls
```

There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.

Output (example):
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]

```

2. Build locally:

```
mkdir -p build
cd build
source /opt/intel/oneapi/setvars.sh

#for FP16
#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference

#for FP32
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

#build example/main only
#cmake --build . --config Release --target main

#build all binary
cmake --build . --config Release -v

```

or

```
./examples/sycl/build.sh
```

Note:

- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.

### Run

1. Put model file to folder **models**

2. Enable oneAPI running environment

```
source /opt/intel/oneapi/setvars.sh
```

3. List device ID

Run without parameter:

```
./build/bin/ls-sycl-device

or

./build/bin/main
```

Check the ID in startup log, like:

```
found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

```

|Attribute|Note|
|-|-|
|compute capability 1.3|Level-zero running time, recommended |
|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|

4. Set device ID and execute llama.cpp

Set device ID = 0 by **GGML_SYCL_DEVICE=0**

```
GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
```
or run by script:

```
./examples/sycl/run_llama2.sh
```

Note:

- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.


5. Check the device ID in output

Like：
```
Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
```


## Environment Variable

#### Build

|Name|Value|Function|
|-|-|-|
|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, LLAMA_SYCL=ON is mandatory.|
|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. <br>For FP32, not set it.|
|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
|CMAKE_CXX_COMPILER|icpx|use icpx for SYCL code path|

#### Running


|Name|Value|Function|
|-|-|-|
|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|
|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|

## Known Issue

- Error:  `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.

  Miss to enable oneAPI running environment.

  Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.


- Hang during startup

  llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.

  Solution: add **--no-mmap**.

## Todo

- Support to build in Windows.

- Support multiple cards.
-												ggml : add unified SYCL backend for Intel GPUs (#2690)

* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-01-28 15:56:23 +00:00
+								# llama.cpp for SYCL
 								[Background](#background)
 								[OS](#os)
 								[Intel GPU](#intel-gpu)
 								[Linux](#linux)
 								[Environment Variable](#environment-variable)
 								[Known Issue](#known-issue)
 								[Todo](#todo)
 								## Background
 								SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
 								oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
 								Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
 								To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
 								The llama.cpp for SYCL is used to support Intel GPUs.
 								For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
 								## OS
 								|OS|Status|Verified|
 								|-|-|-|
 								|Linux|Support|Ubuntu 22.04|
 								|Windows|Ongoing| |
 								## Intel GPU
 								|Intel GPU| Status | Verified Model|
 								|-|-|-|
 								|Intel Data Center Max Series| Support| Max 1550|
 								|Intel Data Center Flex Series| Support| Flex 170|
 								|Intel Arc Series| Support| Arc 770|
 								|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
 								|Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|
 								## Linux
 								### Setup Environment
 . Install Intel GPU driver.
 								a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).
 								Note: for iGPU, please install the client GPU driver.
 								b. Add user to group: video, render.
 								```
 								sudo usermod -aG render username
 								sudo usermod -aG video username
 								```
 								Note: re-login to enable it.
 								c. Check
 								```
 								sudo apt install clinfo
 								sudo clinfo -l
 								```
 								Output (example):
 								```
 								Platform #0: Intel(R) OpenCL Graphics
 								 `-- Device #0: Intel(R) Arc(TM) A770 Graphics
 								Platform #0: Intel(R) OpenCL HD Graphics
 								 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
 								```
 . Install Intel® oneAPI Base toolkit.
 								a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
 								Recommend to install to default folder: **/opt/intel/oneapi**.
 								Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.
 								b. Check
 								```
 								source /opt/intel/oneapi/setvars.sh
 								sycl-ls
 								```
 								There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.
 								Output (example):
 								```
 								[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
 								[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
 								[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
 								[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]
 								```
 . Build locally:
 								```
 								mkdir -p build
 								cd build
 								source /opt/intel/oneapi/setvars.sh
 								#for FP16
 								#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference
 								#for FP32
 								cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
 								#build example/main only
 								#cmake --build . --config Release --target main
 								#build all binary
 								cmake --build . --config Release -v
 								```
 								or
 								```
 								./examples/sycl/build.sh
 								```
 								Note:
 								- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.
 								### Run
 . Put model file to folder **models**
 . Enable oneAPI running environment
 								```
 								source /opt/intel/oneapi/setvars.sh
 								```
 . List device ID
 								Run without parameter:
 								```
 								./build/bin/ls-sycl-device
 								or
 								./build/bin/main
 								```
 								Check the ID in startup log, like:
 								```
 								found 4 SYCL devices:
 								  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
 								    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
 								  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
 								    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
 								  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
 								    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
 								  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
 								    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
 								```
 								|Attribute|Note|
 								|-|-|
 								|compute capability 1.3|Level-zero running time, recommended |
 								|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|
 . Set device ID and execute llama.cpp
 								Set device ID = 0 by **GGML_SYCL_DEVICE=0**
 								```
 								GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
 								```
 								or run by script:
 								```
 								./examples/sycl/run_llama2.sh
 								```
 								Note:
 								- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.
 . Check the device ID in output
 								Like：
 								```
 								Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
 								```
 								## Environment Variable
 								#### Build
 								|Name|Value|Function|
 								|-|-|-|
 								|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, LLAMA_SYCL=ON is mandatory.|
 								|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. <br>For FP32, not set it.|
 								|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
 								|CMAKE_CXX_COMPILER|icpx|use icpx for SYCL code path|
 								#### Running
 								|Name|Value|Function|
 								|-|-|-|
 								|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|
 								|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|
 								## Known Issue
 								- Error:  `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.
 								  Miss to enable oneAPI running environment.
 								  Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.
 								- Hang during startup
 								  llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.
 								  Solution: add **--no-mmap**.
 								## Todo
 								- Support to build in Windows.
 								- Support multiple cards.