mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-10 10:41:47 +00:00
958367bf53
Some checks are pending
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-musa.Dockerfile platforms:linux/amd64 tag:full-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-musa.Dockerfile platforms:linux/amd64 tag:light-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-musa.Dockerfile platforms:linux/amd64 tag:server-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere
37 lines
1.5 KiB
Gherkin
37 lines
1.5 KiB
Gherkin
@llama.cpp
|
|
@infill
|
|
Feature: llama.cpp server
|
|
|
|
# The current model is made by adding FIM tokens to the existing stories260K
|
|
# We may want to use a better model in the future, maybe something like SmolLM 360M
|
|
|
|
Background: Server startup
|
|
Given a server listening on localhost:8080
|
|
And a model file tinyllamas/stories260K-infill.gguf from HF repo ggml-org/models
|
|
And a model file test-model-infill.gguf
|
|
And a model alias tinyllama-infill
|
|
And 42 as server seed
|
|
And 1024 as batch size
|
|
And 1024 as ubatch size
|
|
And 2048 KV cache size
|
|
And 64 max tokens to predict
|
|
And 0.0 temperature
|
|
Then the server is starting
|
|
Then the server is healthy
|
|
|
|
Scenario: Infill without input_extra
|
|
Given a prompt "Complete this"
|
|
And an infill input extra none none
|
|
And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
|
|
And an infill input suffix "}\n"
|
|
And an infill request with no api error
|
|
Then 64 tokens are predicted matching One|day|she|saw|big|scary|bird
|
|
|
|
Scenario: Infill with input_extra
|
|
Given a prompt "Complete this"
|
|
And an infill input extra "llama.h" "LLAMA_API int32_t llama_n_threads();\n"
|
|
And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
|
|
And an infill input suffix "}\n"
|
|
And an infill request with no api error
|
|
Then 64 tokens are predicted matching cuts|Jimmy|mom|came|into|the|room"
|