llama.cpp/examples/server/tests/features/results.feature

@llama.cpp
@results
Feature: Results

  Background: Server startup
    Given a server listening on localhost:8080
    And   a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
    And   a model file test-model-00001-of-00003.gguf
    And   128 as batch size
    And   1024 KV cache size
    And   128 max tokens to predict
    And   continuous batching

  Scenario Outline: consistent results with same seed
    Given <n_slots> slots
    And   1.0 temperature
    Then  the server is starting
    Then  the server is healthy

    Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42

    Given concurrent completion requests
    Then the server is busy
    Then the server is idle
    And  all slots are idle
    Then all predictions are equal
    Examples:
      | n_slots |
      | 1       |
      # FIXME: unified KV cache nondeterminism
      # | 2       |

  Scenario Outline: different results with different seed
    Given <n_slots> slots
    And   1.0 temperature
    Then  the server is starting
    Then  the server is healthy

    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45

    Given concurrent completion requests
    Then the server is busy
    Then the server is idle
    And  all slots are idle
    Then all predictions are different
    Examples:
      | n_slots |
      | 1       |
      | 2       |

  Scenario Outline: consistent results with same seed and varying batch size
    Given 4 slots
    And   <temp> temperature
    # And   0 as draft
    Then  the server is starting
    Then  the server is healthy

    Given 1 prompts "Write a very long story about AI." with seed 42
    And   concurrent completion requests
    # Then the server is busy # Not all slots will be utilized.
    Then  the server is idle
    And   all slots are idle

    Given <n_parallel> prompts "Write a very long story about AI." with seed 42
    And   concurrent completion requests
    # Then the server is busy # Not all slots will be utilized.
    Then the server is idle
    And  all slots are idle

    Then all predictions are equal
    Examples:
      | n_parallel | temp |
      | 1          | 0.0  |
      | 1          | 1.0  |
      # FIXME: unified KV cache nondeterminism
      # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
      # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
      # and https://github.com/ggerganov/llama.cpp/pull/7347 .
      # | 2          | 0.0  |
      # | 4          | 0.0  |
      # | 2          | 1.0  |
      # | 4          | 1.0  |

  Scenario Outline: consistent token probs with same seed and prompt
    Given <n_slots> slots
    And   <n_kv> KV cache size
    And   1.0 temperature
    And   <n_predict> max tokens to predict
    Then  the server is starting
    Then  the server is healthy

    Given 1 prompts "The meaning of life is" with seed 42
    And   concurrent completion requests
    # Then the server is busy # Not all slots will be utilized.
    Then  the server is idle
    And   all slots are idle

    Given <n_parallel> prompts "The meaning of life is" with seed 42
    And   concurrent completion requests
    # Then the server is busy # Not all slots will be utilized.
    Then the server is idle
    And  all slots are idle

    Then all token probabilities are equal
    Examples:
      | n_slots | n_kv | n_predict | n_parallel |
      | 4       | 1024 | 1         | 1          |
      # FIXME: unified KV cache nondeterminism
      # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
      # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
      # and https://github.com/ggerganov/llama.cpp/pull/7347 .
      # | 4       | 1024 | 1         | 4          |
      # | 4       | 1024 | 100       | 1          |
      # This test still fails even the above patches; the first token probabilities are already different.
      # | 4       | 1024 | 100       | 4          |
Server: fix seed for multiple slots (#6835) * Server: add tests for consistent results * sampling: separate rng per sampling context 2024-04-24 09:08:36 +00:00			`@llama.cpp`
			`@results`
			`Feature: Results`

			`Background: Server startup`
			`Given a server listening on localhost:8080`
			`And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models`
			`And a model file test-model-00001-of-00003.gguf`
			`And 128 as batch size`
Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00			`And 1024 KV cache size`
Server: fix seed for multiple slots (#6835) * Server: add tests for consistent results * sampling: separate rng per sampling context 2024-04-24 09:08:36 +00:00			`And 128 max tokens to predict`
Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00			`And continuous batching`
Server: fix seed for multiple slots (#6835) * Server: add tests for consistent results * sampling: separate rng per sampling context 2024-04-24 09:08:36 +00:00
Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00			`Scenario Outline: consistent results with same seed`
Server: fix seed for multiple slots (#6835) * Server: add tests for consistent results * sampling: separate rng per sampling context 2024-04-24 09:08:36 +00:00			`Given <n_slots> slots`
server : fix temperature + disable some tests (#7409) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo 2024-05-20 12:10:03 +00:00			`And 1.0 temperature`
Server: fix seed for multiple slots (#6835) * Server: add tests for consistent results * sampling: separate rng per sampling context 2024-04-24 09:08:36 +00:00			`Then the server is starting`
			`Then the server is healthy`

Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00			`Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42`
Server: fix seed for multiple slots (#6835) * Server: add tests for consistent results * sampling: separate rng per sampling context 2024-04-24 09:08:36 +00:00
			`Given concurrent completion requests`
			`Then the server is busy`
			`Then the server is idle`
			`And all slots are idle`
			`Then all predictions are equal`
			`Examples:`
			`\| n_slots \|`
			`\| 1 \|`
server : fix temperature + disable some tests (#7409) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo 2024-05-20 12:10:03 +00:00			`# FIXME: unified KV cache nondeterminism`
			`# \| 2 \|`
Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00
			`Scenario Outline: different results with different seed`
			`Given <n_slots> slots`
server : tuning tests (#7388) * server : don't pass temperature as string * server : increase timeout * tests : fix the fix 0.8f -> 0.8 ggml-ci * tests : set explicit temperature 2024-05-20 07:16:41 +00:00			`And 1.0 temperature`
Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00			`Then the server is starting`
			`Then the server is healthy`

			`Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42`
			`Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43`
			`Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44`
			`Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45`

			`Given concurrent completion requests`
			`Then the server is busy`
			`Then the server is idle`
			`And all slots are idle`
			`Then all predictions are different`
			`Examples:`
			`\| n_slots \|`
			`\| 1 \|`
			`\| 2 \|`

			`Scenario Outline: consistent results with same seed and varying batch size`
			`Given 4 slots`
			`And <temp> temperature`
			`# And 0 as draft`
			`Then the server is starting`
			`Then the server is healthy`

			`Given 1 prompts "Write a very long story about AI." with seed 42`
			`And concurrent completion requests`
			`# Then the server is busy # Not all slots will be utilized.`
			`Then the server is idle`
			`And all slots are idle`

			`Given <n_parallel> prompts "Write a very long story about AI." with seed 42`
			`And concurrent completion requests`
			`# Then the server is busy # Not all slots will be utilized.`
			`Then the server is idle`
			`And all slots are idle`

			`Then all predictions are equal`
			`Examples:`
			`\| n_parallel \| temp \|`
server: add test for token probs (#7347) 2024-05-19 14:26:02 +00:00			`\| 1 \| 0.0 \|`
			`\| 1 \| 1.0 \|`
server : fix temperature + disable some tests (#7409) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo 2024-05-20 12:10:03 +00:00			`# FIXME: unified KV cache nondeterminism`
Server: add tests for batch size, different seeds (#6950) 2024-05-01 15:52:55 +00:00			`# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227`
server: add test for token probs (#7347) 2024-05-19 14:26:02 +00:00			`# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574`
			`# and https://github.com/ggerganov/llama.cpp/pull/7347 .`
server : fix temperature + disable some tests (#7409) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo 2024-05-20 12:10:03 +00:00			`# \| 2 \| 0.0 \|`
			`# \| 4 \| 0.0 \|`
server: add test for token probs (#7347) 2024-05-19 14:26:02 +00:00			`# \| 2 \| 1.0 \|`
			`# \| 4 \| 1.0 \|`

			`Scenario Outline: consistent token probs with same seed and prompt`
			`Given <n_slots> slots`
			`And <n_kv> KV cache size`
			`And 1.0 temperature`
			`And <n_predict> max tokens to predict`
			`Then the server is starting`
			`Then the server is healthy`

			`Given 1 prompts "The meaning of life is" with seed 42`
			`And concurrent completion requests`
			`# Then the server is busy # Not all slots will be utilized.`
			`Then the server is idle`
			`And all slots are idle`

			`Given <n_parallel> prompts "The meaning of life is" with seed 42`
			`And concurrent completion requests`
			`# Then the server is busy # Not all slots will be utilized.`
			`Then the server is idle`
			`And all slots are idle`

			`Then all token probabilities are equal`
			`Examples:`
			`\| n_slots \| n_kv \| n_predict \| n_parallel \|`
			`\| 4 \| 1024 \| 1 \| 1 \|`
server : fix temperature + disable some tests (#7409) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo 2024-05-20 12:10:03 +00:00			`# FIXME: unified KV cache nondeterminism`
server: add test for token probs (#7347) 2024-05-19 14:26:02 +00:00			`# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227`
			`# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574`
			`# and https://github.com/ggerganov/llama.cpp/pull/7347 .`
server : fix temperature + disable some tests (#7409) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo 2024-05-20 12:10:03 +00:00			`# \| 4 \| 1024 \| 1 \| 4 \|`
server: add test for token probs (#7347) 2024-05-19 14:26:02 +00:00			`# \| 4 \| 1024 \| 100 \| 1 \|`
			`# This test still fails even the above patches; the first token probabilities are already different.`
			`# \| 4 \| 1024 \| 100 \| 4 \|`