* avoid to get prompt in infill mode and embedding mode
* remove embedding mode
* refactor format
---------
Co-authored-by: wudexiang <wudexiang@bytedance.com>
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
* ic
* migrate my eary work
* add the belonging stuff: css,favicon etc
* de prompts
* chore: Update HTML meta tags in index.html file
* add api-key css classes
* some necessary fixes
* Add API key CSS classes and update styling in style.css
* clean the code
* move API to the top, rearrange param sliders. update css
* add tooltips to the parameters with comprehensible explanations
* fix FloatField and BoolField tooltips
* fix grammar field width
* use template literales for promptFormats.js
* update const ModelGenerationInfo
* remove ms per token, since not relevant for most webui users and use cases
* add phi-3 prompt template
* add phi3 to dropdown
* add css class
* update forgotten css theme
* add user message suffix
* fix chatml & add llama3 format
* fix llama3 prompt template
* more prompt format fixes
* add more comon stop tokens
* add missing char
* do not separate with new line or comma
* move prompt style
* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js
* fix toggle state localstorage
* add cmd-r prompt et reduce redundancy
* set default prompt to empty
* move files, clean code
* fix css path
* add a button to the new ui
* move new ui to "/public" due to otherwise problematic CORS behaviour
* include new ui in cpp
* fix wrong link to old ui
* renaming to ensure consistency
* fix typos "prompt-format" -> "prompt-formats"
* use correct indent
* add new ui files to makefile
* fix typo
* SimpleChat:DU:BringIn local helper js modules using importmap
Use it to bring in a simple trim garbage at end logic, which is
used to trim received response.
Also given that importmap assumes esm / standard js modules, so
also global variables arent implicitly available outside the
modules. So add it has a member of document for now
* SimpleChat:DU: Add trim garbage at end in loop helper
* SimpleChat:DU:TrimGarbage if unable try skip char and retry
* SimpleChat:DU: Try trim using histogram based info
TODO: May have to add max number of uniq chars in histogram at
end of learning phase.
* SimpleChat:DU: Switch trim garbage hist based to maxUniq simple
Instead of blindly building histogram for specified substring
length, and then checking if any new char within specified min
garbage length limit, NOW exit learn state when specified maxUniq
chars are found. Inturn there should be no new chars with in
the specified min garbage length required limit.
TODO: Need to track char classes like alphabets, numerals and
special/other chars.
* SimpleChat:DU: Bring in maxType to the mix along with maxUniq
Allow for more uniq chars, but then ensure that a given type of
char ie numerals or alphabets or other types dont cross the
specified maxType limit. This allows intermixed text garbage
to be identified and trimmed.
* SimpleChat:DU: Cleanup debug log messages
* SimpleChat:UI: Move html ui base helpers into its own module
* SimpleChat:DU:Avoid setting frequence/Presence penalty
Some models like llama3 found to try to be over intelligent by
repeating garbage still, but by tweaking the garbage a bit so that
it is not exactly same. So avoid setting these penalties and let
the model's default behaviour work out, as is.
Also the simple minded histogram based garbage trimming from end,
works to an extent, when the garbage is more predictable and
repeatative.
* SimpleChat:UI: Add and use a para-create-append helper
Also update the config params dump to indicate that now one needs
to use document to get hold of gMe global object, this is bcas of
moving to module type js.
Also add ui.mjs to importmap
* SimpleChat:UI: Helper to create bool button and use it wrt settings
* SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt
* SimpleChat:UI:Select: dict-name-value, value wrt default, change
Take a dict/object of name-value pairs instead of just names.
Inturn specify the actual value wrt default, rather than the
string representing that value.
Trap the needed change event rather than click wrt select.
* SimpleChat:UI: Add Div wrapped label+element helpers
Move settings related elements to use the new div wrapped ones.
* SimpleChat:UI:Add settings button and bring in settings ui
* SimpleChat:UI:Settings make boolean button text show meaning
* SimpleChat: Update a bit wrt readme and notes in du
* SimpleChat: GarbageTrim enable/disable, show trimmed part ifany
* SimpleChat: highlight trim, garbage trimming bitmore aggressive
Make it easy for end user to identified the trimmed text.
Make garbage trimming logic, consider a longer repeat garbage
substring.
* SimpleChat: Cleanup a bit wrt Api end point related flow
Consolidate many of the Api end point related basic meta data into
ApiEP class.
Remove the hardcoded ApiEP/Mode settings from html+js, instead use
the generic select helper logic, inturn in the settings block.
Move helper to generate the appropriate request json string based
on ApiEP into SimpleChat class itself.
* SimpleChat:Move extracting assistant response to SimpleChat class
so also the trimming of garbage.
* SimpleChat:DU: Bring in both trim garbage logics to try trim
* SimpleChat: Cleanup readme a bit, add one more chathistory length
* SimpleChat:Stream:Initial handshake skeleton
Parse the got stream responses and try extract the data from it.
It allows for a part read to get a single data line or multiple
data line. Inturn extract the json body and inturn the delta
content/message in it.
* SimpleChat: Move handling oneshot mode server response
Move handling of the oneshot mode server response into SimpleChat.
Also add plumbing for moving multipart server response into same.
* SimpleChat: Move multi part server response handling in
* SimpleChat: Add MultiPart Response handling, common trimming
Add logic to call into multipart/stream server response handling.
Move trimming of garbage at the end into the common handle_response
helper.
Add new global flag to control between oneshot and multipart/stream
mode of fetching response. Allow same to be controlled by user.
If in multipart/stream mode, send the stream flag to the server.
* SimpleChat: show streamed generative text as it becomes available
Now that the extracting of streamed generated text is implemented,
add logic to show the same on the screen.
* SimpleChat:DU: Add NewLines helper class
To work with an array of new lines. Allow adding, appending,
shifting, ...
* SimpleChat:DU: Make NewLines shift more robust and flexible
* SimpleChat:HandleResponseMultiPart using NewLines helper
Make handle_response_multipart logic better and cleaner. Now it
allows for working with the situation, where the delta data line
got from server in stream mode, could be split up when recving,
but still the logic will handle it appropriately.
ALERT: Rather except (for now) for last data line wrt a request's
response.
* SimpleChat: Disable console debug by default by making it dummy
Parallely save a reference to the original func.
* SimpleChat:MultiPart/Stream flow cleanup
Dont try utf8-decode and newlines-add_append if no data to work on.
If there is no more data to get (ie done is set), then let NewLines
instance return line without newline at end, So that we dont miss
out on any last-data-line without newline kind of scenario.
Pass stream flag wrt utf-8 decode, so that if any multi-byte char
is only partly present in the passed buffer, it can be accounted
for along with subsequent buffer. At sametime, bcas of utf-8's
characteristics there shouldnt be any unaccounted bytes at end,
for valid block of utf8 data split across chunks, so not bothering
calling with stream set to false at end. LATER: Look at TextDecoder's
implementation, for any over intelligence, it may be doing..
If needed, one can use done flag to account wrt both cases.
* SimpleChat: Move baseUrl to Me and inturn gMe
This should allow easy updating of the base url at runtime by the
end user.
* SimpleChat:UI: Add input element helper
* SimpleChat: Add support for changing the base url
This ensures that if the user is running the server with a
different port or wants to try connect to server on a different
machine, then this can be used.
* SimpleChat: Move request headers into Me and gMe
Inturn allow Authorization to be sent, if not empty.
* SimpleChat: Rather need to use append to insert headers
* SimpleChat: Allow Authorization header to be set by end user
* SimpleChat:UI+: Return div and element wrt creatediv helpers
use it to set placeholder wrt Authorization header.
Also fix copy-paste oversight.
* SimpleChat: readme wrt authorization, maybe minimal openai testing
* SimpleChat: model request field for openai/equivalent compat
May help testing with openai/equivalent web services, if they
require this field.
* SimpleChat: readme stream-utf-8 trim-english deps, exception2error
* Readme: Add a entry for simplechat in the http server section
* SimpleChat:WIP:Collate internally, Stream mode Trap exceptions
This can help ensure that data fetched till that point, can be
made use of, rather than losing it.
On some platforms, the time taken wrt generating a long response,
may lead to the network connection being broken when it enters
some user-no-interaction related power saving mode.
* SimpleChat:theResp-origMsg: Undo a prev change to fix non trim
When the response handling was moved into SimpleChat, I had changed
a flow bit unnecessarily and carelessly, which resulted in the non
trim flow, missing out on retaining the ai assistant response.
This has been fixed now.
* SimpleChat: Save message internally in handle_response itself
This ensures that throwing the caught exception again for higher
up logic, doesnt lose the response collated till that time.
Go through theResp.assistant in catch block, just to keep simple
consistency wrt backtracing just in case.
Update the readme file.
* SimpleChat:Cleanup: Add spacing wrt shown req-options
* SimpleChat:UI: CreateDiv Divs map to GridX2 class
This allows the settings ui to be cleaner structured.
* SimpleChat: Show Non SettingsUI config field by default
* SimpleChat: Allow for multiline system prompt
Convert SystemPrompt into a textarea with 2 rows. Reduce
user-input-textarea to 2 rows from 3, so that overall
vertical space usage remains same.
Shorten usage messages a bit, cleanup to sync with settings ui.
* SimpleChat: Add basic skeleton for saving and loading chat
Inturn when ever a chat message (system/user/model) is added,
the chat will be saved into browser's localStorage.
* SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key
* SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat
This is a temporary flow
* SimpleChat:ODS:Move restore/load saved chat btn setup to Me
This also allows being able to set the common system prompt
ui element to loaded chat's system prompt.
* SimpleChat:Readme updated wrt save and restore chat session info
* SimpleChat:Show chat session restore button, only if saved session
* SimpleChat: AutoCreate ChatRequestOptions settings to an extent
* SimpleChat: Update main README wrt usage with server
* SimpleChat: A placeholder system prompt, Use usage msg in code
Just have a alert msg wrt needing javascript enabled in html. And
have usage message from js file. Update the usage message a bit.
So also enable switch session wrt setup_ui call.
Add a possible system prompt as a placeholder for the system-input.
* SimpleChat:CompletionMode: Allow control of Role: prefix
* SimpleChat:Completion: Avoid Role: prefix; Newline only in between
In completion mode
* avoid inserting Role: prefix before each role's message
* avoid inserting newline at the begin and end of the prompt
message. However if there are multiple role messages, then
insert newline when going from one role's message to the
next role's message.
* SimpleChat:CompletionMode: Update readme/usage, trim textarea newline
Readme update wrt completion mode behavior.
Usage help updated wrt completion mode behavior.
When changing from input to textarea elment wrt user input, the last
newline at the end of the user input wrt textarea, was forgotten to be
filtered, this is fixed now. However if user wants to have a explicit
newline they can using shift+enter to insert a newline, that wont be
removed. The extra newline removal logic uses substring and keyup to
keep things simple and avoid some previously noted bugs wrt other
events in the key path as well as IME composition etal.
* SimpleChat:SC: Ensure proper clearing/reseting
previous logic would have cleared/reset the xchat, without doing
the same wrt iLastSys, thus leading to it pointing to a now non
existent role-content entry.
So if a user set a system prompt and used completion mode, it would
have done the half stupid clear, after the model response was got.
Inturn when user tries to send a new completion query, it would
inturn lead to handle_user_submit trying to add/update system prompt
if any, which will fail, bcas iLastSys will be still pointing to a
non existant entry.
This is fixed now, by having a proper clear helper wrt SC class.
* SimpleChat: Update usage note and readme a bit
* SimpleChat:Completion: clear any prev chat history at begining
Previously any chat history including model response to a completion
query would have got cleared, after showing the same to the user,
at the end of handle_user_submit, rather than at the begining.
This gave the flexibility that user could switch from chat mode
to completion mode and have the chat history till then sent to
the ai model, as part of the completion query. However this flow
also had the issue that, if user switches between different chat
sessions, after getting a completion response, they can no longer
see the completion query and its response that they had just got.
The new flow changes the clearing of chat history wrt completion
mode to the begining of handle_user_submit, so that user doesnt
lose the last completion mode query and response, till a new
completion mode query is sent to the model, even if they were to
switch between the chat sessions. At the same time the loss of
flexibility wrt converting previous chat history into being part
of the completion query implicitly doesnt matter, because now
the end user can enter multiline queries.
* SimpleChat:Try read json early, if available
For later
the server flow doesnt seem to be sending back data early, atleast
for the request (inc options) that is currently sent.
if able to read json data early on in future, as and when ai model
is generating data, then this helper needs to indirectly update
the chat div with the recieved data, without waiting for the
overall data to be available.
* SimpleChat: Rename the half asleep mis-spelled global var
* SimpleChat: Common chat request options from a global object
* SimpleChat: Update title, usage and readme a bit
Keep the title simple so that print file name doesnt have chars
that need to be removed.
Update readme wrt some of the new helpers and options.
Change Usage list to a list of lists, add few items and style it
to reduce the margin wrt lists.
* SimpleChat:ChatRequestOptions: max_tokens
As some times based on the query from the user, the ai model may get
into a run away kind of generation with repeatations etal, so adding
max_tokens to try and limit this run away behaviour, if possible.
* SimpleChat: Reduce max_tokens to be small but still sufficient
* SimpleChat: Consolidate global vars into gMe, Display to user
This allows the end user to see the settings used by the logic,
as well as allows users to change/update the settings if they
want to by using devel-tools/console
* SimpleChat:SlidingWindow: iRecentUserMsgCnt to limit context load
This is disabled by default. However if enabled, then in addition
to latest system message, only the last N user messages, after the
latest system message and its reponses from the ai model will be sent
to the ai-model, when querying for a new response.
This specified N also includes the latest user query.
* SimpleChat: placeholder based usage hint for user-in textarea
* SimpleChat: Try make user experience better, if possible
Reduce chat history context sent to the server/ai-model to be
just the system-prompt, prev-user-request-and-ai-response and
cur-user-request, instead of the previous full chat history.
This way if there is any response with garbage/repeatation, it
doesnt mess with things beyond the next question, in some ways.
Increase max_tokens to 1024, so that a relatively large previous
reponse doesnt eat up the space available wrt next query-response.
However dont forget that the server when started should also
be started with a model context size of 1k or more, to be on
safe side.
Add frequency and presence penalty fields set to 1.2 to the set
of fields sent to server along with the user query. So that
the model is partly set to try avoid repeating text in its
response.
* SimpleChat:Add n_predict (equiv max_tokens) for llamacpp server
The /completions endpoint of examples/server doesnt take max_tokens,
instead it takes the internal n_predict, for now add the same on
the client side, maybe later add max_tokens to /completions endpoint
handling.
* SimpleChat: Note about trying to keep things simple yet flexible
* main : don't print special tokens with --grammar
The CLI interface was recently changed to print special control tokens
like the </s> stop message one. This token shouldn't be printed if the
grammar flag was passed, unless the grammar specifies it, because that
breaks shell-scriptability.
* main: use seperate stream for control characters
* main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out
* main: dprintf isn't part of the IEEE POSIX standard. Just use write().
* main: remove --ctrl-token-fd-out in favor for fcntl() based detection
* common.cpp: accidentally removed --interactive-first
* main: only merge stdout and control token if not in conversation or grammar mode
* main: rejig control token descriptor handling
* main: must check pipe status on very top of program
* main: renamed --no-special from --ctrl-token-no-out and other refactoring
* main: refactor ctrl_token_no_out --> no_special
* llama: rename llama_token_is_control_token() to llama_token_is_control()
* main: remove special token file descriptor feature (#5)
---------
Co-authored-by: Brian <mofosyne@gmail.com>
* Make tokenizer.cpp CLI tool nicer.
Before this commit, tokenize was a simple CLI tool like this:
tokenize MODEL_FILENAME PROMPT [--ids]
This simple tool loads the model, takes the prompt, and shows the tokens
llama.cpp is interpreting.
This changeset makes the tokenize more sophisticated, and more useful
for debugging and troubleshooting:
tokenize [-m, --model MODEL_FILENAME]
[--ids]
[--stdin]
[--prompt]
[-f, --file]
[--no-bos]
[--log-disable]
It also behaves nicer on Windows now, interpreting and rendering Unicode
from command line arguments and pipes no matter what code page the user
has set on their terminal.
* style fix: strlen(str) == 0 --> *str == 0
* Simplify tokenize.cpp; by getting rid of handling positional style arguments.
It must now be invoked with long --model, --prompt etc. arguments only.
Shortens the code.
* tokenize.cpp: iostream header no longer required
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: brian khuu <mofosyne@gmail.com>
* SimpleChat: Add a skeletal html page
Contains a div placeholder for showing chat messages till now
a text-input for allowing user to enter next chat message/query
to the model.
a submit button to allow sending of the user entered message and
chat till now to the model.
* SimpleChat: A js skeleton with SimpleChat class
Allows maintaining an array of chat message.
Allows adding chat message (from any of the roles be it system,
user, assistant, ...)
Allows showing chat messages till now, in a given div element.
* SimpleChat: request_json, globals, startme
* SimpleChatJS: Roles Class, submitClick
Define Role class with static members corresponding to the roles.
Update startme to
* Get hold of the ui elements.
* Attach a click handler to submit button, which adds the user input
to xchats array and shows the chat messages till now in chat div
element.
Trap DOMContentLoaded to trigger startme
* SimpleChat:HTML: Bring in the js file
* SimpleChat: Rather value wrt input text element
* SimpleChat: Also add completions related prompt
* SimpleChat: Use common helper logic wrt json data
* SimpleChat: Move handling of submit request into its own func
* SimpleChat: Try handshake with llm over its web service endpoint
* SimpleChat:JS: Extract model response and show to user
* SimpleChat:JS: Messages/Prompt, indicate working to end user
* SimpleChat: Try keep input element in view
* SimpleChat: Diff user/assistant msgs, Make input wider
Also show a default message to user
Also add some metas
* SimpleChat: Move into its own sub directory to avoid confusion
* SimpleChat:sh: Add simple shell script to run python3 http.server
So one needs to run the llm server locally
then run this script and access it using a local browser
* SimpleChat:JS: Try trap enter key press wrt input text field
So user can either press submit button or press enter key
* SimpleChat: Allow user to select chat or completion mode
* SimpleChat: Dont submit if already submitted and waiting
Also make chat the default selection wrt mode
* SimpleChat:JS: Handle difference in response
Try read the assistance response from appropriate field in the
response got.
Also examples/server seems to return the response in a slightly
different field, so try account for that also.
* SimpleChat:JS: Force completion mode be single message by default
* SimpleChat: Add a simple readme file
* SimpleChat:HTML: Cleanup/structure UI a bit, Add input for system
* SimpleChat:Allow system prompt to be set, if provided before user
* SimpleChat: Ignore empty user input, without trimming
* SimpleChat:Alert user if they provide sysprompt late or change it
* SimpleChat: Move handling systemprompt into its own func
* SimpleChat:HTML: Add a style for system role message
* SimpleChat: Update the readme file
* SimpleChat:CSS: Move style info into its own css file
To keep it simple, clean and seperate so that things are not
unnecessarily cluttered.
* SimpleChat:CSS: Allow for chat div to be scrollable
* SimpleChat:JS: Try ensure the last entry in chat is visible
Needed because now only the chat div is scrollable and not the full
page.
In last commit the chat div size was fixed to 75% vertical height,
so the full page no longer scrolls, so the old bring user-input
element to view wont work, instead now the last element in the
chat div should be brought into view.
* SimpleChat:JS: bottom of element visible, Set focus to user input
As the generated text could be multiple lines and occupy more space
that the full scrollable div's vertical space, make the bottom of
the last element (which can be such a generated text) in the div
visible by scrolling.
Ensure that the user input box has focus
* SimpleChat: Update notes a bit. Try keep browser happy
Avoid browser quirk mode with DOCTYPE.
Help with accessibility a bit by specifying the language explicitly.
Specify the char encoding explicitly, inturn utf-8 is a safe bet,
even with intermixing of languages if reqd in future.
Add a cache-control http-equiv meta tag, which in all probability
will be ignored.
Defer js loading and execution, just for fun and future, not that
critical here as it stands now.
* SimpleChat:HTML:Group user input+btn together; Note about multichat
* SimpleChat:JS: Allow for changing system prompt anytime for future
* SimpleChat:Readme: Note about handle_systemprompt begin/anytime
* SimpleChat:HTML: Add viewport meta for better mobile friendliness
Without this the page content may look too small.
* SimpleChat:HtmlCss: Cleanup UI flow
set margin wrt vmin rather than vw or vh so portrait/landscape ok.
Use flex and flex-grow to put things on the same line as well as
distribute available space as needed. Given two main elements/line
so it remains simple.
In each line have one element with grows and one sits with a basic
comfortably fixed size.
* SimpleChat: textarea for multiline user chat, inturn shift+enter 4 enter
* SimpleChat: Make vertical layout better responsive (flex based)
Also needed to make things cleaner and properly usable whether
landscape or portrait, after changing to multiline textarea rather
than single line user input.
Avoid hardcoding the chat-till-now display area height, instead
make it a flex-growable within a flex column of ui elements within
a fixed vertical area.
* SimpleChat: Rename simplechat.html to index.html, update readme
Instead of providing a seperate shell script, update the readme wrt
how to run/use this web front end.
* SimpleChat: Screen fixed view and scrolling, Printing full
* SimpleChat:JS:CI: Avoid space at end of jsdoc param line
* SimpleChat:JS: MultiChat initial skeleton
Will help maintain multiple independent chats in future
* SimpleChat:JS: Move system prompt begin/anytime into SimpleChat
* SimpleChat:JS:Keep MultiChatUI simple for now
Worry about different chats with different servers for later.
* SimpleChat:JS: Move handle submit into MultiChat, build on same
Create an instance of MultiChatUI and inturn a instance of chat
session, which is what the UI will inturn work on.
* SimpleChat:JS: Move to dictionary of SimpleChat, instead of array
* SimpleChat: Move ui elements into MultiChatUI, Update el IDs
Move ui elements into MultiChatUI, so that current handleUserSubmit
doesnt need to take the element arguments. Also in future, when
user is allowed to switch between different chat sessions, the
UI can be updated as needed by using the elements in UI already
known to MultiChatUI instance.
Rename the element ids' so that they follow a common convention,
as well as one can identify what the element represents in a more
consistant manner.
* SimpleChat:MCUI:Show available chat sessions, try switch btw them
Previous commits brought in / consolidated existing logic into
MultiChatUI class.
Now start adding logic towards multichat support
* show buttons indicating available chat sessions
* on sessin button click, try switch to that session
* SimpleChat:MCUI: Store and use current chat session id
Also
allow to switch chat session optionally, wrt some of the related
helpers.
setup for two chat sessions by default.
* SimpleChat:MCUI: Delay enabling user-input to avoid race
Re-enable user-input, only after response to a user query has been
updated to the chat-div. This ensures that if user tries to switch
chat session, it wont be allowed till chat-request-response flow is
done.
* SimpleChat: Take care of system prompt
Helper to get the latest system prompt and inturn use same to
set the system prompt ui, when switching.
Ensure that system prompt is set if and when enter key is pressed.
* SimpleChat:GetSystemLatest, fix a oversight.
* SimpleChat:MCUI: Allow selected chat-session btn to be highlighted
Also have a general helper for setting class of children.
* SimpleChat:Cleanup corners
Show system prompt in chat space, when it is set by pressing enter,
as a feedback to user.
Alert user, if they try to switch chat session in the middle of
waiting for a response from the ai model.
* SimpleChat:MCUI: Ensure req-resp failure doesnt lock up things
* SimpleChat:MCUI: Support for new chat sessions
Also a general create button helper.
* SimpleChat:MCUI: CreateSessionBtn helper, use wrt NewChat
Also fix a oversight wrt using stale data wrt the list of chat
sessions.
* SimpleChat:MCUI: NewChat btn first before existing chat sessions
* SimpleChat:MCUI:CornerCases:Skip new chat, show only if current
Skip NewChat if user cancels or if one waiting for response from
the ai model.
Dont show a chat with newly got ai model response, if current chat
session has changed, some how. Chat session shouldnt be allowed to
change, if there is a pending response, but still as a additional
sanity check.
* SimpleChat: Update readme, title, show usage if no chat to show
* SimpleChat: Cleanup the log/dialog messages a bit
* phi3 : duplicate rope factors in each layer
phi3 : set phi-3 model type as 14B
model loader : simplify the process for duplicating model tensors
llama-bench : remove default pg test
* replace bool parameters in llama_model_loader with named flags
* add phi3 128k support in convert-hf-to-gguf
* add phi3 128k support in cuda
* address build warnings on llama.cpp
* adjust index value in cuda long rope freq factors
* add long rope support in ggml cpu backend
* make freq factors only depend on ctx size
* remove unused rope scaling type 'su' frin gguf converter
* fix flint warnings on convert-hf-to-gguf.py
* set to the short freq factor when context size is small than trained context size
* add one line of comments
* metal : support rope freq_factors
* ggml : update ggml_rope_ext API to support freq. factors
* backends : add dev messages to support rope freq. factors
* minor : style
* tests : update to use new rope API
* backends : fix pragma semicolons
* minor : cleanup
* llama : move rope factors from KV header to tensors
* llama : remove tmp assert
* cuda : fix compile warning
* convert : read/write n_head_kv
* llama : fix uninitialized tensors
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
* Update brute force test: special tokens
* Fix added tokens
- Try to read 'added_tokens.json'.
- Try to read 'tokenizer_config.json'.
- Try to read 'tokenizer.json'.
* Fix special tokens rtrim
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
* android : use "ci-android" branch for CI
* ggml : disable SIMD exp and silu for 32-bit ARM
ggml-ci
* android : do not fetch, use add_subdirectory instead
* cmake : provide binary dir
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
* [server] Cleanup a memory leak on exit
There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.
* make tab into spaces
* feat: first things to do
* feat: create tensors for Jina architecture
* fix: use other tensors
* feat: embedding gets results
* fix: fix usage of ALIBI
* fix: clean prints
* fix: do some cleanup unused vars
* fix: revert changes to Makefile and CMakeLists
* fix: revert some changes
* fix: fix small detail
* fix: fix convert formatting
* fix: fix linting and editor
* feat: set proper vocab settings
* fix: JinaBertForMaskedLM registration
* feat: support q_normalization and k_normalization in Jina arch
* feat: handle gpt2 tokenizer with Jina architecture
* feat: example comments in embedding
* feat: rename Jina Bert to Jina Bert V2
* fix: add some changes as per review
* feat: proper KQ_pos for Jina embeddings
* feat: add capacity to load models ES and DE for Spanish
* llama : fix pre-tokenizers
* ggml : full ALiBi support
* ggml : update ggml_soft_max_ext() CUDA, SYCL
* ggml : ggml_flash_attn_ext() support ALiBi (CPU)
* ggml : ggml_flash_attn_ext() support ALiBi (Metal)
* ggml : fix warning
* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)
ggml-ci
* minor : clean-up
* embedding : add warning about missing SEP
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug
./llamafile -m foo.gguf -p bar --grammar 'root::="'
Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
* Revert "Revert "llava : add support for moondream vision language model (#6899)""
This reverts commit 9da243b36a.
* Fix num_positions and embeddings initialization
* convert-hf : begin refactoring write_tensor
* convert : upgrade to sentencepiece v0.2.0
* convert-hf : remove unused n_dims in extra_*_tensors
* convert-hf : simplify MoE weights stacking
* convert-hf : flake8 linter doesn't like semicolons
* convert-hf : allow unusual model part names
For example, loading `model-00001-of-00001.safetensors` now works.
* convert-hf : fix stacking MoE expert tensors
`torch.stack` and `torch.cat` don't do the same thing.
* convert-hf : fix Mamba conversion
Tested to work even with a SentencePiece-based tokenizer.
* convert : use a string for the SentencePiece tokenizer path
* convert-hf : display tensor shape
* convert-hf : convert norms to f32 by default
* convert-hf : sort model part names
`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".
* convert-hf : use an ABC for Model again
It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)
At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.
* convert-hf : use a plain class for Model, and forbid direct instantiation
There are no abstract methods used anyway,
so using ABC isn't really necessary.
* convert-hf : more consistent formatting of cmdline args
* convert-hf : align the message logged for converted tensors
* convert-hf : fix Refact conversion
* convert-hf : save memory with lazy evaluation
* convert-hf : flake8 doesn't like lowercase L as a variable name
* convert-hf : remove einops requirement for InternLM2
* convert-hf : faster model parts loading
Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors
Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.
* convert-hf : minor changes for consistency
* gguf-py : add tqdm as a dependency
It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file
* Added themes support with two sample themes and a favicon.
* Newline
* Newline
* Newline
* Trailing whitespace
* Increased opacity for contrast
* Increase opacity.
Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY
* Opacity action trigger.
Trying to re-trigger the cancelled action.
* One more opacity adjustment
This Actions pipeline is failing for random issues.
* Delete examples/server/themes/buttons_top/completion.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/buttons_top/index.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/wild/completion.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs
This will be served from the static string built-in to server.
* Delete examples/server/themes/wild/index.js
This will be served from the static string built-in to server.
* Delete examples/server/themes/wild/json-schema-to-grammar.mjs
This will be served from the static string built-in to server.
* Replaced underscore.
* Introduce bfloat16 support
Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 brain16
This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌──┴───┐┌─┴───────────────────┐
0b00000000000000000000000000000000 IEEE binary32
The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌─┴─┐┌─┴──────┐
0b0000000000000000 IEEE binary16
This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16
* Remove GGML code that's not needed
* Minimize the GGML API surface area for BF16
* Remove bf16 luts
* Make the GGML header look nicer
* Fix documentation
* Apply ggerganov's fixes for test-backend-ops
* Add BF16 code for new ggml_validate_row_data() function
* Fixed save_imatrix to match old behaviour for MoE
This fix is simple and clear, but unnecessarily doubles the memory overhead..
* Fixed missing idx variable
* Unconditionally increment ncall
Co-authored-by: slaren <slarengh@gmail.com>
* Fixed 2 bugs in save_imatrix()
- Fixed segfault bug because the counts vector needed to be created.
- Fixed pre-existing bug didn't actually add to the counts for "--combine" option.
* ncall needs summing too
* Trailing whitespace
---------
Co-authored-by: slaren <slarengh@gmail.com>
* Update log text (EOS to EOG)
The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT.
* Improve log msg. further by using "an" instead of "some".
As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).
This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
'stop': ['\n'],
'temperature': 0.0,
'n_predict': 10,
'cache_prompt': True,
'n_probs': 10
}
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
* imatrix: save the dataset file used in the output file
* llama: support kv overrides type string string
* common: factorize KV Overrides parsing between common and server
* quantize: add imatrix n entries and dataset KV metadata
quantize: factorize KV Overrides parsing between common
#6656
* llama: remove kv override str_value initialization as it does not compile on some toolchain
* quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count`
* quantize: add imatrix filename in KV
* llama: add llama_model_kv_override_free
* common: add llama_model_kv_override_free
common: free kv override if used after model loading
* llama: finally move the string KV override value to the stack
* llama : minor
* no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators.
Co-authored-by: slaren <slarengh@gmail.com>
* kv override: ensure string termination
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
* server: cap n_predict if not set to n_ctx_train
* server: fix infinite loop
* server: infinite loop, move in process_token
server: infinite loop: set stop limit to true
* minor: spaces
* minor: spaces
* server: include prompt tokens in the EOS limit
* add support for moondream vision language model
This required making the following changes to the CLIP model:
1. Support for patch embedding bias.
2. Make class embedding and pre-layernorm optional.
3. Add support for post-layernorm.
* Update examples/llava/clip.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit renamesthe lerp (linear interpolation) function in clip.cpp
to avoid a conflict with the lerp function in the <cmath> standard C++
library when using c++20.
The motivation for this change is to enable projects that use c++20 to
be able to compile clip.cpp without having to resort to patching it. The
lerp function was added to cmath in version C++20 (202002L) and is why
this is not causing any issue at the moment as C++11/C++17 is currently
used by llama.cpp.
I realize that llama.cpp uses either C++11 (or C++17 in the case for
SYCL) but wanted to ask if this would be an acceptable change just the
same.
Refs: https://en.cppreference.com/w/cpp/numeric/lerp
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Implement '--keep-split' to quantize model into several shards
* Add test script
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Split model correctly even if tensor id is out-of-order
* Update llama_model_quantize_params
* Fix preci failures
---------
Co-authored-by: z5269887 <z5269887@unsw.edu.au>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: revert showing control tokens by default
* feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens
* feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"
* common : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* `build`: generate hex dumps of server assets on the fly
* build: workaround lack of -n on gnu xxd
* build: don't use xxd in cmake
* build: don't call xxd from build.zig
* build: more idiomatic hexing
* build: don't use xxd in Makefile (od hackery instead)
* build: avoid exceeding max cmd line limit in makefile hex dump
* build: hex dump assets at cmake build time (not config time)
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy
* cuda : fix bin bcast with non-cont src0
* test-backend-ops : only run all mul mat tests for base types
* llama : disable moe offloading with SYCL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
This commit updates the hf.sh script usage to include the --outdir option
and specifies the models directory as the output directory.
The motivation for this is to avoid cluttering the root directory with
model files.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Fix --split-max-size
Byte size calculation was done on int and overflowed.
* add tests.sh
* add examples test scripts to ci run
Will autodiscover examples/*/tests.sh scripts and run them.
* move WORK_PATH to a subdirectory
* clean up before and after test
* explicitly define which scripts to run
* add --split-max-size to readme
* disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax
* refactor to disable mmap for SYCL backend
* fix compile error in other os
* refactor the solution, use host buf to fix it, instead of disable mmap
* keep to support mmap()
* use host buff to reduce malloc times
* revert to malloc/free solution, for threaad safe
* infill : add download instructions for model
This commit adds instructions on how to download a CodeLlama model
using the `hf.sh` script. This will download the model and place it
in the `models` directory which is the same model use later by the
infill example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! infill : add download instructions for model
Clarify the reason for using CodeLlama.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* gguf-debug: Example how to use ggml callback for debugging
* gguf-debug: no mutex, verify type, fix stride.
* llama: cv eval: move cb eval field in common gpt_params
* ggml_debug: use common gpt_params to pass cb eval.
Fix get tensor SIGV random.
* ggml_debug: ci: add tests
* ggml_debug: EOL in CMakeLists.txt
* ggml_debug: Remove unused param n_batch, no batching here
* ggml_debug: fix trailing spaces
* ggml_debug: fix trailing spaces
* common: fix cb_eval and user data not initialized
* ci: build revert label
* ggml_debug: add main test label
* doc: add a model: add a link to ggml-debug
* ggml-debug: add to make toolchain
* ggml-debug: tests add the main label
* ggml-debug: ci add test curl label
* common: allow the warmup to be disabled in llama_init_from_gpt_params
* ci: add curl test
* ggml-debug: better tensor type support
* gitignore : ggml-debug
* ggml-debug: printing also the sum of each tensor
* ggml-debug: remove block size
* eval-callback: renamed from ggml-debug
* eval-callback: fix make toolchain
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit adds an option to the gguf example to not check the tensor
data.
The motivation for this is that it can be nice to use the gguf tool to
read other .gguf files that were not created by the gguf tool.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
* llama : save and restore kv cache for single seq id
* remove trailing whitespace
* respond error in case there's no space in the kv cache
* add kv seq save restore to test case
* add --slot-save-path arg to enable save restore and restrict save location
* Returning 0 for some cases, instead of asserting.
* cleanup error cases
* rename sequence state functions
* rename state get set functions
* add previous function names back in with DEPRECATED notice
* update doc
* adjust endpoints to preferred style
* fix restoring zero cell count
* handle seq rm return value
* unused param
* keep in the size check
* fix return types
* add server test case for slot save restore
* cleanup
* add cake
* cleanup style
* add special
* removing a whole sequence never fails
* move sequence state file functionality from server to llama to match session api and add version tags
* catch exceptions on save as well
* error log messages
* check types for stricter restore
* update server doc
* readme : update API changes date
* strict filename validation
* move include, reject bom as well
* also reject empty filename
* reject whitespace and trailing dot
---------
Co-authored-by: Martin Evans <martindevans@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode
* ci: bench: README.md EOL
* ci: bench: remove total pp and tg as it is not accurate
* ci: bench: fix case when there is no token generated
* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics
* ci: bench: fix finish reason rate
* ci: bench: change trigger path to not spawn on each PR
* ci: bench: add more file type for phi-2: q8_0 and f16.
- do not show the comment by default
* ci: bench: add seed parameter in k6 script
* ci: bench: artefact name perf job
* Add iteration in the commit status, reduce again the autocomment
* ci: bench: add per slot metric in the commit status
* Fix trailing spaces
* Typo fix to server's README.md
Fix minor typo ("tonen") in server README.
* server readme grammar/style fixes.
Quickly went through this file to look for inconsistencies in
presentation of defaults, flag options, and looked for typos
and grammar issues.
Not perfect, but hopefully improved.
* Update README.md
Remove an extra space before newline.
* ggml : update mul_mat_id to use the same tensor for all the experts
* update cuda
* minor
* update metal
* update test-backend-ops
* fix cuda
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update convert.py
* update convert-hf-to-gguf.py
* update convert.py for mixtral hf models
* Update convert-hf-to-gguf.py
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* cuda : support non-pow-2 number of experts
* allow quantize to work for split and merged experts models in the same way
* cleanup + disable mmap automatically with split tensors models
* update imatrix
* test-backend-ops : test qwen argsort
* update grok model loading
* llama : add merged experts tensors to the grok tensor map
* minor
* gguf : bump version
* fix quantizing of merged experts
* convert-hf-to-gguf.py : update grok (untested)
* make linter happy
* cuda/argsort : use shared memory instead of pool memory
* convert : fix grok tensor names
* metal : add support for non-pow-2 argsort
* llama : more loader cleanup, better error checking
* cuda : fix warning
* llama : still use mmap for loading old models, but copy the data to a host buffer
* add review note
* llama : remove ffn tensor counting + add sanity check
ggml-ci
* convert : fix handling of n_experts == None
ggml-ci
* imatrix : fix ncall counters
* llama : produce error if imatrix size does not match
* quantize : terminate on errors + trace logs
ggml-ci
* metal : pad shared memory to 16 bytes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* embedding : show full embedding for single prompt
To support the use case of creating an embedding for a given prompt, the entire embedding and not just the first part needed to be printed.
Also, show cosine similarity matrix only if there is more than one prompt, as the cosine similarity matrix for a single prompt is always `1.00`.
* Update examples/embedding/embedding.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : greatly reduce logits memory usage
* llama : more compact state saving and reloading
* llama : fix lctx.n_outputs not being set before building graph
* perplexity : adapt to the logits API changes
* perplexity : fix Winogrande, use correct logits for second choice start
The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.
The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.
This is simpler now, and the outlier scores aren't there anymore.
* perplexity : normalize spaces and punctuation in Winogrande sentences
* llama : fix embedding conditions
* llama : fix llama_get_embeddings_ith when the resulting id is 0
* llama : fix wrong n_outputs in llama_set_inputs
A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.
* llama : when saving the state, recalculate n_outputs
This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.
* llama : fix not-skipping outputs of non-causal models
* llama : fix running a batch with n_outputs == 0
It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.
* llama : keep same graph topology even when n_outputs == 0
* ggml : saner ggml_can_repeat with empty tensors
* ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1
* ggml : do not multi-thread ops returning empty tensors
* ggml : make ggml_is_empty public and work with views
* llama : use a vector for ctx->output_ids
* llama : rework reallocation logic for llama_output_reserve
Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.
* ggml : skip empty tensors in all backends
* llama : fix llama_output_reserve nullptr deref when new_size is 0
* perplexity : make Winogrande work as it does on master
The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.
* llama : clearer error messages for invalid logits or embeddings ids
* llama : assert all models that can have inp_out_ids
Since the graph topology is now constant, this presence check
can be done even when there are no outputs.
* llama : assert logits and embd buffers exist before writing to them
* llama : handle errors from llama_output_reserve at call sites
* perplexity : make hellaswag and multiple-choice outputs identical to master
Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.
This will probably be changed back in the future to make these benchmarks
a tiny bit faster.
* perplexity : fix division by zero when using less than 100 multiple-choice tasks
* llama : allow loading state saved with a different ctx size
When loading a session file, the context size is now only required to be
at least enough to load the KV cells contained in that session file,
instead of requiring to use exactly the same context size as when saving.
Doing this enables the use-case of extending or shrinking the context size
of a saved session.
This breaks existing session files because the meaning of kv_buf_size
is slightly changed (previously it was the size of the whole KV cache,
now it's only the size of the saved part of it). This allows for
finer-grained sanity checks when loading in an effort to keep kv_buf_size
useful even when the kv_size is changed.
* llama : minor
ggml-ci
* readme : update recent API changes, and warn about Vulkan
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq1_m: basics
* iq1_m: basics-2
* iq1_m: CUDA dequantize works
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
* iq1_m: separate shifts for each group of 8 in a block
We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105
Not bad, but slightly higher than
sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
PPL = 9.14 for LLaMA-v2-7B
PPL = 6.63 for LLaMA-v2-13B
* iq1_m: go to 3-bit scales
There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.
We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
* iq1_m: scalar dot product
* iq1_m: AVX2 dot product
* iq1_m: very slightly faster AVX2 dot product
* iq1_m: ARM_NEON dot product
Works, but very slow (10.5 t/s)
* iq1_m: Metal - dequantize works, dot product does not
* iq1_m: Metal now works
About the same performance as iq1_s.
* iq1_m: minor
* iq1_m: checking pure iq1_m quantization
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.
* iiq1_m: slightly faster ARM_NEON dot product
10.5 t/s -> 11.65 t/s
* iq1_m: faster ARM_NEON dot product
11.65 t/s -> 14.9 t/s
* iq1_m: another minor ARM_NEON dot product improvement
14.9 -> 15.0 t/s
* iq1_m: small PPL improvement via super-block scale adjustment
After quantizing block scales redo the super-block scale fit.
PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B ) = 8.1624
* iq1_m: adapt to CUDA refactoring
* iq1_m: remove unused variable
We have progressed to warnings being errors.
* iq1_m: add to backend-ops tests
* iq1_m: fix Windows ARM
* iq1_m: use common definition of iq1m_scale_t
* cuda: assert -> NO_DEVICE_CODE
* iq1_M: PR comments
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* quantize: be able to override metadata by key
* minor : spacing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* embedding: assign `n_ubatch` value, print error on `n_batch` overflow
* Update examples/embedding/embedding.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* use %ld instead of %lld
* Revert "use %ld instead of %lld"
This reverts commit ea753ede90.
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* sampling: remove duplicated code for probability distribution access
* free original_logits
* fix original_logits allocation
* fixes based on review @cebtenzzre
* change function name to `llama_sampling_prepare`
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
- fix header name case sensitive
- support downloading additional split in parallel
- hide password in url
* common: EOL EOF
* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition
* common: change max url max length
* common: minor comment
* server: support HF URL options
* llama: llama_model_loader fix log
* common: use a constant for max url length
* common: clean up curl if file cannot be loaded in gguf
* server: tests: add split tests, and HF options params
* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda
* server: tests: enable back Release test on PR
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* quantize: be able to specify the output tensor type
* quantize: be able to specify the token embedding tensor type
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* split: support in llama_model_loader
* avoid copying the entire vector
Co-authored-by: slaren <slarengh@gmail.com>
* split: move llama_tensor_offset to llama_model_loader
* llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
- store all ggml_context in a vector as the files and mappings
- store all weights in a vector along with the source tensor
- rename ctx_gguf to meta
- rename ctx_meta to contexts
* avoid copying the entire vector
* Simplify this by making these optional, switch some layer creation tensor optional
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Handle optional tensors
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama_model_loader: fail if backend cannot allocate buffer
* fix mmap buffer management
* llama_model_loader: map file to backend buffer if the allocation succeeds only
* llama_model_loader: only map tensors included in the context
* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast
* llama_model_loader: fail if any of backend buffer cannot be allocated
* spacing
Co-authored-by: slaren <slarengh@gmail.com>
* fix loop over pointer
Co-authored-by: slaren <slarengh@gmail.com>
* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting
* llama_model_loader: ensure mappings vector has the expected size
* llama_model_loader: use at instead of operator[] if this should never add to the map.
* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer
* llama_model_loader: fix map -> unordered map
* llama_split_prefix: use a clearer version, not pass split path len but dest max len.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* llama : minor
ggml-ci
* llama : introduce some typedef helpers
* docs: add model shard in hot topic
* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated
Co-authored-by: slaren <slarengh@gmail.com>
* fix llama_split_prefix
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* k_cache: be able to use Q5_0
* k_cache: be able to use Q5_1 on CODA
* k_cache: be able to use Q5_0 on Metal
* k_cache: be able to use Q5_1 on Metal
* k_cache: be able to use IQ4_NL - just CUDA for now
* k_cache: be able to use IQ4_NL on Metal
* k_cache: add newly added supported types to llama-bench and CUDA supports_op
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server tests : remove seemingly redundant newlines in print()
* server tests : use built-in subprocess features, not os.kill and psutil
* server tests : do not catch e.g. SystemExit; use print_exc
* server tests: handle TimeoutExpired exception
* server tests: fix connect on dual-stack systems
* server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127)
* server: tests: remove the hack on windows since now we get the good socket family
* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)
* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)
---------
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
* gguf-split: split and merge gguf files per tensor
* gguf-split: build with make toolchain
* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split
* split : minor style + fix compile warnings
* gguf-split: remove --upload not implemented
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* backend : offload large batches to GPU
* fix hip
* code cleanup
* fix CUDA split buffers
* Update ggml-backend-impl.h
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix memset without set_device
* imatrix : remove sched affix from weight names
* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup
* update backends
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>