mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-07 09:11:46 +00:00
2ac95c9d56
* SimpleChat:DU:BringIn local helper js modules using importmap Use it to bring in a simple trim garbage at end logic, which is used to trim received response. Also given that importmap assumes esm / standard js modules, so also global variables arent implicitly available outside the modules. So add it has a member of document for now * SimpleChat:DU: Add trim garbage at end in loop helper * SimpleChat:DU:TrimGarbage if unable try skip char and retry * SimpleChat:DU: Try trim using histogram based info TODO: May have to add max number of uniq chars in histogram at end of learning phase. * SimpleChat:DU: Switch trim garbage hist based to maxUniq simple Instead of blindly building histogram for specified substring length, and then checking if any new char within specified min garbage length limit, NOW exit learn state when specified maxUniq chars are found. Inturn there should be no new chars with in the specified min garbage length required limit. TODO: Need to track char classes like alphabets, numerals and special/other chars. * SimpleChat:DU: Bring in maxType to the mix along with maxUniq Allow for more uniq chars, but then ensure that a given type of char ie numerals or alphabets or other types dont cross the specified maxType limit. This allows intermixed text garbage to be identified and trimmed. * SimpleChat:DU: Cleanup debug log messages * SimpleChat:UI: Move html ui base helpers into its own module * SimpleChat:DU:Avoid setting frequence/Presence penalty Some models like llama3 found to try to be over intelligent by repeating garbage still, but by tweaking the garbage a bit so that it is not exactly same. So avoid setting these penalties and let the model's default behaviour work out, as is. Also the simple minded histogram based garbage trimming from end, works to an extent, when the garbage is more predictable and repeatative. * SimpleChat:UI: Add and use a para-create-append helper Also update the config params dump to indicate that now one needs to use document to get hold of gMe global object, this is bcas of moving to module type js. Also add ui.mjs to importmap * SimpleChat:UI: Helper to create bool button and use it wrt settings * SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt * SimpleChat:UI:Select: dict-name-value, value wrt default, change Take a dict/object of name-value pairs instead of just names. Inturn specify the actual value wrt default, rather than the string representing that value. Trap the needed change event rather than click wrt select. * SimpleChat:UI: Add Div wrapped label+element helpers Move settings related elements to use the new div wrapped ones. * SimpleChat:UI:Add settings button and bring in settings ui * SimpleChat:UI:Settings make boolean button text show meaning * SimpleChat: Update a bit wrt readme and notes in du * SimpleChat: GarbageTrim enable/disable, show trimmed part ifany * SimpleChat: highlight trim, garbage trimming bitmore aggressive Make it easy for end user to identified the trimmed text. Make garbage trimming logic, consider a longer repeat garbage substring. * SimpleChat: Cleanup a bit wrt Api end point related flow Consolidate many of the Api end point related basic meta data into ApiEP class. Remove the hardcoded ApiEP/Mode settings from html+js, instead use the generic select helper logic, inturn in the settings block. Move helper to generate the appropriate request json string based on ApiEP into SimpleChat class itself. * SimpleChat:Move extracting assistant response to SimpleChat class so also the trimming of garbage. * SimpleChat:DU: Bring in both trim garbage logics to try trim * SimpleChat: Cleanup readme a bit, add one more chathistory length * SimpleChat:Stream:Initial handshake skeleton Parse the got stream responses and try extract the data from it. It allows for a part read to get a single data line or multiple data line. Inturn extract the json body and inturn the delta content/message in it. * SimpleChat: Move handling oneshot mode server response Move handling of the oneshot mode server response into SimpleChat. Also add plumbing for moving multipart server response into same. * SimpleChat: Move multi part server response handling in * SimpleChat: Add MultiPart Response handling, common trimming Add logic to call into multipart/stream server response handling. Move trimming of garbage at the end into the common handle_response helper. Add new global flag to control between oneshot and multipart/stream mode of fetching response. Allow same to be controlled by user. If in multipart/stream mode, send the stream flag to the server. * SimpleChat: show streamed generative text as it becomes available Now that the extracting of streamed generated text is implemented, add logic to show the same on the screen. * SimpleChat:DU: Add NewLines helper class To work with an array of new lines. Allow adding, appending, shifting, ... * SimpleChat:DU: Make NewLines shift more robust and flexible * SimpleChat:HandleResponseMultiPart using NewLines helper Make handle_response_multipart logic better and cleaner. Now it allows for working with the situation, where the delta data line got from server in stream mode, could be split up when recving, but still the logic will handle it appropriately. ALERT: Rather except (for now) for last data line wrt a request's response. * SimpleChat: Disable console debug by default by making it dummy Parallely save a reference to the original func. * SimpleChat:MultiPart/Stream flow cleanup Dont try utf8-decode and newlines-add_append if no data to work on. If there is no more data to get (ie done is set), then let NewLines instance return line without newline at end, So that we dont miss out on any last-data-line without newline kind of scenario. Pass stream flag wrt utf-8 decode, so that if any multi-byte char is only partly present in the passed buffer, it can be accounted for along with subsequent buffer. At sametime, bcas of utf-8's characteristics there shouldnt be any unaccounted bytes at end, for valid block of utf8 data split across chunks, so not bothering calling with stream set to false at end. LATER: Look at TextDecoder's implementation, for any over intelligence, it may be doing.. If needed, one can use done flag to account wrt both cases. * SimpleChat: Move baseUrl to Me and inturn gMe This should allow easy updating of the base url at runtime by the end user. * SimpleChat:UI: Add input element helper * SimpleChat: Add support for changing the base url This ensures that if the user is running the server with a different port or wants to try connect to server on a different machine, then this can be used. * SimpleChat: Move request headers into Me and gMe Inturn allow Authorization to be sent, if not empty. * SimpleChat: Rather need to use append to insert headers * SimpleChat: Allow Authorization header to be set by end user * SimpleChat:UI+: Return div and element wrt creatediv helpers use it to set placeholder wrt Authorization header. Also fix copy-paste oversight. * SimpleChat: readme wrt authorization, maybe minimal openai testing * SimpleChat: model request field for openai/equivalent compat May help testing with openai/equivalent web services, if they require this field. * SimpleChat: readme stream-utf-8 trim-english deps, exception2error * Readme: Add a entry for simplechat in the http server section * SimpleChat:WIP:Collate internally, Stream mode Trap exceptions This can help ensure that data fetched till that point, can be made use of, rather than losing it. On some platforms, the time taken wrt generating a long response, may lead to the network connection being broken when it enters some user-no-interaction related power saving mode. * SimpleChat:theResp-origMsg: Undo a prev change to fix non trim When the response handling was moved into SimpleChat, I had changed a flow bit unnecessarily and carelessly, which resulted in the non trim flow, missing out on retaining the ai assistant response. This has been fixed now. * SimpleChat: Save message internally in handle_response itself This ensures that throwing the caught exception again for higher up logic, doesnt lose the response collated till that time. Go through theResp.assistant in catch block, just to keep simple consistency wrt backtracing just in case. Update the readme file. * SimpleChat:Cleanup: Add spacing wrt shown req-options * SimpleChat:UI: CreateDiv Divs map to GridX2 class This allows the settings ui to be cleaner structured. * SimpleChat: Show Non SettingsUI config field by default * SimpleChat: Allow for multiline system prompt Convert SystemPrompt into a textarea with 2 rows. Reduce user-input-textarea to 2 rows from 3, so that overall vertical space usage remains same. Shorten usage messages a bit, cleanup to sync with settings ui. * SimpleChat: Add basic skeleton for saving and loading chat Inturn when ever a chat message (system/user/model) is added, the chat will be saved into browser's localStorage. * SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key * SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat This is a temporary flow * SimpleChat:ODS:Move restore/load saved chat btn setup to Me This also allows being able to set the common system prompt ui element to loaded chat's system prompt. * SimpleChat:Readme updated wrt save and restore chat session info * SimpleChat:Show chat session restore button, only if saved session * SimpleChat: AutoCreate ChatRequestOptions settings to an extent * SimpleChat: Update main README wrt usage with server
272 lines
14 KiB
Markdown
272 lines
14 KiB
Markdown
|
|
# SimpleChat
|
|
|
|
by Humans for All.
|
|
|
|
|
|
## overview
|
|
|
|
This simple web frontend, allows triggering/testing the server's /completions or /chat/completions endpoints
|
|
in a simple way with minimal code from a common code base. Inturn additionally it tries to allow single or
|
|
multiple independent back and forth chatting to an extent, with the ai llm model at a basic level, with their
|
|
own system prompts.
|
|
|
|
This allows seeing the generated text / ai-model response in oneshot at the end, after it is fully generated,
|
|
or potentially as it is being generated, in a streamed manner from the server/ai-model.
|
|
|
|
Auto saves the chat session locally as and when the chat is progressing and inturn at a later time when you
|
|
open SimpleChat, option is provided to restore the old chat session, if a matching one exists.
|
|
|
|
The UI follows a responsive web design so that the layout can adapt to available display space in a usable
|
|
enough manner, in general.
|
|
|
|
Allows developer/end-user to control some of the behaviour by updating gMe members from browser's devel-tool
|
|
console. Parallely some of the directly useful to end-user settings can also be changed using the provided
|
|
settings ui.
|
|
|
|
NOTE: Current web service api doesnt expose the model context length directly, so client logic doesnt provide
|
|
any adaptive culling of old messages nor of replacing them with summary of their content etal. However there
|
|
is a optional sliding window based chat logic, which provides a simple minded culling of old messages from
|
|
the chat history before sending to the ai model.
|
|
|
|
NOTE: Wrt options sent with the request, it mainly sets temperature, max_tokens and optionaly stream for now.
|
|
However if someone wants they can update the js file or equivalent member in gMe as needed.
|
|
|
|
NOTE: One may be able to use this to chat with openai api web-service /chat/completions endpoint, in a very
|
|
limited / minimal way. One will need to set model, openai url and authorization bearer key in settings ui.
|
|
|
|
|
|
## usage
|
|
|
|
One could run this web frontend directly using server itself or if anyone is thinking of adding a built in web
|
|
frontend to configure the server over http(s) or so, then run this web frontend using something like python's
|
|
http module.
|
|
|
|
### running using examples/server
|
|
|
|
bin/server -m path/model.gguf --path ../examples/server/public_simplechat [--port PORT]
|
|
|
|
### running using python3's server module
|
|
|
|
first run examples/server
|
|
* bin/server -m path/model.gguf
|
|
|
|
next run this web front end in examples/server/public_simplechat
|
|
* cd ../examples/server/public_simplechat
|
|
* python3 -m http.server PORT
|
|
|
|
### using the front end
|
|
|
|
Open this simple web front end from your local browser
|
|
|
|
* http://127.0.0.1:PORT/index.html
|
|
|
|
Once inside
|
|
|
|
* If you want to, you can change many of the default global settings
|
|
* the base url (ie ip addr / domain name, port)
|
|
* chat (default) vs completion mode
|
|
* try trim garbage in response or not
|
|
* amount of chat history in the context sent to server/ai-model
|
|
* oneshot or streamed mode.
|
|
|
|
* In completion mode
|
|
* one normally doesnt use a system prompt in completion mode.
|
|
* logic by default doesnt insert any role specific "ROLE: " prefix wrt each role's message.
|
|
If the model requires any prefix wrt user role messages, then the end user has to
|
|
explicitly add the needed prefix, when they enter their chat message.
|
|
Similarly if the model requires any prefix to trigger assistant/ai-model response,
|
|
then the end user needs to enter the same.
|
|
This keeps the logic simple, while still giving flexibility to the end user to
|
|
manage any templating/tagging requirement wrt their messages to the model.
|
|
* the logic doesnt insert newline at the begining and end wrt the prompt message generated.
|
|
However if the chat being sent to /completions end point has more than one role's message,
|
|
then insert newline when moving from one role's message to the next role's message, so
|
|
that it can be clearly identified/distinguished.
|
|
* given that /completions endpoint normally doesnt add additional chat-templating of its
|
|
own, the above ensures that end user can create a custom single/multi message combo with
|
|
any tags/special-tokens related chat templating to test out model handshake. Or enduser
|
|
can use it just for normal completion related/based query.
|
|
|
|
* If you want to provide a system prompt, then ideally enter it first, before entering any user query.
|
|
Normally Completion mode doesnt need system prompt, while Chat mode can generate better/interesting
|
|
responses with a suitable system prompt.
|
|
* if chat.add_system_begin is used
|
|
* you cant change the system prompt, after it is has been submitted once along with user query.
|
|
* you cant set a system prompt, after you have submitted any user query
|
|
* if chat.add_system_anytime is used
|
|
* one can change the system prompt any time during chat, by changing the contents of system prompt.
|
|
* inturn the updated/changed system prompt will be inserted into the chat session.
|
|
* this allows for the subsequent user chatting to be driven by the new system prompt set above.
|
|
|
|
* Enter your query and either press enter or click on the submit button.
|
|
If you want to insert enter (\n) as part of your chat/query to ai model, use shift+enter.
|
|
|
|
* Wait for the logic to communicate with the server and get the response.
|
|
* the user is not allowed to enter any fresh query during this time.
|
|
* the user input box will be disabled and a working message will be shown in it.
|
|
* if trim garbage is enabled, the logic will try to trim repeating text kind of garbage to some extent.
|
|
|
|
* just refresh the page, to reset wrt the chat history and or system prompt and start afresh.
|
|
|
|
* Using NewChat one can start independent chat sessions.
|
|
* two independent chat sessions are setup by default.
|
|
|
|
* When you want to print, switching ChatHistoryInCtxt to Full and clicking on the chat session button of
|
|
interest, will display the full chat history till then wrt same, if you want full history for printing.
|
|
|
|
|
|
## Devel note
|
|
|
|
### Reason behind this
|
|
|
|
The idea is to be easy enough to use for basic purposes, while also being simple and easily discernable
|
|
by developers who may not be from web frontend background (so inturn may not be familiar with template /
|
|
end-use-specific-language-extensions driven flows) so that they can use it to explore/experiment things.
|
|
|
|
And given that the idea is also to help explore/experiment for developers, some flexibility is provided
|
|
to change behaviour easily using the devel-tools/console or provided minimal settings ui (wrt few aspects).
|
|
Skeletal logic has been implemented to explore some of the end points and ideas/implications around them.
|
|
|
|
|
|
### General
|
|
|
|
Me/gMe consolidates the settings which control the behaviour into one object.
|
|
One can see the current settings, as well as change/update them using browsers devel-tool/console.
|
|
It is attached to the document object. Some of these can also be updated using the Settings UI.
|
|
|
|
baseURL - the domain-name/ip-address and inturn the port to send the request.
|
|
|
|
bStream - control between oneshot-at-end and live-stream-as-its-generated collating and showing
|
|
of the generated response.
|
|
|
|
the logic assumes that the text sent from the server follows utf-8 encoding.
|
|
|
|
in streaming mode - if there is any exception, the logic traps the same and tries to ensure
|
|
that text generated till then is not lost.
|
|
|
|
if a very long text is being generated, which leads to no user interaction for sometime and
|
|
inturn the machine goes into power saving mode or so, the platform may stop network connection,
|
|
leading to exception.
|
|
|
|
apiEP - select between /completions and /chat/completions endpoint provided by the server/ai-model.
|
|
|
|
bCompletionFreshChatAlways - whether Completion mode collates complete/sliding-window history when
|
|
communicating with the server or only sends the latest user query/message.
|
|
|
|
bCompletionInsertStandardRolePrefix - whether Completion mode inserts role related prefix wrt the
|
|
messages that get inserted into prompt field wrt /Completion endpoint.
|
|
|
|
bTrimGarbage - whether garbage repeatation at the end of the generated ai response, should be
|
|
trimmed or left as is. If enabled, it will be trimmed so that it wont be sent back as part of
|
|
subsequent chat history. At the same time the actual trimmed text is shown to the user, once
|
|
when it was generated, so user can check if any useful info/data was there in the response.
|
|
|
|
One may be able to request the ai-model to continue (wrt the last response) (if chat-history
|
|
is enabled as part of the chat-history-in-context setting), and chances are the ai-model will
|
|
continue starting from the trimmed part, thus allows long response to be recovered/continued
|
|
indirectly, in many cases.
|
|
|
|
The histogram/freq based trimming logic is currently tuned for english language wrt its
|
|
is-it-a-alpabetic|numeral-char regex match logic.
|
|
|
|
chatRequestOptions - maintains the list of options/fields to send along with chat request,
|
|
irrespective of whether /chat/completions or /completions endpoint.
|
|
|
|
If you want to add additional options/fields to send to the server/ai-model, and or
|
|
modify the existing options value or remove them, for now you can update this global var
|
|
using browser's development-tools/console.
|
|
|
|
For string and numeric fields in chatRequestOptions, including even those added by a user
|
|
at runtime by directly modifying gMe.chatRequestOptions, setting ui entries will be auto
|
|
created.
|
|
|
|
headers - maintains the list of http headers sent when request is made to the server. By default
|
|
Content-Type is set to application/json. Additionally Authorization entry is provided, which can
|
|
be set if needed using the settings ui.
|
|
|
|
iRecentUserMsgCnt - a simple minded SlidingWindow to limit context window load at Ai Model end.
|
|
This is disabled by default. However if enabled, then in addition to latest system message, only
|
|
the last/latest iRecentUserMsgCnt user messages after the latest system prompt and its responses
|
|
from the ai model will be sent to the ai-model, when querying for a new response. IE if enabled,
|
|
only user messages after the latest system message/prompt will be considered.
|
|
|
|
This specified sliding window user message count also includes the latest user query.
|
|
<0 : Send entire chat history to server
|
|
0 : Send only the system message if any to the server
|
|
>0 : Send the latest chat history from the latest system prompt, limited to specified cnt.
|
|
|
|
|
|
By using gMe's iRecentUserMsgCnt and chatRequestOptions.max_tokens one can try to control the
|
|
implications of loading of the ai-model's context window by chat history, wrt chat response to
|
|
some extent in a simple crude way. You may also want to control the context size enabled when
|
|
the server loads ai-model, on the server end.
|
|
|
|
|
|
Sometimes the browser may be stuborn with caching of the file, so your updates to html/css/js
|
|
may not be visible. Also remember that just refreshing/reloading page in browser or for that
|
|
matter clearing site data, dont directly override site caching in all cases. Worst case you may
|
|
have to change port. Or in dev tools of browser, you may be able to disable caching fully.
|
|
|
|
|
|
Currently the server to communicate with is maintained globally and not as part of a specific
|
|
chat session. So if one changes the server ip/url in setting, then all chat sessions will auto
|
|
switch to this new server, when you try using those sessions.
|
|
|
|
|
|
By switching between chat.add_system_begin/anytime, one can control whether one can change
|
|
the system prompt, anytime during the conversation or only at the beginning.
|
|
|
|
|
|
### Default setup
|
|
|
|
By default things are setup to try and make the user experience a bit better, if possible.
|
|
However a developer when testing the server of ai-model may want to change these value.
|
|
|
|
Using iRecentUserMsgCnt reduce chat history context sent to the server/ai-model to be
|
|
just the system-prompt, prev-user-request-and-ai-response and cur-user-request, instead of
|
|
full chat history. This way if there is any response with garbage/repeatation, it doesnt
|
|
mess with things beyond the next question/request/query, in some ways. The trim garbage
|
|
option also tries to help avoid issues with garbage in the context to an extent.
|
|
|
|
Set max_tokens to 1024, so that a relatively large previous reponse doesnt eat up the space
|
|
available wrt next query-response. However dont forget that the server when started should
|
|
also be started with a model context size of 1k or more, to be on safe side.
|
|
|
|
The /completions endpoint of examples/server doesnt take max_tokens, instead it takes the
|
|
internal n_predict, for now add the same here on the client side, maybe later add max_tokens
|
|
to /completions endpoint handling code on server side.
|
|
|
|
NOTE: One may want to experiment with frequency/presence penalty fields in chatRequestOptions
|
|
wrt the set of fields sent to server along with the user query. To check how the model behaves
|
|
wrt repeatations in general in the generated text response.
|
|
|
|
A end-user can change these behaviour by editing gMe from browser's devel-tool/console or by
|
|
using the providing settings ui.
|
|
|
|
|
|
### OpenAi / Equivalent API WebService
|
|
|
|
One may be abe to handshake with OpenAI/Equivalent api web service's /chat/completions endpoint
|
|
for a minimal chatting experimentation by setting the below.
|
|
|
|
* the baseUrl in settings ui
|
|
* https://api.openai.com/v1 or similar
|
|
|
|
* Wrt request body - gMe.chatRequestOptions
|
|
* model (settings ui)
|
|
* any additional fields if required in future
|
|
|
|
* Wrt request headers - gMe.headers
|
|
* Authorization (available through settings ui)
|
|
* Bearer THE_OPENAI_API_KEY
|
|
* any additional optional header entries like "OpenAI-Organization", "OpenAI-Project" or so
|
|
|
|
NOTE: Not tested, as there is no free tier api testing available. However logically this might
|
|
work.
|
|
|
|
|
|
## At the end
|
|
|
|
Also a thank you to all open source and open model developers, who strive for the common good.
|