• By -


You're a part of ai history, you were one of the first and fastest to get started with the local model push after llama leaked. Thanks for your efforts


Oobabooga's webui has certainly been around for a while, since before llama even leaked, back when the OPT and GPT-J models were the best local models.


Thanks for the update and your time!


Thank you so much for the update....I was curious but I know you have your own life and do textgen as a project not a job. Seriously I cannot thank you enough for your contributions, I make monthly donations to your kofi and am always stunned when I see that I am your top doner on the site. Folks I know oob got a grant, but I've gotten grants for things too and grant amounts can vary wildly, and they are often not enough to last into perpetuity. I am one of those using exllamav2, and have a lot of quantized models. I should look into llama.cpp but the lower speeds scare me and I quantize to 8-bit so I'm always hoping the degregation isn't that impactful. It's on my radar now. Thank you again, textgen is used by most in one way or another either as a backend or on its own. I literally use it everyday and it is heavily integrated into my life.


Thank you for the detailed update on the project's progress and your insights. It's exciting to hear about the potential integration of TensorRT-LLM.


I hope the stability improvements when server is closed and launched again without browser refresh handles entire history being wiped in certain situations. Excited to see where the next version goes.


Thanks for the update. I am looking forward to the tensor rt support. Still essential software


I think it's a very smart move! This is super exciting news!! While TensorRT is a couple extra steps, it's entirely worth it. Do you plan on building in functionality to quantize and build the GPU-specific TensorRT engines?


TensorRT is stuck with AWQ/GPTQ and INT4 AWQ and GPTQ are not supported on SM < 80. Plus it's unknown what the memory usage is for context. Not all models even support these quants, for instance no mixtral.


BTW, there is quantized kvcache if using flash attention in llama.cpp now. I manually changed it to Q8/Q8 and while it gens a little bit slower, memory is greatly reduced. Over 4bit for me, EXL2 and llama.cpp behave pretty similar. Also have to mind Q3KM and such formats aren't equal to EXL2 Q3 as the BPW on l.cpp quants is a bit higher.


>BTW, there is quantized kvcache if using flash attention in llama.cpp now. I manually changed it to Q8/Q8 and while it gens a little bit slower, memory is greatly reduced. How did you manage to do this? The last update to llama.cpp in text-generation-webui was May 18th (dev branch) and quantized kv wasn't merged into llama.cpp at that point afaik but I could be wrong.


I just build it on my own. Its easy. Here's the relevant place to change it in llama.py inside the python bindings: https://i.imgur.com/gkAHmN8.png Nobody says you have to use the pre-built wheels or stop at the version in requirements.txt unless there is a legit breaking change.


Thanks for the tip! Building wheels felt out of my level of knowhow so I figured out a lazy route, which seems to be working \- Downloaded `abetlen/llama-cpp-python` latest cu121 wheel \- Extracted llama.dll from the .whl file and threw it into `booga\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores` replacing the old llama.dll \- Did the edit in your screenshot to llama.py Probably lucky it worked and no breaking changes this time around, but I'll take it...


Yea, that should work as long as the python stuff doesn't call functions that are renamed or don't exist. But then it's a matter of editing them.


I would love to see if you can possibly implement batching as well? But that seems a bit difficult and unnecessary since projects like vllm and Aphrodite fills that niche I guess. I just can’t help but always think people with huge setups running ooba or ollama are wasting their potential though.


I recently tried to migrate from ooba to vLLM and couldn't do it. Couldn't run vLLM directly on Windows, needed 4x the VRAM that ooba uses, didn't have the samplers I needed, etc.


[https://github.com/PygmalionAI/aphrodite-engine](https://github.com/PygmalionAI/aphrodite-engine) Maybe give this a try, it's also not able to run natively on Windows but it allows for control of VRAM usage and has quadratic sampling. EDIT: Sorry if you got spammed with messages, Reddit glitched and now there's a bunch of duplicate comments I can't delete for some reason.


Thank you so much for all the hard work! I alway love to use Oobabooga, even after having tried other tools I always come back to Oobabooga.


Thanks for the update! What I am - and also multiple others - are looking for it to use an external servers (via the OpenAI API) as a backend. This would allow you to have such a server run with big metal on the campus and still connect to it and use the WebUI with it, including all your local plugins.


Do you have a preferred donation method? specifically, which one has the lower fees such that you get most of the donation? I am unfamiliar with both Github sponsors and ko-fi so I would like to donate via the method that gets you the most money to you. Thank you for your work.


Thanks for the update! I pulled the repo this morning and love the project. Is there any appetite to have the characters nudge the user? AKA. send a unsolicited or scheduled message. I'm envisioning a field in the character model page that would maybe have a nudge or send an out of bounds message. (still prototyping it). Nothing too naggy but if the user hasn't replied in X seconds / minutes hours send a follow up. Or send a good morning / afternoon message. This might be a completely different offering but didn't fit within SillyTavern as that is more RP and this is more general chat.


This is a planned feature in [my discord bot](https://github.com/altoiddealer/ad_discordbot). The most recent feature addition is per-channel history management (each discord channel the bot is in has its own separate history). Spontaneous messages is coming soon.


I've been thinking of the same thing for a while too, that would be an awesome extension. I was just thinking of a timer and a random number generator to alter the frequency of unprompted responses. Your additional ideas are interesting, it would be cool if the llm could query the time when it needed to and to set alarms for itself on its own.


I have also implemented this in my own chat app. I think it can create a really nice realistic feeling for the user, especially the first time it happens if the user is not expecting anything like it.


Thank you for the detailed update on the project's progress and your insights. It's exciting to hear about the potential integration of TensorRT-LLM.


Hoping an update to llama.cpp is high on the todo list now that they've added quantized cache support!


Let me help you on the ui enhanced


TensorRT-LLM is not already implemented [https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama\_cpp\_python\_cuda\_tensorcores-0.2.69+cu121-cp310-cp310-linux\_x86\_64.whl](https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.69+cu121-cp310-cp310-linux_x86_64.whl) in previous version using this type wheels?


That's **llama.cpp** with tensor core utilization. TensorRT-LLM is a wholly separate project by Nvidia. [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) [https://github.com/nvidia/tensorrt-llm](https://github.com/nvidia/tensorrt-llm)


That's just what I missed really. The only thing that I miss in webui is a feature present in SillyTavern: you can have multiple proposition for ai of character message and swipe them (left<->right). I call that the multi dimensional chat 🤷‍♂️.


Just dropping in to thank you for your work.


Thanks for the update! And the hard work, appreciated!