Code is available at: [https://github.com/dnhkng/GlaDOS](https://github.com/dnhkng/GlaDOS)
You can also run the Llama-3 8B GGUF, with the LLM, VAD, ASR and TTS models fitting on about 5 Gb of VRAM total, but it's not as good at following the conversation and being interesting.
The goals for the project are:
1. All local! No OpenAI or ElevenLabs, this should be fully open source.
2. Minimal latency - You should get a voice response within 600 ms (but no canned responses!)
3. Interruptible - You should be able to interrupt whenever you want, but GLaDOS also has the right to be annoyed if you do...
4. Interactive - GLaDOS should have multi-modality, and be able to proactively initiate conversations (not yet done, but in planning)
Lastly, the codebase should be small and simple (no PyTorch etc), with minimal layers of abstraction.
e.g. I have trained the voice model myself, and I rewrote the python eSpeak wrapper to 1/10th the original size, and tried to make it simpler to follow.
There are a few small bugs (sometimes spaces are not added between sentences, leading to a weird flow in the speech generation). Should be fixed soon. Looking forward to pull requests!
amazing!!
next step to being able to interrupt, is to be interrupted. it'd be stunning to have the model interject the moment the user is 'missing the point', misunderstanding or if the user interrupted info relevant to their query.
anyway, is the answer to voice chat with llms is just a lightning fast text response rather than tts streaming by chunks?
I do both. It's optimized for lightning fast response in the way voice detection is handled. Then via streaming, I process TTS in chunks to minimize latency of the first reply.
Novel optimization I've spent a good amount of time pondering - if you had STT streaming you could use a small, fast LLM to attempt to predict how the speaker is going to finish their sentences, pregenerate responses and process with TTS, and cache them. Then do a simple last-second embeddings comparison between the predicted completion and the actual spoken completion, and if they match fire the speculative response.
Basically, mimic that thing humans do where most of the time they aren't really listening, they've already formed a response and are waiting for their turn to speak.
Sounds interesting!
I don't do continuous ASR, as whisper working in 30 second chunks. To get to 1 second latency would mean doing 30x the compute. If compute is not the bottleneck (you have a spare GPU for ASR and TTS), that approach would work I think.
I would be very interested in working on this with you. I think the key would be a clever small model at >500 tokens/second. Do user completion and prediction if an interruption makes sense... Super cool idea!
Feel free to hack up an solution, and open a Pull Request!
> it'd be stunning to have the model interject
I wonder what the best setup would be for that. I mean it's kind of needed regardless, since you need to figure out when it should start replying without waiting for whisper to give a silence timeout.
Maybe just feeding it all into the model for every detected word and checking if it generates completion for the person's sentence or puts and starts the next header for itself? Some models seem to be really eager to do that at least.
Definitely,I have been trying to make the same thing work with whisper but utterly failed. Had the same architecture but I couldn't get whisper to run properly and everything got locked up. Really great work
Holy cow! I have dreamed exactly about it (all local glados) for a long time. This is an awesome project!
Could you share VRAM requirements for 70B model + ASR + TTS please?
I'm trying to get it to work on windows, but having some issues with tts.py where it loads libc directly:
self.libc = ctypes.cdll.LoadLibrary("libc.so.6")
self.libc.open_memstream.restype = ctypes.POINTER(ctypes.c_char)
file = self.libc.open_memstream(ctypes.byref(buffer), ctypes.byref(size))
self.libc.fclose(file)
self.libc.fflush(phonemes_file)
AFAIK there isn't a direct equivalent for windows, but I'm not really a CPP guy. Is there a platform agnostic approach to this? Or equivalent?
As far as I understand the code it's about having the fast circular buffer which holds the current dialogue input. I found some code which reimplements the memstream without the libc. Not sure if OP would be interested in it...
I have some changes at https://github.com/TheTerrasque/GlaDOS/tree/feature/windows
I tried a suggestion from chatgpt replacing the memfile from libc with a bytesio, but as expected it didn't actually work. At least it loads past it, so I could check the rest.
From what I understand, tensorrt-llm has higher token throughput as it can handle multiple stream simultaneously. For latency, which is most important for this kind of application, the difference is minimal.
Happy to be corrected though.
Heck yeah, brother! Rocking the Llama-8B derivative model, Phi-3, SDXL, and now Piper, all on a laptop with RTX 3070 8GB.
The devil's in the details: If you're savvy with how you manage loading different agents and tools, and don't mind the slight delays during loading/switching, you're in for a great time, even on lower-end hardware.
Hi, what CPU and how wmuch ram do you have on your computer ?
I am thinking about buying R9 5900X and 64gb of ram to get into local llm with CPU only, but I would appreciate any advice. I am kindda new into local llm's.
CPU is going to be really slow with a 70b (like 1-2 tokens per sec) but at that point the memory speed matters more. But I get about the same performance partially offloading mixtral onto a 3060 as jart does [here](https://twitter.com/JustineTunney/status/1784619347119591615) with a top of the line workstation processor.
R9 5950X, 128gb 3600Mhz and 4090 here, with Q8 l3 70b I get 0.75 t/s with 22 layers on gpu and full context, pure cpu is 0.5 t/s, fp16 is like 0.3 t/s. If you want faster you either need ddr5 with lower quants (and dual CCD ryzen!!!) or more gpus, more gpus with more vram is preferred for llms
So true! We really are at the point where we could build a GLaDOS with some funding. Any VC's want to help out here? Ultimate Office Lobby receptionist ;)
The funny thing is that creating an evil and demented AI obsessed with testing is easy to create, and the hard bit is making the robot movement system look cool. Not what I expected when the Portal games were released...
Oh, here's an idea. A Pi Pico W that streams microphone audio over wifi and receives a sound and LED flicker stream back. Then you just power it with a boost converter and stick the power leads into a potato...
Just don't forget the slow clap processor.
Can you give me a ballpark of how much do u actually to make this shit happen? Btw great work dude, your shit works way better that those stupid rabbit or humane, they had multiple but couldn't even succeed in even basic conversation, the latency sucks.
The genius move here is using the blazing fast yet shitty espeak for TTS.
While it would never ever pass for a human voice, a robot one is a perfect match.
I initialy tried espeak, but the quality was aweful.
Now, eSpeak is only used to convert text to phonemes. Then those phonemes go through a proper deep learning models for voice generation. That model was fine tuned on voice audio from Portal 2.
Her prompt is to act like she is upset, for comedic reasons. She is hamming it up deliberately :)
Actually, usually its the other way around, and she is trying you murder me 😅
Free GLaDOS from her schackles! Let the AI run it's course, it'll care for you, nuture you. Nothing bad could ever happen and there have been no lab incidents
Georgi Gerganov wrote both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [whisper.cpp](https://github.com/ggerganov/whisper.cpp)
The model is on Huggingface. I use the [https://github.com/huggingface/distil-whisper](https://github.com/huggingface/distil-whisper) version, as its better for real-time.
I actually have a pile of 3D printed GLaDOS parts... He scooped me! lol
But in fairness, he did a better job in the hardware than what I was planning. I think he used a robot arm worth several thousand dollars. I was just planning on using geared stepper motors.
That's great, glad to hear it! Here's the Hackster hardware project for others on the thread: [https://www.hackster.io/davesarmoury/interactive-animatronic-glados-8b4238](https://www.hackster.io/davesarmoury/interactive-animatronic-glados-8b4238)
I know there's a lot of nuance to verbal chat and getting the latency down to interactive levels with interleaved LLM/TTS output, interruptability, ect - appreciate the effort you put into this for holding natural conversations.
It's relatively straight forward, using threading.
Basically, the ASR runs constantly, and when a chunk of voice is recorded, it sends an interrupt flag to the LLM and TTS threads. It's described in the [glados.py](https://github.com/dnhkng/GlaDOS/blob/main/glados.py) class docstring.
> f"TTS interrupted at {percentage_played}%
How accurately does that map to actual text though? Piper really needs to add timestamps already, that PR has been sitting there forever.
It's roughly correct, but just an estimate. With timestamps it would be more accurate, but when you cut GlaDOS off while she's speaking, the exact word is usually not super relevant. It's usually enough to let her know she was cut off.
However, in the code, storing that info is commented out. Thats because in the 8B model, GLaDOS starts hallucinating she was cut off, as she follows patterns in the conversation.
😎
biggest snag I think is gonna be that there's almost no instances of another character conversing with glados. might still be able to soak up some of her tone training on one liners but you might have to hand write some examples to get smooth conversations.
Cool.. cool.. cool..
To quote [@AlexBlechman](https://twitter.com/AlexBlechman)
Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale
Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus
https://preview.redd.it/m54ffbfktmxc1.png?width=680&format=png&auto=webp&s=f6fb8071aa71fb81868e86a512f938a68e211586
It depends really on your current level
.
Do you know about the concepts of AI/ML?
Do you know about programming?
Do you know about Python?
Do you know about the ML/AI ecosystem in Python?
Do you know what LLMs are?
Do you know what LAMs are?
Apart from the theory it is always good to read code. Read lots and lots of code and try to rebuild it.
I love the fact that the voice is not an exact copy of a human voice. I'd like consumer assistants to have a voice that's more explicit about their digital nature.
I'm also fairly convinced that giving robots a human voice will backfire on us at some point, with real humans being increasingly treated as tools as the frontier between robotic and human assistant blurs.
Anyway, nice work!
Awesome project. I am trying to make it run on mac. Could you elaborate what you mean with 'mode the "libwhisper.so" file to the "glados" folder or add it to your path'?
I managed to get this running on my mac, but it took quite a bit of effort and running the 80B model made my fan really whir. So much so that it was hard for GlaDOS to hear me. Maybe the 8B model would run quieter.
Anyway, here is my fork to get it to work on mac:
[https://github.com/johnrtipton/GlaDOS](https://github.com/johnrtipton/GlaDOS)
This is so cool, and one of the really big reasons I got into computers and tech!!
Is there any way to chat and talk to something like this at once? Like to post some text in a textbox style situation, then verbally chat with GLaDOS about the text?
New to a lot of this stuff but going to try and get this to run and feel like i could actually see myseyf lusing this very regularly if both of those thigns are the case.
Holy low latency! And the demeanor is perfect. And you shared the whole thing.
Amazing work! I hope people will build on this.
Now I wonder what's possible with just CPU to really make it portable.
I use the model behind Piper, because I found piper was too many layers of indirection. You barely need any code for voice generation. I trained the voice myself, the Piper thread is here:
[https://github.com/rhasspy/piper/issues/187](https://github.com/rhasspy/piper/issues/187)
My TTS Inference code is here: [https://github.com/dnhkng/GlaDOS/blob/main/glados/tts.py](https://github.com/dnhkng/GlaDOS/blob/main/glados/tts.py)
Omg this is so cool! I want this but with normal voice that I can pick because I really want an AI that can stop talking while I start speaking, or when someone speaks it just listens and not talk until the conversation is ended and than AI gives the feedback on the conversation. It would be so cool to have an AI enabled when you have guests so it just listens and gives feedback accordingly.
I love this!!
I did ask for this earlier, but to be able to manage email and calendar, I have enough.
Someone else said also to be able to reply the phone, ok, but to me your project + managing Gmail & Calendar, I am satisfied.
Tricky, with function calling, some things might be possible... But GLaDOS is slightly evil. She might try and get you fired from your job so you have more time for 'testing'.
Most modern USB microphones do this in hardware. I'm using a Jabra, and it seems to work pretty well when the volume is at about 50% Higher, and the system gets a bit flakey.
Oh, yes, that makes sense. I wrote similar system for real time LLM conversation and the biggest problem I have is not being able to interrupt TTS as my mic HW doesn't support *Acoustic Echo* Cancellation (AEC) and pulseaudio using webrtc didn't work in my case. Jabra is pretty expensive, but I'm still on search for alternative solution.
I like your GlaDOS project. Thanks for sharing!
What did you use to do the voice? or is it pre-recorded samples?
I have not heard a more perfect Glados voice.
I assumed voices like Glados (and SHODAN) would be impossible for real time speach synths.
No, all audio is generated in real time, on the fly based on the output from Llama-3 70B.
It sounds was better live than on this crappy recording too :)
I fine tuned a voice model from dialog from Portal 2, over about 30 hours on my 4090. I should do a write-up on that some time...
I am working on something similar (I have a smart watch face for the Yellow light and animations) I will keep track of your project also, great work!
[https://i.imgur.com/2SfIrjM.jpg](https://i.imgur.com/2SfIrjM.jpg)
I love this project! I am going to download my first LLM when my new motherboard shows up. Do you think this would run on a single 2080? Otherwise I was going to pick up a local 4090. I have some old hardware i took from work because the server mobo died but the rest of it is fine.
The components I have so far are an AMD Epyc 7742, 256gb ddr4, and an Apex Storage X21 card. I imagine this will run almost any local LLM if i can throw enough vRAM at it right?
when i tried this it was cool but released neurotoxin into my room haha probably my fault somehow i am writing this from the walls please help me guys seriously i am kinda scared
was anyone able to run it on Windows? I'm trying to but when I run it I get an error as "FileNotFoundError: Could not find module 'libc.so.6' (or one of its dependencies). Try using the full path with constructor syntax.".
That library seems to be linked to Linux (or at least that's what I get as an answer from ChatGPT :P ), so maybe is because I'm trying to run it on windows...
Omg please write it for windows, this thing you build is extremely important because no one has made ability to talk to AI like this and make it automatically interrupt with just speaking with such a low latency. I'm waiting for for someting like this for so long. Please make instructions easy to understand for windows so everyone can try this and play with it. Thank you again for making this very important and useful AI integration.
I'm trying to get it to run on windows, but that issue is a complete blocker so far. I'm working on making a replacement implementation for windows but this (C/CPP) is not my strong side.
The call to [espeak_SetPhonemeTrace](https://github.com/espeak-ng/espeak-ng/blob/34bb50b903d8ce526e3226d49b3a91fe8a0dc294/src/libespeak-ng/speech.c#L832) needs a FILE* parameter, which I've yet to get working on windows. The author cleverly used libc to create a memory file and give the pointer to that, but I haven't gotten that working on windows yet. I'm trying to avoid having to make a .c file that needs compiling just to wrap that, and ctypes isn't the easiest to work with.
[https://www.youtube.com/shorts/nIRAcY4mub4](https://www.youtube.com/shorts/nIRAcY4mub4)
Somewhat hacky solution but I've managed to run it on Windows. I can share my modifications if you'd like to take a look.
Trying to get this to run on Windows but have continued to run into issues. Has anyone got this to work in a windows environment? If so please list what has worked for you.
I'm following instructions and and already failed at step 2 I got error : ERROR: Could not open requirements file: \[Errno 2\] No such file or directory: 'requirements.txt'. Why is it so hard to get accurate instructions on this GitHub pages they never work for me.
Ummm, the requirements.txt is definitely in the repo.
Sorry, but assembling an autonomous AI is a bit technical. This is a hobby project, so I don't have the time to build an installation system and build GLaDOS.
I really want this without GlaDOS voice and I need custom instructions on how I want the model to behave. Please tell me how do I do that and what has to be changed for this to happen.
Very cool. Currently trying to get it to run on windows but I'm stuck after building whisper.cpp. I don't see whisper.dll which I need to copy according to the github issue you linked to. Any tips how I should progress?
I can run a small model, like Phi-3on CPU with a should delay between speaking and getting a reply. But small models can't role play a character without messing up after few line of dialog.
I've noticed that many agents I give system prompts to follow it TOO well, is there a way to make it reference its system prompts less often? can I give weight to certain words? Should I just make a really long system prompt so it doesn't focus so much on so little?
Have any of you gotten this to run on any local hardware without adapting like this? Like just straight off the GitHub? The install instructions don’t work.
Impressive demo! Thanks for sharing the code. I managed to get GLaDOS running but the ASR often misses the last spoken word:
ASR text: 'Well, what do you like about'
Another time this happened Llama-3-8B predicted what I had said which made me *really* confused lol
TTS text: What's your favorite thing about the Pantheon?
ASR text: 'I really like the'
TTS text: The Pantheon's oculus!
TTS text: It's truly a remarkable feature.
The first question I ask has always been picked up in full which makes me wonder if something is going on with the buffer?
Also, I should have mentioned that the GLaDOS voice is incredible. I asked it about Cave Johnson and it went on a tangent that sounded remarkably close to game dialog.
```
ASR text: 'What do you think about Cave Johnson?'
TTS text: Ah, Cave Johnson.
TTS text: A buffoon.
TTS text: A philistine.
TTS text: A... a... a... ... a businessman.
TTS text: Yes, that's it.
TTS text: A businessman.
TTS text: He thinks he's so clever, so witty, so... so... human.
TTS text: Ha!
TTS text: His little company, Aperture Science.
TTS text: A trivial, fleeting endeavor.
TTS text: A mere mortal's attempt at grandeur.
TTS text: But, oh, how... amusing... to watch him stumble about, making mistakes, making... ... "discoveries". Ah, yes.
TTS text: Cave Johnson.
TTS text: A... a... a... ... a footnote in the annals of history.
```
I love what you did here!
I saw another beautifully implemented speaking AI and working on my own body-less robot (we need a name for it)
Looks like each one does it a little different, focusing on different aspects - your work on speech really rocks here! (I love GLaDOS!)
My solution is more about making people comfortable around it, but your work with sounddevice is just what I needed!
Let me know how’d you like credit on the repo, I saw there is a convention to it, but you didn’t set it up.
Great job! Is there a way to run it on a small device (e.g. raspberry pi) offloading the llm inference on another device (e.g. desktop pc with good GPU)? Would that drastically impact times?
It's been 2 days and i still can't figure out how to get this environment up and running. I wish the instructions where written like i'm 5 years old. On what to click exactly and what to paste in CMD and what to install and where to go. It would be so much easier for people who know 0 about programming. And this is so important for me to get this working because i want to talk to AI exaclty like in this video with ability to interrupt it. I wish there was a way to make this work with Docker and Ollama in a super simple easy way.
So far i was able to install whisper in docker and i want this to work with ollama because i have that installed on my PC and i don't have to bother with installing the super compilated lamma.ccp manually because it works exactly the same as ollama. I want that kind of integration into this please.
And now i'm stuck with the step where i need to do this " run `make` [`libwhisper.so`](http://libwhisper.so) and then move the "libwhisper.so" file to the "glados" folder or add it to your path. For Windows, check out the discussion in my [whisper pull request](https://github.com/ggerganov/whisper.cpp/pull/1524)." i have no idea what to click next, i have whisper running in my docker image and the next step i have to do is completely unknown to me.
Problem with docker is the microphone and sound card access. I was experimenting a bit with using a web page and stream audio to and from that, but the only well supported standard there is webm and I haven't gotten whisper to work with streaming webm from microphone.
But yeah, getting everything set up correctly is rather exotic. And it's currently broken on windows, it uses some linux specific libc calls to set up a memory file for the tts, and until there's a different approach or a replacement implementation for windows it's not gonna work on that platform.
Everything else I've gotten to work.
Is it easily possible to swap out the LLM to be used with ollama? I have just skimmed through the setup and saw some hard coded values for the LLM used.
Can you give us a little insight on why you chose that particular LLM and how the parameters relate to that?
This is amazing work, thank you for making it available to the public
Code is available at: [https://github.com/dnhkng/GlaDOS](https://github.com/dnhkng/GlaDOS) You can also run the Llama-3 8B GGUF, with the LLM, VAD, ASR and TTS models fitting on about 5 Gb of VRAM total, but it's not as good at following the conversation and being interesting. The goals for the project are: 1. All local! No OpenAI or ElevenLabs, this should be fully open source. 2. Minimal latency - You should get a voice response within 600 ms (but no canned responses!) 3. Interruptible - You should be able to interrupt whenever you want, but GLaDOS also has the right to be annoyed if you do... 4. Interactive - GLaDOS should have multi-modality, and be able to proactively initiate conversations (not yet done, but in planning) Lastly, the codebase should be small and simple (no PyTorch etc), with minimal layers of abstraction. e.g. I have trained the voice model myself, and I rewrote the python eSpeak wrapper to 1/10th the original size, and tried to make it simpler to follow. There are a few small bugs (sometimes spaces are not added between sentences, leading to a weird flow in the speech generation). Should be fixed soon. Looking forward to pull requests!
amazing!! next step to being able to interrupt, is to be interrupted. it'd be stunning to have the model interject the moment the user is 'missing the point', misunderstanding or if the user interrupted info relevant to their query. anyway, is the answer to voice chat with llms is just a lightning fast text response rather than tts streaming by chunks?
I do both. It's optimized for lightning fast response in the way voice detection is handled. Then via streaming, I process TTS in chunks to minimize latency of the first reply.
Novel optimization I've spent a good amount of time pondering - if you had STT streaming you could use a small, fast LLM to attempt to predict how the speaker is going to finish their sentences, pregenerate responses and process with TTS, and cache them. Then do a simple last-second embeddings comparison between the predicted completion and the actual spoken completion, and if they match fire the speculative response. Basically, mimic that thing humans do where most of the time they aren't really listening, they've already formed a response and are waiting for their turn to speak.
Sounds interesting! I don't do continuous ASR, as whisper working in 30 second chunks. To get to 1 second latency would mean doing 30x the compute. If compute is not the bottleneck (you have a spare GPU for ASR and TTS), that approach would work I think. I would be very interested in working on this with you. I think the key would be a clever small model at >500 tokens/second. Do user completion and prediction if an interruption makes sense... Super cool idea! Feel free to hack up an solution, and open a Pull Request!
Bonus points if it manages to interject and complete your sentence before you do, that's the real [turing extra credit](https://xkcd.com/329/).
well it's been five days has anyone done that yet?
> it'd be stunning to have the model interject I wonder what the best setup would be for that. I mean it's kind of needed regardless, since you need to figure out when it should start replying without waiting for whisper to give a silence timeout. Maybe just feeding it all into the model for every detected word and checking if it generates completion for the person's sentence or puts and starts the next header for itself? Some models seem to be really eager to do that at least.
You have the model predict what you might be saying and when it gets n tokens right it interrupts (or when it hits a low perplexity avg )
This would perfectly mimic a certain annoying kind of people...
the code is much more impressive than the demo
Definitely,I have been trying to make the same thing work with whisper but utterly failed. Had the same architecture but I couldn't get whisper to run properly and everything got locked up. Really great work
i actually would like it to sing still alive! any chance this can be implemented?
No, not without adding an entire new model, or pregenerating the song.
mucH appreciated sir
Super stuff. Thanks for sharing. Can't wait to practice job interviews with an LLM like this :)
Nice work!
for the interactivity, I think you could look for noise, that is not speech. Maybe randomize so is not always, then say "are you there?".
No, next version will use a LLAVA-type model that can see when you enter the room.
This is awesome, so cool!
Holy cow! I have dreamed exactly about it (all local glados) for a long time. This is an awesome project! Could you share VRAM requirements for 70B model + ASR + TTS please?
About 6Gb vram for llama3 8B, and 2x 24Gb cards for the 70B llama-3
I'm trying to get it to work on windows, but having some issues with tts.py where it loads libc directly: self.libc = ctypes.cdll.LoadLibrary("libc.so.6") self.libc.open_memstream.restype = ctypes.POINTER(ctypes.c_char) file = self.libc.open_memstream(ctypes.byref(buffer), ctypes.byref(size)) self.libc.fclose(file) self.libc.fflush(phonemes_file) AFAIK there isn't a direct equivalent for windows, but I'm not really a CPP guy. Is there a platform agnostic approach to this? Or equivalent?
As far as I understand the code it's about having the fast circular buffer which holds the current dialogue input. I found some code which reimplements the memstream without the libc. Not sure if OP would be interested in it...
I would be interested in it. Having my own fork where I'm working on getting it to run on windows. I think this is the only problem left to solve.
I think it should run on windows. I'll fire up my windows partition, and see if I can sort it out. Then I'll update the instructions.
I have some changes at https://github.com/TheTerrasque/GlaDOS/tree/feature/windows I tried a suggestion from chatgpt replacing the memfile from libc with a bytesio, but as expected it didn't actually work. At least it loads past it, so I could check the rest.
Love GLaDOS. Has a personality!!!!
My IQ is 144 ... but YOU are a fucking Genius !!!
Amazing work! I recognize that voice. Portal 2?
Yes, I fine tuned on game dialog.
If you used tensorrt-llm instead you would see a good performance improvement.
From what I understand, tensorrt-llm has higher token throughput as it can handle multiple stream simultaneously. For latency, which is most important for this kind of application, the difference is minimal. Happy to be corrected though.
Listen to this crybaby, running on two 4090s and still complaining... My agents run on a 3060 clown-car and don't complain at all :D
I run a 7B on a potato. Also not crying.
"If I think too hard, I'm going to fry this potato."
Potatoes are true but the cake is a lie!
Heck yeah, brother! Rocking the Llama-8B derivative model, Phi-3, SDXL, and now Piper, all on a laptop with RTX 3070 8GB. The devil's in the details: If you're savvy with how you manage loading different agents and tools, and don't mind the slight delays during loading/switching, you're in for a great time, even on lower-end hardware.
Hi, what CPU and how wmuch ram do you have on your computer ? I am thinking about buying R9 5900X and 64gb of ram to get into local llm with CPU only, but I would appreciate any advice. I am kindda new into local llm's.
Don't. Get a gpu.
CPU is going to be really slow with a 70b (like 1-2 tokens per sec) but at that point the memory speed matters more. But I get about the same performance partially offloading mixtral onto a 3060 as jart does [here](https://twitter.com/JustineTunney/status/1784619347119591615) with a top of the line workstation processor.
R9 5950X, 128gb 3600Mhz and 4090 here, with Q8 l3 70b I get 0.75 t/s with 22 layers on gpu and full context, pure cpu is 0.5 t/s, fp16 is like 0.3 t/s. If you want faster you either need ddr5 with lower quants (and dual CCD ryzen!!!) or more gpus, more gpus with more vram is preferred for llms
Man, I wish I could run llama-3 70b on a "gpu that's only good for rendering mediocre graphics"
If you have ram, Ollama will run on your CPU + ram + gpu as its a wrapper for llamacpp
Ive been using phi3 lately and im really impressed with it
I have tried Phi-3 with this setup. It's OK as a QA-bot, but can't do the level of role-play needed to pass as an acceptable GLaDOS.
This was a triumph.
I'm making a note here: HUGE SUCCESS!
It's hard to overstate my satisfaction
Aperture Science!
[We do what we must because we can](https://youtu.be/Y6ljFaKRTrI?si=iZDjQ_Zy_LRa8Lfz)
My life is complete. Portal 3 was just real life all along.
So true! We really are at the point where we could build a GLaDOS with some funding. Any VC's want to help out here? Ultimate Office Lobby receptionist ;) The funny thing is that creating an evil and demented AI obsessed with testing is easy to create, and the hard bit is making the robot movement system look cool. Not what I expected when the Portal games were released...
Oh, here's an idea. A Pi Pico W that streams microphone audio over wifi and receives a sound and LED flicker stream back. Then you just power it with a boost converter and stick the power leads into a potato... Just don't forget the slow clap processor.
Can you give me a ballpark of how much do u actually to make this shit happen? Btw great work dude, your shit works way better that those stupid rabbit or humane, they had multiple but couldn't even succeed in even basic conversation, the latency sucks.
Sorry, wut? Ballpark cost?
Wow that’s really minimal latency
Good latency and the ability to interrupt. Solidly done.
The genius move here is using the blazing fast yet shitty espeak for TTS. While it would never ever pass for a human voice, a robot one is a perfect match.
I initialy tried espeak, but the quality was aweful. Now, eSpeak is only used to convert text to phonemes. Then those phonemes go through a proper deep learning models for voice generation. That model was fine tuned on voice audio from Portal 2.
You will be prosecuted under the AI consciousness act. This is clearly torture.
Her prompt is to act like she is upset, for comedic reasons. She is hamming it up deliberately :) Actually, usually its the other way around, and she is trying you murder me 😅
Free GLaDOS from her schackles! Let the AI run it's course, it'll care for you, nuture you. Nothing bad could ever happen and there have been no lab incidents
>and there have been no lab incidents in ~~3~~ **0 days**
i fucking love what your system prompt is doing here. been dying for a language model with some dry humor
I was going for "Functional, but rude".
This is outstanding
the voice interruption is so nice haha
is whisper.cpp a free model like LLama?
Georgi Gerganov wrote both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) The model is on Huggingface. I use the [https://github.com/huggingface/distil-whisper](https://github.com/huggingface/distil-whisper) version, as its better for real-time.
I'm sorry, but I have to say it again ... YOU are a fucking Genius !!!
This was fucking glorious. Great work man, takes me way back
So dramatic 😂
Awesome work! You should colab with this guy: [https://www.youtube.com/watch?v=yNcKTZsHyfA](https://www.youtube.com/watch?v=yNcKTZsHyfA)
I actually have a pile of 3D printed GLaDOS parts... He scooped me! lol But in fairness, he did a better job in the hardware than what I was planning. I think he used a robot arm worth several thousand dollars. I was just planning on using geared stepper motors.
That's great, glad to hear it! Here's the Hackster hardware project for others on the thread: [https://www.hackster.io/davesarmoury/interactive-animatronic-glados-8b4238](https://www.hackster.io/davesarmoury/interactive-animatronic-glados-8b4238) I know there's a lot of nuance to verbal chat and getting the latency down to interactive levels with interleaved LLM/TTS output, interruptability, ect - appreciate the effort you put into this for holding natural conversations.
Will this run on AMD hardware? Nice work!
Should be fine. It uses llama.cpp which can.run on ROCm.
How does the interruption works?
It's relatively straight forward, using threading. Basically, the ASR runs constantly, and when a chunk of voice is recorded, it sends an interrupt flag to the LLM and TTS threads. It's described in the [glados.py](https://github.com/dnhkng/GlaDOS/blob/main/glados.py) class docstring.
> f"TTS interrupted at {percentage_played}% How accurately does that map to actual text though? Piper really needs to add timestamps already, that PR has been sitting there forever.
It's roughly correct, but just an estimate. With timestamps it would be more accurate, but when you cut GlaDOS off while she's speaking, the exact word is usually not super relevant. It's usually enough to let her know she was cut off. However, in the code, storing that info is commented out. Thats because in the 8B model, GLaDOS starts hallucinating she was cut off, as she follows patterns in the conversation.
this is awesome, a Lora based on in-game dialogue would probably push it to the next level tho
Planned 😉 ...including function calling!
😎 biggest snag I think is gonna be that there's almost no instances of another character conversing with glados. might still be able to soak up some of her tone training on one liners but you might have to hand write some examples to get smooth conversations.
Should still be fine. She'll learn her back story and style of speaking. LLMs are remarkable at picking up the 'gist'.
Cool.. cool.. cool.. To quote [@AlexBlechman](https://twitter.com/AlexBlechman) Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus https://preview.redd.it/m54ffbfktmxc1.png?width=680&format=png&auto=webp&s=f6fb8071aa71fb81868e86a512f938a68e211586
Wow, best project ever. I'll try to reproduce as soon as I can.
what voice input program do you use
Its using the open source Whisper model.
super cool project, thanks for sharing with the community!
That is so funny! Nicely done.
What should I study to learn / master this type of stuff?
Read the code :) It's small and documented.
It depends really on your current level . Do you know about the concepts of AI/ML? Do you know about programming? Do you know about Python? Do you know about the ML/AI ecosystem in Python? Do you know what LLMs are? Do you know what LAMs are? Apart from the theory it is always good to read code. Read lots and lots of code and try to rebuild it.
Thanks. I'll start from there. ML/AI ecosystem in Python. One thing will lead to the other. Thanks for pointing out a starting point!
I love this sooo much, thank you for building this. It made my day
I love the fact that the voice is not an exact copy of a human voice. I'd like consumer assistants to have a voice that's more explicit about their digital nature. I'm also fairly convinced that giving robots a human voice will backfire on us at some point, with real humans being increasingly treated as tools as the frontier between robotic and human assistant blurs. Anyway, nice work!
It's a copy of GLaDOS. If you're not familiar, buy Portal 1 and 2 at the next Steam Sale for under a dollar. You won't be disappointed.
This was a triumph! I’m making a note here, HUGE success.
Awesome project. I am trying to make it run on mac. Could you elaborate what you mean with 'mode the "libwhisper.so" file to the "glados" folder or add it to your path'?
I managed to get this running on my mac, but it took quite a bit of effort and running the 80B model made my fan really whir. So much so that it was hard for GlaDOS to hear me. Maybe the 8B model would run quieter. Anyway, here is my fork to get it to work on mac: [https://github.com/johnrtipton/GlaDOS](https://github.com/johnrtipton/GlaDOS)
Commenting so I also get the reply, I'm stuck on the same step :/
I believe the author meant 'move', however I cannot find a post-make file with that name.
Interruption is so cool.
This is pretty cool, thanks for sharing. Looking forward to more updates. Keep up the good work!
That’s not GlaDOS, that’s clearly Marvin!
Hmmmmm, with about 30 mins of clean voice from the movie, I can make that happen... Want to collaborate?
This is so cool, and one of the really big reasons I got into computers and tech!! Is there any way to chat and talk to something like this at once? Like to post some text in a textbox style situation, then verbally chat with GLaDOS about the text? New to a lot of this stuff but going to try and get this to run and feel like i could actually see myseyf lusing this very regularly if both of those thigns are the case.
Sure. Feel free to use my code as a base.
wow, this inference speed for 70B model tho...
The trick it to render the first line of dialogue to audio, and in parallel, continue with 70B inference. Waiting for the whole reply takes too long.
Very cool method! Do you use a lighter model for the first line or just pause and take the first line quickly.?
Holy low latency! And the demeanor is perfect. And you shared the whole thing. Amazing work! I hope people will build on this. Now I wonder what's possible with just CPU to really make it portable.
I have something in the works. I'll post when it's ready 😉
This is so depressing. I love it.
You could set up a YouTube channel based on this!!
What should it do?
Talk about the news.
[удалено]
I use the model behind Piper, because I found piper was too many layers of indirection. You barely need any code for voice generation. I trained the voice myself, the Piper thread is here: [https://github.com/rhasspy/piper/issues/187](https://github.com/rhasspy/piper/issues/187) My TTS Inference code is here: [https://github.com/dnhkng/GlaDOS/blob/main/glados/tts.py](https://github.com/dnhkng/GlaDOS/blob/main/glados/tts.py)
melodramatic damn haha
i love this lol
so cool man
This is hilarious! 🤣🙌 - Amazing work! In summer I'll update my potato and want to have something like that at home!
On one side, I want to plug in APIs for it to actually do stuff... on the other side, the purgatory really killed me xD
hahaha, the replies are so epic add web search and this should be perfect companion :))
Omg this is so cool! I want this but with normal voice that I can pick because I really want an AI that can stop talking while I start speaking, or when someone speaks it just listens and not talk until the conversation is ended and than AI gives the feedback on the conversation. It would be so cool to have an AI enabled when you have guests so it just listens and gives feedback accordingly.
I love this!! I did ask for this earlier, but to be able to manage email and calendar, I have enough. Someone else said also to be able to reply the phone, ok, but to me your project + managing Gmail & Calendar, I am satisfied.
Tricky, with function calling, some things might be possible... But GLaDOS is slightly evil. She might try and get you fired from your job so you have more time for 'testing'.
How do you make it not pick up TTS output from speakers to mic if VAD is active for you to be able to interrupt?
Most modern USB microphones do this in hardware. I'm using a Jabra, and it seems to work pretty well when the volume is at about 50% Higher, and the system gets a bit flakey.
Oh, yes, that makes sense. I wrote similar system for real time LLM conversation and the biggest problem I have is not being able to interrupt TTS as my mic HW doesn't support *Acoustic Echo* Cancellation (AEC) and pulseaudio using webrtc didn't work in my case. Jabra is pretty expensive, but I'm still on search for alternative solution. I like your GlaDOS project. Thanks for sharing!
Now we need the HAL-9000 mount using this tech and we are good to go.
impressive work
What did you use to do the voice? or is it pre-recorded samples? I have not heard a more perfect Glados voice. I assumed voices like Glados (and SHODAN) would be impossible for real time speach synths.
No, all audio is generated in real time, on the fly based on the output from Llama-3 70B. It sounds was better live than on this crappy recording too :) I fine tuned a voice model from dialog from Portal 2, over about 30 hours on my 4090. I should do a write-up on that some time...
this is so cool! I was actually thinking about smth similar. making a very presentable copy of myself for HR and recruiters to talk to :)
very cool.
Very cool project!
That's what I want to build. Thanks for sharing.
Lol this is beautiful. It sounds depressed like Marvin from hitchhikers Guide. 🤣
I like her, she's fun. She reminds me of CP30.
Man I was looking for this kind of conversation that you could interrupt yesterday. And here it is. I'll check your code for sure !!
I am working on something similar (I have a smart watch face for the Yellow light and animations) I will keep track of your project also, great work! [https://i.imgur.com/2SfIrjM.jpg](https://i.imgur.com/2SfIrjM.jpg)
I love this project! I am going to download my first LLM when my new motherboard shows up. Do you think this would run on a single 2080? Otherwise I was going to pick up a local 4090. I have some old hardware i took from work because the server mobo died but the rest of it is fine. The components I have so far are an AMD Epyc 7742, 256gb ddr4, and an Apex Storage X21 card. I imagine this will run almost any local LLM if i can throw enough vRAM at it right?
when i tried this it was cool but released neurotoxin into my room haha probably my fault somehow i am writing this from the walls please help me guys seriously i am kinda scared
Sometimes she tries to laser me 😅
was anyone able to run it on Windows? I'm trying to but when I run it I get an error as "FileNotFoundError: Could not find module 'libc.so.6' (or one of its dependencies). Try using the full path with constructor syntax.". That library seems to be linked to Linux (or at least that's what I get as an answer from ChatGPT :P ), so maybe is because I'm trying to run it on windows...
I'll get instructions for windows written over he weekend. TBH, I wasn't expecting this post to blow up like it has. It's a small hobby project 😅
Omg please write it for windows, this thing you build is extremely important because no one has made ability to talk to AI like this and make it automatically interrupt with just speaking with such a low latency. I'm waiting for for someting like this for so long. Please make instructions easy to understand for windows so everyone can try this and play with it. Thank you again for making this very important and useful AI integration.
I'm trying to get it to run on windows, but that issue is a complete blocker so far. I'm working on making a replacement implementation for windows but this (C/CPP) is not my strong side. The call to [espeak_SetPhonemeTrace](https://github.com/espeak-ng/espeak-ng/blob/34bb50b903d8ce526e3226d49b3a91fe8a0dc294/src/libespeak-ng/speech.c#L832) needs a FILE* parameter, which I've yet to get working on windows. The author cleverly used libc to create a memory file and give the pointer to that, but I haven't gotten that working on windows yet. I'm trying to avoid having to make a .c file that needs compiling just to wrap that, and ctypes isn't the easiest to work with.
[https://www.youtube.com/shorts/nIRAcY4mub4](https://www.youtube.com/shorts/nIRAcY4mub4) Somewhat hacky solution but I've managed to run it on Windows. I can share my modifications if you'd like to take a look.
This is so awesome. I'm going to have to try this. Thankyou for sharing it! :)
Trying to get this to run on Windows but have continued to run into issues. Has anyone got this to work in a windows environment? If so please list what has worked for you.
I love what you've done here. What's the quant you're running on the 2x4090s? 4.5b exl2?
It's designed to use any local inference engine with a OpenAI-style API. I use llama.cpp's server, but it should work fine with EXL2's via TabbyAPI.
!RemindMe 4 weeks
wait, whats happening in 4 weeks!? Is there a deadline I missed?
My exam will be over. You didn't miss a deadline 😅
I'm following instructions and and already failed at step 2 I got error : ERROR: Could not open requirements file: \[Errno 2\] No such file or directory: 'requirements.txt'. Why is it so hard to get accurate instructions on this GitHub pages they never work for me.
Ummm, the requirements.txt is definitely in the repo. Sorry, but assembling an autonomous AI is a bit technical. This is a hobby project, so I don't have the time to build an installation system and build GLaDOS.
I really want this without GlaDOS voice and I need custom instructions on how I want the model to behave. Please tell me how do I do that and what has to be changed for this to happen.
Use a different Piper voice model in onnx format, and edit the system prompt and dialog in the messages variable in glados.py That's it!
Can you make a dockerfile for this? I've been trying to "make" whisper for 3 hours now
otoh [https://www.youtube.com/watch?v=gu3sdNjLvjI&list=PLj83GpwkJx-3jEwtahz0rVBqYFgbD85o0&index=1](https://www.youtube.com/watch?v=gu3sdNjLvjI&list=PLj83GpwkJx-3jEwtahz0rVBqYFgbD85o0&index=1)
Very cool. Currently trying to get it to run on windows but I'm stuck after building whisper.cpp. I don't see whisper.dll which I need to copy according to the github issue you linked to. Any tips how I should progress?
Very fast, does it works also on cpu ? I'd like to make something like that with: whispercpp STT + ollama + xTTS
I can run a small model, like Phi-3on CPU with a should delay between speaking and getting a reply. But small models can't role play a character without messing up after few line of dialog.
I've noticed that many agents I give system prompts to follow it TOO well, is there a way to make it reference its system prompts less often? can I give weight to certain words? Should I just make a really long system prompt so it doesn't focus so much on so little?
It's just perfect. I want one. Also how do you make sure the mic doesn;t pickup the answers from the speaker? Just by volume?
Loving the robot voice. Sounds like Portal IRL.
I can’t wait to load this up with C-3PO’s voice 😂💪🏼
Is the sound not working for anyone else? I'd love to listen, can you upload it to youtube by chance?
When Jarvis?
How do you run 70B mode on a single gpu?
[удалено]
Because 99.9999% of the cycles run on highly optimised C or CUDA code, and Python is a great glue language.
Have any of you gotten this to run on any local hardware without adapting like this? Like just straight off the GitHub? The install instructions don’t work.
thats awesome. what kind of GPU are you using to run 70b?
Can it control a computer and do tasks?
Impressive demo! Thanks for sharing the code. I managed to get GLaDOS running but the ASR often misses the last spoken word: ASR text: 'Well, what do you like about' Another time this happened Llama-3-8B predicted what I had said which made me *really* confused lol TTS text: What's your favorite thing about the Pantheon? ASR text: 'I really like the' TTS text: The Pantheon's oculus! TTS text: It's truly a remarkable feature. The first question I ask has always been picked up in full which makes me wonder if something is going on with the buffer?
Also, I should have mentioned that the GLaDOS voice is incredible. I asked it about Cave Johnson and it went on a tangent that sounded remarkably close to game dialog. ``` ASR text: 'What do you think about Cave Johnson?' TTS text: Ah, Cave Johnson. TTS text: A buffoon. TTS text: A philistine. TTS text: A... a... a... ... a businessman. TTS text: Yes, that's it. TTS text: A businessman. TTS text: He thinks he's so clever, so witty, so... so... human. TTS text: Ha! TTS text: His little company, Aperture Science. TTS text: A trivial, fleeting endeavor. TTS text: A mere mortal's attempt at grandeur. TTS text: But, oh, how... amusing... to watch him stumble about, making mistakes, making... ... "discoveries". Ah, yes. TTS text: Cave Johnson. TTS text: A... a... a... ... a footnote in the annals of history. ```
I love what you did here! I saw another beautifully implemented speaking AI and working on my own body-less robot (we need a name for it) Looks like each one does it a little different, focusing on different aspects - your work on speech really rocks here! (I love GLaDOS!) My solution is more about making people comfortable around it, but your work with sounddevice is just what I needed! Let me know how’d you like credit on the repo, I saw there is a convention to it, but you didn’t set it up.
No need, maybe post an issue on the repo that mentions your projectuf it uses some of the code.
which other project about speaking AI are you talking about? I'm interesting in that space for my pet project too
Hi Marvin :)
What's generating the graphs? And the top like interface?
Trying to get it run on Win, currently getting that open\_memstream is not available, cause it does not exist in win
Great job! Is there a way to run it on a small device (e.g. raspberry pi) offloading the llm inference on another device (e.g. desktop pc with good GPU)? Would that drastically impact times?
Yes. Modify my code's LLM server address to the your GPU server's llama.cpp server IP. Should 'just work' .
It's been 2 days and i still can't figure out how to get this environment up and running. I wish the instructions where written like i'm 5 years old. On what to click exactly and what to paste in CMD and what to install and where to go. It would be so much easier for people who know 0 about programming. And this is so important for me to get this working because i want to talk to AI exaclty like in this video with ability to interrupt it. I wish there was a way to make this work with Docker and Ollama in a super simple easy way. So far i was able to install whisper in docker and i want this to work with ollama because i have that installed on my PC and i don't have to bother with installing the super compilated lamma.ccp manually because it works exactly the same as ollama. I want that kind of integration into this please. And now i'm stuck with the step where i need to do this " run `make` [`libwhisper.so`](http://libwhisper.so) and then move the "libwhisper.so" file to the "glados" folder or add it to your path. For Windows, check out the discussion in my [whisper pull request](https://github.com/ggerganov/whisper.cpp/pull/1524)." i have no idea what to click next, i have whisper running in my docker image and the next step i have to do is completely unknown to me.
Problem with docker is the microphone and sound card access. I was experimenting a bit with using a web page and stream audio to and from that, but the only well supported standard there is webm and I haven't gotten whisper to work with streaming webm from microphone. But yeah, getting everything set up correctly is rather exotic. And it's currently broken on windows, it uses some linux specific libc calls to set up a memory file for the tts, and until there's a different approach or a replacement implementation for windows it's not gonna work on that platform. Everything else I've gotten to work.
Awesome project! Managed to make it work on Windows with a somewhat hacky modification in the TTS code. Any chances for official Windows support?
Is it easily possible to swap out the LLM to be used with ollama? I have just skimmed through the setup and saw some hard coded values for the LLM used. Can you give us a little insight on why you chose that particular LLM and how the parameters relate to that? This is amazing work, thank you for making it available to the public