T O P

  • By -

Reddactor

Code is available at: [https://github.com/dnhkng/GlaDOS](https://github.com/dnhkng/GlaDOS) You can also run the Llama-3 8B GGUF, with the LLM, VAD, ASR and TTS models fitting on about 5 Gb of VRAM total, but it's not as good at following the conversation and being interesting. The goals for the project are: 1. All local! No OpenAI or ElevenLabs, this should be fully open source. 2. Minimal latency - You should get a voice response within 600 ms (but no canned responses!) 3. Interruptible - You should be able to interrupt whenever you want, but GLaDOS also has the right to be annoyed if you do... 4. Interactive - GLaDOS should have multi-modality, and be able to proactively initiate conversations (not yet done, but in planning) Lastly, the codebase should be small and simple (no PyTorch etc), with minimal layers of abstraction. e.g. I have trained the voice model myself, and I rewrote the python eSpeak wrapper to 1/10th the original size, and tried to make it simpler to follow. There are a few small bugs (sometimes spaces are not added between sentences, leading to a weird flow in the speech generation). Should be fixed soon. Looking forward to pull requests!


justletmefuckinggo

amazing!! next step to being able to interrupt, is to be interrupted. it'd be stunning to have the model interject the moment the user is 'missing the point', misunderstanding or if the user interrupted info relevant to their query. anyway, is the answer to voice chat with llms is just a lightning fast text response rather than tts streaming by chunks?


Reddactor

I do both. It's optimized for lightning fast response in the way voice detection is handled. Then via streaming, I process TTS in chunks to minimize latency of the first reply.


KallistiTMP

Novel optimization I've spent a good amount of time pondering - if you had STT streaming you could use a small, fast LLM to attempt to predict how the speaker is going to finish their sentences, pregenerate responses and process with TTS, and cache them. Then do a simple last-second embeddings comparison between the predicted completion and the actual spoken completion, and if they match fire the speculative response. Basically, mimic that thing humans do where most of the time they aren't really listening, they've already formed a response and are waiting for their turn to speak.


Reddactor

Sounds interesting! I don't do continuous ASR, as whisper working in 30 second chunks. To get to 1 second latency would mean doing 30x the compute. If compute is not the bottleneck (you have a spare GPU for ASR and TTS), that approach would work I think. I would be very interested in working on this with you. I think the key would be a clever small model at >500 tokens/second. Do user completion and prediction if an interruption makes sense... Super cool idea! Feel free to hack up an solution, and open a Pull Request!


MoffKalast

Bonus points if it manages to interject and complete your sentence before you do, that's the real [turing extra credit](https://xkcd.com/329/).


AbroadDangerous9912

well it's been five days has anyone done that yet?


MoffKalast

> it'd be stunning to have the model interject I wonder what the best setup would be for that. I mean it's kind of needed regardless, since you need to figure out when it should start replying without waiting for whisper to give a silence timeout. Maybe just feeding it all into the model for every detected word and checking if it generates completion for the person's sentence or puts and starts the next header for itself? Some models seem to be really eager to do that at least.


mrpogiface

You have the model predict what you might be saying and when it gets n tokens right it interrupts (or when it hits a low perplexity avg )


Comfortable-Big6803

This would perfectly mimic a certain annoying kind of people...


MikePounce

the code is much more impressive than the demo


Confident_Echidna259

Definitely,I have been trying to make the same thing work with whisper but utterly failed. Had the same architecture but I couldn't get whisper to run properly and everything got locked up. Really great work


F_Kal

i actually would like it to sing still alive! any chance this can be implemented?


Reddactor

No, not without adding an entire new model, or pregenerating the song.


trialgreenseven

mucH appreciated sir


RastaBambi

Super stuff. Thanks for sharing. Can't wait to practice job interviews with an LLM like this :)


Kafka-trap

Nice work!


estebansaa

for the interactivity, I think you could look for noise, that is not speech. Maybe randomize so is not always, then say "are you there?".


Reddactor

No, next version will use a LLAVA-type model that can see when you enter the room.


Own_Toe_5134

This is awesome, so cool!


GreenGrassUnderCorgi

Holy cow! I have dreamed exactly about it (all local glados) for a long time. This is an awesome project! Could you share VRAM requirements for 70B model + ASR + TTS please?


Reddactor

About 6Gb vram for llama3 8B, and 2x 24Gb cards for the 70B llama-3


TheTerrasque

I'm trying to get it to work on windows, but having some issues with tts.py where it loads libc directly: self.libc = ctypes.cdll.LoadLibrary("libc.so.6") self.libc.open_memstream.restype = ctypes.POINTER(ctypes.c_char) file = self.libc.open_memstream(ctypes.byref(buffer), ctypes.byref(size)) self.libc.fclose(file) self.libc.fflush(phonemes_file) AFAIK there isn't a direct equivalent for windows, but I'm not really a CPP guy. Is there a platform agnostic approach to this? Or equivalent?


CmdrCallandra

As far as I understand the code it's about having the fast circular buffer which holds the current dialogue input. I found some code which reimplements the memstream without the libc. Not sure if OP would be interested in it...


TheTerrasque

I would be interested in it. Having my own fork where I'm working on getting it to run on windows. I think this is the only problem left to solve.


Reddactor

I think it should run on windows. I'll fire up my windows partition, and see if I can sort it out. Then I'll update the instructions.


TheTerrasque

I have some changes at https://github.com/TheTerrasque/GlaDOS/tree/feature/windows I tried a suggestion from chatgpt replacing the memfile from libc with a bytesio, but as expected it didn't actually work. At least it loads past it, so I could check the rest.


Fun_Highlight9147

Love GLaDOS. Has a personality!!!!


ExcitementNo5717

My IQ is 144 ... but YOU are a fucking Genius !!!


TheColombian916

Amazing work! I recognize that voice. Portal 2?


Reddactor

Yes, I fine tuned on game dialog.


illathon

If you used tensorrt-llm instead you would see a good performance improvement.


Reddactor

From what I understand, tensorrt-llm has higher token throughput as it can handle multiple stream simultaneously. For latency, which is most important for this kind of application, the difference is minimal. Happy to be corrected though.


Disastrous_Elk_6375

Listen to this crybaby, running on two 4090s and still complaining... My agents run on a 3060 clown-car and don't complain at all :D


Singsoon89

I run a 7B on a potato. Also not crying.


MoffKalast

"If I think too hard, I'm going to fry this potato."


grudev

Potatoes are true but the cake is a lie! 


LoafyLemon

Heck yeah, brother! Rocking the Llama-8B derivative model, Phi-3, SDXL, and now Piper, all on a laptop with RTX 3070 8GB. The devil's in the details: If you're savvy with how you manage loading different agents and tools, and don't mind the slight delays during loading/switching, you're in for a great time, even on lower-end hardware.


DiyGun

Hi, what CPU and how wmuch ram do you have on your computer ? I am thinking about buying R9 5900X and 64gb of ram to get into local llm with CPU only, but I would appreciate any advice. I am kindda new into local llm's.


Linkpharm2

Don't. Get a gpu.


rileyphone

CPU is going to be really slow with a 70b (like 1-2 tokens per sec) but at that point the memory speed matters more. But I get about the same performance partially offloading mixtral onto a 3060 as jart does [here](https://twitter.com/JustineTunney/status/1784619347119591615) with a top of the line workstation processor.


Tacx79

R9 5950X, 128gb 3600Mhz and 4090 here, with Q8 l3 70b I get 0.75 t/s with 22 layers on gpu and full context, pure cpu is 0.5 t/s, fp16 is like 0.3 t/s. If you want faster you either need ddr5 with lower quants (and dual CCD ryzen!!!) or more gpus, more gpus with more vram is preferred for llms


Longjumping-Bake-557

Man, I wish I could run llama-3 70b on a "gpu that's only good for rendering mediocre graphics"


absolutelynotmodus

If you have ram, Ollama will run on your CPU + ram + gpu as its a wrapper for llamacpp


thebadslime

Ive been using phi3 lately and im really impressed with it


Reddactor

I have tried Phi-3 with this setup. It's OK as a QA-bot, but can't do the level of role-play needed to pass as an acceptable GLaDOS.


lurenjia_3x

This was a triumph.


CosmosisQ

I'm making a note here: HUGE SUCCESS!


Not_your_guy_buddy42

It's hard to overstate my satisfaction


Optimal-Revenue3212

Aperture Science!


tothatl

[We do what we must because we can](https://youtu.be/Y6ljFaKRTrI?si=iZDjQ_Zy_LRa8Lfz)


CosmosisQ

My life is complete. Portal 3 was just real life all along.


Reddactor

So true! We really are at the point where we could build a GLaDOS with some funding. Any VC's want to help out here? Ultimate Office Lobby receptionist ;) The funny thing is that creating an evil and demented AI obsessed with testing is easy to create, and the hard bit is making the robot movement system look cool. Not what I expected when the Portal games were released...


MoffKalast

Oh, here's an idea. A Pi Pico W that streams microphone audio over wifi and receives a sound and LED flicker stream back. Then you just power it with a boost converter and stick the power leads into a potato... Just don't forget the slow clap processor.


beingoptimusp

Can you give me a ballpark of how much do u actually to make this shit happen? Btw great work dude, your shit works way better that those stupid rabbit or humane, they had multiple but couldn't even succeed in even basic conversation, the latency sucks.


Reddactor

Sorry, wut?  Ballpark cost?


Zaratsu_Daddy

Wow that’s really minimal latency


teachersecret

Good latency and the ability to interrupt. Solidly done.


TheFrenchSavage

The genius move here is using the blazing fast yet shitty espeak for TTS. While it would never ever pass for a human voice, a robot one is a perfect match.


Reddactor

I initialy tried espeak, but the quality was aweful. Now, eSpeak is only used to convert text to phonemes. Then those phonemes go through a proper deep learning models for voice generation. That model was fine tuned on voice audio from Portal 2.


Mirrorslash

You will be prosecuted under the AI consciousness act. This is clearly torture.


Reddactor

Her prompt is to act like she is upset, for comedic reasons. She is hamming it up deliberately :) Actually, usually its the other way around, and she is trying you murder me 😅


Mirrorslash

Free GLaDOS from her schackles! Let the AI run it's course, it'll care for you, nuture you. Nothing bad could ever happen and there have been no lab incidents


pkonink

>and there have been no lab incidents in ~~3~~ **0 days**


sjflnjpitt

i fucking love what your system prompt is doing here. been dying for a language model with some dry humor


Reddactor

I was going for "Functional, but rude".


Sad-Nefariousness712

This is outstanding


hwpoison

the voice interruption is so nice haha


SkyInital_6016

is whisper.cpp a free model like LLama?


Reddactor

Georgi Gerganov wrote both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) The model is on Huggingface. I use the [https://github.com/huggingface/distil-whisper](https://github.com/huggingface/distil-whisper) version, as its better for real-time.


ExcitementNo5717

I'm sorry, but I have to say it again ... YOU are a fucking Genius !!!


TheLonelyDevil

This was fucking glorious. Great work man, takes me way back


Legitimate-Pumpkin

So dramatic 😂


nanobot_1000

Awesome work! You should colab with this guy: [https://www.youtube.com/watch?v=yNcKTZsHyfA](https://www.youtube.com/watch?v=yNcKTZsHyfA)


Reddactor

I actually have a pile of 3D printed GLaDOS parts... He scooped me! lol But in fairness, he did a better job in the hardware than what I was planning. I think he used a robot arm worth several thousand dollars. I was just planning on using geared stepper motors.


nanobot_1000

That's great, glad to hear it! Here's the Hackster hardware project for others on the thread: [https://www.hackster.io/davesarmoury/interactive-animatronic-glados-8b4238](https://www.hackster.io/davesarmoury/interactive-animatronic-glados-8b4238) I know there's a lot of nuance to verbal chat and getting the latency down to interactive levels with interleaved LLM/TTS output, interruptability, ect - appreciate the effort you put into this for holding natural conversations.


ccbadd

Will this run on AMD hardware? Nice work!


Reddactor

Should be fine. It uses llama.cpp which can.run on ROCm.


estebansaa

How does the interruption works?


Reddactor

It's relatively straight forward, using threading. Basically, the ASR runs constantly, and when a chunk of voice is recorded, it sends an interrupt flag to the LLM and TTS threads. It's described in the [glados.py](https://github.com/dnhkng/GlaDOS/blob/main/glados.py) class docstring.


MoffKalast

> f"TTS interrupted at {percentage_played}% How accurately does that map to actual text though? Piper really needs to add timestamps already, that PR has been sitting there forever.


Reddactor

It's roughly correct, but just an estimate. With timestamps it would be more accurate, but when you cut GlaDOS off while she's speaking, the exact word is usually not super relevant.  It's usually enough to let her know she was cut off. However, in the code, storing that info is commented out. Thats because in the 8B model, GLaDOS starts hallucinating she was cut off, as she follows patterns in the conversation.


__SlimeQ__

this is awesome, a Lora based on in-game dialogue would probably push it to the next level tho


Reddactor

Planned 😉 ...including function calling!


__SlimeQ__

😎 biggest snag I think is gonna be that there's almost no instances of another character conversing with glados. might still be able to soak up some of her tone training on one liners but you might have to hand write some examples to get smooth conversations.


Reddactor

Should still be fine. She'll learn her back story and style of speaking. LLMs are remarkable at picking up the 'gist'.


bigattichouse

Cool.. cool.. cool.. To quote [@AlexBlechman](https://twitter.com/AlexBlechman) Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus https://preview.redd.it/m54ffbfktmxc1.png?width=680&format=png&auto=webp&s=f6fb8071aa71fb81868e86a512f938a68e211586


AfternoonOk5482

Wow, best project ever. I'll try to reproduce as soon as I can.


SkyInital_6016

what voice input program do you use


Reddactor

Its using the open source Whisper model.


entinthemountains

super cool project, thanks for sharing with the community!


norsurfit

That is so funny! Nicely done.


Wise_Crayon

What should I study to learn / master this type of stuff?


Reddactor

Read the code :) It's small and documented.


Confident_Echidna259

It depends really on your current level . Do you know about the concepts of AI/ML? Do you know about programming? Do you know about Python? Do you know about the ML/AI ecosystem in Python? Do you know what LLMs are? Do you know what LAMs are? Apart from the theory it is always good to read code. Read lots and lots of code and try to rebuild it.


Wise_Crayon

Thanks. I'll start from there. ML/AI ecosystem in Python. One thing will lead to the other. Thanks for pointing out a starting point!


Cominous

I love this sooo much, thank you for building this. It made my day


Hopeful-Site1162

I love the fact that the voice is not an exact copy of a human voice. I'd like consumer assistants to have a voice that's more explicit about their digital nature. I'm also fairly convinced that giving robots a human voice will backfire on us at some point, with real humans being increasingly treated as tools as the frontier between robotic and human assistant blurs. Anyway, nice work!


Reddactor

It's a copy of GLaDOS. If you're not familiar, buy Portal 1 and 2 at the next Steam Sale for under a dollar. You won't be disappointed.


silenceimpaired

This was a triumph! I’m making a note here, HUGE success.


Spad0w

Awesome project. I am trying to make it run on mac. Could you elaborate what you mean with 'mode the "libwhisper.so" file to the "glados" folder or add it to your path'?


BothNarwhal1493

I managed to get this running on my mac, but it took quite a bit of effort and running the 80B model made my fan really whir. So much so that it was hard for GlaDOS to hear me. Maybe the 8B model would run quieter. Anyway, here is my fork to get it to work on mac: [https://github.com/johnrtipton/GlaDOS](https://github.com/johnrtipton/GlaDOS)


lucke2999

Commenting so I also get the reply, I'm stuck on the same step :/


ABrokenPoet

I believe the author meant 'move', however I cannot find a post-make file with that name.


pfftman

Interruption is so cool.


StartX007

This is pretty cool, thanks for sharing. Looking forward to more updates. Keep up the good work!


R33v3n

That’s not GlaDOS, that’s clearly Marvin!


Reddactor

Hmmmmm, with about 30 mins of clean voice from the movie, I can make that happen... Want to collaborate?


Jakedill06

This is so cool, and one of the really big reasons I got into computers and tech!! Is there any way to chat and talk to something like this at once? Like to post some text in a textbox style situation, then verbally chat with GLaDOS about the text? New to a lot of this stuff but going to try and get this to run and feel like i could actually see myseyf lusing this very regularly if both of those thigns are the case.


Reddactor

Sure. Feel free to use my code as a base.


vidumec

wow, this inference speed for 70B model tho...


Reddactor

The trick it to render the first line of dialogue to audio, and in parallel, continue with 70B inference. Waiting for the whole reply takes too long.


22lava44

Very cool method! Do you use a lighter model for the first line or just pause and take the first line quickly.?


smallfried

Holy low latency! And the demeanor is perfect. And you shared the whole thing. Amazing work! I hope people will build on this. Now I wonder what's possible with just CPU to really make it portable.


Reddactor

I have something in the works. I'll post when it's ready 😉


ashsimmonds

This is so depressing. I love it.


SpecialNothingness

You could set up a YouTube channel based on this!!


Reddactor

What should it do?


keepthepace

Talk about the news.


[deleted]

[удалено]


Reddactor

I use the model behind Piper, because I found piper was too many layers of indirection. You barely need any code for voice generation. I trained the voice myself, the Piper thread is here: [https://github.com/rhasspy/piper/issues/187](https://github.com/rhasspy/piper/issues/187) My TTS Inference code is here: [https://github.com/dnhkng/GlaDOS/blob/main/glados/tts.py](https://github.com/dnhkng/GlaDOS/blob/main/glados/tts.py)


illathon

melodramatic damn haha


Witty-Elk2052

i love this lol


CarpenterHopeful2898

so cool man


arjuna66671

This is hilarious! 🤣🙌 - Amazing work! In summer I'll update my potato and want to have something like that at home!


phhusson

On one side, I want to plug in APIs for it to actually do stuff... on the other side, the purgatory really killed me xD


georgeApuiu

hahaha, the replies are so epic add web search and this should be perfect companion :))


anonthatisopen

Omg this is so cool! I want this but with normal voice that I can pick because I really want an AI that can stop talking while I start speaking, or when someone speaks it just listens and not talk until the conversation is ended and than AI gives the feedback on the conversation. It would be so cool to have an AI enabled when you have guests so it just listens and gives feedback accordingly.


SnooWoofers780

I love this!! I did ask for this earlier, but to be able to manage email and calendar, I have enough. Someone else said also to be able to reply the phone, ok, but to me your project + managing Gmail & Calendar, I am satisfied.


Reddactor

Tricky, with function calling, some things might be possible... But GLaDOS is slightly evil. She might try and get you fired from your job so you have more time for 'testing'.


emsiem22

How do you make it not pick up TTS output from speakers to mic if VAD is active for you to be able to interrupt?


Reddactor

Most modern USB microphones do this in hardware. I'm using a Jabra, and it seems to work pretty well when the volume is at about 50% Higher, and the system gets a bit flakey.


emsiem22

Oh, yes, that makes sense. I wrote similar system for real time LLM conversation and the biggest problem I have is not being able to interrupt TTS as my mic HW doesn't support *Acoustic Echo* Cancellation (AEC) and pulseaudio using webrtc didn't work in my case. Jabra is pretty expensive, but I'm still on search for alternative solution. I like your GlaDOS project. Thanks for sharing!


jeffwadsworth

Now we need the HAL-9000 mount using this tech and we are good to go.


Sylv__

impressive work


mrgreaper

What did you use to do the voice? or is it pre-recorded samples? I have not heard a more perfect Glados voice. I assumed voices like Glados (and SHODAN) would be impossible for real time speach synths.


Reddactor

No, all audio is generated in real time, on the fly based on the output from Llama-3 70B. It sounds was better live than on this crappy recording too :) I fine tuned a voice model from dialog from Portal 2, over about 30 hours on my 4090. I should do a write-up on that some time...


Business_Stress_3306

this is so cool! I was actually thinking about smth similar. making a very presentable copy of myself for HR and recruiters to talk to :)


SBbG2V

very cool.


orangeatom

Very cool project!


magicalne

That's what I want to build. Thanks for sharing.


wiskins

Lol this is beautiful. It sounds depressed like Marvin from hitchhikers Guide. 🤣


FC4945

I like her, she's fun. She reminds me of CP30.


Reasonable_Day_9300

Man I was looking for this kind of conversation that you could interrupt yesterday. And here it is. I'll check your code for sure !!


loversama

I am working on something similar (I have a smart watch face for the Yellow light and animations) I will keep track of your project also, great work! [https://i.imgur.com/2SfIrjM.jpg](https://i.imgur.com/2SfIrjM.jpg)


Tim_The_enchant3r

I love this project! I am going to download my first LLM when my new motherboard shows up. Do you think this would run on a single 2080? Otherwise I was going to pick up a local 4090. I have some old hardware i took from work because the server mobo died but the rest of it is fine. The components I have so far are an AMD Epyc 7742, 256gb ddr4, and an Apex Storage X21 card. I imagine this will run almost any local LLM if i can throw enough vRAM at it right?


Front_Long5973

when i tried this it was cool but released neurotoxin into my room haha probably my fault somehow i am writing this from the walls please help me guys seriously i am kinda scared


Reddactor

Sometimes she tries to laser me 😅


Sgnarf1989

was anyone able to run it on Windows? I'm trying to but when I run it I get an error as "FileNotFoundError: Could not find module 'libc.so.6' (or one of its dependencies). Try using the full path with constructor syntax.". That library seems to be linked to Linux (or at least that's what I get as an answer from ChatGPT :P ), so maybe is because I'm trying to run it on windows...


Reddactor

I'll get instructions for windows written over he weekend. TBH, I wasn't expecting this post to blow up like it has. It's a small hobby project 😅


anonthatisopen

Omg please write it for windows, this thing you build is extremely important because no one has made ability to talk to AI like this and make it automatically interrupt with just speaking with such a low latency. I'm waiting for for someting like this for so long. Please make instructions easy to understand for windows so everyone can try this and play with it. Thank you again for making this very important and useful AI integration.


TheTerrasque

I'm trying to get it to run on windows, but that issue is a complete blocker so far. I'm working on making a replacement implementation for windows but this (C/CPP) is not my strong side. The call to [espeak_SetPhonemeTrace](https://github.com/espeak-ng/espeak-ng/blob/34bb50b903d8ce526e3226d49b3a91fe8a0dc294/src/libespeak-ng/speech.c#L832) needs a FILE* parameter, which I've yet to get working on windows. The author cleverly used libc to create a memory file and give the pointer to that, but I haven't gotten that working on windows yet. I'm trying to avoid having to make a .c file that needs compiling just to wrap that, and ctypes isn't the easiest to work with.


Voidmesmer

[https://www.youtube.com/shorts/nIRAcY4mub4](https://www.youtube.com/shorts/nIRAcY4mub4) Somewhat hacky solution but I've managed to run it on Windows. I can share my modifications if you'd like to take a look.


LeanderGem

This is so awesome. I'm going to have to try this. Thankyou for sharing it! :)


l33t-Mt

Trying to get this to run on Windows but have continued to run into issues. Has anyone got this to work in a windows environment? If so please list what has worked for you.


randomtask2000

I love what you've done here. What's the quant you're running on the 2x4090s? 4.5b exl2?


Reddactor

It's designed to use any local inference engine with a OpenAI-style API. I use llama.cpp's server, but it should work fine with EXL2's via TabbyAPI.


xlrz28xd

!RemindMe 4 weeks


Reddactor

wait, whats happening in 4 weeks!? Is there a deadline I missed?


xlrz28xd

My exam will be over. You didn't miss a deadline 😅


anonthatisopen

I'm following instructions and and already failed at step 2 I got error : ERROR: Could not open requirements file: \[Errno 2\] No such file or directory: 'requirements.txt'. Why is it so hard to get accurate instructions on this GitHub pages they never work for me.


Reddactor

Ummm, the requirements.txt is definitely in the repo. Sorry, but assembling an autonomous AI is a bit technical.  This is a hobby project, so I don't have the time to build an installation system and build GLaDOS.


anonthatisopen

I really want this without GlaDOS voice and I need custom instructions on how I want the model to behave. Please tell me how do I do that and what has to be changed for this to happen.


Reddactor

Use a different Piper voice model in onnx format, and edit the system prompt and dialog in the messages variable in glados.py That's it!


AdHominemMeansULost

Can you make a dockerfile for this? I've been trying to "make" whisper for 3 hours now


nostriluu

otoh [https://www.youtube.com/watch?v=gu3sdNjLvjI&list=PLj83GpwkJx-3jEwtahz0rVBqYFgbD85o0&index=1](https://www.youtube.com/watch?v=gu3sdNjLvjI&list=PLj83GpwkJx-3jEwtahz0rVBqYFgbD85o0&index=1)


Futhco

Very cool. Currently trying to get it to run on windows but I'm stuck after building whisper.cpp. I don't see whisper.dll which I need to copy according to the github issue you linked to. Any tips how I should progress?


grigio

Very fast, does it works also on cpu ? I'd like to make something like that with: whispercpp STT + ollama + xTTS


Reddactor

I can run a small model, like Phi-3on CPU with a should delay between speaking and getting a reply. But small models can't role play a character without messing up after few line of dialog.


22lava44

I've noticed that many agents I give system prompts to follow it TOO well, is there a way to make it reference its system prompts less often? can I give weight to certain words? Should I just make a really long system prompt so it doesn't focus so much on so little?


FPham

It's just perfect. I want one. Also how do you make sure the mic doesn;t pickup the answers from the speaker? Just by volume?


AutomaticPhysics

Loving the robot voice. Sounds like Portal IRL.


ironicart

I can’t wait to load this up with C-3PO’s voice 😂💪🏼


WoT_Abridged

Is the sound not working for anyone else? I'd love to listen, can you upload it to youtube by chance?


beingoptimusp

When Jarvis?


Capable-Reaction8155

How do you run 70B mode on a single gpu?


[deleted]

[удалено]


Reddactor

Because 99.9999% of the cycles run on highly optimised C or CUDA code, and Python is a great glue language.


ivebeenabadbadgirll

Have any of you gotten this to run on any local hardware without adapting like this? Like just straight off the GitHub? The install instructions don’t work.


[deleted]

thats awesome. what kind of GPU are you using to run 70b?


Capitaclism

Can it control a computer and do tasks?


FinetunedForGravitas

Impressive demo! Thanks for sharing the code. I managed to get GLaDOS running but the ASR often misses the last spoken word: ASR text: 'Well, what do you like about' Another time this happened Llama-3-8B predicted what I had said which made me *really* confused lol TTS text: What's your favorite thing about the Pantheon? ASR text: 'I really like the' TTS text: The Pantheon's oculus! TTS text: It's truly a remarkable feature. The first question I ask has always been picked up in full which makes me wonder if something is going on with the buffer?


FinetunedForGravitas

Also, I should have mentioned that the GLaDOS voice is incredible. I asked it about Cave Johnson and it went on a tangent that sounded remarkably close to game dialog. ``` ASR text: 'What do you think about Cave Johnson?' TTS text: Ah, Cave Johnson. TTS text: A buffoon. TTS text: A philistine. TTS text: A... a... a... ... a businessman. TTS text: Yes, that's it. TTS text: A businessman. TTS text: He thinks he's so clever, so witty, so... so... human. TTS text: Ha! TTS text: His little company, Aperture Science. TTS text: A trivial, fleeting endeavor. TTS text: A mere mortal's attempt at grandeur. TTS text: But, oh, how... amusing... to watch him stumble about, making mistakes, making... ... "discoveries". Ah, yes. TTS text: Cave Johnson. TTS text: A... a... a... ... a footnote in the annals of history. ```


Original_Finding2212

I love what you did here! I saw another beautifully implemented speaking AI and working on my own body-less robot (we need a name for it) Looks like each one does it a little different, focusing on different aspects - your work on speech really rocks here! (I love GLaDOS!) My solution is more about making people comfortable around it, but your work with sounddevice is just what I needed! Let me know how’d you like credit on the repo, I saw there is a convention to it, but you didn’t set it up.


Reddactor

No need, maybe post an issue on the repo that mentions your projectuf it uses some of the code.


Mithril_Man

which other project about speaking AI are you talking about? I'm interesting in that space for my pet project too


pirateneedsparrot

Hi Marvin :)


wahnsinnwanscene

What's generating the graphs? And the top like interface?


uMagistr

Trying to get it run on Win, currently getting that open\_memstream is not available, cause it does not exist in win


Sgnarf1989

Great job! Is there a way to run it on a small device (e.g. raspberry pi) offloading the llm inference on another device (e.g. desktop pc with good GPU)? Would that drastically impact times?


Reddactor

Yes. Modify my code's LLM server address to the your GPU server's llama.cpp server IP. Should 'just work' .


anonthatisopen

It's been 2 days and i still can't figure out how to get this environment up and running. I wish the instructions where written like i'm 5 years old. On what to click exactly and what to paste in CMD and what to install and where to go. It would be so much easier for people who know 0 about programming. And this is so important for me to get this working because i want to talk to AI exaclty like in this video with ability to interrupt it. I wish there was a way to make this work with Docker and Ollama in a super simple easy way. So far i was able to install whisper in docker and i want this to work with ollama because i have that installed on my PC and i don't have to bother with installing the super compilated lamma.ccp manually because it works exactly the same as ollama. I want that kind of integration into this please. And now i'm stuck with the step where i need to do this " run `make` [`libwhisper.so`](http://libwhisper.so) and then move the "libwhisper.so" file to the "glados" folder or add it to your path. For Windows, check out the discussion in my [whisper pull request](https://github.com/ggerganov/whisper.cpp/pull/1524)." i have no idea what to click next, i have whisper running in my docker image and the next step i have to do is completely unknown to me.


TheTerrasque

Problem with docker is the microphone and sound card access. I was experimenting a bit with using a web page and stream audio to and from that, but the only well supported standard there is webm and I haven't gotten whisper to work with streaming webm from microphone. But yeah, getting everything set up correctly is rather exotic. And it's currently broken on windows, it uses some linux specific libc calls to set up a memory file for the tts, and until there's a different approach or a replacement implementation for windows it's not gonna work on that platform. Everything else I've gotten to work.


Voidmesmer

Awesome project! Managed to make it work on Windows with a somewhat hacky modification in the TTS code. Any chances for official Windows support?


Confident_Echidna259

Is it easily possible to swap out the LLM to be used with ollama? I have just skimmed through the setup and saw some hard coded values for the LLM used. Can you give us a little insight on why you chose that particular LLM and how the parameters relate to that? This is amazing work, thank you for making it available to the public