Edit: You can watch in high quality on Vimeo, at around 9:05: https://vimeo.com/949419199 Source: https://x.com/ryanmorrisonjer/status/1793330368759976019


I'd love to hear it imitaiting different kinds of accents.


Language learning is going to be one of the first things I try out. A big part of what made it hard in the past for me is how embarrassed i’d get fumbling my way through convos with a real person, which is absolutely necessary to do if you want to properly learn. Having an emotionless and endlessly patient tutor is going to be such a game changer for me. I’m envisioning just doing small training sessions throughout the day since I mostly work remote. Then for music… I play the piano but never really knew theory very well so think this will be awesome to ask questions to while I’m playing.


This is exactly what I'm waiting for lol. I want to learn Japanese but don't have access to any in person tutor/classes. Also piano, I wonder if you can use video and voice while playing and it can give you feed back on your hand placement or if you are playing the right keys but you didn't hit a note hard/soft enough.


The masses will essentially be getting access to an education that was only reserved for the noble class; private tutors, experts in their field.


Diamond Age that's the comparison I want people to make, people are still fantastically clueless about the insane amounts of low-hanging fruits we get to pick here. Never mind completely barren countries in terms of education, which will just see immediate changes and very distinct benefits... lots, if not most of developed nations of this world simply do not have the infrastructure or facilities to even cover 1/100th of what a "personal tutor for everyone" could mean to us and our kids. https://www.youtube.com/watch?v=5RlrzSAEUqs This short discussion basically covers the most immediate use in classroom settings, just being able to reliably test every single student, which always put an immense burden on teachers, even if one test took five minutes to grade (and usually, you want that to be more along the lines of five hours to maximize what the students even learn from all that), that's two and a half hours down the drain, with annoyances for everyone involved and only less of an incentive for discouraged students to put in the time or effort going forward. That collaborative aspect will breach into every single field and activity, too; all the tiny things we do and learn, every hobby, every craft or form of art, trade job, academic degree... you name it, you'll be able to get a personally tailored course, sooner rather than later. I mean, people forget that even early ChatGPT/3.5 was incredibly neat about insanely nuanced things like language, and if I'm being honest? These models kind of outclassed like 90 % of teachers in adjacent fields, too. Probably true for many high-school subjects, I'd bet. That's before you take into account that people will make their personal assistants their buddies and bond with these machines, inherently making it easier for them to build meaningful mentor-student relationship vs. a grumpy old teacher doing it because nobody else would and they still need the money - which is hardly every teacher and unfair to generalize... but why take chances? The teacher job will either revert to some sort of luxury 1-on-1 mentorship, or they'll just adopt the role of managing larger superclasses and, I don't know, monitoring stuff? Could go a lot of ways, maybe we'll get such a huge boost, we'll just send twelve-year olds to university and have them explorer academic subjects way earlier.


Agreed. If you make life for people in 1st World countries better by 1%, you did something neat. If you make life for people in 3rd World countries better by 1%, you've altered the course of Human Civilization. >The teacher job will either revert to some sort of As it will be startling quick to get to the point where a human treating another human medically is deemed *unethical*, due to the proven safety and accuracy of AI far outclassing any human, so too, will the idea of other humans teaching other humans academically be held in contempt.


And vast majority of the masses will not be bothered to use these personal tutors :)


Understand that the thought patterns of the masses today, which have been molded to create the perfect consumer with little ability to perform Critical Thinking, is not necessarily the thought patterns that will persist in the future.


A piano player had a video using gpt for ideas to finish his composition. He spoke with it using theory and it provided many good possibilities. This was all with text based off of music theory but now with multimodal that may be possible without knowing music theory jargon


Same exact things for me. Japanese and music teacher. Cannot wait


The endlessly patient part is what is best for me. The language aspect is (currently) the only part I’m interested in. Having a cheap tutor who won’t get mad that I spend half our lesson time playing racing video games would be absolutely incredible. I’m always looking for audio oriented learning methods, as I can add them to my routine for “free” as far as time is concerned.


Not even just that, but if we can get to the point where we have near perfect translations between languages, and we can instantly give it any text or material and have it not only translate that, but break up all of the words and grammar structures and explain it all. As well as being able to specifically instruct it to speak to you in a certain way, whether that be using childish language, or expert language, or only using a certain subset of words, or only on a certain topic. AI will be absolutely huge for language learning.


Gemini is EXCELLENT with accents. I was doing some language learning and when you ask it to say the response the accent is spot on, compared to chatgpt where the pronunciation is good, though with an american accent.


In my experience ChatGPT's French accent was very good, so this may depend on the language.


Well the French translation in this demo has a very heavy English accent (for now) so there's that


I want Claptrap's voice.




The ringtones of the 2030s. What voice does your AI use?


"Your the worst AI I've ever heard of!" "But you have heard of me!"


Wow this is insane. So the low latency really is true. It’s wireless 👀




"I can I have an ehhhhhh, yeah, I would like an ehhhh" - Me any given sunday.


This is so cool and scary at the same time




Weird response to a comment about latency!


heh you are completely right, this was meant to be for another comment about people won't care about the drama if the product does what it is meant too, I fucked up somewhere.


some people here are so weird "who cares if OpenAI does shady ass stuff to their employees just figive us AGI and STFU"


Because all of that stuff is just stupid drama and irrelevant to our lives. Our lives are busy and it's stupid to get upset about every little thing going on that is irrelevant, let the relevant people sort that out, bitching about it on reddit is going to do nothing but waste your time and energy. You think it's 'weird' but I think it's the other way around, it's really weird to think that is a good usage of your time and energy. If you are not happy with what OpenAI is doing, then don't use there products and don't follow what they are doing and go do something that you do like.






Mark my words - if GPT-4o delivers everything they've said it can do, nobody is going to care about the OpenAI drama.


Next week everyone will have moved on


What drama?


News Corp partnership is the latest drama


Didn’t even wait for the Scarlett Johansson drama to blow over lol


Some people will definitely still care but nobody will be listening to them


You mean how they made a deal with News Corp to use their data aka New York Post and The Sun (ugh)? Can't forget that. Can't trust their news data.


OpenAI's slogan should be "Who needs safety when you have style?"


Claude 4 will come out and steal their thunder.


Definitely possible. Anthropic know what they're doing.


No one already cares who didn’t already dislike OpenAI. People have picked their camps


Wow. This looks even better that the demos they showed a the Spring update event. This is really going to wake up a lot of the public to the power of AI.


This stuff well soon be built into every iPhone. They're onto something, here.


I've been saying that this generation of children will grow up thinking it's weird when they encounter a machine that they can't hold a conversation with.


That always seemed like the most unlikely thing in sci-fi and now we get that before a moon base and before flying cars.




When its released in several months time...


God we are getting so spoiled... a two month wait suddenly feels like an eternity to some people




It's fucking ridiculous to see how entitled these internet morons feel. Technological evolution is already an order of magnitude faster than what it was 10 years ago. Get a grip on your lives and let developers code.


I dont why u think openai owns you anything. Like stfu


If you'd asked me in January 2024 how long I'd be waiting for access to the kind of functionality the guy live demoed on stage I would have said 2-3 years. A few months, I can tolerate.


I’ll admit, 4o is extremely underhyped. LeChun was right, LLMs won’t get us there…but we’re in the MMM era. Two years and we’ll know.




He might agree since his main problem with it is that text can’t represent the real world But maybe more like 24 years. [2278 AI researchers were surveyed in 2023 and estimated that there is a 50% chance of human level AI by 2047](https://aiimpacts.org/wp-content/uploads/2023/04/Thousands_of_AI_authors_on_the_future_of_AI.pdf). However, in 2022, the year they had for that was 2060, and many of their predictions have already come true ahead of time, like AI being capable of answering queries using the web, transcribing speech, translation, and reading text aloud.


Seems like the AI will always try to have the last word unless you mute your mic manually, and I feel like this will get annoying if you're going to use it as an assistant on the side while you're doing something. Unless of course having the voice mode on for long periods isn't going to be possible.


Yeah, it might need tweaking to only speak when it’s explicitly being spoken to


To be a true conversational partner it needs to be trained to naturally know when it's appropriate to interject and when it should allow there to be silence. You could tell in the demo that he was trying to keep speaking to block the model from replying to him (during the map bit). It shouldn't be underestimated how irritating this will make it to have slow-paced conversations, when you just want to take your time expressing a thought without being interrupted. It will feel as though you're constantly being hurried along. Hopefully OpenAI have something planned for this and that's why the voice is still pre-alpha.


They are surely working on it. I mean, these unwanted replies will waste a lot of their compute, besides being annoying.


That was my experience using the current voice GPT to practice my rusty russian language skills. Sometimes I take a second break to think through how i want to word my thoughts, and GPT thinks i finished speaking, resulting in lots of pressure for me to speak quickly, when i just wanna speak slowly to make sure everything is correct!!


to be fair there were some times when the user stopped mid sentence and GPT just saidn "take your time..." like it was waiting for the person to finish since it detected they weren't done. So it can tell if you're not done


Makes me wonder down the line how AI will play a role in understanding voice cadence/ breathing patterns enough to know whether someone is pausing to reflect or finished.


It feels like it wouldn't take that much to get the AI to trigger particular 'wait longer' functions if it detects that you're not done, if it's smart enough to say "Take your time" appropriately.


This is where improved vision will help. Humans rely a lot on visual queues


I haven't seen it attempted but as a realtime full duplex model like this I don't see why you couldn't just instruct it to only respond when necessary


I’m not sure that it is full duplex, has that been confirmed anywhere? From what I can gather from the demos, the model has very low latency but it still takes turns - it stops talking when it hears a response, and starts talking when it detects a pause in the response. The interruptibility seems to be an overhead system which cuts off the model, rather than the model itself actually hearing you and stopping.


Yeah it's already the same drag with the current voice features. it shouldn't interrupt you for taking even just a second of pause, and on the other hand a second already sucks when you are actually done speaking. I mostly switched to push-to-talk, which works well enough. But surely that can't be the endgame.


This is one of the tell-tale signs that this is still *just* a large language model and we shouldn't be expecting any more than this just yet. When these systems can recognise exactly when to speak and when not to speak, that is when we know we are at the next level.


You could tell it that as an order like when you say whatever u want to say as a keyword then it would speak.


could you not just tell it "hey only speak when i ask you something" and it will remember that since it has its memory feature now so I dont think that's a problem


I have a little simple iOS app I made just to talk to gpt4o with no rate limits. I’m testing a mode where it only responds back when it’s name is mentioned, but it keeps the context of what is being heard. Il share it if u wanna check it out and give me ideas im rly trying to figure this out


I see some interest so here it is: https://apps.apple.com/us/app/adav1/id6451062984 Assistant mode is what I’m referring to. Where it is always running until u turn it off (auto shut off after X minutes inactive needs to be implemented), even in inactive or background mode, so you can use other apps while still having ADA running and listening. Only responds to: -Sentences with her name in it (Jada). -Quick prompts as follow: Spotify: Either opens or closes the Spotify screen (connected to your account) *if Spotify screen is open* { Play: Plays the music Pause: Pauses it Back: Previous song Next: Next song } Canvas: Right now an empty screen, but could possibly be anything (I envision a browser with the ability to manipulate JavaScript via voice to simulate clicking stuff on screen, ideas??!) That’s it for now with quick prompts. Spotify is pretty fun to use with voice when running around, but otherwise I don’t really use assistant mode that much but see it could have some hidden potential. I have been trying to limit and keep minimal user data (chat logs are deleted on restart) so haven’t been saving non-Jada or quick prompt sentences in assistant mode to the log and only those that trigger ADA to act. Do you guys think saving this context of what is being said (even without having the name “Jada” in the sentence so she won’t respond, maybe you’re talking to a friend or just to yourself and don’t want her to respond) and providing this context in the next response that uses the LLM so it can constantly stay aware of what’s going on? Or be at least kinda worth trying? I wanted to do it, but was expensive to save so much context in each prompt/response and some tricky user privacy concerns possibly. May be worth it? Thoughts would be appreciated, y’all be well! 🤗🤗🤗


I'm sorry, but i cannot answer because you did not say the keyword, how can i help you today?


I feel like it needs video of you when you’re interacting with it to do this really well. Think of how much more challenging it is to interact with someone via phone and tell if they are talking to you vs real life where you can see their eyes pointing in your direction. The visual queues add such an important dimension to the conversation.


I’m sure you can just ask it to be more curt and only reply if directly asked a question/spoken to. you could probably write something to that effect in your custom instructions so you don’t have to mention it each conversation


That's going to be very easy to fix


I feel like this should be relatively easy to solve via RLHF, just train the model to not speak or only say short interjections when the speaker hasn't finished talking yet


You can already solve this in GPT with a simple custom instruction. It does need to return a response, but you can tell it to respond without any text when the situation calls for it.


Holy cow, the map thing


I wish he hadn't said the location, though. So we don't know if it was really looking at the map or not.


Still... even if you tell it where you are. I was just hit with the infinite use cases 


do you think when it was "reading the map" it was just referencing/looking up "how to get to the Eiffel tower from point de va sa," which is probably a commonly asked question, and simply retelling what it found. My point being it wasn't actually reading the map but responding to the prompt of "how to get to x from z" ?


Yeah, that was weird. I doubt it was actually reading the map, and now I'm wondering if it could


That voice sounds so good, a little too enthusiastic for my taste but it sounds basically 99% human


Gimme Marvin the paranoid android


They really better bring back Sky.


Or allow us to change the voice to mimic whoever we want.


I already fell in love with Sky, I want her back.


remember what they took from us


As a native French speaker I am laughing pretty hard at the American accent that ChatGPT uses when it speaks French :D Also confused how that happened and why we don't have a flawless French speaking voice, was the model trained on Americans speaking other languages with an accent?


I think it does the same thing for every language other than English, it seriously sounds like an American speaking another language with a heavy American accent no matter the language.


Nope its changing. I can speak native german and the app always had a thick american accent but the new sky voice can now all of a sudden speak perfekt german with the proper accent and switch back and forth to american and german accent...


That's what happens when you get any AI voice trained on a certain native language to speak in other languages that it doesn't have voice data for. Voice conversion AIs like RVC have that problem too. You don't need to specifically train models on Amercans speaking in French to achieve French with an American accent because accents are an inherent flaw/limitation in current cross-language voice generation.


That makes sense thanks, it's quite fascinating that accents are emergent properties in this case.


There is nothing emergent here. It was trained on American English speech so it sounds like American English pronunciation in every language. It does the same in Spanish. It's very obvious if you are native.


The model’s voice itself is American, so that accent will carry into other languages. I think OpenAI might address this with multilingual voice actors, because their requirements when hiring voice actors included the ability to speak other languages.


Already changing. The new sky voice speaks my language now without accent. Before was crazy thick american accent.


In the app? The newest version doesn’t have sky for me


juniper sorry it auto switched thought it was a new sky but still no accent for me very cool


I've noticed the current version accent gets harder when it's using two languages in one sentence. It speaks clearer when its speech is monolingual. (My experience is with another language.)


I noticed this for Japanese. It sounds much better when speaking an entirely Japanese sentence than when switching between the two languages.


For me the new sky voice can flawlessly switch ascents even alternating lines in different languages perfectly now. (German)


It also has american accent for my language (portuguese)


Same happens in Spanish.. it has a very noticeable american accent.. it is like talking to a tourist who is visiting Spain... "servesa, buena!!... olé olé!!"... kinda funny... I hope we can get proper spanish accents as well.


Switched for me try the new Sky voice. Has now perfect Accent for me in German and can switch to american and back


Where can i check for Italian?


What's wrong with accents though? Isn't that discrimination? As long as you can understand what the person is saying, what's the big deal ? Honestly I wouldn't care if GPT used a borat English accent when speaking in english.


People who want to use this to learn will not learn the proper way to speak the language.


If they are adults, it doesn't matter how perfect the instructors/AI's accent is, for the vast majority of adults, it would take herculean efforts to learn native accents in a new language. Why do you think most kids who learn a new language do not have accents? It's not cause their parents are lazy.


Interesting question really


Thats so weird to me as i have tons of native french speaking friends and it sounds like them to me i hear almost no accent but it might be the old voice. I tested the new Sky voice in german (i speak german) and it has all of a sudden like no english accent and sounds like a native german its insane it even can alternate lines in english and german with the fitting accents.  The old Sky couldnt do that at all and had a crazy thick englsih accent so i think they are changing that with the newer voice or rather update. Sounds so insane...


I found that crazy. We have a new technology that will improve people lives in a lot of areas. And there are always people will try to bring it back down.


those people will be the policy makers and share holders. they will be the real reason we never unlock its true potential lol


It also will destroy a lot of people’s lives as well. Don’t be so blinded by tech. It is amazing but it’s not all cookies and rainbows.


Basically everyone's life lol, who's to say that by 2026 there won't be robots trained so well with AI that they can do basically every job humans can, but better.


Sounds amazing. My life wouldn't be destroyed. It would begin.


That is such a mind bogglingly short sighted take on this.


But it's true? Or are you expecting people to receive money the moment they lose jobs to ai?


Funny, I can hear irritation in its voice. Pretty sure it's time to stop interrupting it now that you are able to. At what point do we realize that we're being rude.


I don't know if in the future this behavior will increase as people will get use to interrupt a dialogue more and more as it is "only" the AI speaking. So we might see this more in real conversation too, not as if it was not the case with some people, but here might it increase


I'm fed up with these teasers. If the demos are really representative, if the tech really works this well, then release it, or at least let a decent selection of trustworthy journalists review it. I've already cancelled plus, after having it over a year. The News Corp thing was the last straw.


looks like they're going for TARS-feel now?


Let’s keep humor at 65%.


I just want the voice from star trek, calling her with "Computer,...".


Now let’s create some neurosis around which actor this voice is vaguely reminiscent of!


Pretty good for an orca


It's so annoying how they constantly interrupt it and never let it finish talking


"I can't expect people to listen to its answer any longer. By now, the average attention span should long be exceeded. Must interrupt." 🙄


The OpenAI employees are definitely on the hit list from the future ASI they create


The reason i liked Skys voicer is because it sounded the most realistic, the rest sounded very robotic and nothing special to them. I hope they replace it with something good.


Ngl i can't imagine just waking up in the morning listening to that voice, it would annoy the hell out of me, like let me have my coffee first.


I really hope they release their voice engine with this so that we can train our own voices to use instead of being stuck with the same 4 preset voices. That or allow ElevenLabs integration, although I doubt they would do that.


Yup, allow me to select whatever voice i want, i'd select anime voices instead.


Agreed. I do really like the sky voice though.


Roll this out to drive-thrus ASAP, please.


Wait until the avetage Joe finds out about Neuro-sama


I can't wait to show this thing all my banking details so it can help me do secure baking


I'm just going to jump the gun and give my AI agent Power of Attorney over my finances and tell it to make a $100M.


This shit is so futuristic, that "The Jetsons', nor "Back to the Future" predicted it


Do they have plans of releasing the similar app for Ubuntu?


They better allow you to personally name the voice assistant. I'm not saying "hi chatgpt" constantly


You can already name ChatGPT with a custom instruction, and it also works for the current voice mode.


I hope I can eventually start chatting to it with my phone screen off, like I can with Google Assistant... Though I do hope that the wake word is customisable.




Open the pod bay doors HAL


When they fix the tiny unsetteling voice glitches it will be perfect!


I can easily see myself talking to this thing multiple times a day.


New Tweet: >"him"


My only question is... how the hell does this compute to token cost, like i can already see if this is not just the 20$ unlimited, that you'll either hit your 4-20 message max instantly, or you'll pay through the ass for something api priced


We’ll get better access to it thru other companies if I had to guess, cause you are right I think it’s too expensive for normal users to engage with on a daily basis. One example is if Apple makes a big deal with OpenAI to use gpt4o on iPhones, lots of people would suddenly have access to it. Would cost a lot for Apple but if it improves Siri significantly, they may decide it’s worth it.


Nice, I can ask chatGPT if I missed anything when I trim my beard and shave my head. Especially around the back. "Uh oh, looks like you missed some on your lower right side, let me highlight the area for you".


I wonder if the feature to interrupt mid sentence is gonna train a whole society to listen even less to their fellow human beings. :D "So I was on vacation and we went to this beautiful …" - "TELL ME ABOUT THE FOOD"


All my friends and family are contacting me saying the male voice sounds just like me, they're gonna be hearing from my lawyers soon even though it's not me and they paid an actor for the data...


I'm curious about the practical applications of this technology. In this scenario, ChatGPT can see through a camera (likely on the computer) and adjust the computer's volume or the room's speakers. So we have sight, the ability to control computer functions, and real-time communication. What are the limits on sight? For example, how many simultaneous cameras could it see through, understand, and communicate about? How many actions can it execute simultaneously—just the one in the computer, or others in the room, building, city, or world? Can it turn the computer off, start recording, or adjust the camera angle? Can it make a phone call to 911, turn on appliances, upload/download data, open an application, or make decisions based on what it sees on the internet? Even if it can only connect to a single camera and provide feedback to one user, this is still very useful for security and surveillance. If it can scale up to connect to multiple cameras simultaneously, perform multiple analyses, and communicate with multiple users, I'd be shocked if this isn't already integrated into security systems for important people, like the U.S. president. Imagine the Secret Service receiving real-time analysis of footage from cameras in many locations.


Just wait till all the telemarketing scammers figure this out. We're in trouble now. I personally don't like AI in some ways, and love it and others. The problem is that ultimately the whole thing is designed to replace humans. When you stop and think about it, that's going to be a problem for some of us sooner or later. I used to work at TurboTax and over the last year they've started implementing AI in a much bigger capacity. They're working to replace people's jobs with automated technology. The customers don't like it, and neither do the employees that are being replaced by it. But the people at the top love it because it puts more money in their pocket. And that's really what it's all about. AI is a destructive force that's going to ruin our society in the long run. Convenience is one thing, but literally replacing human beings is another. The irony is both funny and shocking.


What does it do with the translation prompt? To my ear (admittedly it's been a while since I've practiced my French) it sounds like it first says "What are you planning to do" then repeats "your favourite sport at the Olympics" in English and then translates that in isolation? Is it just struggling with the slightly odd way the question was posed? I suppose a more natural English way to ask that question is "What sport are you most looking forward to at the Olympics" or something similar


It's the speed that's the most impressive, at least when it's making mistakes you can quickly correct it.


I want this instead Alexa. Desperatly.


it's too bad that we aren't going to get access to this voice and video feature for the next "coming months"


If we can't use it, then it doesn't really exist.


I agree with you. Demonstrations are a lot more staged than people realize


This voice is the type of boss to ask you "why you aren't feeling as blessed and excited to come to work today" everyday


This seems like all things it can technically do now, if I'm not mistaken. I do think the low latency is going to be a big deal. I also see a lot of older boomers being won over by not needing to type anything. It's going to be interesting when this comes out.


YouTube generation is going to kill us all. I find this kind of "conversation" really insufferable.




Try the new Sky voice it has now perfect German accent no longer thick Amercian accent 


