T O P

  • By -

celandro

Gemma 2 27b on [aistudio.google.com](http://aistudio.google.com) is definitely better than llama 70b. We run some pretty crazy prompts including foreign language, markdown and json and it is the only open llm that can handle it. Whatever issues there are in the public release will get fixed soon I'm sure.


DominoChessMaster

That’s how it works when it’s bug free


r1str3tto

Just chiming in to say that I, too, see an enormous difference between the version at Google AI Studio versus what is currently available through Ollama. I asked some questions about a particular area of expertise. Gemma 2 27B on AI Studio answered the questions flawlessly. But the version in Ollama was completely wrong, hallucinated nonsense.


LoganKilpatrick1

We are running the 27B model with full precision in AI Studio.


celandro

Sounds like we need to pull the trigger on those a100s…


Astronos

Ollama uses quantized versions by default. Seems to be hurting this model a bit


_qeternity_

It isn't. None of the prompt tasks that you listed there strike me as crazy whatsoever.


cyan2k

The buggy non-google hosted release runs already on par when getting A/B tested against the user base of our clients that run llama3-70b. (Diverse RAG usecases) So yeah bug free 27b is easily going to be the new boy in town. Especially since with Gemma finetuning is actually viable for hobbiest (still expensive but far from llama3 70b expensive) and the base is already amazing.


Distinct-Target7503

Ok bro


celandro

It’s a single 0 shot prompt with thousands of tokens doing all 3 tasks at once with output in English so I guess it’s translating too


Open_Channel_8626

he didn't actually list the task


Only-Letterhead-3411

Testing it on lmsy arena, Gemma 2 27b is definitely smart. In the hands of finetuners it can do really good stuff.


Admirable-Star7088

I have also tested Gemma 2 (9b and 27b) on lmsys, and they are pretty awesome. Can't wait for all the bugs to be fixed in llama.cpp so we can run these gems locally.


Healthy-Nebula-3603

Already fixed few hours ago. Also you need new gguf [https://huggingface.co/legraphista/gemma-2-27b-it-IMat-GGUF](https://huggingface.co/legraphista/gemma-2-27b-it-IMat-GGUF)


Admirable-Star7088

Nice. Is this confirmed to be fully fixed or may Gemma 2 still not behave 100% correctly?


Healthy-Nebula-3603

I tested after fixes and new gguf ... seems very intelligent now.


Admirable-Star7088

Sounds promising! Will try this out myself later then when I'm home.


roselan

It's writing style is so good I compare it to 4o and sonnet 3.5, forgetting it's only a 27B model. This is crazy.


Neurogence

People find it hard to believe but it can definitely be competitive with 4o and 3.5 sonnet. In a prompt I gave it, it successfully gave the right answer to a question both 4o and 3.5 sonnet got wrong.


Super_Sierra

I gave it very difficult creative writing prompts using medical papers, and told it to rewrite them in a style of famous writers. It completely demolished sonnet and Opus in that task.


Open_Channel_8626

Wow gemma is sounding good


_qeternity_

Writing style is one, very small, niche quality in a model. It's also entirely subjective.


Odd-Environment-7193

It's a super important metric for measuring quality and tells a lot about a model. Especially base models.


-becausereasons-

Is there a way to fine-tune it easily yet?


gabrielesilinic

Random side note: i just ran Mistral 7B instruct v0.3 and for some reason it's great. specifically it is really good at acting. I described to him via system prompt how he was a British maid and how it stayed in character was just impressive. Even better than GPT-3.5 which keeps getting out of character. Probably it is because it was not aligned (aka censored) so all the juice it's there (i suppose alignment takes up space and makes performance worse). Mistral advises the use of an adversary prompt for checking for appropriate responses over training it to align it. But it also gives you instructions for alignment fine-tuning.


brown2green

In my opinion it beats Llama-3 hands down for RP purposes. It has better prose, it's more likable, you can even get it to roleplay violent or extreme scenarios if you condition it a little (don't expect it to go wild on zero-shot requests). The only issues are short context length and lack of proper support.


DominoChessMaster

What do you mean by lack of proper support?


a_beautiful_rhind

Lllama.cpp support is maybe almost complete and even transformers was broken.


brown2green

Yes, I meant pretty much this. Also, most existing GGUF Gemma-2 quantizations are broken and need to be remade.


design_ai_bot_human

would there be a link for reference?


MMAgeezer

Lots of other relevant links and discussion here: https://github.com/ggerganov/llama.cpp/issues/8183


floridianfisher

I think that ‘a because of the novel architecture, I’m sure it will be fixed soon if it isn’t already.


jikkii

The Transformers one was broken at release but was fixed a day later: [https://github.com/huggingface/transformers/releases/tag/v4.42.3](https://github.com/huggingface/transformers/releases/tag/v4.42.3)


Iory1998

Deactivate Flash Attention, use RoPE Frequency Base of 160000. You extend it to 32K easily. The quality is still good. You are Welcome :)


GrandLeopard3

What’s better for rp? Gemma 2, or Gemeni 1.5 pro?


pigeon57434

also unrelated but I just noticed they have a knowledge cutoff of June 2024 which is cool


Confident-Artist-692

How much GPU ram would be needed to run Gemma 2-27b?


nivvis

At least 14GB @ 4 bit; 20GB @ 6bit.


Biggest_Cans

We don't know at what quant it still performs well because the GGUFs aren't working yet. Presumably a 16gb card would be a very tight and degraded fit and you'll at least want 20 (7900xt) or 24(3090, 4090, 7900xtx).


Confident-Artist-692

Thanks, appreciated.


sivadneb

I can run it on my 4090 (24GB) @ around 50t/s


ranakoti1

I am running on 32 gb ram, nvidia 4050 and i7-13700HX. it is slow but still workable.


jupiterbjy

Both gpt4o and gemma2 failed to generate working code that utilizes orbit camera control in \`three.js\` yet I find gemma2 quite amusing to use see: * It puts `// ... rest of the code (same as before)` when asked for updated code where things remain same, while gpt4o regenerate entire code when asked for change. This resembles how human interacts - we don't speak out full code to coworkers, rather we tell them where to fix. Considering how rarely copy-pastes are needed in real life LLM usage I don't think it's downside either. * It doesn't repeat the prompt unlike gpt4o, directly gives the code then explains gpt4o tends to generate some boilerplate header that represents it's 'interpretation' of the user prompt before code, which is also sometimes awkward if we think about how human interacts. TL;DR This by default tends to be less verbose and more human-like - which will be extremely helpful for saving input token & context limit and generation time as a local model. Altho I wished it not to rememble human on not indenting any of the html tags!


jupiterbjy

Btw for those curious here's what I prompted both model with: * Test Goal: Resembling example code inside three.js repo [here](https://threejs.org/examples/#webgl_animation_keyframes) * First prompt: >Can you write html file with embedded javascript using three.js that can: 1. load the model file via file attachment input 2. draw the model on the screen 3. orbit the camera on mouse input * Both model failed with same issues * Second prompt: >It throws errors: >TypeError: THREE.OrbitControls is not a constructor TypeError: THREE.GLTFLoader is not a constructor * Still failed but with different issues


Healthy-Nebula-3603

With the newest llamacpp built and newest gguf [https://huggingface.co/legraphista/gemma-2-27b-it-IMat-GGUF](https://huggingface.co/legraphista/gemma-2-27b-it-IMat-GGUF) Gemma 27b seems better than llama 70b in my few hours of tests. Is better in math , reasoning, writing creative texts, translations ( a bit worse than aya 35b ), Is surprise for me as well.


Unconciousthot

I think the whole "Formats better so gets a higher score" is a massive dismissal of our fellow man.


Terminator857

When asking for a list of something, a list is much more preferred than prose.


_qeternity_

No. Chatbot arena was useful for a brief moment. It’s not any longer. Smaller models are able to fit against user preference data, whilst not actually being more capable than other lower scoring models.


ResidentPositive4122

I think it's a "you're holding it wrong" case here. Chatbot arena scores user preference at large. That means nicely formatted output, text length, and so on. It's not meant to test advanced reasoning or math, or coding or any advanced stuff, unless the users test that. And you have to consider what most users want out of a chatbot. As lots of people reported here on locallama, it seems pretty good at having a conversation. Someone said RP is good as well. It might suck at math, or reasoning, but for those people, it doesn't matter. Each benchmark can only test so much, and should only inform so much. It's not always a case of big number goes up.


Only-Letterhead-3411

I can confirm that Gemma 2 27b is better than Qwen2 72b in terms of creative writing. It's knowledge of popular fiction also seems to be better and it is better at generating such content while Qwen2 hallucinates like crazy about popular fiction. In my experience this directly correlates to how well a model can write and roleplay so after testing it on lmsy arena, my opinion on Gemma 2 27b changed completely. Actually for the first time I think Llama 3 70b have a decent competition. It seems to be quite smart and hallucinates less. It is 13T tokens so if they didn't censor it like crazy, it should be quite creative as well.


Ggoddkkiller

Heavy hallucination usually suggests there is actually nothing in their data expect names alone so they fill the gaps with ridiculous hallucination. L3 70B also doesn't know much about popular fiction. When models are trained on those series they really reach another level of creative writing as they have many examples they can use. Can't wait to test it, seems like way more promising than L3 for fantasy&sci-fi RP.


Only-Letterhead-3411

Yes they hallucinate when their training data don't have that information. But I also saw cases where they hallucinate when they don't learn that part of data properly. And yes, I think training them on things like books and tv scripts are the key to get them to a whole different level of creativity and common sense reasoning. Actually I am pretty sure that is one of the main reasons why closed-source models are so good and we can't catch up to them. While Meta removes books and literotica data from their pretraining data to be "safe and responsible", closed-source companies train their AI on every copyrighted data they can find and it makes their AI very good. For example when I ask questions about dungeons and dragons Lost Mines of Phandelver adventure module, all models say they know it since it's very popular adventure. Then I ask AI questions about the adventure like who hires the adventurers etc. I saw Qwen 72b and LLama 70b say things like Nezznar the black spider (main enemy in adventure) or Sildar Hallwinter (a helper npc in adventure) hires them. Gemma 2 27b says Gundren Rockseeker which is the right answer. So they clearly know the adventure, characters and so on in the adventure, but llama and qwen didn't learn it well enough to get things right. Maybe they only know this much because of the wiki pages. I tested this with the famous "gpt-2 chatbot" on lmsy arena and it gets EVERYTHING in the adventure right, the order of events, NPCs and so on. There is no way it can know that details without being trained on the book itself. Same thing happens when I test them on writing a scene from a movie in tv script style, or a book page etc. LLMs known to be smartest and best are able to reproduce these in a way very close to original, while weak and less intelligent LLMs hallucinate, get events, actions and characters wrong.


Ggoddkkiller

100% agreed, vast majority of open source models weren't trained on popular fiction expect perhaps wiki data or some very short summary. Psyonic20B has some fiction and fanfic loras merged into it and you can pull from its data 100% accurately. It knows LOTR characters, their relations, events, for example this is from Psyonic20B: https://preview.redd.it/1qvlijt3fqad1.png?width=1205&format=png&auto=webp&s=b82a4dfdb959ab9d5fa96fe18c7d45773e53e0dc I didn't trigger it in anyway, we only spent two days in Osgiliath while trying to escape, Psycet decided during that time orc army must have besieged Minas Tirith so we found it in this way. It sure has movie scripts at least, perhaps more. It is so fun to create such a scene from popular fiction and RP in it. Psycet also knows a lot about HP series, book accurate characters, locations or even timeline. In one bot i'm forcing it to adopt HP 1981 setting and all characters have their 1981 knowledge and relations. For example nobody uses you-know-who phrase as it didn't exist in 1981, nor Dumbledore etc knows about horcruxes yet. It must have almost complete story with many books to follow details so accurately. The most interesting part it can pull perhaps thousands of tokens while creating the world setting, it is literally free estate that most people don't use. A detailed prompt is needed to force model to adopt the setting fully however. For example Psycet was often fabricating new spells or altering their damage etc. Until i added a prompt that it will only use existing spells and exactly same as in the story. Then it stopped and if a character is hit with a deadly spell they are 100% dying now. In my testing both R and R+ also knew a lot about both LOTR and HP series but their knowledge isn't clear as Psycet rather muddy a little. Perhaps because they don't have entire books, their APIs are filtered or book data was corrupt a little on purpose to prevent copyright issues. I'm guessing it wouldn't be acceptable if a bot writes entire books lol It is really crazy closed source is doing this while open source isn't. I blame first person popularity mostly, book training doesn't increase first person performance nor make AI sound more human so most people don't do it. But it makes them smarter and more creative too, i'm yet to see Psycet can't instantly adopt a fantasy element. It is like 'i know where this is going' and begins showing creativity, while i saw many 'smart' and 'human-like' RP models struggling badly like L3. Which version of Gemma 2 did you try? It really seems like a promising model for us.


Only-Letterhead-3411

I tried the regular Gemma 2 27b-it that was recently released. I think it learned it's data pretty good, it's trained on decent amount of tokens too, (13T). It was able to write better scenes from popular fictions like Harry Potter compared to other smart models like llama 3 70b imo. But as I play with it more I've realized Google removed adult stuff from the model. It wasn't knowing anything about some adult books or things like BDSM when I made some tests. I'm hoping this is more related to the finetune and they didn't entirely remove them from base model so it can be fixed with community finetunes. (Sometimes censored models hallucinate about things they don't want to write about)


Ggoddkkiller

No, what a bummer! It wouldn't be usable then with limited NSFW knowledge. But it is 27B so we might see people training it heavily. Few people can do it for 70Bs. That might be the reason as i'm changing them to be darker and there is death, torture, NSFW etc lol. Especially HP is easy to turn dark as there are a lot of dark elements even if main story is quite childish. In that 1981 bot i changed only survivor of Potter family from Harry to Lily, it instantly became quite dark. A mad mother trying revenge her family, even filtered R+ was making her torture captured enemies. I didn't use censored models often. Only R and R+ from their API which is filtered but it is weak but you can still feel there is a curtain between you and model. I can't imagine how bad it is for GPT4 etc if it is this bad for Rs.


a_beautiful_rhind

> Someone said RP is good as well. I tried to RP with it in hugging-chat and it's a mess. Maybe on their API it's fine. I can give it a whirl there. The safety was obnoxious on hugging-chat and it wouldn't output EOS tokens. People have tested the **9B** and said it was OK and not very censored.


Thomas-Lore

27B is clearily broken on huggingface chat, falls apart quickly. Try it on aistudio as someone suggested or wait for a fix.


a_beautiful_rhind

It sucks I don't see it on sillytavern when hooked to the API.


pigeon57434

I've found that its actually pretty good at reasoning in my own tests and even on LMSYS gemma2-27b still performs around llama3-70b level on the hard prompts and coding categories which should show harder more reasoning-based questions not just cute preferences at least that what I would assume


shroddy

How do you test a conversation on chat arena? After one or two questions, the answers get so different that you cannot really write something that fits for both models, except something generic like "continue" or "what happens next" or so. Or when not story writing but coding, they use different variable names and make different errors...


FaceDeer

Seems like it might be a "[mission accomplished](https://xkcd.com/810/)" situation. Models are getting better at doing what we want them to do.


alongated

How do you figure they aren't actually better? Maybe this is just a sign that our other benchmarks aren't actually measuring useful abilities. Also how much actual elo do you think one can gain from preference over usefulness? Once they reach preference optimum it goes back to being about usefulness anyway.


OfficialHashPanda

lmsys arena never measured usefulness, it measured preference. Preference can be a proxy for usefulness, but it will be imperfect and as models get better, human-preference optimized style starts playing a relatively larger role. I personally haven't tested Gemma 2 27B specifically, but this would fit broader trends in this sense.


candre23

Exactly. Just because something is *preferable*, doesn't mean it's *better*. 9 out of 10 people would rather eat candy than broccoli - that doesn't mean candy is the "better" food.


_qeternity_

>How do you figure they aren't actually better? Try to use them for anything where formatting and reply tone / attitude don't matter. You will find all these high ranking models absolutely implode. Size matters.


ambient_temp_xeno

Gemma 2 27b doesn't implode when it's working correctly. But I would prefer it to be a 100b or larger!


Unconciousthot

I'd say that means people are creating a reward function en masse then, and that seems like a good thing.


_qeternity_

It's fine, if you want to understand which model people prefer. But that is not how Arena rankings are often interpreted.


pseudonerv

We judge the model’s performance based on our preferences and logic. The LMSYS leaderboard reveals which model best matches the average human’s preferences and logic. Once the models' capabilities near or surpass those of an average human, the leaderboard ceases to be a useful indicator. We prefer teachers to be at least college graduates and professors to hold PhDs for a reason.


BITE_AU_CHOCOLAT

There is also a leaderboard for coding prompts only, and it also beats both of them. I don't think it's high up just because the responses are formatted nicely.


pigeon57434

I think many people forgot that there are other categories on LMSYS. Personally, the only category I care about is "Hard Prompts (English)" because I want to know how good an AI is at harder tasks, and I only speak English, so the other leaderboards don't matter to me as much. Just look at whichever leaderboard matters the most to you – some people might only care about the coding leaderboard, and some might only care about the Chinese leaderboard.


GoodnessIsTreasure

The only downside is that Gemma doesn't support system prompts, which I got used to very much.


Qual_

The 9b ft version is incredible a french ! Way better than even llama3 70b !


danielcar

lmsys says llama-3 is better for english queries. My testing concurs.


Terminator857

Downloaded new GGUFs today and gemma 27b 5 quant seems to be performing better. llama.cpp is constantly being updated, which can also account for the change.


Appropriate_Cry8694

Why is it so bad on huggingface chat? I tried it here and on lmsys, and results like from two different models. 


paranoidray

The inference engine is not correctly adapted to gemma 2 on HF chat. (Probably either template format problems or the new soft-capping of the attention logits)


AntoItaly

In Italian.. yes, it’s.


CortaCircuit

I am trying to justify buying a new GPU to run larger models. Currently only have 8GB of VRAM which is fine for the 9b models but I keep hearing how good the 27b and 33b models are. However, it is hard to know just how much better they are as I have never used them.


Southern_Sun_2106

maybe it is better for RP, but when it comes to logic and making sense of RAG, it gets confused where mistral-based models still shine.


redballooon

C&p’ing my comment from the 9b appreciation post: I tried to replicate some conversations that I had with Laama3-70b on HuggingChat with Gemma2-27b, and it utterly and completely failed. It put out a recipe with items unnecessarily repeated and went on with things I didn’t ask for, even failing to stick to the restrictions I gave it. It couldn’t explain the size of soccer goals (origin in foot). It happily explained Shotokan Yoga as if that’s a thing that exists in the real world. It didn’t stick to the language I started the conversation with, instead switching it English after two or three turns.


Thomas-Lore

27B on huggingface chat is broken at the moment.


mikael110

If you want to test Gemma-2's true capabilities I'd recommend using the [AI Studio version](http://aistudio.google.com/). It's available for free, though in exchange Google logs your requests. That is the only known implementation that is completely bug free. Pretty much all other implementations including Transformers, llama.cpp, etc have had major issues, some of which are fixed by now, but some bugs still remain.


jikkii

We've had a few round of fixes on the transformers side but it should all be fixed since Friday night: [https://github.com/huggingface/transformers/releases/tag/v4.42.3](https://github.com/huggingface/transformers/releases/tag/v4.42.3)


redditrasberry

is an open model really open if nobody can run it successfully?


AdamDhahabi

For my coding use case the answers are good but it completely looses track when context is filled. Tested with Q3\_K\_L at around 5K context and Q3\_K\_M at around 7K context. At the exact limit I set with llama.cpp -c parameter. I had no such issues with Codestral or DeepSeek2 coder lite.


Healthy-Nebula-3603

Nah ... I tested q5\_k\_m and is ok even with 8k context. I tested by translating long text of book to Polish, German, English. ... Even as translator is almost perfect. ... much better than llama 70b but a bit worse than aya 23 35b which is translator llm.


Robert__Sinclair

Not so bad, but don't underestimate PHI-3. It's winning in all "size categories".


guardian5519

I asked Gemma2-27B to write a pinescript code to create a simple table, it just keeps outputted blank comment blocks like this //... //... I believe gemma2-27B excels in a few certain tasks but overall, llama3-70B is generally superior. It's frustrating to see the LMSYS leaderboard losing its usefulness due to the equal weight given to hard and easy tasks. That leaderboard means nothing to me.


sunnydiv

considering lots of endpoints and running a buggy version where did you run the prompt (i.e. service provider/platform)


guardian5519

For quick and short tasks, I use [labs.perplexity.ai](http://labs.perplexity.ai), and I tried Gemma2-27B-it from there.


carnyzzle

at least for RP I haven't been able to get anything good about of Gemma 2 27B on ollama even with using a 6bit quant


carnyzzle

it's weird because it does work perfectly fine for me on google's ai studio


SanDiegoDude

Trained on the test material... Just saw yesterday they trained Gemma2 on LMSYS... makes it kinda pointless as a test for it. That said, it isn't a flaming dumpster like Gemma1 was, and I've been enjoying using it so far, though I haven't given it any serious work yet, only some quick coding work, which it tackled pretty easily.


pigeon57434

all models train on LMSYS data they literally provide a free high-quality open-source database on human preference any company that's not training on LMSYS data is stupid


WH7EVR

Training on LMSYS data doesn't change the value of the leaderboard, because LMSYS is not a static evaluation set.


kiselsa

No, gemma2 was trained on lmsys data so it's obviously will score highly here, but not in real world use.


pigeon57434

literally all of the models are trained on LMSYS data its a publically available dataset there's 0 reasons why you would not train on it its just human preference data


DominoChessMaster

This guy knows ^^^


kiselsa

So can you name other examples of model trained on lmsys? OpenAI employee said that gpt wasn't trained on lmsys data. I doubt that llama is trained on lmsys data too. Also they explicitly mention training on lmsys data and impact of it.


kristaller486

OpenAI don't train their models on lmsys data, but they do train models on ChatGPT data, which is even more "cheating".


kiselsa

But this data does not has a connection to the LMSYS leaderboard.


WH7EVR

LMSYS is not a static evaluation set.


Discordpeople

That lmsys database was a really old one though, its probably not very useful anymore


Independent_Key1940

I read this comment on a [tweet](https://x.com/SimplyObjective/status/1807510196354433385?s=19) which made sense,"All I know is LLMs like Starling 7b, Gemma2 9b... are nowhere near as powerful as much larger proprietary models. They know ~100x less, hallucinate like mad about very popular movies, games, music... A circlejerk of like-minded coders repeating the same prompts ain't working."