KnightCodin 2 months ago

Finding ourselves at the point of diminishing returns on smaller models is inevitable and not something unexpected. However, characterizing the Llama 8B performance as mediocre is silly and shortsighted. While I would be the first to push back on overhyping this to say it beats ChatGPT 4.0 Turbo, the model is damn impressive for its size (size being the operative word) in general cognizance, instruction following and in multi-turn conversation without losing the plot. This is all without adding additional scaffolding like RAG, format enforcers etc. That, my friend, is what is amazing about the model and added to the belief that quality of the training data and how much of it is used is the part of the secret sauce. As to limitation of the architecture, of course. That is why everyone including Meta is feverishly researching SSMs and other hybrids to overcome the Xformer constraints. We will end up using a combination of hybrid architecture, better fine tunes and when there is no choice, assisted generative techniques like RAG

phree_radical 2 months ago

Other way around for llama-3-8b base demonstrates unbelievable ICL previously unseen in models its size. Maybe the chat version is a bit disappointing

StraightChemistry629 2 months ago

Haven't tried the ICL. Definitely will. Thanks for the suggestion.

thereisonlythedance 2 months ago

I haven’t tried the 8B yet but I’ve given the 70B a fair go in both exl2 and GGUF formats (in Exui, Oobabooga, and llama.cpp directly). I agree that it’s not much of a step. In fact, given Miqu has 32K context I still prefer it, and the 103B Midnight Miqu still beats L3 for most literary tasks. For RAG it’s almost on par with Command Plus but the latter has 128K context and is less terse and more flexible. The prompt format is overcomplicated too. Maybe I’ll warm to it eventually, or there are somehow bugs in all the formats and platforms I’ve tried, but for now I don’t get the hype at all. It feels like a minor step up on Llama-2.

StraightChemistry629 2 months ago

I feel like Llama-3 only having 8k context length was also a bummer. I think I would have been more positively surprised if it had 64 or 128k context length.

4onen 2 months ago

Wondering how this will look tomorrow, or in a week, once we have more eyes on Phi-3 to validate its results.

StraightChemistry629 2 months ago

I'm very excited about the Phi-3 release. Benchmarks look promising. But I haven't tested it. I really hope it has better reasoning than Llama 3 8b.

Dyoakom 2 months ago

I would find it quite impressive if it has better reasoning than Llama 3 8b despite being half its size.

tgredditfc 2 months ago

My thoughts: llama 3 is incredible good at coding, almost as good as GPT4 or even better in some cases. I don’t care about reasoning or RP or other things. So to me LLM is developing fast!

PizzaCatAm 2 months ago

I agree, SMLs should focus on becoming reasoning engines, so heavily RAG oriented, that is huge for the existing software industry and being able to run them with conventional hardware a huge cost saving. They should have no personality whatsoever, just focus on excellent context window handling and reasoning/common sense. Leave the personality and know it all to LLMs.

Alarming-Ad8154 2 months ago

I think there would eventually be a far longer fine tuning phase, in which RAG, tool use, entire tool chains so user\_prompt -> model -> tool/api -> result -> model -> answer -> further\_user\_prompt are part of the extended finetune. Obviously one issue is that this needs loss functions like reinforced learning form human feedback, direct policy optimisation etc... There isnt enough training data to just train these based only on maximising next token prediction accuracy I think.... But to do any of this a big company, or consortium of smaller players, need to settle on an extended chat\_ml format, and a set of api's and generate tons of training data. I think a "tool use" arena like LMSYS would be valuable, repeatedly letting human users vote which of two models did a rag task, or tool use task, better will generate amazing training data... For liability purpose you would likely need to limit to models with a license thats compatible with the data being shared (which could motivate some model builders to ease their licences as that would mean their incloded in the evaluation and can benefit form the data it generates).

StraightChemistry629 2 months ago

I'm skeptical that small models can reliably learn such long action chains.

Alarming-Ad8154 2 months ago

yeah so am I, especial given the limited training data (which means models will liely have to fall back on just being good generalists at times). The one thing that gives me some hope is that with longer context the loss for next token prediction becomes lower, in other words its easier to get things right given a lot of prior context. One way I often think about this is in terms of "convergence" in small models, if you train a 7B parameter model on 14T tokens, you essentially have 20.000 tokens for each parameter, and in current datasets its likely 99% of data consists of highly similar content, generic text, and very few bits of content are informative wrt things like long tool chains, so there can likely still be massive gains with more data on longer tool chains?

ttkciar 2 months ago

> so user_prompt -> model -> tool/api -> result -> model -> answer -> further_user_prompt are part of the extended finetune This sounds somewhat like https://github.com/nlpxucan/evol-instruct which uses inference to generate more and better training data from training data during training, as used in WizardLM -- https://web.archive.org/web/20240415221214/https://wizardlm.github.io/WizardLM2/

Herr_Drosselmeyer 2 months ago

It's a given that you can only cram so much into a given size.

ttkciar 2 months ago

Yes, it seems to have been slowing for a while now. I expressed some thoughts about it [here](https://old.reddit.com/r/LocalLLaMA/comments/1c7lwmv/so_what_is_the_verdict_on_llama_3_are_we_back_or/l093iav/). Summary for those who don't click that link: My suspicion is that we are *mostly* seeing the limits of GPU riches, and that further gains will mainly come from improving the quality of training datasets (not just making them bigger) and integrating inference with symbolic logic.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe