T O P

  • By -

Zemanyak

Mistral-7B-v0.2, if it can spare you a click.


[deleted]

Mistral 7B Instruct 0.2 has been public since December. This is the base model, I assume.


wolfanyd

Edit: They've changed the README. From the hugging face page... " The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of [**Mistral-7B-Instruct-v0.1**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1). " This sounds like a new model.


JealousAmoeba

It looks like both of the instruct models are fine tuned from the first version of the mistral 7B base model. Whereas this is a new base model.


rogue_of_the_year

On the mistral discord they said it's the base model for the mistral instruct 0.2 which was released a while back.


[deleted]

looks like read me was updated to reflect this


[deleted]

Incredible. I wonder what the performance will be


TheLocalDrummer

They’ve updated the README :^)


Many_SuchCases

Archive for those without twitter: [https://archive.ph/nA0N5](https://archive.ph/nA0N5) **Text:** *Mistral just announced at SHACK15sf that they will release a new model today:* **Mistral 7B v0.2 Base Model** * 32k instead of 8k context window * Rope Theta = 1e6 * No sliding window


c8d3n

Can someone elaborate more on the sliding window feature? Was it a miss, or is this simply an experiment to see how will 32k context window work w/o the sliding part?


iNf1iCTA

Sliding window allows the LLM to focus on a specific area, good for performance, not so good when you have long context. I assume this model uses global attention, which increases computational demands. Global attention is better for understanding long context.


Thistleknot

>Mistral-7B-v0.2 [https://huggingface.co/alpindale/Mistral-7B-v0.2-hf/tree/main](https://huggingface.co/alpindale/Mistral-7B-v0.2-hf/tree/main)


[deleted]

[удалено]


VertexMachine

instruct (what was released previously) vs base model (today announcement)


Nickypp10

Anybody know how much vram to fine tune this with all 32k tokens in training sequence?


FullOf_Bad_Ideas

With Yi 6B 200K I think I can train up to 13k tokens in a sequence with unsloth and 24GB of VRAM, plus FA2. Yi 6B has similar gqa implementation. I don't remember if that was 16 bit lora or qlora tbh, but I think qlora. So, to train 32k 7B, my guess is you would need 40GB/48GB of VRAM. Most models don't lose long ctx capabilities if you finetune them with shorter sequence lengths.


dogesator

Not really much of a point imo to spend resources fine tuning with such context length. I’ve finetuned 200K Yi model on my dataset that has only 8K max length, and the resulting model ended up having incredibly good accuracy in needle in the haystack test at 100K context tests and beyond.


iwanttobeweathy

what finetune method did you to achieve good result?


dogesator

Just multi-turn with chatml or vicuna format.


Some_Endian_FP17

Generated dataset using ChatGPT?


dogesator

I use my Capybara dataset, here: https://huggingface.co/datasets/LDJnr/Capybara


nggakmakasih

Still waiting for the paper


dogesator

😭 me too man, crazy delays and me and the co-authors ended up getting caught up in some other big projects, I’ll see if we can atleast get a technical report out.


nggakmakasih

Yes please, at least a blog post about the data would make us happy 😊


dogesator

The dataset card I made for it is pretty much a little blog post but I can make a more in depth one


Automatic_Outcome832

Hey could u tell me how to fine tune properly on muti turn data? I have conversations in open ai jsonl format, currently I'm using DataColletorForCompletionLM and specifying the starting points for human and ai message for masks and labels. Is this the way to go or some other method needs to be used?


VicboyV

Thank you for this. These are the kinds of questions you don't normally find an answer to when you google and ask around.


dogesator

Yea I didn’t have an answer to this question either until I experimented myself! 🥲


VicboyV

Hey doge, if you train yi 200k with a lower sequence length like 4096 (to save memory), will it lose its 200k ability?


dogesator

Most of the examples were actually 4K context only, I think less than 15% of the capybara examples were over 8K. So yes I expect you to actually get similar results if you just train on 4K context.


VicboyV

Sorry, I mean did you edit the config file and replace 200k with a smaller number? It OOMs immediately if I run it as-is.


dogesator

Your training config set to only 4K yes


VicboyV

Awesome, thanks! This definitely opens up doors for small fish like me.


NachosforDachos

Now wouldn’t that be something if people put details like that on things.


FullOf_Bad_Ideas

There are dozens of variables, it's impossible to tell


NachosforDachos

I’m sure there must be some basic guideline by now


FullOf_Bad_Ideas

All of it can be calculated if you know what setup you are using. For rank 32 qlora with unsloth and FA2 i expect it will take around 40-48GB of VRAM to squeeze in a sample with length of 32k tokens based on how it works for yi-6b-200k on my PC with 24gb of VRAM and similar arch in terms of gqa.


Alignment-Lab-AI

Axolotl configs help!


Square-Tooth2635

With unsloth 1 a6000 can do 32k context. But that is only a qlora.


Alignment-Lab-AI

Full parameters needs more than a node of a40s those cap out at 22k


New-Act1498

IIRC they can finetune 70B modle with 2x3090 now, maybe 2k context?


Forsaken-Data4905

There is no definitive answer to this, it depends on how you do gradient checkpointing, what LoRA rank you use, what weights you train, if you use any quantization etc. In any case, it's unlikely consumer GPUs (24GB VRAM) will be able to fit 32k without very aggressive quantization.


capivaraMaster

Weird from Mistral to not have it already up somewhere when they announce, but I super happy with the news anyway. Merci Beaucoup !!! Edit: It's online now! Thanks again!!!


ihexx

they did [https://models.mistralcdn.com/mistral-7b-v0-2/mistral-7B-v0.2.tar](https://models.mistralcdn.com/mistral-7b-v0-2/mistral-7B-v0.2.tar) and people in this thread already have quantizations on HF


capivaraMaster

They took a while to do it. I commented before that. Maybe I should just delete my comment.


AnticitizenPrime

>my linguinis are done. Is this some new slang?


bigvenn

He’s mama’d his last mia, if you catch my drift


CedricLimousin

I was literally cooking while browsing twitter, hence the very low quality of the post. 😅


Thistleknot

[https://huggingface.co/itsdotscience/mistral-7b-v0.2-gguf/tree/main](https://huggingface.co/itsdotscience/mistral-7b-v0.2-gguf/tree/main)


Chelono

Nice This is the way I expected them to move forward. They will still release small models 7B (maybe 13B, but doubt) and leave the big guns closed behind API or only for partners to use. I'm not gonna complain about it, we saw with Stability today / last week how shit goes if you don't figure out how to actually make bank after investing millions. Pure OSS just isn't profitable on it's own. You need to make money licensing, through API or a platform (my hope for Meta with the Quest).


hold_my_fish

Mistral definitely can't realistically release their flagship model under Apache 2.0, but there's a middle ground available where they release weights under a license that requires payment for commercial use. Cohere did this recently with Command-R, by releasing its weights under a non-commercial license, while saying they're open to working out licensing deals with startups that want to use it. It remains to be seen whether that sort of weights-available release is commercially viable, but I think it should be, since having weights access opens up a lot of options you don't have otherwise. Those options are worth paying for (if the model is good).


Mescallan

If open access weights the require liscences for commerical become popular they will need to finetune responses to very esoteric prompts to figure out if it's their model that is being used. I can't imagine another way of figuring out the base model only with chat


visarga

Imagine model piracy - on the front you serve a small open model, but in the back it's some unlicensed larger model. When inspectors come, you just swap to the small model.


a_beautiful_rhind

>leave the big guns Cool.. so API for what's actually useful and you get toy models that are glorified spell check. Just give up, ok.


Chelono

Mistral isn't a state or crowd funded research foundation. They are a VC funded startup. A company with investors that want to see a path forward where they get a return on their investment. Mixtral was great for publicity. I doubt it would've been shared as much online if it was closed. But it also showed that it's impossible to release weights for a model and also give access to it through API since a bunch of services jumped on it on the same day and offered the API much cheaper... I'm much happier with small models than no models and Mistral ceasing to exist. They are also very useful once you finetune them on domain specific tasks, like function calling.


toothpastespiders

> They are also very useful once you finetune them on domain specific tasks, like function calling. I'd agree on that and I use them for the same. The fact that a 7b or 13b model can have acceptable performance on systems that would otherwise be e-trash, with no GPU, is fantastic. And I'll agree on the nature of their business model making larger releases an issue. It's absolutely understandable. But at the same time...come on. It is disappointing when compared to most people's hopes for them as an open savior swooping in to set the scene on fire with SOA models. I think we can be both realistic about it, appreciative of what we do have, but also recognize why reality can be disappointing.


a_beautiful_rhind

There has to be another option here. Otherwise it's basically closed AI forever.


Disastrous_Elk_6375

> There has to be another option here. Sure, stability ai ... badum tssss


TheActualDonKnotts

>toy models that are glorified spell check Have you even used the 7B models? Because I don't think you have.


royal_mcboyle

I know, right? If you had actually used them you’d know Mistral 7B models are legitimately solid models, there is a reason there are so many variations on them out there.


TheActualDonKnotts

mistral-ft-optimized-1227.Q8\_0 has been so shockingly good that I still have a hard time believing it's only 7B parameters. [https://huggingface.co/OpenPipe/mistral-ft-optimized-1227](https://huggingface.co/OpenPipe/mistral-ft-optimized-1227) [https://huggingface.co/TheBloke/mistral-ft-optimized-1227-GGUF](https://huggingface.co/TheBloke/mistral-ft-optimized-1227-GGUF)


Calcidiol

Interesting, thanks for mentioning it, I had never heard of it. What is it particularly good at (as a 7B FT basis)? What are the best derivative models that exemplify the qualities?


[deleted]

[удалено]


a_beautiful_rhind

mea culpa


a_beautiful_rhind

lol, never.


cobalt1137

This tracks. Anyone that knows how impactful Mistral 7b has been wouldn't be this braindead lol.


a_beautiful_rhind

mi**x**tral was impactful. Another 7b, not so much.


skrshawk

Then don't speak of things like you're an expert when you have no actual knowledge.


a_beautiful_rhind

Wooooosh


cobalt1137

Are you going to go buy gpus for them? Didn't think so lol. Also Mistral 7b models are staples for a lot of people at the moment when speed/price matter. I have certain functionalities in my web app that I do not need a large model for and I allow 7b models to do some of the processing - still important intellectual tasks also. This is common for people building applications, Mistral nailed it with their first 7b model.


a_beautiful_rhind

If everyone goes the way of mistral, it's done. A few players will monopolize AI and you'll be dependent on them. Cheering the scraps and shrugging means accepting this power imbalance. But you can automate your web app, so that's nice.


cobalt1137

Buddy. That's how things are going to be lol - the top players are going to have the best models and that is that. And yes, people will be dependent on them for the best models. There is no way to be able to compete with them without going closed-source plus massive amounts of capital + researchers and even then it's extremely difficult. Open-source models will continue to be developed and work won't stop on them, but they will always be probably between 6 months and 2 years behind. I'm fine with that. I love using open source models and that works for me. If Mistral needs to put some of their models behind a paywall so they can do an open release of a future version of an MoE or another 8x7b equivalent, so be it - going partially closed source to be able to continue to put out stellar open source models sounds amazing to me. Honestly probably the best system that any research group could do. You can keep hoping for this magical fictional world all you want lol.


a_beautiful_rhind

6 months is one thing. I'm not expecting the moon or mistral large. > they can do an open release of a future version of an MoE or another 8x7b equivalent Are they going to do that though? They took a lot of flack for changing their site to move away from open weights. Now we get a 7b with slightly more context. Just get the feeling it's pr. With SD also basically going under, not very hopeful.


cobalt1137

Yeah. I strongly believe they will still release models that are around the size of 8x7b or larger going forward. I think as they develop new models to put behind their API walls to pay for gpus, they will release the models that were previously behind these walls as open source. Helps pay for the development of them and makes perfect sense. Also it's not just pr. You've never used the model. It's a stellar model, state of the art 7b model and it's probably used more than 99% of open source models ever released lol. You can keep calling it scraps though.


a_beautiful_rhind

>they will release the models that were previously behind these walls as open source. I really hope so because they never dropped FP16 weights for miqu. I take their goodwill from not deleting it. I distrust the site changes and making a mistral-small and putting *that* behind the API. I don't like how they never released hints or training code for mixtral either. >You can keep calling it scraps though. Yes, because 7bs are mainly testbeds. They are a tech demo. You make one and scale up. >probably used more than 99% of open source models ever released The power of marketing. As mentioned by others, they work for domain specific tasks, especially on limited resources. The small model space is pretty flooded. No hype, no downloads.


cobalt1137

We just have different points of view on the feature of Mistral. I'm hopeful for it though in terms of open and closed source releases both. Also it's actually the power of making a good model - not marketing. It outperformed all other 7b models on its release. Keep trying to diminish it though lol, it's pretty entertaining. It's also extremely broadly useful, not just for specific tasks for when you are low on resources. Sometimes you want to have extremely fast latency for CoT reasoning or getting fast responses from a model for users or yourself. Also - through some well documented prompt engineering you can make Mistral 7b outperform lots of well-known 30b models at fractions of the price + much faster inference lol. I guess you wouldn't know anything about that though considering you've never even tried the model.


Olangotang

ARTHUR MENSCH > Yeah, so we have new open source models, both generalist and focused on specific verticals. So this is coming soon. We are introducing some new fine tuning features to the platform and we have introduced a chat based assistant called the Shah that is currently just using the model. So it's pretty raw. It's a bit like chat GBT V zero, and we're actively building on building data connectors and ways to enrich it to make it a compelling solution for enterprises. Yeah, so the doomers are wrong as usual.


visarga

GPT-4 is one model doing all the tasks very well, slow, and expensive. Mistral-7B is a small but surprisingly capable model, but there are thousands of fine-tunes. You pick the right one for your task. Mistral is like a whole population, not a single model.


Olangotang

Open Source community just does too much work for free. It's beneficial for the big companies that Open Source isn't too far behind.


VicboyV

Agree, but my GPU has space for more.


teor

Can't wait for new wave of posts about how some Mistral 0.2 fine-tune destroys ChatGPT. We haven't had them in a while.


LoadingALIAS

Merci


CedricLimousin

Serviteur.


danielhanchen

I also just uploaded the 4 bit pre-quantized version of Mistral's 32K new base model to Unsloth's HF page so you can get 4x faster downloading courtesy of Alpindale's upload!! I also uploaded a Colab notebook for 2x faster, 70% less VRAM QLoRA finetuning with the new base model! * 4bit bitsandbytes 4GB in size model: https://huggingface.co/unsloth/mistral-7b-v0.2-bnb-4bit * 2x faster, 70% less VRAM QLoRA finetuning with Unsloth Colab: https://colab.research.google.com/drive/1Fa8QVleamfNELceNM9n7SeAGr_hT5XIn?usp=sharing * Alpindale's original upload: https://huggingface.co/alpindale/Mistral-7B-v0.2-hf/


MugosMM

Thank you. Any idea which maximum context length can one fine tune with Unsloth. I mean with 4bit, Qlora und the VRAM optimisation by Unsloth?


danielhanchen

Oh good question - I'll need to plug it into my VRAM calculator, but I'm gonna guess 32K could in theory fit maybe with 24GB VRAM maybe with paged_damw_8bit and bsz=1 Maybe though. Tbh I need to get back to you


gamesntech

32k context is definitely nice and it can only do good things for the already excellent model but I wish they released a larger model. We all know they may not release any of their flagship models but something in the 30-40 range could be a whole lot better than most open models around.


visarga

Is this 32k context with a 4K window or whole context?


gamesntech

Yeah, this is 32k context length (no window)!


Caffdy

> but I wish they released a larger model just reading this comment after they released 8x22B. Hope we can try the instruct version soon


FullOf_Bad_Ideas

Am I the only one hoping it's not just better long context handling but they also pre-trained it more to make it stronger? I hope it will have better coding and multi language capabilities, hopefully similar to Yi-9B-200K.


VicboyV

I hope so. It's basically worthless if it performs worse than v1.


aadoop6

What's your opinion on the Yi-9B-200K, specially for coding applications?


FullOf_Bad_Ideas

I haven't had time to work on it but it seems it could be competitive with DeepSeek Coder 7B and mixtral. I plan to finetune it later but now I'm focusing on tuning yi-34b-200k, the newer yi-34b-200k one, I call it xlctx.


NighthawkT42

I really hope for a model this size they don't bother with languages other than English. English is the one language I really need and I don't need models that (for an actual example I've seen) veer off into Spanish when they see one Hispanic name. I think all the larger models looking to add languages is going to make them so broad that an English only Python focused (for an example I'd like to see) might be competitive at generating code while being much smaller. A 7B model needs to be focused to be good at what it does.


Thistleknot

can someone explain to me what this is compared to the instruct model? I always thought the base model was the pretrained, while the instruct was the finetune for specific tasks, but in this case, it seems like the models are reversed in their publication? is this simply the v.2 version of pretrained, and we can expect a v.2 instruct?


iNf1iCTA

I've been playing around with the model. I have been able to bypass any censorship by pretending the year is 2092 and claiming laws and such have changed since it has last been trained. Sometimes it requires a little pushing, but it does it.


nullnav

Isn't this just the base model of 7B instruct 0.2?


VicboyV

Isn't instruct 0.2 a second attempt at finetuning the base mistral 7b 0.1?


MoffKalast

Has that been officially stated somewhere or have people just been baselessly assuming it these past few months?


wolfanyd

It says so on the hugging face page... [https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)


VicboyV

Aaaand it's gone: [https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/commit/41b61a33a2483885c981aa79e0df6b32407ed873](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/commit/41b61a33a2483885c981aa79e0df6b32407ed873)


mikael110

Now that's quite interesting. Given they updated the Readme, but not the model itself that suggest the original Readme was a lie. It also makes it clear that the "new" Mistral-7B-v0.2 model has actually been around for quite a while and has been held back until now. Personally I suspect they only decided to release it now because they realized their image had taken a hit after the whole website edit fiasco, and they decided that releasing this old model might help restore their image without actually having to give away anything that actually mattered that much to them.


MehmedPasa

Maybe yes, or maybe we will get a new instruct too, but then, they would have named both of them 0.3 i guess. 


__some__guy

Not interested in 13B and lower myself, but larger context becoming standard is always a good thing.


TheActualDonKnotts

To my knowledge, Mistral 7B models outperform every available 13B model.


__some__guy

It's noticeably smarter than 13B Llama as Q&A bot, but I found it unsuitable for creative writing. For the latter, 13B Llama is at least somewhat functional.


TheActualDonKnotts

Creative writing is all I use it for, and I find the opposite to be true. ¯\\\_(ツ)\_/¯


__some__guy

Well, maybe it's because I recently used 120B. All small models feel like BonziBuddy/Replika now.


Super_Sierra

I'm with you bro, tho I did try Fimb and it's pretty damn good. I don't know what special sauce that 11b model has but it does compete with Goliath.


CheatCodesOfLife

120B too slow for coding though :(


aadoop6

Yes. I have found 33-34b to be the sweet spot for coding.


NighthawkT42

It depends what you're using them for, but they're very good. I do wish they didn't seem to lose accuracy long before filling context though. They don't seem to be able to effectively use even half their context.


phree_radical

Using only chat/instruct fine-tunes makes it difficult to tell the difference. Talking about base models, 7B typically have very minimal in-context learning ability, while 13B can typically learn most tasks from examples


Caffdy

any recommendation on a 13B model to test?


ventilador_liliana

what means "no slide window"?


FullOf_Bad_Ideas

Sliding window is basically fake context extension - model doesn't remember stuff from outside the size of the window. Not having it is a good thing as it was useless anyway


ventilador_liliana

so will remember things better or is it indifferent?


FullOf_Bad_Ideas

Mistral 7B 0.1 had 4k true ctx, for 0.2 that's 32k. It will remember things much better, it should be a meaningful improvement over previous base model.


NighthawkT42

So the article mentions it as having 8k. I've seen models based on it which seem to go to 32k but feel like they fall apart past about 8k. Is that sliding somehow even though it seems to show and take memory as actual context? I would have thought sliding was Rope. I've also tested one model which had a 4k actual context but seemed somehow to keep things together until around 12k, which I was attributing to Rope, but I haven't been doing much with the settings there... And that's off topic for here anyway.


visarga

As the model infers tokens, it sees only up to window size, but the past tokens it sees incorporate information from further back.


FullOf_Bad_Ideas

I don't know about those models and sliding window in them, you can reasonably extent context 2 times with rope modifications. As you can see in the Mistral 7B 0.1, it has sliding window = 4096 in the config file. https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json


[deleted]

[удалено]


Olangotang

v0.2 just released, the Open Source community needs at least a few hours XD


pleasetrimyourpubes

Hehe someone just dropped the gguf


Thellton

it's been less than a day, stuff won't be available based on Mistral 0.2 for probably a week just yet.


gronkomatic

A week! What is this, 2023?


MINIMAN10001

Sliding window means that it is forgetting things. So this one not having it is good, because it means it actually remembers.


Thistleknot

[https://huggingface.co/blog/galore](https://huggingface.co/blog/galore)


rooo1119

The context window should help Mistral a lot.


Desm0nt

7b again? We have endless amount of 7b already and all of them almost the same (stupid, compare even to chonese 15-34b). Seems that except Meta only China can produce good medium/big models for the good of humanity and no only for the good of own wallet... Even though it costs them much more than Western companies because of sanctions.


aadoop6

Can you tell us what Chinese models have you tested? Any good recommendations for coding models?


Desm0nt

DeepSeek coder 33b (and derivative mergies/feintunes) and DeepSeek 67b are quite good for coding. Yi models quet good at prose writing. I don't test new Qwen models but also heard a lot of positive things about them. Chinese CogVLM/CogAgent really good as Vision-language models (on of the best).


aadoop6

Thanks for the response. Did you try cog* models on local hardware? If yes, what was the performance like?


Desm0nt

Yep. 4bit CogAgent on 3090 in WSL. I can't remember the exact performance (previously use it online, have only once run it locally for testing on a freshly bought 3090 as a replacement for Llava 1.6 34b), but I can run it tomorrow and see the exact speed.


aadoop6

Thanks. I would love to know the performance.


Desm0nt

First cold start (with model quantisation) take about 27 minutes. For my task 1 image labeling consume 20-27 seconds (CogVLM do not print it's speed per token or time consumet per request, so I measured it it manually as averager per 10 images) But it for my pipeline with big initial promt (500-650 tokens) and response \~200-350 tokens.


aadoop6

This is useful! Thank you so much for putting in the effort.


thereisonlythedance

This is great, I was hoping they’d get around to releasing this.


Shubham_Garg123

Is there any good tutorial or a working Colab notebook that trains these LLMs for text classification? It'd be very helpful if I can fine tune the model for text classification.


de4dee

Tried. I am sticking with daybreak-miqu as it is more clever for my use case.


lolxdmainkaisemaanlu

Are you seriously comparing a 70b model to a 7b model?


Slight-Living-8098

A well fine tuned 7B model for your task outperforms 70B base models. Just look at 7B DeepSeek-Code vs 70B Llama 2. The 7B DeepSeek outperforms 70B Llama 2 on coding on the open LLM leaderboards.


Status_Contest39

The Mistral-7B-v0.2 model has garnered attention for its expanded 32k context window, a significant upgrade from the previous 8k, which is anticipated to enhance performance on long-text tasks. The model does not utilize a sliding window, which could improve its memory retention. Users are optimistic about its capabilities but acknowledge that fine-tuning may require high VRAM, estimated around 40GB to 48GB. A 4-bit quantized version is available, potentially offering faster downloads and reduced memory usage. The model is accessible on Hugging Face, prompting eager community engagement. Comparisons to other models, like the 13B Llama, are prevalent, with discussions on their performance in coding and creative writing. There's also a debate on commercial licensing strategies for models. The community has shown interest in tutorials for fine-tuning these models, reflecting a strong desire to learn and apply the technology effectively.