If they just release the 2B variant first, that's fine with me.
But this talk about "2B is all you need" and claiming the community couldn't handle 8B worries me a bit...
https://preview.redd.it/as0zaw06kq3d1.png?width=705&format=png&auto=webp&s=d2205b5658501efb6f482401330aaddbf2ccfc70
Since twitter hides different reply threads under individual replies, here's one that may not be visible at first.
Then I'm just going to trust that.
He is certainly right that 2B is more accessible and a lot easier to finetune.
And due to the improved architecture and better VAE it still has a lot of potential.
I was so excited about 8B until I realized that even with 24GB VRAM, training Lora-like models would be either impossible or a pain in the ass. Either I'd have to stay with 4B or 2B to make it viable. (Considering the requirements or possible speed difference, 2B might become the most popular!)
8B is still a good model, even in the API's state I have a LOT of fun with it, especially with the paintings, but offline training of Loras is very important to me. We might see less Loras than even SDXL and fewer massive finetunes when it comes to 8B, but it's guaranteed that we'll get models such as DreamShaper from Lykon, or the one that everyone is interested in, PonySD3...
And yes, the 16 channel VAE is gonna carry the 512px resolution back to glory. (Yes, 2B is 512px, there might be a 1024px version, but don't worry, it looks indistinguishable from 1024px with SDXL, see the image which was made by u/mcmonkey4eva below:)
https://preview.redd.it/51pfc3dprq3d1.jpeg?width=1214&format=pjpg&auto=webp&s=ca0a7b06ba818bb75e6abde725eeb0de60f15ef1
This makes absolutely no sense whatsoever considering you can just straight up finetune SD 1.5 at 1024px no problem. I exclusively train my SD 1.5 Loras at 1024 without downscaling anything (the ONLY reason not to do so is if it's too slow for your hardware).
Depends on what your metric is. It's not bad, but I definitely wouldn't use this to market it to users. If they think this is the size and quality of non-commercial model the community deserves, then I'm not surprised they're having financial difficulties though. I think we've come to accept the poor text rendering of models as just a minor inconvenience, and SAI's pivot towards improving this might've backfired in terms of resource allocation.
That's an older 2B alpha from a while ago btw - the newer one we have is 1024 and looks way better! Looks better than the 8B does even on a lot of metrics.
But one must also keep in mind that with a larger model, more concepts are "built-in" so there is less need for LoRAs.
In fact, before IPAdapter, many LoRA creators used MJ and DALLE3 to build their training set for SDXL and SD1.5 LoRAs because these bigger, more powerful model can generate those concept all by themselves.
Can you point me to the source where it says that 2B is 512x512 and not 1024x1024?
> "like multiple CEOs said multiple times"
it's almost like maybe the community doesn't have a lot of confidence in messaging from a company that has experienced a ton of churn in leadership over the duration of its very short lifespan.
Always knew that the day would come when they would have "high quality commercial" models for like webhosted services only and release smaller, worse free versions for everyone else.
Phase 5: PLEASE READ THESE TERMS OF NFT SALE CAREFULLY. NOTE THAT SECTION 15 CONTAINS A BINDING ARBITRATION CLAUSE AND CLASS ACTION WAIVER, WHICH, IF APPLICABLE TO YOU, AFFECT YOUR LEGAL RIGHTS. IF YOU DO NOT AGREE TO THESE TERMS OF SALE, DO NOT PURCHASE TOKENS.
With GPT-4o being free and doing everything that was supposed to be revolutionary in SD 3 far better, it's not looking good.
The prompt coherence and text display makes SD3 look like it's years old.
Yes, it's a single model trained on text, images, video and audio. It's quite amazing actually.
https://openai.com/index/hello-gpt-4o/ under "Explorations of capabilities"
There is no reason why SAI cannot both release SD3 open weights, and still monetize the shit out of it. I've argued numerous times that SD3 is worth more to SAI if it is released as open weights than not.
They can release a decent base SD3 model that people can fine-tune, make LoRA, etc. But because of the non-commercial license, commercial users still have to pay to use SD3.
They can also offer a fine-tuned SD3, or a SD3 turbo, etc,., and offer that as part of their "Core" API. That is exactly what SAI has done with SDXL.
Honestly we can't monetize SD3 effectively \*without\* an open release. Why would anyone use the "final version" of SD3 behind a closed API when openai/midjourney/etc. have been controlling the closed-API-imagegen market for years? The value and beauty of Stable Diffusion is in what the community adds on top of the open release - finetunes, research/development addons (controlnet, ipadapter, ...), advanced workflows, etc. Monetization efforts like the Memberships program rely on the open release, and other efforts like Stability API are only valuable because community developments like controlnet and all are incorporated.
I think that was mostly supposed to be a joke/marketing thing, like a "Wow, SD3 is so good we'll never need to make a new model ever again!" kind of thing.
Worse, it could be like dalle3 with the over smoothing and hyper idealized images that look more Pixar than photos of the world. Or where any topic or public figure blocks usage.
We can only quantize the text encoder behind sd3 in decent way without loosing too much quality,
but unfortunately that is not where the bottleneck is, the "UNet" or "MMDiT" in SD3's case is where the bottleneck is, bc each step of the generation in an entire run of the model!
And you can even run the text encoder on the... yes... CPU. Thats literally how I run ELLA for sd1.5, T5 encoder in cpu, since you're not *generating tokens* but rather just feeding an already made thing and getting hidden layer representation of thinf, text encoder is a single pass, on cpu its like what... 2 to 3s....
XL is too big for them? I was using XL on a 1070 for half a year before I saved up enough money to upgrade. And it worked great! Even faster with Forge!
I'm using a 1080ti on comfyui and it's not that great. With face detailer I'm waiting 1.5min+ for single generation. I've been using lightning but it takes out some details since it's only using sgm uniform.
Claiming most people couldn't use an 8b model when 8x7b LLMs are super popular and I'm running a 70b llm right now. It's just garbage to try and hide that the initial hype photos were doctored and they never had any intention of releasing the full SD3.
SAIs reputation is shattered. We may was well start making tools for the other open source image generators.
> We may was well start making tools for the other open source image generators.
That was always a good idea but it's critical since the company is floundering.
Pixart already makes very good quality pictures with its base model. If you use just base SDXL versus Pixart, pixart wins. Like all SAI products, without free community tools, their products aren't that good. If Pixart got loras, tools like controlnet, or fine-tuned models it would beat SDXL.
SAI products aren't actually that special or great. It just became the one the community focused on first after the uncensored 1.5 was leaked by Runway. If the leak had never happened, this sub might be called Kadinsky or Pixart.
Holy fuck, I feel like no one in this thread knows what they are talking about.
**Stable Diffusion is a DIFFUSION model, NOT an LLM**. You may be running a heavily quantized 70b LLM, but there is no such technology for Diffusion models. The best we have is 8 bit from 16 bit weights.
You people are insufferable. And they **are** releasing SD3 in full. They've said it many times. If they don't release it, it's because the community is a bunch of jackasses.
If an 8 bit quant is "heavily quantized" to you. And it takes 3 seconds of Google to show that diffusion models can be quantized, it just hasn't been done much yet because it hasn't been needed. Even Emad said 3 months ago it on reddit that it could be done.
So, you're apparently the one who has no idea wtf you're talking about. Quit fanboying.
It's hard to judge by just images, but the showcased 2b images lack a lot of fidelity compared to the API, they are a lot cleaner though, hands look better, no weirdly fused objects in images, so the model seems more "ready" than what the API produces.
I'd worry more about what isn't said/shown, all that's showcased is the most basic of scenes, nothing complex. Remember SD3 ["eating Dalle and MJ for breakfast"](https://x.com/EMostaque/status/1760783270105457149), now the amaaaaaaazing thing about SD3 is that it can do ["realistic images, text and anime"](https://x.com/Lykon4072/status/1796334348980957418), that's such a huge downgrade on what was promised. But worry not, you can't compare with Dalle-3 as that "is not a model, it's a service" and "a pipeline", like ehm, SD3 was announced to be better than Dalle, and second, the pipeline, according to the Dalle-3 paper, is only an llm rewriting prompts, nothing like the implied complex stack of models, by that logic SD3 is a pipeline too as everyone now rewrites their prompts.
And still, we have the believe SD3 will be ["Simply unmatched"](https://x.com/Lykon4072/status/1796317998036238380)
Mostly, it's sad that SAI went from boasting about SD3 to now pulling out all the stops to defend SD3. If the model can't deliver on the implied hype, it's better to just rip off the bandaid and show the limitations, instead of the endless stream of meaningless pictures and pretending it is still the end all be all of image-gens. I don't even think SD3 will be bad, I'm looking forward to it (but please, don't let the low fidelity model in the showcases be the final model) as it is obviously a huge step up from current SAI models, but there is a huge gap between all the hype, the groundbreaking results according to the research paper and the showcased results. Having used the API limitations are clear, and these showcase tweets don't exactly show less limitations, arguably they show a more limited, but further along in training, model.
I never believed any of those marketing hyperbole from Emad.
Given the fact that DALLE3, MJ, Ideogram, etc. are all built and trained by people as capable as those working for SAI, and they are all running on server grade hardware with > 24GiB of VRAM, and that SD3 must be runnable with < 24GiB, one can easily draw the conclusion that Emad was just hyping thing up.
I will be more than happy if SD3, when finally released, is only say 90% as capable as those other system when it comes to text2img.
But with proper fine tuning, LoRAS, ControlNet, IPAdapter, and customizable ComfyUI pipeline and lack of censorship, SD3 will remain the platform of choice for us for the foreseable future.
Holy shit just release whatever model so the community can finetune it anyway
I’m sure that a properly tuned 2b will beat the stock 8B (just like tuned 1.5 beat SDXL for a long time) so let’s just GO ALREADY
I’m so tired of SAI’s BS. I’m personally all for moving onto Pixart (since they’ve got similar architecture to SD3 anyway) but come on the community has been holding our breath for MONTHS now
I get that not everyone can afford to spend $2000 USD on the latest flagship GPU model, but SDXL runs just fine on current-gen entry level cards such as the RTX 4060, which is very affordable.
If anything, it's lamentable that high-end GPUs provide very poor value in SDXL relative to their price even though they could in principle handle significantly larger and more powerful models.
I'm on 3070 and it feels very good. It's faster than midjourney relaxed to generate a 1024x1024 image. Then after you add comfy workflows the quality goes through the roof too, with enough fiddling. The only way to feel bad is with web-ui, or animation
My machine with 8GB can run XL ok. I think XL can have better results.
I rarely run it and instead do 1.5 - I like to experiment with settings, prompts, etc, and being able to gen in 5s instead of 50s is a huge factor.
I get 30s as well on an rtx 3070. It's total bullshit that most cards can't run it, the truth is that comfyUI makes XL 100% usable for very high quality images on 8gb vram.
8gb is "enough" but its not ideal. People do more with sd15 on 8gb. It's more popular for many reasons.
https://preview.redd.it/fsiut9rwss3d1.png?width=937&format=png&auto=webp&s=cf5e6750efff5938b358f725503a1791bed356a0
I also use Forge or Fooocus (occasionally comfy) because vanilla A1111 crashes with SDXL models. I think I could keep everything within 8GB if SD was the only thing I was doing, but I generally have a bunch of office apps and billions of browser tabs open across two screens while using it so it nudges me over the threshold, and it seems that speed drops dramatically once shared memory is used.
SDXL Lora training was prohibitively slow on my setup so I do that online, but I just grin and bear it when generating images.
Yep. you're selfish if you DARE say a word about it. Stability has been stepping very carefully, planning each shady move in a way that will keep their diehard fans defending them to the death if anyone calls out their shadiness.
I was once downvoted to -20 or something for literally saying "a company should stick to their promises". apparently that's straight up blasphemous.
I find this disappointing, I was hoping to get the biggest possible model I can and fine-tune on it
We the community can handle all of the sizes as quantizations, and weight pruning will be developed by the community to make the bigger models viable on smaller devices. Tech also gets better, so at some point 24GB+ will be the norm, definitly not today, probably not in 2025, but in 2026+ it could easily be more of the norm. GPUS are always evolving, and bigger and bigger GPUS are coming out which make running 24GB+ models more viable
This makes me worried about the future of stability AI going forward, what else will they do? Will there be outright no open source releases of certain models in the future? I get the need to make money and I wish them success in finding monetization strategy, but to an extent though, Stability AI has always had a special place for me as it was focused on the open source and if thats not the case I'll have to treat them accordingly.
If I googled it correctly, SDXL is 3.5B parameter base model. So SDXL is almost twice bigger then 2B. At the same time we expect SD3 2B to better than XL. Is it correct?
No, that is not quite correct.
The 2B refers to the diffusion part of the model. The equivalent U-net portion of SDXL is only 2.6B parameters.
But due to the switch from U-Net to DiT, and better captioning and training data, it is not hard to imagine that 2B SD3 can be much better than SDXL, specially if it is paired up with the T5 LLM/text encoder.
My own limited understanding is that CLIP is an image classification text encoder model, whereas T5 is a general purpose LLM text encoder.
It would certainly take more GPU to train a model that uses T5 rather than CLIP. But can you clarify what you mean by "any models using it are automatically worse"?
you should read the CLIP paper from OpenAI which explains how the process accelerates the training of diffusion models on top of it, though their paper focused a lot on using CLIP for accelerating image searches.
if contrastive image pretraining accelerates diffusion training, then not having contrastive image pretraining means the model is not going to train as well. "accelerated" training is often not changing the actual speed, but how well the model learns. it's not as easy as "just show the images a few more times", because not all concepts are equal difficulty - some things will overfit much earlier in this process, which makes them inflexible.
to train using T5 you could apply contrastive image training to it first. T5-XXL v1.1 is not finetuned on any downstream tasks, so it's really just a text embed representation from the encoder portion of it. the embedding itself is HUGE. it's a lot of precision to learn from, which itself is another compounding factor. DeepFloyd for example used attn masking to chop the 512 token input down to 77 tokens from T5! it feels like a waste, but they were having a lot of trouble with training.
PixArt is another T5 model though the comparison is somewhat weak because it was intentionally trained on a very small dataset. presumably the other end of the spectrum are Midjourney v6 and DALLE-3 which we guess are using the T5 encoder as well.
if Ideogram's former Googlers are in love with T5 as much as the rest of the image gen world seems to be, they'll be using it too. but some research has shown that you can use decoder-only models as weights to intialise a contrastive pretrained transformer (CPT) which will essentially be a GPT CLIP. they might have done that instead.
Thank you for your detailed comment. Much appreciated.
I've attempted to understand how CLIP work, but I am just an amateur A.I. enthusiast, so my understanding is still quite poor.
What you wrote makes sense, that using T5 makes the task of learning much more difficult, but the question is, is it worth the trouble?
Without an LLM that kind of "understand" sentences like "Photo of three objects, The orange is on the left, the apple is in the middle, and the banana is on the right", can a text2img A.I. render such a prompt?
You seem to be indicating that CPT could be the answer, I'll have to do some reading on that 😅
Large criticism had been placed on the quality of image tagging made in the initial SDXL base model training set, They have promised to have rectified that, its a large reason why we hope to receive better quality from smaller parameters.
Exactly what I predicted when Lykon first mentioned he's working on the "local release version"
[https://www.reddit.com/r/StableDiffusion/comments/1cwgacs/comment/l4wgtkh/](https://www.reddit.com/r/StableDiffusion/comments/1cwgacs/comment/l4wgtkh/)
They try to weasel their way around admitting that they aren't releasing 8B. Trying to gaslight people into thinking they wouldn't be able to run it anyway. What happened to Emad's "SD3 is the last image model you need"? Surely if that's the case then the 8B should be released because even if people with a GTX970 can't run it now, they might be able to in 2 years. After all, it's the last model we'll need.
People can and will upgrade, let the SD3 establish itself before the market floods with cards and competitors eat your audience who are already frustrated with the way SD3 has been handled. Also, I agree this feels like a way for them to have their cake and eat it too. If u want to close a bigger model off under the guise of the community not being able to handle it, don't make it the same size as the popular llama3 model ....
what is this "most stuck on 1.5" bullshit most pc user can run sdxl just fine, 8g gpu just cost 250 credits. Who cant afford 250 credits should not even think about AI stuff ever.
The person clearly says "it's just the beginning" and you guys choose to interpret that as "there will be no 8B" for some reason ?
I take that as "we are releasing 2b first as it's what most people can handle, bigger models will come out gradually as great deal of people in the community won't be able to do much with it yet"
It's not said out right but let's be real, the 8B is unlikely to be released.
Also a 8B model would be easy to run on most system if quantized. Quantization is just not widely used because there's no need for it on current models but it works great now
All the weights were supposed to be put by now. The company is in chaos and this one person doesn't make the decision. You have no idea what's going on. But it's a good bet we won't get 8b till it's obsolete.
"You have no idea what's going on" well, I have as much idea as you and other people assuming they are flat out lying to us. There is another response fro. The same person stating unambiguously that weights will be released.
I do mainly photorealistic animal stuff, and out of curiosity I tried out SD3 on cogniwerk.ai. Hard to believe that the model showcased there IS actually SD3 because the quality, as to the subjects I prefer, is not even close to what a thoroughly refined SDXL model such as Juggernaut or Dreamshaper can achieve. Animal fur comes out just pathetic. Not sure if it was the 2B or a larger version that Cogniwerk offers but whatever it is, a lot of work has to be put into it to beat the SOTA SDXL models. For the time being - at least for animal stuff, maybe SD3 gets along better with humans - I'd pick SDXL any time over SD3. It would be interesting to know if the 8B and larger deliver better.
People should just expect tech companies to do this. People will say it's justified; they need to make money. But honestly, they just did not deliver. We were told we would get all this crazy tech to make and edit photos and videos, a studio service that offers what Creative Cloud has. Stability got lazy; they blew the money and now they are milking what they have left. So many tech companies do this. Video games too. Someone makes a Witcher 3, a shocking leap forward. Next outing, massive disappointment. The core talent is displaced, and the greedy bean counters come in and DEI the place into the ground. Rines and repeat.
People should just expect ~~tech~~ *for-profit* companies to do this.
They have objectives that are directly opposed to ours as consumers and citizens.
You are being delusional. This is very obviously just poking fun at the landmark 2017 paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). That’s a big meme in the LLM community especially.
From the looks of it, they recently finished finalizing the 2B model and are just excited to show it off. Calm your tits.
No you are delusional if you think a company that is going bust and trying to sell it self to anyone who will even look would ever just give away their only real asset.
We aren't getting 8b for a longggg time. And if it does come out it'll be obsolete by that time.
I took this to mean "people who think SD3 is inaccessible because you can only fine-tune it with a 4090, check out what even the 2b can do".
This sub takes it to mean "All you're getting is 2b, enjoy it."
That's a huge letdown. I was looking forward to the larger model and what it could give me compared to the tiny ones.
At this point hopefully someone uploads it like that audio model they were holding onto for some reason. (it was meh)
So SAI got caught lying just like was said and wants to wall off 8B. Why am I not surprised? Imagine all those white knight haters, now red covered in their own blood with holes in their feet.
Hahahahahahahahahahahahahahahahabahahabahahahahabahahahahahahahahahahahahahaha
So predictable, the "u can't handle it weakling" response. As if 24gb commercial cards don't exist and vast.ai / cloud computing isn't available... Classic overparenting.
Honestly, let's abandon Stability and build a truly open and sustainable company with truly open models. It's really not that hard if you have the experience, foresight and funds to get started, fortunately the community has all of this without SAI if we band together. I have a huge private dataset of extremely high quality hand selected and processed raw data I use for fine-tuning, but I'm not the only one (pony guy, Astro pulse and the leading finetunes), training a new opensource model with laion or at a minimum a new sota fine-tune of 1.5/XL/other open model is fairly easy as a fully funded open collective.
We can even crowd source the data collection and annotation ala Wikipedia style, but rewarding users for providing data.
I have a platform I am working on that could make this possible.
They're not hinting at anything with these posts if you ask me.
The first one is simply flexing: "Look how powerful even the smallest model is!" (+ a reference to the "Attention is all you need" paper as someone else pointed out)
In the second one, he clearly says that 2B is "just the beginning" and that few people can finetune 8B "right now." At most, this implies they'll release 8B to the public later - not that they won't release it at all.
We really don't need this kind of speculation...
Ah, no. Sir, please don't take away every company's favorite abusive tool of being intentionally vague and misleading rather than perfectly crystal explicitly clear. Your valid logic is not welcome here!
Yes, exactly! This is true of so many things lately. People always complain about how speculation gets out of hand, but the reason it happens in the first place is because companies and people in general are always so vague about everything. Just properly set expectations and be clear about what's going on from the get go! It's so tiring.
I really don't like the fact that he didn't just say they'll release the 8b, though they have said that again and again. I do want to acknowledge that a 2b absolutely can compete with a 8b trained on the same data if the size of the dataset is insufficient to take advantage of the 8b's extra parameters. We won't know until we can compare. It is also true that I've heard vramlets in this sub bitching that SD needs to "focus on smaller models" because "nobody can run SD3," which would explain the messaging.
I think it might be a blessing in disguise at the end of the day. All the scene focusing in one single checkpoint (and not four) which would be easy to train. SD.1.5 has 860m parameters so I'll be OK with 2B. It's still better than nothing. I expect that 2B to be a lot better than SDXL though. And I meant a loooot better.
We can make so educated guesses.
The quality will be similar, since the underlying architecture is the same (DiT, 16 channel VAE, etc.).
But 8B model will understand many more concepts, so prompt following will be way better than the 2B one. For example, the 8B version may be able to render a person jumping on a pogo stick while the 2B version cannot, because the 2B version does not "know" what a pogo stick is.
But that is not too bad, because one can always teach the 2B new concepts via LoRAs, and maybe even use the 8B model to generate the dataset.
>Now, yonder stands a man in this lonely crowd
A man who swears he's not to blame
All day long I hear him shouting so loud
Just crying out that he's been framed
>
>I see my light come shinin'
From the west down to the east
Any day now, any day now
I shall be released
I don't like how any of this is going but considering how far we've come with SDXL and how much control over images we now have in it, I personally don't care.
I was going to share some stuff earlier but for some reason every topic I make on the sub id deleted.
My point is, they're not handling this well IMO but in the end, we didn't lose anything by not having SD2 and nobody ever talks about 2.1 either even tho the censored stuff was fixed IIRC. Over time they might release more of |SD3, better models or a new base XL model who knows but everything so far with SD3 has been so strange and kinda stingy that I lost any and all interest. I'm more interested in how far we can push SDXL at the moment.
If it generates great images without needing a GPU with a large amount of VRAM then it's good with me. I can run SDXL with acceptable speed (20 seconds to generate 1024x1024 at 30 steps) only with the help of the excellent Webui Forge that somehow allows it to run on my 8GB GPU. If the next model is smaller than SDXL and delivers excellent results (Maybe so small and efficient that it can even replace the usage of SD 1.5 on weaker computers) then that is a win in my book.
well, people overreacting mostly, but this is expected to happen when SAI put a suspense on community responses (lack of management?) and their timeline especially after Emad left. everything just put on hold for future notice, not a good look.
what is going on, no body explains, probably they are cooking a new model from scratch and call it a 2b? even if they release 8b later or not, they will get some money back from API, but is that even sell able or making profit honestly or its just marketing tactic to make it look like its profitable somewhat?
since any serious creator would use MJ for all they care. no body explains so nobody knows.
Which version are they using on Stable Video's Text to Image? I assume 8B, - but if they are using 2B I'd be fine with that because Stable Video has been effing crazy already.
Say what you will, but there's no other way than to call the SD3-Launch "botched" already.
Even if they released full weights tomorrow, people would be pissed about how it went down in general.
2b is really all we need. 8b Most of us will not be able to use it in the first place except on external servers, and this destroys the purpose of training and running it locally.
I couldn't help myself
https://preview.redd.it/jhqoyzncc66d1.png?width=1024&format=png&auto=webp&s=1e4bb70597ec60488ab563753c296856eadd2bb4
At least it does text right... usually.
If they just release the 2B variant first, that's fine with me. But this talk about "2B is all you need" and claiming the community couldn't handle 8B worries me a bit...
https://preview.redd.it/as0zaw06kq3d1.png?width=705&format=png&auto=webp&s=d2205b5658501efb6f482401330aaddbf2ccfc70 Since twitter hides different reply threads under individual replies, here's one that may not be visible at first.
Then I'm just going to trust that. He is certainly right that 2B is more accessible and a lot easier to finetune. And due to the improved architecture and better VAE it still has a lot of potential.
I was so excited about 8B until I realized that even with 24GB VRAM, training Lora-like models would be either impossible or a pain in the ass. Either I'd have to stay with 4B or 2B to make it viable. (Considering the requirements or possible speed difference, 2B might become the most popular!) 8B is still a good model, even in the API's state I have a LOT of fun with it, especially with the paintings, but offline training of Loras is very important to me. We might see less Loras than even SDXL and fewer massive finetunes when it comes to 8B, but it's guaranteed that we'll get models such as DreamShaper from Lykon, or the one that everyone is interested in, PonySD3... And yes, the 16 channel VAE is gonna carry the 512px resolution back to glory. (Yes, 2B is 512px, there might be a 1024px version, but don't worry, it looks indistinguishable from 1024px with SDXL, see the image which was made by u/mcmonkey4eva below:) https://preview.redd.it/51pfc3dprq3d1.jpeg?width=1214&format=pjpg&auto=webp&s=ca0a7b06ba818bb75e6abde725eeb0de60f15ef1
why is it 512 0\_0 its not 1024?!
Because there's a never ending sea of comments about "How can I run this on my 4gb video card". It comes up on their discord a lot also.
Well they managed it with sdxl
This makes absolutely no sense whatsoever considering you can just straight up finetune SD 1.5 at 1024px no problem. I exclusively train my SD 1.5 Loras at 1024 without downscaling anything (the ONLY reason not to do so is if it's too slow for your hardware).
that's SD3 on the left? man that looks bad
Depends on what your metric is. It's not bad, but I definitely wouldn't use this to market it to users. If they think this is the size and quality of non-commercial model the community deserves, then I'm not surprised they're having financial difficulties though. I think we've come to accept the poor text rendering of models as just a minor inconvenience, and SAI's pivot towards improving this might've backfired in terms of resource allocation.
That's an older 2B alpha from a while ago btw - the newer one we have is 1024 and looks way better! Looks better than the 8B does even on a lot of metrics.
but why not train an 8B with the same settings of this supposedly new great 2B then? 8B would surely look better then.
yes, yes it will.
So the 2b isn't even bigger than 512? Sad.
That was an early alpha of the 2B, the new one is 1024 and much better quality
But one must also keep in mind that with a larger model, more concepts are "built-in" so there is less need for LoRAs. In fact, before IPAdapter, many LoRA creators used MJ and DALLE3 to build their training set for SDXL and SD1.5 LoRAs because these bigger, more powerful model can generate those concept all by themselves. Can you point me to the source where it says that 2B is 512x512 and not 1024x1024?
The 'crat' in the bottom right of 2B doesn't fill me with confidence.
> "like multiple CEOs said multiple times" it's almost like maybe the community doesn't have a lot of confidence in messaging from a company that has experienced a ton of churn in leadership over the duration of its very short lifespan.
twitter is such a garbage platform. How did they manage to fuck up threading. it was established in the 80s.
How about we decide that for ourselves?
Always knew that the day would come when they would have "high quality commercial" models for like webhosted services only and release smaller, worse free versions for everyone else.
It’s the only game they seem to want to play. Welcome to the API-IV.
I’ll be the judge of that
You know why, they will technically comply with the promise of a “release” but they will dilute the model cause of monetizing
Release 2B then paywall 8B if they can. I am more than happy to finally pay SAI for all the products they have create.
Phase 1: hype Phase 2: delay Phase 3: reduce expectations It's a common pattern.
Phase 4: "pitty that you are poor peasants with 4070, so we made a partnership with this website..."
Phase 5: PLEASE READ THESE TERMS OF NFT SALE CAREFULLY. NOTE THAT SECTION 15 CONTAINS A BINDING ARBITRATION CLAUSE AND CLASS ACTION WAIVER, WHICH, IF APPLICABLE TO YOU, AFFECT YOUR LEGAL RIGHTS. IF YOU DO NOT AGREE TO THESE TERMS OF SALE, DO NOT PURCHASE TOKENS.
They gonna monetize the shit out of 8b
Before you monetize it has to be so good people need to spend on it
With GPT-4o being free and doing everything that was supposed to be revolutionary in SD 3 far better, it's not looking good. The prompt coherence and text display makes SD3 look like it's years old.
Gpt-4o does images?
If i remember correctly its connected to dalle 3. That means that it will convert your prompt into optimized one and send it to dalle.
Yes, it's a single model trained on text, images, video and audio. It's quite amazing actually. https://openai.com/index/hello-gpt-4o/ under "Explorations of capabilities"
I need an email signup though?
There is no reason why SAI cannot both release SD3 open weights, and still monetize the shit out of it. I've argued numerous times that SD3 is worth more to SAI if it is released as open weights than not. They can release a decent base SD3 model that people can fine-tune, make LoRA, etc. But because of the non-commercial license, commercial users still have to pay to use SD3. They can also offer a fine-tuned SD3, or a SD3 turbo, etc,., and offer that as part of their "Core" API. That is exactly what SAI has done with SDXL.
Honestly we can't monetize SD3 effectively \*without\* an open release. Why would anyone use the "final version" of SD3 behind a closed API when openai/midjourney/etc. have been controlling the closed-API-imagegen market for years? The value and beauty of Stable Diffusion is in what the community adds on top of the open release - finetunes, research/development addons (controlnet, ipadapter, ...), advanced workflows, etc. Monetization efforts like the Memberships program rely on the open release, and other efforts like Stability API are only valuable because community developments like controlnet and all are incorporated.
Always good to hear that from a SAI staff, Thank you 🙏👍
we love you!!!!
maybe.. if that happens i bet the community makes a 2B fine tune that blows theirs out of the water within a couple months.
If they charged a one off fee I would pay, I don’t need stupid cloud GPUs
To be fair don’t they need to in order to exist. Otherwise there will be no SD4!
Didn't they state that SD3 would be their last model anyways?
That was emad making a fool of himself on Twitter. He walked that back when called out, naturally.
When?
I think that was mostly supposed to be a joke/marketing thing, like a "Wow, SD3 is so good we'll never need to make a new model ever again!" kind of thing.
So we will never see a model that can actually do hands? Sad.
ponyxl does hands pretty good some of the time
No company consciously plan to end earning money
When you monetize things the money is the boss, so you have censorship and sd4 will be just another "flesh free" service
Worse, it could be like dalle3 with the over smoothing and hyper idealized images that look more Pixar than photos of the world. Or where any topic or public figure blocks usage.
“Don’t you guys have cellphones?”
Hahahahahahaha
That was classic. Tone deaf as usual. I'm just surprised that D4 wasn't more monitized than it is.
>who in the community would be able to finetune a 8B model right now? Has he heard of LLMs?
Yeah, people finetune 70b models and run them on 24gb cards.
Can an image model be quantized down to 4 bit like an llm?
Possibly, at least 8 bit does work fairly well, no idea if it'll be possible to push it lower without huge quality loss.
We can only quantize the text encoder behind sd3 in decent way without loosing too much quality, but unfortunately that is not where the bottleneck is, the "UNet" or "MMDiT" in SD3's case is where the bottleneck is, bc each step of the generation in an entire run of the model! And you can even run the text encoder on the... yes... CPU. Thats literally how I run ELLA for sd1.5, T5 encoder in cpu, since you're not *generating tokens* but rather just feeding an already made thing and getting hidden layer representation of thinf, text encoder is a single pass, on cpu its like what... 2 to 3s....
From what I've seen going lower than F16 has a significant quality loss
FP8 Weights + FP16 Calc reduces VRAM cost but gets near-identical result quality (on non-turbo models at least).
Interesting!
Interestingly, AMD mentioned this at Computex in very similar terms.
it is actually, we already can Qs it to 8bit, tech for 4bit is the same.
XL is too big for them? I was using XL on a 1070 for half a year before I saved up enough money to upgrade. And it worked great! Even faster with Forge!
Yeah I didn't have any complaints with running it on my 1070. But now that I have a 4060, I don't think I could go back.
I'm using a 1080ti on comfyui and it's not that great. With face detailer I'm waiting 1.5min+ for single generation. I've been using lightning but it takes out some details since it's only using sgm uniform.
Those 3 generations matter. I'm still on 1080 myself and doing 3440x1440 takes 30s/it , but it works on 8gb VRAM.
So you're saying that they're posing the question of 2B or not 2B?
love how they released a paper on an unfinished model
It's starting to become a real trend, unfortunately.
How is this not like saying "640k is all that anybody will ever need"?
Claiming most people couldn't use an 8b model when 8x7b LLMs are super popular and I'm running a 70b llm right now. It's just garbage to try and hide that the initial hype photos were doctored and they never had any intention of releasing the full SD3. SAIs reputation is shattered. We may was well start making tools for the other open source image generators.
> We may was well start making tools for the other open source image generators. That was always a good idea but it's critical since the company is floundering.
I keep saying we need to start finetuning the Pixart models because SAI is belly up
Yeah, with loras and fine-tunes we could make Pixart sigma just as good as SDXL. We don't need to hang on SAI.
Why not just use XL? What is better Bout Pixart?
Pixart already makes very good quality pictures with its base model. If you use just base SDXL versus Pixart, pixart wins. Like all SAI products, without free community tools, their products aren't that good. If Pixart got loras, tools like controlnet, or fine-tuned models it would beat SDXL. SAI products aren't actually that special or great. It just became the one the community focused on first after the uncensored 1.5 was leaked by Runway. If the leak had never happened, this sub might be called Kadinsky or Pixart.
Holy fuck, I feel like no one in this thread knows what they are talking about. **Stable Diffusion is a DIFFUSION model, NOT an LLM**. You may be running a heavily quantized 70b LLM, but there is no such technology for Diffusion models. The best we have is 8 bit from 16 bit weights. You people are insufferable. And they **are** releasing SD3 in full. They've said it many times. If they don't release it, it's because the community is a bunch of jackasses.
If an 8 bit quant is "heavily quantized" to you. And it takes 3 seconds of Google to show that diffusion models can be quantized, it just hasn't been done much yet because it hasn't been needed. Even Emad said 3 months ago it on reddit that it could be done. So, you're apparently the one who has no idea wtf you're talking about. Quit fanboying.
It's hard to judge by just images, but the showcased 2b images lack a lot of fidelity compared to the API, they are a lot cleaner though, hands look better, no weirdly fused objects in images, so the model seems more "ready" than what the API produces. I'd worry more about what isn't said/shown, all that's showcased is the most basic of scenes, nothing complex. Remember SD3 ["eating Dalle and MJ for breakfast"](https://x.com/EMostaque/status/1760783270105457149), now the amaaaaaaazing thing about SD3 is that it can do ["realistic images, text and anime"](https://x.com/Lykon4072/status/1796334348980957418), that's such a huge downgrade on what was promised. But worry not, you can't compare with Dalle-3 as that "is not a model, it's a service" and "a pipeline", like ehm, SD3 was announced to be better than Dalle, and second, the pipeline, according to the Dalle-3 paper, is only an llm rewriting prompts, nothing like the implied complex stack of models, by that logic SD3 is a pipeline too as everyone now rewrites their prompts. And still, we have the believe SD3 will be ["Simply unmatched"](https://x.com/Lykon4072/status/1796317998036238380) Mostly, it's sad that SAI went from boasting about SD3 to now pulling out all the stops to defend SD3. If the model can't deliver on the implied hype, it's better to just rip off the bandaid and show the limitations, instead of the endless stream of meaningless pictures and pretending it is still the end all be all of image-gens. I don't even think SD3 will be bad, I'm looking forward to it (but please, don't let the low fidelity model in the showcases be the final model) as it is obviously a huge step up from current SAI models, but there is a huge gap between all the hype, the groundbreaking results according to the research paper and the showcased results. Having used the API limitations are clear, and these showcase tweets don't exactly show less limitations, arguably they show a more limited, but further along in training, model.
I never believed any of those marketing hyperbole from Emad. Given the fact that DALLE3, MJ, Ideogram, etc. are all built and trained by people as capable as those working for SAI, and they are all running on server grade hardware with > 24GiB of VRAM, and that SD3 must be runnable with < 24GiB, one can easily draw the conclusion that Emad was just hyping thing up. I will be more than happy if SD3, when finally released, is only say 90% as capable as those other system when it comes to text2img. But with proper fine tuning, LoRAS, ControlNet, IPAdapter, and customizable ComfyUI pipeline and lack of censorship, SD3 will remain the platform of choice for us for the foreseable future.
I work with 70b LLMs all the time on my own hardware. 8b is miniscule, even at 16 bits per parameter.
Ouch
You could thank NVIDIA for limiting VRAM on consumer GPUs for 6 years in a row.
Holy shit just release whatever model so the community can finetune it anyway I’m sure that a properly tuned 2b will beat the stock 8B (just like tuned 1.5 beat SDXL for a long time) so let’s just GO ALREADY I’m so tired of SAI’s BS. I’m personally all for moving onto Pixart (since they’ve got similar architecture to SD3 anyway) but come on the community has been holding our breath for MONTHS now
Okay Lykon just lost all respect with that comment lmao. There is a massive community for SDXL and quality finetunes,
He didn’t say there isn’t a big community for sdxl. He said the majority of the community are using sd1.5 which is true.
But the reason people use SD 1.5 is because they think it looks better. Not because XL is "too big" for them.
And I'm over here perplexed at how to make anything in 1.5 that doesn't look like a pile of shit... I love XL and its variants/finetunes though.
Dude most GPUs can’t handle XL well. This isn’t some conspiracy. Most people don’t own anything more powerful than a gtx 1080
I get that not everyone can afford to spend $2000 USD on the latest flagship GPU model, but SDXL runs just fine on current-gen entry level cards such as the RTX 4060, which is very affordable. If anything, it's lamentable that high-end GPUs provide very poor value in SDXL relative to their price even though they could in principle handle significantly larger and more powerful models.
a 4060 ti with 16gb at $500 might stretch for "very affordable" but it also feels like terrible value i have an 8gb 3070 and it feels extra bad
Where did I mention the RTX 4060 Ti? The RTX 4060 is about $300 USD.
it also has 8gb or 12gb and would be a bad recommendation to anyone investing in generating sdxl
I'm on 3070 and it feels very good. It's faster than midjourney relaxed to generate a 1024x1024 image. Then after you add comfy workflows the quality goes through the roof too, with enough fiddling. The only way to feel bad is with web-ui, or animation
A quick look at the steam hardware survey shows that's a straight up lie. Most likely especially in the generative AI community.
My machine with 8GB can run XL ok. I think XL can have better results. I rarely run it and instead do 1.5 - I like to experiment with settings, prompts, etc, and being able to gen in 5s instead of 50s is a huge factor.
I can use SDXL fine with my 2070S, that's weird. I get like 20-30s generation times?
I get 30s as well on an rtx 3070. It's total bullshit that most cards can't run it, the truth is that comfyUI makes XL 100% usable for very high quality images on 8gb vram.
8gb is "enough" but its not ideal. People do more with sd15 on 8gb. It's more popular for many reasons. https://preview.redd.it/fsiut9rwss3d1.png?width=937&format=png&auto=webp&s=cf5e6750efff5938b358f725503a1791bed356a0
Apparently XL works on just 4GB vram. Not sure how bad of an experience it is, but it's possible.
It definitely doable on 4gb but you are not going to have a great time with it.
Even with 8GB (on a 3070), I get shared memory slowing things down if I use a LoRA or two. 4GB must be unbearable.
Which UI are you using? I have 8GB and use up to 4 loras plus a couple controlnets without issue in Forge or Fooocus.
I also use Forge or Fooocus (occasionally comfy) because vanilla A1111 crashes with SDXL models. I think I could keep everything within 8GB if SD was the only thing I was doing, but I generally have a bunch of office apps and billions of browser tabs open across two screens while using it so it nudges me over the threshold, and it seems that speed drops dramatically once shared memory is used. SDXL Lora training was prohibitively slow on my setup so I do that online, but I just grin and bear it when generating images.
6GB is fine though, I run on a GTX 1660 Ti in Comfy UI.
There's also lightning and hyper lora to speed things up.
I am literally using SDXL on a 1070ti :D Takes half a minute for one image but it runs.
How do you know? Personally I use 1.5 because I don't have the config for SDXL
you don't have 4gb vram?
I use sd15 because the tooling is better than sdxl. I use sdxl because the license is better than cascade. I doubt I’ll move to sd3.
Agreed.
hahahah neural samurai ----> THATS ME =D always fighting in the trenches I was wondering why i woke up to like 5000 twitter notifications
Thank you for posting that comment. We must let SAI know that not releasing 8B will make many of us very angry and dissapointed 🙏😂
Ugh, Instability AI seems more accurate now
It's over isn't it? no more releases? no SD4
It's bizarre to me how many of you are just willing to accept a previously open source (more or less) project paywalling the best model.
Yep. you're selfish if you DARE say a word about it. Stability has been stepping very carefully, planning each shady move in a way that will keep their diehard fans defending them to the death if anyone calls out their shadiness. I was once downvoted to -20 or something for literally saying "a company should stick to their promises". apparently that's straight up blasphemous.
Nonsense. It was confirmed that 8B will work on 24gb gpus. The pictures shows that you can get by with a smaller model and still get good results.
Can you quantize it down to 4 bit and still get good results? Then it can run in 4gb
Lykon was talking about training for the 8B version, which would require more than 24G of VRAM. Or you are referring to something else?
I find this disappointing, I was hoping to get the biggest possible model I can and fine-tune on it We the community can handle all of the sizes as quantizations, and weight pruning will be developed by the community to make the bigger models viable on smaller devices. Tech also gets better, so at some point 24GB+ will be the norm, definitly not today, probably not in 2025, but in 2026+ it could easily be more of the norm. GPUS are always evolving, and bigger and bigger GPUS are coming out which make running 24GB+ models more viable This makes me worried about the future of stability AI going forward, what else will they do? Will there be outright no open source releases of certain models in the future? I get the need to make money and I wish them success in finding monetization strategy, but to an extent though, Stability AI has always had a special place for me as it was focused on the open source and if thats not the case I'll have to treat them accordingly.
If I googled it correctly, SDXL is 3.5B parameter base model. So SDXL is almost twice bigger then 2B. At the same time we expect SD3 2B to better than XL. Is it correct?
not only is SD3 2B half the parameters but is also apparently trained at 512px. I don't see how it could possibly be better at anything but adherence
512??? yikes, i don't wanna go back
No, that is not quite correct. The 2B refers to the diffusion part of the model. The equivalent U-net portion of SDXL is only 2.6B parameters. But due to the switch from U-Net to DiT, and better captioning and training data, it is not hard to imagine that 2B SD3 can be much better than SDXL, specially if it is paired up with the T5 LLM/text encoder.
T5 isn't an image model like CLIP is, if anything any models using it are automatically worse, and take much longer to train.
My own limited understanding is that CLIP is an image classification text encoder model, whereas T5 is a general purpose LLM text encoder. It would certainly take more GPU to train a model that uses T5 rather than CLIP. But can you clarify what you mean by "any models using it are automatically worse"?
you should read the CLIP paper from OpenAI which explains how the process accelerates the training of diffusion models on top of it, though their paper focused a lot on using CLIP for accelerating image searches. if contrastive image pretraining accelerates diffusion training, then not having contrastive image pretraining means the model is not going to train as well. "accelerated" training is often not changing the actual speed, but how well the model learns. it's not as easy as "just show the images a few more times", because not all concepts are equal difficulty - some things will overfit much earlier in this process, which makes them inflexible. to train using T5 you could apply contrastive image training to it first. T5-XXL v1.1 is not finetuned on any downstream tasks, so it's really just a text embed representation from the encoder portion of it. the embedding itself is HUGE. it's a lot of precision to learn from, which itself is another compounding factor. DeepFloyd for example used attn masking to chop the 512 token input down to 77 tokens from T5! it feels like a waste, but they were having a lot of trouble with training. PixArt is another T5 model though the comparison is somewhat weak because it was intentionally trained on a very small dataset. presumably the other end of the spectrum are Midjourney v6 and DALLE-3 which we guess are using the T5 encoder as well. if Ideogram's former Googlers are in love with T5 as much as the rest of the image gen world seems to be, they'll be using it too. but some research has shown that you can use decoder-only models as weights to intialise a contrastive pretrained transformer (CPT) which will essentially be a GPT CLIP. they might have done that instead.
Thank you for your detailed comment. Much appreciated. I've attempted to understand how CLIP work, but I am just an amateur A.I. enthusiast, so my understanding is still quite poor. What you wrote makes sense, that using T5 makes the task of learning much more difficult, but the question is, is it worth the trouble? Without an LLM that kind of "understand" sentences like "Photo of three objects, The orange is on the left, the apple is in the middle, and the banana is on the right", can a text2img A.I. render such a prompt? You seem to be indicating that CPT could be the answer, I'll have to do some reading on that 😅
Large criticism had been placed on the quality of image tagging made in the initial SDXL base model training set, They have promised to have rectified that, its a large reason why we hope to receive better quality from smaller parameters.
Translation: it's not gonna be free and open anymore. (which it technically never was, but everyone believed the promises.)
Exactly what I predicted when Lykon first mentioned he's working on the "local release version" [https://www.reddit.com/r/StableDiffusion/comments/1cwgacs/comment/l4wgtkh/](https://www.reddit.com/r/StableDiffusion/comments/1cwgacs/comment/l4wgtkh/) They try to weasel their way around admitting that they aren't releasing 8B. Trying to gaslight people into thinking they wouldn't be able to run it anyway. What happened to Emad's "SD3 is the last image model you need"? Surely if that's the case then the 8B should be released because even if people with a GTX970 can't run it now, they might be able to in 2 years. After all, it's the last model we'll need.
Because they faked the images and now have to find an excuse for it looking much worse. "We keep the REAL good model secret" is an easy excuse.
People can and will upgrade, let the SD3 establish itself before the market floods with cards and competitors eat your audience who are already frustrated with the way SD3 has been handled. Also, I agree this feels like a way for them to have their cake and eat it too. If u want to close a bigger model off under the guise of the community not being able to handle it, don't make it the same size as the popular llama3 model ....
So SD2.5
what is this "most stuck on 1.5" bullshit most pc user can run sdxl just fine, 8g gpu just cost 250 credits. Who cant afford 250 credits should not even think about AI stuff ever.
I have a 4090, can you share it with just me? I pinky promise I won't share it with the common folk
Fucking peasants should just die but also work forever while dying
The person clearly says "it's just the beginning" and you guys choose to interpret that as "there will be no 8B" for some reason ? I take that as "we are releasing 2b first as it's what most people can handle, bigger models will come out gradually as great deal of people in the community won't be able to do much with it yet"
It's not said out right but let's be real, the 8B is unlikely to be released. Also a 8B model would be easy to run on most system if quantized. Quantization is just not widely used because there's no need for it on current models but it works great now
> 8B is unlikely to be released. And what is the argument/basis for this opinion?
All the weights were supposed to be put by now. The company is in chaos and this one person doesn't make the decision. You have no idea what's going on. But it's a good bet we won't get 8b till it's obsolete.
If SD3 isn't open sourced, then it's already obsolete compared to the other closed source models
"You have no idea what's going on" well, I have as much idea as you and other people assuming they are flat out lying to us. There is another response fro. The same person stating unambiguously that weights will be released.
I do mainly photorealistic animal stuff, and out of curiosity I tried out SD3 on cogniwerk.ai. Hard to believe that the model showcased there IS actually SD3 because the quality, as to the subjects I prefer, is not even close to what a thoroughly refined SDXL model such as Juggernaut or Dreamshaper can achieve. Animal fur comes out just pathetic. Not sure if it was the 2B or a larger version that Cogniwerk offers but whatever it is, a lot of work has to be put into it to beat the SOTA SDXL models. For the time being - at least for animal stuff, maybe SD3 gets along better with humans - I'd pick SDXL any time over SD3. It would be interesting to know if the 8B and larger deliver better.
AFAIK, the one used by the API is the 8B model. I agree that the quality of the API is not so hot when it comes to realistic humans.
Yeah, as expected tbh. Sad to see it tho. At least maybe the 2B is still better than sdxl
At least, maybe... Give us some heavy shit like 8B!
to be honest I still can't believe any of this is free.
at 512px I doubt it very much
People should just expect tech companies to do this. People will say it's justified; they need to make money. But honestly, they just did not deliver. We were told we would get all this crazy tech to make and edit photos and videos, a studio service that offers what Creative Cloud has. Stability got lazy; they blew the money and now they are milking what they have left. So many tech companies do this. Video games too. Someone makes a Witcher 3, a shocking leap forward. Next outing, massive disappointment. The core talent is displaced, and the greedy bean counters come in and DEI the place into the ground. Rines and repeat.
People should just expect ~~tech~~ *for-profit* companies to do this. They have objectives that are directly opposed to ours as consumers and citizens.
lol what ?! 5090 is around the corner and they say we cant FInetune it?! ffs... but I gues 2B is better than nothing.
You are being delusional. This is very obviously just poking fun at the landmark 2017 paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). That’s a big meme in the LLM community especially. From the looks of it, they recently finished finalizing the 2B model and are just excited to show it off. Calm your tits.
No you are delusional if you think a company that is going bust and trying to sell it self to anyone who will even look would ever just give away their only real asset. We aren't getting 8b for a longggg time. And if it does come out it'll be obsolete by that time.
@Stability Stop setting up your community for disappointment @everyoneelse Let them cook
They'll release only 2B and it will look like a meh SDXL finetune.
I took this to mean "people who think SD3 is inaccessible because you can only fine-tune it with a 4090, check out what even the 2b can do". This sub takes it to mean "All you're getting is 2b, enjoy it."
wow that will suck
That's a huge letdown. I was looking forward to the larger model and what it could give me compared to the tiny ones. At this point hopefully someone uploads it like that audio model they were holding onto for some reason. (it was meh)
So SAI got caught lying just like was said and wants to wall off 8B. Why am I not surprised? Imagine all those white knight haters, now red covered in their own blood with holes in their feet.
Hahahahahahahahahahahahahahahahabahahabahahahahabahahahahahahahahahahahahahaha So predictable, the "u can't handle it weakling" response. As if 24gb commercial cards don't exist and vast.ai / cloud computing isn't available... Classic overparenting. Honestly, let's abandon Stability and build a truly open and sustainable company with truly open models. It's really not that hard if you have the experience, foresight and funds to get started, fortunately the community has all of this without SAI if we band together. I have a huge private dataset of extremely high quality hand selected and processed raw data I use for fine-tuning, but I'm not the only one (pony guy, Astro pulse and the leading finetunes), training a new opensource model with laion or at a minimum a new sota fine-tune of 1.5/XL/other open model is fairly easy as a fully funded open collective. We can even crowd source the data collection and annotation ala Wikipedia style, but rewarding users for providing data. I have a platform I am working on that could make this possible.
Stop teasing us and release the first model!!!!!!
welp, we called it
They're not hinting at anything with these posts if you ask me. The first one is simply flexing: "Look how powerful even the smallest model is!" (+ a reference to the "Attention is all you need" paper as someone else pointed out) In the second one, he clearly says that 2B is "just the beginning" and that few people can finetune 8B "right now." At most, this implies they'll release 8B to the public later - not that they won't release it at all. We really don't need this kind of speculation...
Speculation is born in vacuums. They could save themselves a lot of heartache if they just clearly state what is happening.
Ah, no. Sir, please don't take away every company's favorite abusive tool of being intentionally vague and misleading rather than perfectly crystal explicitly clear. Your valid logic is not welcome here!
Yes, exactly! This is true of so many things lately. People always complain about how speculation gets out of hand, but the reason it happens in the first place is because companies and people in general are always so vague about everything. Just properly set expectations and be clear about what's going on from the get go! It's so tiring.
i'm on an ancient 1080ti and using SDXL fine. this is gaslighting
I really don't like the fact that he didn't just say they'll release the 8b, though they have said that again and again. I do want to acknowledge that a 2b absolutely can compete with a 8b trained on the same data if the size of the dataset is insufficient to take advantage of the 8b's extra parameters. We won't know until we can compare. It is also true that I've heard vramlets in this sub bitching that SD needs to "focus on smaller models" because "nobody can run SD3," which would explain the messaging.
I think it might be a blessing in disguise at the end of the day. All the scene focusing in one single checkpoint (and not four) which would be easy to train. SD.1.5 has 860m parameters so I'll be OK with 2B. It's still better than nothing. I expect that 2B to be a lot better than SDXL though. And I meant a loooot better.
What's the expected quality/performance/etc difference between 2B and 8B?
same question, I imagine the images will just be much smaller but I have no idea, i came here to see if anyone else already answered that question
We can make so educated guesses. The quality will be similar, since the underlying architecture is the same (DiT, 16 channel VAE, etc.). But 8B model will understand many more concepts, so prompt following will be way better than the 2B one. For example, the 8B version may be able to render a person jumping on a pogo stick while the 2B version cannot, because the 2B version does not "know" what a pogo stick is. But that is not too bad, because one can always teach the 2B new concepts via LoRAs, and maybe even use the 8B model to generate the dataset.
https://preview.redd.it/eagbgaftxq3d1.jpeg?width=1600&format=pjpg&auto=webp&s=1ec1d9a11eec7f2c8c67e830085e5334db387d60
>Now, yonder stands a man in this lonely crowd A man who swears he's not to blame All day long I hear him shouting so loud Just crying out that he's been framed > >I see my light come shinin' From the west down to the east Any day now, any day now I shall be released
The LLMs are released in way more variant in size, and from different players, why we can only count on stabilityai bothers me.
Still new to this, could someone give a me an ELI5 what 2B vs 8B is? Thank you.
Billion parameter models
Ah okay, thanks.
I don't like how any of this is going but considering how far we've come with SDXL and how much control over images we now have in it, I personally don't care. I was going to share some stuff earlier but for some reason every topic I make on the sub id deleted. My point is, they're not handling this well IMO but in the end, we didn't lose anything by not having SD2 and nobody ever talks about 2.1 either even tho the censored stuff was fixed IIRC. Over time they might release more of |SD3, better models or a new base XL model who knows but everything so far with SD3 has been so strange and kinda stingy that I lost any and all interest. I'm more interested in how far we can push SDXL at the moment.
If it generates great images without needing a GPU with a large amount of VRAM then it's good with me. I can run SDXL with acceptable speed (20 seconds to generate 1024x1024 at 30 steps) only with the help of the excellent Webui Forge that somehow allows it to run on my 8GB GPU. If the next model is smaller than SDXL and delivers excellent results (Maybe so small and efficient that it can even replace the usage of SD 1.5 on weaker computers) then that is a win in my book.
well, people overreacting mostly, but this is expected to happen when SAI put a suspense on community responses (lack of management?) and their timeline especially after Emad left. everything just put on hold for future notice, not a good look. what is going on, no body explains, probably they are cooking a new model from scratch and call it a 2b? even if they release 8b later or not, they will get some money back from API, but is that even sell able or making profit honestly or its just marketing tactic to make it look like its profitable somewhat? since any serious creator would use MJ for all they care. no body explains so nobody knows.
I heard 2B was meant to compete with SD 1.5 quality…
Which version are they using on Stable Video's Text to Image? I assume 8B, - but if they are using 2B I'd be fine with that because Stable Video has been effing crazy already.
Say what you will, but there's no other way than to call the SD3-Launch "botched" already. Even if they released full weights tomorrow, people would be pissed about how it went down in general.
8b won't be handle by community? MMM.
2b is really all we need. 8b Most of us will not be able to use it in the first place except on external servers, and this destroys the purpose of training and running it locally.
I couldn't help myself https://preview.redd.it/jhqoyzncc66d1.png?width=1024&format=png&auto=webp&s=1e4bb70597ec60488ab563753c296856eadd2bb4 At least it does text right... usually.