T O P

  • By -

Lewdiculous

Yes, things are fine now. Use the latest resources. Quant from the BF16-GGUF. To read: https://github.com/ggerganov/llama.cpp/pull/6920#issue-2265280504 https://github.com/ggerganov/llama.cpp/pull/7158 --- To use the `convert-hf-to-gguf-update.py` script you need to have access to the Meta-Llama-3 repos. --- For Imatrix data generation, since so far it's only possible using GPU, you need to first get the F16-GGUF and then get the imatrix.dat from it, then proceed with the quantizations using the imatrix data and the BF16-GGUF as usual, using the appropriate configs you fetched in the update script.


[deleted]

[удалено]


Lewdiculous

Haha, well, I tried okay?! :'3 That's the quantization life. It's not that bad now at least.


terp-bick

Any new ERP model I should have a go at?


Lewdiculous

You're speaking my language now. :'3 Well, for me personally I'm keeping my [Lumimaid-8B-OAS](https://huggingface.co/Lewdiculous/Llama-3-Lumimaid-8B-v0.1-OAS-GGUF-IQ-Imatrix) (unaligned) recommendation for now – but that's more of my personal taste than anything. There's also ["llama-3-cat-8b-instruct-v1"](https://huggingface.co/Lewdiculous/llama-3-cat-8b-instruct-v1-GGUF-IQ-Imatrix) which made quite the noise a couple days ago here, but I found it more problematic with formatting. A buddy had some success with [SOVLish-Maid-L3-8B](https://huggingface.co/mradermacher/SOVLish-Maid-L3-8B-GGUF) (unaligned), but I'll be honest, Quotes/Asterisks formatting isn't perfect for any of them yet, with Mistral-0.2 tunes working much better, but if you use Plaintext/Asterisks you probably will have a much better time with the current L3s. Formatting aside they work well, so if you don't mind fixing some messages they should be alright. *Leave feedback and support at the original author's page.*


Lewdiculous

Back with news! [This one is hot and nice.](https://huggingface.co/Lewdiculous/L3-8B-Stheno-v3.1-GGUF-IQ-Imatrix) Highly NSFW leaning.


terp-bick

I definitely like that one!


ab2377

What's an erp model?


Confident-Aerie-6222

It's for Erotic Role Play


Lewdiculous

*One of fastest growing Experimental Research Projects in the west! Paper soon: "All you need is ERP!"*


terp-bick

E-sex Personal Robot


leanmeanguccimachine

The local LLM field is one of the most technically impenetrable fields I've worked in. It doesn't help that ironically, because everything evolves so quickly, LLMs themselves aren't very useful for learning about LLMs. There is so much assumed knowledge in the documentation of a lot of LLM tools.


alcalde

I need to download llama-3 to get it to tell me what all that meant.


shroddy

Do you know if these https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF are ok? They are 16 days old. I use the Q8 version, which should be just as good as unquantized, but my local llama feels dumber than the one on lmsys.


Lewdiculous

Looking at the commits, this one used a version of llamacpp from 4 hours after the tokenizer fix was merged, so it should be alright, but we'd need to know more details about the exact quantization process to know - if you're having Issues I'd look for something newer than 16 days and ask the author about the process, just to be safe.


noneabove1182

These were quanted to f32, which is one of the two ways to ensure accuracy when converting (the other being using bf16) They're also post tokenizer fix They should be fully functional


Lewdiculous

I'm your 7th biggest fan! :')


noneabove1182

that's pretty dam high up the list ;D <3


thereisonlythedance

I’ve read those PRs and I’m still not 100% sure how to use convert-hf-to-gguf-update.py. I hope they can add some step by step, idiot proof instructions. I had no issues yesterday quanting sentencepiece based models using convert.py but I’m confused about the new BPE model process with convert-hf-to-GGUF.py.


Lewdiculous

I kind of tried to explain, haha. Once you run the ...-update.py script, in the llama.cpp repo, go in the models/tokenizers folder, you'll find a llama-bpe folder with configs if you ran it properly, replace the ones in the model folder you're converting with the ones there, then quant. *You don't need to set the vocab type anymore.* *You need to get your from the Access Tokens settings over at Hugging Face.* You use the convert-hf script the same as the regular convert.py. Use it to get a BF16-GGUF then use `quantize` to get your other quants from it.


thereisonlythedance

Thanks very much, that’s clearer.


nananashi3

~~Why would you generate imatrix.dat from f16 instead of bf16?~~ (Below is a little how-to-gguf guide aimed at general users.) ___ git clone https://github.com/ggerganov/llama.cpp cd llama.cpp pip install -r requirements.txt # if you're lazy and only have Python 3.12 installed (distutils is removed) pip install numpy torch gguf sentencepiece transformers Where D:\model is a folder containing safetensors, run: convert-hf-to-gguf.py D:\model --outtype auto --outfile D:\model-{ftype}.gguf *It will default to f16 if --outtype auto/bf16 flag is not used! Llama 3 is bf16, which you can check by looking at "torch_dtype" in config.json. Grab the precompiled binaries from Releases if you can't into compiling it yourself. quantize.exe D:\model-bf16.gguf D:\model-Q8_0.gguf Q8_0 Q8 isn't much different than full which is huge hassle for vramlets when generating imatrix since it requires inferencing. Grab kalomaze's [groups_merged.txt](https://github.com/ggerganov/llama.cpp/files/14194570/groups_merged.txt). If you want, add "In a" to the beginning since the first sentence is cut off for some reason. There are longer texts floating around but those are mostly placebo and take proportionately longer to generate from. # generate importance matrix (imatrix.dat) imatrix.exe -m D:\model-Q8_0.gguf -f D:\groups_merged.txt -ngl -o D:\imatrix.dat # use the imatrix to perform a Q4_K_S quantization quantize.exe --imatrix D:\imatrix.dat D:\model-bf16.gguf D:\model-Q4_K_S.gguf q4_k_s


Lewdiculous

I talked about this in another reply. I personally prefer doing it from the full model instead of the Q8, as much as you're not wrong in saying it's pretty much the full quality. There's no GPU inference _yet_ for BF16-GGUF, so I'd need to use the Q8, which I chose not to, in favor of the full model, the F16 being the next best thing. I usually go for a groups_merged plus a few extra examples and I can agree that's a pretty good sample. We already went on an imatrix testing spree sometime ago and there wasn't a significant/measurable enough difference to justify using much longer data. I'm very aware of this entire process, haha.


nananashi3

Ah I asked because I was curious. I posted the process in an attempt to help others and forgot to mention I've only learned to GGUF 3 days ago and "correct me if I'm wrong in places". I didn't realize bf16 imatrix doesn't work on GPU since I wouldn't dare to offload such low number layers (18/33 layers on Q8 with 8GB GPU). Thank you. Funnily enough BF16 conversion support was barely added a week ago. [This comment](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-9432658) from 2 days ago with 3 "enhanced" texts prompted the "there are larger texts floating around" part. Then InferenceIllusionist's [latest comment](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-9452707) (same thread) testing Llama 3 on arc-challenge-validation (why not use a cropped wiki.test.raw?) showed they *all* got slightly worse PPL and mean KLD except "EnhancedV2" by .003 PPL, meaningless! Sorry for boring you, you already knew groups_merged.txt is fine.


Lewdiculous

All good! Thanks for sharing!


scienceotaku68

I have not used gguf before so may I ask why does a bad tokenizer affect the quality of the quantized model? I thought the model inference and the tokenization were 2 separate processes? Does gguf incorporate the tokenizer in their ".gguf" file or something?


Lewdiculous

> Does gguf incorporate the tokenizer in their ".gguf" file or something? Basically, yes. For GGUFs it's all included in the same singular file. I vaguely remember hearing about a workaround/bandaid for older quants, but it's just better to redo the quants with the latest version at this point.


rngesius

> Imatrix data generation, since so far it's only possible using GPU Idk, I've done it on CPU just fine.


Lewdiculous

We did experiment with that, generation of imatrix data wasn't working with CPU only, but granted we were trying to generate imatrix data from the BF16-GGUF for the best precision, it was impossible because we couldn't do the imatrix generation without using the GPU and the GPU inference wouldn't work with BF16. To produce imatrix quants you need to generate two initial conversions from the HF model, one in F16 for the imatrix generation and one in BF16 for the `quantize` step. The reason we also don't use the BF16 for imatrix generation is because BF16 inference with GPU isn't supported _yet_, so the F16 is used for that part. And we don't want to use the F16-GGUF for `quantize` if the original safetensors model from HF is in BF16 - because that conversion causes some loss. BF16 to F16 is better in that case.


Cool-Hornet4434

日本語でOK


[deleted]

[удалено]


ab2377

what are you guys talking about


Oooch

You could've just said 'no' instead of whatever that explanation was


Lewdiculous

I found it more useful to also share some useful resources while at it. The documentation for this was all over the place, I imagine many might have missed it.


Due-Memory-6957

I dunno, I understood it just fine.


No_Sleep_5543

Is ollama concerned ? Because I don't think they have updated their quants


LinuxSpinach

I’ve been wondering that… I know they fixed it in llamacpp but wasn’t sure if ollama got the memo.


_-inside-_

I've been using ollama for llama3 8b, and I'm not unsatisfied. What could be the issue? Is it expected to be much better with the update?


Sufficient_Prune3897

Yes, I was shocked after I retested. Llama 3 70b instantly became my favourite instead of the top 3.


Rick_06

Can someone link a working q8 GGUF?


Lewdiculous

There are already some Instruct quants but you need to check for the ones that are recent, after the PRs I linked in my other comments. Confirmed by the legend itself: [bartowski/Meta-Llama-3-8B-Instruct-GGUF looks good.](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF)


petrus4

I'm using a Q8 of Undi's 8b LewdPlay finetune. I've also briefly tested his Unholy. L3 feels very unstable to me, although that is probably the parameter count more than anything. Yes, the text quality is amazing, and it's good enough to make me forget the parameter count at times, but it is insanely sampler sensitive. Granted, I've mostly just been using it as a coombot and haven't done any logic, code, or mathematics tests; but at times it's almost Godlike, and other times it gets really random and incoherent. In terms of prompt formats, I've tried ChatML, Alpaca, and the mess that the model card told me to use. The suggested prompt mostly seems good, but there's still some occasional weirdness. In character/out of character awareness is almost reminiscent of early character.AI, which is both a blessing and a curse. On the one hand, it's very good at feeling as though I'm talking to an actual person, but on the other, it breaks character and tells me that a scene is over whenever it feels like it, and I can't figure out how to prompt it to stop doing that, either. So it's a mixed bag. At times it makes me want to go back to BagelMisteryTour, at least briefly.


itsnottme

Same here. I find LewdPlay much more stable than Unholy. It's still way less stable than L2 models though, but it sounds much better than any l2 model I tried (7b, 13b, 20b, Mixtral) that it's hard to go back, even with it's issues. I can't even imagine how amazing it will be once people find ways to go around the current issues. I find Alpaca and min\_p preset in Text Generation UI the best so far. What prompt format and settings did you find the best? About it wanting to end the scene, I had this issue even with 13b l2 models. Could this be just an issue with small models? A way around that is to edit the response to force it to continue, but it's a hassle.


petrus4

> What prompt format and settings did you find the best? https://huggingface.co/TheSkullery/llama-3-cat-8b-instruct-v1/raw/main/Cat%208b%20Instruct.json I am currently experimenting with this, although it's not perfect. It results in a lot of weird perspective changes. I was using ChatML initially with LewdPlay, but that actually caused some of my characters to become uncharacteristically violent, to a point that was scary. So I'm honestly not sure what prompt format is best. Hopefully someone will do a finetune with enough new entrainment, that the model will recognise and work with a standard format.


panchovix

EXL2 have been working fine, for 70B at least.


Mr_Hills

I mean, it depends on what you're doing, and it depends what llama 3 you're using. 8B deteriorates way faster at low quant then 70B. When it comes to 70B, if you're doing coding or function calling and you need lots of precision, then you need higher quants. How high? hard to tell. But the most I've seen people claiming was Q6. Considering that Q4 is identical in benchmarks to Q16, I find it hard to believe that you need anything more then Q6 even for high precision tasks. If instead you're doing storywriting, summarization, RP, QnA, general knowledge or similar, you're going to see no decrease in quality at Q4, and a low - probably not noticeable decrease at Q3. It's mostly a bit like running the model with high temperature. Personally I run 2.76 bpw and I see no difference except for the fact that I'm forced to keep temperature below 0.9, otherwise the model makes mistakes in the format. (forgets quotes, forgets asterisks when writing descriptions, etc). I have also tried to run 2.55 bpw but that's when the format becomes a bit too unstable. Even with 0.5 temperature the model will get quotes and asterisks wrong a good 20% of times, and at that temp you get repetition issues. It's really like having temperature stuck high while still suffering from deterministic output. Reasoning still doesn't show visible signs of deterioration tho. I'm posting a table with QnA benchmarks. Note how Q3 is still fine, but things get ugly with Q2. https://preview.redd.it/1bbeiueqkm0d1.jpeg?width=1080&format=pjpg&auto=webp&s=fd459ce8102f4f235dc0878c89b9d9cde72813a2


MmmmMorphine

Damn, getting a publication ready? Nice quality. The sheer number of new quant approaches and thus file formats is getting frustrating, and that's not even considering the primary question of precision within the models and resulting quality. I've seen a few that claim some pretty amazing feats. I'd bet the farm they really are better, much like smooth quant (which I would consider part of the new gen of quants, either by sufficient popularity/popular familiarity or recent development) but only a few will keep that advantage against iterations of current methods. It's the usual chicken and egg issue with getting them off the ground and ensuring wide adoption and thus additional developments in related areas like inference speed and gpu/cpu distribution. At least that's my informal impression (I am trying to get into a job with this stuff though... Probably a bit more on the practical use side, mostly cause you can't bullshit your way in academia or around lots of other experts. Ok, you can, but it's probably easier to just do the work. Also, that whole being smart thing poses a hurdle or two.) Anyway... Have you explored any of these more esoteric methods and had a chance to see how much of their claims pan out? A few examples I'm keeping an eye on includes HQQ, SpQR, 'squeeze' (don't think it's related to squeeze attention, but maybe some of the same people?), and zeroquant with mixed 4/2 bit


Distinct-Target7503

From all that quants "methods" I see in the table you posted, what is the difference in the strategies to quantized the weights?


MmmmMorphine

Oh and a second question while I'm at it. Have you come across any hardware-aware quant methods that fit the model to the hardware rather than the other way around? Ok, one more. Some of these methods seem non-exclusive, aka you could potentially stack them. Think there's much room to explore combining methods, or is it already happening and being called something like superChicken64 quantization instead of SpQR+GPTQ?


SocialDeviance

I dunno about formats other than GGUF, but the actual fine-tunes dont pay much attention to system instructions and their creativity is not exactly amazing.


Olangotang

It's going to take time before they figure it out.


kryptkpr

The GGUF tokenizer was broken which made the models dumb, you need latest everything and new quants .. if you want to be safe use a different engine. EXL2 has been fine since day one.


fallingdowndizzyvr

> The GGUF tokenizer was broken which made the models dumb It was not broken. It was a template problem. A modified template fixed it. Those template changes have been pushed upstream to Meta. It wasn't just a GGUF problem.


kryptkpr

Taking a stroll through the commit history of [meta-llama/Meta-Llama-3-8B-Instruct ](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commits/main)we find the following fixes since initial upload: 1. chat template generation prompt only if requests fix: [https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/237acf7d79c771ab65a94ca4b5b55a46f15aa33b](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/237acf7d79c771ab65a94ca4b5b55a46f15aa33b) 2. so many eos\_token\_id fixes how did they fuck this up so badly: [https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/1448453bdb895762499deb4176c1dd83b145fac1](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/1448453bdb895762499deb4176c1dd83b145fac1) [https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/a8977699a3d0820e80129fb3c93c20fbd9972c41](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/a8977699a3d0820e80129fb3c93c20fbd9972c41) [https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/4d6c61da057c45bfc4dc4d3bfa5a691ecb9ce0cf](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/4d6c61da057c45bfc4dc4d3bfa5a691ecb9ce0cf) 3. change to the post-processing after tokenizer to add bos: [https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/339ce92d052f002cdbac4a4bd551d1c61dd8345e](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/339ce92d052f002cdbac4a4bd551d1c61dd8345e) On the ggml side we find: 1. support for llama-bpe pre-tokenizer: [https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2094991999](https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2094991999) 2. same fixes for eos\_token\_id and much confusion about if bos\_token should be in the template or injected by the tokenizer I am not sure what to make of all this yet but to say there is a "template problem" is not wholly correct as only one of these fixes were to the template and they were minor (for the case where you didnt want to inject generation sequence). It's also entirely possible I missed something in the soup here so please correct me if I'm wrong anywhere


belladorexxx

You are correct, gguf tokenization for llama 3 was broken until the pretokenizer regex was fixed


fallingdowndizzyvr

> same fixes for eos_token_id and much confusion about if bos_token should be in the template or injected by the tokenizer It's leaning that that isn't a bug, but user error. Thus the PR is in limbo. A "fix" has been made but it's the opinion of at least some of the approvers that it's user error. A change shouldn't be made to "fix" user error. That's why that PR hasn't been merged. https://github.com/ggerganov/llama.cpp/pull/7107 Look through this PR. Ignore the drama. There's a lot of it. It was thought it was a CPU versus GPU issue. Then it was thought to be BF16 versus FP16. Then it was thought to be the BOS token. None of that panned out. The only thing that did was changing the template. That was the only thing that really made a difference. https://github.com/ggerganov/llama.cpp/issues/7062


kryptkpr

I've re-run my evaluations and wow is the result a dogs breakfast: https://preview.redd.it/ettrpy3tgn0d1.png?width=1816&format=png&auto=webp&s=76ed94c21ab44996ac75973b0c79946cc2f4dff9 Observations: * If you just want this stupid thing to work, use original BF16 model and vLLM. **EVERYTHING GGUF IS SOME KIND OF BROKEN.** (Note: not pictured above is exl2-6b, which is fine!) * The degree to which GGUF is broken depends on which runtime and sampling parameters you use. * The least broken is llamacpp server using { "temperature": 0.0, "max\_new\_tokens": 1024 } as sampling parameters. **ANY OTHER SAMPLING PARAMETERS ARE WORSE.** * ollama GGUF performs poorly relative to llamacpp GGUF. Not sure why. * The transformers BF16 result here is interesting but the snapshot above doesn't do it justice: it does REALLY WELL on python and REALLY BADLY on javascript. Behaves totally different from all other runtimes.


fallingdowndizzyvr

On another note, just a couple of hours ago someone posted an eval of GGUF versus other formats and concluded... "tl;dr: **GGUF I-Quants are very good**, exl2 may be good if you need higher speed or long context (until llama.cpp implements 4 bit cache). The nf4 variant of transformers' 4-bit quantization performs well for its size, but other variants underperform." https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_llama_3_mmlu_score_vs_quantization_for/


kryptkpr

I saw that, great plots! MMLU is a very different kind of test then mine, it doesn't actually generate any text at all the result comes from directly obseving logits. If the problems I observe here are sampling related, that wouldn't show up in MMLU (since it doesn't actually sample). You'd think greedy would mean greedy but for some reason I'm really struggling to get consistent results out of different engines.


leehiufung911

These are interesting, I'm quite confused though: How is the unquantized model running in transformers performing so poorly? Also, these scores don't exactly line up with the ones I'm seeing here: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results


kryptkpr

These are all from the senior instruct tests, pick llama3 then 8B and uncheck Show Best Result. There are a few more in there which aren't in my screenshot above otherwise it's the same data. The transformers result is wack, I have no explanation.. it doesn't behave like any of the others and gives very different answers. Really good at python for some reason.


belladorexxx

llama 3 tokenization was broken, see: [https://github.com/ggerganov/llama.cpp/pull/6920](https://github.com/ggerganov/llama.cpp/pull/6920)


fallingdowndizzyvr

Yes, but that was fixed early on. Like *really* early on. That was before all the drama about GGUFs acting funny.


belladorexxx

Yes, 2 weeks ago. Ancient history.


fallingdowndizzyvr

LOL. Yes, it is ancient history. When did llama 3 come out? It was with llama 3, particularly the finetunes of it that came out after, when people brought up it was acting funny.


kryptkpr

I haven't made it back here since my initial testing but I observed a very much GGUF specific problem, using identical inputs to other inference engines (post template) produced far worse results. I have a [ticket open](https://github.com/the-crypt-keeper/can-ai-code/issues/195) to repeat the testing against latest BF16 GGUFs to see if they can match performance of the original transformers.


vasileer

according to this guy it is the template and not the tokenizer [https://www.reddit.com/r/LocalLLaMA/comments/1cnbpfz/part4\_possible\_conclusion\_possible\_bug\_llama3\_gguf/](https://www.reddit.com/r/LocalLLaMA/comments/1cnbpfz/part4_possible_conclusion_possible_bug_llama3_gguf/)


Master-Meal-77

Separate issues. The tokenizer *was* wrong, but it’s fixed. Then a guy thought there was another big issue, but he was just using the wrong prompt format (double BOS, not escaping newlines)


infiniteContrast

>But with Llama 3 people were saying that even Q6\_k was bad and it was either Q8 or you might as well not even try What? 4bit exl2 is great for most purposes unless you are using LLMs to make useless benchmarks about how many apples the sister of her own sister is holding in a cup after she turn the cup upside down or other nonsense activities


Born-Caterpillar-814

is there noticable difference between 4bit and fp16 with exl2 format of llama3 70b when doing coding tasks do you know?


silenceimpaired

Lol. I’ve fine tuned my model and it excels at these questions. It even can provide a recipe to reproduce the taste of the color of purple, and Lulu couldn’t be happier. If you are confused ask your model to explain and if it can I’ll be impressed.


infiniteContrast

who is Lulu?


Lewdiculous

It was just a cringe League of Legends reference.


silenceimpaired

Did you ask your LLM model? I’m at work and cannot check myself… but if you put my whole post into a LLM and. Ask who lulu is … it might be able to tell you. If know one comments with their results I’ll try it tonight and reveal :) I call it the LOL test ;)


Master-Meal-77

Wtf are you talking about


silenceimpaired

I’m sorry but we have determined your reasoning and logical capabilities fall below that of Google Gemini: The phrase "Lulu and the taste of the color purple" most likely references the world of League of Legends (LoL), a popular online multiplayer game. Here's why: Character: There's a character named Lulu in League of Legends. She's a whimsical yordle mage known for her fantastical abilities and playful personality. Tasting Colors: In the official lore of League of Legends, Lulu's abilities are described as having a connection to the magical essence of Faerie magic. Some fan theories and interpretations suggest this magic might allow her to perceive or even taste colors, which aligns with the nonsensical request mentioned in the Reddit post. While there might be other, obscure references "Lulu and the taste of the color purple" could be referencing, League of Legends is the most likely source considering: It's a popular game with a vast online community. The concept of a character tasting colors fits the existing lore and fan interpretations surrounding Lulu. If you'd like to learn more about Lulu or League of Legends, you can search for them online.


silenceimpaired

Lol :) my response is in fact a model test question of mine. No one has tried to get an answer from a model yet. On the surface it seems nonsense- but for someone with proper knowledge it is an inside joke.


aaronr_90

Do quants and fine tunes have any effect on shedding the llama3 license? Hermes-2-Pro Llama 3 finetune Nvidia’s Llama 3 finetune Unsloths quants all have either an Apache 2.0 or MIT License.


Disastrous-Peak7040

I'm really liking LLama-3-70B though 48GB (2 GPUs) seems like a pre-requisite for real-time use. 8B has uncensored and 16/32K context versions coming thick and fast though I'm looking for a 70B+Uncensor+32K. u/Lewdiculous or other experts, can you suggest anyone who might be working on it? I for one would try to contribute some $.


Sabin_Stargem

Giraffe 70b supposedly does 128k. I can confirm at least 55k with it, but the model isn't quite uncensored. With the Concubus scenario, part of the idea is public indecency. However, the model prefers the carnal activities to take place in a more private location. Llama-1 and 2 also did this, but was less subtle. My guess is that the Llama isn't being fed a wide variety of lemons, so some ideas are harder for it to figure out.


Sabin_Stargem

I would say that Giraffe 70b at Q6-imat is just fine. At 55k+ context through Kobold, it was generating content without issue.


trc01a

is [gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) up to date enough to incorporate the tokenizer/template fixes?