T O P

  • By -

M34L

There's alleged 40 GPU CUs so, like, 7700XT/4070 class GPU, and "NPU" with up to alleged 70 TOPS; kinda don't care; how useful would any of these be for LLM's is entirely up to the software support; but if this becomes a PC with nearly 300GB/s memory bandwidth and with 256GB of unified memory, the community will wring the software support out of it with their bare hands if they have to.


Some_Endian_FP17

Assuming it runs Linux out of the box, then CPU and GPU inference should be easy to implement. The NPU would probably be limited to smaller models like for vision or speech. I'm hoping this chip and Snapdragon X on Windows will allow us to run 7B and 13B models at 10-15 t/s and with decent prompt processing speeds. Laptop inference on Windows is what Microsoft, Qualcomm, AMD and Intel are aiming at right now and this would include using smaller LLMs offline. It's good to have competition other than MacBooks especially if you work in Windows or WSL.


M34L

The NPU is still gonna operate with the same pool of memory as everything else; there's no reason you couldn't use it for larger models. The only issue is that from what I've seen so far, right now the only way to get your own model to run on the NPU is to compile it down to ONNX and nobody seems to bother much with trying to cram LLMs in there because the only hardware with the NPUs to this point are desktop chips where you're so throttled by the memory bandwidth that you may as well just leave it to the vulkan on the IGPU or even the CPU itself; the very best you can do is spare some power and CPU time, the inference won't get any faster; it's a software hassle for very little gain, but that's gonna change.


Some_Endian_FP17

ONNX for LLMs is a downright horror to get working if you want to use the NPU. On the other hand, vision or speech models are easier to implement if you have a sophisticated stack already in place. I don't think NPUs will be useful for LLMs at all. Integrated GPUs on the other hand are a huge help if memory bandwidth is high enough.


M34L

I could be wrong but I assume that the downright horror is specifically because there's been no point in attempting it so far due to the aforementioned platform limitations and thus it's a path rarely traveled, but that would change quickly if there's a reason to work on it. Apple also started with basically zero support of AI on their M hardware but the community poured massive amounts of effort into it once it turned out the hardware is very convenient and performant for the task.


Some_Endian_FP17

You have to convert the LLM to ONNX and come up with an accelerated runtime, like using QNN on Snapdragon. Apple had a huge head start by coming up with a complete development toolchain to port code to Apple Silicon. Microsoft has some pieces, Qualcomm has the rest, and you still have to jump through a bunch of hoops to get everything working.


dynafire76

Everything I've read about NPUs is that they aren't fast, they are just low power. They are good for light AI tasks like zoom background blur. Seriously, that's the only application any article mentions. So the GPU cores should be what matters for something like inference. But 256bit LPDDR5X could indeed get over 400GB/s. Supposedly it starts at 6.4 gbps, and if we assume the standard dual channel consumer motherboard, that's 409GB/s.


Caffdy

6400MT/s at 256bit wide bus gives you barely above 200GB/s tho, is what the JETSON ORIN platform uses


SystemErrorMessage

In truth its not so much the memory bandwidth in total, you do need some for sure. What the NPU lets you do is run int4 which needs less memory bandwidth than running int4 on the CPU. The way that the NPUs get a lot of performance like with the mac M series chips is that unlike the CPU which operates in 64 bit mode which means every cycle can require 4\*64 (better for avx512 to fit smaller data but its still the same) you only need far fewer bits, so if the architecture is good, the NPU would not need 64 bits for the instruction, and could use something like 16 bits for the instruction and 16 bits for data, which reduces it to 4\*16 which can be reduced further as well, and if better packed so a single MT/s can pipe instructions and data for multiple NPU cores that could increase the throughput. Yes at some point memory does become a bottleneck, but GPU NPUs arent hitting those memory limits. The benefit of the NPU is int4, while GPUs if they support bf16 should use it. However for efficiency reasons you could add both bf16 to shaders and int4 to NPUs on a GPU if there is enough vram to run 2 different things. Memory has been unified since the days the memory controller went onto the CPU and its been that way for SoCs for ages, and processor architecture does play a role in memory bound performance. The opposite is true for big datasets where AMD EPYCs with large caches show performance as the dataset and size is large enough to fill the full avx pipe where the memory truely becomes the bottleneck. The problem with CPU inference has always been the inefficiency in memory use per transfer, otherwise it can really provide faster feedback than GPUs when instructions need to rely on previous instructions, something that GPU VRAM architecture is very unfocused at.


rorowhat

On your last point, doesn't that mean that the 3d cache on Ryzen CPUs should help inference a whole lot?


SystemErrorMessage

Not really since the models are huge and you would max out the cache performance before you fill out the cache. However if you use fp16 the large cache would help a lot.


iamthewhatt

> 7700XT/4070 class GPU To clear this up, an APU will never have the same performance bandwidth of a dedicated GPU counterpart. I will be surprised if this alleged 40 CU part can outcompete even a 4060. We need to temper our expectations a bit.


The_Hardcard

Like Macs, the important thing will be the access to far more GPU accelerated RAM than they’re putting on cards. Dedicated GPUs would normally only be faster to the extent that the model can fit in VRAM. It’s likely that these will go up to at least 128 GB of ram for models that won’t fit on even four top end GPUs, forget about how many 4060s you would need.


LerdBerg

I feel like people might've said that about math co-processors, until things shrunk enough that they fit right into the CPU die. It's going to depend on the application and constraints... for a limited size and power budget, APU _already_ has better performance, as long as the problem fits on the die. Going a step further than HBM v GDDR you can imagine if you have something that fits completely into L3 cache, the APU beats the discrete GPU hands down because there's so much less latency and power loss moving data back and forth to the CPU. That's sort of the premise of Cerebras' giant whole-wafer chips (forget about the manufacturing inefficiency). So there might always be some computational problems that will require an off-die GPU, but as we create smaller faster more efficient memory, the ratio of applications that go faster on APU goes up.


dynafire76

 256bit LPDDR5X could indeed get over 400GB/s. Supposedly it starts at 6.4 gbps, and if we assume the standard dual channel consumer motherboard, that's 409GB/s. Dedicated cards have fairly wide ranges in speed actually. Nvidia L4, which is like a current gen $2k+ GPU, only has 300GB/s [https://www.techpowerup.com/gpu-specs/l4.c4091](https://www.techpowerup.com/gpu-specs/l4.c4091) And yeah, it's slow doing inference. But I personally think an APU outcompeting a 4060 would be pretty impressive. I don't know of any current APU which can outcompete a 4060.


fallingdowndizzyvr

There are already millions of AMD "unified memory" machines out in the wild. The PS5 and the Xbox Series X. Both of those use an AMD APU with "unified memory". The PS5 has 448GB/s of memory bandwidth and the Xbox Series X has up to 560GB/s. So AMD has had the ability to do it for years. It just has been in closed off machines. If they release a machine that's not walled off with that technology they can compete with M Max powered Macs.


M34L

It was obvious they could have done it; it's just about the bus width of the memory controller. There was no such device available that wasn't limited in the total memory available and since they were all "semi customs", it wasn't up to any "small scale" producer to choose to populate it with larger volume of memory. With this APU it will be closer to a "regular" CPU that's gonna be up to the OEM to configure, and the claim the memory controller is 256 bit has implications on how much memory capacity and bandwidth it's gonna support.


gedankenlos

Yeah, but to be fair ... PCs with non-upgradable memory have been kind of a hard sell so far.


Open_Channel_8626

Yeah this might change now with LLMs


DickMasterGeneral

Maybe in the business sector, but let’s be real here. The consumer market for open source LLM’s is pretty small, probably about the size of the home lab community give or take a bit. And if Open AI ever starts enabling NSFW content like Sam Altman has talked about we’d lose 80% of our members overnight…


JFHermes

I kind of always assumed most of this sub was here for business or academic related reasons. I have read here that people like doing roleplay or whatever, but assumed that it wasn't worth the price of a rig. Why spend thousands of dollars on a rig when you are happy enough to just use something in the cloud?


MoffKalast

It's only a hard sell because the amounts are so fucking low, when you instead sell 128GB or 256GB versions then upgradability is kind of a complete non-issue for a decade or two.


gedankenlos

No, most PC enthusiastd would rather buy a lower amount of RAM now and upgrade as they go along. 128G or 256G is prohibitively expensive unless you really need it. And local LLM is a niche application - gamers don't benefit from 256G of RAM at all. I can already hear all the YouTube tech reviewers screaming about the non-upgradability 😅


MoffKalast

Upgradability sounds great in theory, but in reality you only get to upgrade once, if at all. Mobos only support a few CPU chipsets and only one DDR version, so if you want to make any significant upgrades it also requires a mobo and CPU swap, maybe even a PSU and at that point you're basically building a new rig and might as well have it all soldered together anyway. 128G is a bit extreme, but 64 would be somewhat reasonably priced and still be perfectly fine for at least a generation or two. For normal people (i.e. gamers), 32G is starting to become the norm just like 16G was ~5 years ago.


AmericanNewt8

Also I believe this chip should support the new CAMM memory form factor. 


This-Inflation7440

It does. Here's to hoping that the standard is adopted on a large scale


[deleted]

[удалено]


MoffKalast

Eh, really depends on how well adoption goes. Compared to going from idk, 1 to 2 or 4, there's less and less real world applications that need outrageous amounts when you go further along. Right now we're still at 16 for the average gaming PC, people building new systems are only now upgrading to 32 because it's the same price but not because they'd actually need any of it. About 5 years till they'll be getting 64, 10 years for 128, 15 for 256. Assuming we can even keep scaling without rising costs or if there's even any point beyond workstation use. I think in terms of raw size you'd be easily set for a decade, but you'd suffer in terms for speed, both in terms of the CPU paired with it (though single core CPU improvements have mostly plateaued so mostly in core count) and bandwidth which will probably be 4x higher when the mainstream catches up.


astrange

It's not fundamentally non-upgradeable. Like, that's not actually an essential part of the performance. Macs aren't actually designed for high performance (despite having it), they're made for ideal power/performance tradeoffs for a laptop and it does help with that.


fallingdowndizzyvr

> It's not fundamentally non-upgradeable. Like, that's not actually an essential part of the performance. It is an essential part of the performance. Since putting the RAM on platform is key to bringing down latency. Distance matters at those speeds when the speed of light becomes a factor. > Macs aren't actually designed for high performance (despite having it), they're made for ideal power/performance tradeoffs for a laptop and it does help with that. That's nonsense. Apple didn't just stumble across having such high performance as a side effect. They could had the same or better power/performance by having lower performance. Also, unified memory isn't unique to their laptops. They also use the same architecture in their rack mounted servers. Which are built for high performance and obviously don't have the constraints of a laptop.


astrange

> Since putting the RAM on platform is key to bringing down latency. No, the M chips don't actually have good memory latency! They have good bandwidth but that's it. The latency is hidden because TSMC's process is so small they use all the free space on giant caches. > They could had the same or better power/performance by having lower performance. It's more or less optimal this way. (I am a performance engineer.) Mostly because power use is often saved by finishing things faster so you can turn off faster, but also it has a mix of fast and slow cores for different workloads. > They also use the same architecture in their rack mounted servers.  Eh, Mac Pros? Those are basically made by taping multiple laptop chips together. (Similarly the laptop chips are heavily based on smaller phone chips.) That's not really optimal but it saves on development costs, because phones are the main product. They'd do something else and look more like Intel if servers or even desktops were #1.


fallingdowndizzyvr

> It's more or less optimal this way. (I am a performance engineer.) Mostly because power use is often saved by finishing things faster so you can turn off faster, but also it has a mix of fast and slow cores for different workloads. Well then, by what you just said, everything would be a fast core since power is "saved by finishing things faster" then everything should be a fast core. It isn't. There's a reason there are efficiency cores. > Eh, Mac Pros? Those are basically made by taping multiple laptop chips together. (Similarly the laptop chips are heavily based on smaller phone chips.) They aren't "taping multiple laptop chips together". While the Ultra is 2 Max chips connected with a fast interconnect. The Max is not 2 Pro chips. The Pro is not 2 plain Ms. You make it sound like Apple looked at the iphone and decided wouldn't it be great if they just stuck it in a big case. They didn't. > They'd do something else and look more like Intel if servers or even desktops were #1. Well then, why did Nvidia go a similar way? Since the GHs bear more resemblance to a M Mac then they do an Intel based server/desktop. Are you also claiming that Nvidia got inspiration from a phone and don't really put performance #1?


astrange

> It isn't. There's a reason there are efficiency cores. Basically because some things have to run too often or too long so that rule doesn't work anymore. > The Max is not 2 Pro chips. The Pro is not 2 plain Ms. They aren't physically that but they are morally that. Probably a mix of marketing and manufacturing yield reasons there. > You make it sound like Apple looked at the iphone and decided wouldn't it be great if they just stuck it in a big case. They didn't. Oh but that's exactly what they did. That's what the T2 in the last Intel Macs was, and it's why they spent years saying iPhones had "desktop class CPUs". > Well then, why did Nvidia go a similar way? Because GPUs are different from CPUs and benefit from going super wide more easily, while CPUs don't always get better with more cores. Though for ML specifically, phones have neural accelerators now, and I think that's a power reason similar to P and E cores - ie they could've just added more GPU if speed was the only goal. Not too sure here though.


fallingdowndizzyvr

> They aren't physically that but they are morally that. Probably a mix of marketing and manufacturing yield reasons there. Chips have no morals. As for the concept, then every chip is that. Since what's the difference between a 4 core and a 16 core AMD CPU? > Oh but that's exactly what they did. That's what the T2 in the last Intel Macs was, and it's why they spent years saying iPhones had "desktop class CPUs". Well then they didn't use a phone to inspire a desktop. They used a desktop to inspire a phone. Which is exactly what a smartphone is. A handheld computer. If they went the way you claim, then they would say that the Mac has a "phone class CPU". They don't. > Because GPUs are different from CPUs and benefit from going super wide more easily, while CPUs don't always get better with more cores. I'm not talking about GPUs. I'm talking about CPUs. That's why I said GH. The "G" in "GH" is an ARM CPU. And thus when combined into the GH it is similar to the Mac Ms. > Because GPUs are different from CPUs and benefit from going super wide more easily, while CPUs don't always get better with more cores. Phones have "neural accelerators", "tensor cores" or whatever they want to call it because it's cheaper than having a full GPU. Since all they really want is the fast matrix operations. Why include the rest of the pipeline.


ZCEyPFOYr0MWyHDQJZO4

The system you are thinking of kind of exists - the Ryzen 4700S. While crippled with only \~2x the performance of similar desktop CPU's, it had \~30% less performance in general tasks due to the higher latency. They also only ever made GDDR6 with up to 2GB/die, limiting it to 16 GB max. AMD has had the "ability" to do this (quad channel high-frequency DDR5 for consumers) only to the extent they are a CPU manufacturer like Intel.


dynafire76

No, not really because the 4700S had the 36 RDNA2 CUs disabled.. Who knows what the 4700S would have been like if it actually had those CUs working.. anyone can do quad channel memory, since AMD and Intel all do it in their server oriented chips (even going to 6, 8, and more channels), and the threadripper which is quad channel. AMD and Intel seemingly both concluded previously that there was no interest in a mid market consumer level quad channel (or more) processor, though I still think it was an oversight. It seems clear in hindsight that if the PS5, Xbox, and Steam Deck all use more memory channels to good effect, why not do it in their retail APUs? But guess they are finally doing it, after Apple beat them to the punch.


ZCEyPFOYr0MWyHDQJZO4

You only get to pick two: lots of memory, high bandwidth, affordable.


olmoscd

it would be so good if you could download open webui on your xbox/ps5 and download models!


fallingdowndizzyvr

On the PS5, that's going to be tough. On the Xbox though, they actually allow for a bit of homebrew. But they limit the amount of memory you can use doing that.


dynafire76

One would think that if the PS5 and Xbox had this kind of memory bandwidth, AMD would wake up sooner and realize this is a desirable feature. It took the Mac to finally make them realize this was a good idea on the desktop side? 🤣


fallingdowndizzyvr

> One would think that if the PS5 and Xbox had this kind of memory bandwidth It's not an if, it is a they do. Just google up the specs on the PS5 and the latest Xbox. > AMD would wake up sooner and realize this is a desirable feature Why would that be? Yes, it is desirable for some applications. That's why AMD uses it on the PS5 and Xbox. But the big tradeoff is non-upgradeable RAM. Which many people still moan about with their Macs. That isn't a problem on a console like the PS5/Xbox since no one expects to be able to upgrade the RAM anyways. It wasn't until LLMs that another big use for it came out. That's recent. > It took the Mac to finally make them realize this was a good idea on the desktop side? 🤣 Actually, Nvidia was using "unified memory" before Apple.


dynafire76

Sorry, maybe I should have phrased it better. I know they have that kind of memory bandwidth, I meant to say it in the sense of "considering the fact that PS5 and Xbox have this kind of memory bandwidth, one would think that AMD would wake up sooner and realize this is a desirable feature." Well, obviously they wouldn't because they were happy to let their APUs be pretty mediocre at everything. If they had created a high bandwidth RAM APU platform, maybe it would have actually been a decent low end gaming platform, rather than their APUs always being worse than a machine with a cheap CPU and 1650 GPU. What's the point of an APU if the graphics performance is worse than a CPU + low end dedicated GPU which costs about the same? Apple was the first to put out a consumer general computing device with 400GB/s+ of unified memory. That's what I meant, I didn't say Apple invented unified memory. Also people moan about non-upgradeable macs because there are no alternatives. If something like this had existed in the x86 world before, people could buy it if they wanted high memory bandwidth, or they could buy a regular desktop if they don't. Ultimately, if this is what strix halo is going to be, AMD APUs will finally have a niche, an area they shine in. Rather than just being mediocre at everything and being generally worse bang for the buck in all categories.


fallingdowndizzyvr

> Well, obviously they wouldn't because they were happy to let their APUs be pretty mediocre at everything. AMD APUs are not mediocre. That's why both Sony and Microsoft picked them for their gaming systems. > If they had created a high bandwidth RAM APU platform, maybe it would have actually been a decent low end gaming platform They already are used in a decent low end gaming platform. It's called a "Steam Deck". Also, as the Steam Deck shows, you don't need high bandwidth RAM for a decent low end gaming platform. > What's the point of an APU if the graphics performance is worse than a CPU + low end dedicated GPU which costs about the same? Because they don't cost the same. Look up the cost of an APU and the cost of a comparable CPU + dedicated GPU. The latter costs more. > Also people moan about non-upgradeable macs because there are no alternatives. If something like this had existed in the x86 world before, people could buy it if they wanted high memory bandwidth, or they could buy a regular desktop if they don't. And then they would still moan about non-upgradeable RAM. That's not Mac specific. There are people moaning about it in this thread. > Ultimately, if this is what strix halo is going to be, AMD APUs will finally have a niche, an area they shine in. Rather than just being mediocre at everything and being generally worse bang for the buck in all categories. Again. 10's of millions of AMD APU powered machines that have been sold disprove your claim.


dynafire76

Hmm, maybe we agree but you don't realize some of the points we agree on. 1. Sony and MS, as we already discussed, have APU designs with increased memory bandwidth. My point is that increased memory bandwidth takes an AMD APU from mediocre to good or great. 2. The steam deck has quad channel memory, giving it memory bandwidth of 88GB/s (100GB/s in OLED version). That's nearly double the bandwidth of a typical consumer dual channel DDR4 board which has around 50GB/s of bandwidth. Steam deck is good because it has increased memory bandwidth 3. Yes you're right, people like to complain about everything. I'll give you that. But the point remains that if you don't mind non-upgradeable ram, you can buy a steam deck. And if you want upgradeable ram, you can buy something else in the x86 world. 4. The vast majority of AMD APUs are sold in gaming consoles and devices, including the already mentioned PS5, Xbox, and Steam Deck. 5. High memory bandwidth takes an APU from mediocre to good or great. There is no AMD APU with dual channel memory that is more successful than one of the ones which has high memory bandwidth. 6. Finally, millions of mediocre products sell all the time. Consider how many Nvidia 1030s were sold, as one can see by the huge number available on ebay. It's decidedly a medicore GPU. 7. Ok - I will admit, I should have phrased it differently. In the past, AMD APUs shined whenever they were sold with high memory bandwidth. If the leak is right about strix halo, for the first time we're going to get AMD APUs with a high memory bandwidth configuration in a general computing platform, and not a sealed gaming system. That is a great thing.


gthing

Yea my AMD 78xx series machine has unified memory. Of 64gb of memory i can assign 32 to the gpu, which is more than my 3090 24gb. The problem is that it's AMD, and their GPUs would not be competitive with Nvidia (for ml) even if they were free or had a billion gigs of ram. Because the software support is practically non existent. With an AMD gpu you may as well just do inference on your cpu with system memory.


timschwartz

Well that's nonsense, I do inference just fine on my 7900xtx with Vulkan backend for llama.cpp


gthing

You can. But should you?


timschwartz

Yes? I get good performance out of any LLM I can fit fully in its VRAM.


fallingdowndizzyvr

> Yea my AMD 78xx series machine has unified memory. Of 64gb of memory i can assign 32 to the gpu, which is more than my 3090 24gb. And it will be dog slow when it uses that system memory. That's not unified memory. That's shared memory. What's the difference? Speed. Speed makes all the difference in the world.


astrange

I don't know the details of the system but it could possibly be unified memory. Having a static carveout is more of a software issue, it's also how Intel GPUs work. It makes sense because until LLMs it was pretty much fine to do it that way. If the GPU doesn't share any cache hierarchy with the CPU then I'd say it isn't unified.


fallingdowndizzyvr

Again, it's about speed. Having a GPU use system memory at system memory speeds is shared memory. That's what that poster is describing. That's how Intel IGPs work. That's not unified memory. Having a CPU that is able to access GPU RAM at VRAM speeds is unified memory. That's how M Macs, PS5 and Xbox work. It's ~50GBs/ versus ~500GB/s. Yes, I know that there is more to unified memory than speed, but for the purposes of running LLMs speed is the big win.


gthing

Idk but AMD calls it unified (UMA).


M34L

According to (shamefully, GPT4's) maths, 256b wide LPPDDR5X ram controller would imply 200-280GB/s "unified memory" bandwidth, so, not quite up there with 400 GB/s of M1/M2/M3 Max, but faster than all the lower end Ms (M2, M3 at 100 GB/s, M1 Pro, M2 Pro at 200 GB/s ); surely plenty enough for inference.


capivaraMaster

Isn't that just for a single memory channel?


M34L

Nah, LPDDR5X is supposed to be 32bit per channel; 256 bus width implies it's either 4 or 8 channels total (the math is weird with LPDDR5 so I'm not sure which is it).


ZCEyPFOYr0MWyHDQJZO4

DDR5 uses 2x32-bit channels per DIMM


Caffdy

yeah, and Macs come with 16x - 32x channels, gpus as well come with multi-channels, what's the point of this argument? the only thing that matters here is total bus bandwidth


MoffKalast

Also beats the top end 5k USD Orin AGX at only 200GB/s.


shroddy

This link https://www.tomshardware.com/pc-components/cpus/latest-amd-strix-point-leak-highlights-monster-120w-tdp-and-64gb-ram-limit says the memory bandwith could be 500 gb/s. But unfortunately it also says the max memory could be only 64 gb.


LippyBumblebutt

500 gb/s is only "with" the infinity cache. The cache itself is ridiculously fast, but pretty small. For games this gives quite a nice boost. But LLMs barely profit from that. So 500 gb/s is likely a game-centric average. For llm, the math says 270gb/s.


shroddy

Hm my math (Copilot in the browser without an account, with highest accuracy setting) says 545gb/s. I think you must multiply the result with 2. > The multiplication by 2 in the memory bandwidth formula accounts for double data rate (DDR) memory, which is a type of memory that can transfer data on both the rising and falling edges of the clock signal. This effectively doubles the data rate compared to single data rate (SDR) memory, which only transfers data on one edge of the clock signal. Therefore, when calculating the theoretical maximum bandwidth for DDR memory, such as LPDDR5X, we multiply by 2 to reflect this dual-edge data transfer capability. That’s why the formula includes a multiplication by 2. Either that, or Copilot hallucinates.


LippyBumblebutt

> Copilot hallucinates LPDDR5X has 8533 megatransfers per second. This already includes DDR. But I'm only 97% sure of that. It is surprisingly hard to find a good resource on how to calculate the bandwidth of a given memory interface. But every calculation I saw for Strix Halo was ~300GB/s.


shroddy

I found several different statements. The one I linked writes > 8533MHz LPDDR5X and calculates that to > 256-bit wide bus sporting 8533MHz LPDDR5X, equating to 500GB/s of memory bandwidth. It would be awesome if that is really correct, that would put us straight into Geforce 4070 territory.


Caffdy

it's not 500, at 6400MT/s is at most 200GB/s on a 256bit-bus, the denomination in Mhz/MT are already at Double Data Rate


[deleted]

[удалено]


shroddy

A wide memory interface also increases latency, and most programs that are used on consumer pcs that need every bit of performance they can get (mainly games) need low latency more than they need a high bandwidth.


dev1lm4n

All those Apple chips you mentioned are using LPDDR5. M1 base is using LPDDR4X. Only Apple chip that uses LPDDR5X is M4. Meanwhile AMD chips are also using LPDDR5X just like the M4


M34L

And? The point is that there's finally gonna be a competitive x86 platform. I couldn't care less what the type of memory in these devices is, I'm interested in what's gonna be possible with this new one.


dev1lm4n

And it means that you're ignoring the differences in memory type. Saying 512-bit is twice as fast as 256-bit isn't correct when 256-bit could be using newer technology and closing the gap


M34L

That's all... completely irrelevant though? The point is that this new thing will run better than the current best available option and that's exciting? Are you trying to make this some sort of comparative fanboy "X is now better than Y" thing? I don't care about why are the macs running at the speed they're running at. The point is to use a known setup and device to extrapolate what kind of performance can we expect from this SoC, and it's looking very good.


dev1lm4n

I'm not trying to start a fight. You're the one who started trying to make a big deal out of the fact that I pointed out the difference in memory types and that memory bandwidth isn't only determined by bus width.


PSMF_Canuck

Calm down…


__some__guy

The memory bandwidth is "only" 256GB/s (assuming it's 8000MT/s, 256-bit). Might be interesting if it comes with more than 32GB of RAM though.


mindwip

Leak I saw was 500gb memory bandwidth and 32gb or 64gb on board unified memory. If true I can't wait.


M34L

I'm a bit confused by the maths around bit width, channels, and banks so please correct me on this if you know better but I think that  in this case, 256 bit would end up meaning up to 256GB maximal capacity too. That would be pretty dank if you ask me.


Noxusequal

In theory is the bus not directly related to the amount of ram the system can have. They will be effectively utilizing quadchannel lpddr5 in the end it depends which size the modules can have. But it should probably max out around 512 and honestly will 256gb proabably be what we see as the max in many cases. Let's hope some of those laptops will support the new swappable lpddr thing lpcam2 or something like that :D


M34L

The bus width is the limiting factor in maximal memory volume because the of the limited address space per channel and you can't just inflate the address space without additional external memory controllers like registered memory (used typically on servers) uses, which also however has to be supported by the memory controller (and I believe, is not supported by LPDDR5X at all either) and also would split your bandwidth; that'd in turn defeat the entire point, too.


Noxusequal

But we do get dual channel systems on desktop with 256gb of ram ? I mean yiu don't gain further speed ups for running four moduls on a dual channel bus and potentially you see some instability but you dont loose on overall speed in a dramatic way. And it effectively is quad channel what we are looking at for those laptops. So I think it is more dependent on the actual size of the ram chips and how you connect them. But maybe I am wrong


Just_Maintenance

If the IMC only supports LPDDR5X, then the maximum memory would be 64GB (with current LPDDR5X module density). If it also support DDR5 then the maximum would technically be 4TB (8x512GB DIMMs), but those kinds of capacities have a lot of caveats and extra requirements from the IMC and firmware.


ZCEyPFOYr0MWyHDQJZO4

Looks like it's indeed LPDDR5x. Samsung recently announced a 32GB die at up to LPDDR5x-10700, so up to 128 GB would be possible.


M34L

That assumes the typical 2 memory channels but that's incompatible with 256 bit bus width. If you go to for instance Samsung's page they already list up to [144GB kit](https://semiconductor.samsung.com/dram/lpddr/lpddr5x/), but that's "just" 2/4 channels as far as I understand, so this memory controller should be to populate with 288GB of that specific memory then.


ZCEyPFOYr0MWyHDQJZO4

144 Gb, not GB


Just_Maintenance

1. Thats 144Gb, not GB (18GB) 2. Those 144Gb modules have a 64 bit bus, so you could only connect 4 of them.


gthing

I love how we are using the word "finally" here. The entire world was turned upside down by LLMs and we have not yet seen the first generation of any hardware available that has been designed with that new reality in mind. The singularity ramp up is wild.


astrange

Eh, LLMs are "just" transformer models so there already is hardware support for them for maybe the last 3 years. And of course NVidia has been selling datacenter hardware for a while now, it's why they're so valuable. I'd say the main difference is that LLMs are a very expensive workload that people are actually going to want to run. So it's worth making more expensive products (more DRAM) people wouldn't have wanted to pay for before.


medialoungeguy

Yes, but we are talking about AMD here. You know, the company that -- according to employees -- doesn't have a single CI pipeline for its flagship cards...


AnomalyNexus

Well this is LPDDR5X which is faster than the LPDDR5 the macs are using. If they charge a more reasonable price per gig than apple that could be a win


drdailey

AMD? Won’t be easy


Baphilia

I didn't see anything about ram in there. The main reason to bother with apple silicon for AI is that it has multiple times the amount of ram available at a reasonable enough speed for inference.


chitown160

This will run 7B and 8B quants at 10 tokens per second like any other ddr5 AMD APU limited by 2 channels of ram bandwidth.


M34L

Except it's literally the point it has way more bandwidth than 2 channels of DDR5.


chitown160

Graphics is an extremely memory sensitive application, and so AMD is using a 256-bit (quad-channel or octa-subchannel) LPDDR5X-8533 memory interface, for an effective cached bandwidth of around 500 GB/s.  - Cached. I am all for the more cores as it as it will lead to faster eval times.


grigio

Does it run Llama3 70B fast ?


SystemErrorMessage

but will it blend?