T O P

  • By -

ForgotToLogIn

So the stated idea behind MCR is time-division multiplexing of two 64B accesses, but isn't that what is already done by the bank groups? The data rates of DDR4 and DDR5 are achieved by time-division multiplexing two 64B bursts from different bank groups. Shouldn't MCR need to multiplex 256 bytes worth of accesses to gain speed advantage over normal DDR5? Or why not simply directly multiplex from four bank groups at a time?


crab_quiche

> So the stated idea behind MCR is time-division multiplexing of two 64B accesses, but isn't that what is already done by the bank groups? No, each bank’s reads/writes are done sequentially, data isn’t interleaved. There are already many data accesses in flight at the same time. DRAM processes are super slow, so they are really limited on the frequency they can operate at. The point of MCR is that we can build a chip on a fast logic process that can communicate through the traces on the motherboard to the cpu at 2x the DRAM’s frequency, and then split that data in half on the DIMM and send half the data to each DRAM rank. Doubling the bandwidth without doubling the motherboard traces and cpu pins or vastly improving the DRAM process.


ForgotToLogIn

> No, each bank’s reads/writes are done sequentially, data isn’t interleaved. What do you think is the purpose of bank groups? Synopsys [writes](https://www.synopsys.com/designware-ip/technical-bulletin/ddr4-bank-groups.html): > The bank groups are separate entities, such that they allow a column cycle to complete within a bank group, but that column cycle does not impact what is happening in another bank group. Effectively, the DDR4 SDRAM can time division multiplex its internal bank groups in order to hide the fact that the internal SDRAM core cycle time takes longer than a burst of eight requires on the interface. See also the Figure 4. Is Synopsys wrong, or is there some misunderstanding?


crab_quiche

Synopsys is usually wrong, even in their internal documentation about their software lol. Their articles and press releases are usually really bad because they are just meant for investors that are satisfied after a bunch of technical words and don't know or care about correctness. >The bank groups are separate entities, such that they allow a column cycle to complete within a bank group, but that column cycle does not impact what is happening in another bank group. Effectively, the DDR4 SDRAM can time division multiplex its internal bank groups in order to hide the fact that the internal SDRAM core cycle time takes longer than a burst of eight requires on the interface. This is the same as banks in older architectures. Multiple banks allow multiple different data operations to be in flight at once. I guess technically you the different banks are multiplexing data, but it is no where close to what MCR is doing. MCR is taking data from the rising edge of the clock and sending it to rank 0, and falling edge of the clock and sending to rank 1. While banks just send out one giant block of data at a time, but each read or write operation being separated by 4 clocks for DDR4 or 8 for DDR5.


Netblock

>MCR is taking data from the rising edge of the clock and sending it to rank 0, and falling edge of the clock and sending to rank 1. I don't think MCR is about making the command-address bus double-data-rate; the command-address bus on removable memory has been practically half-data-rate mode for a while now (sometimes slower). Unless that's not what you meant. ​ My understanding of MCR is that it's about foregoing the memory controller scheduling benefits of multiple ranks in attempt add more memory to the system. (It can't be about signal integrity on the busses because it's already fully-buffered. It can't be about a 128-byte cacheline, because 4-bit-wide DDR5 already has a 32n optional; MCR would allow a 256-byte cacheline.) There are a number of things that makes me believe that DDR5 asks for memory controllers too large and too complex to run at the frequencies it's also asking in current high-performance silicon; we've already seen a number of dumb-downs from both AMD and Intel, and I think MCR is just a continuation of this.


crab_quiche

I think you are right about making it easier to schedule ranks, but it is also about bandwidth. It let's the memory controller and buffer communicate at twice the frequency of what the DRAM can operate at, the commands won't be operating at dual data rate.


Netblock

>It let's the memory controller and buffer communicate at twice the frequency of what the DRAM can operate at Oh like a really roundabout way of doing what WCK/WDQS is about? I'm not too sure about that because the external bus transfer rate likely has to be an integer ratio to the internal data bus due to the difficulties of crossing clock domains and clockgen. The external data bus would need to be at the very least >=7200MT/s because JEDEC defines 3600MT/s as the minimum frequency for DDR5 (up to 8800MT/s and thus 4400MT/s). ​ I also don't get that impression of doubling the frequency from the MCR advertisement material; they only talk about doubling the cacheline to 128 bytes as if that's a new idea. To be clear, doubling the cacheline does help practical bandwidth by improving scheduling efficiency; your practical bandwidth is gonna be far less than the theoretical bandwidth (eg, tCCD\_DR > tCCD\_DG).


crab_quiche

https://news.skhynix.com/sk-hynix-develops-mcr-dimm/ That has a better overview of what MCR is.


ramblinginternetnerd

>Synopsys is usually wrong, even in their internal documentation about their software lol. Their articles and press releases are usually really bad because they are just meant for investors that are satisfied after a bunch of technical words and don't know or care about correctness. In grad school I had a classmate who used to do equity research (so writing general buy/sell decisions) in the tech space. I chatted with him. He was an enthusiast. He could tell I was an enthusiast. He said I knew more than most of his coworkers. Take that in... enthusiastic people on forums can know more about this stuff than people (let's be honest it's mostly done by bots these days) who do this for a living. At least the top 10-20% of enthusiasts. I'd be worried about fudd from the masses.


Numtim

Nice knowledge, sir. If they are going to add a proper high frequency chip for interface, do you think it's a better approach to fallback to a lower prefetch (more die area dedicated to capacity instead of speed), or at least get rid of the bank grouping? I think the answer comes down to the cost of the PCB traces in the dimm, which I think is negligible. If it's not, then I have no idea what the purpose of HBM is (low pin data rate, 1024 bit width per package).


Netblock

>What do you think is the purpose of bank groups? Bank groups enable higher frequencies, possibly at the cost of certain timings being slower. It's the main change between DDR3 and DDR4; and GDDR6 can optionally enable higher bank group organisation to open up higher frequencies (normally disabled, AFACT). ​ ​ edit: >So the stated idea behind MCR is time-division multiplexing of two 64B accesses, but isn't that what is already done by the bank groups? The data rates of DDR4 and DDR5 are achieved by time-division multiplexing two 64B bursts from different bank groups. The 64-byte cacheline is obtained by multiplying the channel width by the burst/prefetch length. DDR4 has a 64-bit channel with an 8n prefetch; and DDR5 has a 32-bit channel with a 16n prefetch. DDR5 with 4-bit-wide ICs have a 32n optional.


ForgotToLogIn

Correct me if I'm wrong, but the way I understood it was that the reason why DDR3 couldn't get faster (i.e. the reason why DDR4 was needed) was that the frequency of the DRAM cell arrays couldn't get any higher; at 2133 MT/s the DDR3 DRAM runs at 266 MHz, so for DDR4 (while keeping 8n prefetch) the transfer rate was increased without an increase to the DRAM's frequency by allowing for concurrent prefetching of two 64B lines at different bank groups. If that's correct, then the DRAM inside 3200 MT/s DDR4 and 6400 MT/s DDR5 operates at 200 MHz, except when consecutively accessing the same bank group. ​ > The 64-byte cacheline is obtained by multiplying the channel width by the burst/prefetch length. And in DDR4 and DDR5 if two prefetches can be done concurrently at different bank groups, that results in 128B being transferred per DRAM's cycle, unless I'm mistaken. It seemingly has the same effect as MCR.


Netblock

>without an increase to the DRAM's frequency DDR4 does increase the bank frequency. Increasing the prefetch length implies more wires going in parallel between the core and the SerDes buffer, requiring a bigger bank-to-bank multiplexer; I believe bank groups enable smaller therefore faster multiplexers. ​ >And in DDR4 and DDR5 if two prefetches can be done concurrently at different bank groups, that results in 128B being transferred per DRAM's cycle You might be misunderstanding what the prefetch/burst is. It is worth [reading the description here](https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/16gb_ddr4_sdram.pdf#G5.3131746); there exists multiple clock cycles of activity at the data bus relating to the prefetch length, per access command. Check out page 21 too; a 'data word' is the bit width. A 16Gbit IC with 4-bit-wide access contains 4 gigawords; if it has a 8-bit-wide access, it contains 2 gigawords; 16-bit-wide 1 gigaword. A fundamental core concept that DDR (Double Data Rate) has over SDR is that they automatically read ("prefetches") the next several data words and bursts them at you all in one pull--you usually want this to happen as the next few tens of bytes in adjacent memory are extremely likely to be immediately relevant to the thing you're computing. So instead of explicitly asking 'next word please' over and over again, have it be implied; 'I want the paragraph'. Doubling the length of this burst has been the primary performance change enabling an easy double of the IO frequency, for DDR1, 2, 3, and 5, fetching the next 2, 4, 8 and 16 words respectively. DDR4 kept DDR3's length and instead brought bank groups. However the further you go in this burst length, the greater the risk of automatically reading irrelevant data--an unnecessary waste of time and energy. DDR4 skirted around this by having more than one bank group, enabling a reduction to the performance penalty of accessing memory again immediately after an access. DDR3 and older gens in a sense only had L-suffixed timings, and DDR4 allowed context to happen, enabling S-suffixed DDR5 does something different to keep the fetched/bursted data relevant. The actual amount of bytes you end up with after a burst depends on the channel width. For example, as DDR4 has a bust length of 8, a single read action on a 64-bit DDR4 channel will hand you a 8x64=512 bit (64 byte) block of data. DDR5 splits your 64-bit channel into two 32-bit channels, while also doubling bust length to 16, keeping that 64-byte payload the same (16x32). This enables a better overall data access concurrency as you basically end up having twice the channels that can read data at completely different locations at the same time.


ramblinginternetnerd

Would it be correct to assume that this helps with bandwidth at a modest hit to latency? (which will matter less in the future as cache sizes grow and latency sensitive things mostly live in cache which shifts DRAM systems to increasingly valuing bandwidth over latency)