For the past three years, AMD has not ceased to please everyone without exception. Fans of the red camp are happy with the performance, and the blue ones are happy with the lower prices and the progress that has finally appeared. After the presentation of the first solutions of the Zen architecture, Lisa Su’s team has perfected their brainchild so much that the new products in the face of the Ryzen 5000 series have become even more expensive than their predecessors. And this is possible when you are completely confident in your product. But is this true – we will find out already at the end of testing, but for now let’s find out what the new Zen 3 architecture has brought.
Despite the fact that the Zen architecture has already undergone dramatic changes in the Zen 2 iteration, AMD has not slowed down and in a year it managed to make another revolution without changing the technical process. Zen 3 is a really big step forward with 19% IPC and a number of other innovations.
Looking at the illustration above, you can see that hard work has been carried out in many areas in order to satisfy the requirements of all users and, in particular, to please gamers. Optimizations have affected all elements of pipelines: the target branch buffer (BTB) has doubled, the throughput of the branch predictor has been significantly improved, the effect of “bubbles” in samples has been eliminated, the penalty for branch misprediction has been reduced, fast acquisition sequence and finer granularity of Op-cache switches. Blocks for executing instructions, loading and storing data also received a number of improvements, but more on that later.
CCD and cache hierarchy
Probably the main trump card of the Zen 3 microarchitecture is the modified CCD (Core Complex Dies) structure. Now the crystal has become monolithic and does not contain two CCX (CPU Complex) modules, which were previously interconnected using Infinity Fabric.
Thanks to the re-layout, each core can now directly access the L3 cache without using Infinity Fabric, significantly reducing data access latency. However, L3 cache latency has increased from 39 to 46 cycles. The main reason for this phenomenon turned out to be the volume of the monolithic structure, and the second – the increased clock frequencies of L3. Fortunately, the second reason is able to somewhat compensate for the increased latency. The third-level cache filling mechanism is victimized, that is, prefetching does not apply to it, data is simply forced out of L2 into it. Thus, the L3 cache is predominantly exclusive.
As for the rest of the cache hierarchy, there are still some changes.
The L1-I and L1-D caches are still 32K bytes with 8-way associativity, and the L2 cache is 512K bytes with 8-way associativity. The micro-op cache has not changed and is 4 KB.
Zen 3 supports 64 significant gaps between L2 and L3 per core and 192 between L3 and RAM.
The instruction cache (L1-I) per cycle can provide 32-byte fetches, while the data cache (L1-D) allows 3x 32-byte loads and 2x 32-byte stores per cycle. The save queue has increased from 48 to 64.
It is worth noting that this solution is typical for most high-intensity workloads, which ultimately lead to more downloads than saves.
Also, Zen 3 can load two and save one 256-bit vector in one operation, provided that different DC banks are accessed.
The address translation buffer (DTLB) has not changed and is 2K. AMD also points out that prefetching across page boundaries has been improved.
Instruction fetch and branch prediction block
According to AMD, the instruction fetching and branch prediction block received a number of optimizations that allow for faster fetching, especially for branched and large code areas. This is undeniably great news for programmers and most game engines, since the processor is busy with additional optimizations, and not the developers themselves.
Branch target buffers have been slightly reworked. The first level table in Zen 3 has 1024 entries instead of 512, and the second level has 6.5K entries instead of 7K. The array of indirect addresses has increased from 1K entries to 1.5K. Together, all these changes should allow the processor to more quickly recover the execution pipeline from a mispredicted branch.
Micro-op queues are formed from four decoded instructions per clock or eight instructions from the micro-op cache. Also, the micro-op queue and scheduling can inject six micro-ops per cycle into the schedulers.
Another interesting innovation is the fight against “bubbles” in predictions. All Ryzen processors use a program counter (PC) register to determine the current instruction being fetched in the pipeline to prevent new instructions from being fetched. When an instruction in the decode step is stopped, the value in the PN register and instructions in the fetch step are saved to prevent changes. The values are retained until the instruction causing the conflict passes the execution stage. Such an event is often referred to as a bubble, in analogy to an air bubble in a heating system or your CWC circuit. In some microarchitectures, a taken prediction branch causes a bubble in the fetch pipeline, even if it is correctly predicted. With multilevel predictors, the second level can correct the first level’s prediction, so the prediction can be said to be correct, but a small bubble will be introduced to redirect the sample. With buffering, this effect usually does not have a significant effect. The branch will also use the branch predictor resources, possibly reducing the predictability of other branches (for example, by accepting a BTB entry that would not otherwise be deleted until bits in the global history are reused or “wasted”).
Zen 3 brought changes to how planners work, too. The four integer schedulers are now unified and can have a queue of 24 entries for an AGU or 24 entries for an ALU (instead of 4×16 ALUs and 28 AGUs).
Integer schedulers can accept up to six micro-ops per clock, which feed into a fatter 256-entry reorder buffer (instead of 224 in Zen 2).
The Integer module has eight execution ports, through which four ALUs (arithmetic logic units), three AGUs (address generation units) and one BRU (the unit allows the program to make decisions, as well as execute jumps and procedure calls) are connected.
The AGU can still feed three micro-ops per clock to the register file. The general purpose register file has slightly grown from 180 to 192 entries.
The modulus of real values (FP) has not been bypassed either. According to AMD, the structure for loading the execution module has received more parallelism, while internal delays have decreased (this also applies to the integer module).
In frequency, the multiply-accumulate (FMAC) operation is one cycle faster. There were also two new blocks F2I and STORE F2I, the essence of which is the conversion of real values into integers, as well as subsequent storage. That is, as a result, scheduling has become somewhat more complicated, and the scheduler has become somewhat larger. Unfortunately, AMD does not provide specific values, but, nevertheless, this should increase performance in applications that use AVX2.
IOD and RAM overclocking
Perhaps the IOD is the only crystal that does not contain any architectural changes.
In the case of a single-chip configuration, the write speed is half the read speed (just like with the Zen 2 microarchitecture). Recall that this is not a disadvantage, since most applications use more reads than writes (with the exception of specialized test applications). Also, the most attentive of you may have noticed that this is similar to the situation with the hierarchy of caches, there are three reads and two writes per clock. That is, the architecture is balanced everywhere.
Dual chip configurations (Ryzen 9 5900X and Ryzen 9 5950X) get full write speeds. There are also no changes.
Due to the higher core frequency and CCX reorganization, DRAM access latency has decreased by almost 10 ns and in most cases will be 55 ns when using memory with a frequency of 3800 MHz and CAS 16. But there is also disappointment – so far all samples have refused to conquer the FCLK frequency higher than 1900 MHz. The memory controller was not touched or modified.
Data security has always been AMD’s top priority, and as many of you know, Ryzen processors are the ones that haven’t received security patches with performance downgrades. However, AMD continues to improve security and introduce CET support in all processors with the Zen 3 microarchitecture.
CET is designed to protect against the misuse of legitimate code through hijacking attacks, widely used techniques in large classes of malware. CET offers two key capabilities for software developers to protect against flow-of-control malware: indirect navigation tracking and the shadow stack. Indirect branch tracking provides indirect branch protection to protect against jump/programming call (JOP/COP) attack methods. The shadow stack provides return address protection to help protect against return-oriented (ROP) attack methods.
Precision Boost 2
As practice has shown with CTR, AMD’s Precision Boost 2 technology demonstrated zero efficiency even in processors with the Zen 2 microarchitecture, since there was no intelligent approach to increasing the frequency relative to voltage. As a result, the user received a real stove. Now nothing has changed and in the press materials there is no mention of an updated auto-overclocking mechanism or a new factory way of finding successful cores that will be most suitable for low-threaded boost. We will definitely return to this in future materials, since the procedure for assessing the quality of nuclei takes a lot of time, and it also requires sampling to draw correct results.
As for the standard boost, at the moment it is great and in most cases the values obtained even exceed the marketing figures.
Safe voltages and other recommendations
Every day, users worry about what voltage should be and what value will be safe. So, a little cheat sheet. For games and small applications, the CPU VID value can fluctuate up to 1.5 V and this is the norm. For a heavy load, the maximum value of the CPU VID should not exceed 1.35 V. Idle values \u200b\u200bmay fluctuate in the range of 0.2-1.45 V, which corresponds to the P-state P2 and P1 (weak boost) values.
Moreover, the value of 1.45 V does not mean at all that this particular voltage is applied to all the cores, most of the cores can generally be in the C6 sleep state or another C-state. Another important point is the rate of change of the effective frequency and VID values, it is about 1 ms – this means that a running HWinfo or Ryzen Master “catches” only 1/1000 of what is actually happening in the system.
I would like to note that now after installing the drivers for the chipset (don’t forget to install!) You don’t need to activate any Ryzen Balanced or Ryzen Performance profiles – they are now simply gone, AMD has abolished them.
Now processor power management occurs using standard Windows power plans. Moreover, the cores of Ryzen processors with the Zen 3 microarchitecture began to sleep more and spontaneous temperature spikes disappeared. The Windows system must be with the May 2004 update, otherwise you may have problems with boost or power saving in idle.
The maximum allowable temperature is now 95 degrees for processors with a declared TDP package of 65 W and 90 degrees for processors with a TDP of 105 W.