Overclocking Matisse or in search of the limit. Zen 2 Architecture Overview


Quite a lot of time has passed from the first announcement to the “announcement of the announcement”. We’ve been teased with next-gen AMD processors for over a year now. The new chiplet design was hailed as a breakthrough in processor performance and scalability, especially since with each generation, with each architecture, it becomes more and more difficult to create a large chip with high frequencies at lower technological standards.

Overclocking Matisse or in search of the limit

This bold move is expected to have an impact on the industry as a whole. Today I will try to roll out a long-awaited next guide review for you, which will include a comparison of all Zen architectures, and a guide on overclocking different generations of processors and, of course, what overclocking RAM will give us in the Zen 2 generation. I want to warn you right away that the review ” the contents of the boxes “you will not find in this article.



Overclocking Matisse or in search of the limit

To begin with, Zen 2 is a member of the Zen family, not a completely new architecture or a new x86 instruction processing paradigm, and at the top level, the kernel looks pretty much the same as Zen/Zen+. Key key features of the Zen 2 architecture include a new L2 branch predictor (known as the TAGE predictor), doubling the micro-op cache, doubling the L3 cache, increasing integer resources, increasing load/storage resources, and support for single operation AVX-256 (or AVX2) plus no penalties for AVX2.


I have already mentioned above that AMD has made a breakthrough in processor engineering by applying a multi-chip design. However, Zen 2 CCX complexes are composed of cores similar to previous generations. One CCX block combines 4 cores and 16 MB of shared L3 cache.

A pair of CCXs is located on a single 7nm die and forms a processor chiplet, which received the abbreviation CCD (Core Complex Die). In addition to the cores and cache, the CCD chiplet also includes an Infinity Fabric bus controller, through which the CCD should be connected to the mandatory 12nm I / O (IO) chiplet for any Ryzen 3000, which we could feel earlier in Zen + .

Overclocking Matisse or in search of the limit

The input/output (I/O) chiplet of Zen 2 generation processors houses the so-called non-core components, as well as elements of the north bridge and SOC. It, among other things, contains a memory controller and a PCI Express 4.0 bus controller. The I/O chip also implements two Infinity Fabric buses required for connection to CCD chips.

Depending on which processor of the Ryzen 3000 family we are talking about, it can consist of either two or three chiplets.

Processors with eight cores (threads) or less use only one CCD chip and one I/O chip.

Overclocking Matisse or in search of the limit
Overclocking Matisse or in search of the limit

In processors with more than eight cores, there are already two CCD chipsets. However, you need to understand that the processor still remains a single entity. Due to the fact that in any Ryzen 3000 the memory controller is located in the I / O chiplet and there is only one, any of the cores can access any of its areas, there are no NUMA configurations here.

Of course, the chiplet design creates certain difficulties for the interaction of various CPU components and requires competent implementation of a specialized bus, which is Infinity Fabric. However, AMD managed to successfully cope with this task, and we have the opportunity to feel it in practice.

For example, the Infinity Fabric bus for recording works in 16 bytes per clock mode, and not 32 as shown on the slides, without interesting features that did not get on the slides. You can see the write drawdown in RAM in AIDA tests and the like. There is no point in getting upset in this case, because the write speed is not really important in most tasks. In any case, X86 has a 2:1 read/write ratio, and many new instructions even have a 3:1 ratio (a = a + b + c).

Zen 2 also has a different write-dedicated AGU to help find the correct write-back address faster, and each core can now write faster than before, despite the total write bandwidth between chiplet and memory being halved. It’s also good for games that fill and generate more writes than reads in some cases.

Infinity Fabric

With the transition to Zen 2, we are also moving to the second generation of Infinity Fabric. One of the major updates to IF2 is the increase in bus width from 256 to 512 bits, which means a twofold increase in throughput and the ability to transfer 32 bytes per clock in each direction. AMD did this primarily because of the appearance of PCI Express 4.0 support in the Ryzen 3000, and secondly, to increase system performance in a number of scenarios, provided there was insufficient bus bandwidth caused by low clock speeds of RAM (for example, the user bought cheap memory).

Overclocking Matisse or in search of the limit

According to AMD, the overall efficiency of IF2 increased by 27%, resulting in a reduction in power per bit. In the future, we are waiting for many chiplet HEDTs, which, of course, urgently need a new interface, but we will talk about this in the fall.

One of the features of IF2 is that the memory controller received another mode in which its frequency is half of the real DRAM frequency, that is, UCLK = 1/2 MEMCLK. This was done in order to meet the extreme overclocking needs of enthusiasts and so that in the event of a failed IO die, the user can still overclock the RAM without relying on IF2 and the memory controller. However, in practice, even the worst specimen is able to work perfectly at UCLK 1800 MHz frequencies, and the 2: 1 mode remains exclusive to enthusiasts and overclockers.

Clock syncs are available in 1:1 or 2:1 for Zen 2, only 1:1 for Zen 1 and Zen+ generations.

I also noticed that there are quite a few questions on Reddit related to FCLK (this is a new option in UEFI), in particular how to configure it so that the system has maximum performance. The ideal option for Zen 2 remains the mode when FCLK = UCLK = MEMCLK, in this case there are no “penalties” of synchronization of these three domains.

Overclocking Matisse or in search of the limit

Regarding AMD recommendations, everything is quite simple, if there is no desire to bother with tuning timings, we must select the 1: 1 mode (by the way, you don’t need to choose it, it is enabled by default), that is, as it was before, but if you are an enthusiast and familiar with my guide on overclocking and tuning RAM – nothing prevents you from squeezing the maximum from any mode.


The cache system has received major changes, the most notable change is the L1 instruction cache, which has been reduced from 64 to 32 KB, but the associativity has increased from 4 to 8.

Overclocking Matisse or in search of the limit

This change allowed AMD to increase the micro-op cache size from 2K to 4K and have higher L1-I usage. According to AMD, this gave the best balance of energy efficiency and performance in modern applications that do not “shine” with optimization and are dominant in the software market.

The L1-D cache is still 32 KB with 8-way associativity, and the L2 cache is 512 KB with 8-way associativity. The L3 cache has now doubled in size per core of the complex (CCX) and is as much as 16 MB, that is, one chiplet (CCD) has a whopping 32 MB of L3 at its disposal. The cache memory latency for the first two levels has not changed and is 4 cycles for L1 and 12 cycles for L2, but L3 is in for a little surprise, the latency has increased from 35 cycles to 40, which is typical for large caches and is not something terrible.

AMD also said that it has increased the size of the queues that handle L1 and L2 skips, but did not specify how large they are.

From the “chips” – now the cache memory can serve two 256-bit read operations and one 256-bit write operation per clock at the L1 level, as well as one 256-bit read and write operation per clock at the L2 level, which makes a huge contribution to the speed of AVX execution.

Double Precision Calculations

The main floating point performance improvement is full support for AVX2. AMD increased the width of the execution unit from 128-bit to 256-bit, which allows you to perform AVX2 calculations in one clock cycle, rather than breaking the calculations into two instructions and two cycles, therefore, Zen 2 can be expected to double the speed of working with AVX2 code.

The actuators in the FPU remained intact. In addition, in Zen 2, AMD was able to ensure that processing of AVX2 instructions can be carried out without any reduction in clock frequency, as happens in Intel processors, while not forgetting that the frequency can be reduced depending on the requirements for stock limits (temperature and voltage), but this happens automatically and regardless of the instructions used. I must make a reservation that the user can change the limits at will or disable them altogether, thereby shifting the entire responsibility to the cooling system and their shoulders.

Overclocking Matisse or in search of the limit

In the floating point module, queues accept up to four micro-ops per clock from the dispatch module, which are fed into a 160-entry physical register file. This moves into four execution units, which can be supplied with 256-bit data in the load and store engine.

Other changes have been made to FMA modules besides doubling the size – AMD says engineers have increased raw performance in memory allocation, for physics simulations (computing) and some audio processing techniques.

Another key upgrade is to reduce the FP multiplication latency from 4 to 3 cycles. This is a pretty significant improvement. More details about this AMD promised to tell at Hot Chips, which will be held in August.


The main claimed improvement is the use of the TAGE predictor, although it is only used for non-L1 samples. AMD still uses a hashed perceptron prefetch mechanism for L1 samples, which will consist of as many samples as possible, but the TAGE L2 branch predictor uses additional tags to enable a longer branch history for better prediction. This becomes more important for L2 prefetchers and above, as a hashed perceptron is preferred for short power-based L1 prefetchers.

Overclocking Matisse or in search of the limit

We also get larger BTBs on the front end to keep track of command branches and cache requests. The L1 BTB size doubled from 256 to 512 entries, and L2 almost doubled from 4K to 7K. BTB L0 remains at 16 entries, but the indirect target array goes up to 1K entries. Overall, these changes, according to AMD, reduce the probability of misprediction by 30%, thereby saving power.


For the decoding stage, the main advantage is the micro-op cache. By doubling the size from 2K records to 4K records, it will contain more decoded operations than before, which means it must be reused. To make this use easier, AMD has increased the u-cache-to-buffer send rate to 8 combined instructions.

Overclocking Matisse or in search of the limit

The decoders in Zen 2 remain the same, we still have access to the four complex decoders, and the decoded instructions are cached in the micro-op cache and also sent to the micro-op queue.

Beyond the decoders, the micro-op queue and dispatching can inject six micro-ops per clock into the schedulers. However, this is a little unbalanced because AMD has independent integer and floating point schedulers: an integer scheduler can take six micro-ops per clock, while a floating point scheduler can only take four. However, a micro-operation can be sent to both at the same time.

Execution of instructions

Overclocking Matisse or in search of the limit

Integer unit schedulers can accept up to six micro-ops per clock, which feed into a 224-entry reorder buffer (versus 192). Technically, the Integer module has seven execution ports, consisting of four ALUs (arithmetic logic units) and three AGUs (address generation units).

The schedulers consist of four ALU queues with 16 inputs and three AGUs with 28 inputs. The AGU can feed 3 micro-ops per clock to the register file. Also, the size of the AGU queue has increased as a result of modeling AMD instruction distributions in conventional software. These queues go into a general purpose register file with 180 entries (instead of 168), but also keep track of specific ALU operations to prevent potential shutdowns.

Three AGUs are fed into the load/store module, which can support two 256-bit reads and one 256-bit write per clock. Not all three AGUs are equal, AGU2 can only manage storages, while AGU0 and AGU1 can handle both downloads and storages.

Loading and storage

Overclocking Matisse or in search of the limit

In Zen 2, the L2 TLB (Address Translation Buffer) has been improved. In the first generation of Zen processors, the size of this table was 1.5K, but now it has increased to 2K. L2 TLB now supports 1 GB pages, which was not implemented in previous versions of the microarchitecture.

Another key metric here is load/storage bandwidth, as the kernel can now support 32 bytes per clock instead of 16.

Also along the way, the storage queue was increased from 44 to 48 records.

QoS control of cache and memory bandwidth

Overclocking Matisse or in search of the limit

In most new x86 microarchitectures, there is a race to improve performance with new instructions, as well as a drive for parity between different vendors as to which instructions are supported. Regarding Zen 2, AMD is in no hurry to “satisfy” Intel by adding some exotic instruction sets to its offspring. The company is adding new, proprietary guidelines in three different areas.

CLWB has been seen previously in Intel processors in relation to non-volatile memory. This instruction allows the program to put data back into non-volatile memory in case the system receives a stop command and the data may be lost. There are other guidelines related to protecting data in non-volatile memory systems, but AMD did not disclose this. Perhaps the company is looking to improve support for non-volatile memory hardware and structures in future developments, especially in its EPYC processors, and does not want to show a trump card ahead of time.

The second caching instruction WBNOINVD is relatively new, it builds on other similar instructions such as WBINVD and is exclusive to the AMD platform. This command is designed to predict when certain portions of the cache may be needed in the future and flush them out, ready to speed up future calculations. In case the required cache line is not ready, the flush command will be processed in advance of the required operation, which will increase the latency – by running the cache flush line in advance, while the delay-critical instruction is still incoming, the pipeline helps to speed up its final execution.

The third set of QoS instructions actually refers to how cache and memory are prioritized.

When a cloud CPU is split into different containers or VMs for different tenants, the level of performance is not always the same as performance can be limited depending on what the other VM is doing on the system. This is known as the “noisy neighbor” problem: if someone else is consuming all the core-to-memory bandwidth or L3 cache, it can be very difficult for another VM on the system to access what it needs. As a result of this noisy neighbor, the other VM will have a very variable latency while processing its workload. Alternatively, if a critical VM is on the system and another VM continues to request resources, the critical VM may end up missing its targets because it doesn’t have all the resources it needs to access.

It is difficult to deal with noisy neighbors, in addition to providing full access to equipment in the face of one user. Most cloud providers won’t even tell you if you have neighbors, and in the case of live virtual machine migrations, these neighbors can change very often, so there’s no guarantee of stable performance at any given time.

As with Intel’s implementation, when a series of virtual machines are hosted on top of a hypervisor, the hypervisor can control the amount of memory bandwidth and cache that each virtual machine has access to.

Intel is only enabling this feature on its Xeon Scalable processors, however AMD will enable and expand the Zen 2 family of processors for consumer and enterprise users.


Another aspect of Zen 2 is AMD’s approach to the increased security requirements of today’s processors. As previously reported, a significant number of recent side-channel exploits do not affect AMD processors, primarily due to the way AMD manages its TLBs, which always required additional security checks before much of it became a problem. However, for the issues that AMD is vulnerable to, it has implemented a complete hardware security platform for them.

Overclocking Matisse or in search of the limit

The change here is for the Speculative Store Bypass, known as Specter v4, where the new processors now have a hardware patch that will work in conjunction with the OS or virtual memory managers such as hypervisors. The company does not expect any performance changes from these updates. New issues such as Foreshadow and Zombieload do not affect AMD processors.


Leave a Reply

Your email address will not be published. Required fields are marked *