[FPGA]

Metastability(메타스테빌리티)란?

Neo Park 2019. 7. 8. 13:18


출처 : https://m.blog.naver.com/PostView.nhn?blogId=eom913&logNo=128079723&proxyReferer=https%3A%2F%2Fwww.google.com%2F



1. Metastability


Asynchronous clock간 signal을 교환할 때 보통 flip-flop을 2개를 연달아 거치도록 하라고 합니다. 그래야 metastability를 피할 수 있다고 하는데, metastability란 간단히 low도 high도 아닌 중간상황이 지속되는 것을 말합니다. 그런데 이것을 unknown이랑 헷갈려선 안됩니다. Unknown은 말그대로 low인지 high인지 모르지만 둘 중 하나라는 것이구요. Metastability는 중간값을 갖는 것이 문제가 되는 것입니다. 이게 문제가 되는 것은 cmos 회로는 간단하게 pmos로 이루어져 vdd로 output을 연결하는 pull-up회로와 nmos로 이루어져서 output을 ground로 연결하는 pull-down회로로 구성됩니다. 그래서 일반적인 low 혹은 high 입력에 대해서 pull-up만 on되고 pull-down은 off되던가 또는 그 반대 상황이 되는 경우가 우리가 원하는 상황입니다. 단지 입력이 바뀌는 순간에만 순간적으로 pull-up과 pull-down이 모두 부분적으로 on되어 vdd에서 ground로 전류가 순간적으로 많이 흐르게 됩니다. 그래서 입력이 low 혹은 high 어느쪽으로 명확하게 지속되어야지 중간값으로 오래 지속되면 power 문제 뿐 아니라 회로가 타는 등의 문제가 발생할 수도 있어 피해야 하는 것입니다. 그런데 asynchronous clock간에 signal을 교환하게 되면 정확하게 값이 변화하는 중간에 다른 clock에서 동작하는 flip-flop이 값을 capture하게 되어 중간값이 지속될 수 있습니다.


이때 출력이 중간값을 유지하는 것은 사실 매우 불안정한 상황입니다. 그래서 약간의 noise가 있단던가 회로가 정확히 비례하지 않는다던가 하는 이유로 low나 high 어느 한 쪽으로 결정되게 됩니다. 문제는 그것이 순간적으로 일어나지 않고 서서히 진행되는 것에 있습니다. 그래서 flip-flop을 하나 더 거치게 해서 clock cycle만큼 시간을 벌어주는 것입니다. 첫번째 flip-flop을 거친 출력이 두번째 flip-flop의 입력이 되어 capture되기 까지 한 clock 시간동안 low나 high로 완전히 결정되라는 것이지요. 이것을 flip-flop은 항상 low나 high의 출력을 내니까로 오해해선 안됩니다. 왜냐하면 그렇다면 2개를 달 이유가 없는 것이죠.


 


2. Muti-bit


그런데 multi-bit signal을 전달할 때는 단순히 각각의 bit마다 flip-flop 2개를 단 구조를 선택해서는 안 됩니다. 왜냐하면 multi-bit signal이라면 각각의 bit이 변하는 시간도 다르고 또 같이 변한다 하더라도 각각 low로 갈지 high로 갈지 정해지지 않았기 때문에 전혀 엉뚱한 값이 전달되기 때문입니다. 가령 0000 이던 신호가 1111로 바뀌었는데 모두 metastability 상황에 걸렸다가 0101이 되거나 0011이 되거나 한다면 전혀 우리가 원하는 결과가 아니게 됩니다. 왜냐하면 어차피 바뀌는 순간이었기 때문에 0000이 선택되건 1111이 선택되건 둘 중 하나라면 별 문제가 없지만 0101이나 0011은 전혀 발생하지 않은 상황이기 때문이죠. 따라서 이런 문제를 피할 수 있는 방법을 만들어야 합니다.


크게 2가지 방법이 있는데 하나는 handshake방법이고 또 하나는 gray code를 이용하는 방법입니다. Handshake방법이란 보내는 쪽에서 신호를 보낸다는 신호를 보내고 받는 쪽에서 받았다고 신호를 보내서 서로 신호를 주고 받는 중간에 신호가 바뀌는 일이 없도록 보장하는 것입니다. 보통 req, ack신호를 서로 교환해서 처리를 하죠.


Gray code란 한번에 한 bit만 변화하도록 만든 code를 말합니다. 가령 0000, 0001, 0011, 0010 순서로 바뀌도록 만든 것이죠. 이 gray code를 이용할 때도 역시 flip-flop 2개를 단 구조를 사용하지만 이때는 문제가 발생하지 않게 됩니다. 가령 0001에서 0011로 바뀌는 순간이 capture되었다고 하면 0에서 1로 바뀐 2번째 bit만 metastability상태가 됩니다. 나머지 bit들은 상태가 바뀌지 않았기 때문에 결국 2번째 flip-flop의 결과는 0001혹은 0011로 별문제가 없는 상황입니다.


중간값으로 오래 지속되면 power 문제 뿐 아니라 회로가 타는 등의 문제가 발생할 수도 있어 피해야 하는 것입니다. 그런데 asynchronous clock간에 signal을 교환하게 되면 정확하게 값이 변화하는 중간에 다른 clock에서 동작하는 flip-flop이 값을 capture하게 되어 중간값이 지속될 수 있습니다.


이때 출력이 중간값을 유지하는 것은 사실 매우 불안정한 상황입니다. 그래서 약간의 noise가 있단던가 회로가 정확히 비례하지 않는다던가 하는 이유로 low나 high 어느 한 쪽으로 결정되게 됩니다. 문제는 그것이 순간적으로 일어나지 않고 서서히 진행되는 것에 있습니다. 그래서 flip-flop을 하나 더 거치게 해서 clock cycle만큼 시간을 벌어주는 것입니다. 첫번째 flip-flop을 거친 출력이 두번째 flip-flop의 입력이 되어 capture되기 까지 한 clock 시간동안 low나 high로 완전히 결정되라는 것이지요. 이것을 flip-flop은 항상 low나 high의 출력을 내니까로 오해해선 안됩니다. 왜냐하면 그렇다면 2개를 달 이유가 없는 것이죠.


 


2. Muti-bit


그런데 multi-bit signal을 전달할 때는 단순히 각각의 bit마다 flip-flop 2개를 단 구조를 선택해서는 안 됩니다. 왜냐하면 multi-bit signal이라면 각각의 bit이 변하는 시간도 다르고 또 같이 변한다 하더라도 각각 low로 갈지 high로 갈지 정해지지 않았기 때문에 전혀 엉뚱한 값이 전달되기 때문입니다. 가령 0000 이던 신호가 1111로 바뀌었는데 모두 metastability 상황에 걸렸다가 0101이 되거나 0011이 되거나 한다면 전혀 우리가 원하는 결과가 아니게 됩니다. 왜냐하면 어차피 바뀌는 순간이었기 때문에 0000이 선택되건 1111이 선택되건 둘 중 하나라면 별 문제가 없지만 0101이나 0011은 전혀 발생하지 않은 상황이기 때문이죠. 따라서 이런 문제를 피할 수 있는 방법을 만들어야 합니다.


크게 2가지 방법이 있는데 하나는 handshake방법이고 또 하나는 gray code를 이용하는 방법입니다. Handshake방법이란 보내는 쪽에서 신호를 보낸다는 신호를 보내고 받는 쪽에서 받았다고 신호를 보내서 서로 신호를 주고 받는 중간에 신호가 바뀌는 일이 없도록 보장하는 것입니다. 보통 req, ack신호를 서로 교환해서 처리를 하죠.


Gray code란 한번에 한 bit만 변화하도록 만든 code를 말합니다. 가령 0000, 0001, 0011, 0010 순서로 바뀌도록 만든 것이죠. 이 gray code를 이용할 때도 역시 flip-flop 2개를 단 구조를 사용하지만 이때는 문제가 발생하지 않게 됩니다. 가령 0001에서 0011로 바뀌는 순간이 capture되었다고 하면 0에서 1로 바뀐 2번째 bit만 metastability상태가 됩니다. 나머지 bit들은 상태가 바뀌지 않았기 때문에 결국 2번째 flip-flop의 결과는 0001혹은 0011로 별문제가 없는 상황입니다.


 

//////////////////////////////////////////////////////////////////////////////////////////////////

 

FPGA Clock Schemes

One of the most important steps in the design process is to identify how many different clocks to use and how to route them. This article tells you how to use routing resources efficiently.

Field-programmable gate arrays (FPGAs) have long had a prominent position in the world of signal processing and embedded systems. By implementing calculations in hardware, they can process data at rates much higher than the fastest microprocessors. The increasing flexibility and affordability of FPGAs have also made them attractive alternatives to application-specific integrated circuits (ASICs), particularly in small-to-medium volume applications.


As larger designs are implemented in FPGAs, it is likely that many of them will have multiple data paths running on multiple clocks. An FPGA design that contains multiple clocks requires special attention. Issues to focus on are maximum clock rates and skew, maximum number of clocks, asynchronous clock design, and clock/data relationships.


Time bandits

The first step in any FPGA design is to decide what clock speed is needed within the FPGA. The fastest clock in the design will determine the clock rate that the FPGA must be able to handle. The maximum clock rate is determined by the propagation time, P, of a signal between two flip-flops in the design. If P is greater than the clock period, T, then when the signal changes at one flip-flop, it doesn't change at the next stage of logic until two clock cycles later. Figure 1 shows this.



Figure 1: An undesirable propagation effect


The propagation time is the sum of the hold time required for the signal to change at the output of the first flip-flop, plus the delay of any combinatorial logic between stages, plus the routing delay between stages, plus the set-up time for the signal going into the flip-flop at the second stage. Later in this article, we will look at methods to increase the maximum clock rate supported by a design.


Each clock used in an FPGA design, no matter the rate of the clock, must have low skew. The skew, S, is the maximum delay from the clock input of one flip-flop to the clock input of another flip-flop. For the circuit to work properly, the skew must be less than the propagation time between the two flip-flops.



Figure 2: An undesirable skew effect


Figure 2 shows what will happen if the skew is greater than the propagation time (S > P). The intent of the circuit is to delay Signal 1 by two rising edges of the clock. However, a change in Signal 1 could propagate through to Signal 3 on the same clock cycle that caused the change at Signal 2. Because S > P, the circuit behaves incorrectly.


Notice that the clock rate does not play into this at all. Even a seemingly innocuous 100 bit/s clock can exhibit skew problems. This means that while FPGA vendors release chips with lower propagation times and higher clock rates, skew problems could potentially get worse, even in designs that do not run at anywhere near the maximum clock rate.


Fortunately, FPGA vendors are well aware of the problems caused by clock skew and offer low-skew routing resources within their chips. These are special routes that offer a guaranteed maximum skew between any two flip-flops within a certain domain inside the chip. Some of these low-skew resources cover the entire chip, while others might only cover a particular row or column of logic blocks within the FPGA. For a design that requires many different clock sources, these low-skew routing resources can become precious commodities.


One of the most serious problems associated with multiple clock designs is when two stages of logic are combined using asynchronous clocks. Asynchronous logic can create metastable states that can seriously degrade the performance of the design or completely destroy the functionality. A metastable state is created when the flip-flop's timing requirements (setup and hold times) are violated. The resulting output of the flip-flop is unknown, and can make the entire design nondeterministic. If one stage of logic asynchronously feeds data to another, it is difficult, if not impossible to meet the set-up and hold-time requirements of the flip-flop, as shown in Figure 3.



Figure 3: Metastability


To clarify, if the design contains asynchronous logic, metastability is likely to occur. Care must be taken when dealing with with asynchronous sources as they can be some of the most difficult problems to find once they make their way into a system.


Multiclock designs

For the remainder of this discussion, let's consider an E3 mux/demux design commonly used in the telecommunications industry. Figure 4 shows the design. The multiplexer receives 16 independent E1 channels, each running at 2.048MHz, from a set of independent line interface chips. These E1 streams are then combined into four E2 streams, each running at 8.448MHz. The E2 streams are then combined into an E3 stream that's transmitted serially at a rate of 34.368Mbps.



Figure 4: E3 multiplexer/demultiplexer


At the receiving end, the demultiplexer performs the inverse operation by extracting the four E2 data streams from the E3 stream and then the 16 E1 streams from the E2 streams. Finally, the E1 streams are transmitted to the receiving line interface chips.


The E1 line interfaces operate independently for both transmit and receive. As a result, the corresponding 2.048MHz clocks could be running at slightly different rates (+/- 20ppm). Also, since most systems transmit and receive data simultaneously, the separate multiplexer and demultiplexer will be accommodating two independent E3 streams (transmit and receive). Consequently, both of the 34.368MHz clocks could be running at slightly different rates.


Since the E2 streams are created on-chip, the E2 muxes may all share the same 8.448MHz clock. However, since the received data rate is independent of the board we are designing (and we cannot assume we are getting data from a source that uses the same clock for all of the E2 muxes), the E2 demux clocks must be allowed to operate at slightly different rates.


Furthermore, let's assume that we have a separate serial peripheral interface (SPI) bus interface for status and control via a processor that operates with a 1MHz clock.


In all, we'll have thirty-two 2.048MHz clocks, five 8.448MHz clocks, two 34.368MHz clocks, and one 1MHz clock for a grand total of 40 clocks in the design. Clearly, this design will serve as an excellent test case for some of the issues that are covered in the remainder of this article.


The fastest clock in our example design is the 34.368MHz E3 clock, which determines the minimum speed rating of the FPGA. The maximum clock rate for an FPGA is not trivial to determine, since differences in designs will affect the maximum. However, you can usually find parameters called "Global Clock Setup and Hold Time" and "Clock to CLB Outputs." Summing the two largest values found in the data sheet and adding 25% will result in roughly the minimum allowable clock period. It is good practice to allow for at least a 10% overhead on the maximum clock rate to guarantee operation over temperature. Therefore, in our example, we set our minimum allowable speed rating to 40MHz. Most newer FPGAs can easily support this frequency. In fact, FPGA vendors are advertising rates in excess of 300MHz.


Once you have settled on an FPGA that meets our maximum frequency requirement, you must ensure that the device has enough room to fit your design comfortably. If the FPGA is almost full, there may not be enough routing resources available to meet the timing constraints of the design. Be advised that the advertised rate is the best-case scenario rate. FPGA vendors typically suggest that the FPGA logic can go to about 80% full before the routing performance starts to degrade noticeably. Choosing to err on the side of caution, I recommend starting most new designs with the FPGA logic about 50% full. This allows slop for any errors made in calculating the initial design size and gives room for the inevitable "feature creep" that occurs after a design is started. If the final design occupies less than 50% of the FPGA's real estate, you can then achieve a cost reduction by going with a smaller FPGA in the same family.


The routing of the fastest clock in the design can be improved by specifying the rates on the slower clocks via timing constraints. In our multiplexer example, if you tell the FPGA routing tool that the SPI bus clock is 1MHz while the E3 clock is 40MHz, then it will attempt to lay the circuit out such that all the logic blocks that use the E3 clock are close together. If any circuits need to be spread out due to space limitations, it will first spread out the SPI logic since that can handle longer propagation delays. All FPGA vendor routing tools have methods for specifying these slower clock rates.


Local or global

As mentioned previously, all FPGAs have special low-skew resources for routing clocks. Some devices have dedicated clock inputs that feed directly to the low-skew resources, while others have low-skew buffers inside the FPGA that can be sourced from anywhere. Some clock resources are global and can feed logic to any part of the device. Other resources are local and can only distribute a clock in a certain area of the device. It is important to make sure that clocks that drive many gates are on global nets. A clock that drives only a small amount of logic, like the SPI clock in our example, can go on a local net.


Despite its importance in choosing the right device, it is not always easy to tell how many clock resources are on a particular chip by a perfunctory examination of its data sheet. Clock resource information is usually found under "Clock Distribution" or "Global Clock Buffers" on FPGA data sheets. The Xilinx Spartan II 100, for example, has four global clock resources and 24 local resources. The Altera APEX 20K 100 contains eight global clock resources and no local resources. The newer Xilinx Virtex II 70 contains 16 global clocks and no locals, while by comparison the Altera APEX II 70 contains eight global clocks and no locals. Some special resources such as DLLs (Dynamic Link Libraries—a Xilink term), PLLs (phase-locked loop—an Altera term), and clock multipliers use up some of the low-skew resources, so remember to keep this in mind if you need to use these types of functions.


Upon inspection of the FPGA market, it appears there is no device with adequate clock resources to support the 40 clocks needed for our mux/demux design. So we'll need to use some tricks to reduce the number of clocks used in our design.


Reducing clocks

Let's begin by looking at the clocks for the E2 and E3 multiplexers. We have already stated the acceptability of running all four E2 multiplexers with the same clock; the E3 multiplexer, running at a much higher rate, must have a different clock. However, what if we derive the E2 clocks from the E3 clock? Since we know when the E3 multiplexer wants data from each of the E2 tributaries, we can simply send clock pulses to each E2 multiplexer when we want it to give us data. We have not yet eliminated any clocks, but the E2 clocks are now based on the E3 clock.


What if we also use the same clock on all the multiplexers and just use an enable signal to tell the E2 multiplexers when to operate? If the E3 mux creates the enable signals from its 34.368MHz clock, skew on these enable signals is not any more of a problem than it is for any other synchronous logic in the FPGA. Therefore, the enable signals can use normal (high-skew) routing, and we have just eliminated the need for the separate 8.448MHz multiplexer clock! The same concept can be used for reading the data out of the E1 data buffers. In other words, if the E2 multiplexer requires data, it can activate the enable signal to that particular buffer. The clock to the buffer itself can retain the 34.368MHz clock used by the E3 multiplexer, as seen in Figure 5.



Figure 5: E3 MUX clock usage


Finally we examine the 16 E1 clocks entering the FPGA from the line interface chips. These clocks present several problems. First, 16 clocks would consume too much of the available clock routing resources. Second, with 16 asynchronous clocks driving flip-flops (in close proximity) within the same FPGA, one may experience noise problems due to ground bounce, crosstalk, and other effects. For example, a positive edge flip-flop can change state on the falling edge due to double clocking caused by noise. Such a problem is very difficult to isolate.


As a possible solution, we recommend using one fast clock to sample the 16 E1 clocks. To clarify, we know that each of the 16 input clocks will be close to 2.048MHz, and we have a 34.368MHz system clock. We can therefore sample each of the 16 E1 clocks with the 34.368MHz clock and store the results into a 16-bit register (one bit per E1 clock). Then we can employ an algorithm to detect low-to-high transitions on the E1 clocks, create an enable signal for each of the E1 data signals, and register the data (with the 34.368MHz clock) on the next cycle.


In order to be successful, there must be prior knowledge of the clock-data relationship to ensure that we do not sample data during a time that it may be changing, as illustrated in Figure 6. Note that two flip-flops are used in the first stage on the clock sampling circuit in order to tolerate metastability (more on this later). Also note that the data must be delayed by the same number of clock cycles as the clock.



Figure 6: Clock oversampling in the multiplexer


We have successfully reduced the mux-side to merely one clock. Can something similar be done on the demux side? The input clock for the E3 demultiplexer must be an external input. This is because the same off-chip device that drives the E3 input data sources this clock. Since the E3 demultiplexer knows when it is sending data to the E2 demultiplexers, it can create enable signals for each one, and the four E2 demultiplexers can all run on the same master clock as the E3 multiplexer. Likewise, the E2 multiplexers can create enable signals for the individual E1 streams.


If we assume that the line interface chip can accept gapped clocks (many of them can), we need only send a clock pulse to the line interface whenever the E1 enable signal is asserted. However, it is simpler just to send the enable signal itself to the interface chip without worrying about creating a new clock. Since the data sent to the interface will change on the falling edge of the enable signal (refer to Figure 5), we need to make sure that the interface will sample on the rising edge of the clock. Since the enable signal is used as a clock only at the line interface chip and not within the FPGA, there is no need to route it on a low-skew resource. Note that this only works when you know you will never have a case where you will want to send data to the same tributary on successive pulses of the master clock.


The 1MHz SPI clock cannot be easily eliminated, but we have now reduced our original requirement of 40 clocks to just three by using enable signals and clock oversampling techniques. We can now implement our design on a much wider range of devices.


Asynchronicity

Before creating any logic with asynchronous clocking, you should exhaustively consider another alternative. Combining logic stages with asynchronous clocks is a dominant source of problems. Again, when a flip-flop's setup and hold time constraints are violated, the output becomes unpredictable for a short amount of time and will eventually settle to a "1" or "0". Which state it will settle in is impossible to predict.


In the July and August 2001 issues of Embedded Systems Programming, Jack Ganssle wrote a pair of columns on asynchronicity and metastability that should be required reading for any logic designer. Please refer to those articles for an in-depth look at metastability.


Fortunately, there are some solutions to the problems of metastability. The solution illustrated in Figure 6 is the double-registering technique. Data coming into the first flip-flop is asynchronous with the clock, so the first flip-flop will almost certainly go metastable. However, the second flip-flop will never go metastable as long as the length of metastability is less than the period of the clock. (Unfortunately, FPGA vendors rarely publish metastability times, though they are typically less than the sum of the set-up and hold time of the flip-flop.)


If the clock is not too fast to meet normal timing constraints, it is probably not going to propagate metastable states in a circuit such as Figure 6. Even though the output of the first flip-flop can be used as long as all of the paths out go to flip-flops clocked by the same clock, it is generally good practice to use a circuit such as that in Figure 6 to isolate metastability to one short line. That way, it is less likely that a future change to the circuit will unintentionally use the metastable line in nonclocked logic.


If the data being read is a counter, such as read or write addresses from an asynchronous FIFO, you should consider the following caveat. A traditional 3-bit counter can have one, two, or three bits changing between states. For example, if the read occurs at the instance when the counter is changing from "011" to "100," then the state of all three bits is unknown, and the read value can be any of the eight possible states.


Table 1: A 3-bit grey code

000

001

011

010

110

111

101

100


If the counter is built using a grey code, such as in Table 1, then only one bit can change from one state to the next. If the read occurs at the instance that the counter is changing, then only one bit will be in question, and there are only two possible outcomes to the read operation. Furthermore, the two possible values will be the value of the counter just before the read and the value of the counter just after the read. Since the read occurred at a time when the counter was in transition, it is impossible to say with certainty that one value is correct while the other is not. In other words, either value should be considered valid.


Another way to avoid problems with asynchronous clocks is to ignore the slower clock and sample the data with the faster clock (if you know which one is faster). This requires that the data have special framing characters (a preamble, for example) to define the data boundary. This is a common practice and can be found in nearly every embedded system in the form of a UART. A very fast clock, say 16 times the data symbol rate, will sample until 15 consecutive start characters are found. The design then declares that the next 16 (or so) bits correspond to the first bit sent, the next 16 (or so) bits are the next bit, and so on.


Countdown

As FPGAs increase in density and become more affordable, the number of embedded products that we see using FPGAs will continue to grow rapidly. As these designs get more complex, a lot of design issues stemming from multiple clocks are sure to arise. Paying special consideration to the maximum clock rate, the number of clocks, and asynchronous clocks in the initial design phase can alleviate many problems. These same factors can then be used to pick an FPGA that suits all the needs of the design. esp


Tim Behne has worked in embedded systems for nearly a decade. He has a BSEE from Texas A&M University and an MSCE from the University of Houston--Clear Lake. Tim is currently the manager of software and signal processing at Microwave Networks. You may contact him at timothyb@microwavenetworks.com.


Note

Tim would like to thank Tom Cassaro for his contributions to this article.