miniPHY

Project Logs

Collapse

YGMII-7
Yann Guidon / YGDES • 3 days ago • 0 comments
I think I have found a decent compromise that uses only 7 signals and "some amount of TMPI".
- 2 wires provide the 4-phase clock in ascending or decending Gray code (the difference is simply which wire is connected where). The direction (given by the value of one wire during the first transition) is a flag for "data valid", one of the 4 states in the QSID protocol. When it's in Data state, the payload is the 20-bit data, otherwise it contains other information (such as Q/S/I or even config data for drive/frequency/whatever)
- 5 wires transmit 18 data bits and 2 TMPI bits. This means that the TMPI width is 9 bits.
- The data are TMPI-AC coded so any invalid clock sequence will reset the AC register.
And now, let's make a Popcount9 circuit:

That was almost easy.
Less obvious now is how to make an asynchronous quadrature decoder.
Now, this deserves a proper individual project.
YGMII (draft)
Yann Guidon / YGDES • 4 days ago • 0 comments
The TT IHP26a experience has taugh valuable lessons...

One of them is the need to transmit data reliably with few pins at high speed and efficiently yet easily/simply.

I'd like to have a 6-pin "link" interface à la SHARC but it does not look possible or at least practical, 6 is too tight. But 7 or 8 looks good.

For comparison, RGMII uses 6 signals in each direction (source clocking included) plus 2 config pins (MDC/MDIO) for a total of 14.

But I'm also throwing some TMPI in the lot to reduce power draw and EMI so that's one more pin.

I have come up with this 5-bit popcount

The argument being :
- 18 bits of data (16+C/D+parity)
- 2 bits of link status (Quiet/Synching/Idle/Data)
that's 20 bits = 5×4

5 being odd, it's not considered at the number of words, however 4 is perfect.

So we get 5 bits of data, plus 1 bit of TMPI and 2 bits of staggered/4-phase clock, to keep transitions low.

There would be a TMPI-AC: it's a time-differential transition minimisation so we need to "clear" the initial value. This is done by the clock signals that do not complete a full 4-phase cycle, just doing a pulse with all-0s (or something).

.

(state machine diagram here)

00 => 01 => 00 : clear the ACcumulator

00 => 01 => 11 => 10 => 00 : data without inversion

00 => 10 => 11 => 01 => 00 : data with inversion

.

With 4-phase signals, there is the potential of transmitting another bit by changing the phase / swapping the signal that sends the first rising edge => this can encode a 2nd-level TMPS bit, for extra-lower power draw.

So it is actually transmitting 21 bits over 8 wires in 4 cycles, which is 21/32 of efficiency but with much fewer transitions.

Now there is the challenge of designing a 20-bit popcount.

Maybe I could even drop the 5-bit popcount that adds another wire but it's too easy/alluring to pass... Saving one pin is great anyway, I'll have to check how it reduces power draw.

.

Just for the record here is the 6-bit TMPS circuit

...

Anyway, for that MAJ20 circuit...

I can easily have four 3-bit partial sums and they must be added together. The circuit must detect a value of 10 or more, requiring 4 bits. No need to compute the 5th bit if a custom adder is used.

...

I get this circuit

I have not found how to combine both the word-level and quibble level TMPI. There are now 2 possibilities :
- use 7 bits per quibble with a word-level TMPI. The transition gains are low but you can't make the interface smaller.
- use 8 bits per quibble with a dedicated TMPI per quibble, providing less violent bursts of close transitions. required when EMI gets difficult.
The most attractive version at first glance would be the 7-bit version but the 8-bit version would be required for extra performance. Combining both would be rad though.
Interfacing: the links
Yann Guidon / YGDES • 03/03/2026 at 22:53 • 0 comments
The QSDE state machine is replaced by QSLD :
- Quiet (no signal on the line, disconnected or at rest, or request to close the line)
- Synching (trying to advertise presence, or presence detected)
- Locked (acknowledge that PLLs are synced, ready to send data, or data not ready)
- Data (data transmitted or received)
Still 4 simple states, still 2 bits.

b1 b0

Q 0 0

S 0 1

L 1 0

D 1 1

The interesting part is that this pair of bits is both a state and a command, so it's present at the input and output of the units (MAC, PHY ...)

(insert state machine diagram)

Errors go back to Synching, and Locked is used when no data is present but the link remains active (sort of Idle but not to be confused with Quiet).

MAC-PHY link

Due to the usual high cost/limitation of pins, the link uses dual-edged clock and keeps the meta/control information to the minimum. All the communication use the QSLD state, which uses only one wire in DDR.

For the PHY-MAC link, we get 5 wires :
- CLK active high then low
- QSLD sampled 2× : CLK rising => b0, falling => b1
- Data[2:0] expanded to 6 bits, CLK rising => Data[2:0], falling => Data[5:3]
Notable feature : the QSLD wire toggles only in Sync or Locked states. If tied to 0 it's quiet/disabled, if 1 then data is flowing (the desired state).
3 cycles are required to transmit 18 bits. There is no clear start (no room for a framing bit) so it is implied by going from the Locked state to the Data state. This resets a free-running 3-state counter, incremented for each new Data cycle. Going back from D to L when the counter is not 0 is an error and the peer will reply with Sync.

When in states Q, S or L, the data bits are "don't care" though they could be used to send configuration or auxiliary data, one day.

.

MAC-host link

Same constraints so a DDR interface is used as well. But due to the larger and slower bus, the data path is 8 bits. An extra bit works as Control/Data flag during one phase but is not used in the other phase (for now).

Total: 11 wires
- CLK active high then low
- QSLD sampled 2× : CLK rising => b0, falling => b1
- Data[7:0] expanded to 16 bits, CLK rising => Data[7:0], falling => Data[15:8]
- C/D : sampled at rising edge, added to the 16 data bits. The falling edge value would be used only during the first tests to emulate the parity bit.
A whole word is transmitted every cycle. There is no counter to care of. The QSLD state is per-word.

Since everything is "clock sourced" and each device in the chain has their own local oscillator, the overall system is considered asynchronous.

Normally the transmitter should do its best to keep the link in Data mode, resorting to the Control type words to fill the bandwidth with padding control words. The Locked state may bypass the scrambler and prolonged L state would affect EMI & BLW. But for now, let's consider we link two MAC together without going through a PHY (which is actually the current situation).

.

The "parallel" interface amounts to 9 bits per word, the "nibble" interface to 3 bits, so there is a 3× speed difference. The circuits need (double-)buffering, probably a "Hold" signal...

.
Decoder architecture
Yann Guidon / YGDES • 12/12/2025 at 06:47 • 0 comments
These days, the ideas are
- combine the running sum with the error detection/correction bits
- borrow some design principles from TCM
- Implement the Viterbit-like decoder using an array of asynch gates, a sort of physical treillis
- move to 3 or 4 bits per sample instead of a simple 3-level input
- let the asynch gates treillis handle clock resynch
The size of the codes are still undetermined. The ratio could at worst be 1B1T, at best 3B2T or so.

.

1T
3
2T
9
3T
27 4T
81 5T
243 6T
729
1B
2
X X X X X
2B
4 X X X X X
3B
8 X X X X
4B
16 X X X X
5B
32 X X X X
6B
64
X X X
7B
128 X X X X

.

There is an "upper diagonal" (quite arbitrary where B=T, and a lower slop forced by 2^n >= 3^m.
If possible the words should be short to keep the circuit small but this reduces the encoding efficiency.

The remaining candidates are
- 1B1T : huh...
- 2B2T : like above.
- 3B2T : 9/8=1.125 => very little margin for ECC and no way to fight baseline wander
- 4B3T : 27/16=1,68 => first efficient BLW code but no room left for ECC
- 4B4T : 81/16=5.0625 => one added trit for ECC but the internal state is getting very large !
- 5B4T : 81/32=2.53125 => looks interesting, there is a bit more than 1 bit of extra data
- 5B5T : 243/32=7.59 looks overkill, almost 3 bits of added data
So how many bits do we need for both ECC and BLW? => this sets the ratio between B and T
And what could/should the ADC resolution be? => number of pins, 4 pins / 16 values looks like the maximum, but also 4 pairs makes a 5-value Flash ADC, that's 8 pins already...
Ternary Viterbi
Yann Guidon / YGDES • 12/02/2025 at 04:53 • 0 comments
Over at the #miniMAC project, there was this log 108. Error correction and the realisation that the channel coding should be done directly in ternary. The encoder is pretty straight-forward (see above log) but the decoder is a different beast. Still, I know it's possible since GbEthernet uses it (TCM) over 4 simultaneous channels so there should be a way, right ?

http://pl91.ddns.net/viterbi/algrthms2.html has some good ideas for the binary case and I don't want the system to get out of hand (complexity, size, latency). But I'm somehow glad that others have already studied the subject of conversion to ternary.
- 2023 https://www.ijfmr.com/papers/2023/2/1757.pdf "Design of Ternary Convolutional Code Using
  Reconfigurable Architecture"
- 2017-2021 : https://dspace.library.uvic.ca/server/api/core/bitstreams/faafcf78-7e00-4fe4-b599-95f97f9cafa5/content by Bharath Rao Madela
- https://arxiv.org/pdf/2209.01360 Henri Mertens and Marc Van Droogenbroeck: "Error-rate in Viterbi decoding of a duobinary signal in presence of noise and distortions: theory and simulation", 2022
- 2002-2014 Khmaies Ouahada: several publications starting from the thesis, university work and more later. Found also on ResearchGate, great reading, that seems to converge to several of my conclusions.
- https://www.researchgate.net/publication/266389989_Viterbi_Decoding_of_Ternary_Line_Codes
- https://www.vodafone-chair.org/pbls/legacy/gerhard-fettweis/High-rate_Viterbi_processor_a_systolic_array_solution.pdf (not ternary though)
So the subject is not novel (which is both sad and great). I have actual data to crunch. Note that several studies just do the FPGA compilation and simulation tests but real field tests are lacking, I would love to see the actually measured waveforms. Where are the oscilloscopes ?

.

"Soft-decision Viterbi" seems to point to using more input bits, like a 3- or 4-bit flash ADC but this becomes impractical. Actually, that's where TCM leads us. And with 2 input bits, that's a 4-bit ADC. If the 2 differential inputs are used as 4 single-ended ones, that's a 4-bit, 16-level Flash ADC.

.

So far the idea is to do the DSP part in ternary though I doubt I could achieve Viterbi decoding at 50M trits/second (equivalent to about 60Mbps of useful bandwidth). Some parallelism becomes necessary. And other tricks too.

.

Anyway, using a convolution code solves quite a few things and the Viterbi decoder is somehow simplified (a tiny bit) because the binary-to-ternary (3B2T) conversion leaves one code unused (8 out of 9 codes). However the mechanism against baseline wander goes out of the window. Unless I can make a wander-reduction mechanism that also acts as a convolution code ? Then the extra data works for 2 effects ?
Gearbox
Yann Guidon / YGDES • 11/30/2025 at 21:01 • 0 comments

The latest log 108. Error correction over at #miniMAC - Not an Ethernet Transceiver brings an interesting idea for the adjustment of transmission parameters : using a LUT with different GCR patterns, such that the speed is not the only adjustment knob to turn.

The LUT can be hardwired or reprogrammable and contains basic binary patterns (such as tweaked Manchester codes) or ternary (MLT3, 4B3T, 3B2T...)
A link is established with the basic code and frequency increases until the error rate becomes significant, then the GCR LUT is changed to a more efficient one as long as the peer supports it.
I'm not sure yet how to classify the FEC in the pipeline yet, it would be an extra optional layer...
4B3T
Yann Guidon / YGDES • 07/10/2025 at 19:35 • 0 comments
Baseline wander (BLW) is a very concerning and underrated problem, covered in the last log 4B4T: An extended ternary Manchester code and its implications.

MLT-3 has two compelling aspects:
- the spectrum is shifted toward lower frequencies (EMI compliance) and
- the coder/decoder looks quite simple. Roughly.
The MLT-3 encoder is very simple but the decoder is very sophisticated due, indeed, to BLW. So the spectrum spread and the AFE complexity are linked: higher code efficiency and reduced EMI require more high-speed DSP effort.

This project's focus is on low cost, ease of implementation with very affordable and simple parts, not absolute speed, and the receiver's AFE must be kept as simple as possible, to keep the BOM very low. It's not possible to do much analog magic, like active filters. One of the first logs (AGC) shows that the AGC could be done almost with passive parts (and a few diodes) but I'm not sure this feat can be repeated for other functions.
If needed, the clock speed can be reduced to conform to EMI rules. But it is important that overall noise and BLW remain low, low enough that analog filtering is almost unnecessary. I'd say that the level of BLW must be below 1/2 the difference between 0 and +.
This is why I should evaluate the performance of the 4B3T FoMoT code and see if it fits my requirements. The 4-level Running Digital Sum (RDS) (actually -3/+3) looks promising.
I also consider a protocol (passive/implicit or active/explicit) to "drift" the clock and adapt to the line's capacitance/inductance. Transmission would start at a standard, low frequency (10MHz ?) and slowly increase as long as both receivers see some "margin".
4B4T: An extended ternary Manchester code and its implications
Yann Guidon / YGDES • 06/04/2025 at 23:56 • 0 comments
So the last log re-emphasised the importance of BaseLine Wander on the design of the AFE.

Modern designs have sophisticated hyper-fast ADCs and perform complex DSP to compensate for many line effects including droop and BLW. This is totally out of the realm of possibility, the miniPHY must be very simple.

On the other end of the spectrum, 10Mbps Ethernet uses Manchester code which is very inefficient: 2baud/bit, the bit value is followed by its inverse. However is has a wonderful property: there is no space for BLW, as each code is "neutral" by definition.

Hybrid_ternary_code has an intriguing and very simple encoding scheme with 1bit/baud. Not great, not bad, it's a baseline.

The 3B2T code (9 symbols) is pretty efficient (density/packing=1,5bit/baud) but the balance/neutrality is data dependent. Trying to preprocess the data to prevent unwanted patterns is hard, expensive, ... The hardware overhead is significant (it adds latency and bloats the circuit) and the packing density is still unclear: adding one bitrit worth of information (8 symbols) to 7 bitrits would reduce BLW to "a certain amount" but it's still too data-dependent and has insufficient effect/leverage. 1/7th overhead (14%) can not ensure DC balance in all cases.

4B3T has a slightly worse density (1.3b/baud) but can ensure DC balance, using a 3-bit running disparity counter, a reasonably-sized LUT and a pretty simple decoder. It links consecutive words/nibbles but it looks like it's the smallest such scheme, simpler than the 2-LUT 8b/10b system.
Some interesting analysis can be found at Block_Coding_with_4B3T_Codes
Let's say "it's interesting"...
...

But what if we don't want to link nibbles ? We end up with needing a scheme where all the codes are DC balanced, just like Manchester. HTC (see above) also has to link consecutive bits to work. In ternary, we can also get the equivalent of Manchester with a triplet of codes : +- / -+ / 00 But then the long runs of 0s must be prevented. So it's basically Manchester (2baud/bit), with S code.

Going to 3 trits, we get 6 non-null codes: +0- / -0+ / 0+- / 0-+ / +-0 / -+0 which amounts to 2,5b/3T. Not great.

Four trits gets interesting though : 9 non-zero invertible codes (18 total) gives something like 4B/4T:
```
00+-  +00-  +-00  /  00-+  -00+  -+00
0+0-  0+-0  +0-0  /  0-0+  0-+0  -0+0
++--  +--+  +-+-  /  --++  -++-  -+-+
```
This gives 16 data codes, 2 control codes and one "quite/silent/same" marker. This almost looks like something!

Packing-wise, it has 33% overhead compared to 4B3T so the data bandwidth drops by 25%. It is stateless though and the LUT is smaller.

But compared to HTC, the density is almost the same: 1b/baud ! The control codes are nice but not a significant bandwidth concern and HTC is way simpler.

I intend the miniPHY to have various (incompatible) versions so it is good to start with the simplest possible code. HTC does not have a "Same/Silence" code though that helps with the signalling and protocol so let's skip it.

So the development course would be :
1. Start with 4B4T, simple/easy/low bandwidth which can be implemented in either 2send/2receive or 1send/3receive if bandwidth matters, and see how it works in practice.
2. Increase the bandwidth usage with 4B3T, as a simple upgrade on the FPGA side
3. Meanwhile, see if I can figure out a balancing scheme to retrofit into 3B2T with a smaller overhead than 4B3T.
This whole analysis has brought a lower bound of coding overhead to bring DC balance. Looking at 8B6T, it does not look like this packing ratio can be easily improved.

From there, if a line frequency of 30MHz can not be exceeded, and 1MHz=2baud,
1. 4B4T will bring about 60Mbits per lane (hypothetically and unlikely, let's say 20Mbps)
2. 4B3T increases to 75Mbps (ok let's say 25 or 30Mbps)
3. 3B2T could reach 80 (25 to 33Mbps in good conditions)
The cool thing with a custom miniPHY is that the clock frequency could be adjusted according to the line's characteristics (length, capacitance...) and we could add lanes...
Read more »
Reverse antibias
Yann Guidon / YGDES • 06/04/2025 at 01:23 • 0 comments

The last log has shown that the running disparity of a whole word can be computed in parallel but at high costs.

Wouldn't it be better to compute only one word's disparity then deduce the correction ?

That's what the previous systems (NRZ and MLT3) enabled, with simple parity as well as mod4. Yet it was still not satisfying.

The baseline wander can be attributed to a "random walk" with no limit on the excursion, and the limit requires extra coding, which I'd like to minimise. This is the territory of 4B3T and its cousin 8B/6T with a very short range disparity, very low excursion, hence high overhead. I'd like to keep it at or below 3 bits/8 codes per 20-bit word so the idea of tweaking the data from the source is pretty interesting.

I don't know why but what I imagine right now is the bias evaluation starting from the middle of the word, going in both directions, seeing how the wander evolves, then at 1/4, 1/2 and 3/4, "swap" something to invert the bias slope. Thus the disparity counter can get higher values but clumps can be broken up. I think. Aaaand it looks a bit (from afar) like Knuth's idea. (D.E. Knuth, “Efficient Balanced Codes”, IEEE transactions on Information Theory, vol it-32, no.1, January 1986)

And there are 7 trits so it's not as easy.

------------------

3B2T is very efficient, 4B3T is less dense but provides relatively easy and very short-term DC balance, which would be good for the analog front-end. There is a tension/compromise between coding efficiency (bandwidth usage) and BLW resilience...

BLW and code disparity is fortunately studied in length. Howard Johnson has an interesting analysis at https://sigcon.com/vault/pdf/7_09_addenda.pdf

https://imapsource.org/api/v1/articles/57229-line-coding-methods-for-high-speed-serial-links.pdf

But one thing I have not yet seen covered is "clipping". Applying a clipping with a pair of diodes adds some non-linearity and some hysteresis but it reduces the absolute excursion. Another trick is to use the midpoint tap of the transformer. Absolute levels don't seem to really matter, but the amplitude and direction of the pulses count the most.
For now, the emphasis is on the simplicity of the analog front-end, where a good portion of the manufacturing complexity and costs lies. In fact at this stage, even Manchester coding (like 10Base-T) would be nice, though would it work at a higher speed (at, say, 30MHz), and how is it possible to apply this principle to ternary coding ?
Drift/Bias evaluation
Yann Guidon / YGDES • 05/25/2025 at 19:04 • 0 comments
The constellation has a nice property that has been already highlighted:
```
encoding:
bits  trits  weight
000    - 0   -  \____NOR2
001    0 -   -  /
010    + +   ++ ---NOR+AND
011    - -   -- ---ANDN+AND
100    + 0   +  \____ANDN
101    0 +   +  /
110    - +   0
111    + -   0
```
The net sum of the levels does not need a lot of gates to evaluate: the circuit takes about 4 gates.

Of course, let us not forget the activation/enable, coming from the circuit we have already designed in the last log 2. The "Same" circuit. Since 11x totals 0, then we can just OR the result on b2 and b1 as in the circuit below:

Now the goal is to evaluate the total bias of the encoded 20-bit word, for each of 8 "fumblings" of the 7 tribits. Initially I imagined an incrementer but there is much simpler than that: XOR each tribit with the output of a 3-bit counter. Then the winning count gets encoded with the others. The cost is one more layer of XOR2 at the input of the circuit:

And from there we can simply add a popcount7 for each of the 7 outputs and combine them in a 32-bit "weight". But even though it's already "done", we can already simplify it a bit by noticing that neighbours can cancel each other. So let's introduce another new circuit: the reduction. To simplify it, I need to add a "zero" output to the bias decoder. And then, things took a weird turn. Here is the new circuit that combines 2 tribits:

Now there is a big binary encoder and the 4 bits require about 7 gates of propagation. The output is a signed number so no need to process negative and positive values separately.

This circuit uses 52 gates to process 2 bitrits, it simply amounts to a 64×4 ROM, and it must be replicated 3,5× so it's not very compact, and the rest of the adders require even more gates.

Furthermore, the running disparity must also be injected somehow : that's the 8th value to add, since there are 7 bitrits and the adder tree would be unbalanced. So the running disparity accumulator from the last word is added with the 7th bitrit.

The circuit needs to be run (pipelined) 8 times, for each of the 8 possible counter values, while the serialiser outputs 8 bitrits (7 data, 1 counter), so the phases could overlap but the evaluation must be complete before serialising can start: it's a pipeline (eval, serialise) with 8 sub-cycles

....

Reducing the bias apparently requires a lot of effort. More than would be reasonable, probably.

Modern links rely on the scrambler to even things out, methods like 8b/10b are out of fashion for a decade now.

Better DSP front-ends can digitally handle the droops and wanders... I can't afford that though.

Adding a bitrit expands the words to 16 bauds, to transmit 16 bits : the ternary recoding allows the 50% expansion. And there is still one unused bit.

I'd like to avoid the above circuit but I know my AFE is lousy and would need some serious help, but is the expense/complexity/latency justified ? Is there a simpler method ?