Counter vs Booth

When choosing between Counter and Booth algorithms for signed multiplication, engineers weigh speed, silicon cost, and power. Both methods re-cast the ancient shift-and-add process, yet they diverge sharply in encoding style, critical path length, and energy profile.

Understanding these differences early prevents costly respins. A mis-selected multiplier can inflate cycle time by 30 % or force a larger die that price-sensitive markets reject.

🤖 This article was created with the assistance of AI and is intended for informational purposes only. While efforts are made to ensure accuracy, some details may be simplified or contain minor errors. Always verify key information from reliable sources.

Binary multiplication landscape

Baseline “shift-and-add” needs n cycles for n-bit operands. That serial dependency bottlenecks DSP cores that must issue a multiply every clock.

Signed operands complicate matters because partial products must be sign-extended. Naïve sign-extension injects n extra additions and grows array height to 2n.

Booth and Counter techniques attack these overheads from opposite flanks: Booth cuts the number of partial products, while Counter accelerates their summation.

Radix-2 Booth algorithm dissected

Radix-2 Booth recodes the multiplier y into signed digits {–1, 0, 1}. Each digit selects among –x, 0, or x, eliminating the need for a separate sign-extension tree.

The recoding scan is a three-bit window: current, previous, and an implicit zero. A tiny ROM or hard-wired truth table emits the control signals in one LUT delay.

Because only half of the recoded digits are non-zero, the partial-product array shrinks to n/2 rows on average. Wallace trees downstream thank Booth for the lighter load.

Gate-level implementation snapshot

A 16-bit Booth encoder fits in eight 4-LUT slices on Xilinx Ultrascale. The critical path is one LUT plus the mux delay for the partial-product row.

Recoded digits steer Booth selectors built from transmission-gate muxes. These muxes present only 30 fF load to the encoder, keeping dynamic power under 150 µW/GHz in 28 nm.

Counter algorithm essentials

Counter multiplication keeps the partial-product array intact but collapses it faster. 3:2 and 4:2 compressors turn three or four rows into two, slicing height logarithmically.

Each compressor is a CSA tree built from full adders. The carry signals propagate diagonally, so delay grows as log1.5(height) rather than linearly.

Unlike Booth, Counter introduces no new control signals. That simplicity appeals to ASIC flows that prize verification closure over algorithmic cleverness.

Compressor micro-architecture

A 4:2 compressor uses five full adders wired in a balanced tree. The carry-out of the first stage feeds the carry-in of the second, creating only one net delay hop.

Placement tools can stack compressors into regular tiles, yielding dense 32-bit macros below 5 kgates in 7 nm. Regularity also improves yield by minimizing lithography hot-spots.

Partial-product count comparison

Unsigned 16 × 16 multiplication spawns 256 bit-products. Radix-2 Booth drops that to 128, while Counter faces the full 256 but digests them faster.

Radix-4 Booth halves the count again to 64 rows. However, each row now spans ±2x, demanding hard multiple generators that cost 200 µm² in 28 nm.

Counter arrays never compress below two rows, yet those rows emerge after log2(64) = 6 levels of 3:2 counters. The race is between fewer rows and faster folding.

Critical path deep dive

Booth’s critical path starts at the encoder, runs through the mux bank, then enters the Wallace tree. In 16-bit designs, this totals 12 FO4 delays in typical synthesis.

Counter’s first compressor level activates before any Booth encoder wakes. The path is simply CSA → CSA → … → carry-lookahead adder, 10 FO4 delays for the same width.

At 32-bit, Booth’s encoder depth stays constant, but the mux fan-out doubles. Counter trees scale logarithmically, so the gap widens to 2.5 FO4, worth 100 MHz at 0.8 V.

Power signature analysis

Booth multiplexers toggle only when the recoded digit changes. Sparse recoding can cut internal switching by 40 % versus random data.

Counter compressors switch every clock because every bit-product ripples through adders. Dynamic power peaks at 250 µW/MHz for 32-bit macros in 40 nm.

Yet Booth pays for sign-extension multiplexers that span the full width. At 64-bit, the extra caps outweigh sparse switching, flipping the energy advantage back to Counter.

Glitch activity under PRBS stimuli

Gate-level VCD dumps show Counter trees produce 30 % more glitches. Inserting two stages of buffers with balanced rise/fall halves that waste for only 2 % area growth.

Booth mux banks glitch less, but their select lines come from an encoder that itself can sparkle. Isolating the encoder with a registered layer quenches 70 % of those unwanted toggles.

Area footprints in modern nodes

A 32-bit Radix-2 Booth multiplier needs 3.2 kgates: 1 k for encoder/mux, 1.5 k for CSA tree, 0.7 k for final adder. That fits into 18 kµm² in 7 nm.

Counter logic is 2.8 kgates—no encoder, smaller muxes—but the CSA array is taller. The resulting macro is 20 kµm², 10 % larger despite fewer gates.

Radix-4 Booth balloons to 4.5 kgates once ±2x and ±3x multiples are硬化. Foundries often provide only ±2x hard macros, forcing designers to synthesize ±3x, which adds another 1 kgates.

Pipeline friendliness

Booth’s encoder sits in one pipeline stage, the mux bank in the next, and the CSA tree in a third. Balancing those stages requires retiming that not all tools handle cleanly.

Counter compressors slice naturally at every CSA level. Inserting registers every two compressor rows yields near-perfect 50-50 duty without manual ECO.

High-speed DSPs therefore prefer Counter when targeting 1 GHz. Booth remains attractive for 500 MHz budgets where area matters more than latch count.

Signed-extension overhead

Booth recoding embeds sign handling; no separate sign-extension trees are necessary. That saves 500 gates and two levels of logic for 64-bit operands.

Counter arrays need sign-extension copies for every compressed row. The extension bits fan out horizontally, creating 20 % wire congestion in the upper metal layers.

One workaround is sign-booth plus Counter: pre-encode the multiplier into signed digits, then feed a pure Counter array. Hybrid schemes add only 300 gates yet reclaim routing tracks.

Hybrid unification approach

Some vendors merge Booth encoder with Counter compression. The encoder outputs four signed bits that drive 4:2 compressors directly, skipping the mux bank entirely.

This removes the large fan-out muxes that slowed pure Booth. Meanwhile, the compressor count halves versus pure Counter because the array height is pre-reduced.

Silicon measurements show 15 % energy savings and 8 % speed gain over standalone Booth at 64-bit. The macro grows by only 5 % area because multiplexers disappear.

Verification complexity

Booth’s encoder is a sequential FSM that must be formally verified against corner patterns like alternating 0110 sequences. Counter logic is combinational, so equivalence checking is trivial.

However, Booth offers fewer partial products, shrinking the formal model size. A 128-bit Counter tree balloons the BDD beyond 4 M nodes, straining provers.

Teams often respin because of a missed Booth corner case at the encoder boundary. Counter bugs, when they occur, are usually localized to a single compressor and fixed with a metal-only ECO.

Compiler and software view

Compilers treat both multipliers as opaque “mul” instructions. Yet scheduling models must expose latency differences to avoid pipeline stalls.

GCC’s machine description file differentiates Booth (3-cycle) from Counter (2-cycle) via the “mulsi3” pattern. Tuning this entry improved SPECint by 4 % on a Counter-based core.

Booth’s variable latency due to recoding sparse patterns is invisible to software. Counter’s fixed latency simplifies loop unrolling heuristics, yielding tighter software pipelining.

Process-porting anecdotes

A European startup ported a Booth-based DSP from 40 nm to 28 nm. The encoder’s critical path improved by only 12 % because mux select nets became wire-limited.

They swapped to Counter and gained 25 % frequency at the same voltage. The area penalty was offset by a denser SRAM compiler, keeping die size flat.

Conversely, a wearable vendor shrinking to 22 nm kept Booth because dynamic power dominated. The 30 % switching reduction trumped the 10 % speed loss.

Choosing for your next SOC

If your target exceeds 1 GHz and power budget exceeds 200 mW, pick Counter with three pipeline stages. The logarithmic delay and regular register boundaries simplify timing closure.

For battery-powered sensors running at 200 MHz, Booth’s sparse switching and smaller area save 15 % energy. The encoder overhead is negligible when duty-cycled.

When die cost is paramount and frequency is below 500 MHz, consider Radix-2 Booth. It balances area, power, and verification effort without exotic multiples.

Quick selection cheat-sheet

Above 500 MHz → Counter. Below 200 MHz → Booth. Between 200–500 MHz, simulate both with extracted parasitics and use the PPA that meets margin.

Remember to model wire load; at 64-bit, Counter’s taller array stresses upper metals and can flip the winner. Always close timing with real clocks, not ideal waves.