The division instruction is one of the most expensive instruction in your CPU. Thus optimizing compilers often compile divisions by known constants down to a multiplication followed by a shift. However, in some lucky cases, the compiler does not even need a shift. I call the corresponding divisors ideal. For the math. geeks, they are related to Fermat numbers.

For 32-bit unsigned integers, we have two such divisors (641 and 6700417). For 64-bit unsigned integers, we have two different ones (274177 and 67280421310721). They are factors for 2^{32} + 1 and 2^{64} + 1 respectively. They are prime numbers.

So you have that

`n/274177 = ( n * 67280421310721 ) >> 64`

and

`n/67280421310721 = ( n * 274177 ) >> 64`.

In these expressions, the multiplication is the full multiplication (to a 128-bit result). It looks like there is still a ‘shift’ by 64 bits, but the ‘shift’ disappears in practice after compilation.

Of course, not all compilers may be able to pull this trick, but many do. Here is the assembly code produced by GCC when compiling `n/274177` and `n/67280421310721 ` respectively for an x64 target.

movabs rdx, 67280421310721 mov rax, rdi mul rdx mov rax, rdx ret

mov rax, rdi mov edx, 274177 mul rdx mov rax, rdx ret

You get similar results with ARM. It looks like ARM works hard to build the constant, but it is mostly a distraction again.

mov x1, 53505 movk x1, 0xf19c, lsl 16 movk x1, 0x3d30, lsl 32 umulh x0, x0, x1 ret

mov x1, 12033 movk x1, 0x4, lsl 16 umulh x0, x0, x1 ret

What about remainders?

What a good compiler will do is to first compute the quotient, and then do a multiplication and a subtraction to derive the remainder. It is the general strategy. Thus, maybe surprisingly, it is more expensive to compute a remainder than a quotient in many cases!

You can do a bit better in some cases. There is a trick from our Faster Remainder by Direct Computation paper that compilers do not know about. You can compute the remainder directly, using exactly two multiplications (and a few move instructions):

n % 274177 = (uint64_t( n * 67280421310721 ) * 274177) >> 64

and

n % 67280421310721 = (uint64_t( n * 274177 ) * 67280421310721) >> 64.

In other words, the following two C++ functions are strictly equivalent:

// computes n % 274177 uint64_t div1(uint64_t n) { return n % 274177; } // computes n % 274177 uint64_t div2(uint64_t n) { return (uint64_t( n * 67280421310721 ) * __uint128_t(274177)) >> 64; }

Though the second function is more verbose and uglier, it will typically compile to more efficient code involving just two multiplications, back to back. It may seem a lot but it is likely better than what the compiler will do.

In any case, if you are asked to pick a prime number and you expect to have to divide by it, you might consider these ideal divisors.

**Further reading**. Integer Division by Constants: Optimal Bounds

Are there similar numbers for 128 and 256-bit unsigned integers? And are these useful in cryptographic contexts? Obviously not as a secret prime, but as something to speed up big-number arithmetic..

Yes.

For 128 bits: 59649589127497217 and 5704689200685129054721

For 256 bits: 1238926361552897 and 93461639715357977769163558199606896584051237541638188580280321

Over the years I’ve noticed that trying to write a lot of numbers to text log can seriously slow down the sections doing the logging. Seems that printf(“%d”,…) is quite the little delay statement. I figure this is down to the the necessity to divide by 10 (with remainder) for every digit thusly printed.

On account of printing human readable numbers, dividing by 10 must be one of the most common divisors encountered. So you’d think that CPU designers would give us dedicated optimized divide-by-10 instructions. I’ll admit I’m pretty out of touch with current instruction sets, but given how doggy printf %d still is, I’d guess they haven’t.

You are right that calls to

printfare slow. I do not expect that the conversion from integers to strings is what slows you down in such a case.For some fast decimal-output ideas, have a look at the Linux kernel binary-to-decimal conversion code. It uses a number of tricks, including smaller constants for restricted ranges and a 200-byte mod-100-to-2-digits lookup table.

I think it is worth noting that this is an instance of strength reduction

(see https://en.wikipedia.org/wiki/Strength_reduction ) and also that compilers really have a cost model to pick when to apply this reduction.

Some HN comments on this post correctly note that the relative cost of division and multiplication varies with time and target architectures.

One problem with the direct computation of a remainder is the need for (using the notation of the 2019 paper) a multiplication with an N+L-bit input, where the numerator is N bits long, and the computation needs L guard bits. (arxiv:2012.12369 defines N differently.)

This means that the algorithm requires the input range be restricted, Yes, we can hash 48-bit pointers on 64-bit machines, but it would be nice to find something that could handle 64-bit inputs directly.

Unfortunately, just dropping the least significant L bits of the intermediate product fails immediately; just computing n/3 requires L=1, and when n = 1,

`(n*c>>L) * d >> L`

returns 0, while 1 % 3 = 1.Okay, I think, what if I round the division by 2L up? (Adding

`(1 << L) - 1`

before shifting down by L.)Well, testing with N=16, that fails on 65416%670. L=8, c = 0x61d1, n*c = 0x61a32608, and 0x61=97 is the correct quotient, but 0xa327 * 670 = 0x1ab0012, while the correct remainder is 0x1aa.

But I wonder if it’s possible in general to find some additive constant 0 < b < 2L that would work…

Nope. 65443%670 has n

c=0x61ad7713 and requires b <= 0xec, while 16%670 has nc=0x61d10 and requires b >= 0xf0.The number of bits needed is divisor-dependent (as this blog post demonstrate). You are correct that it does not work for all divisors and word sizes. But, at least for compile-time optimizations, it does not need to.

You might enjoy this software implementation:

https://github.com/lemire/fast_division

The division technique is more useful because it can usually be done with a multiply-high and a right shift. And even the exceptions can be handled with a small number of additional simple instructions.

I was lamenting that the direct remainder computation technique isn’t more useful, because it

almost alwaysrequires a multiply with an input range wider than the modulo operation it’s emulating. I was (foolishly?) hoping someone might be able to find an improvement.One application I’m thinking about is the remainder of a bignum modulo a number of small primes, as the first trial-division step in an isprime() test. This generally starts by choosing a single-word divisor d = p1

p2p3*…, and finding the 1-word remainder, then checking for divisibility by the individual primes.Each step consists of finding r = (r << N | limb[i]) % d. N can be chosen to match available operation sizes (64 bits for x86 native divide, 32 bits for emulation via 64-bit multiply), and the primes can be partitioned into divisors d in a variety of ways.

The figure of merit depends on the cost of the remainder and the probability of finding a factor (and therefore obviating future tests). So more smaller divisors can be advantageous if the remainder computation is sped up enough.

If you are lamenting about the speed of 64 bit remainder value calculation for bignums then take a look at the paper of E. W. Mayer (arXiv:1303.0328). I did an implementation on Skylake architecture with < 6 cycles per limb of the given bignum.

For the decomposition of a 64-bit remainder divisibility to the individual primes you don’t need the reduced remainder value – just an indicator if it is 0 or not. I think that can be done with one mullo and one comparison (if my math doesn’t leave me).

Ooh, interesting, thank you! Actually, as soon as I heard the title, I said “oh, that’s right, you probably can…”.

More specifically, when testing for divisibility by multiple primes, I don’t need the true remainder r; s = 2kr is fine, as s ≡ 0 iff r ≡ 0 (mod p).

If k is fixed, then any tests for particular non-zero values of the remainder can also be simply translated.

BTW, that “2kr” was supposed to have a superscript k (2^k r), but despite the fact that it’s documented that HTML can be used in markdown, “2<sup>k</sup>r” renders as “2kr”.

I think there are still corners to improve on remainder range check optimization on modern compilers. For instance, I found out by brute force that x % 7 < 3 where x is uint8_t can be replaced by (uint16_t)(9363 * x) <= 18906. (Compilers seem to emit a dozen-instruction sequence for this.) Similar solutions can be found for other variations, I did just look for these in brute force.

… and with (uint16_t)(9363 * x) < 28087 x can go up to 13109. 9363 is simply ceil(2^16 / 7) and 28087 is round(2^16 * (3/7)). Is there a good reason why such optimizations are not generated? (Of course they should be correct for the whole input range…)

Well, of course it should be 28088 instead of 28087 and 3 * ceil(2^16 / 7). Nonetheless, I’m surprised why remainder range check strength reductions with “next higher word size” aren’t implemented on compilers.