In article <1991Nov20.101023.26008@rdg.dec.com> goodwin@oyez.enet.dec.com (Pete Goodwin) writes:
>[stuff about !PVray]
>
>Got ARM code for square roots? 8)
I haven't any code, but it shouldn't be too difficult for you to write some. The algorithm I'd use would be a Newton-Raphson iteration scheme designed to solve x^2=z. (Whether this is the fastest method I don't know)
The way this works is you make a guess for sqrt(z), say x_0. (The guess can be very rough, I'd do something like find out how many bits in z (say there were 23) half that (11 or 12) and then guess x_0 (2^11 or 2^12.)
One then makes successive approximations via
x_{n+1} = x_{n} - (x_{n}^2 - z)/2/x_{n}
This scheme will converge very rapidly to sqrt(z) if your initial guess is at all close to z (give or take a factor of 2 or 4 is close enough). You might want to write (x_{n}^2-z) as (x_{n}+z)*(x_{n}-z) it might be quicker (I've no idea).
As an example: if z=529020918
and we guess x0=16384 then we find x1=24336.43719 x2=23037.12504 x3=23000.48392 x4=23000.45473 x5=23000.45473 x6=23000.45473
so you can see 4 or maybe 5 iterations at most is enough. Each iteration (if you arrange things *carefully*) will need only a 16x16 bit multiply and a 32x16 bit divide.
Graeme Williams
>In article <1991Nov20.101023.26008@rdg.dec.com> goodwin@oyez.enet.dec.com (Pete Goodwin) writes:
>>[stuff about !PVray]
>>
>>Got ARM code for square roots? 8)
>
>I haven't any code, but it shouldn't be too difficult for you to write some.
>The algorithm I'd use would be a Newton-Raphson iteration scheme designed
>to solve x^2=z. (Whether this is the fastest method I don't know)
>
>The way this works is you make a guess for sqrt(z), say x_0. (The guess
>can be very rough, I'd do something like find out how many bits in z
>(say there were 23) half that (11 or 12) and then guess x_0 (2^11 or 2^12.)
>
>One then makes successive approximations via
>
> x_{n+1} = x_{n} - (x_{n}^2 - z)/2/x_{n}
>
>This scheme will converge very rapidly to sqrt(z) if your initial guess
>is at all close to z...
The Newton-Raphson technique is a good one when you've got a fast division - e.g. if you've got division in hardware (and haven't got square root in hardware :-). However, if you're going to do division by a software routine (as you must on an ARM), there is a "long square root" algorithm which is very similar to the "long division" algorithms used to do division in software. It takes only a bit more time than one division *in total*, compared with a division, an addition and a shift *for each of several iterations* in Newton-Raphson. I would therefore recommend it for ARM square root routines when speed is important.
I won't supply code, since (a) I don't know what type of numbers you want to take square roots of, how accurate a result you want, etc. - these details will obviously affect the code; (b) I don't have time to write code, whereas I have a written description of the algorithm on file which I can simply include here!
I'll illustrate the "long square root" algorithm in base 10 - suppose we want to take the square roots of 2 and 20:
(A) First write down the number you want to take the square root of, split into groups of two digits, with one split at the decimal point. In these particular cases, this gives us:
V 02. 00 00 00 ... V 20. 00 00 00 ...
(Note that the two numbers chosen have the same pattern of digits, but different splits because of the different positions of the decimal point. This is the only reason they produce different answers.)
(B) Initialise by finding the largest square which is smaller than the first pair of digits: in the first case the first pair of digits is 02 = 2, so this square is 1^2 = 1, and in the second the first pair of digits is 20, so this square is 4^2 = 16. Subtract the square from the first pair of digits, and put its square root in as the first digit of the result:
Twice answer so far is 2*1=2. Twice answer so far is 2*4=8. 0*20 = 0, less than 100 0*80 = 0, less than 400 1*21 = 21, less than 100 1*81 = 81, less than 400 2*22 = 44, less than 100 2*82 = 164, less than 400 3*23 = 69, less than 100 3*83 = 249, less than 400 4*24 = 96, less than 100 4*84 = 336, less than 400 5*25 = 125, more than 100 5*85 = 425, more than 400 So the next digit of the answer So the next digit of the answer is 4, and we must subtract 96 is 4, and we must subtract 336
Twice answer so far is 2*14=28. Twice answer so far is 2*44=88. 0*280 = 0, less than 400 0*880 = 0, less than 6400 1*281 = 281, less than 400 1*881 = 881, less than 6400 2*282 = 564, more than 400 2*882 = 1764, less than 6400 So the next digit of the answer : is 1, and we must subtract 281 : 7*887 = 6209, less than 6400 8*888 = 7104, more than 6400 So the next digit of the answer is 7, and we must subtract 6209
2 20 = 1^2 + 1 = 4^2 + 4 = 1.4^2 + 0.04 = 4.4^2 + 0.64 = 1.41^2 + 0.0119 = 4.47^2 + 0.0191 = ... = ...
[ Aside: this fact yields another useful property of the algorithm, which is that (as in long division) your answer is exactly right if the remainder becomes 0 and there are no further non-zeros to bring down. This can be very helpful e.g. for calculating square roots in floating point systems that conform to IEEE standard 754. ]
Then prove this - it depends on the identity:
(10x+y)^2 - (10x)^2 = 20xy + y^2 = (20x + y) * y
The quantities (20x + y) * y are what the digit calculation in part (C) is actually determining.
One important observation: this algorithm simplifies immensely in binary! I'll run through it, illustrating it for the case of the square root of two again:
(A) Write down the argument, split into pairs aligned with the binary point:
V 10. 00 00 00 00 ...
(B) If the first pair is 00, the first digit of the answer is 0; otherwise it is 1. Subtract the square of this first digit from the first pair and bring down the next pair:
1. -------------------- V 10. 00 00 00 00 ... - 1 ------ 100
(C) There are only two possible next digits, namely 0 and 1. But 0 * (twice the answer so far concatenated with 0) is always 0, which must be less than or equal to the remainder. So all that matters is whether 1 * (twice the answer so far concatenated with 1) is less than or equal to the remainder: if it is, the next result digit is 1; if it is greater than the remainder, the next result digit is 0. One more thing: twice the answer so far concatenated with 1 is the answer so far concatenated with 01.
So the result digit selection procedure becomes: concatenate the answer so far with 01, and compare with the remainder. If greater, leave the remainder unchanged and the result digit is 0; if less than or equal, subtract it from the remainder and the result digit is 1.
To continue the example:
Result so far concatenated with 01 is 101, which is greater than 100, so result digit is 0 and subtract 0:
Result so far concatenated with 01 is 1001, which is less than 10000, so result digit is 1 and subtract 1001:
Result so far concatenated with 01 is 10101, which is less than 11100, so result digit is 1 and subtract 10101:
Result so far concatenated with 01 is 101101, which is greater than 11100, so result digit is 0 and subtract 0:
Hope this helps.
David Seal
dseal@armltd.co.uk
>In article <1991Nov20.101023.26008@rdg.dec.com> goodwin@oyez.enet.dec.com (Pete Goodwin) writes:
>>[stuff about !PVray]
>>
>>Got ARM code for square roots? 8)
>I haven't any code, but it shouldn't be too difficult for you to write some.
>The algorithm I'd use would be a Newton-Raphson iteration scheme designed
>to solve x^2=z. (Whether this is the fastest method I don't know)
[Method deleted]
>so you can see 4 or maybe 5 iterations at most is enough. Each iteration
>(if you arrange things *carefully*) will need only a 16x16 bit multiply
>and a 32x16 bit divide.
>Graeme Williams
There is a *much* faster way. If your number is of the form a*2^n, where 0.5 <= a < 1.0 then do the following:
sqrt(a*2^(2n)) = sqrt(a)*2^n
sqrt(a*2^(2n+1)) = sqrt(a/2)*2^(n+1)
This reduces the problem to finding a square root of a number a, 0.25 <= a < 1.0. This can be done by this method, which should have a cost comparable to 2 or 3 divisions.
{ 0.25 <= a < 1.0 }
x := 4*(a-0.25);
r := 1;
n := 1;
while (n<number-of-bits) and (a<>0) do begin
{ invariant: r^2+x = 4^n*a, x<2*r+1 }
n := n+1;
t := 4*r+1;
x := 4*x;
r := 2*r;
if x >= t then begin
x := x-t; r := r+1end
Note that only addition, subtractions and multiplications by 2 or 4 are used in each part of the loop. At the end a single n-bit shift is done. Note that r and t are integers, while x is a real.
The (a<>0) test allows you to exit the loop if an exact root is found, but it can be omitted without causing error (so the loop can be unwound).
We can reformulate the algorithm to keep x < 1.0 throughout the algorithm:
{ 0.25 <= a < 1.0 }
x := (a-0.25)/2;
r := 1;
for n := 2 to number-of-bits do begin
{ invariant: r^2+2^n*x= 4^n*a, x<(2*r+1)*2^-n }
t := 4*r+1;
x := 2*x;
r := 2*r;
if x >= t/2^(n+1) then begin
x := x-t/2^(n+1); r := r+1end
Assuming we have a 32 bit mantissa which is stored with the most significant bit as 0.5, we would get the following ARM code, which uses 213 cycles to get a 32 bit result. A test of x=0 could be added at every iteration. This would make minimum time shorter, but assuming there is no large number of squares in the input, this would make the average time longer.
\ on entry Ra holds a
\ on exit Rr holds sqrt(a)
\ uses Rx and Rt (either can be equal to Ra)
SUB Rx,Ra,#2^30
MOV Rx,Rx LSR #1 \ x := (a-0.25)/2;
MOV Rr,#1 \ r := 1;
MOV Rt,Rr ASL #2
ADD Rt,Rt,#1 \ t := 4*r+1;
MOV Rx,Rx ASL #1 \ x := 2*x;
MOV Rr,Rr, ASL #1 \ r := 2*r;
SUBS Rt,Rx,Rt ASL #29 \ if x >= t/2^(n+1) then begin
MOVPL Rx,Rt \ x := x-t/2^(n+1);
ADDPL Rr,Rr,#1 \ r := r+1
...
MOV Rt,Rr ASL #2
ADD Rt,Rt,#1 \ t := 4*r+1;
MOV Rx,Rx ASL #1 \ x := 2*x;
MOV Rr,Rr, ASL #1 \ r := 2*r;
SUBS Rt,Rx,Rt ASL #0 \ if x >= t/2^(n+1) then begin
MOVPL Rx,Rt \ x := x-t/2^(n+1);
ADDPL Rr,Rr,#1 \ r := r+1
\ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
We can actually get this better by avoiding the doubling of r in each iteration. This way r and t are no longer integers, but the code is better anyway:
{ 0.25 <= a < 1.0 }
x := a-0.25;
r := 0.5;
for n := 2 to number-of-bits do begin
{ invariant: r^2+x/2^(n-1) = a, x<r+2^-(n+1) }
t := r+2^-(n+2);
x := 2*x
if x >= t then begin
x := x-t; r := r+2^-(n+1)end
In ARM this uses 149 cycles to get a 32 bit result. Try to better that!
\ on entry Ra holds a
\ on exit Rr holds sqrt(a)
\ uses Rx and Rt (either can be equal to Ra)
SUB Rx,Ra,#2^30 \ x := a-0.25;
MOV Rr,#2^31 \ r := 0.5;
ADD Rt,Rr,#2^28 \ t := r+2^-(n+2);
MOV Rx,Rx ASL #1 \ x := 2*x;
SUBS Rt,Rx,Rt \ if x >= t then begin
MOVPL Rx,Rt \ x := x-t;
ADDPL Rr,Rr,#2^29 \ r := r+2^-(n+1)
...
ADD Rt,Rr,#2^0 \ t := r+2^-(n+2);
MOV Rx,Rx ASL #1 \ x := 2*x;
SUBS Rt,Rx,Rt \ if x >= t then begin
MOVPL Rx,Rt \ x := x-t;
ADDPL Rr,Rr,#2^1 \ r := r+2^-(n+1)
\ t := r+2^-(n+2);
\ { 2^-(n+2) is less than precision }
\ x := 2*x;
RSBS Rt,Rr,Rx ASL #1 \ if x >= t then begin
\ x := x-t; { x not needed }
ADDPL Rr,Rr,#2^0 \ r := r+2^-(n+1)
\ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
Note: I haven't tried this code on my Archimedes, as it is at home and
I am at work. I have, however, tried the pseudo-Pascal code.
[ lots deleted ]
>We can actually get this better by avoiding the doubling of r in
>each iteration. This way r and t are no longer integers, but the code
>is better anyway:
> { 0.25 <= a < 1.0 }
> x := a-0.25;
> r := 0.5;
> for n := 2 to number-of-bits do begin
> { invariant: r^2+x/2^(n-1) = a, x<r+2^-(n+1) }
> t := r+2^-(n+2);
> x := 2*x
> if x >= t then begin
> x := x-t;> end
> r := r+2^-(n+1)
I made a small error in this. The correct code is:
{ 0.25 <= a < 1.0 }
x := a-0.25;
r := 0.5;
for n := 2 to number-of-bits do begin
{ invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) }
t := r+2^-(n+1);
x := 2*x
if x >= t then begin
x := x-t; r := r+2^-nend
There are, however, still a small problem: x might be greater than 1 after the doubling. We can solve this by delaying the doubling:
{ 0.25 <= a < 1.0 }
x := a-0.25;
r := 0.5;
for n := 2 to number-of-bits do begin
{ invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) }
t := r+2^-(n+1)
if x >= t/2 then begin
x := x-t/2; r := r+2^-nend;
Also, my ARM code used unsigned numbers but compared them as if they were signed (using PL). Comparison of unsigned numbers requires HS. The code is still untested, so there might be further bugs. Note also that you can't write the constant 2^31 in BASIC assembler. Use 1<<31 instead. The number of cycles have increased to 154, as an extra pass through the loop is required to get 32 bit precision:
\ on entry Ra holds a
\ on exit Rr holds sqrt(a)
\ uses Rx and Rt (either can be equal to Ra)
SUB Rx,Ra,#2^30 \ x := a-0.25;
MOV Rr,#2^31 \ r := 0.5;
ADD Rt,Rr,#2^29 \ t := r+2^-(n+1);
SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin
MOVHS Rx,Rt \ x := x-t/2;
ADDHS Rr,Rr,#2^30 \ r := r+2^-n
MOV Rx,Rx,ASL #1 \ x := 2*x;
...
ADD Rt,Rr,#2^0 \ t := r+2^-(n+1);
SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin
MOVHS Rx,Rt \ x := x-t/2;
ADDHS Rr,Rr,#2^1 \ r := r+2^-n
MOV Rx,Rx,ASL #1 \ x := 2*x;
\ t := r+2^-(n+1);
\ { 2^-(n+1) is less than precision }
SUBS Rt,Rx,Rr,LSR #1 \ if x >= t/2 then begin
\ x := x-t/2; { x not needed }
ADDHS Rr,Rr,#2^0 \ r := r+2^-n
\ x := 2*x; { x not needed }
\ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
Torben Mogensen (torbenm@diku.dk)