In article <1991Nov20.101023.26008@rdg.dec.com> goodwin@oyez.enet.dec.com (Pete Goodwin) writes: >[stuff about !PVray]
>
>Got ARM code for square roots? 8)

I haven't any code, but it shouldn't be too difficult for you to write some. The algorithm I'd use would be a Newton-Raphson iteration scheme designed to solve x^2=z. (Whether this is the fastest method I don't know)

The way this works is you make a guess for sqrt(z), say x_0. (The guess can be very rough, I'd do something like find out how many bits in z (say there were 23) half that (11 or 12) and then guess x_0 (2^11 or 2^12.)

One then makes successive approximations via

x_{n+1} = x_{n} - (x_{n}^2 - z)/2/x_{n}

This scheme will converge very rapidly to sqrt(z) if your initial guess is at all close to z (give or take a factor of 2 or 4 is close enough). You might want to write (x_{n}^2-z) as (x_{n}+z)*(x_{n}-z) it might be quicker (I've no idea).

As an example: if z=529020918

and we guess     x0=16384
then we find     x1=24336.43719
                 x2=23037.12504
                 x3=23000.48392
                 x4=23000.45473
                 x5=23000.45473
                 x6=23000.45473

so you can see 4 or maybe 5 iterations at most is enough. Each iteration (if you arrange things *carefully*) will need only a 16x16 bit multiply and a 32x16 bit divide.

Graeme Williams >In article <1991Nov20.101023.26008@rdg.dec.com> goodwin@oyez.enet.dec.com (Pete Goodwin) writes:
>>[stuff about !PVray]
>>
>>Got ARM code for square roots? 8)
>
>I haven't any code, but it shouldn't be too difficult for you to write some.
>The algorithm I'd use would be a Newton-Raphson iteration scheme designed
>to solve x^2=z. (Whether this is the fastest method I don't know)
>
>The way this works is you make a guess for sqrt(z), say x_0. (The guess
>can be very rough, I'd do something like find out how many bits in z
>(say there were 23) half that (11 or 12) and then guess x_0 (2^11 or 2^12.)
>
>One then makes successive approximations via
>
> x_{n+1} = x_{n} - (x_{n}^2 - z)/2/x_{n}
>
>This scheme will converge very rapidly to sqrt(z) if your initial guess
>is at all close to z...

The Newton-Raphson technique is a good one when you've got a fast division - e.g. if you've got division in hardware (and haven't got square root in hardware :-). However, if you're going to do division by a software routine (as you must on an ARM), there is a "long square root" algorithm which is very similar to the "long division" algorithms used to do division in software. It takes only a bit more time than one division *in total*, compared with a division, an addition and a shift *for each of several iterations* in Newton-Raphson. I would therefore recommend it for ARM square root routines when speed is important.

I won't supply code, since (a) I don't know what type of numbers you want to take square roots of, how accurate a result you want, etc. - these details will obviously affect the code; (b) I don't have time to write code, whereas I have a written description of the algorithm on file which I can simply include here!

I'll illustrate the "long square root" algorithm in base 10 - suppose we want to take the square roots of 2 and 20:

(A) First write down the number you want to take the square root of, split into groups of two digits, with one split at the decimal point. In these particular cases, this gives us:

      V 02. 00 00 00 ...                    V 20. 00 00 00 ...

(Note that the two numbers chosen have the same pattern of digits, but different splits because of the different positions of the decimal point. This is the only reason they produce different answers.)

(B) Initialise by finding the largest square which is smaller than the first pair of digits: in the first case the first pair of digits is 02 = 2, so this square is 1^2 = 1, and in the second the first pair of digits is 20, so this square is 4^2 = 16. Subtract the square from the first pair of digits, and put its square root in as the first digit of the result:

4.

V 02. 00 00 00 ... V 20. 00 00 00 ...
- 1 -16 -- -- 1 4
Then bring the next pair of digits down:
1. 4.
  
  V 02. 00 00 00 ... V 20. 00 00 00 ...
  - 1 -16
    100 400
  (C) To find the next digit of the result: take twice the answer so far. Look at numbers of the form N * (twice answer so far concatenated with N) for N a digit. The largest one of these that is still less than or equal to the remainder we've got gives the next digit of the result, and the value of N * (twice answer so far concatenated with N) is what we should subtract from the remainder before bringing down the next pair of digits:
```
      Twice answer so far is 2*1=2.         Twice answer so far is 2*4=8.
      0*20 = 0, less than 100               0*80 = 0, less than 400
      1*21 = 21, less than 100              1*81 = 81, less than 400
      2*22 = 44, less than 100              2*82 = 164, less than 400
      3*23 = 69, less than 100              3*83 = 249, less than 400
      4*24 = 96, less than 100              4*84 = 336, less than 400
      5*25 = 125, more than 100             5*85 = 425, more than 400
      So the next digit of the answer       So the next digit of the answer
        is 4, and we must subtract 96         is 4, and we must subtract 336
```
  1. 4 4. 4
    V 02. 00 00 00 ... V 20. 00 00 00 ...
    - 1 -16
      100 400
      - 96 -336
        400 6400
    (D) Repeat step (C) as many times as you need result digits - I'll do it once more for the examples:
```
      Twice answer so far is 2*14=28.       Twice answer so far is 2*44=88.
      0*280 = 0, less than 400              0*880 = 0, less than 6400
      1*281 = 281, less than 400            1*881 = 881, less than 6400
      2*282 = 564, more than 400            2*882 = 1764, less than 6400
      So the next digit of the answer       :
        is 1, and we must subtract 281      :
                                            7*887 = 6209, less than 6400
                                            8*888 = 7104, more than 6400
                                            So the next digit of the answer
                                              is 7, and we must subtract 6209
```
    1. 4 1 4. 4 7
      V 02. 00 00 00 ... V 20. 00 00 00 ...
      - 1 -16
        100 400
        
        96 -336
        400 6400
        
        281 -6209
        11900 19100
      As to why the method works, I'll give some hints and leave you to work out the details yourself. To give the rough reason for it, note that the "remainder" in the above calculations is actually the difference between the number you are trying to take the square root of and the square of the answer so far:
```
      2                                     20
      = 1^2 + 1                             = 4^2 + 4
      = 1.4^2 + 0.04                        = 4.4^2 + 0.64
      = 1.41^2 + 0.0119                     = 4.47^2 + 0.0191
      = ...                                 = ...
```
      [ Aside: this fact yields another useful property of the algorithm, which is that (as in long division) your answer is exactly right if the remainder becomes 0 and there are no further non-zeros to bring down. This can be very helpful e.g. for calculating square roots in floating point systems that conform to IEEE standard 754. ]
      Then prove this - it depends on the identity:
```
     (10x+y)^2 - (10x)^2  =  20xy + y^2
                          =  (20x + y) * y
```
      The quantities (20x + y) * y are what the digit calculation in part (C) is actually determining.
      One important observation: this algorithm simplifies immensely in binary! I'll run through it, illustrating it for the case of the square root of two again:
      (A) Write down the argument, split into pairs aligned with the binary point:
```
      V 10. 00 00 00 00 ...
```
      (B) If the first pair is 00, the first digit of the answer is 0; otherwise it is 1. Subtract the square of this first digit from the first pair and bring down the next pair:
```
         1.
       --------------------
      V 10. 00 00 00 00 ...
       - 1
        ------
           100
```
      (C) There are only two possible next digits, namely 0 and 1. But 0 * (twice the answer so far concatenated with 0) is always 0, which must be less than or equal to the remainder. So all that matters is whether 1 * (twice the answer so far concatenated with 1) is less than or equal to the remainder: if it is, the next result digit is 1; if it is greater than the remainder, the next result digit is 0. One more thing: twice the answer so far concatenated with 1 is the answer so far concatenated with 01.
      So the result digit selection procedure becomes: concatenate the answer so far with 01, and compare with the remainder. If greater, leave the remainder unchanged and the result digit is 0; if less than or equal, subtract it from the remainder and the result digit is 1.
      To continue the example:
```
      Result so far concatenated with 01 is 101, which is greater than 100,
      so result digit is 0 and subtract 0:
```
      1. 0
        
        V 10. 00 00 00 00 ...
        
        1
        100
        
        0
        10000
        
        Result so far concatenated with 01 is 1001, which is less than 10000, so result digit is 1 and subtract 1001:
        
        0 1
        
        V 10. 00 00 00 00 ...
        
        1
        
        100
        
        0
        
        10000
        
        1001
        
        11100
        
        Result so far concatenated with 01 is 10101, which is less than 11100, so result digit is 1 and subtract 10101:
        
        0 1 1
        
        V 10. 00 00 00 00 ...
        
        1
        
        100
        
        0
        
        10000
        
        1001
        
        11100
        
        10101
        
        11100
        
        Result so far concatenated with 01 is 101101, which is greater than 11100, so result digit is 0 and subtract 0:
        
        0 1 1 0
        
        V 10. 00 00 00 00 ...
        
        1
        
        100
        
        0
        
        10000
        
        1001
        
        11100
        
        10101
        
        11100
        
        0
        
        1110000
        
        One further point: Step (B) is just a special case of step (C) if we initialise the "square root so far" to 0 - i.e. you don't need special case hardware for the first iteration.
        Hope this helps.
        David Seal
        dseal@armltd.co.uk >In article <1991Nov20.101023.26008@rdg.dec.com> goodwin@oyez.enet.dec.com (Pete Goodwin) writes:
        >>[stuff about !PVray]
        >>
        >>Got ARM code for square roots? 8) >I haven't any code, but it shouldn't be too difficult for you to write some.
        >The algorithm I'd use would be a Newton-Raphson iteration scheme designed
        >to solve x^2=z. (Whether this is the fastest method I don't know)
        [Method deleted] >so you can see 4 or maybe 5 iterations at most is enough. Each iteration
        >(if you arrange things *carefully*) will need only a 16x16 bit multiply
        >and a 32x16 bit divide.
        >Graeme Williams
        There is a *much* faster way. If your number is of the form a*2^n, where 0.5 <= a < 1.0 then do the following:
        sqrt(a*2^(2n)) = sqrt(a)*2^n
        sqrt(a*2^(2n+1)) = sqrt(a/2)*2^(n+1)
        This reduces the problem to finding a square root of a number a, 0.25 <= a < 1.0. This can be done by this method, which should have a cost comparable to 2 or 3 divisions.
        { 0.25 <= a < 1.0 }
        x := 4*(a-0.25);
        r := 1;
        n := 1;
        while (n<number-of-bits) and (a<>0) do begin { invariant: r^2+x = 4^n*a, x<2*r+1 } n := n+1;
        t := 4*r+1;
        x := 4*x;
        r := 2*r;
        if x >= t then begin
        
        x := x-t; r := r+1
        end
        end;
        r := r/2^n;
        { sqrt(a)-r < 2^-n }
        Note that only addition, subtractions and multiplications by 2 or 4 are used in each part of the loop. At the end a single n-bit shift is done. Note that r and t are integers, while x is a real.
        The (a<>0) test allows you to exit the loop if an exact root is found, but it can be omitted without causing error (so the loop can be unwound).
        We can reformulate the algorithm to keep x < 1.0 throughout the algorithm:
        { 0.25 <= a < 1.0 }
        x := (a-0.25)/2;
        r := 1;
        for n := 2 to number-of-bits do begin
        { invariant: r^2+2^n*x= 4^n*a, x<(2*r+1)*2^-n } t := 4*r+1;
        x := 2*x;
        r := 2*r;
        if x >= t/2^(n+1) then begin
        
        x := x-t/2^(n+1); r := r+1
        end
        end;
        r := r/2^number-of-bits;
        { sqrt(a)-r < 2^-number-of-bits }
        Assuming we have a 32 bit mantissa which is stored with the most significant bit as 0.5, we would get the following ARM code, which uses 213 cycles to get a 32 bit result. A test of x=0 could be added at every iteration. This would make minimum time shorter, but assuming there is no large number of squares in the input, this would make the average time longer.
        \ on entry Ra holds a
        \ on exit Rr holds sqrt(a)
        \ uses Rx and Rt (either can be equal to Ra) SUB Rx,Ra,#2^30
        MOV Rx,Rx LSR #1 \ x := (a-0.25)/2;
        MOV Rr,#1 \ r := 1;
        MOV Rt,Rr ASL #2
        ADD Rt,Rt,#1 \ t := 4*r+1;
        MOV Rx,Rx ASL #1 \ x := 2*x;
        MOV Rr,Rr, ASL #1 \ r := 2*r;
        SUBS Rt,Rx,Rt ASL #29 \ if x >= t/2^(n+1) then begin MOVPL Rx,Rt \ x := x-t/2^(n+1);
        ADDPL Rr,Rr,#1 \ r := r+1
        ...
        MOV Rt,Rr ASL #2
        ADD Rt,Rt,#1 \ t := 4*r+1;
        MOV Rx,Rx ASL #1 \ x := 2*x;
        MOV Rr,Rr, ASL #1 \ r := 2*r;
        SUBS Rt,Rx,Rt ASL #0 \ if x >= t/2^(n+1) then begin MOVPL Rx,Rt \ x := x-t/2^(n+1);
        ADDPL Rr,Rr,#1 \ r := r+1
        \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
        We can actually get this better by avoiding the doubling of r in each iteration. This way r and t are no longer integers, but the code is better anyway:
        { 0.25 <= a < 1.0 }
        x := a-0.25;
        r := 0.5;
        for n := 2 to number-of-bits do begin
        { invariant: r^2+x/2^(n-1) = a, x<r+2^-(n+1) } t := r+2^-(n+2);
        x := 2*x
        if x >= t then begin
        
        x := x-t; r := r+2^-(n+1)
        end
        end;
        { sqrt(a)-r < 2^-number-of-bits }
        In ARM this uses 149 cycles to get a 32 bit result. Try to better that!
        \ on entry Ra holds a
        \ on exit Rr holds sqrt(a)
        \ uses Rx and Rt (either can be equal to Ra) SUB Rx,Ra,#2^30 \ x := a-0.25;
        MOV Rr,#2^31 \ r := 0.5;
        ADD Rt,Rr,#2^28 \ t := r+2^-(n+2);
        MOV Rx,Rx ASL #1 \ x := 2*x;
        SUBS Rt,Rx,Rt \ if x >= t then begin MOVPL Rx,Rt \ x := x-t;
        ADDPL Rr,Rr,#2^29 \ r := r+2^-(n+1)
        ...
        ADD Rt,Rr,#2^0 \ t := r+2^-(n+2);
        MOV Rx,Rx ASL #1 \ x := 2*x;
        SUBS Rt,Rx,Rt \ if x >= t then begin MOVPL Rx,Rt \ x := x-t;
        ADDPL Rr,Rr,#2^1 \ r := r+2^-(n+1)
        \ t := r+2^-(n+2);
        \ { 2^-(n+2) is less than precision } \ x := 2*x;
        RSBS Rt,Rr,Rx ASL #1 \ if x >= t then begin \ x := x-t; { x not needed }
        ADDPL Rr,Rr,#2^0 \ r := r+2^-(n+1)
        \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa Note: I haven't tried this code on my Archimedes, as it is at home and
        I am at work. I have, however, tried the pseudo-Pascal code.
        [ lots deleted ] >We can actually get this better by avoiding the doubling of r in
        >each iteration. This way r and t are no longer integers, but the code
        >is better anyway: > { 0.25 <= a < 1.0 }
        > x := a-0.25;
        > r := 0.5;
        > for n := 2 to number-of-bits do begin
        > { invariant: r^2+x/2^(n-1) = a, x<r+2^-(n+1) }
        > t := r+2^-(n+2);
        > x := 2*x
        > if x >= t then begin
        
        > x := x-t;
        > r := r+2^-(n+1)
        > end
        > end;
        > { sqrt(a)-r < 2^-number-of-bits }
        I made a small error in this. The correct code is:
        { 0.25 <= a < 1.0 }
        x := a-0.25;
        r := 0.5;
        for n := 2 to number-of-bits do begin
        { invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) } t := r+2^-(n+1);
        x := 2*x
        if x >= t then begin
        
        x := x-t; r := r+2^-n
        end
        end;
        { sqrt(a)-r < 2^-number-of-bits }
        There are, however, still a small problem: x might be greater than 1 after the doubling. We can solve this by delaying the doubling:
        { 0.25 <= a < 1.0 }
        x := a-0.25;
        r := 0.5;
        for n := 2 to number-of-bits do begin
        { invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) } t := r+2^-(n+1)
        if x >= t/2 then begin
        
        x := x-t/2; r := r+2^-n
        end;
        x := 2*x
        end;
        { sqrt(a)-r < 2^-number-of-bits }
        Also, my ARM code used unsigned numbers but compared them as if they were signed (using PL). Comparison of unsigned numbers requires HS. The code is still untested, so there might be further bugs. Note also that you can't write the constant 2^31 in BASIC assembler. Use 1<<31 instead. The number of cycles have increased to 154, as an extra pass through the loop is required to get 32 bit precision:
        \ on entry Ra holds a
        \ on exit Rr holds sqrt(a)
        \ uses Rx and Rt (either can be equal to Ra) SUB Rx,Ra,#2^30 \ x := a-0.25;
        MOV Rr,#2^31 \ r := 0.5;
        ADD Rt,Rr,#2^29 \ t := r+2^-(n+1);
        SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin MOVHS Rx,Rt \ x := x-t/2;
        ADDHS Rr,Rr,#2^30 \ r := r+2^-n
        MOV Rx,Rx,ASL #1 \ x := 2*x;
        ...
        ADD Rt,Rr,#2^0 \ t := r+2^-(n+1);
        SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin MOVHS Rx,Rt \ x := x-t/2;
        ADDHS Rr,Rr,#2^1 \ r := r+2^-n
        MOV Rx,Rx,ASL #1 \ x := 2*x;
        \ t := r+2^-(n+1);
        \ { 2^-(n+1) is less than precision } SUBS Rt,Rx,Rr,LSR #1 \ if x >= t/2 then begin \ x := x-t/2; { x not needed }
        ADDHS Rr,Rr,#2^0 \ r := r+2^-n
        \ x := 2*x; { x not needed }
        \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
        Torben Mogensen (torbenm@diku.dk)