Project | Game of Life bit-parallel algorithm

« Back to project details Sort by:

Article in GLMF
03/01/2018 at 01:27 • 0 comments

The article is out !

I have about 14 pages in GNU/Linux Magazine France #213 where I go through all the optimisation process and go even further.

This article goes beyond the methods shown so far and I will try to develop them. The AVX SIMD extension brings some significant advantages and require a deep redesign of the algorithm but the basic ideas remain the same, summarised in the following graph:

I'll have to update the project, or maybe create a new one. Meanwhile, I'm pretty proud :
So it works.
11/29/2016 at 05:07 • 0 comments
I just uploaded another version of the C file, this time with a copy-paste-adjust-warnings block from the JS version, and it worked right away.
For the record, you have to press ENTER to compute the next iteration. Add to this the cost of displaying data through the Xterm and you understand that it's far from the true speed that this algo can reach. But I'll optimise speed for #AMBAP: A Modest Bitslice Architecture Proposal later.
For now, I rejoice that it worked just like I expected :-)
```
                       #
                     ####
            ##     #### ##         ##
          #  #   #   ## ###        ##
 ##      #       #   ## ##
 ##      #      #   #####
         #       ###   #
          #  #
            ##

                         #
                          ##
                         ##

              ####
             # ## #
             ##  ##
             ##  ##
             # ## #             # #
              ####               ##
                                 #
```

Handling data

11/29/2016 at 03:59 • 0 comments

I've just uploaded a C file that works with 64 bits (though it can be anything) and provides a skeleton for the data shuffling that reduces memory usage.

There is no double-buffering, the cells are computated "in place". To prevent time-distorsions (the result being directly used for the next iteration), the current result is written to the WriteBuffer variable. This variable is then written back to memory after the past value is read (again).

    L1=world[i-1];          // make sure we read
    world[i-1]=WriteBuffer; // the value of t-1, not the
                            // result computed in the last iteration
    L2=world[i  ];   // I know, a rotating register barrel
    L3=world[i+1];   // would work better but not for AMBAP

To emulate "some sort of expanding computation", I have used the following simple formula :

WriteBuffer = L1+L2+L3;

This helps me check that the ranges of indices can be dynamically adjusted. For example, with the following initial configuration:

start: 13

                         #
                       # ##
             ##      ## ## #       ##
            # ###    # #  ##       #  #
 ##        #  # # #   # #  #       ####
 #  #      #   # ### # ####        ####
 ####       ## #### # # #  #       #  #
 ####      ## ### #### #  ##       ##
 #  #       ## #### ### ## #
 ##        #   # ###   # ##
           #  # # #      #
            # ###
             ##
               ##
              # # #
             #  # ##
            #    # ##
            #  # # # #
            #### ## ##
            #### ## ##
            #  # # # #
            #    # ##
             #  # ##
              # # #
               ##
max=40

After a few cycles, the range has grown :

start: 10

                         #
                       #  ##
             ##      ### # # #     ##
            #### #    ####   ##     #  #
 ##        #####  ## ###  ## # #     ####
  #  #       ##### # ##### # ###   ###    #
   ####    # ####   ##   # #    #  #    ###
 ###    #   #   # ## #    ##    #       #  #
 #    ###   #  # # # ##    #    #       #  #
      #  #  # # ### # # # ##    #  #    ###
      #  # # ### ###      ##    #  ###    #
 #    ###   # # ### # ### ## ###     ####
 ###    #   #  # # # ##  ### # #    #  #
   ####     #  # ### # #  #  ##    ##
  #  #     # # ### ## # ## # #
 ##           ###   # ##  ##
           # ##    ####  #
            ##     #####
              # ##  # ###
            #  ##   #### #
            ###### #      #
             # # ##### #  #
             # # ##### #  #
            ###### #      #
            #  ##   #### #
             ### #  # ###
              # # # ####
            ### ## # ##
             #### #  #
              ###  #
               ##
max=43

It is important to keep an empty line before first line and after the last line, so more cells can grow there.

The adjustment of the last line is done with the variable run: it is first cleared to 0 then set to the line number where the line is not empty. This value is then assigned at the end of the scan:

  MaxLine=run;

MaxLine will be the upper limit for the next time step, but it should preserve the previously mentioned "margin" for the growth of new cells beyond the last line. That's why run is set with a little offset, inside the loop:

    if (WriteBuffer)
      run=i+2;  // push the end of the work range

Dealing with the start index is a bit more tricky but run provides us with a way to handle it nicely. It remains stuck to 0 while empty lines are found so we can increase MinLine as long as run is cleared. The complete code is simply:

    if (WriteBuffer)
      run=i+2;  // push the end of the work range
    else
      if (run==0) // no life found yet
        MinLine=i-1; // push the start of the work range

Overall, it's a bit delicate but not totally arcane, since it's only a 1D array. I keep it simple because a 2D array has some very tricky side effects. The AMBAP is limited to 16-bits wide data anyway, for computation AND display.

The parallel boolean computation

11/28/2016 at 20:36 • 0 comments

I uploaded GoL.js and GoL.html. They don't "work" yet but they allowed me to test the following, functional code:

    L2=world[i];
    L1=world[i-1];
    L3=world[i+1];

    // half adder:
    S20 = L1 ^ L3;
    S21 = L1 & L3;
    // full adder:
    S31 = L1 | L3;
    S30 = S20 ^ L2;
    S31 = S31 & L2;
    S31 = S31 | S21;

    // Sum LSB
    S0 = (S30>>>1) ^ S20;
    S0 = (S30<< 1) ^ S0;

    // LSB Carry
    S0C = (S30<< 1) | S20;
    S20_= (S30<< 1) & S20;
    S0C = (S30>>>1) & S0C;
    S0C = S0C | S20_;

    // Sum middle bit
    S1 = S31>>>1 ^ S0C;
    S1 = S31<<1  ^ S1;
    S1 = S21     ^ S1;

// Inhibit
    INH = S31<<1 & S21;
    S1_ = S31<<1 ^ S21;
    C2 = S31>>1 & S0C;
    INH = INH | C2;
    S2_ = S31>>1 ^ S0C;
    S2_ = S2_ & S1_;
    INH = INH | S2_;

    X2 = S0 | L2;
    X2 = X2 & S1;
    X2 = X2 & ~INH; // The result is now in X2

This is the contents of the inner computational loop, made of a series of half-adder and full-adder "blocks" that perform the bit 2D counting (popcount). This will evolve, as well as the surrounding code, to keep the number of needed registers as low as possible (4 for AMBAP, plus 2 mapped to memory).

Game of Life bit-parallel algorithm

Article in GLMF

So it works.

Handling data

The parallel boolean computation