w16 in .79s

The pthread-enabled parallel scanner is operational !

[yg@localhost pscan]$ gcc -Wall -Os -DWIDTH=16 -lpthread -D_REENTRANT -DTHREADS=4 pscan_04.c -o pscan 
[yg@localhost pscan]$ /usr/bin/time ./pscan 
Scanning w16 from X=1 (included) to 65536 (excluded)
Spawning 4 threads.
 log.p7k done.
3.07user 0.00system 0:00.79elapsed 388%CPU (0avgtext+0avgdata 2616maxresident)k
0inputs+3328outputs (0major+309minor)pagefaults 0swaps

For now, the number of threads is chosen at compile time, the structures are statically allocated, though fewer threads will be spawned if there is not enough work for all of them. You can get the source files at pscan_20211224.tgz to play at home. It expects -DTHREADS=4 to be tuned to the appropriate platform.

$ for i in 16 17 18 19 20; do gcc -Wall -Os -DWIDTH=$i -lpthread -D_REENTRANT -DTHREADS=4 pscan_04.c -o pscan ; /usr/bin/time ./pscan ; mv log.p7k log.$i.p7k ; done
Scanning w16 from X=1 (included) to 65536 (excluded)
3.06user 0.00system 0:00.79elapsed 387%CPU
Scanning w17 from X=1 (included) to 131072 (excluded)
14.42user 0.00system 0:03.65elapsed 394%CPU
Scanning w18 from X=1 (included) to 262144 (excluded)
57.39user 0.01system 0:14.88elapsed 385%CPU
Scanning w19 from X=1 (included) to 524288 (excluded)
229.89user 0.03system 0:59.25elapsed 388%CPU
Scanning w20 from X=1 (included) to 1048576 (excluded)
1051.04user 0.25system 4:31.12elapsed 387%CPU

The runtime looks pretty linear. And compared to the script presented in 95. New packing at work, job management is much easier: CTRL-C will affect all the threads at once, no killall required. This comes at the price of maybe 150 lines of C overall but... it works and the results seem coherent. And since it's now a single program, I can now measure the performance precisely and easily, which was less so with the script. And better than a script, the workload is dynamically spread over the worker threads, which means that the total time is not affected if one CPU is also required to do some other work (one CPU does I/O for example).

Here is the runtime for 1 to 9 threads, on the old 4-cores i7:

$ for i in 1 2 3 4 5 6 7 8 9; do gcc -Wall -Os -DWIDTH=18 -lpthread -D_REENTRANT -DTHREADS=$i pscan_04.c -o pscan ; /usr/bin/time ./pscan ; done                                                                                                                                                                                                                                                                    
Spawning 1 threads: 40.94user 0.01system 0:41.05elapsed 99%CPU 
Spawning 2 threads: 43.60user 0.01system 0:21.86elapsed 199%CPU
Spawning 3 threads: 59.75user 0.02system 0:20.10elapsed 297%CPU
Spawning 4 threads: 69.85user 0.03system 0:18.46elapsed 378%CPU 
Spawning 5 threads: 71.62user 0.03system 0:18.61elapsed 384%CPU 
Spawning 6 threads: 73.25user 0.03system 0:19.21elapsed 381%CPU
Spawning 7 threads: 74.68user 0.02system 0:19.14elapsed 390%CPU
Spawning 8 threads: 74.90user 0.03system 0:19.54elapsed 383%CPU
Spawning 9 threads: 75.72user 0.02system 0:19.51elapsed 388%CPU

Do I need to plot that data? Obviously there is no point in spawning more threads than cores, the hit is very small. But on this processor, going from 2 to 4 threads only brings 10% more throughput. Recent i7s are better.

The program uses a pair of FIFO to distribute the jobs (workload) and collect the results. I didn't use complex semaphores and implemented a rough poll instead. Each "worker" thread ends as soon as its input FIFO is empty. The total run/range could be incomplete if the HZ and FIFOLOG2 defines are inappropriate. HZ will decrease as WIDTH increases. FIFOLOG2 might decrease too... HZ is an integer, the frequency for polling the result FIFO. For W32 I suppose I'll poll at 1Hz and keep FIFOLOG2 at 10, which means a 1024-deep queue.

0.79s is not bad, considering it takes all the I/O into account (the output log is 1703910 bytes long). It's not a raw speedup of 4 but that's a convincing result. An even faster run is possible with SIMD code: this is the next step...

Happy new year and merry Christmas everybody!

______________________________________________________

Edit:

First, notice that the total run time is rounded up because the main thread is awoken regularly, so there is some added granularity. The other thing is that due to the now much shorter runtime, w16 is not a reliable benchmark anymore, replaced by w17 at 3.81s. The following runs shared the CPU with other programs (see the 332% CPU usage) but the scaling is pretty good and indicative of what's to come.

Scanning w16 from X=0 to 65536   => 0.00system    0:00.80elapsed 378%CPU
Scanning w17 from X=0 to 131072  => 0.00system    0:03.81elapsed 374%CPU
Scanning w18 from X=0 to 262144  => 0.01system    0:14.96elapsed 384%CPU
Scanning w19 from X=0 to 524288  => 0.11system    1:14.96elapsed 344%CPU
Scanning w20 from X=0 to 1048576 => 0.65system    6:24.30elapsed 332%CPU
Scanning w21 from X=0 to 2097152 => 2.52system   24:39.46elapsed 345%CPU
Scanning w22 from X=0 to 4194304 => 6.08system 1:29:33   elapsed 365%CPU

System time also becomes noticeable.

Each new width quadruples the runtime so we can extrapolate the durations:

6h for w23 (7h13m at 316%CPU)
1 day for w24 (5h13m at 1198%CPU, at about 3.7GHz, about 1000 crossings/second)
4 days for w25 (20h58m at 500 crossings/s on 12 threads)
16 days for w26 (still better than 2 months and consistently 4× faster than the original algo)

And this is with 4 threads so we can expect w26 in 5 or 6 days with 12 threads. The corresponding log will be about 1.7GB...

The irony is that rewriting the code took much more than the original 2 computing months.

Discussions

Yann Guidon / YGDES wrote 12/29/2021 at 22:42

mer. déc. 29 23:39:29 CET 2021 :
I just started w26 on the 12-treads i7 and it's going to end in 4 to 5 days....

scan rate is about 230 to 240 crossings per second.
That would translate to only 4 crossings per second for w32, requiring one billion seconds to complete...

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/01/2022 at 19:42

Computation should end tomorrow and I'm not even close to starting the fusion program...

Are you sure? yes | no

Parallel scan, at the thread level

The road to w32

Discussions

Become a Hackaday.io Member