ASPP/2024-heraklion-comp-arch

No description

Find a file

Tiziano Zito 3967778878 added numpy diagrams		2024-08-13 13:45:05 +02:00
architecture	added speed historical data to architecture	2024-08-13 13:26:15 +02:00
benchmark_low_level	added low level benchmark	2024-08-11 01:23:12 +02:00
benchmark_python	added instructions for benchmarking	2024-08-08 12:46:15 +02:00
numpy	added numpy diagrams	2024-08-13 13:45:05 +02:00
puzzle.ipynb	update puzzle with values from benchmark_python	2024-08-05 16:33:56 +02:00
README.md	first commit, import from https://github.com/otizonaizit/2024-ims	2024-08-05 12:42:05 +02:00

README.md

What every scientist should know about computer architecture

Important: these are instructor notes, remove this file before showing the materials to the students. The notes can be added after the lecture, of course.

Introduction

Puzzle (how swapping two nested for-loops makes out for a >27× slowdown
Let students play around with the notebook and try to find the "bug"
A more thorough benchmark using the same code is here

A digression in CPU architecture and the memory hierarchy

Go to A Primer in CPU architecture
The need for a hierarchical access to data for the CPU should be clear now ➔ the "starving" CPU problem
Have a look at the historical evolution of speeds of different components in a computer:
- the CPU clock rate
- the memory (RAM) bandwidth, latency clock rate
- the storage media access rates
Measure size and timings for the memory hierarchy on my machine with a low level C benchmark

Back to the Python benchmark (second try)

can we explain what is happening?
it must have to do with the good (or bad) use of cache properties
but how are numpy arrays laid out in memory?

Anatomy of a numpy array

memory layout of numpy arrays

Back to the Python benchmark (third try)

can we explain what is happening now? Yes, more or less ;-)
quick fix for the puzzle: try and add order='F' in the "bad" snippet and see that is "fixes" the bug ➔ why?

Notes on the Python benchmark:

while running it attached to the P-core (cpu0), the P-core was running under a constant load of 100% (almost completely user-time) and at a fixed frequency of 3.8 GHz, where the theoretical max would be 5.2 GHz
while running it attached to the E-core (cpu10), the E-core was running under a constant load of 100% (almost completely user-time) and at a fixed requency of 2.5 GHz, where the theoretical max would be 3.9 GHz
... ➔ the CPU does not "starve" because it scales its speed down to match the memory throughput? Or I am misinterpreting this? This problem which at first sight should be perfectly memory-bound, becomes CPU-bound, or actually, exactly balanced? ;-)

Excerpts of parallel Python

The dangers and joys of automatic parallelization (like in numpy linear algebra routines) and the use of clusters/schedulers (but also on your laptop)

Concluding remarks

how is all of this relevant for the users of a computing cluster?

README.md Unescape Escape

What every scientist should know about computer architecture

Introduction

A digression in CPU architecture and the memory hierarchy

Back to the Python benchmark (second try)

Anatomy of a numpy array

Back to the Python benchmark (third try)

Excerpts of parallel Python

Concluding remarks

README.md