Find a file
2024-08-19 12:01:51 +02:00
architecture added speed historical data to architecture 2024-08-13 13:26:15 +02:00
benchmark_low_level added low level benchmark 2024-08-11 01:23:12 +02:00
benchmark_python fixed a typo in bench.py 2024-08-15 15:49:11 +02:00
numpy added numpy diagrams 2024-08-13 13:45:05 +02:00
puzzle.ipynb update puzzle with values from benchmark_python 2024-08-05 16:33:56 +02:00
README.md remove note about instructor notes 2024-08-13 14:41:34 +02:00
visuals_for_game.pdf adding props game 2024-08-19 12:01:51 +02:00

What every scientist should know about computer architecture

Introduction

  • Puzzleread-only rendered notebook
  • Question: how come that swapping dimensions in a for-loop makes out for a huge slowdown?
  • Let students play around with the notebook and try to find the "bug"
  • A more thorough benchmark

A digression in CPU architecture and the memory hierarchy

Analog programming

Two exercises to activate the body and the mind

Common goal of both exercises is to sort a deck of tarot cards by value

First experiment: human sorting

Setup:

  • 1 volunteer to keep the time spent sorting
  • each person picks up a tarot card from the randomly shuffled deck on the table
  • moving around and speaking is allowed until the tarot cards are displayed sorted on the table

Second experiment: machine sorting

Setup:

  • 2 volunteers to keep the time:
    • one volunteer keeps the time spent programming
    • one volunteer keeps the time spent executing the program
  • 2 volunteers to be the programmers:
    • can use the whiteboard
    • can and should speak and think loudly and ask for help
  • 2 volunteers to be two CPUs:
    • only understand the instructions:
      • fetch a value from a memory address into register N ➔ returns 0 if succeded else 1
      • push the value from register N to a memory address ➔ returns 0 if succeded else 1
      • compare var0 and var1 ➔ returns 0 if var0 ≥ var1 else 1
  • 4 volunteers to be CPU registers:
    • each register has a tag: R1, R2, R3, R4
    • a value fetched from memory is kept in short-term memory by the registers
    • the result value of an operation is stored in one register
  • everyone else sits on their seats and represent RAM:
    • they own a value, i.e. they hold on a tarot card
    • they have an address based on their seating order: 0th seat, 1st seat, 2nd seat, 3rd seat, 4th seat, etc…
    • when fetched, walk to the corresponding register and hand in their value (card)
    • when pushed, walk to the corresponding register and fetch their new value (card)
  • each RAM address comes and picks up a random tarot card as initialization step

Back to the Python benchmark (second try)

  • can we explain what is happening?
  • it must have to do with the good (or bad) use of cache properties
  • but how are numpy arrays laid out in memory?

Anatomy of a numpy array

Back to the Python benchmark (third try)

  • can we explain what is happening now? Yes, more or less ;-)
  • quick fix for the puzzle: try and add order='F' in the "bad" snippet and see that is "fixes" the bug ➔ why?

Notes on the Python benchmark:

  • while running it attached to the P-core (cpu0), the P-core was running under a constant load of 100% (almost completely user-time) and at a fixed frequency of 3.8 GHz, where the theoretical max would be 5.2 GHz

    ➔ the CPU does not "starve" because it scales its speed down to match the memory throughput? Or I am misinterpreting this? This problem which at first sight should be perfectly memory-bound, becomes CPU-bound, or actually, exactly balanced? From the Intel documentation:

    Energy Efficient Turbo

    When Energy Efficient Turbo is enabled, the CPUs optimal turbo frequency will be tuned dynamically based on CPU utilization. The actual turbo frequency the CPU is set to is proportionally adjusted based on the duration of the turbo request. Memory usage of the OS is also monitored. If the OS is using memory heavily and the CPU core performance is limited by the available memory resources, the turbo frequency will be reduced until more memory load dissipates, and more memory resources become available. The power/performance bias setting also influences energy efficient turbo. Energy Efficient Turbo is best used when attempting to maximize power consumption over performance.

Concluding remarks

  • how is all of this relevant for the users of a computing cluster?