ASPP/2024-heraklion-comp-arch

History

Tiziano Zito ae8dfa153d add axes labels everywhere and change some colors to avoid confusion		2024-08-16 19:00:42 +02:00
..
bandwidth-t14.svg	add axes labels everywhere and change some colors to avoid confusion	2024-08-16 19:00:42 +02:00
latency-t14.svg	add axes labels everywhere and change some colors to avoid confusion	2024-08-16 19:00:42 +02:00
parse_results.py	added low level benchmark	2024-08-11 01:23:12 +02:00
plot.py	add axes labels everywhere and change some colors to avoid confusion	2024-08-16 19:00:42 +02:00
README.md	added low level benchmark	2024-08-11 01:23:12 +02:00
results_t14	added low level benchmark	2024-08-11 01:23:12 +02:00
t14-bwr.csv	added low level benchmark	2024-08-11 01:23:12 +02:00
t14-bww.csv	added low level benchmark	2024-08-11 01:23:12 +02:00
t14-lrnd.csv	added low level benchmark	2024-08-11 01:23:12 +02:00
t14-lseq.csv	added low level benchmark	2024-08-11 01:23:12 +02:00

README.md

Low Level Memory Benchmark

These are the results of a low level memory benchmark (written in C) on my laptop

Summary plots (details below)

Benchmarks details:

Bandwidth (read), bw_mem_rd. Allocate the specified amount of memory, zero it, and then time the reading of that memory as a series of integer loads and adds. Each 4-byte integer is loaded and added to accumulator.

Results (block size in MB, bandwith in MB/s)
Bandwidth (write),bw_mem. Allocate twice the specified amount of memory, zero it, and then time the copying of the first half to the second half.

Results (block size in MB, bandwith in MB/s)
Latency (sequential access), lat_mem_rd. Run two nested loops. The outer loop is the stride size of 128 bytes. The inner loop is the block size. For each block size, create a ring of pointers that point backward one stride. Traverse the block by p = (char **)*p in a for loop and time the load ladency over block.

Results (block size in MB, latency in ns)
Latency (random access). Like above, but with a stride size of 16 bytes.

Results (block size in MB, latency in ns)

Running the benchmarks on Linux:

You need the lmbench library and cpuset
All commands must be run as root after having killed as many processes/services as possible, so that the CPUs are almost idle

Disable address space randomization:

echo 0 > /proc/sys/kernel/randomize_va_space

Set scaling governor to performance for CPU0:

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Reserve CPU 0 fro our benchmark, i.e. kick out (almost) all other processes
```
cset shield --cpu 0 --kthread=on
```
If you are on INTEL and CPU0 is part of a SMT-pair (hyperthreading), disable the peer
```
echo 0 > /sys/devices/system/cpu/cpu1/online
```

Disable turbo mode on INTEL:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Run the configuration script for lmbench. Select only the HARDWARE set of benchmarks and set the maximum amount of memory to something like 1024MB

cd /usr/lib/lmbench/scripts
# the following command will create the configuration file /usr/lib/lmbench/bin/x86_64-linux-gnu/CONFIG.<hostname>
cset shield --exec -- ./config-run
# run the benchmark
cset shield --exec -- /usr/bin/lmbench-run
# results are in /var/lib/lmbench/results/x86_64-linux-gnu/<hostname>