Hello,
I am running a simple single threaded memory benchmark that measures the
time it takes to copy an array (https://github.com/BTone/cagbench). I run
the benchmark in SE mode with only 1 thread (and 1 CPU) configured to match
the setup used in gem5-Skylake (
https://github.com/darchr/gem5-skylake-config) with 32 kB L1I and L1D
cache, 256 kB L2 and 8 MB LLC.
On a real Intel Skylake (i7 6700k), DDR4-2400:
With an array size of 8 MB (total working set of 16 MB), the throughput is
~11,000 MB/s and with an array size 16 MB (total working set of 32 MB) the
throughput is ~9,500 MB/s.
In Gem5 (darchr/gem5-skylake-config):
With an array size of 8 MB (total working set of 16 MB), the throughput is
~6,000 MB/s. However, with an array size 16 MB (total working set of 32 MB)
the throughput drops to ~700 MB/s.
The performance when the workload mostly fits in the cache hierarchy is
reasonable, but ~700 MB/s seems far slower and does not seem commensurate
with the real system.
I think this has something to do with the memory system past the last-level
cache, but I am having trouble determining what exactly the issue is.
Just for reference, this is how I have the cache hierarchy configured (I
reduced the tag/data/response latencies to eliminate the caches from being
an issue):
Both L1I and L1D caches:
size = '32kB'
assoc = 8
tag_latency = 1
data_latency = 1
response_latency = 1
mshrs = 128
tgts_per_mshr = 16
write_buffers = 56
demand_mshr_reserve = 96
L2 Cache:
size = '256kB'
assoc = 4
tag_latency = 1
data_latency = 1
response_latency = 1
mshrs = 256
tgts_per_mshr = 16
write_buffers = 256
L3 cache:
size = '8MB'
assoc = 16
tag_latency = 1
data_latency = 1
response_latency = 1
mshrs = 256
tgts_per_mshr = 20
write_buffers = 256
clusivity = 'mostly_excl'
Any suggestions would be greatly appreciated.
_______________________________________________
gem5-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s