Hi Javed,
It seems there is a bug in handling CleanUnique requests. From the code
(src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):
transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC,
SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC},
CleanUnique, BUSY_BLKD) {
Initiate_Request;
Initiate_CleanUnique;
Pop_ReqRdyQueue;
ProcessNextState;
}
Profile_Miss/Profile_Hit are not being called so the stats are not being
incremented for a CleanUnique arriving at the L3.
Could you create a JIRA ticket to track this bug ?
Also note that some requests that miss in the L2 never go the the L3. E.g.: if
the line is UC/UD at one of the other cores L1, it will always count as miss in
the L2 because you have to get the copy from the other core L1, but no request
is generated to the L3.
Thanks,
Tiago
________________________________
From: Javed Osmany <[email protected]>
Sent: Friday, July 29, 2022 5:22 AM
To: gem5 users mailing list <[email protected]>
Cc: Javed Osmany <[email protected]>
Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and
LLC (in HNF)
Hello
I am modelling the following system:
a) Three clusters – big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)
b) All CPUs have private L1I and L1D caches.
c) Each cluster has a shared and unified L2$.
d) Model a shared and unified L3$, shared between [middle, little]
clusters. The L3$ is modelled as a CHI_Node.
e) 4 x HNF/LLC/Directory
f) 1 x SNF
I am using gem5-21.2.1.0.
An example of the command used to run the lu_ncb benchmark being:
./build/ARM/gem5.opt
--outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache
–debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby
--topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4
--num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4
--num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1
--cpu-type-littleclust=a76 --bigclust-l2cache=shared
--middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB
--l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB
--l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4
--l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4
--l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8
--l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB
--l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1
--num-middleclust-subclust=1 --num-littleclust-subclust=1
--num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1
--num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE
–bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl
--l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl
--l2-big-data-latency=12 --l2-middle-data-latency=12
--l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5
--l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45
--l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20
--sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz
--middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz
--ruby-clock=2GHz --cacheline_size=64 --verbose=t
rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'
I am running the Parsec/Splash2 benchmark suite.
Extracting the stats from the stats.txt file, I have the following:
Blackscoles
Canneal
Swaptions
Cholesky
FFT
Fmm
Lu_cb
Lu_ncb
Raytrace
Volrend
Water_sq
Water_sp
Demand L2$ miss, little cluster
7019
9605353
7656
2724902
2930037
1365976
58955
1026556
594351
93401
24063
11435
Demand L2$ accesses, little cluster
13506
33101031
1207307
6206252
3511657
3199668
794479
4665754
2471593
1039411
393792
166955
Demand L3$ accesses, total
7165
10359847
9992
2686126
2929728
1321580
54026
51745
131095
22744
12840
8843
If I compare row1 and row3, the number of demand L3$ accesses is lower for the
Splash2 benchmarks (and in some benchmarks, considerably lower) than the number
of demand L2$ misses for the little cluster (the little cluster is the main
compute cluster).
QS: Why don’t all the L2$ misses make their way to the L3$?
In the attachment, I have included my versions of CHI.py, CHI_config.py,
config.ini, stats.txt.
Any insight greatly appreciated.
Best regards
JO
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
_______________________________________________
gem5-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]