A Modular platform for computer-system
architecture research
chsgcxy
Base introduction
Why named Gem5
Almost most popular
Modular platform
Compiling Gem5
How to use
Statistics
Fs mode and Se mode
Checkpoint
Core mechanism
Event-driven
Python && c++
Clock system
Instruction frameworks
memory system
Ruby subsystem
Why named Gem5
m5 simulator
GEMS simulator
University of Michigan
Modeling Networked Systems
2006
2011
University of Wisconsin
general execution-driven multiprocessor simulator
2005
Open-source computer architecture simulator used in academia and in industry
gem5 is used by many industrial research labs including ARM Research, AMD Research,
Google, Micron, Metempsy, HP, Samsung, and others.
Multiple interchangeable CPU models
Multiple ISA support (static)
Dynamically configurable(branch-predictor, CPU, DRAM
Prefetcher, replacement policy, ……)
Configurable memory hierarchy
How gem5 is used for computer architecture research
#include <stdio.h>
int main(int argc, char* argv[])
Can run different workloads
{
printf("Hello world!\n"); return 0;
}
Config.dot
./build/ARM/gem5.debug configs/example/se.py -c tests/test-progs/hello/bin/arm/linux/hello
Somewhat similar with DTB
Gem5 lib Python script
log M5out/stats.txt
Statistics Detail
configs/example/se.py
./build/ARM/gem5.debug configs/example/fs.py
--caches
--kernel=vmlinux.arm64
--disk-image=ubuntu-18.04-arm64-docker.img
./build/ARM/gem5.debug configs/example/se.py
--cpu-type=ArmAtomicSimpleCPU --caches --l2cache
--mem-type=DDR4_2400_8x8 --mem-size=2GB
--l1d_size=64kB --l1i_size=32kB --l2_size=512kB -n 4
-c tests/test-progs/threads/src/threads
configs/example/fs.py
Start kernel will take about 40 minutes on my desktop
This will hang while run 4 cores, to be resolved (x86 ok)
./build/ARM/gem5.debug
configs/example/fs.py
--caches
--kernel=vmlinux.arm64
--disk-image=ubuntu-18.04-arm64-docker.img
--take-checkpoints=100000000,100000
when, period
Checkpoints are essentially snapshops of a simulation
build/ARM/dev/arm/rv_ctrl.cc:176: warn: SCReg: Access to unknown device dcc0:site0:pos0:fn7:dev0
Writing checkpoint
build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100000000. Starting simulation...
Writing checkpoint
build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100100000. Starting simulation...
Writing checkpoint
build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100200000. Starting simulation...
Will sort cpt.xxx and restore according to the given index
./build/ARM/gem5.debug configs/example/fs.py
--caches
--kernel=vmlinux.arm64
--disk-image=ubuntu-18.04-arm64-docker.img
--checkpoint-restore=2
class Event
{
Event *nextBin; Event *nextInBin;
//!< timestamp when event should be processed
Tick _when;
Priority _priority; //!< event priority
virtual void process() = 0;
};
When=500, priority=50
When=500, priority=64
When=1000, priority=50
When=1000, priority=90
Event2
Event1
Event0
Event0
Event4
Event1
Event3
Event2
Event5
Event6
nextInBin
nextBin
The same when and priority
0: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 0
0: system.cpu.icache.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 71 scheduled @ 1000
0: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 500
500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 500
500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 1000
1000: system.cpu.icache.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 71 executed @ 1000
1000: system.l2.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 76 scheduled @ 11500
1000: system.tol2bus.reqLayer0.wrapped_function_event: EventFunctionWrapped 88 scheduled @ 1500
1000: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 1000
1000: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 1500
1500: system.tol2bus.reqLayer0.wrapped_function_event: EventFunctionWrapped 88 executed @ 1500
1500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 1500
Event *EventQueue::serviceOne()
{
if (next) {
next->nextBin = head->nextBin; head = next;
} else {
head = head->nextBin;
}
setCurTick(event->when());
event->process();
return NULL;
}
GlobalSimLoopExitEvent *simulate(Tick num_cycles)
{
while (1)
Event *exit_event = mainEventQueue[0]->serviceOne();
}
class SimObject
{
static std::vector<SimObject *> simObjectList; virtual void init();
virtual void startup();
};
class StaticInst
{
std::bitset<Num_Flags> flags;
OpClass _opClass;
uint8_t _numSrcRegs = 0; uint8_t _numDestRegs = 0;
RegIdArrayPtr _srcRegIdxPtr = nullptr; RegIdArrayPtr _destRegIdxPtr = nullptr;
virtual std::string generateDisassembly(
DynInst(O3CPU)
X86StaticInst
StaticInst
RiscvStaticInst
MinorDynInst
ArmStaticInst
enum OpClass
{
No_OpClass = 0,
IntAlu = 1,
IntMult = 2,
IntDiv = 3,
FloatAdd = 4,
FloatCmp = 5,
FloatCvt = 6,
FloatMult = 7,
FloatMultAcc = 8,
FloatDiv = 9,
……
}
enum Flags
{
IsNop = 0,
IsInteger = 1,
IsFloating = 2,
IsVector = 3,
IsVectorElem = 4,
IsLoad = 5,
IsStore = 6,
IsAtomic = 7,
IsStoreConditional = 8,
IsInstPrefetch = 9,
IsDataPrefetch = 10,
Addr pc, const loader::SymbolTable *symtab) const = 0;
virtual Fault execute(ExecContext *xc, Trace::InstRecord *traceData) const = 0;
virtual void advancePC(PCStateBase &pc_state) const = 0;
virtual std::unique_ptr<PCStateBase> branchTarget(
const PCStateBase &pc) const;
}
IsControl = 11,
IsDirectControl = 12,
IsIndirectControl = 13,
……
}
inherit
reference
The cpu can deal with Flags and OpClass and no longer need to care about which instruction it is
0x0e: decode FUNCT3 {
format ROp {
0x0: decode FUNCT7 { 0x0: addw({{
Rd_sd = Rs1_sw + Rs2_sw;
}});
0x1: mulw({{
Rd_sd = (int32_t)(Rs1_sw*Rs2_sw);
}}, IntMultOp);
.isa framework given a convenient and flexible way to generate instruction classes
Decode/xxx.isa
Build/XXX/arch/XXX/generated
def format ROp(code, *opt_flags) {{
iop = InstObjParams(name, Name, 'RegOp', code, opt_flags) header_output = BasicDeclare.subst(iop)
decoder_output = BasicConstructor.subst(iop) decode_block = BasicDecode.subst(iop) exec_output = BasicExecute.subst(iop)
}};
Isa_parser.py
Arch/xxx/insts:
branch.cc branch.hh xxx.cc xxx.hh
formats/xxx.isa
def template BasicDeclare {{
class %(class_name)s : public %(base_class)s
{
public:
Fault execute(ExecContext *, Trace::InstRecord *); using %(base_class)s::generateDisassembly;
};
}};
Generate instructions inherited StaticInst
templates/xxx.isa
class ResponsePort : public Port
{
void bind(Port &peer) override {} Bool sendTimingResp(PacketPtr pkt);
}
class RequestPort: public Port
{
void bind(Port &peer) override; bool sendTimingReq(PacketPtr pkt);
}
Every memory object has to have at least
one port to be useful
CPU1
CPU2
CPU1
DCache1
dcache_port (RequestPort) dcache_port(RequestPort)
cpuSidePort(responsePort) cpuSidePort(responsePort)
DCache1
DCache2
MemSidePort(reqeustPort) MemSidePort(reqeustPort)
Coherent Bus
LSQUnit::trySendPacket()
dcachePort->sendTimingReq(data_pkt)
LSQUnit::recvTimingResp(pkt)
Cache::recvTimingReq()
latency
cpuSidePort.sendTimingResp(pkt)
Simple Memory
Request ports can send requests and receive responses, whereas Response ports receive requests and send responses. Due to the coherence protocol, a slave port can also send snoop requests and receive snoop responses, with the master port having the mirrored interface.
It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses
SLICC stands for Specification Language for Implementing Cache Coherence
MI_example: example protocol, 1-level cache.
MESI_Two_Level: single chip, 2-level caches, strictly-inclusive hierarchy. 3.MOESI_CMP_directory: multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.
MOESI_CMP_token: 2-level caches. TODO.
MOESI_hammer: single chip, 2-level private caches, strictly-exclusive hierarchy.
Garnet_standalone: protocol to run the Garnet network in a standalone manner.
MESI Three Level: 3-level caches, strictly-inclusive hierarchy. Based on MESI Two Level with an extra L0 cache.
CHI: flexible protocol that implements Arm’s AMBA5 CHI transactions. Supports configurable
cache hierarchy with both MESI or MOESI coherency.
QA
Thanks