Gem5 part1

A Modular platform for computer-system

architecture research

chsgcxy

Base introduction
- Why named Gem5
- Almost most popular
- Modular platform
- Compiling Gem5
- How to use
- Statistics
- Fs mode and Se mode
- Checkpoint
Core mechanism
- Event-driven
- Python && c++
- Clock system
- Instruction frameworks
- memory system
- Ruby subsystem
  
  Why named Gem5
  
  m5 simulator
  GEMS simulator
  
  University of Michigan
  Modeling Networked Systems
  2006
  2011
  University of Wisconsin
  general execution-driven multiprocessor simulator
  2005
  - Open-source computer architecture simulator used in academia and in industry
  - gem5 is used by many industrial research labs including ARM Research, AMD Research,
    Google, Micron, Metempsy, HP, Samsung, and others.
    
    Almost the most popular open-source simulator for architecture research
    contributors
    
    commits
    
    A modular platform for computer-system architecture research
    - Multiple interchangeable CPU models
    - Multiple ISA support (static)
    - Dynamically configurable(branch-predictor, CPU, DRAM
      Prefetcher, replacement policy, ……)
    - Configurable memory hierarchy

Compiling Gem5

How gem5 is used for computer architecture research

#include <stdio.h>

int main(int argc, char* argv[])

Can run different workloads

{

printf("Hello world!\n"); return 0;

}

Config.dot

./build/ARM/gem5.debug configs/example/se.py -c tests/test-progs/hello/bin/arm/linux/hello

Somewhat similar with DTB

Gem5 lib Python script

log M5out/stats.txt

Statistics Detail

fs (full system) mode & se (system call emulation) mode

configs/example/se.py

./build/ARM/gem5.debug configs/example/fs.py

--caches

--kernel=vmlinux.arm64

--disk-image=ubuntu-18.04-arm64-docker.img

./build/ARM/gem5.debug configs/example/se.py

--cpu-type=ArmAtomicSimpleCPU --caches --l2cache

--mem-type=DDR4_2400_8x8 --mem-size=2GB

--l1d_size=64kB --l1i_size=32kB --l2_size=512kB -n 4

-c tests/test-progs/threads/src/threads

configs/example/fs.py

Start kernel will take about 40 minutes on my desktop

This will hang while run 4 cores, to be resolved (x86 ok)

相机图标的图像结果

checkpoint

./build/ARM/gem5.debug

configs/example/fs.py

--caches

--kernel=vmlinux.arm64

--disk-image=ubuntu-18.04-arm64-docker.img

--take-checkpoints=100000000,100000

Create checkpoint Command line

when, period

File tree

Checkpoints are essentially snapshops of a simulation

build/ARM/dev/arm/rv_ctrl.cc:176: warn: SCReg: Access to unknown device dcc0:site0:pos0:fn7:dev0

Writing checkpoint

build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100000000. Starting simulation...

Writing checkpoint

build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100100000. Starting simulation...

Writing checkpoint

build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100200000. Starting simulation...

log

Will sort cpt.xxx and restore according to the given index

Restore command line

./build/ARM/gem5.debug configs/example/fs.py

--caches

--kernel=vmlinux.arm64

--disk-image=ubuntu-18.04-arm64-docker.img

--checkpoint-restore=2

Event-driven

class Event

{

Event *nextBin; Event *nextInBin;

//!< timestamp when event should be processed

Tick _when;

Priority _priority; //!< event priority

virtual void process() = 0;

};

When=500, priority=50

When=500, priority=64

When=1000, priority=50

When=1000, priority=90

Event2

Event1

Event0

Event4

Event1

Event3

Event2

Event5

Event6

nextInBin

nextBin

The same when and priority

0: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 0

0: system.cpu.icache.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 71 scheduled @ 1000

0: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 500

500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 500

500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 1000

1000: system.cpu.icache.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 71 executed @ 1000

1000: system.l2.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 76 scheduled @ 11500

1000: system.tol2bus.reqLayer0.wrapped_function_event: EventFunctionWrapped 88 scheduled @ 1500

1000: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 1000

1000: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 1500

1500: system.tol2bus.reqLayer0.wrapped_function_event: EventFunctionWrapped 88 executed @ 1500

1500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 1500

Event *EventQueue::serviceOne()

{

if (next) {

next->nextBin = head->nextBin; head = next;

} else {

head = head->nextBin;

}

setCurTick(event->when());

event->process();

return NULL;

}

GlobalSimLoopExitEvent *simulate(Tick num_cycles)

{

while (1)

Event *exit_event = mainEventQueue[0]->serviceOne();

}

class SimObject

{

static std::vector<SimObject *> simObjectList; virtual void init();

virtual void startup();

};

Python & C++

clock system

class StaticInst

{

std::bitset<Num_Flags> flags;

OpClass _opClass;

uint8_t _numSrcRegs = 0; uint8_t _numDestRegs = 0;

RegIdArrayPtr _srcRegIdxPtr = nullptr; RegIdArrayPtr _destRegIdxPtr = nullptr;

virtual std::string generateDisassembly(

Instruction frameworks

DynInst(O3CPU)

X86StaticInst

StaticInst

RiscvStaticInst

MinorDynInst

ArmStaticInst

enum OpClass

{

No_OpClass = 0,

IntAlu = 1,

IntMult = 2,

IntDiv = 3,

FloatAdd = 4,

FloatCmp = 5,

FloatCvt = 6,

FloatMult = 7,

FloatMultAcc = 8,

FloatDiv = 9,

……

}

enum Flags

{

IsNop = 0,

IsInteger = 1,

IsFloating = 2,

IsVector = 3,

IsVectorElem = 4,

IsLoad = 5,

IsStore = 6,

IsAtomic = 7,

IsStoreConditional = 8,

IsInstPrefetch = 9,

IsDataPrefetch = 10,

Addr pc, const loader::SymbolTable *symtab) const = 0;

virtual Fault execute(ExecContext *xc, Trace::InstRecord *traceData) const = 0;

virtual void advancePC(PCStateBase &pc_state) const = 0;

virtual std::unique_ptr<PCStateBase> branchTarget(

const PCStateBase &pc) const;

}

IsControl = 11,

IsDirectControl = 12,

IsIndirectControl = 13,

……

}

inherit

reference

The cpu can deal with Flags and OpClass and no longer need to care about which instruction it is

Instruction frameworks

0x0e: decode FUNCT3 {

format ROp {

0x0: decode FUNCT7 { 0x0: addw({{

Rd_sd = Rs1_sw + Rs2_sw;

}});

0x1: mulw({{

Rd_sd = (int32_t)(Rs1_sw*Rs2_sw);

}}, IntMultOp);

.isa framework given a convenient and flexible way to generate instruction classes

Decode/xxx.isa

Build/XXX/arch/XXX/generated

def format ROp(code, *opt_flags) {{

iop = InstObjParams(name, Name, 'RegOp', code, opt_flags) header_output = BasicDeclare.subst(iop)

decoder_output = BasicConstructor.subst(iop) decode_block = BasicDecode.subst(iop) exec_output = BasicExecute.subst(iop)

}};

Isa_parser.py

Arch/xxx/insts:

branch.cc branch.hh xxx.cc xxx.hh

formats/xxx.isa

def template BasicDeclare {{

class %(class_name)s : public %(base_class)s

{

public:

Fault execute(ExecContext *, Trace::InstRecord *); using %(base_class)s::generateDisassembly;

};

}};

Generate instructions inherited StaticInst

templates/xxx.isa

memory system

class ResponsePort : public Port

{

void bind(Port &peer) override {} Bool sendTimingResp(PacketPtr pkt);

}

class RequestPort: public Port

{

void bind(Port &peer) override; bool sendTimingReq(PacketPtr pkt);

}

Every memory object has to have at least

one port to be useful

CPU1

CPU2

CPU1

DCache1

dcache_port (RequestPort) dcache_port(RequestPort)

cpuSidePort(responsePort) cpuSidePort(responsePort)

DCache1

DCache2

MemSidePort(reqeustPort) MemSidePort(reqeustPort)

Coherent Bus

LSQUnit::trySendPacket()

dcachePort->sendTimingReq(data_pkt)

LSQUnit::recvTimingResp(pkt)

Cache::recvTimingReq()

latency

cpuSidePort.sendTimingResp(pkt)

Simple Memory

Request ports can send requests and receive responses, whereas Response ports receive requests and send responses. Due to the coherence protocol, a slave port can also send snoop requests and receive snoop responses, with the master port having the mirrored interface.

Ruby Subsystem

It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses

SLICC stands for Specification Language for Implementing Cache Coherence

MI_example: example protocol, 1-level cache.
MESI_Two_Level: single chip, 2-level caches, strictly-inclusive hierarchy. 3.MOESI_CMP_directory: multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.

MOESI_CMP_token: 2-level caches. TODO.
MOESI_hammer: single chip, 2-level private caches, strictly-exclusive hierarchy.
Garnet_standalone: protocol to run the Garnet network in a standalone manner.
MESI Three Level: 3-level caches, strictly-inclusive hierarchy. Based on MESI Two Level with an extra L0 cache.
CHI: flexible protocol that implements Arm’s AMBA5 CHI transactions. Supports configurable

cache hierarchy with both MESI or MOESI coherency.

Thanks

Gem5 part1

Almost the most popular open-source simulator for architecture research

contributors

commits

A modular platform for computer-system architecture research

Compiling Gem5

fs (full system) mode & se (system call emulation) mode

checkpoint

Create checkpoint Command line

File tree

log

Restore command line

Event-driven

Python & C++

clock system

Instruction frameworks

Instruction frameworks

memory system

Ruby Subsystem