Gem5 part1

A Modular platform for computer-system

architecture research


chsgcxy



Compiling Gem5


image


How gem5 is used for computer architecture research


image

#include <stdio.h>


int main(int argc, char* argv[])

Can run different workloads

{

printf("Hello world!\n"); return 0;

}


Config.dot


./build/ARM/gem5.debug configs/example/se.py -c tests/test-progs/hello/bin/arm/linux/hello


Somewhat similar with DTB


Gem5 lib Python script

log M5out/stats.txt


Statistics Detail


image


image


image


image


image


fs (full system) mode & se (system call emulation) mode


image

configs/example/se.py

./build/ARM/gem5.debug configs/example/fs.py

--caches

--kernel=vmlinux.arm64

--disk-image=ubuntu-18.04-arm64-docker.img

./build/ARM/gem5.debug configs/example/se.py

--cpu-type=ArmAtomicSimpleCPU --caches --l2cache

--mem-type=DDR4_2400_8x8 --mem-size=2GB

--l1d_size=64kB --l1i_size=32kB --l2_size=512kB -n 4

-c tests/test-progs/threads/src/threads

configs/example/fs.py



Start kernel will take about 40 minutes on my desktop


This will hang while run 4 cores, to be resolved (x86 ok)


相机图标 的图像结果

checkpoint


./build/ARM/gem5.debug

configs/example/fs.py

--caches

--kernel=vmlinux.arm64

--disk-image=ubuntu-18.04-arm64-docker.img

--take-checkpoints=100000000,100000

Create checkpoint Command line


image

when, period


File tree

Checkpoints are essentially snapshops of a simulation


image


build/ARM/dev/arm/rv_ctrl.cc:176: warn: SCReg: Access to unknown device dcc0:site0:pos0:fn7:dev0

Writing checkpoint

build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100000000. Starting simulation...

Writing checkpoint

build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100100000. Starting simulation...

Writing checkpoint

build/ARM/sim/simulate.cc:194: info: Entering event queue @ 100200000. Starting simulation...

log


image


Will sort cpt.xxx and restore according to the given index

Restore command line

./build/ARM/gem5.debug configs/example/fs.py

--caches

--kernel=vmlinux.arm64

--disk-image=ubuntu-18.04-arm64-docker.img

--checkpoint-restore=2


image


Event-driven


class Event

{

Event *nextBin; Event *nextInBin;

//!< timestamp when event should be processed

Tick _when;

Priority _priority; //!< event priority


virtual void process() = 0;

};

When=500, priority=50

When=500, priority=64

When=1000, priority=50

When=1000, priority=90


image

image

Event2

Event1

Event0

image

image

Event0

Event4

Event1

Event3

image

image

Event2

image

Event5

Event6

image

nextInBin



image image

nextBin


The same when and priority


0: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 0

0: system.cpu.icache.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 71 scheduled @ 1000

0: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 500

500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 500

500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 1000

1000: system.cpu.icache.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 71 executed @ 1000

1000: system.l2.mem_side_port-MemSidePort.wrapped_function_event: EventFunctionWrapped 76 scheduled @ 11500

1000: system.tol2bus.reqLayer0.wrapped_function_event: EventFunctionWrapped 88 scheduled @ 1500

1000: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 1000

1000: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 scheduled @ 1500

1500: system.tol2bus.reqLayer0.wrapped_function_event: EventFunctionWrapped 88 executed @ 1500

1500: O3CPU tick.wrapped_function_event: EventFunctionWrapped 61 executed @ 1500


image

Event *EventQueue::serviceOne()

{

if (next) {

next->nextBin = head->nextBin; head = next;

} else {

head = head->nextBin;

}


setCurTick(event->when());

event->process();


return NULL;

}

GlobalSimLoopExitEvent *simulate(Tick num_cycles)

{

while (1)

Event *exit_event = mainEventQueue[0]->serviceOne();

}

class SimObject

{

static std::vector<SimObject *> simObjectList; virtual void init();

virtual void startup();

};

Python & C++


pybind11 logo


clock system



image



class StaticInst

{

std::bitset<Num_Flags> flags;

OpClass _opClass;


uint8_t _numSrcRegs = 0; uint8_t _numDestRegs = 0;


RegIdArrayPtr _srcRegIdxPtr = nullptr; RegIdArrayPtr _destRegIdxPtr = nullptr;


virtual std::string generateDisassembly(

Instruction frameworks

image

image

image

DynInst(O3CPU)

X86StaticInst

StaticInst

RiscvStaticInst

MinorDynInst

ArmStaticInst

enum OpClass

{

No_OpClass = 0,

IntAlu = 1,

IntMult = 2,

IntDiv = 3,

FloatAdd = 4,

FloatCmp = 5,

FloatCvt = 6,

FloatMult = 7,

FloatMultAcc = 8,

FloatDiv = 9,

……

}

enum Flags

{

IsNop = 0,

IsInteger = 1,

IsFloating = 2,

IsVector = 3,

IsVectorElem = 4,

IsLoad = 5,

IsStore = 6,

IsAtomic = 7,

IsStoreConditional = 8,

IsInstPrefetch = 9,

IsDataPrefetch = 10,

Addr pc, const loader::SymbolTable *symtab) const = 0;


virtual Fault execute(ExecContext *xc, Trace::InstRecord *traceData) const = 0;

virtual void advancePC(PCStateBase &pc_state) const = 0;

virtual std::unique_ptr<PCStateBase> branchTarget(

const PCStateBase &pc) const;

}

IsControl = 11,

IsDirectControl = 12,

IsIndirectControl = 13,

……

}


inherit


reference


The cpu can deal with Flags and OpClass and no longer need to care about which instruction it is


Instruction frameworks

image

0x0e: decode FUNCT3 {

format ROp {

0x0: decode FUNCT7 { 0x0: addw({{

Rd_sd = Rs1_sw + Rs2_sw;

}});

0x1: mulw({{

Rd_sd = (int32_t)(Rs1_sw*Rs2_sw);

}}, IntMultOp);

.isa framework given a convenient and flexible way to generate instruction classes

Decode/xxx.isa


image

Build/XXX/arch/XXX/generated


def format ROp(code, *opt_flags) {{

iop = InstObjParams(name, Name, 'RegOp', code, opt_flags) header_output = BasicDeclare.subst(iop)

decoder_output = BasicConstructor.subst(iop) decode_block = BasicDecode.subst(iop) exec_output = BasicExecute.subst(iop)

}};

Isa_parser.py

image

Arch/xxx/insts:


branch.cc branch.hh xxx.cc xxx.hh

formats/xxx.isa


def template BasicDeclare {{

class %(class_name)s : public %(base_class)s

{

public:

Fault execute(ExecContext *, Trace::InstRecord *); using %(base_class)s::generateDisassembly;

};

}};

Generate instructions inherited StaticInst


templates/xxx.isa


memory system


class ResponsePort : public Port

{

void bind(Port &peer) override {} Bool sendTimingResp(PacketPtr pkt);

}

class RequestPort: public Port

{

void bind(Port &peer) override; bool sendTimingReq(PacketPtr pkt);

}

Every memory object has to have at least

one port to be useful


CPU1

CPU2

CPU1

DCache1

image

image

dcache_port (RequestPort) dcache_port(RequestPort)

cpuSidePort(responsePort) cpuSidePort(responsePort)


DCache1

DCache2

image

image

image

MemSidePort(reqeustPort) MemSidePort(reqeustPort)


Coherent Bus

LSQUnit::trySendPacket()

image

image

dcachePort->sendTimingReq(data_pkt)


image

LSQUnit::recvTimingResp(pkt)


Cache::recvTimingReq()


image

latency


cpuSidePort.sendTimingResp(pkt)

Simple Memory



Request ports can send requests and receive responses, whereas Response ports receive requests and send responses. Due to the coherence protocol, a slave port can also send snoop requests and receive snoop responses, with the master port having the mirrored interface.


image

Ruby Subsystem


image

It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses


image

SLICC stands for Specification Language for Implementing Cache Coherence

  1. MI_example: example protocol, 1-level cache.

  2. MESI_Two_Level: single chip, 2-level caches, strictly-inclusive hierarchy. 3.MOESI_CMP_directory: multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.

  1. MOESI_CMP_token: 2-level caches. TODO.

  2. MOESI_hammer: single chip, 2-level private caches, strictly-exclusive hierarchy.

  3. Garnet_standalone: protocol to run the Garnet network in a standalone manner.

  4. MESI Three Level: 3-level caches, strictly-inclusive hierarchy. Based on MESI Two Level with an extra L0 cache.

  5. CHI: flexible protocol that implements Arm’s AMBA5 CHI transactions. Supports configurable

cache hierarchy with both MESI or MOESI coherency.

QA


Thanks