# 111403059_C.O.proj # readme 有Q3到Q5的: - 測試結果 - 結果說明 - 程式碼修改處(有附上註解) # Q3 ## simulation ```bash= ./build/X86/gem5.opt configs/example/se.py -c ./quicksort --cpu-type=TimingSimpleCPU --caches --l2cache --l3cache --l3_assoc=? --l1i_size=32kB --l1d_size=32kB --l2_size=128kB --l3_size=128kB --mem-type=NVMainMemory --nvmain-config=../NVmain/Config/PCM_ISSCC_2012_4GB.config ``` - 2way: 將?改為2 - fway: 將?改為l3size/64=2048 ### 2way set associative cache #### result(high system.l3.overall_miss_rate::total 0.570359 # miss rate for overall accesses #### explain - Structure: The cache is divided into multiple sets, ==each can hold two cache lines== (ways). - Mapping: A given block of memory can be placed in one of two lines within a specific set. The set is determined by the address of the memory block using a portion of its bits as an ==index==. - Replacement: When both lines in a set are occupied and a new block needs to be placed in that set, one of the ==existing blocks must be evicted==. The eviction can be based on various policies like LRU (Least Recently Used). ### full associative cache #### result(low system.l3.overall_miss_rate::total 0.521089 # miss rate for overall accesses #### explain - Structure: The cache has no sets; it's just a single pool of cache lines. - Mapping: A memory block can be placed in any cache line. - Replacement: Since any line can be used for any block, the replacement policy (e.g., LRU) is used to determine which line to evict when the cache is full. ## comparision - Flexibility: Fully associative caches are more flexible because any memory block can be placed in any line, reducing the likelihood of conflicts. In contrast, 2-way set associative caches restrict blocks to specific sets, increasing the chance of conflicts. - Complexity: Fully associative caches require more complex hardware to search all cache lines simultaneously for a match, whereas 2-way set associative caches only need to search within a specific set, simplifying the hardware. - Performance: Fully associative caches typically have a lower miss rate compared to 2-way set associative caches because they ==reduce conflict misses==. However, the difference in performance depends on the workload and the size of the cache. # Q4 ## 修改處 - 因為簡報上的要求還要考慮時間先後,所以這邊把 LRU 融合進入 LFU ### /configs/common/Caches.py ```clike= class L3Cache(Cache): assoc = 8 tag_latency = 20 data_latency = 20 response_latency = 20 mshrs = 20 tgts_per_mshr = 12 write_buffers = 8 //多這行啟用LFU Replacement Policy replacement_policy = Param.BaseReplacementPolicy(LFURP(),"Replacement policy") ``` ### lfu_rp.hh(Header File) ```clike= /** LFU-specific implementation of replacement data. */ struct LFUReplData : ReplacementData { /** Number of references to this entry since it was reset. */ unsigned refCount; Tick lastTouchTick;//add this line //records the last time the cache entry was accessed /** * Default constructor. Invalidate data. */ LFUReplData() : refCount(0), lastTouchTick(0) {}//add this line }; ``` ### lfu_rp..cc(Source File) ```clike= #include "mem/cache/replacement_policies/lfu_rp.hh" #include <cassert> #include <memory> #include "params/LFURP.hh" LFURP::LFURP(const Params *p) : BaseReplacementPolicy(p) { } //Resets both refCount and lastTouchTick to 0. void LFURP::invalidate(const std::shared_ptr<ReplacementData>& replacement_data) const { // Reset reference count std::static_pointer_cast<LFUReplData>(replacement_data)->refCount = 0; std::static_pointer_cast<LFUReplData>( replacement_data)->lastTouchTick = Tick(0); } //Increments refCount and updates lastTouchTick to the current tick (curTick()). void LFURP::touch(const std::shared_ptr<ReplacementData>& replacement_data) const { // Update reference count std::static_pointer_cast<LFUReplData>(replacement_data)->refCount++; // Update last touch timestamp std::static_pointer_cast<LFUReplData>( replacement_data)->lastTouchTick = curTick(); } //Resets refCount to 1 and updates lastTouchTick to the current tick (curTick()). void LFURP::reset(const std::shared_ptr<ReplacementData>& replacement_data) const { // Reset reference count std::static_pointer_cast<LFUReplData>(replacement_data)->refCount = 1; std::static_pointer_cast<LFUReplData>( replacement_data)->lastTouchTick = curTick(); } //If multiple candidates have the same refCount, it selects the one with the smallest lastTouchTick (i.e., the least recently accessed). ReplaceableEntry* LFURP::getVictim(const ReplacementCandidates& candidates) const { // There must be at least one replacement candidate assert(candidates.size() > 0); // Visit all candidates to find victim ReplaceableEntry* victim = candidates[0]; for (const auto& candidate : candidates) { // Update victim entry if necessary if (std::static_pointer_cast<LFUReplData>( candidate->replacementData)->refCount < std::static_pointer_cast<LFUReplData>( victim->replacementData)->refCount) { victim = candidate; } else if(std::static_pointer_cast<LFUReplData>( candidate->replacementData)->refCount == std::static_pointer_cast<LFUReplData>( victim->replacementData)->refCount){ if(std::static_pointer_cast<LFUReplData>( candidate->replacementData)->lastTouchTick < std::static_pointer_cast<LFUReplData>( victim->replacementData)->lastTouchTick){ victim = candidate; } } } return victim; } std::shared_ptr<ReplacementData> LFURP::instantiateEntry() { return std::shared_ptr<ReplacementData>(new LFUReplData()); } LFURP* LFURPParams::create() { return new LFURP(this); } ``` ### summary By adding lastTouchTick, the modified version provides a more balanced approach, potentially improving cache performance by reducing both conflict misses (by considering frequency) and capacity misses (by considering recency). ## simulation ```bash= ./build/X86/gem5.opt configs/example/se.py -c ./quicksort --cpu-type=TimingSimpleCPU --caches --l2cache --l3cache --l3_assoc=2 --l1i_size=32kB --l1d_size=32kB --l2_size=128kB --l3_size=128kB --mem-type=NVMainMemory --nvmain-config=../NVmain/Config/PCM_ISSCC_2012_4GB.config ``` - quicksort - asso=4 - l3=128kb ### old(LRU)-high system.l3.replacements 40704 # number of replacements ### new(new policy)-low system.l3.replacements 39939 # number of replacements # Q5 ## 修改處 ### gem5/src/mem/cache/base.cc ```clike= if (blk->isWritable()) { PacketPtr writeclean_pkt = writecleanBlk(blk, pkt->req->getDest(), pkt->id); //writecleanBlk function likely generates a packet that writes back the data of the writable block to the main memory. writebacks.push_back(writeclean_pkt); // adds the writeclean_pkt to the writebacks list } ``` ### summary 修改前:本來是dirty 才寫回去 修改後:不管如何 只要block存在且writable 都寫回去 ## simulation ```bash= ./build/X86/gem5.opt configs/example/se.py -c ./mul --cpu-type=TimingSimpleCPU --caches --l2cache --l3cache --l3_assoc=2 --l1i_size=32kB --l1d_size=32kB --l2_size=128kB --l3_size=128kB --mem-type=NVMainMemory --nvmain-config=../NVmain/Config/PCM_ISSCC_2012_4GB.config ``` - multiply - asso=4 - l3=128kb ### old(default:WriteBack)-low i0.defaultMemory.totalWriteRequests 16944 ### new(WriteThrough)-high i0.defaultMemory.totalWriteRequests 15611534 # 知識點 ## policy - LFU (Least Frequently Used) - LRU (Least Recently Used) ## gem5 an open-source ==simulator used for computer system architecture== research and education. It offers a ==flexible and modular platform for simulating a wide range of systems==, from simple microcontrollers to complex multicore architectures. Key features include: Modularity: Combine different components (CPUs, caches, memory) to simulate various architectures. Support for Multiple ISAs: ARM, x86, SPARC, MIPS, RISC-V. Detailed and High-Level Simulation: Cycle-level detailed simulations and faster functional simulations. Full-System and SoC Simulation: Run entire operating systems and complex workloads. Extensive Configuration: Customize CPU parameters, memory hierarchies, interconnects, and I/O devices. ## NVMain a cycle-accurate ==memory simulator designed to model non-volatile memory (NVM)== technologies. It is used for research and development in ==memory system architecture==, particularly focusing on emerging NVM technologies such as phase-change memory (PCM), spin-transfer torque RAM (STT-RAM), and resistive RAM (ReRAM). NVMain can be integrated with other system simulators like GEM5 to ==provide a detailed simulation of memory behavior and performance in the context of a complete system==. ## packet In the context of GEM5 and computer architecture simulators, a packet represents a unit of communication between different components of the memory system, such as the CPU, caches, and main memory. It encapsulates the details of a memory transaction, including reads, writes, and other operations. Packets are used to carry data and control information throughout the memory hierarchy. ## WBWT ### WB Data Writing: - In a write-back cache, data is written to the cache first and only written to the main memory when the cache block is evicted. This means that the main memory does not get updated immediately when a cache write occurs. - A dirty bit is set for the cache line when it is modified. If the cache line is evicted, the data is then written back to the main memory if the dirty bit is set. ### WT Data Writing: - In a write-through cache, data is written to both the cache and the main memory simultaneously whenever a write operation occurs. This ensures that the main memory is always up-to-date with the cache. - There is no need for a dirty bit ## base.cc ```clike= if (!blk) { if (pkt->writeThrough()) { // if this is a write through packet, we don't try to // allocate if the block is not present return false; } else { // a writeback that misses needs to allocate a new block blk = allocateBlock(pkt->getAddr(), pkt->isSecure(), writebacks); if (!blk) { // no replaceable block available: give up, fwd to // next level. incMissCount(pkt); return false; } tags->insertBlock(pkt, blk); blk->status |= (BlkValid | BlkReadable); } } // at this point either this is a writeback or a write-through // write clean operation and the block is already in this // cache, we need to update the data and the block flags assert(blk); // TODO: the coherent cache can assert(!blk->isDirty()); if (!pkt->writeThrough()) { blk->status |= BlkDirty; } // nothing else to do; writeback doesn't expect response assert(!pkt->needsResponse()); pkt->writeDataToBlock(blk->data, blkSize); DPRINTF(Cache, "%s new state is %s\n", __func__, blk->print()); incHitCount(pkt); // populate the time when the block will be ready to access. blk->whenReady = clockEdge(fillLatency) + pkt->headerDelay + pkt->payloadDelay; // if this a write-through packet it will be sent to cache // below return !pkt->writeThrough(); } else if (blk && (pkt->needsWritable() ? blk->isWritable() : blk->isReadable())) { // OK to satisfy access incHitCount(pkt); satisfyRequest(pkt, blk); maintainClusivity(pkt->fromCache(), blk); return true; } ```