![]() 17 CXL ~1 MB Device ~1 MB Deviceġ8 Mapping Flow Back to CPU Hierarchy Peer Cache can be: Peer CXL Device with Cache CPU Cache in Local Socket CPU Cache in Remote Socket ~10 GB Directly Connected (aka DDR) Home Agent ~10 MB 元 (aka LLC) CXL.mem ~10 GB CXL.mem ~10 GB Coherent CPU-to-CPU Symmetric Links CPU Socket 1 Peer Cache Peer ~500 KB Cache L2 L1 ~50 KB CPU L1 ~50 KB CPU. CXL.Cache Wr Cache ~50KB CXL.io PCIe CXL.$ Wr Cache ~50KB CXL.io PCIe CPU Socket Storage Developer Conference. 16ġ7 Mapping Flow Back to CPU Hierarchy ~10 GB Directly Connected (aka DDR) Home Agent CXL.mem ~10 GB CXL.mem ~10 GB Coherent CPU-to-CPU Symmetric Links ~10 MB 元 (aka LLC) CPU Socket 1 ~500 KB L2 L1 ~50 KB CPU L1 ~50 KB CPU. 15ġ6 Read Flow Diagram to show message flows in time X-axis: Agents Y-axis: Time 2020 Storage Developer Conference. 14ġ5 Read Flow Diagram to show message flows in time X-axis: Agents Y-axis: Time 2020 Storage Developer Conference. 13ġ4 Cache Protocol Channels 3 channels in each direction: D2H vs H2D Data and RSP channels are pre-allocated D2H Requests from the device H2D Requests are snoops from the host Ordering: H2D Req (Snoop) push H2D RSP 2020 Storage Developer Conference. 12ġ3 Cache Protocol Summary Simple set of 15 cache-able reads and writes from the device to host memory Keep complexity of global coherence management in the host 2020 Storage Developer Conference. ![]() 11ġ2 CXL Cache Protocol 2020 Storage Developer Conference. Snoop Current (SnpCurr): Cache state does not change, but must return any Modified data 2020 Storage Developer Conference. 10ġ1 How are Peer Caches Managed? All peer caches managed by the Home Agent within the cache level Hidden from CXL device A Snoop is the term for the Home to check cache state and may cause cache state changes CXL Snoops: Snoop Invalidate (SnpInv): Cache to degrade to I-state, and must return any Modified data Snoop Data (SnpData): Cache to degrade to S-state, and must return any Modified data. 9ġ0 Cache Coherence Protocol Modern CPU caches and CXL are built on M,E,S,I protocol/states Modified Only in one cache, Can be read or written, Data NOT up-to-date in memory Exclusive Only in one cache, Can be read or written, Data IS up-to-date in memory Shared Can be in many caches, Can only be read, Data IS up-to-date in memory Invalid Not in cache M,E,S,I is tracked for each cacheline address in each cache Cacheline address in CXL is Addr Notes: Each level of the CPU cache hierarchy follows MESI and layers above must be consistent Other extended states and flows are possible but not covered in context of CXL 2020 Storage Developer Conference. 8ĩ Cache Consistency How do we make sure updates in cache are visible to other agents? Invalidate all peer caches prior to update Can managed with software or hardware CXL uses hardware coherence Define a point of Global Observation (aka GO) when new data is visible from writes Tracking granularity is a cacheline of data 64-bytes for CXL All addresses are assumed to be Host Physical Address (HPA) in CXL cache and memory protocols Translations using existing Address Translation Services (ATS) Storage Developer Conference. Higher levels (元), less bandwidth per source but much higher capacity and support more sources Device caches are expected to be up to 1MB Storage Developer Conference. ![]() Modern CPUs have 2 or more levels of coherent cache Lower levels (L1), smaller in capacity with lowest latency and highest bandwidth per source. CXL.Cache CXL.mem ~10 GB Coherent CPU-to-CPU Symmetric Links L1 ~50 KB CPU ~1 MB Device Wr Cache ~50KB CXL.io PCIe CXL.mem ~10 GB CXL.$ ~1 MB Device Wr Cache ~50KB CXL.io PCIe CPU Socket 1 Note: Cache/ capacities are examples and not aligned to a specific product. ~10 MB 元 (aka LLC) ~500 KB L2 L1 ~50 KB CPU Home Agent. 7Ĩ CPU Cache/ Hierarchy with CXL L1 ~50 KB CPU CPU Socket 0 ~10 GB Directly Connected (aka DDR) ~500 KB L2 L1 ~50 KB CPU. 6ħ Caching Overview Caching temporarily brings data closer to the consumer Improves latency and bandwidth using prefetching and/or locality Prefetching: Loading Data into cache before it is required Spatial Locality (locality is space): Access address X then X+n Temporal Locality (locality in Time): Multiple access to the same Data Host Access Latency: ~200ns Shared Bandwidth: 100+ GB/s Read Local Data Cache Access Latency: ~10ns Dedicated Bandwidth: 100+ GB/s Read Data Data Accelerator 2020 Storage Developer Conference. 5 Representative CXL Usages 2020 Storage Developer Conference.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |