Mannarswamy, Sandya and Govindarajan , Ramaswamy (2011) Making STMs Cache Friendly with Compiler Transformations. In: International Conference on Parallel Architectures and Compilation Techniques, OCtober, 2011, Galveston Island, TX, USA.
PACT11-SandyaGov.pdf - Published Version
Download (338Kb) | Preview
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
|Item Type:||Conference Proceedings|
|Additional Information:||Copyright of this article belongs to the IEEE|
|Department/Centre:||Division of Information Sciences > Supercomputer Education & Research Centre|
|Date Deposited:||14 Aug 2012 08:35|
|Last Modified:||14 Aug 2012 08:35|
Actions (login required)