COSC430 discussion: Trinity

--- tags: cosc430 2020 --- # COSC430 discussion: Trinity ## Some take-away points about the graph database lecture - It's worth thinking about what forms of query and data justify a specific graph database engine, versus data management that might be handled easily by a typical relational database management system. # Discussion of the Trinity paper ## Breakout room 1 ### Key problem under investigation? Significant problems arise with the programming model and the infrastructure for processing large scale graph databases. Generally, issues arise in two categories: 1. Online query processing, which requires low latency and a high level of random data access, and 2. offline analytics, which requires high throughput. Difficult to access the data randomly for the large graph processing. ### Key idea of the proposed solution? Focus on online query processing applications and offline graph analytics applications. Build the graph engine on top of a distributed memory storage infrastructure, which is called as memory cloud It allows users define graph schema,communication protocols and computation paradigms through TSL. ### How does it solve the problem? Trinity is designed as a general purpose graph engine that addresses the above problems is several ways, including: 1. Use of a distributed memory cloud, optimised memory management and network communication which supports fast graph exploration and efficient parallel computing, ie directly address the random data access problem. 2. High level specification language, TSL 3. Does not come with compehensive built in computation modules, rather focuses on flexible data and computation modelling capabilities and allows for the development of more complex modules for the specific needs of each user. ### Evaluation? 1) People Search Query On Social Graph- They performed people search queries in a social network to measure the perfor- mance of data-intensive, traversal-based online graph queries. This experiment measures query response time of searching friends by name within 2 and 3 hops on synthetic social graphs. 2) Breadth-first Search. Breadth-first search (BFS) is a fundamental graph computation operation. Many graph algo- rithms are built on BFS. Graph 500 Benchmark adopts BFS as one of its two computation kernels. 3) Trinity vs. PBGL. PBGL is a generic C++ library for high-performance parallel and distributed graph Computa- tion. They ran BFS on RMAT graphs in a 16-machine cluster using PBGL and Trinity to compare their execution time and memory usage. The node count varies from 1 million to 256 million with average degree 4, 8, 16 and 32. ### Drawbacks? 1. No comprehensive build in computation modules 2. Does not follow ACID 3. Performance is limited by network communication limits. 4. Cannot be centrally implemented due to bottleneck issues but having a duplicate on each slave results in inconsistency. ## Breakout room 2 ### Key problem under investigation? Cannot provide the level of efficient random access required by graph computation. Memory-based approaches usually do not scale due to the capacity limit of single machines. ### Key idea of the proposed solution? Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. ### How does it solve the problem? Trinity supports efficient online query processing and offline analytics on large graphs with just a few commodity machines. Key-value is stored Circular memory management and fast memory allocation Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. ### Evaluation? PeopleSearchQueryOnSocialGraph Page Rank Calculation On Web Graph Breadth-first Search Parallel Speedup ### Drawbacks? Fault Tolerance issues Fault Recovery Shared Addressing Table Maintenance No ACID properties support (which might not be very relevant here) ## Breakout room 3 ### Key problem under investigation? Processing of large graphs: 1. Online query processing with low latency 2. Offline graph analytics with high throughput These tasks are I/O intensive, with high degree of random data access ### Key idea of the proposed solution? Distributed parallel computation Keeping graph (at least, the topology) in memory It gives us a high performance and works well for random access ### How does it solve the problem? We use a distributed key-value store on top of the memory cloud Data is divided into trunks, each host can have multiple trunks (for better parallelism) For global addressing, each host keeps a replica of the addressing table Key is a 64-bit globally unique id, value is a blob (for keeeping data size small) High level specification language called TSL ### Evaluation? Trinity has a high performance even on billion-node graphs. For example: 1. People search queries on a Facebook-like social graph are answered within 100 ms 2. One iteration of PageRank on a billion-node graph is completed in 1 minute on 8 machines 3. Breadth-first search: for a billion-node graph, it takes about 1000 seconds on 8 machines 4. Trinity vs. PBGL: Trinity runs 10x faster with 10x less memory footprint 5. Trinity vs. Giraph: Trinity runs faster by two orders of magnitude 6. Parallel Speedup 7. More Discussions on Scalability: Trinity takes less memory and runs much faster as compared to PBGL and Giraph. This is because Trinity uses an in memory system where data is loaded in memory before doing computation. Compared to PBGL and Giraph this allows Trinity to avoid cache-missing penalty. ### Drawbacks? 1) Shared addressing table maintainence: centralized implementation is infeasible because of performance bottleneck and risk of a single point failure 2) fault recovery 3) With more and more machines, performance is limited to the network communication limit 4) No ACID transactions support, i.e., no serializability for concurrent threads