Note about Big Data

# Note about Big Data ###### tags: `Các kiến thức liên quan đến cuối kì` > Giải thích ngắn gọn về các phần cần ôn ## :memo: Basic Spark? ### :rocket: Spark là gì? - **Khái niệm**: * Công cụ tính toán hỗ trợ thư viên cho việc xử lý song song. - **Đặc điểm**: * Speed (in memory compulations ==> faster than MR for applicqation on disk), parrallel distributed processing. * Real time computation: design for massive scalibility, production with 1000 node, in memory compulation (real time, low latency) * Polyglot: high level API * Multiple format: Json, text, RDBMS table * Haddop integration: run on top of an existing Hadoop cluster. * Lazy evaluation * ML - **Tại sao nên dùng Spark** * Parallel distributed processing * Scalability * Fault tolerance on commondity hardware * High level API * In memory computing :arrow_right: Save money + time ![](https://i.imgur.com/ZvJ94sl.png) ### :rocket: Spark Core? - Cung cấp những chức năng cơ bản nhất của Spark như * Lập lịch cho các tác vụ * Quản lý bộ nhớ, * Fault recovery, * Tương tác với các hệ thống lưu trữ… - Cung cấp API để định nghĩa RDD (Resilient Distributed DataSet) là tập hợp của các item được phân tán trên các node của cluster và có thể được xử lý song song. - Có thể chạy ở nhiều loại Cluster Manager như Hadoop Yarn,Apache Mesos hoặc trên phân phối tài nguyên của chính nó (Standalone Scheduler). ### :rocket: Spark SQL? Cho phép truy vấn dữ liệu cấu trúc qua các câu lệnh SQL. Spark SQL có thể thao tác với nhiều nguồn dữ liệu như Hive tables, Parquet, và JSON. ### :rocket: Spark streaming? Cung cấp API để dễ dàng xử lý dữ liệu stream DStream - fundamental strem unit: a series of RDDs (đơn vị xử lý dữ liệu dòng) ### :rocket: Spark GraphX Cung cấp thư viện xử lý đồ thị ### :rocket: Spark MLlib - Spark ML: xử lý trên RDD ==> xử lý không cao, không hiệu quả - Spark MLlib: convert những cái đã có thành dataframe ==> có thể tương thích với các component khác trong spark như spark SQL. Cung cấp rất nhiều thuật toán của học máy như: classification, regression, clustering, collaborative filtering… ## :memo: RDD (Resilent Distributed Dataset) ### :rocket: Đặc điểm - Là thành phần cơ bản của Spark - Immutable (can not change) and follow lazy transformations (không tạo ra giá trị trung gian, tạo ra giá trị cuối cùng) - ==Resilent== : fault tolerance and can rebuilding data on failure. - ==Distributed==: distribute data among multiple node in a cluster. - ==Dataset==: Collection of patitioned data with values (phân hoạch data) ### :rocket: Mỗi RDD có: - A list of partition (nhiều máy quán lí) - A function for computing each split - A list of dependencies on other RDDs (quan hệ phụ thuộc) - A partitioner for key-value RDDs - A list of preferred locations on which to compute each slpit. ### :rocket: 3 thành tố chính: - Each dataset in RDD is divided into logical partitions, which maybe computer on diffrent nodes of the cluster. - Transformation applied on an RDD create a new RDD - Action apply to perform computation and pass the result back to the driver. https://viblo.asia/p/tong-quan-ve-apache-spark-cho-he-thong-big-data-RQqKLxR6K7z