ADB FINAL PREP

--- tags: sem1 --- # ADB FINAL PREP ## Stuffs not in 2018 ### Multimedia Database ### Fagin / Fuzzy ### Big Data/Map Reduce ### ORDB ## 2018 ### Q1 Schema Type #### (10%) Given 4 different kinds of schemas below, please sort them based on the corresponding invention time and describe the usage of each schema. ![](https://i.imgur.com/LZDKTjL.png) Hierarchical model. An expandable solution employed by IBM to remove the need of full sequential scan over the whole database. Each node maintains pointers to children, and each node may only have at most one parent. ![](https://i.imgur.com/ZJIlWsB.png) Relational model. A solution proposed by Codd and is used in all RDBMS. Tables are related to each other by matching data fields. ![](https://i.imgur.com/404ROI7.png) ER model. It is used to model the requirements at a higher level, and formally denotes the interactions between entities. It can then be converted into relational model for actual implementation. ![](https://i.imgur.com/e9Jr3R0.png) Star schema is used to model multidimension data. It splits the data into fact table and dimension table. ### Q2 ER Model (15%) Recently, the domain of artificial intelligence (AI) is so popular that almost every student in Computer Science department would like to learn and apply the techniques to his/her research work. However, there is no shortcut to learn AI, it requires some preliminary knowledge. To help students overcome this problem, the department is going to develop an online-course guideline system coined as SuperCourse Map. The objective of this system is to provide the road map to learn some specific subject. For example, if we are going to learn deep learning, we may need to take some prerequisite courses, such as linear algebra and machine learning, where these courses could be offered by different universities. To understand the overview of the database design, Prof. Nehc asked her 5-year-old son to draw the ER model. However, there are some issues with this model. As the project assistant, you’re asked to improve a well-formed ER-model for this project based on Figure 1. (The Topics in Figure 1 represent the key concept for different courses, e.g. relational database is one of the key concepts in advanced database course.) ![](https://i.imgur.com/Z7dyYNE.png) #### a. (10%) Please revise the ER diagram and highlight the changes, you have to provide reasons for such revisions. ```graphviz digraph { student_id [label=<student_id>] student_name time_stamp is_passed course_id [label=<course_id>] course_name course_profs topics department_id [label=<department_id>] select_id [label=<select_id>] node[shape=square] students courses departments node[shape=Msquare] select node[shape=diamond] of by require department_id->departments school->departments student_id->students student_name->students select_id->select time_stamp->select is_passed->select course_id->courses course_name->courses course_profs->courses topics->courses courses->by by->departments students->of of->departments students->select select->courses courses->require require->courses } ``` #### b. (5%) What are the dis/advantages before and after your modifications. - Normalize the `departments` data - Better define the relationship of prerequisite courses ### Q3 SQL #### (20%) Alvin is a coupon guy. He loves to collect and compare coupons. Use SQL statements to answer Alvin’s questions. ![](https://i.imgur.com/sqKhGBN.png) ![](https://i.imgur.com/fNIrp12.png) ![](https://i.imgur.com/5D72sSw.png) #### a. (5%) Alvin wants to know the names of the companies whose total coupon discounts are larger than 1500. Write down the corresponding SQL statement. ```sql= SELECT s.Name FROM ( SELECT Name, SUM(Discount) as d FROM Coupon, Company WHERE Coupon.CompanyID = Company.CompanyID GROUP BY Name ) s WHERE s.d > 1500; ``` #### b. (8%) Alvin wants to know the ID of the store that is surrounded by the most stores within 500 meters. Write down the corresponding SQL statement. ```sql= SELECT s.StoreID FROM ( SELECT a.StoreID, count(1) FROM store a, store b WHERE ST_DWithin(a.locationgeom, b.locationgeom, 500) GROUP BY a.StoreID ) s ORDER BY count DESC LIMIT 1; ``` #### c. (7%) In the previous question, you must have used some spatial function to help Alvin. Spatial functions are a kind of user defined functions(UDF). What are the pros and cons of UDF? :::success PROS: turing complete, can compute anything; ecosystem to extend capability; reuse existing language instead of difficult PL/SQL etc. CONS: may contain bug that db cant handle; lacking table/tuple operators, need data conversion ::: ##### Pros A good implementation of UDF architecture may create a strong ecosystem of interoperable UDFs, greatly extending the functionality of the database beyond the capability of the core developer teams. It also allows in-house customization of the engine by users with different needs. Companies can capitalize on this and develop database features without writing their own database engine. Big money, create jobs, happy engineers. Minecraft is a classic example of third party extending the capabilities of the original product via mechanism analogous to UDF. ##### Cons 1 Converting data to format that ease development of UDF might cost extra computations; the lack of raw and full data access might forbid the use of efficient algorithms. Extra care is needed to maintain a good architecture and API for UDF developers. For safety reasons, database engine might have to impose sanity checks and data validation for results from UDFs to prevent total meltdown of the database engine, which might cause people losing their holidays to overwork during Christmas, that is soooo sad. ##### Cons 2 (or is this just ranting..?) UDF allows the integration of user-built functions which may be developed by team with competency, method and amount of resources different from the database engine developers[^udf-userbuilt], thus increasing the risk of logical defect (i.e. software bug) impacting the operation of the database. _Different wording:_ > Developers of UDF might be less experienced and lack the resource to properly validate the correctness and efficiency of the functions, increasing the risk of errors to occur during operation. [^udf-userbuilt]: Au contraire, throughout the history of software developments it has been frequently observed that developers from smaller companies or different culture are able to build solution that fits the current requirements better, due to their flexibility and velocity to adapt compared to megacorps. ### Q4 Query Strategy - Spatial, Multidim, NoSQL #### (10%) In database history, several types of databases have been designed for different uses. People also designed corresponding query strategies to handle these different uses. For example, in traditional relational databases, people want to efficiently look up precise data. Therefore, the query needs to be exact on selecting the correct data. What are the corresponding query strategies for spatial databases, multidimensional databases, and NoSQL databases? Explain the reasons behind these query strategies. - Spatial: Filter and refine - Multidimensional: Slice and Dice (column based); Drill down; Roll-up (aggregate) - Multimedia: Fuzzy; Human perception - NoSQL: Scan heavy ### Q5 OLAP #### (10%) There are three kinds of OLAP：MOLAP, ROLAP, HOLAP. What are the differences between them? What are the advantages and disadvantages of them? Multidimensional OLAP: Does not use relational DB, use special database engine; pros: optimized for MD queries; cons: expensive to build and maintain Relational OLAP: Use relational DB directly. pros: cheaper to build, efficient storage; cons: slower, limited by SQL Hybrid OLAP: Use a mix of them. ### Q6 Star Schema #### There are two types of tables in Star Schema. One needs normalization and one does not. Why? Please explain the reasons. :::info preview 8 ::: ### Q7 R-tree #### (10%) In Figure 2, given the initial R-tree structure on the right, consider the following insertion steps with ascending order as shown on the left. Please build an R-Tree with (m, M) = (2, 4) by inserting these objects one by one, following the ascending order. Please draw every step of the tree for each insertion. ##### Initial state ![](https://i.imgur.com/JdSCcDJ.png) ##### After insertion: ![](https://i.imgur.com/CXEiZuN.png) ![](https://i.imgur.com/H5Lqy92.png) ![](https://i.imgur.com/MgqNO38.png) ### Q8 XML Schema #### a. (10%) There are three XML files listed in the Appendix, which contains one xml data storage (.xml) file and two xml schemas (.xsd) files. However, TA intentionally made some mistakes among those files. In order to get scores from this question, please list all the mistakes and provide the correct solutions for them. Reminder: you might like to pay more attention to the words in bold. ##### books.xml ```xml= <?xml version="1.0"?> <library xmlns:ppl="http://dyomedea.com/ns/people" xmlns="http://dyomedea.com/ns/library" xsi:schemaLocation="library.xsd"> <book id="b0836217462"> <isbn> 0836217462 </isbn> <title> Being a Dog Is a Full-Time Job </title> <authors> <ppl:person id="CMS"> <ppl:name> Charles M Schulz </ppl:name> <ppl:born> 1922-11-26 </ppl:born> <ppl:dead> 2000-02-12 </ppl:dead> </ppl:person> </authors> <characters> <ppl:person id="Snoopy" species="animal"> <ppl:name> Snoopy </ppl:name> <ppl:born> 1950-10-04 </ppl:born> <ppl:qualification> extroverted beagle </ppl:qualification> </ppl:person> <ppl:person id="Schroeder" species="human"> <ppl:name> Schroeder </ppl:name> <ppl:born> 1951-05-30 </ppl:born> <ppl:qualification> brought classical music to the Peanuts strip </ppl:qualification> </ppl:person> </characters> </book> ``` - 少關一個 library ##### library.xsd ```xml= <?xml version="1.0"?> <xs:schema targetNamespace="http://dyomedea.com/ns/library" elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:ppl="http://dyomedea.com/ns/people" xmlns:lib="http://dyomedea.com/ns/library"> <xs:import namespace="http://dyomedea.com/ns/people" schemaLocation="people.xsd"/> <xs:element name="library"> <xs:complexType> <xs:sequence> <xs:element name="book" type="lib:bookType"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="bookType"> <xs:sequence> <xs:element name="isbn" type="xs:NMTOKEN"/> <xs:element name="title" type="xs:string"/> <xs:element name="authors"> <xs:complexType> <xs:sequence> <xs:element ref="ppl:person"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="characters"> <xs:complexType> <xs:sequence> <xs:element ref="ppl:person" maxOccurs="unbounded"/> <xs:element ref="person"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="born" type="xs:date"/> <xs:element name="dead" type="xs:string" default="never"/> <xs:element name="qualification" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType> <xs:attribute name="id" type="xs:ID" use="required"/> <xs:attribute name="species" type="xs:string"> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="id" type="xs:ID" use="required" default=820218/> <xs:attribute name="available" type="xs:string" use="required"/> </xs:complexType> </xs:schema> ``` - default 沒有 quote - attribute 要在 complexType 裡面 - 第二個 people 不是 ref 而是 name！ ##### people.xsd ```xml= <?xml version="1.0"?> <xs:schema targetNamespace="http://dyomedea.com/ns/people" elementFormDefault="qualified" attributeFormDefault="unqualified" mlns:ppl="http://dyomedea.com/ns/people" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"> <xs:element name="born" type="xs:date"> <xs:element name="dead" type="xs:date" minOccurs="0"> <xs:element name="qualification" type="xs:string" minOccurs="0"> </xs:sequence> <xs:attribute name="id" type="xs:ID" use="required"> <xs:attribute name="species" type="xs:string" required="required" fixed="human"> </xs:complexType> </xs:element> </xs:schema> ``` - 都沒關！！！ #### b. (5%) Why do we need to introduce a new query language to query XML data? Illustrate with examples. ### Q9 NoSQL #### (10%) In a relational database, ACID summarizes the most important properties. However, such properties might not be prefered in recent years. The properties are no longer ACID in NoSQL databases. In NoSQL databases, people have different perspectives. What are these important properties of NoSQL databases? Explain them. ## Preview Quizes ### 1. Illustrate the differences between Relational Model and E-R model with examples. Why are they important? (8%) A relational model is a group of tables that are related to each other. An E-R model maps a logical entity with a number of attributes (e.g. Student) to a relational model (e.g. students, courses, contacts). The importance of it lies in appropriate mapping of E-R relationship which ensures smooth, correct and consistent operation of the system. (8%) Relation model 是用表格的形式組合資料。E-R model 則是在 Relation model 之前的階段，用 Entity(長方形)、Attribute(圓形)及Relationship (線)表達資料間的關係。 E-R Model 較為 high level，可用來規劃 Relational model ### 2. Explain the difference between DML and DDL, and when is each language used. List some examples of each. (7%) A DML (data modeling language) is used primarily to document the relationship and the structure of the data. DDL is a domain specific language to perform various operations on database. SQL is one example of DDL. :::danger Data Manipulation Language: To manipulate the data (insert, update, delete). Both DDL and DML perform operations on the database. SQL support both types, DDL and DML, not just DDL. Didn't give examples of each. ::: (9%)DML: Data Manipulation Language is used to query data we want e.g. SELECT * FROM Student s WHERE s.id=1; DDL: Data Definition Language is used to modify the structure of the relation(table) we want e.g CREATE table_name (ID integer, name varchar(50)); ### 3. Why “object-relational concepts” are important to be mentioned? :::success (10%) It addresses some of the issues of traditional RDBMS such as schema inflexibility, difficulty to access and manipulate group of data as a single logical entity (i.e. object). It also provides an ecosystem of plugins, data inheritance, etc for ease of development and maintenance. ::: :::success OR concepts- address schema inflexibility; access/manipulate group of data as logical entity; ecosystem of plugins; data inheritance; ease of development and maintenance ::: (6%) If we divided query and data complexity in a plane. No query + Simple data => File system Query + Simple data => ORDBMS No query + Complex data => OODBMS Query + Complex data => ORDBMS(object-relational database management system) With OR concepts, we can first have object properties(繼承、抽象、封裝) to deal with complex data. and second with relational properties, we can query data in database. ### 4. What will be the problems to use B-Tree for spatial data? Illustrate with examples. ::: success (10%) 1. B-tree 作為 index 會有 I/O 次數過多(速度慢)的問題，一個樹只有一個資料會導致樹高可能過高而在逐點存取時使速度變慢。 2. B-tree 穩定性較差，原因如上，有可能在樹高低的地方得到資料，但最差的情形可能會到 leave 才取得資料，導致穩定性低（時快時慢）。 3. B-tree 無法判斷資料間 closeness，而 Spatial data 不同於一般資料即是還有不同的 algebra(如 ROSE) 要執行，需要判別 closeness 的關係。 ::: -> Closeness + 2% sweet (8%) Spatial data has two (or even more) dimensions, thus using B-tree for spatial data requires either mapping the data to one dimension (which has its downside too) or sacrificing the ability to lookup with another dimension. (9%) Since B-tree provides one dimension indexing, it cannot fit the spatial database which is usually multi-dimension. For example, a spatial database include several regions that may overlap each other. Therefore, the nodes may duplicated. However, B-tree cannot handle this situation. :::success Unable to determine "closeness"; unable to lookup with multidimensional key ::: ### 5. Quadratic Split ![](https://i.imgur.com/GW1rY5D.png) :::success (10%) We are going to choose C and E because it is the largest space(MBR) in this graph. ::: :::success (10%) C and E. Since they are the farer two objects, which pair up to form the most space, it can be the leaders (roots) in two subtrees separately, which use the less space to form the R tree in the end. ::: :::danger (0%) A and D, as it splits the group in a more balanced manner, yielding a tree with lower height which will have faster access time. By slicing through the vertical plane, it has the possibility to build subtrees across the horizontal plane in a more balanced way. ::: ### 6. Why should XML documents be well-formed? What applications (scenarios) could employ XML documents? Name one and explain. :::success XML well form: compatibility and interoperability;application: share product list to other companies; ::: (8%) XML can be used to serialize collections of data/item for ease of exchange, e.g. define a list of products available on an e-store. It has to be well formed as the XML reader is not supposed to be able to resolve any ambiguity automatically when the document is malformed. An auto resolution will suppress the fact that the document is malformed and potentially propagate wrong data to the audience due to a glitch in the auto resolution. (8%) XML is mainly focus on transfer of data and unlike HTML,which has predefined tags and only focus on presentation of the data, XML has it own define tags, so XML need to be well-formed such as strictly for closed tag. For example, if you want to transfer a customer data, you can define `<customer>` tags by our own. ```xml <customer> <ID>081</ID> <name>Kevin</name> <phone>0912345678</phone> </customer> ``` ### 7. Why do we need to have another new query language XQuery forXML data? Pick two differences of XQuery as compared to SQL. :::success (10%) There are unique XML dom trees for XML data, so we use XQuery rather than XML to query the XML data. First, XQuery is used to find attributes and elements, while SQL is used to find indexes. Second, XQuery is used for XML data, while SQL is used for the data stored in the general database. ::: (9%) XQuery has an XPath notation to navigate within the XML nested structure, which is different from the flat table structure in SQL. It does not have the SELECT, FROM, WHERE format and instead uses another format that can achieve the same thing. ### 8. There are two types of tables in star schema. One should not be normalized. Why one should, and one should not? :::success (10%) Dimensional table should not be normalized and fact table should. Dimensional table can be used to scan and analyze the database more efficiently. Since the number of records in dimensional table is small and it's rarely be modified, it's okay to be unormalized. ::: (6%) There are fact table and dimension table in star schema. The dimension table should not be normalized to ensure the join depth is 1, i.e. each dimension has only one table. (8%) There are two types of tables in star schema: One is the fact table, and the other one is the dimensional table. The fact table should be normalized; however, the dimensional table should not. The reason dimensional tables are not normalized is that the number of keys in dimensional tables is much smaller than in the fact table. Also, the dimensional table is less needed to be changed. These two are the reasons dimensional tables are not normalized. ### 9. Multimedia data have quite many characteristics might affect the designs of the databases. Pick two characteristics and describe how they affect the designs of DB. :::success (10%) Human perception of the data is more important compared to the traditional data, therefore the DB has to support fuzzy query that matches data based on high level similarity. It is also complex and high dimensional, which make the indexing problem more challenging. The DB needs to understand many multimedia formats to process them appropriately. ::: (9%) First, there are multiple different features of multimedia data. In order to query their similar (or semi-structured) features, we design multimedia databases with fuzzy logic and emphasize semantic without only focusing on the features themselves. Second, in the query we need more human-centered instruction, so we use the Fagin algorithm to query the data we needed. ### 10. NoSQL DBs only support BASE. What is BASE? Why not ACID? Please explain. :::success (10%) The BASE model stands for three properties: 1. Basically Available: High availability and flexibility are critical to NoSQL database 2. Soft State: Database does not need to obligate that the dataset is consistent itself, instead obligating the responsibilities to the programmers. 3. Eventually Consistent: Do not need to be strictly consistent, but need to be consistent eventually. Since big data need parallel computing, high availability is the main concern. It means the database dealing with big data can not follow ACID properties, especially in C(Consistent) property. ::: (8%) BASE is an abbrevation coined to be the opposite of ACID, which lifts the hard requirements of a database such as consistency, and atomicity. It is not supported in favor of distributed computation and to lower the cost of synchronisation. For example, in BASE, only eventual consistency is required.