# Normalization The process of reorganizing a set of given relational schemas, to ensure the final design : - minimize redundancy, as well as storage usage - improve [data integrity](https://hackmd.io/DqKX0OsATO6zaKY-jxlu_A). - data dependencies make sense ### Terminology |term|description| |----|-----------| |Atomic|All attributes of an entity set do NOT have substructure (e.g. represented as another entity type), `atomic` here means indivisible unit.| |Normal forms (NF)|- a series of guidelines which ensure the relational schemas reach certain degree of normalization.<br>- higher-level NF has more restrictive criteria than lower-level NF. (e.g. the schema has to satisfy 2NF before testing with 3NF)<br>- Level of normalization required depends on specific needs of the application.<br>- Database design in practice pays particular attention to 3NF, BCNF, or 4NF.| |[Functional Dependency (FD)](https://en.wikipedia.org/wiki/Functional_dependency)| a constraint between 2 sets of attributes (say `X` and `Y`), one set `X` determines the value of another set `Y`, and `X` is a subset of `Y`.<br><br>e.g. in the relation `employee`<br> - `ssn → Employee-name`, the social security number `ssn` uniquely determines `Employee-name`.<br> - `Proj-id → {Proj-name , Proj-location}`, project ID determines the value of the project name<br> - `{Ssn , Proj-id} → Hours`<br>| |Decomposition|split a relation into several smaller relations, in order to achieve specific normal form.| |Multivalued Dependency|whenever an attribute `A` has [one-to-many relationship](https://hackmd.io/LCzkGjMuStGSgzI3_gfq2A?view#Terminology) with 2 other different attributes `B`, `C` in a single relation.| ||| ### Unnormalized Relation #### Redundancy can cuase update [anomaly](https://www.oxfordlearnersdictionaries.com/definition/english/anomaly) Consider the relation `employee-department` below, where `Dname` and `Dmgr_ssn` is redundant among the tuples: |`Ename`|`Ssn`|`Address`|`Dnumber`|`Dname`|`Dmgr_ssn`| |-----|---|-------|-------|-----|--------| |Smith John|123456789|731 Fondren, Houston, TX|5|Research|333445555| |Wong Franklin|333445555|638 Voss, Miami, FL|5|Research|333445555| |Zelaya Alicia|999887777|3321 Castle, Salt Lake, UT|4|Administration|987654321| |Wallace Jennifer|987654321|291 Raspberry, Los Angels CA|4|Administration|987654321| - When inserting a tuple for new employee in department `5`, the application has to ensure the corresponding `Dname` and `Dmgr_ssn` are correct ; such check is NOT necessary if `employee` and `departmnet` are in seperate relations. - When deleting the last employee working for a particular department, information of that department will be immediately lost, that might NOT be allowed in application requirement. - When updating any attribute (e.g. `Dname`) of a department, you have to update all tuples of employees that have the same `Dname` value #### Null in Tuples It should be avoid as much as possible (to save storage space), or sufficient logic for expection handling if unavoidable. ### Normalization Workflow - Analyze the given relational schemas with their **functional dependencies** and **primary keys**. - decompose the schema into smaller one if it doesn't satisfy the condition(s) for a normal form. ## First Normal Form (1NF) All attributes are atomic within an entity ; no single attribute can represent composite value (e.g. another entity). ### Example ![](https://hackmd.io/_uploads/HkcSvwqHn.png) Consider the example above, the research department entity has `Dlocation` attribute which contains a list of locations (which is NOT atomic). Techniques to achieve 1NF: - for each department, if maximum number of locations can be known in advance, create different attributes (`Dlocation1`, `Dlocation2` ... etc.) to store them. - not a good idea, it wastes storage space - decompose the relation `DEPARTMENT`, move `Dlocation`, and duplicate `Dnumber` to new relation `DEPT_LOCATION`, so there are two relations which satisfy 1NF - duplicate the same tuple according to different locations (see diagram below) - this causes redundant `Dnumber` in each new tuple, and violates the property which can be uniquely identified. - you can further move `Dlocation` from `DEPARTMENT` to another new relation. ![](https://hackmd.io/_uploads/SJgwuDqHn.png) ## Second Normal Form (2NF) Remove redundant information by NOT allowing any partial functional dependency. ### Full or Partial Functional Dependency Assume the set `X` functionally determines another set `Y` (denoted as `X -> Y`), now an attribute `A` is removed from `X` - if `X - {A}` still functionally determines `Y` --> partial FD - otherwise, full FD Given a relation, let `X` be **all attributes** of **a Super Key**, `Y` be **any non-prime attribute** of the same relation. - iteratively test whether each pair of `X` and `Y` is a partial FD, if yes, reorganize the relation. - if only one attribute is in the key, it should be full FD, and safe to pass the test - Remind that number of attributes in a candidate key has to be **minimal**. ### Example |`ssn`|`P-num`|`hours`|`E-name`|`P-credit`|`P-location`| |-|-|-|-|-|-| The relation `EMPLOYEE-PROJ` above may contain tuples whose `P-num` are the same but `ssn` values indicate to different people. The relation also has one key `{ssn, P-num}` and following 4 FDs. |label|`X`|`Y`|partial ?|unnecessary attributes| |----|----|----|----|----| |`FD1`|`{ssn, P-num}`|`{hours}`|No|`N/A`| |`FD2`|`{ssn, P-num}`|`{E-name}`|Yes|`{P-num}`| |`FD3`|`{ssn, P-num}`|`{P-credit}`|Yes|`{ssn}`| |`FD4`|`{ssn, P-num}`|`{P-location}`|Yes|`{ssn}`| For `FD3` and `FD4`, move the attributes`P-credit` and `P-location` to a new relation `PROJECT` with `P-num` as the only candidate key. |`P-num`|`P-credit`|`P-location`| |-|-|-| Similarly, for `FD2`, move the attribute`E-name` to another new relation `EMPLOYEE` with `ssn` as the only candidate key. |`ssn`|`E-name`| |-|-| Original relation `EMPLOYEE-PROJ` will look like : |`ssn`|`P-num`|`hours`| |-|-|-| Now all 3 relations satisfy 2NF. ## Third Normal Form (3NF) The following 3 pieces of description express the similar concept : - A Relation should NOT have a non-prime attribute functionally dependent on another non-prime attribute. - Any functional dependency has to either (1) begin with a Super Key, or (2) end with prime attribute. - after 2NF, Super key should be reduced to candidate key ? - **Transitive functional dependency** from a primary key to any non-prime attribute is NOT allowed. - [original definition from Ted Codd, 1971](https://en.wikipedia.org/wiki/Third_normal_form) ### Transitive Functional Dependency Assume a set `X` functionally determines another set `Y` (denoted as `X -> Y`), `X` and `Y` has [transitive](https://en.wikipedia.org/wiki/Transitive_relation) FD if there exists the 3rd set of attributes `Z` : - which doesn't [overlap](https://math.stackexchange.com/questions/2340403) both `X` and `Y` - both `X -> Z` and `Z -> Y` hold Given a relation, let `X` be **all attributes** of **primary key**, `Y` and `Z` be **any 2 different non-prime attributes**, if `Z` also determines the vlaue of `Y`, then transitive FD is found between `X` and `Y`. ### Example |`ssn`|`address`|`E-name`|`Dept-num`|`Dept-name`|`Dept-mgr-ssn`| |-|-|-|-|-|-| Assume `ssn` is the only attribute in the primary key, the relation `EMP-DEPT` above is in 2NF but not in 3NF because of the transitive FD listed below. |primary key|via|non-prime attribute| |----|----|----| |`{ssn}`|`{Dept-num}`|`{Dept-name, Dept-mgr-ssn}`| The last 3 non-prime attributes stores department information. To remove the transitive FD, move `Dept-name` and `Dept-mgr-ssn` to a new relation `DEPARTMENT`. |`Dept-num`|`Dept-name`|`Dept-mgr-ssn`| |-|-|-| `EMP-DEPT` will look like : |`ssn`|`address`|`E-name`|`Dept-num`| |-|-|-|-| ## Boyce-Codd Normal Form (BCNF) ### Improvement from 3NF The downside of 3NF is that it doesn't test FD from a non-prime attribute `X` to a prime attribute `Y` in a given relation. BCNF addresses that by stricter definition : - Any functional dependency has to begin with a Super Key. > In other words, all non-prime attributes have to be (directly) dependent ONLY on a super key. ### Example Consider a relation `TEACH` in a school, where `student` and `course` form a primary key : |`student`|`course`|`instructor`| |-|-|-| |Narayan|Microcontroller Lab|Mark| |Smith|Microcontroller Lab|Navathe| |Smith|Operating Systems|Ammar| |Smith|Computer Graphics|Schulman| |Wallace|Microcontroller Lab|Mark| |Wallace|Operating Systems|Ahamad| |Wong|Microcontroller Lab|Omiecinski| |Zelaya|Microcontroller Lab|Navathe| |Narayan|Operating Systems|Ammar| |||| One `student` who takes specific `course` determines the value of `instructor` attribute, and one `instructor` who teaches specific `course` determines the value of `course` attribute. |label|`X`|`Y`|3NF valid|BCNF valid| |----|----|----|----|----| |`FD1`|`{student, course}`|`{instructor}`|Yes|Yes| |`FD2`|`{instructor}`|`{course}`|Yes|No| In such a situation, the relation can be decomposed to : - 3 relations for each attribute `student`, `course`, `instructor`. - one additional relation for keeping information like which students takes which courses. Due to the many-to-many relationship on `course` attribute : - one `student` may take more than one `course`. - one `instructor` may teach more than one `course`. ## Fourth Normal Form (4NF) **Multivalued dependency** which begins with super key is NOT allowed. ### examples Consider the relation `EMPLOYEE` below, where all 3 attributes are in primary key. The relation satisfies 1NF, 2NF, 3NF, and BCNF. |`E-name`|`Proj-num`|`Dependent-name`| |---|---|---| |Alyssa|533|Amor| |Alyssa|533|Daniel| |Alyssa|158|Amor| |Alyssa|158|Daniel| The relation does not satisfy 4NF, due to the following one-to-many relationships : - `E-name` to `Proj-num` - `E-name` to `Dependent-name` Decomposition can be done by separating projects and dependents to following 2 relations. In each of the relations, the primary key includes both of the attributes. |`E-name`|`Proj-num`| |---|---| |Alyssa|533| |Alyssa|158| |`E-name`|`Dependent-name`| |---|---| |Alyssa|Amor| |Alyssa|Daniel| Now the relations are in 4NF. --- ## [Denormalization](https://en.wikipedia.org/wiki/Denormalization) One step back to previously-normalized relations, it is done mostly for performance improvement, some redundancy is expected under application's control. ## Reference - [Database normalization - Wikipedia](https://en.wikipedia.org/wiki/Database_normalization#Normal_forms) - [Normalize your database](https://completedeveloperpodcast.com/episode-135/) - [Difference between 3NF and BCNF in simple terms](https://stackoverflow.com/q/8437957/9853105) ###### tags: `database`