# Normalization
The process of reorganizing a set of given relational schemas, to ensure the final design :
- minimize redundancy, as well as storage usage
- improve [data integrity](https://hackmd.io/DqKX0OsATO6zaKY-jxlu_A).
- data dependencies make sense
### Terminology
|term|description|
|----|-----------|
|Atomic|All attributes of an entity set do NOT have substructure (e.g. represented as another entity type), `atomic` here means indivisible unit.|
|Normal forms (NF)|- a series of guidelines which ensure the relational schemas reach certain degree of normalization.<br>- higher-level NF has more restrictive criteria than lower-level NF. (e.g. the schema has to satisfy 2NF before testing with 3NF)<br>- Level of normalization required depends on specific needs of the application.<br>- Database design in practice pays particular attention to 3NF, BCNF, or 4NF.|
|[Functional Dependency (FD)](https://en.wikipedia.org/wiki/Functional_dependency)| a constraint between 2 sets of attributes (say `X` and `Y`), one set `X` determines the value of another set `Y`, and `X` is a subset of `Y`.<br><br>e.g. in the relation `employee`<br> - `ssn → Employee-name`, the social security number `ssn` uniquely determines `Employee-name`.<br> - `Proj-id → {Proj-name , Proj-location}`, project ID determines the value of the project name<br> - `{Ssn , Proj-id} → Hours`<br>|
|Decomposition|split a relation into several smaller relations, in order to achieve specific normal form.|
|Multivalued Dependency|whenever an attribute `A` has [one-to-many relationship](https://hackmd.io/LCzkGjMuStGSgzI3_gfq2A?view#Terminology) with 2 other different attributes `B`, `C` in a single relation.|
|||
### Unnormalized Relation
#### Redundancy can cuase update [anomaly](https://www.oxfordlearnersdictionaries.com/definition/english/anomaly)
Consider the relation `employee-department` below, where `Dname` and `Dmgr_ssn` is redundant among the tuples:
|`Ename`|`Ssn`|`Address`|`Dnumber`|`Dname`|`Dmgr_ssn`|
|-----|---|-------|-------|-----|--------|
|Smith John|123456789|731 Fondren, Houston, TX|5|Research|333445555|
|Wong Franklin|333445555|638 Voss, Miami, FL|5|Research|333445555|
|Zelaya Alicia|999887777|3321 Castle, Salt Lake, UT|4|Administration|987654321|
|Wallace Jennifer|987654321|291 Raspberry, Los Angels CA|4|Administration|987654321|
- When inserting a tuple for new employee in department `5`, the application has to ensure the corresponding `Dname` and `Dmgr_ssn` are correct ; such check is NOT necessary if `employee` and `departmnet` are in seperate relations.
- When deleting the last employee working for a particular department, information of that department will be immediately lost, that might NOT be allowed in application requirement.
- When updating any attribute (e.g. `Dname`) of a department, you have to update all tuples of employees that have the same `Dname` value
#### Null in Tuples
It should be avoid as much as possible (to save storage space), or sufficient logic for expection handling if unavoidable.
### Normalization Workflow
- Analyze the given relational schemas with their **functional dependencies** and **primary keys**.
- decompose the schema into smaller one if it doesn't satisfy the condition(s) for a normal form.
## First Normal Form (1NF)
All attributes are atomic within an entity ; no single attribute can represent composite value (e.g. another entity).
### Example

Consider the example above, the research department entity has `Dlocation` attribute which contains a list of locations (which is NOT atomic).
Techniques to achieve 1NF:
- for each department, if maximum number of locations can be known in advance, create different attributes (`Dlocation1`, `Dlocation2` ... etc.) to store them.
- not a good idea, it wastes storage space
- decompose the relation `DEPARTMENT`, move `Dlocation`, and duplicate `Dnumber` to new relation `DEPT_LOCATION`, so there are two relations which satisfy 1NF
- duplicate the same tuple according to different locations (see diagram below)
- this causes redundant `Dnumber` in each new tuple, and violates the property which can be uniquely identified.
- you can further move `Dlocation` from `DEPARTMENT` to another new relation.

## Second Normal Form (2NF)
Remove redundant information by NOT allowing any partial functional dependency.
### Full or Partial Functional Dependency
Assume the set `X` functionally determines another set `Y` (denoted as `X -> Y`), now an attribute `A` is removed from `X`
- if `X - {A}` still functionally determines `Y` --> partial FD
- otherwise, full FD
Given a relation, let `X` be **all attributes** of **a Super Key**, `Y` be **any non-prime attribute** of the same relation.
- iteratively test whether each pair of `X` and `Y` is a partial FD, if yes, reorganize the relation.
- if only one attribute is in the key, it should be full FD, and safe to pass the test
- Remind that number of attributes in a candidate key has to be **minimal**.
### Example
|`ssn`|`P-num`|`hours`|`E-name`|`P-credit`|`P-location`|
|-|-|-|-|-|-|
The relation `EMPLOYEE-PROJ` above may contain tuples whose `P-num` are the same but `ssn` values indicate to different people. The relation also has one key `{ssn, P-num}` and following 4 FDs.
|label|`X`|`Y`|partial ?|unnecessary attributes|
|----|----|----|----|----|
|`FD1`|`{ssn, P-num}`|`{hours}`|No|`N/A`|
|`FD2`|`{ssn, P-num}`|`{E-name}`|Yes|`{P-num}`|
|`FD3`|`{ssn, P-num}`|`{P-credit}`|Yes|`{ssn}`|
|`FD4`|`{ssn, P-num}`|`{P-location}`|Yes|`{ssn}`|
For `FD3` and `FD4`, move the attributes`P-credit` and `P-location` to a new relation `PROJECT` with `P-num` as the only candidate key.
|`P-num`|`P-credit`|`P-location`|
|-|-|-|
Similarly, for `FD2`, move the attribute`E-name` to another new relation `EMPLOYEE` with `ssn` as the only candidate key.
|`ssn`|`E-name`|
|-|-|
Original relation `EMPLOYEE-PROJ` will look like :
|`ssn`|`P-num`|`hours`|
|-|-|-|
Now all 3 relations satisfy 2NF.
## Third Normal Form (3NF)
The following 3 pieces of description express the similar concept :
- A Relation should NOT have a non-prime attribute functionally dependent on another non-prime attribute.
- Any functional dependency has to either (1) begin with a Super Key, or (2) end with prime attribute.
- after 2NF, Super key should be reduced to candidate key ?
- **Transitive functional dependency** from a primary key to any non-prime attribute is NOT allowed.
- [original definition from Ted Codd, 1971](https://en.wikipedia.org/wiki/Third_normal_form)
### Transitive Functional Dependency
Assume a set `X` functionally determines another set `Y` (denoted as `X -> Y`), `X` and `Y` has [transitive](https://en.wikipedia.org/wiki/Transitive_relation) FD if there exists the 3rd set of attributes `Z` :
- which doesn't [overlap](https://math.stackexchange.com/questions/2340403) both `X` and `Y`
- both `X -> Z` and `Z -> Y` hold
Given a relation, let `X` be **all attributes** of **primary key**, `Y` and `Z` be **any 2 different non-prime attributes**, if `Z` also determines the vlaue of `Y`, then transitive FD is found between `X` and `Y`.
### Example
|`ssn`|`address`|`E-name`|`Dept-num`|`Dept-name`|`Dept-mgr-ssn`|
|-|-|-|-|-|-|
Assume `ssn` is the only attribute in the primary key, the relation `EMP-DEPT` above is in 2NF but not in 3NF because of the transitive FD listed below.
|primary key|via|non-prime attribute|
|----|----|----|
|`{ssn}`|`{Dept-num}`|`{Dept-name, Dept-mgr-ssn}`|
The last 3 non-prime attributes stores department information. To remove the transitive FD, move `Dept-name` and `Dept-mgr-ssn` to a new relation `DEPARTMENT`.
|`Dept-num`|`Dept-name`|`Dept-mgr-ssn`|
|-|-|-|
`EMP-DEPT` will look like :
|`ssn`|`address`|`E-name`|`Dept-num`|
|-|-|-|-|
## Boyce-Codd Normal Form (BCNF)
### Improvement from 3NF
The downside of 3NF is that it doesn't test FD from a non-prime attribute `X` to a prime attribute `Y` in a given relation. BCNF addresses that by stricter definition :
- Any functional dependency has to begin with a Super Key.
> In other words, all non-prime attributes have to be (directly) dependent ONLY on a super key.
### Example
Consider a relation `TEACH` in a school, where `student` and `course` form a primary key :
|`student`|`course`|`instructor`|
|-|-|-|
|Narayan|Microcontroller Lab|Mark|
|Smith|Microcontroller Lab|Navathe|
|Smith|Operating Systems|Ammar|
|Smith|Computer Graphics|Schulman|
|Wallace|Microcontroller Lab|Mark|
|Wallace|Operating Systems|Ahamad|
|Wong|Microcontroller Lab|Omiecinski|
|Zelaya|Microcontroller Lab|Navathe|
|Narayan|Operating Systems|Ammar|
||||
One `student` who takes specific `course` determines the value of `instructor` attribute, and one `instructor` who teaches specific `course` determines the value of `course` attribute.
|label|`X`|`Y`|3NF valid|BCNF valid|
|----|----|----|----|----|
|`FD1`|`{student, course}`|`{instructor}`|Yes|Yes|
|`FD2`|`{instructor}`|`{course}`|Yes|No|
In such a situation, the relation can be decomposed to :
- 3 relations for each attribute `student`, `course`, `instructor`.
- one additional relation for keeping information like which students takes which courses.
Due to the many-to-many relationship on `course` attribute :
- one `student` may take more than one `course`.
- one `instructor` may teach more than one `course`.
## Fourth Normal Form (4NF)
**Multivalued dependency** which begins with super key is NOT allowed.
### examples
Consider the relation `EMPLOYEE` below, where all 3 attributes are in primary key. The relation satisfies 1NF, 2NF, 3NF, and BCNF.
|`E-name`|`Proj-num`|`Dependent-name`|
|---|---|---|
|Alyssa|533|Amor|
|Alyssa|533|Daniel|
|Alyssa|158|Amor|
|Alyssa|158|Daniel|
The relation does not satisfy 4NF, due to the following one-to-many relationships :
- `E-name` to `Proj-num`
- `E-name` to `Dependent-name`
Decomposition can be done by separating projects and dependents to following 2 relations.
In each of the relations, the primary key includes both of the attributes.
|`E-name`|`Proj-num`|
|---|---|
|Alyssa|533|
|Alyssa|158|
|`E-name`|`Dependent-name`|
|---|---|
|Alyssa|Amor|
|Alyssa|Daniel|
Now the relations are in 4NF.
---
## [Denormalization](https://en.wikipedia.org/wiki/Denormalization)
One step back to previously-normalized relations, it is done mostly for performance improvement, some redundancy is expected under application's control.
## Reference
- [Database normalization - Wikipedia](https://en.wikipedia.org/wiki/Database_normalization#Normal_forms)
- [Normalize your database](https://completedeveloperpodcast.com/episode-135/)
- [Difference between 3NF and BCNF in simple terms](https://stackoverflow.com/q/8437957/9853105)
###### tags: `database`