Abstract Data Types vs Data Structures

# Abstract Data Types vs Data Structures An abstract data type (ADT) is a mathematical abstraction of a collection of objects and the operations done on these objects. An ADT provides a certain framework describing how we arrange the objects and how each object is related to each other. With knowledge of ADTs, we could often introduce optimizations to these certain objects that would otherwise be really difficult to do without it. On the other hand, a data structure is a specific implementation of an ADT. Often, multiple data structures exists for the same ADTs. To compare implementations, we usually look at their performance when doing operations. Each implementation would often have its own merits over another. Depending on what kind of optimizations you are looking for, some data structures may provide better optimizations compared to the others. One good real life analogy would be what we define as *computers* nowadays. The *computer* would be our 'ADT' and we can define it as a collection of electronics that stores information, gets input from its users, processes data, and produces some form of output for the user. Our 'data structures' would be desktop computers, laptops, cellphones, smart appliances, etc. Each of our 'data structures' are functional the same but would have different performance levels and specialized functions. Some would be built to provide the maximum processing needed, some would be built so small you can carry it in one hand, then some would be built with the bare minimum parts to become very cheap. # The Dynamic List ADT One of the abstract data types we should have already encountered is the **dynamic list**. A *dynamic list* is a group of elements arranged one after another, forming a line. Each element in the list should have exactly one element before or one element after it unless it is the first element which has no element before it and the last element which has no element after it. Common operations on *dynamic lists* include: 1. *Access* - when we want to get the value of a particular item in the list. 2. *Insertion* - when we want to insert another item to the list. 3. *Deletion* - when we want to remove a particular item to the list. ## Dynamic Array Data Structure The standard Python list is actually an implementation of a *dynamic list*. More specifically, it implements a **Dynamic Array** data structure. An *array*, in a general sense, is a group of elements that are physically next to each other in memory. Since they are physically next to each other, only one memory location is needed and each element can be accessed through simple arithmetic. What makes a *dynamic array* dynamic is that it can resize itself to accomodate more elements if needed. This process will be shown later. ![picture](https://drive.google.com/uc?export=view&id=1_4OiNVRyzehgZXijHVSWiPQvMm2NW3Vf) Accessing the elements is done in trivial time for this data structure since we only need to add to the known memory location of the first element and add $(k \times memory \space per \space element)$ to know the memory location of the $k$th element. If we consider the addition and multiplication operation to run in constant time then for a dynamic array of size $n$: $$ O(T_{\rm access}(n)) = 1 $$ To *delete* an element from the list, we can simply remove it. However, elements after the removed one must shift their places on the list like it is described in the pseudocode below where `k` is the index of the element we want to delete: ``` FUNCTION delete(k): let N be the current number of elements let c_list be the current list DELETE element AT k OF c_list FROM i = k + 1 TO i = N-1: SWAP c_list elements at index i and index i-1 ``` To better visualize it, you can look at this simple animation: ![picture](https://drive.google.com/uc?export=view&id=1_jCHgKvdHUBtoxM_jUe4jTnNc5m5bbTm) At the best case, we would need to delete the last element and we don't need move any other elements. At the worst case, we need to delete the first element and in that case we would need to move all the elements of the list. Thus: $$ O(T_{\rm delete}(n)) = n $$ What really differentiates a *static* and *dynamic* array is the insert operation. In a fixed-size *static array*, it would simply return an error message whenever the array is at full capacity. For the resizeable *dynamic array*, we do what is called *memory reallocation*. We can't simply try and access whichever memory space is located next to the last element of the array because that would be rude and we don't even know who owns that space in the first place. What is done is a new, bigger, and better memory space is requested from the CPU. The contents of the old memory space is transfered to this new one, one element at a time. And lastly, our program takes note of where the new memory space is and the old space is returned to the CPU. The `insert` function together with the reallocation algorithm is explained further in the pseudocode below: ``` FUNCTION insert(new_element, k): let C be the current capacity let N be the current number of elements let c_list be the current list IF N + 1 > C: create new list n_list with capacity (2 * C) copy each element in c_list to n_list assign n_list memory space to c_list let temp be c_list element at k i store at k of c_list new_element FROM i = k + 1 TO i = N-1: swap c_list element at index i with temp ``` If that's a little confusing, you can look at the following animation of an element being normally: ![picture](https://drive.google.com/uc?export=view&id=1ZtEblU0sAM5ymGDJc3WOQ5qVjo8dnkMb) And an animation with the reallocation (Note that the amount of new memory requested is somehow arbitrary but it is usually twice the current capacity): ![picture](https://drive.google.com/uc?export=view&id=1EI1NBxPqGeinXfFEm_ynWayUaSUzNAXb) Similar to deletion, the best case is when we need to insert at the end, assuming we don't need to re-allocate. At worst, we need to adjust all the elements in the list. Thus: $$ O(T_{\rm insert}(n)) = n $$ Also note that reallocation takes a linear amount of time as well: $$ O(T_{\rm reallocate}(n)) = n $$ ## Linked List Data Structure A *linked list* is a dynamic list implementation where each object is actually an independent element in memory that has an additional *pointer* to the next node as shown below: ![picture](https://drive.google.com/uc?export=view&id=1RbllSsRv7wzsFFgL_NxCOZGwN3a2AF-s) To help even further, we can look at a simple implementation of a linked list in the Python code below: """ class SLLNode: def __init__(self, data=0, next=None): self.data = data self.next = next class LinkedList: def __init__(self): self.head = DLLNode(0, None) def access(self, k): temp = self.head while k >= 0 and temp.next != None: temp = temp.next k -= 1 return temp def insert(self, new_element, k=0): temp = self.access(k-1) new_node = SLLNode(new_element, next=temp.next) temp.next = new_node return def delete(self, k=0): temp = self.access(k-1) to_delete = temp.next temp.next = to_delete.next del to_delete return def __str__(self): temp = self.head.next out = "" while temp != None: out = out + str(temp.data) + "->" temp = temp.next return out LL = LinkedList() LL.insert(25) LL.insert(11) LL.insert(7) LL.insert(20) LL.insert(10) print(LL) """Note that we have a `head` node which points to the first element of the list. This kind of node is often referred to as a **sentinel node** which makes the implementation of the operations easier especially when the list often becomes empty. We start our analysis of the data structure with the `access` operation. Notice that in order to access the `k`th element, we need to go through a loop that continuosly goes to the next node until we reach the node of interest. Thus: $$ O(T_{\rm access}(n)) = n $$ The `insert` operation is done by first looking for the the $(k-1)$th element and then we rearrange the pointers accordingly as visualized here: ![picture](https://drive.google.com/uc?export=view&id=1Rbgfq-20BwQXJrt7lbbU5hvalKvKvm7a) Without needing the `access(k-1)` call the function would actually run in constant time. However, it is what it is: $$ O(T_{\rm insert}(n)) = n $$ After already seeing insert, you should have a pretty good idea of how `delete` would go. If not, below is a visualization of the operation: ![picture](https://drive.google.com/uc?export=view&id=1uOuJTLJ9A6Gfhc0V-LKN-ZdVMqNXlK3A) Similarly, the `access(k-1)` function really slows down the operations. Thus: $$ O(T_{\rm delete}(n)) = n $$ Obviously, the worst case scenario for all operations on linked lists is when we need to access the final element. A common workaraound is the use of an additional pointer called the *tail* which points to the last element, and an additional pointer for each node pointing to the previous element in the list as shown below: ![picture](https://drive.google.com/uc?export=view&id=1cKxK4uovUd2bax7D5NL3CkRMSzN5EnfF) How do you think the code for a *Singly Linked List* like the one given above would change to accomodate this? How would the time complexity change? Are these the type of questions to expect in the weekly quiz? Absolutely. ## Comparing Data Structures Let's list down the runtimes for each operation for the two datatypes. | | Access |Insert | Delete | | --- | :-: | :-: | :-: | | Dynamic Array | O(1) | O(n) | O(n) | | Linked Lists | O(n) | O(n) | O(n) | Hm, as of now, there seems to be no situation for us to prefer the use of linked lists so why on earth would we use them? We have to take into account when the worst cases for each operation occurs. Let's include in the table the running times for accessing elements at the start, middle (kth index), and end of our list: | | Access | Insert (start) | Insert (k) | Insert (end) | Delete (start) | Delete (k) | Delete (end) | | --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: | | Dynamic Array | O(1) | O(n) | O(n-k) | O(n) | O(n) | O(n-k) | O(1) | | Singly Linked List | O(n) | O(1) | O(k) | O(n) | O(1) | O(k) | O(n) | | Doubly Linked List | O(n) | O(1) | O(k) | O(1) | O(1) | O(k) | O(1) | Looking at new table, linked lists do have an advantage when it comes to inserting and deleting items at the start of the list (also at the end for doubly linked lists). A very easy observation is that when we need to access elements in the middle of a list multiple times (maybe a search operation), we would prefer dynamic arrays. If we often insert or delete items in the list, we usually go with linked lists. # More Linear ADTs and Data Structures There is sometimes a blurry line between ADTs and data structures. The following examples are actually ADTs but we implement the data structures by the same name. To avoid confusion, when just the `<ADT Name>` is said it refers to the actual ADT and when `<ADT Name> implementation` it would refer to the data structure or implementation of the ADT. This applies to week 6 activities and materials as a whole. # The *Stack* ADT The **stack** ADT is a variation of the dynamic list ADT where the operations are a bit limited. It follows the *last-in first-out* (**LIFO**) principle where only the latest element inserted is the only one that can be accessed or deleted. To differentiate the operations with a normal list ADT, we have the following: 1. *push* - puts an item on the stack. 2. *peek* - looks at the last existing item inserted in stack list. 3. *pop* - removes the last existing item on the stack. As the name suggests, this ADT trys to mimic stacks in real life. When we have a **stack of boxes**, for example, you can't really move the bottom boxes without removing the top ones first (unless you want the stack to collapse). A stack we should be familiar with is the **program stack** which stores information about your function calls in a FIFO manner as well. The last funtion to be called must be finished first before previous functions can terminate. A simple visualization of how a stack works is shown below: ![picture](https://drive.google.com/uc?export=view&id=1LlPz6AvR3p6ofDoJeZu3-lXCuQTKoDFD) Since a stack is just a reduced function list, we have two implementations based on the dynamic list data structures: 1. *Dynamic Array* - The `peak` operation obviously runs at constant time. The `pop` and `push` operations would just be modified `delete` and `insert` operations that always operate on the final element of the list. These operations should normally take constant time unless re-allocation needs to occur. 2. *Linked List* - For a linked list stack operation, we can always access, insert, and delete the first element in constant time. Unlike the dynamic array where the element with the highest index is operated on, linked list can operate on the first element without ever needing to reallocate memory. As a summary, we have the table below: | Stack Implementation | peek | push | pop | | --- | :-: | :-: | :-: | | Dynamic Array | O(1) | O(n) | O(1) | | Linked Lists | O(1) | O(1) | O(1) | Obviously, the linked list stack implementation has a better runtime in general. Can you think of a scenario when we would use and array instead? # The *Queue* ADT A **queue** is similar to a stack but it uses a *first-in first-out* (**FIFO**) principle much like a queue in real life. It has similar operations but to differentiate between them we usually use the following terms: 1. *enqueue* - inserts an item into the queue. 2. *peek* - looks at the oldest existing item inserted in the queue. 3. *dequeue* - removes the oldest existing item on the queue. Queues are usually used in computers to manage which processes to accomodate first. Queues are also used with recursive functions as well to act as a form of storage to retain some form of sequencing. Since we have a visualization of the stack ADT in the previous section, it would be unfair if we didn't have one for the queue: ![picture](https://drive.google.com/uc?export=view&id=1gQcuFxVNtKcFy9irtyyaiJ6kz150VXFf) Similarly, we have two queue implementations based on the dynamic list data structures: 1. *Dynamic Array* - The `peak` operation obviously runs at constant time. The `enqueue` and `dequeue` operations would just be modified `delete` and `insert` operations that always operate on the last and first element of the list, respectively. This means a constant running time for delete but always a linear running time for insert. 2. *Linked List* - For a linked list queue operation, it is better to use a doubly linked list to have constant access, insert, and delete for the first and last element of the list. Otherwise in a singly linked list, either the `enqueue` or `dequeue` would take linear time. As a summary, we have the table below: | Queue Implementation | peek | enqueue | dequeue | | --- | :-: | :-: | :-: | | Dynamic Array | O(1) | O(n) | O(1) | | Singly Linked Lists | O(1) | O(1) or O(n) | O(n) or O(1) | | Doubly Linked Lists | O(1) | O(1) | O(1) | There is a particular focus on enqueuing and dequeuing for queue applications and the linked list implementations appear to be better implementations for that. Still, whatever advantage an array-based implementation would have over a linked list-based implementation should remain true for queues. # The *Deque* ADT The **Doubly Ended QUEue** or deque (pronounced as /dɛk/) is, as the name suggests, two queues with their elements meeting in the middle. In other words, a queue where you can insert and and delete from either ends. To explain it further, here are the standard operations on a deque: 1. *push_front* - insert an element at the front of the queue. 2. *push_back* - insert an element at the back of the queue. 3. *pop_front* - delete the element at the front of the queue. 4. *pop_back* - delete the element at the back of the queue. 5. *peek_front* - looks at the element at the front of the queue. 6. *peek_back* - looks at the element at the back of the queue. An example of something utilizing a deque would be your web browser which can go forwards and backwards through your history. Another similar use would be implementing an undo/redo mechanism in various editing software. Of course, this section will not be complete without the visualization of a deque in action: ![picture](https://drive.google.com/uc?export=view&id=14njuCVPjJgWqY9SBoyPBnAtXSJWV7mXZ) Again, since a deque is another linear ADT, the dynamic list data structures can easily be modified to implement a deque. The `push` operations are obviously `insert` operations either at the start or end of the array, whichever is considered the *front* and the *back*. The same could be said for the `pop` and `peek` functions with `delete` and `access` opertations, respectively. Similarly, the table below summarizes the performance of the operations depending on the implementation used (assumes index `0` is the front): | Deque Implementation | peek_front | peek_back | push_front | push_back | pop_front | pop_back | | --- | :-: | :-: | :-: | :-: | :-: | :-: | | Dynamic Array | O(1) | O(1) | O(n) | O(1) | O(n) | O(1) | Singly Linked Lists | O(1) | O(n) | O(1) | O(n) | O(1) | O(n) | Doubly Linked Lists | O(1) | O(1) | O(1) | O(1) | O(1) | O(1) | Of course, the doubly linked list is best in all operations which it should be, considering how harder it is to set up. # Making things *Interesting* Obviously a *dynamic list* could function as any of these and technically be more useful. Why do we limit ourselves and access only one or two elements with these other ADTs? Why do people still use stacks and queues if a deque can essentially function as either? The answer lies in the data structures and which data structures can be used in your current setup. In high-level languages like python, we can essentially implement whichever data structure we want through classes and object-oriented programming or OOP. In low level languages, that may not be possible. There are certain internal mechanisms required to implement the more complex data structures like the pointers for linked lists. Remember, we program computers and computers are not just the desktops and the laptops anymore. Smart-[anything]s are everywhere nowadays and in each of these there's a special little computer that has to be very efficient with the limited space and capabilities it has. To better understand that, let's have a hypothetical problem that you could encounter. One of the simplest data structures to implement in hardware is a stack since you only ever need to know one memory location which is the top of the stack (recall the program stack). Let's say that there's a certain algorithm that runs in cubic time $\left( n^3 \right)$ time with a stack, but if we can implement it with a queue, its running time would be the product of the running time of the queue operations $\left( T_{\rm access} (n) \times T_{\rm dequeue} (n) \times T_{\rm enqueue} (n)\right)$. Knowing that there at least two stacks available in hardware, is there a way for us to implement a good enough queue data structure that can imporve the run time? Recall that a stack is LIFO and a queue is FIFO. Thus, the last element in a stack is the first element of a queue. We can do that by empyting a stack and using the last element that was popped. However, we must have someplace to store the other popped elements they should still exist in the *queue* and that's where the second stack comes in. To visualize: ![picture](https://drive.google.com/uc?export=view&id=19h92KOyMherUopgLyBXxK8pHNDXuyYot) With this implementation, since we need to empty our original stack in order to access the element to `dequeue` or `peek`, we expect the runtime for these to be linear, assuming stack operations are all constant. `enqueue` will be the same as a `push` operation. Thus, the algorothm runtime would be: $$ O(T(n)) = O(n \times n \times 1) = O(n^2) $$ which is certainly an improvement over the theoretical one using a stack. Do you think we could improve this further? # Final Notes Depending on the reference you are using, there might be some variations in terms. The `peek` operation for stacks is sometimes referred to as the `top` since it accesses the *topmost* element. Similarly, the `peek` function for a queue is sometimes referred to as the `front` since it access the element in front of the queue. Then, `peek_front` and `peek_back` are simply `front` and `back`, respectively, for deques. There are still others but I'm sure we can deduce which term is which as long as we understand the basic principles of these ADTs.