# Week 2 (23-28/11/20) ## Data structures and Algorithms (MON, 23/11/20) ==**Big O Notationion**== Describes the limiting behaviour of a function when the ==argument== tends **towards infinity** * **Time complexity** of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the length of the input. ==**Big O Orders**== * O(1)-constant * O(logn)-logarithmic * O(n)-linear * O(nlogn)-linearithmic * O(n**2)-quadratic * O(c**n)-exponential (c is constant) **Determine O()** ``` a, b=0, 0 n-1000 for i in range(n): a=a+random.randit(0, 10) for j in rnage(n): b=b+random.randit(0,10) ``` ---> O(n) ``` i=n while (i>0): print(i) i=i//2 ``` ---> O(logn) --- ``` a= [random.random() for _ in range(n)] for i in range (n-1): for j in range (i+1, n): if a[i]<a[j]: a[i], a[j] = a[j], a[i] ``` ---> O(n**2) --- ``` def search(arr, n, x): """If x is present then return its location, otherwise return -1 """ for i in range (n): if (arr[i] == x): return i; return -1; ``` ---> O(n) --- **==Searching & Sorting Algorithms==** **==Binary Search==** ![](https://i.imgur.com/BOyClaL.png) **==Recursion==** **Recusion** occurs where a *function* being defined is applied **within its own definition** ---> O(logn) is generally the time complexity of **binary search** Advantages of recursion: * neat to solve complex problem, easy to understand * Used in other algorithms (Binary Search) **Factorial** n!= n x (n-1)! e.g. factorial(n)=nxfactorial(n-1) factorial(n-1)=(n-1)xfactorial(n-2) factorial(1)=1 **The Base Case** ---> have to include base case to stop the recursion or else it goes on forever --- **==Advanced Sorting==** 1. **==Merge Sort==** * Average case= Worst case= O(nlogn) * Split everything first then start sorting only when merging process start 2. **==Quick Sort==** * Average case: O(nlogn) * Worst case: O(n^2) (when the list has already been sorted: grow tree on 1 side only) * Given an array, set an element x of array as ==**pivot**== bubble sort: O(n^2) merge sort: O(nlogn) for both worst and average case quick sort: O(nlogn)-average case, O(n^2)-worst case --- **==Object Oriented Programming(OOP)==** **Functional programming vs OOP** **FP**: create a function to solve problems, ie emphasizes on the evaluation of functions **OOP**: define a **class** then you have an object to fit into the appropriate class properties=**attributes** ie description of object **methods**=what kind of action the object can do ``` type(object) ---> bring out class of the object ``` --- ## Numpy (TUE, 24/11/20) ==**Numpy**==: a library that allows us to work with large, multi-dimensional arrays and matrices and perform high-level mathematical functions on these arrays. Python variable:`x=4` x is a pointer, pointing to a position in memory containing all the Python object information, including the bytes that contain the integer value 4 Numpy array **stores the exact value** (does not need a medium like Python) --->Numpy is **faster** than Python list **==Syntax==** ``` tmp = np.array([3.14, 4, 2, 3]) print(tmp) print(tmp.dtype) ``` Output: ``` [3.14 4. 2. 3. ] float64 #type is float stored in 64bits ``` --- * When put a mixture of type of input, numpy automatically changes all into 1 type `np.array([3.14, 4, 2, 3, 'a'])` Output: `array(['3.14', '4', '2', '3', 'a'], dtype='<U32')` ``` # if you want to explicitly set the data type np.array([1, 2, 3, 4], dtype='float32') ``` **==Basics==** Tensors: a collection of values ![](https://i.imgur.com/wEOiWed.png) Visualizing tensors: ``` x = np.array(6) #Scalar print ("x: ", x) # Number of dimensions print ("x ndim: ", x.ndim) # Shape print ("x shape:", x.shape) # Size of elements (how many numbers does x store) print ("x size: ", x.size) # Data type print ("x dtype: ", x.dtype) ``` --- ``` x = np.array([1.3 , 2.2]) #Vector (1 dimensional) ``` --- ``` x = np.array([ [1,2] , [3,4] ]) #Matrix (list of list, 2 dimensional ``` --- ``` x = np.array([ [[1,2], [3,4]], [[5,6], [7,8]] ]) #3D tensor, ``` **IMAGE** `imgage.shape` ---> `(427, 640, 3)` 427:height 640:width 3:channels (R,G,B) pixel: number between 0-255 **Functions**: to create tensors **quickly** ``` print ("a 2x2 matrix full of zeros:\n", np.zeros(shape=(5,2))) print ("a 2x2 matrix full of ones:\n", np.ones(shape=(2,2))) print ("a 2x2 identity matrix:\n", np.eye((2))) # identity matrix print ("a 2x2 matrix full of random numbers between 0 and 1:\n", np.random.random((2,2))) ``` --- `x= np.random.random((2,2,3)) #2 rows, 2 columns, 3 channels` ``` x[0,:,:] #first matrix x[1,:,:] #second matrix x[1,1,:] #second row of second matrix ``` **==Indexing==** ![](https://i.imgur.com/y2kMXdE.png) **Slicing** `x[[0,2],:] #first and third row` **Integer array indexing** ``` x = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) print(x) ``` ``` [[ 1 2 3 4] [ 5 6 7 8] [ 9 10 11 12]] ``` ``` print ("indexed values: ", x[[0,1,2], [0,2,1]]) # (0, 0), (1, 2), (2, 1) ``` Output: `indexed values: [ 1 7 10]` --- **Boolean array indexing** ``` x = np.array([[1, 2], [3, 4], [5, 6]]) print(x.shape) print ("x:\n", x) print ("x > 2:\n", x > 2) #compare every single value in matrix with the number 2 print ("x[x > 2]:\n", x[x > 2]) #filter out number > 2 ``` Output: ``` (3, 2) x: [[1 2] [3 4] [5 6]] x > 2: [[False False] [ True True] [ True True]] x[x > 2]: [3 4 5 6] ``` **==Arithmetic==** **Basic Maths** `np.add(x, y)) # or x + y` `np.subtract(x, y)) # or x - y` `or x * y (different from x.dot(y))` **Dot product** `c = a.dot(b)` OR `c = np.dot(a, b)` **Axis Operation** ``` x = np.array([[1,2],[3,4]]) print (x) print ("sum all: ", np.sum(x)) # adds all elements print ("sum axis=0: ", np.sum(x, axis=0)) # sum across rows meaning collapsing row together--> become 1 row print ("sum axis=1: ", np.sum(x, axis=1)) # sum across columns ie collapsing columns together become 1 column ``` Output: ``` [[1 2] [3 4]] sum all: 10 sum axis=0: [4 6] sum axis=1: [3 7] ``` **MIN/MAX** ``` x = np.array( [[1,2,3], [44,45,0]] ) print(x) print(x.shape) print ("min: ", x.min()) print ("max: ", x.max()) print ("min axis=0: ", x.min(axis=0)) print ("min axis=1: ", x.min(axis=1)) ``` Output: ``` [[ 1 2 3] [44 45 0]] (2, 3) min: 0 max: 45 min axis=0: [1 2 0] min axis=1: [1 0] ``` **==Axis operation==** axis 0=rows axis 1=columns [1] still a vector ---> 1 dimensional [[1]] turn into matrix of 1x1 --->2 dimensional **==Broadcasting==** --- ## Basic SQL (WED, 25/11/20) **==What is Database==** A **collectio** of data, can be stored+accessed electronically from a computer system **Database management system (DBMS)**: a software to manage the database 1. Non-relational database 2. Relational database:collection of tables(relations), each consisting a set of **rows and columns** - Table(Entity): a structured list of data - Column: table is made up of one or more columns, each of column have a specific **data type** - Row(record):table can have multiple row - **Primary key**: ==column== that **uniquely identifies each row** in a table --->Values in primary key should never be modified/updated - **Foreign key**: a referencing key used to link two tables together ==**Entity relationship diagram**== ![](https://i.imgur.com/lrAwrdD.png) ![](https://i.imgur.com/iSw3eKv.png) **Mandatory**:every single genre links to at least one track in the tracks table **Optional**:genre classical but have no tracks, **One-Many** (1 genre can have different tracks) ![](https://i.imgur.com/xN160Fp.png) mandatory many --------------------------------optional one **==Structured Query Language==** Language designed to query, manipulate, and transform data from a relational database **==Order of Execution==** ![](https://i.imgur.com/2xnsEcV.png) --- Print out how many artist there are in the database ``` pd.read_sql_query('SELECT COUNT(Name) FROM artists', conn) ``` ==**Queries with Constraint**== ``` SELECT column, another_column, ... FROM mytable WHERE conditions AND/OR another_condition AND/OR ...; ``` --- #Find all invoices where Total is less than or equal to 1 (dollar) `pd.read_sql_query('SELECT * FROM invoices WHERE Total<=1',conn)` --- #Find all invoice in 2011 ``` pd.read_sql_query('SELECT InvoiceDate FROM invoices WHERE strftime("%Y",InvoiceDate) = "2011"',conn) ``` OR ``` pd.read_sql_query('SELECT * FROM invoices WHERE InvoiceDate LIKE "2011%"',conn) ``` --- ==**Operators for **numerical** data**== ![](https://i.imgur.com/spG2Svp.png) ==**Operators form **text** data**== ![](https://i.imgur.com/UIH7mAI.png) --- ==**Filtering and Sorting query results**== * Discard rows that have a duplicate column value by using the DISTINCT keyword: ``` SELECT DISTINCT column, another_column, … FROM mytable WHERE condition(s); ``` * Sort result + LIMIT+ OFFSET ``` SELECT column, another_column, … FROM mytable WHERE condition(s) ORDER BY column ASC/DESC; LIMIT num_limit OFFSET num_offset; ``` ==**Multi-table queries with JOINS**== ``` SELECT column, another_table_column, … FROM mytable INNER JOIN another_table ON mytable.id = another_table.id WHERE condition(s) ORDER BY column, … ASC/DESC LIMIT num_limit OFFSET num_offset; ``` ==**LEFT/ RIGHT/ OUTER JOINS**== 1. Left join 2. Right join 3. Outer join 4. Inner join 5. Cross join ==**Queries with aggregates and GROUP BY**== ![](https://i.imgur.com/bM36u75.png) ``` SELECT group_by_column, AGG_FUNC(column_expression) AS aggregate_result_alias, … FROM mytable WHERE condition GROUP BY column ``` --- #Find the total number of tracks belong to each genre ``` pd.read_sql_query(''' SELECT genres.Name, COUNT(tracks.TrackId) as NumberOfTracks FROM tracks JOIN genres ON tracks.GenreId = genres.GenreId GROUP BY genres.GenreId ''', conn) ``` --- #Find the number of customers each employee has supported ``` pd.read_sql_query(''' SELECT e.FirstName, e.Title, COUNT(c.CustomerId) AS NumberOfCustomers FROM employees AS e LEFT JOIN customers AS c ON e.EmployeeId = c.SupportRepId GROUP BY e.EmployeeId; ''', conn) ``` --- #Find the customer who paid the most ``` pd.read_sql_query(''' SELECT c.CustomerId, c.FirstName, SUM(UnitPrice) AS TotalSpending FROM customers AS c JOIN invoices AS invoice ON c.CustomerId= invoice.CustomerId JOIN invoice_items AS items ON items.InvoiceId = invoice.InvoiceId GROUP BY c.CustomerId ORDER BY TotalSpending DESC LIMIT 1 ''', conn) ``` --- ## Advanced SQL (THU, 26/11/20) **Components of Relational Database:** ![](https://i.imgur.com/rjxJp31.png) * Tables * Columns (fields) * Rows (records) * Scheme **Relations:** * one to one * one to many * many to many (avoid this relation) ==**NoSQL**== * Non-relational database, each offers a unique query language * Data organization: stored in a dictionary, has few/no relations ![](https://i.imgur.com/W7hyuDp.png) * Advatage: fast look up * Disadvantage: difficult to modify information inside, unsuitable for complex query **Scaling** * SQL is suitable for vertical scaling * NoSQL is suitable for horizontal scaling ![](https://i.imgur.com/0Jsi68r.png) --- **==SQL vs NoSQL==** ![](https://i.imgur.com/9qxky0R.png) ==**Sub-queries**== ![](https://i.imgur.com/F5J9fpU.png) A Sub-query is a SQL query nested inside a larger query * Used to calculate summary value to use for a query * 2 types of sub query * Correlated: there's a relationship between value in first SELECT query and value in secon SELECT value * Uncorrelated ``` SELECT column_names FROM table_names WHERE value IN (SELECT column_names FROM other_table WHERE condition) ``` Example: ``` pd.read_sql_query(''' SELECT artists_mini.ArtistId, artists_mini.Name, albums_mini.AlbumID, albums_mini.Title FROM (SELECT * FROM artists LIMIT 5) as artists_mini INNER JOIN (SELECT * FROM albums LIMIT 5) as albums_mini ON albums_mini.ArtistId = artists_mini.ArtistId ''', conn) ``` ==**Common Table Expression**== CTE is a temporary result to use with other queries CTEs help organize complex queries ![](https://i.imgur.com/h2Soq9T.png) **==Conditionals==** * If-else statements in SQL * Used to map on value to another ![](https://i.imgur.com/njaob8s.png) ==**Set Operators**== Allow the results multiple queries to be combined together 1. UNION 2. UNION ALL 3. INTERSECT 4. EXCEPT --- ## TEST