# Data Wrangling ## Week1 * Thomas to go through basic instructions for health n safety. * Class reps by voting - gender equality (it's pretty good) * Giulio discuss about the importance of Data and how it was used in the past with Venice and example of Library. * Thomas discuss tools stack (R and Julia) along with the dedicated ubuntu server. * Assessment and group projects - questions from Zoom attendees. ### Lab 1- Installation of software stack ## Week 2 - R basic along with syntax and variable assignment along with vectors - F1 key for help - Functional part - we can create our own 'complex' functional comprises of composites of different existing R functions. - The output of same functions in R might not be same as they are not Deterministic but probablistic. The R was developed for statistical purposes. - F1 - use instad help() or ? Break - DataFrame (efficiently search and wrangle data / standard format with lot of sfotware support e.g. map is difficult but in the tabular it's easier) - Before go for DataFrame better to understand the packages which are used for DataFrame - - pipe (part of tidyverse library - it will help us to long sequence of operators in much readible way e.g. sqrt of 2) n - - Flow... band_instruments example - Functions -- like Flows but for more complex scenarios.. you can provide multiple input variable etc.. e.g. data / taget_name - multiple input are not working so ----- in python you can but in R you couldn't. - IRIS - plotting .. - working directory is the place where you saving your work/packages/datafiles ... - Data Frame and it's di,mensions (data set / atomic vector e.g. 1 dimension, all numeric, kind of homogenous) ... List as compared to Vector has multiple datasets e.g. numeric, string - In subsequent lectures we are going to read Matrix where we could have different type of objects. - Let talk about dataFrame definations e.g. two-dim array like structure where each col has one variable and each row contains a set of values from each columns. Each row should be unique and column has a name. - Creating a DF, library(tidyverse) to do this and tibble() is the functiona to create DF .. view/ structure Str(DF) ... - Try ggplot() along with filter() and mutate() ...... . ### Lab 2- Tidyverse and other R packages & their usage ## Week 3 - tidydata - Bit of revision from the last lecture - - What is tidy data (happy families are like alike, evey unhappy ) - Basic Data Types - revision... - Skim() vs glimpse() - columns / rows / values in the rows - Usage of select; this includes usage of arguments .. - descending/ascending order - Data earch .. filter/search().. - Concatination and vector usage... - Fake news - data set given to us related to news... - what do we like to read in News.. - every year, Pulitzer prize ## Week 4 - Dataformats - Data formats (long vs wide) - Wide to long dataframe - Note that there has been some update in tidyverse wrt to gather() and spread() - with the functions pivot_long() and pivot_wide(). Refer to help for further information. - Long to Wide dataframe ( did we get the same data frame) - column order is not same. use the function (arrange coln /rows) ## Week 5 - data strucutre and how to join them - joining data using primary keys - types of joins and their usage - ## Week 6 - Image processing - talk about groups *** - Image formats - not going to do raw one.. - - pick an image format.. which image format is your current image at and where you want to transform it to.. - compression methods - lossy vs lossless - - Filesize, details of the image data kept - Common standard proprietry format - Visual effects - Applications - - Basic principles of image processing .. why it's so important... - "with the advent of sophisticated and affordable camers, image processing is considered cost-effective, accurante, labor saving and reproducible... 1992.." - - Import image as an array !!! - - hexcodes or color https://htmlcolorcodes.com - - EXAMPLES... image processing.. - lab . how to import single or column image - Image contrast (linearly interpolated for each pixel -- if you want to change it) - you can change the brightness on each of the pixel -- if want darker then subtract the number... and even combine the brightness and contrast together.. or even can do the dynamic range .. - Image transformation - blurring in case of provacy e.g. number plates.. typically - Image tranformation to B/W Image Edge Tracking -- it's a big reasearch field.. where you get the edges.. - We are going to borader the edges .. edges can be computed buwth different operators.. - There are algorightme which can detect the edges - - Take the forest image .. - jupyter lab .. boats_array_number example.. - computations in the image array .. horizontal/vertical edges... - color channel vertical vs horizontal.. less effect horizontally as compare to vertically. - *Negative Image* we can also compute the negative of an image. two way to store color channel. 0~1 / 0-255... #### OUTPUT - think about compression algorightm lossy vs lossless.. tradeoff between filesize and color accuracy.. if you don't specifiy the quality you are using, you would automatically does some sort of compression.. Two usage application (leading to Data project) - computer - shadows of xrays.. restructure a slater.. - Clinical CT -- shadow reconstruction.. - Cricket bat - strucutral changes.. - Data 601 project regrading topgraphic thing. - UC 3D Lidar ... surround photos.. pre-processing + image joining.. ## Week 7 - API usage - how to read from web url - exploit package rvest - tidytext ## Week 8 - Guest lecture Orbica - GIS based data and it's importance - GEO-AI - processing and ways to filter down the data in to information - Data processing pipeline and software being used ## Week 9 - Guest lecture - Network data - Scraping data, efficiency R vs Julia - Data ventures does not want to handle, store or serve up unit record data. - Repartition/Merge/Sort of data which is alot in terms of 1g ~ 5g. - Coblin (Java flavor) / Julia & Scala.. - Data merge and obfuscated - In UK, they are handing over medical data on excel sheet with 65K records per sheet. - MangoDB is really fast to aggregate.. - Speed is a quality all of it's own .. ## Week 10 - Guest lecture - Maori Data - Moari data and how we are processing it - Security of data and it's implications - Presence in different languages - History of Maori data and it's optimization -