AI Day Research : Prototype : Learning

AI Day Research : Prototype : Learning === ###### tags: `learning` ## "Swift for TensorFlow" & " Introduction to tf.keras and TensorFlow 2.0" - Paige Bailey - Google Brain Paige Bailey is the product manager for TensorFlow core as well as Swift for TensorFlow. Prior to her role as a PM in Google's Research and Machine Intelligence org, Paige was developer advocate for TensorFlow core; a senior software engineer and machine learning engineer in the office of the Microsoft Azure CTO; and a data scientist at Chevron. Her academic research was focused on lunar ultraviolet, at the Laboratory for Atmospheric and Space Physics (LASP) in Boulder, CO, as well as Southwest Research Institute (SwRI) in San Antonio, TX. ## "TensorFlow Lite: On-Device ML and the Model Optimization Toolkit" - Jason Zaman - Light Machine Learning at the edge is important for everything from user privacy to battery consumption. This talk will give an overview of the different strategies to optimize models for on-device inference: pruning, integer quantization with the model optimization toolkit. Then there will be a demo of all these techniques together to run a model on an EdgeTPU. Jason is the community lead for TensorFlow SIG-Build and an ML-GDE. He works as a machine learning engineer at Light doing computational photography in mobile cameras. Along with speaking regularly, he is also active in Open Source as a Gentoo Linux developer and maintainer of the SELinux Project. ## "Which image should we show? Neural Linear Bandit for Image Selection" - Sirinart Tangruamsub - Agoda Sirinart is a data scientist at Agoda. Before joining Agoda, she was a postdoctoral researcher at the University of Goettingen. She has extensive experience in the fields of computer vision and natural language processing at various startups and corporates. Her current areas of interests include personalization and recommendation systems. ## "XLNet - The Latest in language models" - Martin Andrews - Red Dragon AI Martin is a Google Developer Expert in Machine Learning based in Singapore - and was doing Neural Networks before the last AI winter... He's an active contributor in the Singapore data science community and is the co-host of the Singapore TensorFlow and Deep Learning MeetUp (with now with 3700+ members in Singapore). ## "Deep Learning on Graphs for Conversational AI" Sam Witteveen - Red Dragon AI Sam is a Google Developer Expert for Machine Learning and is a co-founder of Red Dragon AI a deep tech company based in Singapore. He has extensive experience in startups and mobile applications and is helping developers and companies create smarter applications with machine learning. Sam is especially passionate about Deep Learning and AI in the fields of Natural Language and Conversational Agents and regularly shares his knowledge at events and trainings across the world, as well as being the co-organiser of the Singapore TensorFlow and Deep Learning group. ## "TensorFlow Extended (TFX): Real World Machine Learning in Production" - Robert Crowe - Google Brain A data scientist and TensorFlow addict, Robert has a passion for helping developers quickly learn what they need to be productive. He's used TensorFlow since the very early days and is excited about how it's evolving quickly to become even better than it already is. Before moving to data science Robert led software engineering teams for both large and small companies, always focusing on clean, elegant solutions to well-defined needs. You can find him on Twitter at @robert_crowe ------- ## Paige Bailey: tf.keras webpaige@google.com Twitter: @DynamicWebPaige **internal google training material for google engineers** Alphafold paper ## The Internals of tf.keras ### Architecture * Engine * baselayer * base network - DAG of layers * model (network + training/evail loop) * Sequential * ### Layers and Models [Layers docs](https://keras.io/layers/about-keras-layers/) Everything is a layer: models are just layers Computation from inputs -> outputs batchwised computation can't mix eager exec and static manages states - training and inference mode supports "type checking" automatic compat checks frozen or unfrozen serailised and deserialized - soon Mixed precision layers DOES NOT DO: device placement no dataset no non-batch computation no outputless or inputless processing Gave an example of a canonical lazy layer - most layers you build look like that, you don't hardcode what the layer does yet `GradientTape()`something like a training loop `optimisers.[optimiser]()` ### Functional Models [Models docs](https://keras.io/models/model/) similar to how layers work too! Can nest also Model is a layer: but also provides access to training, saving and summary/model visualization Layer: "layer" or "blocks" in the literature Model: "model" or "network" model compile: substantial perf hit for eager ops because some parts of it are really not meant for eager exec compile: build spec: fit: go through data functional API -> create DAGs of layers build directed acyclic graph -> show linkage of model functional API is connectivity of layers : declarative models: no logic : all logic is contained inside the layers all debug is done at compile time - you merely define the thing static input compat check model saving model plotting auto masking good for model debug Can check the entire model history ### training and inference When you call `fit()` it runs an entire list of functions ### losses and custom metrics custom metric - init update state add_metric endpoint pattern (can also write your own loops) ## "Deep Learning on Graphs for Conversational AI" Sam Witteveen - Red Dragon AI DL: great for perception tasks, getting better at generative tasks: but how do we get _reasoning_ GPT2: seems to have dumped all the knowledge into weights: but knowledge as weights is inefficient Maybe use Graphs? like the nodal representation: undirected, digraph, weighted Knowledge graphs: the 'knowledge panel': "freebase" wikidata cyc DBpedia wordnet prospera NELL geonames gdelt Concept seems to be to train a model that does our google searches for us Symbolic AI - adding of knowledge into conflict Node - edge - node ( object-property-value) (subject predicate object) (RDF data) Informational Retrieval - Right knowledge at right time? - Custom Graphs? - What happens if I have missing information? DL for knowledge graphs: Getting knowledge out is hard - extraction ++ prediction on graphs: node classification v edge classification what's the right graph classification? Node regression mode? Facebook Ego Network: node value and regression ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Red,fontname=Courier,shape=box] //All nodes will this shape and colour edge [color=Blue, style=dashed] //All the lines look like this BarackObama->{Michelle} Michelle -> {Malia Sasha} } ``` So you can look up the Barack Obama node and check out the other properties ### Why are graphs hard? meaing and features are in the relationships, not the nodes no nice fixed position for each node edges can be directed non-euclidean space Inductive bias of DL: euclidean space - fixed space sequences - lack of breakthroughs for representation of this space How do we represent this input? node+edge embedding? adjacency matrixes - order n^2 deep walk - [node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) - random walks along the graph for N steps treat as sentence - use skip grams - use random walks as sentence - we don't know if this is the best representation [Graph-CNN](https://github.com/tkipf/keras-gcn): graph convolution: subgraphing of n nodes reachable from a given node - assumes nodes connected implies likelihood of similarity - loss on known nodes only - treat all nodes as undirected Kegra Kipf: keras-gcn Relational-inductive Bias, deep learning edge-node-global updating - looks at edges - gets embedding values - then nodes - and then a global update Transfer learning with graphs? graph-in-graph-out? predict a new graph ## TF eXtended: Robert Crowe @robert_crowe Real world ML in production Configuration, Data Collection, Serving Infrastructure, Process Managemet Tools, Analysis Tools,... TF-extended exists to do this [Ranking Tweets withTF](https://medium.com/tensorflow/ranking-tweets-with-tensorflow-932d449b7c4) ### Production Pipelining: TFX Production ML: - labelled data - feature space coverage - minimal dimensionality - maximum predictive data - fairness - rare conditions - data lifecycle management Classic problems don't go away: SE problems still sit around! ["Hidden Technical Debt in ML Systems" ](https://bit.ly/ml-techdebt) Data Ingestion - Validation - Feature Engineering - Train Model - Validate Model - Push if good - Serve Model [Apache Beam](https://beam.apache.org/documentation/) Component: Driver does job exec - Exec does work - Publisher updates ml.metadata (in a model validator) some configurator file pulls and writes back to the metadata store - based on dependency task-aware pipeline: transform-trainer (classic) ++++ TFX: training data - input data -> transform -> transformed data -> trainer -> trained models metadata store: contains artifacts and properties + Exec history of runs + Metadata-powered functionality - remember what was previously run (and what data they were run on) + carry-over states from previous model runs - caching of previously computed outputs Beam: unified batch and stream distributed processing API - SDKs in multiple languages + sets of runners You application lives for years - want to compare, you need the metadata to visualise what's happened Evaluator let's you check for individual slices in dataset. If one user isn't being served well, he's having a bad experience Model objective is nearly always a proxy for your business objectives World doesn't stand still Data is never what you wished you had ML Triangle Business realities - Bad Data -Model needs Improvement (Demographics? Insights? Processes?) What-if tool - run inference on your model TFX and Kubeflow pipelines: Kubeflow team takes TFX code and applies it to a Kubernetes environment ## Sirinart: How do we pick photos for Agoda? The Multi-armed bandit: AB testing but... extreme We know that each path has some expected reward - so we try everything! Exploration-Exploitation Thompson Sampling - updating posterior distribution with neural linear units to approximate the posterior distrib Bayesian Linear regression on the representation on the last layer of a neural network ## XLNET: Martin Andrews, Red Dragon AI ### Transformer Architectures Feed in tokens at the bottom - pass through layers - get result at the top Masked multi-self attention Sequential Attention: Turn the input into memory - "attend" to each portion - each step queries dot product to produce score - check match of query - feed into softmax to create attention distribution q,k,v: queries (what) keys (why choose) values (what you get when you choose) Transforming all of the stuff - modifies all of the input so that it's "more useful" for the stuff at the end -> take input, generate qkv, score them by dot product, softmax the value -> sum (this is each column) Learning the meaning of a word through its context in the sentence. Token and position embed: take words of the english language and then zip compress them into fragments - group words into each other then form a vocabulary - great way to form an infinite vocabulary Positional embeddings - each position in the input stream comes up with a kind of "sine-wave position" phase -> can compare positional differences by this phase difference Unsupervised training if I can: BERT: introspection - don't do one word at a time - mask out some of the words and have the model play fill in the blanks - non-predictive but more like analytical - feedback in all directions over predictive Reconfigurable output: don't need to retrain from scratch, but rather we can use it to understand text None of the text is labelled, it's just live data. Stop the sentence and then get it to predict ### What's new in XLNET from the same team that did BERT two streams of attention long memory like TransformerXL Loads of compute => Results+ Fixing the Masking Problem: multiple mask entries CLASH in BERT: words in setences aren't independent - but MASK appearances are - MASK never appears in RL data actual RL data is different from training data - better hiding: permutation process to omit certain tokens - rely on positional encoding to preserve order Solution: split the streams XL memory: need to make sure that the positional encoding 'joins up': train on whole words: not just tokens -> whole word gives BERT better results Abandon "next-sentence-or-not" tasks XLNet-Large - similar sizes to BERT-large Heavy-compute word generators ### 1-minute glosses Distil model to a CNN -> use model > train model to get answer Adapter modes - don't update original transformer - add in extra trainable layers -> "fix up" (which is effective) "Parameter Efficient Transfer Learning for NLP" Last Layer "graph layer" Multimodal learning -> MASK technique to "fill in" text and photos VideoBERT ## Swift for TensorFlow: Paige Bailey A next gen framework for ML: ML Arxiv papers exceed Moore's Law: rapid accuracy improvements [The Swift Programming Language](swift.org): Python-like but with typing: intuitive swift for tensorflow allows you to differentiate any function just by adding a decorator `@differenciable` functional approaches for swift: declare and then use optimizer to update syntactically similar to Kotlin: useful anywhere C++ can go Typed APIs; static detections for errors Interoperability: No wrappers: import and then call: import the c or c++ directly -> using a python wrapper limits you to the pythonic single threading works for python too: `import Python` and then using it: differentiable programming: language-integrated autodiff - any function in S4TF is differentiable ### Performance: speedy low level perf: parity with c++ thread-level scalability: no GIL -> no bottleneck in data ingestion process automatic graph extraction ## TFLite: On-device ML and Model Op Toolkit - Jason Zaman @perfinion jason@perfinion.com Use on phone because everyone has one,lots of data that you can use but not to send over to a server, get an immediate reply How it works is that it converts a model into TFLite Model optimization toolkit: