# AWS re:Invent 2021 - Swami Sivasubramanian Machine Learning Keynote ## Opening Amazon AI, VP - Swami Sivasubramanian can't emphasize enough the importance of data in conveying various reinvention, in order to help companies make better decisions and stimulate innovations. ![](https://i.imgur.com/StQ5rWP.png) The concept of big data was proposed in 1987, and since then, big data has become bigger then we ever imagine. Most of the data were unstructured like images, documents, special data and more. ***In fact, 80% of data today is unstructured.*** With such violence, data spread faster than most companies can keep track of having data and actually getting valued, processing the data is a challenging thing to do. ![](https://i.imgur.com/FdX9Dix.jpg) "AWS provides the most comprehensive set of services for the entire end-to-end data, analytics, and ML journey for all workloads and all types of data", he notes, while giving a short overview of AWS' existing tools and services available today. ![](https://i.imgur.com/k26goND.png) This year, Sivasubramanian announced the three elements of modern end-to-end data strategy, Sivasubramanian notes: - **Modernize** data structure - AWS has "unmatched maturity and experience" - **Unify** your data - connect everything together into a coherent system that is secure and well-governed - **Innovate** with your data - use AI and ML to come up with new approaches! ![](https://i.imgur.com/LLe5axD.png) ## Modernize data structure Sivasubramanian mentions how customers have been suffering from the tedious, time-consuming and expensive work of managing infrastructure. More and more there exist additional issues related to the status and update of the system, software, and data security, compliance, etc. To add up, many Data Stores rely on third-party database engines, which costs additional license fees. ![](https://i.imgur.com/N50k37q.png) Currently customers can relocate the database to the AWS Managed database solution: Amazon Aurora in order to solve the pain mentioned above. Amazon Aurora provides a lot of benefits such as: - Supports for MySQL and PostgreSQL, which are cheaper than self-managed databases - Many companies have adopted it, for example, Airbnb - No need to manage infrastructure - No need to worry about backup, update, configuration file maintenance - No fixed asset cost ![](https://i.imgur.com/dnwPpP9.png) Yet even with Amazon Aurora, users still have another requirement: **"Automatically diagnosing database performance problems"** ![](https://i.imgur.com/1iqTbgx.png) Due to lots of factors and possibilites that may cause performance problems of the database, and debugging process takes expensive time, and it isn't easily curable. Users hopes to have a service that can provide the following functions or solve the following challenges: ![](https://i.imgur.com/yftVdDj.png) This year, this worries are now gone as the announcement of Amazon DevOps Guru for RDS. This new service have the ability to automatically detect, diagnose, and resolve hard-to-find database-related performance issues in minutes, not days. ![](https://i.imgur.com/K2F9f2L.png) Next, Sivasubramanian discuss that many users need their own migration tools to migrate databases to the cloud. As initially, these data sources were in several different types of storage services, such as Microsoft SharePoint. Therefore, they need to optimize costs, make the best use of resources, and integrate data to drive the current of their essential business applications. ![](https://i.imgur.com/7BaBNhV.png) Sivasubramanian announced the launch of Amazon RDS Custom, which supports customization of DB environment and underlying operating system. At the same time, Amazon RDS Custom supports various SQL Server applications, such as SharePoint, PowerBI, PolyBase. ![](https://i.imgur.com/jD7yJZw.png) Yet Sivasubramanian also raised the question on why using relational database in the cloud? As users have begun to migrate their data to the cloud, they also think about building applications. As these applications have to bear millions of online customers, we need a fast response with low latency. So, what should we do? ![](https://i.imgur.com/Ai5RC2R.png) DynamoDB is currently a solid choice. It has fast response and serverless, while supporting key-value. Operators do not need to worry of managing its infrastructure, as has the ability to automatically expand its performance as required. As more and more users use DynamoDB to process nearly tens of millions of API requests per day, requiring millisecond latency, users have been requiring optimization of cost with DynamoDB. ![](https://i.imgur.com/oHsFNDI.png) Sivasubramanian also raised that not every piece of data in an applications is reached frequently. Yet those that are not often accessed though are still needed to be highly available and considering RDS option makes it too complicated and expensive for its requirement. ![](https://i.imgur.com/sG7MgFU.png) Sivasubramanian showed that AWS recognized this problem and in response launched the Amazon DynamoDB Standard-Infrequent Access table class to solve the problem. This new service provides the ability to store infrequently accessed data to help users reduce the cost of using DynamoDB. Application scenarios such as application logs, outdated community posts, mat business orders, and past game records are all suitable for use. ![](https://i.imgur.com/4yVF2sw.png) We can't emphasize enough the importance of data, yet in the current relocating local data to the cloud is still troublesome. From converting schema and changing Application code to match the relocated database. Although relevant relocation tools have been provided to simplify these processes, still establishing a relocation plan haven't change and still is a challenging matter. ![](https://i.imgur.com/5ICmWc9.png) ![](https://i.imgur.com/di4FSa2.png) One of the challenges users often faced is that more than one database engine exists. Therefore, user needs to evaluate which database solution corresponds to which services. ![](https://i.imgur.com/Rg1wKg3.png) This whole process requires multiple database experts who specialize in different fields or third-party suppliers to evaluate and consider the required performance, integration of schema, and other factors to find the most suitable solution. ![](https://i.imgur.com/gXN6ID5.png) Sivasubramanian has announced that as of the announcement of AWS Database Migration Service Fleet Advisor, those are the past. AWS Database Migration Service Fleet Advisor, provides new function of AWS DMS that automates creation and migration of multiple databases plans to help users simplify these processes. AWS Database Migration Service Fleet Advisor will relocate the local data to Amazon S3 in the form of streaming, and begin to evaluate which database service the data is suitable for, and then generate a customized relocation plan. ![](https://i.imgur.com/jwYUSzG.jpg) ## Unify Many companies and organizations leverages insights from a large amount of data to profit from it. They will need to put all data in a unified data center and appropriately manage them, to fulfill the need of data analysts and machine learning developers and many more. When they need data for analysis and training, technical staff can quickly and conveniently access the required data to drive visual diagrams and model inferences, thereby assisting in making business decisions. Data Lake is the solution adopted by many users. ![](https://i.imgur.com/PL5Lhus.png) Nowadays, many enterprises and users are building Data Lake on AWS, Amazon S3 may be a good choice: - It accomodates a variety of data types, and the total capacity is unlimited - High availability, high durability - Meet the best compliance requirements - Use AI technology to help users switch storage types and reduce costs - Support cross-region replication and backup ![](https://i.imgur.com/B704c2Z.png) Once data is unified in a way, the next step is to allow all of the team to access the data. Many people think this as a trade-off. It is a worry for everyone to able to access the data, especially when it is stored quickly as it is not easy to control permissions. ![](https://i.imgur.com/d6KOqpJ.png) Therefore, the following launch of AWS Lake Formation allows users to create a secure Data Lake within a few days and provides fine-grained permission control, like row and column level management. ![](https://i.imgur.com/8BsFRce.png) With Amazon Athena for data access, results can be obtained quickly with no complicated ETL process required. Also built-in ML technology exists enabling the analyst team to provide customers with model-driven data inferences without the need for specialized tools and infrastructure. ![](https://i.imgur.com/xyodRGg.png) Yet there are still different scenarios and needs, AWS provides a wide range of services to meet the needs of every kind of data analysis. For example, if you need a distributed cluster to run things like Spark, Hadoop, and Presto, you can use Amazon EMR. To analyze the daily log records of the system, you can use Amazon OpenSearch; even for streaming data, there is also Amazon Kinesis; and for data storage for large amounts of data, you can use Amazon Redshift. ![](https://i.imgur.com/Y5tzMDd.png) With the Serverless Redshift launched by re:Invent this time, it's easier to use. ![](https://i.imgur.com/XL5hBSB.png) In addition to data analysis, users also need visual charts to obtain patterns and trends in the data. Amazon QuickSight has our top solution for this matter. ![](https://i.imgur.com/gWNheim.png) Since launch of Amazon QuickSight Q, users no longer need to train users to know Query syntax. ***Now we can directly ask QuickSight question and get answers from the graphs*** ![](https://i.imgur.com/Rqx5Ptf.png) ## Innovate Innovation through machine learning has become the most representative technology of this era. We have been leveraging machine learning to improve customer experience, efficiency of our operations, and stimulate many innovative developments. ![](https://i.imgur.com/4MIJuop.jpg) AI-related services present currently are simply divided into three levels (IaaS, PaaS, SaaS) ![](https://i.imgur.com/tO2BrRB.png) All machine learning needs rely on an excellent infrastructure to continue innovating, so optimizing cost and performance have become crucial. ![](https://i.imgur.com/9wrD5KQ.png) For inference, AWS now supports GPU-computing Instance and chips developed by AWS themselves. ![](https://i.imgur.com/y06CbMm.png) For training models, AWS have been providing new EC2 Instance Types for each specific needs. ![](https://i.imgur.com/CrMvK55.png) Recently, integrated ML tools, services, and functions enable one-stop ML process implementation, such as Amazon SageMaker which provides fast and low-cost model building, training, and deployment functions. ![](https://i.imgur.com/ifNI1f5.png) ### Data Preparation Data as the fuel of ML, and the entire process, have been the most time-consuming. Data divides into two types, structured and unstructured data. ![](https://i.imgur.com/YyhFtHK.png) #### Labeling Assuming that 20% of data is structured, you can easily use SageMaker Datawrangler for pre-processing. For 80% of unstructured data, we need to spend much time doing Labeling. How to solve the above problems quickly and cost-effectively? We can use AWS SageMaker GroundTruth to create Training Data Sets with high efficiency and quality. ![](https://i.imgur.com/v599zIe.png) But for some particular data (Audio), users need even more accurate labeling operations, in order for automatic labeling conclude "how to do". ![](https://i.imgur.com/AEoQ8nu.png) This year announced, **Amazon SageMaker Ground Truth Plus** in which users only need to tell ground truth about the data set and labeling requirements and fill out an application form for the labeling project. AWS will arrange a conference call to understand the requirements and only need the information. After uploading the collection to S3, there will be a dedicated team (natural person) to help you label, the subsequent actions are a bit like the original ground truth, and the ML model will be used to imitate the experts for Labeling, to speed up the labeling process and maintain Accuracy. ### Model Building When users use SageMaker Studio Notebooks to build a model, filtering out the most valuable data, pre-processing, and analysis will need to prioritize. Different Notebook Instances will be activated according to other execution purposes. As team members have different needs and enable multiple Notebook Instances to access other toolkits, a Notebook Instance can be the solution where various analysis meets, this processing needs and simplification can speed up the overall model building process. ![](https://i.imgur.com/6fV9VPQ.png) Sivasubramanian announced the launch of SageMaker Studio Notebook, which uses a unified Notebook Instance for data processing, analysis, training, and machine learning workflow. Users can now process different files or switch programming languages for other processes. SageMaker Studio Notebook supports Spark Hive and Presto running on EMR clusters and Data Lake on Amazon S3 as data sources and directly starting training models. ![](https://i.imgur.com/FXtm5ph.png) ![](https://i.imgur.com/bEbu038.png) ### Training & Deployment With the rapid development and increasing complexity of machine learning and the expanding size of machine learning models, we must optimize performance and cost. Therefore, AWS has launched three updates for SageMaker. ![](https://i.imgur.com/bhFCdoI.png) #### SageMaker Training compiler Often data scientists need to convert Python code into ML frameworks, mathematical formulas that GPU kernels can process, or standard ML framework tools such as TensorFlow or PyTorch. Even if many tools can optimize the computing power and accelerate the training model, the conversion process still relies on the performance of the ML framework itself, and it still takes expensive time to process. Sivasubramanian announced that these difficulties are now past as the launch of SageMaker Training compiler, which provides a kernel that specifically converts Python code into GPU usable without other third-party tools, thereby reducing the memory and GPU consumed in the training process, ultimately reducing the training space. The time required. > News source: [SageMaker Training compiler](https://aws.amazon.com/tw/about-aws/whats-new/2021/12/amazon-sagemaker-training-compiler-dl-model-training-50/ ) #### SageMaker Inference recommender In the past, users had to choose the Instance Type responsible for training before starting a training job, but each Instance Type has different strengths. First, it is necessary to understand the difference in the instance type, then test after selection and compare multiple types to know which one is the most suitable. The current training job; and the testing process will also take a lot of time and money. To solve this difficulty, AWS launched SageMaker Inference recommender to help select the most suitable computer instance, considering both cost and performance, and recommend it to users. > News source: [SageMaker Inference recommender](https://aws.amazon.com/tw/about-aws/whats-new/2021/12/amazon-sagemaker-inference-recommender/) #### SageMaker Serverless Inference (Preview) SageMaker Serverless Inference is the new deployment inference endpoint option. In the past, when deploying a model into an inference endpoint, you would need to select the instance type of the endpoint. Subsequent users also need to adjust themselves according to performance requirements, but if you use Serverless Inference, you can omit the previous process of setting and arranging the size of the endpoint. > News source: [SageMaker Serverless Inference](https://aws.amazon.com/tw/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/) ![](https://i.imgur.com/c3RmZ0E.jpg) #### SageMaker Canvas: Models can be trained, deployed, and tested without writing code - Multiple datasets can be compiled through a graphical interface - Train multiple models yourself and find the most suitable - It can show which factor affects the model the most ![](https://i.imgur.com/fGUbrXJ.png) ### AI Services Today we are expanding many solutions through machine learning. In the field of AI Service, AWS has released two updates. <!-- ![](https://i.imgur.com/mnOQG2c.png) --> - Amazon Kendra Experience Builder: Users can now quickly create an enterprise search engine through Kendra, but IT still needs to build an Application to provide end-user functions, so it can promptly deploy as an APP through Kendra Experience Builder. ![](https://i.imgur.com/xeOBZpG.jpg) ![](https://i.imgur.com/3uneGiP.png) - Amazon Lex Automated Chatbot Designer: Through machine learning, use existing conversation records to build chatbots within a few hours, including creating intents and messages in objectives, accelerating the design of chatbots, reducing errors, and optimizing the customer experience. ![](https://i.imgur.com/UjVpuzj.jpg) ![](https://i.imgur.com/EfxOh97.png) For ML technology to continue innovating and developing, we must also educate more developers. To implement this matter, we must first lower the threshold of learning technology and make it easier for everyone to use tools and technology. ![](https://i.imgur.com/5OfNczf.png) According to research, today's tools are quite different. Either they are too simple or too complicated to perform machine learning effectively so that we can practice this through the cloud! Although there are free Notebook services available now, users cannot save the session, they have to restart each time, and they cannot quickly transfer their work. Instead, there is no way to focus on machine learning wholeheartedly. ![](https://i.imgur.com/TGVr9N3.png) AWS launched Amazon SageMaker Studio Lab to solve this problem, which is free to use! Just like Amazon SageMaker, you only need to register for Email through a web browser, you can develop through Jupyter IDE, and you can get 15 GB of storage capacity! You don't need to set up a development environment. ![](https://i.imgur.com/kBMfpWk.jpg) ### ML Training Program AWS and Udacity cooperated to launch the AWS AI & ML Scholarship Program, dedicated to training more people to become ML developers, providing excellent development plans and learning resources to enable ML technology to flourish in the new generation! ![](https://i.imgur.com/bsc8n3j.png) ## Conclusion ***"A data-driven organization is imperative for the future"*** - Swami Sivasubramanian This extends to **Modernize** and **Unify** And **Using data to innovate (Innovate)** these three topics, each topic is based on the past and built from the past, and the pain points encountered in the establishment of a data-driven organization from the past to the present will be published more valuable solutions, ahead of time Deployment helps companies accelerate the establishment of data-driven organizations to achieve corporate transformation. ![](https://i.imgur.com/JVw0UrK.png) ###### tags: re:invent 2021