Introduction to Generative AI

本文根據Google的Gen AI 課程做筆記 # Introduction to Generative AI About AI、ML、DL :::spoiler :::info About AI、ML、DL 1. AI is the theory and development of computer systems able to perform tasks normally requiring human intelligence. ![](https://hackmd.io/_uploads/B12nkq9O3.png) 2. ML have two type : **Unsupervised ML modols** AND **Supervised ML models** ![](https://hackmd.io/_uploads/HkD5eq9dh.png) 3. Deep learning is a subset of ML ![](https://hackmd.io/_uploads/rkqJ-q5_3.png) 4. Generative AI is a subset of DL ![](https://hackmd.io/_uploads/ryzIW5qdh.png) 5. Large Language Models are also a subset of DL ![](https://hackmd.io/_uploads/rkjdWq9O2.png) 6. Deep Learning Model Types : **Discriminative** AND **Generative** >Discriminative >>Used to classify or predict >>Typically trained on a dataset of **labeled data** >>Learns the **relationship** between the features of the data points and the labels >>![](https://hackmd.io/_uploads/HkqTl29_2.png) >Generative >>Generates new data that is **similar to data** it was trained on >>**Understands distribution of data** and how likely a given example is >>**Predict next word** in a sequence >>![](https://hackmd.io/_uploads/rk40g2q_n.png) ::: GenAI :::spoiler ----------------------------- :::info GenAI 1. generative models can **generate new data** instances while discriminative models discriminate between differnet kinds of data instances. ![](https://hackmd.io/_uploads/HyV8m55dh.png) 2. GenAI or not ![](https://hackmd.io/_uploads/HydRWnqdh.png) >**NOT GenAI** if **y** is a >>Number >>Discrete >>Class >>Probability >**IS GenAI** if **y** is a >>Natural language >>image >>Audio 3. What is GenAI >1. GenAI creates new content based on what it has learned from existing content >2. When given a prompt, GenAI uses this statistical model to predict 4. GenAI model ![](https://hackmd.io/_uploads/B1ttjsc_n.png) 5. Generative language models : LaMDA，PaLM，GPT，etc. 6. Types of GenAI Based on Data ```mermaid graph LR A[input Text] A --> B[Output Text] A --> C[Output Image] A --> D[Output Audio] A --> E[Output Decisions] B --> F[1.Translation 2.Summarization 3.Question Answering 4.Grammar Correction] C --> G[1.Image Generation 2.Video Generaation] D --> H[Text To speech] E --> I[Play Games] ``` 7. How GenAI Works **Pre-Training** >1. Large amount of Data >2. Billions of parameters >3. Unsupervised learning ![](https://hackmd.io/_uploads/rk_2ho5_2.png) 8. Challenges >1. The model is not trained on **enough data** >2. The model is trained on **noisy or dirty data** >3. The model is not given enough context >4. The model is not given enough constraints 9. GenAI Model Types >1. text-to-text >>1. take a natural language input >>2. produce text output >>3. trained to learn the mapping between a pair of texts >>4. Applications : Generation，Classification，Summarization，Translation，Search，Extraction，Clustering，Content editing/rewriting >2. text-to-image >>1. trained on a large set of images,each captioned with a short text description >>2. Applications : Image generation，Image editing >3. text-to-video/text-to-3D >>1. generate a video representation **from text input** >>2. input text can be anything from a **single sentence** to a **full script** >>3. otput is a video that corresponds to the input text >>4. Applications : Video generation，Video editing，Game assets >4. text-to-task >>1. trained to perorm a specific task or action based on text input >>2. this task can be >>>1. a wide rangeof actions such as answering a quertion >>>2. performing a search >>>3. making a prediction >>>4. taking some sort of action >>3. Applications : Software agents,Virtual assistants，Automation >>![](https://hackmd.io/_uploads/ByLsTjcO2.png) 10. GenAI Application ![](https://hackmd.io/_uploads/Hkv8Asqun.png) ::: GenAI Application in Google :::spoiler :::success GenAI Application in Google ## text-to-text : Bard About Bard ![](https://hackmd.io/_uploads/HykjCj5O3.png) here are some different with ChatGPT ![](https://hackmd.io/_uploads/BkPl7g2O3.png) ## text-to-Image : [Imagen](https://imagen.research.google/) ## text-to-Audio : [AudioLM](https://google-research.github.io/seanet/audiolm/examples/) ## text-to-Video : [Phenaki](https://phenaki.research.google/) ::: --- # Large Language Models (LLMs) :::spoiler :::info 1. LLMs are a subset of ML ![](https://hackmd.io/_uploads/BJf4xGbK2.png) 2. What are **large language models** >1. **Large**,**general-purpose** language models can be **pre-trained** and then **fine-tuned** for specific purposes. >2. LLMs are trained to **solve common language problems** , like Text classification , QA , Document summarization , Text generation. >3. LLMs are also tailored to solve specific problems in different fields , like Retail , Finance , Entertainment . Trained with a relatively **small size of field datasets** 3. LLMs three major features >1. Large >>1. Large training dataset >>2. Large number of parameters(hyperparameters) >>3. parameters define the skill of models >2. General purpose >>1. means the models are sufficient to solve common problems >>2. Commonality of human languages >>3. Resource restriction >3. Pre-trained and fine-tuned >>1. meaning to pre-train a LLMs for a general purpose with a large data set >>2. and then fine tune it for specific aims with a much smaller data set 4. Benefits of using LLMs >1. A singel model can be **used for different tasks**. >2. The fine-tune process requires **minimal** filed data. >3. The performance is continuously **growing** with more data and parameters. 5. Pathways Language Model (PaLM) >:::danger >>1. Is a dense **decoder-only** transformer model >>2. has **540 billion** parameters >>3. Leverages the new Pathway system >> >Pathway is a new AI architecture that >>1. will handle many tasks **at once** >>2. learn new tasks **quickly** >>3. reflect a better understanding of the world >>4. Orchestrates distributed computation for accelerators >> >Transformer model two units : >**Encoding Cpmponent** AND **Decoding Component** >![](https://hackmd.io/_uploads/B1gQUUeF3.png) >>1. **Encoder** encodes the input sequence and passes to the decoder >>2. **Decoder** decode the representations for a relevant task >::: 6. LLM Development vs. Traditional Developent | LLM Development(using pre-trained APIs) | Traditional Developent | | -------- | -------- | | **NO ML** expertise needed | **YES ML** expertise needed| |**NO training** examples|**YES training** examples| |**NO need** to train a model|**YES need** to train a model| ||**YES compute time + HW**| |Thinks about prompt design|Thinks about minimizing a loss function| 7. Generative QA >:::danger >QA in Natural Language Processing >>1. QA models are able to retrieve the answer to a question from a given text , and useful for searching for an answer in a document . >>2. Depending on the model used , the answer can be **directily extracted from text** or **generated from scrath** . >>3. QA answering required **Domain Knowledge** >>>1. **Big Tech** : provide Tech support to customers. >>>2. **Consumer** : tailored messaging to individual consumers. >>>3. **Education** : provide info on courses , tuition , policy. >>>4. **Media** : provide info on subscriptions and services. >>>5. **Pharma and Healthcare** : provide info for patient self-management. >>>6. **Retail** : provide etter chatbots ; product visualisation. >>>7. **Supply/Chain** : provide logistics info on inventory info. >::: but in Generative QA : >1. Generates free text directly based on the context. >2. It leverages Text Generation models. >3. No need for domain knowledge. 8. Prompts and Prompt Enineering >**Prompt Design** : Prompts involve uctions and context passed to a language model to achieve a desired task. >**Prompt Engineering** : Prompt engineering is the practice of developing and optimizing prompts to efficiently for a varietly of applications. 9. Three main kinds of LLM,each **needs prompting in a different way**. >1. **Generic (or Raw) Language Models** >>These **predict the next word**(technically token) based on the language in the training data. >> ![](https://hackmd.io/_uploads/Ske4OLeFn.png) >2. **Instruction Tuned** language model >>1. Trained to **predict a response** to the instructions given in the input. ![](https://hackmd.io/_uploads/SJ12EHZY2.png) >>2. Elements of the Prompt ![](https://hackmd.io/_uploads/B18xHSWt3.png) >3. **Dialog Tuned** >>1. Trained to have a **dialog by prediction the next response**. >>2. **framed as questions** to a chat bot. >>3. typically works better with **natural question-like phrasings**. >> ![](https://hackmd.io/_uploads/r1LnCE-Kh.png) >>4. Chain of thought reasoning : >>>1. Models are better at getting the right answer when they **first output text that explains the reason** for the answer >>> ![](https://hackmd.io/_uploads/S1q6-HbYn.png) >>>2. The model is less likely to get the correct answer directly.Now the output is more likely to end with the **correct answer**. ![](https://hackmd.io/_uploads/rkEzGHZtn.png) >**Observation** >>A model that can do everything has practical limitations. >>Task-specific tuning can make LLMs more reliable. 10. Tuning >The process of **adapting a model** to a new domain or **set of custom** use cases by triaining the model on new data. 11. Fine tuning >Bring your own dataset of and **retrain the model** by tuning every weight in the LLM. > >This requires a **big training job** and hosting your own fine-tuned model. An example of a medical foundation model trained on healthcare data. ![](https://hackmd.io/_uploads/rkGzt8gt2.png) 12. PaLM API & MakerSuite Simplifies Generative Developent Cycle ![](https://hackmd.io/_uploads/rkes5HbY3.png) >The suite includes a number of diffeerent tools >**model-training tool** helps developers train ML models on their data using different algorithms > >**model-deployment tool** helps developers deploy ML models to production with a number of different deployment options > >**model-monitoring tool** helps developers monitor the performance of their ML models in production using a dashboard and a number of different metrics. ::: --- # Introduction to Image Generation :::spoiler :::info ![](https://hackmd.io/_uploads/rJ22utTt2.png) 1. Image Generation Model Families >Variational Autoencoders - VAEs >>1. Encode images to a compressed size , then decode back to the original size , while learning the distribution of the data. >Generative Adversarial Models - GANs >>1. Pit two neural networks against each other. >>2. One neural network, the generator creates images >>3. and the other neural network, the discriminator predicts, if the image is real of fake. >>4. the discriminator gets better and better at distinguishing between real and fake, and the generator gets better and better at creating real looking fakes.(deepfake) >Autoregressive Models >>1. Generate images by treating an image as a sequence of pixels. >>2. the modern approach with auto regressive models actually draws much of its inspiration from how LLM's or large language models, handle text. 2. Diffusion Model : New trend of generative model ![](https://hackmd.io/_uploads/rJI5EBSs3.png) >Diffusion models draw their inspiration from physics, specifically thermodynamics. >Wheil they were first really intrduced for Image Generation. >Underpin many of the state of the art image generation systems. >Show promise across a number of different use cases. >![](https://hackmd.io/_uploads/SkYrrHri3.png) 3. Diffusion Modes : What is it? >1. The essential idea is to systematically and slowly destroy struction in a data distribution through an **iterative forward diffusion process**. >2. We then **learn a reverse diffusion process** that restores structure in data, yielding a highly flexible and tractable generative model of the data. >![](https://hackmd.io/_uploads/HJWsUSrsh.png) >DDPM，DM need to learn de-noise. >![](https://hackmd.io/_uploads/Hky0USSo3.png) >![](https://hackmd.io/_uploads/rymfwrHih.png) >![](https://hackmd.io/_uploads/HyUrDBro2.png) >![](https://hackmd.io/_uploads/ByCLwHBsh.png) DDPM Generation ![](https://hackmd.io/_uploads/SyOcwSHi3.png) >1. Start with pure, absolute noise. >2. Send that noise through our model that is trained. >3. Take the output , the predicted noise. >4. Subtract it fro the initial noise. >5. DO that over and over again, we end up with a generated image. > ![](https://hackmd.io/_uploads/Hy5t_Srsn.png) with the LLM's , that can really enable us to create context aware, photorealistic images from a text prompt. example is Imogen from Google Research. What are some challenges of diffusion models? They can generate images that are not realistic. check They can be difficult to control. They can be computationally expensive to train. What is the goal of diffusion models? To learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space ::: --- # Encoder-Decoder Architecture Overview :::spoiler :::info 1. The Encoder-Decoder architecture is a sequence-to-sequence architecture. ![](https://hackmd.io/_uploads/ryjsVeUi3.png) 2. The two stages ![](https://hackmd.io/_uploads/HkkeSeIs3.png) >These two stages can be implemented with different internal architectures, not only RNN. 3. Encoder ![](https://hackmd.io/_uploads/ryXo8xLon.png) >1. Takes each token in the input sequence one at a time >2. Produces a state representing this token as well as all the previously ingested tokens. ![](https://hackmd.io/_uploads/HJN68gLjn.png) 4. Decoder >1. Takes the vector representation of the input sentence and produces an output sentence from that representation. ![](https://hackmd.io/_uploads/ry0gvgIin.png) 5. Training >1. A data set that is a collction of I/O pairs that you want your model to imitate. >2. Feed that data set to the model >3. Need a collection of input and output texts. >4. Compute the error between what the decoder generates and the actual translation. >5. Need to give the decoder the correct previous translated token as input to generate the next token. >![](https://hackmd.io/_uploads/H18JFeIo2.png) 6. Serving ![](https://hackmd.io/_uploads/B1KbCgLjh.png) >1. The start token needs to be represented by a vector using an embedding layer.![](https://hackmd.io/_uploads/HyxB0e8o2.png) >2. Then the recurrent layer will update the previous state produced by the encoder into a new state.![](https://hackmd.io/_uploads/SJrqRgUs3.png) >3. Passed to a dense softmax layer to produce the word probabilities.![](https://hackmd.io/_uploads/rJg0Al8oh.png) >4. The word is generating by taking the highest probability word with Greddy Search or the highest probability chunk with Beam Search. >5. Do 1-4 until finish.![](https://hackmd.io/_uploads/Hkpry-8j3.png) 6. History ![](https://hackmd.io/_uploads/ByL9JbIon.png) ::: --- # Attention Mechanism : Overview :::spoiler :::info 1. The Encoder-Decoder translate one word at a time，but sometimes in the source language do not align with the words in the target language. ![](https://hackmd.io/_uploads/r1U2mSLi2.png) ![](https://hackmd.io/_uploads/BJ7UESUo2.png) >In this example ， "black" is the first word in English，but "chat " is the first word in Franch which means "cat". ![](https://hackmd.io/_uploads/rkw54rUsh.png) >so we can add **Attention mechanism** in Encoder-Decoder to focus on **specific parts** of an input sequence. > 2. Traditional RNN encoder-decoder ![](https://hackmd.io/_uploads/rJDlBB8sh.png) >1. The model takes one word at a time as input updates the hidden state >updates the hidden state and passes it on to the next time step. >2. In the END, only the final hidden state is passed on to the decoder. >3. The decoder works with the final hidden state for pocessing and translates it to the target language. 3. Attention model differs from a traditional model >1. Passing more data from encoder to the decoder. >>the encoder pass all the hidden states from each time step. >>this give the decoder more context beyond just the final hidden state. >>![](https://hackmd.io/_uploads/BkhLDHIi3.png) >>the decoder uses all the hidden state information to translate the sentence. >>![](https://hackmd.io/_uploads/ByR3DrLo3.png) >2. adding an extra step to the attention decoder before producing its output. >>1. Look at the set of encoder hidden states that it received. >>>each encoder **Hidden State** is associated with a certain word iin the input sentence.![](https://hackmd.io/_uploads/BkJYuS8i2.png) >>2. Give each hidden state a score.![](https://hackmd.io/_uploads/BynquHIin.png) >>3. Multiply each hidden state by its soft-maxed score. ![](https://hackmd.io/_uploads/rkNCuBIi2.png) >>>amplifying hidden states with the **highest scores** . >>>downsizing hidden states with low scores. ![](https://hackmd.io/_uploads/rkKXYBIj2.png) ::: ## 課程中的文字生成範例 :::spoiler ### 匯入dataset並查看細節以便做處理首先把dataset引入，Dataset 是莎士比亞所有劇本的完整字串，將dataset命名為text ![](https://hackmd.io/_uploads/BJoRPkuin.png) 檢視dataset ![](https://hackmd.io/_uploads/ryWruJdj3.png) 計算不重複字元的數量(unique characters) 例如輸入aabbcc，unique characters會有abc三種 ![](https://hackmd.io/_uploads/HJ7ft1ui3.png) 詳細去看vocab可以發現除了26種英文字母外還有其他符號，還有大小寫之分 ![](https://hackmd.io/_uploads/SJ_y91di2.png) ### text處理，轉換成ID 1. 首先使用tf.keras.layers.StringLookup，可以將每個字符轉換為數字ID。需要首先將text分割成標記。 ![](https://hackmd.io/_uploads/rkj9ck_ih.png) 2. 再來轉成ID 圖中可以看到b'a'對應到40，b'x'對應到63 ![](https://hackmd.io/_uploads/S1XWoyuj3.png) 3. 反向對應關係，ID對應字元 ![](https://hackmd.io/_uploads/SJwnhJdin.png) ### 建立training examples跟targets 例如完整的語句是Hello，training example就會使用前面的Hell，模型的target就要產出後面的ello 1. 確認set的shape ![](https://hackmd.io/_uploads/rkbOCkOih.png) 2. 以ID形式儲存text到dataset ![](https://hackmd.io/_uploads/SJr_Rkdo2.png) 3. 前15個ID的轉換成文字如下 ![](https://hackmd.io/_uploads/B1aORJOin.png) 4. 使用batch將字元轉成100個字元ID組成的序列，配合輸入需求可以看到下方的文字:First_Citizen:\nBefore_we_........組成sequence ![](https://hackmd.io/_uploads/HkUo0y_jn.png) 組合在一起用numpy檢視前5個sequences ![](https://hackmd.io/_uploads/rkO8Jguin.png) 5. 處理sequence成input 跟 target def一個function，之後將sequences都丟進去 input_text就是去掉最後一個字元[:-1] target_text就是去掉第一個字元[:1] ![](https://hackmd.io/_uploads/HyjN-xuon.png) 例如丟入"Tensorflow" ![](https://hackmd.io/_uploads/ryTvZluin.png) 6. 將兩種sequences丟入dataset 建立輸入序列的方式是截斷原始序列的最後一個字元建立目標序列是將原始序列的第一個字元截斷具體來說就是將split_input_target function對應到序列dataset就可以完成 ![](https://hackmd.io/_uploads/BJs6bl_jh.png) 7. 將data變成training batches ![](https://hackmd.io/_uploads/SyduGg_o2.png) ### Model 1. 首先設定變數 ![](https://hackmd.io/_uploads/Sym4CC_on.png) 2. 在class MyModel中建立所需的網路層跟各層的聯絡方式先看全部的model ![](https://hackmd.io/_uploads/Skzp1yts3.png) 詳細講解，先建立模型 ![](https://hackmd.io/_uploads/SkEIlyFoh.png) >1. 根據dim做訓練的table ![](https://hackmd.io/_uploads/Sk6_EyFs3.png) >2. 呼叫RNN的框架 ![](https://hackmd.io/_uploads/Sy_F4JFjh.png) >3. 建立dense layer，可以把結果的logits 做output ![](https://hackmd.io/_uploads/BkXcEktih.png) 建立各層的聯絡方式call function ![](https://hackmd.io/_uploads/HkhH-1Fi3.png) >1. 為每個ID建立第一個訓練層訓練的重點在於使用前面回饋的state進行調整，如果沒有前面的回饋則需要使用初始state，因此需要為每個ID建立訓練層 ![](https://hackmd.io/_uploads/H1mEXyton.png) >2. 建立循環層，根據前面回饋進行調整 ![](https://hackmd.io/_uploads/HyoUX1Kih.png) 用變數model包起剛剛的模型，並丟入一開始設定的變數 ![](https://hackmd.io/_uploads/BynMByKin.png) ### Try the Model 首先確認model的各input ![](https://hackmd.io/_uploads/rkLDBJKs3.png) 確認input_example_batch跟target_example_batch，因為input跟target都會比本來的65少第一位或最後一位，所以長度是64 ![](https://hackmd.io/_uploads/H1ONPyYih.png) 在前面將字元包成100個一組，意即sequence_length ![](https://hackmd.io/_uploads/HJEzOktin.png) 前面確認過的vocab_size，有65個 ![](https://hackmd.io/_uploads/SJFqBJti3.png) **但此處有66個，目前不懂** 用summary確認model架構 ![](https://hackmd.io/_uploads/ByK7YJto2.png) ![](https://hackmd.io/_uploads/Sym4CC_on.png) embedding 16896 = vocab size 66 * 前面自訂的embedding_dim 256 gru 3938304 = 前面自訂的run_units 1024 * 3846 dense 67650 = vocab size 66 * 1025 ==GRU，LSTM的簡化版，將遺忘閥(forget gate)與輸入閥(input gate)結合成更新閥(update gate)，並且把cell state與隱藏狀態結合== ==遺忘閥(forget gate)與輸入閥(input gate)，因為傳統的NN無法使用前面的資料，造成訓練後的結果過擬合或偏頗某些資料，設計Gate就可以輸入前面的資料，並控制是否遺忘== ==GRU將控制gate的數值範圍控制在0~1，越接近1代表input的數據越多，反之則遺忘的越多，克服LSTM多個控制數值的缺點，整合上述兩種gate== 隨機取出樣本 ![](https://hackmd.io/_uploads/SJvEZltjh.png) ![](https://hackmd.io/_uploads/HyfH-eYjn.png) ### Train the model 首先設定loss變數來監控loss ![](https://hackmd.io/_uploads/HJ3YZeFjh.png) 使用optimizer adam ![](https://hackmd.io/_uploads/Bkf3ZxKjh.png) 在訓練中儲存weight ![](https://hackmd.io/_uploads/SJR6-lFs2.png) 訓練10次 ![](https://hackmd.io/_uploads/H1M1MgYo2.png) **訓練後的model還不能直接用，需要一個decoder來將ID轉成text** ![](https://hackmd.io/_uploads/BJOFMetoh.png) 其中重點是 generate_one_step ![](https://hackmd.io/_uploads/SkFeQgKo3.png) 給出開頭詞後就可以生成下面的文字了 ![](https://hackmd.io/_uploads/BkvtXlti2.png) ![](https://hackmd.io/_uploads/rkW5XlYon.png) 也可以用同一個開頭生成好幾種段落 ![](https://hackmd.io/_uploads/BJ62QeFon.png) ::: # Transformer Models and BERT Model: Overview :::spoiler :::info 1. A transformer is an encoder-decoder model that can >1. Take adbantage of parallelization GPU/TPU. >2. Process much more data in the same amount of time. >3. ==Process all tokens at once.== 2. Transformer models were built using Attention mechanism at core.which can helps ==improve the performance of machine translation applications.== ![](https://hackmd.io/_uploads/rk8AEz9i3.png) 3. A Transformer models consists of encoder and decoder. ![](https://hackmd.io/_uploads/SkvVrf5ih.png) >The encoder encodes the input sequence and passes it to the decoder. >The decoder decodes the representation for a relevant task. 4. The encoding component is a stack of encoders of the same number. ![](https://hackmd.io/_uploads/HJU3rf9ih.png) >Transformer stack six encoders on top of each other. six is a hyperparameter. 5. Each encoder can be broken down into two sub layers. >1. ==self attention== >>helps to encoder look at relevant parts of the words >2. ==feedforward layer== >>The output of the self attention layer is fed to the feedforward neural network. ![](https://hackmd.io/_uploads/rk-yKfcsh.png) 6. The decoder has both the self attention and the feedforward layer,but between them is the ==encoder decoder attention layer==. >1. encoder decoder attention layer helps a decoder focus on relevant parts of the input sentence. 7. After embedding the words in the input sequence,each of the embedding vector ==flows through the two layers of the encoder.== ![](https://hackmd.io/_uploads/B12mTM9o2.png) >1. The word at each position passes through a self attention process.![](https://hackmd.io/_uploads/HJarTzcj2.png) >2. Then it passes through a feedforward neural network![](https://hackmd.io/_uploads/ByOdTMqjh.png) >3. ==Dependencies exist between these paths in this self attention layer.== 8. In the self attention layer, the input embedding is broken up into ==query, key, and value vectors.== ![](https://hackmd.io/_uploads/BJU90z9i2.png) >1. All of these computations happen in parallel in the model, in the form of matrix computation.![](https://hackmd.io/_uploads/B1kRRMqj2.png) >2. Once we have the query key and value vectors, the next step is to ==multiply each value vector== by the ==soft max score== in preparation to ==sum them up.==![](https://hackmd.io/_uploads/BJGUlm5j2.png) >3. **The intention here is to ==keep intact the values of the words you want to focus on== and ==leave out a irrelevant words== by multiplying them by tiny numbers like 0.001.** 9. Process of getting the final ==z== embeddings ![](https://hackmd.io/_uploads/SkXwFX5jn.png) >1. Input natural language sentence ![](https://hackmd.io/_uploads/Syl3Ymqj2.png) >2. embed each word in the sentence. ![](https://hackmd.io/_uploads/HyDpY75i2.png) >3. perform multi-headed attention eight times in this case and multiply this embedded word with the ==respective weighted matrices.== ![](https://hackmd.io/_uploads/Bkul9Xco3.png) >4. calculate the attention using the resulting QKV matrices. ![](https://hackmd.io/_uploads/ryoQqX5s3.png) >5. concatenate the matrices to produce the output matrix, which is the same dimension as the final matrix ![](https://hackmd.io/_uploads/BkgucQ9jn.png) 10. There's multiple variations of transformers out there now. ![](https://hackmd.io/_uploads/BJFhqQqo3.png) BERT Overview 1. The way that Bert works is that it was trained on two different tasks. >1. ==masked language modeling== >>where the ==sentences are masked== >>the model is ==trained to predict the masked words.== >>The recommended percentage for masking is ==15%==. >>==Too little== masking makes the training process extremely expensive >>==too much== masking removes the context of the model requires. ![](https://hackmd.io/_uploads/rkndRmcs3.png) >2. next sentence predict (NPS) ![](https://hackmd.io/_uploads/S11TfVci3.png) >>the model is given two sets of sentences. Bert aims to ==learn the relationships== between sentences and predict the next sentence given the first one. >>This is a ==binary classification task.== >>This helps Bert perform at a sentence level 2. BERT input embeddings ![](https://hackmd.io/_uploads/BklO7V5jn.png) In order to train Bert, You need to ==feed three different kinds of embeddings== to the model for the input sentence. >1. **token embeddings** >>1. The token embeddings is a representation of each token as an embedding in the ==input sentence.== >>2. The words are transformed into vector representations of certain dimensions. >>3. BERT can solve NLP problems. >2. **segment embeddings** >>1. Use segment embeddings can make Bert distinguish the input in a given pair >>2. There is a special token represented by SEP that separates the two different splits of the sentence. >3. **position embeddings** >>1. This allows Bert to learn a vector representation for each position. >>2. to let model ==learn the order of the words in the sentence.== The order of the input sequence is incorporated into the position embeddings. >>3. Bert consists of a ==stack== of transformers , so it is designed to process ==input sequences== up to a ==length of 512.== ::: # Create Image Captioning Models: Overview :::spoiler :::info 1. dataset ![](https://hackmd.io/_uploads/H1WQp32sn.png) >1. there are a lot of pairs of images and text data >2. **our goal** is to build and train a model that ==can generate these kind of text captions based on images.== 2. extract features by using that kind of backbones. It is a kind of encoder-decoder model, but in this case, encoder and decoder ==handle different modality of data==, which is image and text. ![](https://hackmd.io/_uploads/r1LxR3njh.png) >1. pass images to ==encoder== at first >2. it extracts information from the images >3. creates some feature vectors. >4. And then the vectors are passed to the ==decoder== >5. build captions by generating words, one by one. You can use any kinds of image backbone, like ResNet, EfficientNet, or Vision Transformer. 3. The decoder this is the entire architecture of the decoder ![](https://hackmd.io/_uploads/H1E-7Tni3.png) >1. It gets words ==one by one== >2. It ==makes the information of words and images==, which is coming from the encoder output, and tries to predict the next words. So this decoder itself is an iterative operation, by calling it again and again autoregressively, we can eventually generate text captions. And we are ==passing it to GRU layer==. ![](https://hackmd.io/_uploads/B1IZKanj2.png) The GRU output goes to **attention layer**, >1. ==mixes the information of text and image.== >2. it pays attention to image feature from text data. >3. it can ==calculate attention score== by mixing both information. ![](https://hackmd.io/_uploads/Sy_It62ih.png) this attention layer takes two inputs-- ==gru_output== and ==encoder_output==. >gru_output is used as attention query and key and encoder_output as value. ![](https://hackmd.io/_uploads/S1Zjcpnsh.png) Add layer and layer normalization layer. ![](https://hackmd.io/_uploads/ByGMopni2.png) >1. Add layer just adds two same-shaped vectors. >2. gru_output is passed to attention layer, as we discussed, and to this add layer directly. This kind of architecture is called ==skip connection== or ==residual connection,== especially when you want to design a very deep neural network. 4. Inference Loop Overview ![](https://hackmd.io/_uploads/B1hoGAhoh.png) **Inference phase**, we can actually generate captions for our visual images >1. Generate the GRU initial state and ==create a start token.== ![](https://hackmd.io/_uploads/rko-uA3j2.png) 1. at the beginning of each captioning, we explicitly initialize the gru_state with some value. 2. And at the same time, ==our decoder is an autoregressive function.== 3. since we haven't got any word prediction yet at the beginning of the inference, ==we pass start token==. >2. ==pass an input image to the encoder== and extract a feature vector. ![](https://hackmd.io/_uploads/S1nMuR3o2.png) >3. ==pass the vector to decoder==, and generate a caption word in the for loop (until it returns end token or it reach to a hyperparameter specifying some number, like the 64.) ![](https://hackmd.io/_uploads/S1Y7uA3in.png) # Image Captioning with Visual Attention :::spoiler :::info ## 目的 1. Learn how to create an image captioning model 2. Learn how to train and predict a text generation model. ## 用到的材料 1. data The training dataset is the ==COCO large-scale object detection, segmentation, and captioning dataset.== 2. constants >1. use a pretrained InceptionResNetV2 model from **tf.keras.applications** as a feature extractor, so some constants are comming from the InceptionResNetV2 model definition. >2. **tf.keras.applications** is a ==pretrained model== repository like TensorFlow Hub, but tf.keras.application only hosts popular and stable models ==for images==. ## 開始 1. 引入dataset，設定變數 ![](https://hackmd.io/_uploads/ryf72Asj3.png) 2. Filter and Preprocess >resize image to (IMG_HEIGHT, IMG_WIDTH) shape >rescale pixel values from [0, 255] to [0, 1] >return image(image_tensor) and captions(captions) dictionary. ```python GCS_DIR = "gs://asl-public/data/tensorflow_datasets/" BUFFER_SIZE = 1000 def get_image_label(example): caption = example["captions"]["text"][0] # only the first caption per image img = example["image"] img = tf.image.resize(img, (IMG_HEIGHT, IMG_WIDTH)) #resize image to (IMG_HEIGHT, IMG_WIDTH) shape img = img / 255 #rescale pixel values from [0, 255] to [0, 1] return {"image_tensor": img, "caption": caption} #return image(image_tensor) and captions(captions) dictionary. trainds = tfds.load("coco_captions", split="train", data_dir=GCS_DIR) trainds = trainds.map( get_image_label, num_parallel_calls=tf.data.AUTOTUNE ).shuffle(BUFFER_SIZE) trainds = trainds.prefetch(buffer_size=tf.data.AUTOTUNE) ``` 3. 把範例帶進剛設定的get_image_label() ```python f, ax = plt.subplots(1, 4, figsize=(20, 5)) for idx, data in enumerate(trainds.take(4)): ax[idx].imshow(data["image_tensor"].numpy()) caption = "\n".join(wrap(data["caption"].numpy().decode("utf-8"), 30)) ax[idx].set_title(caption) ax[idx].axis("off") ``` plt 出來的是 ![](https://hackmd.io/_uploads/SkTMNkno3.png) 可以看到data是含有圖片跟文字敘述 # Text Preprocessing ==Add special tokens== to represent the starts (start) and the ends (end) of sentences. ![](https://hackmd.io/_uploads/B1NOKXash.png) # Preprocess and tokenize the captions ==transform the text captions into integer sequences== using the **TextVectorization** layer 1. Use adapt to iterate over all captions, split the captions into words, and compute a vocabulary of the top VOCAB_SIZE words. 2. ==Tokenize all captions== by mapping each word to its index in the vocabulary. All output sequences will be padded to the length MAX_CAPTION_LEN. ```python MAX_CAPTION_LEN = 64 # We will override the default standardization of TextVectorization to preserve # "<>" characters, so we preserve the tokens for the <start> and <end>. def standardize(inputs): inputs = tf.strings.lower(inputs) return tf.strings.regex_replace( inputs, r"[!\"#$%&\*\+.,-/:;=?@\[\\\]^_`{|}~]?", "" ) #建立tokenizer，將狗，貓等等字詞做tokenize ，對應到index # Choose the most frequent words from the vocabulary & remove punctuation etc. tokenizer = TextVectorization( max_tokens=VOCAB_SIZE, standardize=standardize, output_sequence_length=MAX_CAPTION_LEN, ) tokenizer.adapt(trainds.map(lambda x: x["caption"])) ``` 查看轉成token後的樣子 ![](https://hackmd.io/_uploads/SJuWiQToh.png) 有6個字所以前6個有數字當然也要做反向對應 ![](https://hackmd.io/_uploads/SyI8i7Ti2.png) # Create a tf.data dataset for training Now Let's apply the adapted tokenization to all the examples and create tf.data Dataset ==for training.== ==creating labels by shifting texts from feature captions.== (If we have an input caption "(start) I love cats (end)", its label should be "I love cats (end) (padding)". With that, our model can try to learn to predict I from (start).)> ```python= BATCH_SIZE = 32 def create_ds_fn(data): img_tensor = data["image_tensor"] caption = tokenizer(data["caption"]) target = tf.roll(caption, -1, 0) #target is create by caption shifting in one word zeros = tf.zeros([1], dtype=tf.int64) target = tf.concat((target[:-1], zeros), axis=-1) return (img_tensor, caption), target #img_tensor will go to the encoder #caption will go th the decoder #target is label batched_ds = ( trainds.map(create_ds_fn) .batch(BATCH_SIZE, drop_remainder=True) .prefetch(buffer_size=tf.data.AUTOTUNE) ) #trainds map return of create_ds_fn ``` 確認data ![](https://hackmd.io/_uploads/ryMF6Qaih.png) ![](https://hackmd.io/_uploads/BJJ5pmasn.png) # Model ## Image Encoder Image Encoder extracts features through a pre-trained model and passes them to a fully connected layer ```python= FEATURE_EXTRACTOR.trainable = False # freezing the most of the parts of this CNN #we extract the features from convolutional layers of InceptionResNetV2 #which gives us a vector of (Batch Size, 8, 8, 1536). image_input = Input(shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS)) image_features = FEATURE_EXTRACTOR(image_input) x = Reshape((FEATURES_SHAPE[0] * FEATURES_SHAPE[1], FEATURES_SHAPE[2]))( image_features )# reshape the vector to (Batch Size, 64, 1536) encoder_output = Dense(ATTENTION_DIM, activation="relu")(x) #We squash it to a length of ATTENTION_DIM with a Dense Layer and return (Batch Size, 64, ATTENTION_DIM) #the Attention layer attends over the image to predict the next word. ``` 確認encoder ![](https://hackmd.io/_uploads/SJZrJVaih.png) ## Decoder ![](https://hackmd.io/_uploads/B1xq1Nash.png) 如圖，加入各種layer ![](https://hackmd.io/_uploads/B1Yi14ash.png) ![](https://hackmd.io/_uploads/BywtzV6s2.png) # Loss Function ``` python #分類模型 #用於decoer針對每個詞類生成大量可能性 #但我們data是填充性的需要刪除zero values and meaningless values loss_object = tf.keras.losses.SparseCategoricalCrossentropy( from_logits=True, reduction="none" ) def loss_function(real, pred): loss_ = loss_object(real, pred) # returns 1 to word index and 0 to padding (e.g. [1,1,1,1,1,0,0,0,0,...,0]) mask = tf.math.logical_not(tf.math.equal(real, 0)) mask = tf.cast(mask, dtype=tf.int32) sentence_len = tf.reduce_sum(mask) loss_ = loss_[:sentence_len] return tf.reduce_mean(loss_, 1) image_caption_train_model.compile( optimizer="adam", loss=loss_function, ) ``` 接下來做caption ![](https://hackmd.io/_uploads/Sk8i9Rkhn.png) ```python= MINIMUM_SENTENCE_LENGTH = 5 ## Probabilistic prediction using the trained model def predict_caption(filename): gru_state = tf.zeros((1, ATTENTION_DIM))# initialize with zero vector simply img = tf.image.decode_jpeg(tf.io.read_file(filename), channels=IMG_CHANNELS) img = tf.image.resize(img, (IMG_HEIGHT, IMG_WIDTH)) img = img / 255 features = encoder(tf.expand_dims(img, axis=0)) dec_input = tf.expand_dims([word_to_index("<start>")], 1)# 加入<start>作為第一個字 result = [] #loop逐一產生文字 for i in range(MAX_CAPTION_LEN): #call decoder，回傳大量預測字詞機率 predictions, gru_state = decoder_pred_model( [dec_input, gru_state, features] ) #---------------------------------------------------------------------------- # draws from log distribution given by predictions #本例隨機挑選字詞增加隨機性 top_probs, top_idxs = tf.math.top_k( input=predictions[0][0], k=10, sorted=False ) chosen_id = tf.random.categorical([top_probs], 1)[0].numpy() predicted_id = top_idxs.numpy()[chosen_id][0] result.append(tokenizer.get_vocabulary()[predicted_id])#選出字詞把id轉成文字 #-------------------------------------------------------------------------------------------- if predicted_id == word_to_index("<end>"): return img, result dec_input = tf.expand_dims([predicted_id], 1) return img, result ``` 最後就是 ![](https://hackmd.io/_uploads/r18wiAyh2.png) ![](https://hackmd.io/_uploads/HyGbFzAs2.png) ::: # Introduction to GEN AI Studio 1. What does generative AI do? It can generate content for you in multiple formats, including text, images, audio, and video. It can also help you with many tasks, such as email generation, content extraction, code completion, and virtual assistance. This tool is a valuable asset for any business or individual who is looking to create high-quality content quickly and easily. 2. How does generative AI generate new content? It is trained on massive amounts of data, which results in a foundational model. This model can then perform general tasks such as text summarization. It can also be further trained on new datasets, which leads to fine-tuned models that can be used for specific tasks in specific fields such as finance and healthcare. 3. What does Generative AI Studio do? Generative AI Studio allows users to rapidly prototype and customize generative AI models with no code or low code and to use the generative AI capabilities in their applications. 4. What does Generative AI Studio currently support? >Language >Prompt design for different tasks >Conversation creation >Tuning and deployment of language models >Image >Image generation >Image editing >Speech >Generation of text from speech >Generation of speech from text 5. What are the main features of Generative AI Studio Language? >Prompt design >Conversation creation >Model tuning 6. What is a prompt? In the world of generative AI, a prompt is just a fancy name for the input text that you feed to your model. You can feed your desired input text like questions and instructions to the model. The model then provides a response based on how you structured your prompt; therefore, the answers you get depend on the questions you ask. 7. What is prompt design? The process of finding and designing the best input text to get the desired response back from the model is called prompt design, which often involves a lot of experimentation. 8. What is the difference between zero-shot, one-shot, and few-shot prompting? >Zero-shot prompting: Provides one single command with no examples. >One-shot prompting: Provides one example of the task. >Few-shot prompting: Provides a few examples of the task often with the description of the context. 9. What are the best practices for prompt design? >Be concise >Be specific and well-defined >Ask one task at a time >Turn generative tasks into classification tasks >Improve response quality by including examples 10. What are the mode parameters that you can tune in Generative AI Studio Language to improve the response that fits your requirement? >Model type >Temperate >Top K >Top P 11. What does temperature mean as a tuning parameter? Temperature is a number used to tune the degree of randomness. Low temperature means choosing the most likely and predictable words. For example, the word "flowers" in the sentence "The garden is full of beautiful__." High temperature means choosing the words that have low possibility and are more unusual. For example, the word "bugs" in the sentence "The garden is full of beautiful__. " 12. How do you tune a language model in Generative AI Studio? You need to specify the tuning parameters, the tuning dataset, and the tuning objective.