5/31 Introduction of AI in Different Field[Editing]

# 5/31 Introduction of AI in Different Field[Editing] ###### tags: `explore` `paper` `論文` `學習紀錄` `coarse-to-fine` [toc] ---- :::danger **本文服用說明書**:請注意看publish年代，有些說是最新技術是2012，所以已經非常老但很經典。 ::: * reference: https://github.com/machinelearningmindset/deep-learning-ocean?fbclid=IwAR0rqqBYMfT9ohI7hgPihKeaoN9ThvKigZGxvaIuiJJ_CfXym5wzDsswHUs#image-recognition ## Caption Generation ### 2015 Mind's Eye: A Recurrent Visual Representation for Image Caption Generation: :::success * Critical to our approach is a recurrent neural network that attempts to dynamically build a visual representation of the scene as a caption is being generated or read * Generating novel captions given an image * Reconstructing visual features given an image description. * sentence generation * sentence retrieval and image retrieval ::: :::warning * The creation of a mental image may play a significant role in sentence comprehension in humans * Could computer vision algorithms that comprehend and generate image captions take advantage of similar evolving visual representations? ![](https://i.imgur.com/NLtQp2s.png) * bi-directional representation capable of generating both novel descriptions from images * visual representations from descriptions * a novel representation that dynamically captures the visual aspects of the scene that have already been described. That is, as a word is generated or read the visual representation is updated to reflect the new information contained in the word * Recurrent Neural Networks (RNNs * Model structure ![](https://i.imgur.com/WronUYi.png) * For learning we use the Backpropagation Through Time (BPTT) algorithm ::: :::info * learning long-term interactions * using a recurrent visual memory that learns to reconstruct the visual features as new words are read or generated. * We demonstrate state-of-the-art results on the task of sentence generation, image retrieval and sentence retrieval on numerous datasets. ::: :::danger * not explore the use of LSTM models ::: * [refer site](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Chen_Minds_Eye_A_2015_CVPR_paper.pdf) --- ### 2015 Show and Tell: A Neural Image Caption Generator: :::success * We present a model that generates natural language descriptions of images and their regions * Combination of: * Convolutional Neural Networks over image regions * Bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding * Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions ::: :::warning * Generating dense descriptions of images ![](https://i.imgur.com/Ni5KDXx.png) * leverage these large imagesentence datasets by treating the sentences as weak labels * Our approach is to infer these alignments and use them to learn a generative model of descriptions. Concretely, our contributions are twofold: * We develop a deep neural network model that infers the latent alignment between segments of sentences and the region of the image that they describe. * We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. ![](https://i.imgur.com/MoyvNlZ.png) ::: :::info ![](https://i.imgur.com/1XCwbKx.jpg) * Generates natural language descriptions of image regions based on weak labels in form of a dataset of images and sentences, and with very few hardcoded assumptions. * image-sentence ranking experiments * Multimodal Recurrent Neural Network architecture that generates descriptions of visual data * We evaluated its performance on both fullframe and region-level experiments and showed that in both cases the Multimodal RNN outperforms retrieval baselines ::: * [refer site](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vinyals_Show_and_Tell_2015_CVPR_paper.pdf) --- ### 2015 Deep Visual-Semantic Alignments for Generating Image Descriptions: :::success ::: :::warning ::: :::info ::: * [refer site](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf) --- ### : --- ### : --- ## Natural Language Processing ### 2014 Sequence to Sequence Learning with Neural Networks: :::success * Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector ::: :::warning * Application of ==Deep Neural Nertwork(DNN)== * DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality * A useful property of the LSTM is that it learns to map an input sentence of variable length into a fixed-dimensional vector representation * using the cased ==BLEU== score to evaluate the quality of our translations * Result ![](https://i.imgur.com/wNKBnER.png) ::: :::info * we showed that a large deep LSTM, that has a limited vocabulary and that makes almost no assumption about problem structure can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task ::: :::danger * DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality ::: * [refer site](https://arxiv.org/pdf/1409.3215.pdf) --- ### 2017 Attention Is All You Need: :::success We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train ::: :::warning ![](https://i.imgur.com/jTYJCJe.png) * Encoder and Decoder Stack * Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. * Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack ::: :::warning ![](https://i.imgur.com/psTBklp.png) * Why Self-Attention * total computational complexity per layer * amount of computation that can be parallelized, as measured by the minimum number of sequential operations required * the path length between long-range dependencies in the network ::: :::info For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. ::: * [refer site](https://arxiv.org/pdf/1706.03762.pdf) --- ### 2014 Convolutional Neural Networks for Sentence Classification: :::success We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks ::: :::warning * Model ![](https://i.imgur.com/Q03p2zQ.png) ::: :::warning * Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features * Originallyinvented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing * we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. * We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels ::: :::info * Multichannel vs. Single Channel Models: * The results are mixed, and further work on regularizing the fine-tuning process is warranted. * Static vs. Non-static Representations * Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP ::: * [refer site](https://arxiv.org/pdf/1408.5882.pdf) --- ### : --- ### : --- ## Speech Technology ### 2012 Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups : :::success * Most speech recognition systems: * hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input * feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output * Using deep neural networks for acoustic modeling in speech recognition. ::: :::warning * Expectation-Maximization (EM) * Develop speech recognition systems for real world tasks using the richness of Gaussian mixture models (GMM) to represent the relationship between HMM states and the acoustic input  * Two-stage training procedure DNNs: 1.First Stage * layers of feature detectors are initialized * one layer at a time, by fitting a stack of generative models, each of which has one layer of latent variables 2.Second Stage * each generative model in the stack is used to initialize one layer of hidden units in a DNN and the whole network is then discriminatively fine-tuned to predict the target HMM states. * These targets are obtained by using a baseline GMM-HMM system to produce a forced alignment ![](https://i.imgur.com/rlnKBXG.png) * The DNNs that worked well on TIMIT were then applied to five different large vocabulary, continuous speech recognition tasks by three different research groups whose results we also summarize. * The DNNs worked well on all of these tasks when compared with highly-tuned GMM-HMM systems and on some of the tasks they outperformed the state-of-the-art by a large margin ::: :::danger * GMMs shortcoming – they are statistically inefficient for modeling data that lie on or near a non-linear manifold in the data space. ::: :::info * When GMMs were first used for acoustic modeling they were trained as generative models using the EM algorithm and it was some time before researchers showed that significant gains could be achieved by a subsequent stage of discriminative training using an objective function more closely related to the ultimate goal of an ASR system * When neural nets were first used they were trained discriminatively and it was only recently that researchers showed that significant gains could be achieved by adding an initial stage of generative pre-training that completely ignores the ultimate goal of the system * The first method to be used for pre-training DNNs was to learn a stack of RBMs, one per hidden layer of the DNN. An RBM is an undirected generative model that uses binary latent variables, but training it by maximum likelihood is expensive so a much faster, approximate method called contrastive divergence is used. * Subsequent research showed that autoencoder networks with one layer of logistic hidden units also work well for pre-training, especially if they are regularized by adding noise to the inputs or by constraining the codes to be insensitive to small changes in the input. ::: * [refer site](https://static.googleusercontent.com/media/research.google.com/zh-TW//pubs/archive/38131.pdf) --- ### 2013 Speech recognition with deep recurrent neural networks: :::success * deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs ::: :::warning * The combination of Long Short-term Memory , an RNN architecture with an improved memory, with end-to-end training has proved especially effective for cursive handwriting recognition. ![](https://i.imgur.com/0IRb0sw.png) ::: :::info ![](https://i.imgur.com/DM4qkZH.png) * combination of deep, bidirectional Long Short-term Memory RNNs with end-to-end training and weight noise gives state-of-the-art results in phoneme recognition on the TIMIT database. * combine frequency domain convolutional neural networks with deep LSTM ::: * [refer site](https://arxiv.org/pdf/1303.5778.pdf) --- ### 2015 Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition: :::success ::: :::warning ::: :::info :: --- ### : --- ### : --- ::: * [refer site](https://arxiv.org/pdf/1507.06947.pdf) # [論文索引](https://hackmd.io/DSTgKtaGQ1yjEzDInyRhoA?both)