Try   HackMD

Discussions and future works

As the first exploration in end-to-end approaches to spoken question answering, the result of our experiments shows the feasibility of this research direction. While reasonable performance can be achieved by this approach, there is a large room for future research with the following issues addressed.

The first issue to address is the usage of word boundaries. Although it is legal to use an off-the-shelf ASR model that acts as a segmenter under the supervised setting, it will be much more desirable if the boundaries can be provided by the end-to-end model itself. In conventional SLU tasks, it is possible to extract information from frame-level speech features to do classification tasks like slot filling. However, it will be an enormous challenge to use frame-level speech features for the SQA model which needs a pointer network to predict position directly on the very long frames. Considering that our work is the first stepping stone on end-to-end SQA, we choose an easier setting that focusing on the embedding learning and language model pre-training to solve SQA after getting the word boundaries. However, there are some alternative approaches may be explored in future works. The simplest way is to divide audios heuristically by the voice intensity. In more meticulous ways, previous work on simultaneous speech translation~\cite{oda2014optimizing} has proposed algorithms to learn segmentation strategies that directly maximizes the performance of the machine translation system. Joint learning of segmentation and speech segment embedding that can mutually be enhanced by reinforcement learning~\cite{wang2018segmental} is also a promising approach. This method possibly can be adapted to the text and speech cross-modal language model pre-training in our work in the future.

The second direction to future research is cross-modal langauge model pre-training without labels for speech corpus. While paired data was used in our pre-training stage, a purely unsupervised method is more practical concerning it is easier to collect large unpaired corpus. One potential way to achieve this is Contrastive Predictive Coding (CPC)~\cite{oord2018representation} on word-level speech features.

===============

"As the first exploration in end-to-end approaches to spoken question answering, the result of our experiments shows the feasibility of this research direction. Though we achieved a reasonable performance on the SQA task, there is still a large room for future research to address the following issues." (sounds strange)

The first challenge is the usage of word boundaries. Although it is reasonable to use an off-the-shelf ASR model that acts as a segmenter under a supervised setting, it will be much more desirable if the boundaries can be provided by the end-to-end model itself. In conventional SLU tasks, it is possible to extract information from frame-level speech features to do classification tasks like slot filling. However, it will be an enormous challenge to use frame-level speech features for the SQA model which needs a pointer network to predict positions directly on very long frames. In this work, we choose an easier setting that focuses on embedding learning and language model pre-training to solve SQA with pre-computed word boundaries. One possible way to integrate segmentation into our approach is simply dividing audios heuristically by the voice intensity. Alternatively, previous work on simultaneous speech translation~\cite{oda2014optimizing} has proposed algorithms to learn segmentation strategies that directly maximizes the performance of the machine translation system. Their work might lead to a more thorough strategy for solving the segmentation issue in our task. Joint learning of segmentation and speech segment embedding that can mutually be enhanced by reinforcement learning~\cite{wang2018segmental} is another promising approach. This method can be adapted to the text and speech cross-modal language model pre-training in our work in the future.

The second goal for future research is cross-modal langauge model pre-training without labels for speech corpus. While paired data was used in our pre-training stage, a purely unsupervised method can leverage the much larger unpaired corpuses. One possible way to achieve this is by utilizing Contrastive Predictive Coding (CPC)~\cite{oord2018representation} on word-level speech features. ( how?)

https://1drv.ms/u/s!AmPnT_bgr8OxrtlFlD9E-ebjxwgryA