[PAPER] Generating Radiology Reports via Memory-driven Transformer

# [PAPER] Generating Radiology Reports via Memory-driven Transformer ![](https://hackmd.io/_uploads/Sk6Pxk2F2.png) :::info **Author** : Zhihong Chen, Yan Song, Tsung-Hui Chang, Xiang Wan **Paper Link** : https://arxiv.org/abs/2010.16056 **Code** : https://github.com/zhjohnchan/R2Gen ::: #### Content ![](https://hackmd.io/_uploads/HyPqIsmtn.png) #### Abstract --- * Medical imaging은 진단 및 치료를 위한 clinical practice and trials에서 자주 사용된다. * 영상 보고서 작성은 시간이 만하이 걸리고 경험이 부족한 방사선 전문의인 경우 오류가 발생하기 쉽다. * 따라서 방사선 전문의 업무량을 줄이고 임상 자동화를 촉진하기 위해 방사선 보고서를 자동으로 생성한다. * Memory-driven Transformer를 설계하여 생성 과정의 주요 정보를 기록하고, transformer's decoder에 통합하기 위해 memory-driven conditional layer normalization를 적용한 transformer로 방사선 보고서를 생성한다. * 연구의 접근 방식은 의미 있는 이미지-텍스트 매핑뿐만 아니라 필요한 의학 용어가 포함된 긴 보고서를 생성할 수 있다. #### Information --- 1. We propose to generate radiology reports via a novel memory-driven transformer model. 2. We proposed a relational memory to record the previous generation process and the MCLN to incorporate relational memory into tlayers in the decoder of transformer. 3. Extensive experiments are performed and the results show that our proposed models outperform the baselines and existing models. 4. We conduct analyses to investigate the effect of our model with respect to different memory sizes and show that our model is able to generate long reports with necessary medical terms and meaningful image-text attention mappings. * 실제로 방사선 보고서 작성의 가장 큰 어려운점은 방사선 보고서가 여러문장으로 구성된 긴 서술이라는 점입니다. (impression section, finding section) * 정확성 및 요구사항 그리고 문단의 길이 * 기존 conventional image captioning approaches는 보고서 생성에 적합하지 않습니다.(briefly describe viusal scenes with short sentences) * 그렇다면 이러한 문제를 해결하기위해 proposed to address the challenges of radiology generation * Liu(2019) - A simple retrieval-based method * Li(2018) - combined retrieval-based and generation-based methods with manually extracted templates * retrieval-based 접근법은 한계가 있다. * In this paper, to generate radiology reports via memory-driven transformer(Relational memory RM). RM is proposed to record the information from previous generation processes and a novel memory-driven conditional layer normalization(MCLN) is designed to incorporate the relational memory into transformer * 결과, 생성과정에서 다른 의료 보고서의 유사한 패턴을 잠재적으로 모델링하고 기억할 수 있어 transformer의 디코딩을 용이하게 하고 유익한 내용의 긴 보고서를 생성할 수 있습니다. #### Method - sequence-to-sequence패러다임을 따른다. 방사선 이미지를 순서대로 입력을 하고, xs는 각 시각적 추출기에서 추출한 패치 특징이고 d는 백터 크기입니다. - thanswkㅛsms todtjdehls xhzms, ㅅsms todtjdehls xhzmsdml rlfdl, ㅍsms rksmdgks ahems xhzmsdml djgnl ## Information * RM (relational memory) : record the information from previous generation processes * d * * MCLN (memory-driven condtional layer normalization) : ## Method * The proposed model can be partitioned into three major components: the visual extractor, the encoder, and the decoder. * The visual extractor is responsible for extracting visual features from a radiology image using pre-trained convolutional neural networks (CNN), such as VGG or ResNet. The encoded results are used as the source sequence for all subsequent modules. * radiology image(input ) -> CNN(such as VGG or ResNet)->patch feature(from visual extractor feature). The process is formulated as: {x1, x2, ..., xS} = fv(Img), where fv() represents the visual extractor and Img is the radiology image * The encoder is the standard encoder from Transformer, which takes the input features extracted from the visual extractor and encodes them into hidden states. The outputs of the encoder are the hidden states h_i encoded from the input features x_i extracted from the visual extractor. * {h1, h2, ..., hS} = fe(x1, x2, ..., xS), where fe() refer to encoder. * The decoder is where the proposed memory and the integration of the memory into Transformer are mainly performed. The decoder generates the target sequence, which is the radiology report. The proposed memory is a relational memory that records the previous generation process and patterns for long text generation. The memory is incorporated into Transformer using a novel layer normalization mechanism. The decoder also uses a multi-head self-attention mechanism to capture the dependencies between different parts of the generated report. The decoder generates the report token by token, conditioned on the visual features and the previous generated tokens.