Jia-Liang Lu

@VitasLu

Joined on Oct 25, 2023

在這個HackMD空間,我主要記錄: 1. Python語法和程式寫作技巧 2. NLP領域研究筆記,特別是文本摘要相關論文

  • 前言 在詳細介紹模型的壓縮之前,我想先用『圖片』來開場。 ![higher-resolution-image](https://hackmd.io/_uploads/Hk6dpWGS6.jpg =90%x) 觀察上面這兩張照片,大家應該會毫不猶豫的說: 右邊照出來的向日葵比較清晰。 左邊照出來的向日葵很模糊。
     Like  Bookmark
  • 5-1 認識函式 函式(Function)是<span class='red'>將一段具有某種功能或重複使用的statement寫成獨立的程式</span>。 然後給予名稱以供後續呼叫使用,這樣可以簡化程式提高可讀性 有時我們也稱函式(Function)為方法(method), 程序(procedure) 或副程式(subroutine) :::success 使用函式(Function):優點: <span class='red'>重複使用性</span>使程式會變得比較精簡、<span class='red'>可讀性提高</span>(容易理解有利於偵錯) 缺點:因為多了一道呼叫的程式,執行的速度自然比直接將statement寫進程式裡面慢一點 ::: 5-2 定義函式
     Like  Bookmark
  • Python Chapter 5: 函式 Chapter 6: list、tuple、set與dict Chapter 8: 檔案存取 Chapter 9: 例外處理 Chapter 10: 物件導向 (Object-Oriented Programming) Chapter 11: 模組與套件 Python Third-Party Library Pytorch
     Like 1 Bookmark
  • 10-1 認識物件導向 物件導向(Object-Oriented)的優點是<span class='red'>物件可以在不同的應用程式中被重複使用</span>。 :::info 傳統的程序性程式設計(Procedural Programming) 是將資料和「用來處理資料」的函式分開定義,因此注重函式的設計 整個程式是由一連串的敘述(statement)所組成 逐步執行這些程式就可以得到結果 EX: FORTRAN, ALGOL, BASIC, COBOL, Pascal, C, Ada等 ::: 以下是物件導向中常見的名詞:物件(object)或實體(instance):就像是生活中的所看到的物體
     Like  Bookmark
  • 8-1 認識檔案路徑 檔案或資料夾存放在儲存裝置的方式取決於檔案系統(File System) 檔案系統(File System) 會找出檔案存放在儲存裝置的哪個位置,進而讀取上面的資料 不同電腦的作業系統(Operation System),可能採用不同的檔案系統MS-DOS的檔案系統為FAT(File Allocation Table) Windows 7/8/10的檔案系統為FAT32、NTFS(New Technology File-System)或exFAT(Extended File Allocation Table) 8-2 寫入檔案 8-2-1 建立檔案物件 在Python程式中,無論是要讀取還是寫入檔案資料,<span class='red'>都必須透過中介的檔案物件</span> 我可以使用下方的open( )函示來建立物件。當建立成功時會回傳檔案物件;相反的,當建立失敗時會發生錯誤
     Like  Bookmark
  • :::danger Comments: Accepted at ICLR 2024 camera-ready Github: https://github.com/lilakk/BooookScore Keywords: LLM Evaluation, Long-term Evaluation ::: 1. Introduction Just two years ago, automatically-generated summaries were riddled with artifacts such as grammar errors, repetition, and hallucination. Nowadays, such artifacts have mostly disappeared. In fact, Pu et al. (2023b) find that summaries generated by large language models (LLMs) are preferred over those written by humans, leading them to pronounce the death of summarization research. However, as with most prior work on summarization, the input documents in their study are relatively short (<10K tokens).
     Like 1 Bookmark
  • 11-1 模組(module) <span class='red'>模組(module)是一個</span>檔名為 *** .py的<span class='red'>python檔案</span>,這個檔案裡面定義了一些資料、函式和類別 當我們要使用這個module所提供的功能(資料、函式和類別)時,要使用import指令進行匯入 # ***表示modulename import *** 以Python內建的calendar module為例 import calendar print(calendar.month(2024, 6))
     Like  Bookmark
  • ICLR Conference BooookScore: A systematic exploration of book-length summarization in the era of LLMs <!-- 2024/04/13 --> ACL Conference Element-aware Summarization with Large Language Model: Expert-aligned Evaluation and Chain-of-Thought Method <!-- 2023/05/22 --> SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization <!-- 2023/05/26 --> EMNLP Conference Toward Unifying Text Segmentation and Long Document Summarization <!-- 2022/10/28 -->
     Like 1 Bookmark
  • :::danger Comments: Accepted at ACL 2022 Github: https://github.com/ntunlp/SummaReranker Keywords: Mixture-of-Experts, Decoding Strategies ::: 1. Introduction In recent years, sequence-to-sequence neural models have enabled great progress in abstractive summarization. <span class='red'>In the news domain, they have surpassed the strong LEAD-3 extractive baseline.</span> With the rise of transfer learning since BERT, leading approaches typically fine-tune a base pre-trained model that either follows a general text generation training objective like T5, BART, ERNIE and ProphetNet, or an objective specifically tailored for summarization like in PEGASUS.
     Like  Bookmark
  • :::danger Comments: Accepted at EACL 2024 conference Github: https://github.com/flbbb/locost-summarization Keywords: State-space Model, Model with Unlimited Context Length. Python: pip install python = 3.10 ::: 1. Introduction The introduction of transformer architectures indeed came as a major bump in performance and scalability for text generation. However the quadratic complexity in the input length still restricts the application of large pre-trained models to long texts.
     Like  Bookmark
  • :::danger Comments: Accepted at ACL 2023 Github: https://github.com/Alsace08/SumCoT ::: 1. Introduction Existing studies commonly train or fine-tune language models on large-scale corpus. However, some standard datasets have shown to be noise-enriched, mainly in terms of information redundancy and factual hallucination. Sufficient experiments have shown that reference summaries in these standard datasets perform poorly on human assessment dimensions, especially coherence, consistency, and relevance. :::info
     Like  Bookmark
  • :::danger Comments: Accepted at EMNLP 2022 (Long Paper) Github: https://github.com/tencent-ailab/Lodoss ::: 1. Introduction One of the most effective ways to summarize a long document is to extract salient sentences. While abstractive strategies(生成式摘要) produce more condensed summaries, they suffer from hallucinations and factual errors, which pose a more difficult generation challenge. Extractive summaries(提取式摘要) have the potential to be highlighted on their source materials to facilitate viewing, e.g., Google’s browser allows text extracts to be highlighted on the webpage via a shareable link. <span class='red'>As a document grows in length, it becomes crucial to bring structure to it.</span> Examples include chapters, sections, paragraphs, headings and bulleted lists(條列式標題).
     Like  Bookmark
  • :::danger Comments: Accepted at COLING 2022 Github: https://github.com/SeungoneKim/SICK_Summarization ::: 1. Introduction Dialogue-to-document summarization suffers from the discrepancy between input and output forms, which makes learning their mapping patterns more challenging. There are two key challenges that make summarizing dialogues harder than documents:Detecting unspoken intention is crucial for understanding an utterance. There exists information that can only be understood when its hidden meaning is revealed. :::info
     Like  Bookmark
  • 6-1 list (串列) Definition:由一連串資料所組成 資料是有順序的 是可以改變的(mutable)的序列(suquence) list的前後以中括號表示 list裡面的資料已逗號隔開 list裡面資料的型別可以不同 # 包含5個元素的list [1, "Taipei", 3.14, "NTNU", -43]
     Like 1 Bookmark
  • 9-1 認識例外 常見的錯誤類型: 語法錯誤(syntax error) 執行期間錯誤(runtime error) 邏輯錯誤(logic error) 當Python程式發生錯誤時,系統會丟出一個例外(exception) Traceback指的是此錯誤訊息是追朔到函式呼叫所發生的 例外的類型
     Like  Bookmark
  • :::danger Keywords: Cyber-physical system (CPS), digital twin (DT), edge computing, fog computing, cloud computing, smart manufacturing. ::: 1. Introduction The advances in new generation information technologies (New IT), such as Internet of Things (IoT), big data, cloud computing, artificial intelligence (AI), etc., have had a profound impact on manufacturing. In this context, many countries have issued their advanced manufacturing strategies, e.g., Industry 4.0 in Germany, Industrial Internet in USA, Made in China 2025, etc. Although each of these strategies was proposed under different circumstances, <span class='red'>one of the common purposes of these strategies is to achieve smart manufacturing</span>. However, <span class='red'>to implement smart manufacturing, one specific challenge is how to converge the physical and cyber worlds of manufacturing</span>. Cyber-physical integration has gained extensive attentions from academia, industry, and the government.
     Like  Bookmark
  • 1. Introduction LLM summaries are significantly preferred by the human evaluators, which also demonstrate higher factuality. After sampling and examining 100 summarization-related papers published in ACL, EMNLP, NAACL, and COLING in the past 3 years, we find that the main contribution of about 70% papers was to propose a summarization approach and validate its effectiveness on standard datasets. :::success we acknowledge existing challenges in the field(Text summarization) such as: The need for high-quality reference datasets Application-oriented approaches Improved evaluation methods
     Like  Bookmark
  • 1. Introduction 大型語言模型(Large language models, LLMs)在各種任務上都顯示出了優異的性能和巨大的潛力。 但是部署這些模型一直面臨挑戰,因為LLMs參數非常多,這需要大容量的記憶體和很高的記憶體頻寬,所以下面說明一些技術來改善這個問題: 量化(Quantization)是一種減少神經網路的數值精確度以降低推理計算成本的技術。 INT8量化是最常見的方法,但由於激活值(Activation Value)的極值限制了更廣泛的採用。 :::info 激活(Activation)是神經網路中一個重要的概念:
     Like 2 Bookmark
  • 1. Introduction Large language models (LMs) can be prompted to perform a range of natural language processing(NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” Thus, we say that the language modeling objective is misaligned :::info misaligned是指LLM產生的內容沒有被法解決或有效幫助人類所提出來的問題 ::: We make progress on aligning language models by training them to act in accordance with the user’s intention.
     Like  Bookmark
  • :::danger Github: https://github.com/the-anonymous-bs/FAVOR ::: 1. Introduction Text-based large language models (LLM) have demonstrated remarkable performance in various natural language processing tasks, especially achieving human-level capabilities in reasoning and comprehension. :::info instruction fine-tuning: where data is organised as pairs of user instruction (or prompt) and reference response, has emerged as a training paradigm that enables LLMs to perform various tasks by following open-ended natural language instructions from non-expert users. ::: Recently, there has been a burgeoning research interest in equipping LLMs with visual and auditory perception abilities.
     Like  Bookmark