Zhan Ling
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Overall comment on all negative reviews: BLD6(4) didn't read the paper carefully and misunderstand the paper. TPbw(3) is a first-time reviewer. kmTd(4) is confused about the Table 3 and willing to increase the score. AC: 1. many fields, not reliable answer is not answer, we concern reliable knowledge, lawyer, medical, 2. our goal - our metric, but reviewers xxx ignored our goal and metrics, with their own imagination and understanding of goal and metrics to evaluate our paper, we think our work is logical work, they ignore our primary goal of our work and premises, disrespectful, not professional; 3. 3 point is junior review, they don't understand novelty; novelty should be based on MAIN methods of paper, not based on shallow levels of similarity; when methods (citations) 4. their own imagination and understanding of goal and metrics to evaluate our paper verification accuracy improved, not final answer accuracy; under OUR def of accuracy, we SIGNIFICANTLY improve R: eliminate as much as we can (e.g., very long math proof), not absolutely correct, e.g., lawyer, medical Many fields, not reliable = not answer. We'd like to first clarify that our primary goal is *NOT* to improve the final answer accuracy of reasoning tasks. Our primary goal is to ensure the **validity of each reasoning step**, emphasizing LLM reasoning reliability over final answers. To AC and SAC > Dear AC, SAC and PC We want to report viewer BLD6, TPbw here. BLD6 did not read our paper carefully and cannot understand the deductive verification (). Also, BLD6 is not fimilar with the fields. Designing prompt is not novel & cannot understand self-consistency. 1. They are not knowledgable enough to review our paper. Here are some evidence. 2. They misunderstood our paper. --- Dear ACs and SACs, Thanks a lot for your time and effort reviewing our paper! We wish to emphasize that our paper's core objective is to enhance the trustworthiness and reliability of LLM-generated reasoning by proposing an effective reasoning verification framework. This is vital in fields such as law, medicine, and scientific inquiry, where erroneous reasoning steps can lead to harmful outcomes, even if the final conclusions are accurate. As such, we emphasize how we can significantly **enhance the verification accuracy of LLM reasoning chains**, rather than the final answer accuracy of reasoning tasks. We'd also like to express concerns regarding the original reviews posted by Reviewers `BLD6` and `TPbw`, *which appear unprofessional*: - Reviewer BLD6 stated that "designing new prompts for a task (verification of rationales in this case) seems limited in terms of novelty". However, we'd like to note that two notable papers accepted by NeurIPS 2022 on LLMs primarily focused on prompting for step-by-step reasoning: "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" with 1010 citations as of Aug 9, and "Large Language Models are Zero-Shot Reasoners" with 485 citations as of Aug 9. We'd also like to note that our work goes beyond studying prompts for LLM reasoning chain verification. A crucial component of our verification framework involves simplifying the step-by-step verification process by only keeping necessary premises and contexts for each reasoning step, which we believe is a valuable contribution. - Reviewer TPbw performed a superficial comparison between our work and NLProofS and claimed that our approach lacks novelty. However, our work's central focus and central approach are significantly different from those of NLProofS. - Reviewer TPbw disregarded the primary goal of our paper to improve the *verification accuracy* of LLM reasonings, and failed to acknowledge the fact that our proposed verification framework has significantly enhanced verification accuracy of reasoning chains (Table 4). Instead, Reviewer TPbw insisted on using final answer accuracy to judge our work, ignoring our central claims clearly stated in our paper (e.g., L62-64, L118-120 of submission). We believe that Reviewer TPbw relies on their unsubstantiated imaginations during their review, which is disrespectful and lacks professionalism. --- > **Common Response** We sincerely thank all reviewers and ACs for their constructive feedback! We're glad that reviewers found our paper well-motivated (`LzaJ, BLD6, TPbw`), our verification framework and ideas plausible (`bmW6`) and good (`kmTd`), our observations intersting (`BLD6`), and our papers clear and easy to follow (`TPbw, bmW6`). We address common reviewer comments here and provide a revision of our tables. > Primary goal of this work We'd like to first clarify that our primary goal is *NOT* to improve the final answer accuracy of reasoning tasks. Our primary goal is to ensure the **validity of each reasoning step**, emphasizing LLM reasoning reliability over final answers. To this end, we first discover that naively verifying complex reasoning chains with LLMs yields low accuracy (see Table 2) because of intertwining reasoning-step dependencies that confound LLMs. We thus propose Natural Program-based verification, decomposing chains into steps and verifying each with *only* its necessary context and premises. This significantly improves verification accuracy (see Table 4), enhancing LLM-generated reasoning's reliability and achieving our primary goal. > Adding F1 metric to Tables 2, 4, 5 We appreciate Reviewer LzaJ's suggestions. We have added F1 scores to Tables 2, 4, and 5. When calculating F1 scores, for reasoning chains that are correct, a "true positive" is defined as a verification approach outputing "yes" for the final validity answer; for reasoning chains that are incorrect, a "true positive" is defined as a verification approach outputing "no" for the final validity answer. **Table 2**: Two-shot reasoning chain verification accuracy for GPT-3.5-turbo (ChatGPT) with a naive verification approach, where we input an entire reasoning chain for verification, and premises of different reasoning steps are entangled together. Prompt in Appendix Table 10. |Answer Correctness|GSM8K|AQuA|MATH|AddSub|Date|Last Letters| |---|---|---|---|---|---|---| |Correct|98.00%|96.00%|100.00%|92.00%|100.00%|96.00%| |Incorrect|2.00%|4.00%|0.00%|6.00%|26.00%|6.00%| |Average|50.00%|50.00%|50.00%|49.00%|63.00%|51.00%| |F1|35.03%|36.58%|33.33%|37.43%|57.13%|38.56%| **Table 4**: 1-shot deductive verification accuracy for GPT-3.5-turbo (ChatGPT) using our Natural Program-based approach with step-by-step reasoning chain decomposition. During the verification of each reasoning step, we consider two options: utilizing the complete premises $p_i = QC \cup S_{\le j}$, or keeping the minimal set of premises $\bar{p_i} \subseteq p_i$. |Premise Context|Answer Correctness| GSM8K|AQuA|MATH|AddSub|Date|Last Letters| |-|-|-|-|---|---|---|---| ||Correct|64.00%|54.00%|58.00%|95.00%|26.00%|96.00%| |Full Premises|Incorrect|56.00%|68.00%|56.00%|24.00%|76.00%|5.00%| ||F1|59.94%|60.81%|56.70%|53.66%|47.73%|37.58%| ||Correct|84.00%|72.00%|70.00%|95.00%|90.00%|96.00% |Minimal Premises|Incorrect|84.00%|62.00%|76.00%|40.00%|56.00%|6.00%| ||F1|84.00%|66.92%|72.98%|64.84%|72.20%|38.56%| **Table 5**: Ablation of different values of k′ on the verification accuracy of reasoning chains using our Unanimity-Plurality Voting strategy. Experiments are performed on AddSub using GPT-3.5-turbo (ChatGPT). |Answer Correctness|k′=1|k′=3|k′=5|k′=10| |---|---|---|---|---| |Correct|86.00%|90.00%|90.00%|92.00%| |Incorrect|38.00%|38.00%|38.00%|40.00%| |F1|59.68%|61.39%|61.39%|63.53%| > Revision of Table 3 **The major goal of this work is to improve the trustworthiness and reliability of reasoning. Table 3 aims to illustrate that, along this process, our deductive verification approach does not significantly compromise the accuracy of the final answers.** We update Table 3 results here and will update our submission pdf in the future. For the baselines (CoT and Faithful CoT), we adopt the implementations from their original repositories and report the results. These baselines utilize varying numbers of few-shot samples across datasets, exceeding our approach's usage (we use 2-shot for Date and 1-shot for other datasets). Additionally, for AQuA, we observed an inconsistency in capitalization between the baseline CoT's official few-shot prompt and the AQuA answer choices. This discrepancy led the baseline CoT to generate answers with random capitalizations. Our previous answer parser only considered capitalized answer choices as correct, so we adjusted our answer parser to accommodate different capitalizations, which improved the accuracies of baselines. We find that prompting language models to reason in our Natural Program format sometimes achieves better final answer accuracy than baselines. Upon further applying our deductive verification approach to filter out invalid reasoning chains, we observe a slight decrease in final answer accuracy. **One major contributing factor to this decrease is the filtering out of reasoning chain candidates that provide correct answers but exhibit incorrect deductive reasoning**. An example is illustrated in Table 6 of our submission, where ChatGPT generates the correct final answer but assigns incorrect premise numbers to support the first reasoning step. We note that in many such cases, our verification approach effectively identifies these reasoning errors, thereby enhancing the rigor and reliability of the language models’ reasoning processes, albeit with a slight negative impact on the overall final answer correctness. **Table 3**: Final answer accuracy comparison on GPT-3.5-turbo (ChatGPT). All approaches generate $k = 10$ reasoning chains for each problem before performing majority voting or reasoning chain filtering with our deductive verification approach. |Methods| GSM8K|AQuA|MATH|AddSub|Date|Last Letters| |-------|------|----|----|------|----|------------| |CoT + Voting, few-shot|87.62%|70.18%|35.93%|92.36%|69.97%|81.60%| |Faithful CoT + Voting, few-shot|81.20%|61.80%|31.78%|88.35%|73.50%|-| |Ours (Natural Program (NP), No Verification), 1-2 shot|87.05%|70.34%|36.75%|93.67%|72.49%|92.98%| |Ours (NP + Deductive Verification + UPV), 1-2 shot|86.01%|69.49%|36.48%|93.54%|71.45%|92.60%| <!-- **Table 9** |Methods|Results|GSM8K|AQuA| |---|---|---|---| |CoT|Correct|95.4%|93.6%| ||Incorrect|60.0%|43.2%| |Natural Program|Correct|95.6%|58.6%| ||Incorrect|89.3%| 49.8%| GPT-4 with full premises: correct means the ratio of $\frac{GPT-4\ think\ correct}{Answer\ Correct}$. --> --- Reviewer LzaJ > **Rating**: 6 > Response We sincerely thank you for your constructive feedback! We address the comments and questions below. > Q1: Include stronger baselines for comparision with "Natural Program". We'd like to note that our primary goal is to improve the *truthworthiness and reliability* of the reasoning via step-by-step deductive verification. As such, we don't primarily focus on comparing the final answer correctness of "Natural Program" with other established prompting techniques. Our choice of CoT and Faithful CoT as baselines in Table 3 stems from two reasons: 1) When we require the LLMs to generate output in the "Natural Program" format, we utilize few-shot prompting. Consequently, we included "CoT", which also employs step-by-step few-shot prompting for reasoning. 2) "Faithful CoT" and our proposed "Natural Program" both require LLMs to explicitly generate premises for each reasoning step. Unlike "Natural Program" that produces step-by-step answers in natural language, "Faithful CoT" generates a step-by-step code program as answers and executes the program with interpreters. We empirically find Faithful CoT to underperform Natural Program, suggesting that generating intricate code programs alongside reasoning chains could be challenging, while generating natural language-based reasonings could be comparatively simpler for LLMs. > Q2: Why "Unanimity-Plurality Voting"? We named our voting method "Unanimity-Plurality Voting" because there are two phases of majority voting: the "unanimity" phase and the "plurality" phase. - The "unanimity" phase of the majority voting occurs when verifying a reasoning chain. During this phase, (1) To verify each reasoning step $s_i$ of a reasoning chain $S$, we sample a set of $k'$ validity predictions, and perform majority voting to obtain the validity prediction for $s_i$; (2) For a reasoning chain $S$ to be valid, all its reasoning steps have to be "unanimously" valid. - The "plurality" phase of the majority voting occurs when we produce final answer for problem solving. During this phase, we perform majority final answer voting over the validated reasoning chains from the "unanimity" phase. > Q3: Faithful CoT + Voting result update; Why does "CoT + Voting" outperform "Faithful CoT + Voting"? We have updated Table 3 where the numbers in Faithful CoT reflect majority voting results (please see "Common Response"). From the new Table 3, we still find that CoT sometimes outperforms Faithful-CoT. This difference might stem from the nature of the methods: CoT provides natural language answers, whereas Faithful-CoT generates code solutions. For challenging tasks such as MATH and GSM8K, the complexity of correct code programs could be substantial, potentially making natural language solutions simpler for LLMs to generate. > Q4: Include final answer accuracy to the tables besides verification accuracy. Thanks for your suggestion! We will position Tables 3 (final answer accuracy) and 4 (verification accuracy) on the same page. We'll also add final answer accuracies to Table 2. > Q5: Report F1 score of deductive verification Thanks for your suggestion! We have updated Tables 2, 4, and 5 to include F1 score. Please see "Common Response" for further details. > Q6: Deductive Verification of "Faithful CoT" Thanks for the suggestion! We've conducted an experiment on the GSM8K dataset to show the deductive verification performance of "Faithful CoT". In a manner similar to the verification evaluation pipeline for "Natural Program", we have manually chosen 100 reasoning chains—50 deductively valid and 50 with reasoning errors. We validated each reasoning step using full premises and minimal premises. Detailed results are presented in the subsequent table. **Verification results of GSM8K with Faithful CoT and Natural Program** | Answer Correctness| Faithful - Full reasoning chain | Faithful - Step-by-step (Minimal premises) | NP - step-by-step (Minimal preimises)| |--|---|---|---| |Correct|96.00%|96.00%|84.00%| |Incorrect|12.00%|18.00%|84.00%| |Average|54.00%|57.00%|84.00%| |F1|44.15%|49.28%|84.00%| > Q7: Using UPV with CoT without Natural Program (NP) Thanks for the suggestion! Unfortunately, for reasoning chains generated with vanilla CoT, it is very challenging to extract the actual premises of a single reasoning step because of intertwining reasoning-step dependencies. In contrast, Natural Program allows individual reasoning steps and their minimal set of premises to be easily extracted. Therefore, step-by-step deductive verification is more compatible with answers generated through NP than vanilla CoT. As for UPV, it also depends on step-by-step verification results, and thus has better applicability with NP. > Limitations Thanks for the suggestion! We'd like to point the reviewer to Sec. 6 and Tab. 7, where we showcase our approach's failures to detect reasoning step errors when encountering ambiguous wordings. We also included more success / failure analysis in Appendix D and Table 12-14 of Appendix. <!-- > Q7: UPV Performance Comparison: Full Premises vs. Minimal Premises. Thanks for the question. We present the performance of UPV with minimal premises in Table 3 as a reference, since it is expected to outperform UPV with full premises. When utilizing UPV, it initially eliminates reasoning chains that fail the deductive verification, then votes based on the final answer. A higher performance in deductive verification implies that more accurate reasoning chains are retained, while erroneous ones are filtered out, thereby improving the voting results. Moreover, the use of minimal premises proves to be more efficient than employing full premises, particularly when dealing with a lengthy reasoning chain and a sparse dependency graph (where each reasoning step relies on a small subset of full premises). --> <!-- The accuracy of correct and incorrect chains in Tables 4 and 5 will be moved to the appendix in the future. We will maintain the two accuracies in Table 2 to highlight LLM's tendency to validate both accurate and inaccurate reasoning as "correct" when given the entire reasoning process. --> <!-- Tables 2, 4, and 5 aim to present the performance of deductive verification. However, as Table 1 illustrates, even reasoning chains yielding correct answers can possess flawed reasoning. This suggests that QA accuracy or the validity of the final answer may not be inherently linked to verification performance. Still, we acknowledge your insight about connecting deductive verification performance with QA accuracy. Consequently, we will position Tables 3 and 4 on the same page. This comparison should enhance our understanding of how deductive verification performance influences QA accuracy with UPV. --> <!-- > However, it's important to note that the program's execution relies on an external interpreter, and the assignment of values to internal variables and the final answer occurs only upon successful execution with this interpreter. This reliance on an external interpretor can disturbs the reasoning dynamics of the language model, resulting in an unfair comparison. The calculation or execution process itself is non-trivial, and may potentially induce hallucination and error accumulation to the afterwards reasoning. </addexp> --> --- Reviewer BLD6 > **Rating**: 4 We sincerely thank you for your constructive feedback! We address the comments and questions below. > Circular nature of using LLM for verification We'd like to clarify that "the main hypothesis here is that we don't trust the generations sampled from LLM" is actually NOT our hypothesis. Not only do we seek when the LLM cannot be trusted, but also **when it can be trusted**. According to our experiments, when LLM is used as a verifier, although it is not trustable if the logic chain to verify is complex and long, it is, indeed, highly trustable for short logic chains (e.g., one-step reasoning). This is the motivation for us to invent the Natural Program which allows us to break complex and long logic chains to short ones so that we can trust the verification result of LLMs. In fact, this well aligns with how humans verify results -- recall the experience when we were elementary school students, teachers ask us to write down reasoning steps in very details so that the logical flaws of the answers can be easily detected. These insights are validated by Table 2 (see "Common Response"), which shows poor verification accuracies when naively using the LLM to verify reasoning with entangled premises. Conversely, Table 4 (see "Common Response") illustrates how our Natural Program-based approach with decomposition and simplification leads to improved verification accuracy and fewer reasoning errors. We also share the view that constructing comprehensive verification datasets and finetuning LLMs on these datasets can further enhance their verification accuracy. In Appendix Table 3, we finetuned Vicuna using our Natural Program verification framework, achieving better verification accuracies. Our Natural Program format and minimal-premise approach may inspire future verification datasets, improving the reliability of LLM-generated outputs. > Intermediate steps and premise numbers for verification: ablations and design choices Premise numbers serve to encourage LLMs to cite the *minimal* set of premises necessary for each generated reasoning step as they work through problem-solving. In this way, when LLMs verify each reasoning step, only the cited premises are utilized as the context for verification. Many irrelevant premises, which can easily confound and distract LLMs, are removed. This process is illustrated in Figure 1 of the paper. Without premise numbers, it is very challenging to remove irrelevant premises when verifying each reasoning step. In Table 2 (refer to "Common Response"), we perform an ablation by excluding premise numbers from the reasoning generation prompt (prompt in Table 10 in the Appendix). Here, despite LLM attempting to verify each reasoning step, irrelevant premises remain during step verification. In contrast to Table 4 (minimal premises - our approach), verification accuracy experiences a notable decline. Another ablation is in Table 4, where we keep the premise numbers in the reasoning chain generation prompt. However, during step verification, premise numbers are disregarded, and all preceding steps are used as verification context (e.g., verifying step "#6" includes all prior steps "#1...#5" as context). This is seen in the "Full Premises" entry, where verification accuracy markedly decreases. > L240-241 - choice of using "yes" as the default verification results when the model does not provide a direct answer Firstly, LLMs generally output "yes" or "no" following our few-shot examples and rarely output other answers (GSM8K: 0.04%; AQuA: 0.19%; MATH: 0.16%; AddSub: 0.00%; Date: 1.30%; Last Letters: 1.3%). Our voting strategy treats outliers as a separate category, further minimizing their impact on our approach's efficacy. Additionally, we manually checked the model's reasoning steps when no direct answer is provided, finding them correct in nearly all cases on our dataset, so the verification answer is 'yes'. This is reasonable, as LLMs can answer most problems in the datasets we use. > Novelty of prompting Firstly, we'd like to note that two notable papers accepted by NeurIPS 2022 on LLMs primarily focused on prompting for step-by-step reasoning: "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" with 1010 citations as of Aug 9, and "Large Language Models are Zero-Shot Reasoners" with 485 citations as of Aug 9. We'd also like to note that our work goes beyond studying prompts for LLM reasoning chain verification. A crucial component of our framework involves simplifying the step-by-step verification process by only keeping necessary premises and contexts for each reasoning step. > L111-112 too generic We acknowledge that this statement is too generic. We'll revise it as "We observe that for many cases where LLMs produce erroneous final answers through a step-by-step reasoning process, there exists at least one mistake among the intermediate reasoning steps." > Lines 205-207 and Table 5 clarification We'd like to clarify that we are NOT "relying on crude statistical hope that the model will get it right most of the time", or relying on extensive voting. - While we can sample multiple yes/no values for verification and perform majority voting, Table 5 (see "Common Response") shows that while multiple votes offer some improvement, the gains are modest compared to using a single vote. - Compared to voting, the premises used for verifying of each reasoning step plays a *much* larger role, as shown in Table 4. Removing irrelevant premises that confound LLMs significantly improves verification accuracy. > Limitations of API calls We recognize that including the verification step increases API calls compared to omitting it. Though, in situations like legal cases and answering scientific questions, reliability of reasoning is often prioritized over economic factors. We'll add discussion. Also, Table 5 shows that adding more votes for verification provides slightly more benefits, but does not differ much from using one vote. <!-- As illustrated in Table 2, LLMs are inherently not good at verifying complex reasoning when juxtaposed with their problem-solving capacity. --> <!-- Yes, to obtain the final validity prediction for $s_i$, we sample verification results multiple times and perform majority voting. However, through Table 5, we illustrate that while augmenting the value of k(more samples) generally leads to slightly improved accuracy in reasoning validation, notably, a significant enhancement in verification accuracy was observed, reaching 62% with a minimal value of k=1, comparing to the full premises. This table suggests that under the our deductive verification with the miminum premises, LLMs perform much better on verification tasks. --> <!-- The prompt to verify each reasoning step $s_i$ includes (1) extracted minimal set of premises for the current reasoning step; (2) current reasoning step; (3) statement in L186-187. An illustration is in the upper right block of Figure 1. If component (1) does not remove irrelevant premises, then LLMs are *not* good at verifying $s_i$, as ablated in Table 4. By removing irrelevant premises from component (1) (which is our proposed approach), LLMs are much better at verifying $s_i$. --> --- Reviewer TPbw > **Rating**: 3 <!-- 1. method difference, difference is non-trivial, simplies premise, in-context learning --> We sincerely thank you for your constructive feedback! We address the comments and questions below. > Differences between NLProofS and our work Thanks for your insightful question and we'll cite the "NLProofS" paper. We'd like to highlight several perspectives that distinguish our work from NLProofS: - **The central approach of the two papers differ.** Our approach leverages the in-context learning and step-by-step reasoning abilities of LLMs to perform the verification of each step in a reasoning chain, and does not require LLM finetuning. On the other hand, NLProofS finetunes LLMs on specific datasets to score proof tree nodes and search for proof. Such scoring process does not utilize in-context learning. - **The focus of these two papers differ.** The primary goal of NLProofS is to propose "a novel method for stepwise proof generation", while our primary goal is to propose an effective framework to verify deductive reasoning chains. To this end, we explore different designs for our verification framework, ultimately leading us to develop the Natural Program reasoning format and adopt the minimal premise-based verification approach. - While NLProofS employs a scoring model to assign a validity score to proof tree nodes, it does not provide explainations for **why** a proof step is invalid. In contrast, our Natural Program-based LLM verification approach not only identifies invalid reasoning steps, but they can also provide explicit explanations for why they are invalid, detailing the specific reasoning errors involved. > Have we achieved the goal of generating more trustworthy reasoning? Throughout this work, **we have uncovered valuable insights and discovered a verification framework that *significantly* improves the verification accuracy of reasoning chains over naive approaches, and we believe we have achieved our primary objective of improving LLM reasoning reliability**. We are also NOT asserting that the deductive verification problem is completely solved, but we believe that our proposed framework is a promising avenue to explore. The insights from our paper are as follows: naively using LLM to verify complex reasoning chains results in very poor verification accuracies (see Table 2 in “Common Response”) because of intertwining reasoning-step dependencies that confound LLMs. However, when we break down complex and long logic chains into short ones (e.g., one-step reasoning), and more crucially, verify each reasoning step with *only* its minimal context and premises necessary, we can *significantly* improve verification accuracies (see Table 4), thereby enhancing the reliability of LLM-generated reasonings. Moreover, we experimentally demonstrate the potential to adapt our insights and our verification frameworks to construct comprehensive verification datasets and further improve LLM verification accuracy through finetuning. In Appendix Table 3, we finetuned Vicuna using our Natural Program verification framework, achieving better verification accuracies. Our Natural Program format and minimal-premise approach may inspire future verification datasets, improving the reliability of LLM-generated outputs. > I'd still place Natural Programs as a means to achieving better final accuracy. Final answer accuracy results are mixed While the final answer accuracy results are mixed, we find that **one major contributing factor is the filtering out of reasoning chain candidates that provide correct answers but exhibit incorrect deductive reasoning.** An example is illustrated in Table 6 of submission, where ChatGPT generates the correct final answer but assigns incorrect premise numbers to support the first reasoning step, and our verification approach effectively identifies many of these reasoning errors. Combined with the fact that our proposed verification approach has significantly improved the verification accuracy over naive approaches, we believe we have achieved our primary goal. > Final answer accuracy on Vicuna Thanks for your suggestion. We have added final answer accuracy comparision bewteen "Natural Program" and "CoT" for Vicuna models to complement the verification accuracies in Appendix Tab. 3. |Models|Methods|GSM8K|AQuA|MATH|AddSub|Date|Last Letters| |---|---|---|---|---|---|---|---| |Vicuna-7B|CoT|9.26%|25.03%|7.25%|28.33%|21.70%|0.02%| ||Natural Program|7.79%|23.93%|6.55%|38.10%|27.76%|0.00%| |Vicuna-13B|CoT|16.54%|25.26%|7.30%|36.15%|19.79%|1.04%| ||Natural Program|34.16%|21.01%|7.09%|31.48%|33.88%|2.36%| > Number of parameters for GPT-3.5 Turbo Thanks for your suggestion. We will remove this number. > L59-61 statement Thanks for your attention. We'll revise it by replacing "nearly all" with "many". > Faithful CoT GSM8K results We rerun Faithful CoT with GPT-3.5-turbo using their original repo and have updated Table 3. It achieves a better performance of 81.20% on GSM8K. > Number of API calls per question for Natural Program and CoT For reasoning chain generation, both Natural Program and CoT use 1 API call per question. This is because "Natural Program" and "CoT" both append few-shot examples before the query question. > Call per step in verification For each reasoning step, we need one call for verification. We can also aggregate multiple queries into one single call by asking model to answer multiple questions. > Combine Faithful CoT with Natural Programs Thanks for your question. Table 3 shows that Faithful CoT does not perform as well as the CoT baseline or our Natural Program across most tasks, likely because Faithful CoT focuses on code generation, which doesn't provide much benefits for our current arithmatic-focused tasks. Due to limited rebuttal time, we'd like to leave further explorations for future work. > Frequency of "Unknown" (L66) The frequency of "unknown" is pretty rare (GSM8K: 2.7%; AQuA: 5.0%; AddSub: 2.9%; others: 0%). > Typo in page 7 Thanks for pointing out! We'll fix it. <!-- Thanks for your question. This strategy enhances user experience, although "Unknown" cases are treated as incorrect during evaluations for fair comparison. In addition, we expect the "Unknown" cases diminish, when samples increase with self-consistency. We present the ratio of the "Unknown" apperance in our evaluation. |Methods|GSM8K|AQuA|MATH|AddSub|Date|Last Letters| |---|---|---|---|---|---|---| |Unknown Rate|2.7%|5.00%|2.90%|0.00%|0.00%|0.00%| --> <!-- Thank you for pointing this out. We have updated Table 3. For details, please refer to the "Response to Overlapping Review Comments" section. For all tasks, except "Last Letters," methods perform similarly. Regarding "Last Letters," we believe that the original prompt in the CoT paper is not as effective as ours for "Natural Program", which is designed for deductive verification. --> <!-- NLProofS employs a scoring model to generate a validity score for new node samples within a proof tree, which is then utilized to expand the proof graph with high-scoring node proposals. Unlike NLProofS, our approach centers on employing LLM verifiers to provide **clear and conclusive** validation outcomes for reasoning steps, definitively determining their validity (yes/no). Furthermore, in situations where a reasoning step is deemed invalid, LLM verifiers can elucidate **why** it is so by explicitly outlining the specific reasoning errors. NLProofS does not produce such explanations. --> <!-- - Different Role of Language Models: In NLProofs, two **fine-tuned** models are utilized. One facilitates step-wise reasoning, generating new steps utilizing premises and intermediate steps. The second score model(verifier) acts as a heuristic function to assist in searches, but lacks the capability to confirm the deductive validity of reasoning. Notably, NLProofs achieves reasoning step connections through sampling, devoid of language model reasoning. In contrast, Natural Program leverages the robust reasoning of LLMs to construct a step-by-step solution with premises in a single pass. Moreover, Natural Program is only a part of our work, which is designed for verification. The ensuing deductive verification not only confirms the deductive validity of a reasoning step but also provides the underlying reasons. - In NLProofS, LLM selects premises exclusively from a pool of "supporting facts" and does. On the other hand, Natural Program is more flexible especially for open-ended reasoning problems, such as MATH, which assumes an understanding of common mathematical concepts like basic identities or inequalities, and Date, which assumes an understanding of definitions of "yesterday" and "tomorrow", but these concepts and definitions are not provided in the given context. - Different Targets: NLProofS focus on enumerated premises-based proof. These premises serve as initial points for the step-wise proof process and scoring, utilizing brute-force search for constructing natural language proofs. However, NLProofS lacks the capacity to leverate external knowledge except those premises. Our approach, on the other hand, emphasizes on validating LLM-generated reasoning with LLMs' knowledge, enabling a general problem-solving and **verifying** framework. NLProofS's reliance on premises enumeration limits its scalability for broader open-ended reasoning tasks require **external knowledge and commonsense**. A simple case of this can be "Marin eats 4 apples a day. How many apples does he eat in November?", where missing the hidden information about November causes NLProofs probably to fail. --> <!-- 1) Solution Provisioning: Both NLProofS and Natural Program offer step-by-step solutions. However, NLProofS necessitates the manual enumeration of premises and employs a fine-tuned stepwise prover for subsequent steps. In contrast, Natural Program leverages the capabilities of LLMs to produce a step-by-step solution with premises in a single pass. Furthermore, the reliance of NLProofS on enumerating premises challenges its scalability for broader open-ended reasoning tasks. 2) Target Tasks: NLProofS is designed for generating natural language proofs. Our work, on the other hand, primarily emphasizes verifying the reasoning produced by LLMs. Natural Program is just a part of our work, which is designed for verification. 3) Verification: The NLProofS verifier is a score model fine-tuned from RoBERTa, acting as a heuristic function to assist in searches. Notably, it lacks the capability to confirm the deductive validity of reasoning. Conversely, our deductive verification algorithm not only establishes whether a reasoning step or chain is deductively valid but also provides the underlying reasons. --> <!-- For UPV results, we expect no improvement over final answer accuracy, becasue as shown in Table 3 in Appendix, the verification performance of Vicuna models is even lower than GPT-3.5-turbo. --> --- Reviewer kmTd > **Rating**: 4 We sincerely thank you for your constructive feedback! We address the comments and questions below. We'd like to first clarify that our primary goal is *NOT* to improve the final answer accuracy of reasoning tasks. When LLMs leverage CoT to carry out problem solving, hallucinations and accumulated errors can often occur in intermediate reasoning steps, even if the final answers are correct (examples illustrated in Tab. 1 of submission). This phenomenon limits the reliability and trustfulness of LLM generations. As a result, we place significant emphasis on ensuring the **validity of every reasoning step**, *not the correctness of the final answer*, such that LLM reasoning outputs become more reliable. To this end, we first discover that naively using LLM to verify complex reasoning chains results in very poor verification accuracies (see Table 2 in “Common Response”) because of intertwining reasoning-step dependencies that confound LLMs. We therefore propose our Natural Program-based verification approach, which decomposes the verification of an entire reasoning chain into the verification of individual reasoning steps, and more crucially, the verification process of each reasoning step is based on *only* its necessary contexts and premises. As a result, we have *significantly improved the verification accuracy of reasoning chains* (see Table 4 in “Common Response”), thereby significantly improving the reliability of LLM-generated reasoning steps and achieving our primary goal. > Clarification of Table 3 in our submission and "Tab. 5.2" typo Thank you for pointing it out. "Tab. 5.2" is a typo, and it refers to Tab. 3 in our submission. We will revise it in the future. We have updated Table 3 to address the submission-arxiv mismatch. For details, please refer to "Common Response". > Does deductive verification improve QA/generation accuracy? Thank you for raising this point. As shown in Table 3, we acknowledge a slight decrease in QA/generation (final answer) after applying deductive verification on Natural Program reasoning chains. One major contributing factor to this decrease is the filtering out of reasoning chain candidates that provide correct answers but exhibit incorrect deductive reasoning. > Can our deductive verification methods be applied to a broader range of cases, such as other models or different formats? Thank you for your insightful question. Our Natural Program-based verification approach harnesses the power of natural language, making it potentially applicable to a wide range of natural language reasoning tasks along with any LLM with in-context learning capabilities. For abstract reasoning datasets (e.g., Last Letters), our verification method can be applied when the reasoning processes can be articulated in natural language. For datasets centered around shape or object comprehension, new prompts can be devised to translate shapes and objects into symbols and words, enabling natural language reasoning and verification at the symbol level. For code-based datasets, if natural language reasonings used for code generation are accessible, our approach can be used to verify these reasonings as well. Furthermore, our verification approach can be applied to fields that naturally demand verification of sequences, such as assessing human reasoning or analyzing arguments in law, which would be interesting for future work. <!-- Another significant contributing factor is that the deductive verification accuracies of current LLMs are not very high, even though our proposed framework has already significantly improved over naive verification approaches. As shown in Table 4 (see "Common Comments"), a notable percentage (>15%) of verification results still exhibit false positives and false negatives. We believe this might be due to the lack of fine-tuning in current LLMs, including ChatGPT, specifically for reasoning verification. While prior work has largely focused on fine-tuning LLMs for specific reasoning tasks, deductive verification has not received as much attention. To improve the deductive verification accuracies of LLMs, an intuitive approach is to construct comprehensive verification datasets and finetune LLMs on these datasets to further enhance their verification accuracy. Encouragingly, we observe promising results with this approach. In Appendix Table 3, we present experiments where Vicuna is finetuned on a verification dataset generated using our Natural Program verification framework, leading to improved verification accuracies. Thus, as future LLMs exhibit even better performance on deductive reasoning verification, we are optimistic about leveraging verification frameworks to improve QA/generation accuracy. --> --- Reviewer bmW6 > **Rating**: 6 We sincerely thank you for your constructive feedback and your appreciation of our work! We address the comments and questions below. > Q1: Add experiments over more benchmarks like CLUTRR and StrategyQA. Thanks for your suggestion. For StrategyQA, we sample 1000 instances from the dataset. We use 7-shot samples for CoT and 1-shot sample for Natural Program. For CLUTRR, we sample 200 instances from the dataset due to limited rebuttal time. We use 8-shot samples for CoT and 2-shot samples for Natural Program. Both datasets are shuffled with random seed 42. **Final answer accuracy on StrategyQA and CLUTRR** | Methods | StrategyQA| CLUTRR| | ---| ---| ---| | CoT (few-shot)| 72.45%| 39.25%| | Natural Program (1-2 shot)| 69.95%| 46.00%| > Q2: Add experiments with advanced models like GPT-4 Thanks for your suggestion! It would be a great idea to explore more advanced models like GPT-4. However, due to the high costs associated with GPT-4, particularly when combined with majority voting, we'd like to postpone this exploration to future work when GPT-4 becomes more cost-effective. > Q3: Related work and comparison with "Selection-Inference" family Thanks for bringing attention these works and we'll cite "Selection-Inference". We'd also like to note that while both Selection-Inference and Natural Program have premise selection phases, their mechanisms are different. The selection model within the Selection-Inference framework selects facts exclusively from the given context, effectively preventing fact hallucination. On the other hand, Natural Program is more flexible for e.g., instruction-based open-ended reasoning problems, such as Last Letters, which instructs LLMs to output the concatenation of the last letters of all words in a sequence. > Q4: Evaluate Natural Program on ProofWriter Thank you for your suggestion. We conduct a random sampling of 100 examples with a depth of 3 from the development subset of the ProofWriter CWA dataset. Both Natural Program and CoT are evaluated using GPT-3.5-turbo (ChatGPT) with a one-shot example as a prompt. The results can be found in the table below. **Final answer accuracy on ProofWriter** | CoT| Natural Program| | ---| ---| | 87.00%| 94.00%| > Q5: What is the meaning of "Correct/Wrong" in Table 2? Thanks for your question. Yes, your understanding is correct. Results in Table 2 illustrate the percentage of verification validity outputs that align with the answer correctness ("Correct/Wrong") of reasoning chains. > Q6: Results are perticually good over "AQuA". Thanks for pointing this out. Upon analysis on AQuA, we found that baseline CoT performed worse than Natural Program mainly because its answer responses were not consistently capitalized (e.g., output can be "A" or "a"), which is further because the few-shot prompt from the original CoT paper uses small letters while query questions use capitalized letters. Our previous answer parser only considered capitalized answer choices as correct. Therefore, we adjusted our answer parser to accommodate different capitalizations in answer choices, which improved the accuracies of baselines from our previous Table 3. We have updated Table 3 in "Common Response". After the update, our Natural Program performs comparably to the CoT baseline on AQuA. > Q7: How are minimal premises decided in Table 4? Yes, your understanding is correct. "Minimal premises" means that when verifying each reasoning step from a Natural Program reasoning chain, we only keep the premises whose numbers are cited by that reasoning step as its context for verification. While we encourage LLMs to cite the minimal set of premises when generating reasoning chains in our Natural Program prompt, LLMs might still cite irrelevant premises during actual problem solving processes, thereby introducing irrelevant premises for verification. <!-- Occasionally, the premises for valid reasoning chains in the evaluation set might contain redundant information. This scenario aligns with situations encountered when using UPV, and the performance is expected to lie between "Full Premises" and real "Minimal Premises". --> --- Thanks for your reply! We address your follow-up questions below: > Comparing the formats of Natural Program vs NLProofS, besides the two work's difference in central goal and central approach mentioned in our rebuttal. We’d like to highlight several perspectives that distinguish the format of Natural Program from the format of NLProofS: - In NLProofS, the initial premises (referred to as "supporting facts" in Fig. 1 of their paper) used for proof search are not extracted through in-context learning by LLMs. Rather, these premises are provided from the datasets NLProofS utilize. On the other hand, in Natural Program, initial premises ("Question-Related Premises" in Fig. 1 of submission) are automatically extracted by LLMs through in-context learning, which makes Natural Program more flexible towards diverse natural language reasoning tasks. - When deriving a new reasoning step from previous premises, Natural Program is also more flexible, as it allows the use of commonsense knowledge not explicitly listed in premises. For example, consider this problem: "Marin eats 4 apples a day. How many apples does he eat in November?" Even though "November has 30 days" is not explicitly listed in the premises, Natural Program permits the use of such common knowledge within a reasoning step. Our in-context verification process is also capable of handling these implicit premises (e.g., if LLM outputs "November has 29 days" in a reasoning step, it will be marked as invalid). - NLProofS is limited to tasks that have an explicit proof structure, while Natural Program is compatible with in-context abstract reasoning tasks such as Last Letters, where the LLM is instructed to output the concatenation of the last letters of all words in a sequence as the final answer. > How our paper's results fit together to support the main claim of the paper. If we are understanding your question correctly, we believe that your question can be rephrased as follows:`Essentially, why does our verification approach significantly improve the verification accuracy of reasoning chains in Table 4 of submission, but barely improves the final answer accuracy in Table 3 of submission?` We answer this rephrased question below: Consider the GSM8K dataset as an example (recall that the final answer for a problem is obtained through majority voting). Among all problems, 91.6% of problems have `|number of votes received by the correct answer - largest number of votes received by a single wrong answer| > 2`, and their final answers are unlikely to be changed through our deductive verification approach. For the *rest of the cases (8.4%)*, where deductive verification is more likely to impact their final answers, we found that: - Among all reasoning chains that arrive at correct answers (these correct-answer chains account for 49.4% of all reasoning chain candidates), 46.2% of reasoning chains are filtered out by our verification process. - Among the reasoning chains that arrive at correct answer but are filtered out by our verification process, 76.3% indeed exhibit incorrect reasoning. - Among the reasoning chains that arrive at correct answer and are not filtered out by our verification process, 78.0% indeed have correct reasonings. - Among the reasoning chains that do not arrive at correct answer and exhibit incorrect reasonings (these account for 50.6% of all reasoning chain candidates), 40.6% are filtered out by our verification process. The above statistics shows that a significant portion of reasoning chains that arrive at correct answers but exhibit incorrect reasoning are successfully eliminated. Therefore, the reliability and trustfulness of reasoning chains that arrive at the correct answers are significantly improved. Combined with the fact that a significant proportion of reasoning chains that exhibit incorrect answers are eliminated, and that our approach's verification accuracy significantly improves over naive verification approaches, our primary goal to improve LLM reasoning reliability is accomplished. Nevertheless, the removals of many reasoning chains yielding correct answers (specifically, a significant 46.2% x 49.4% of all chains) has a notable impact. This even exceeds the removals of reasoning chains with incorrect reasonings and answers (40.6% x 50.6% of all chains). As a result, there are fewer votes for the correct answer when generating final answers through majority voting, which limits the final answer accuracy. In the future, we believe that when a greater proportion of incorrect reasoning chains with incorrect answers are filtered out, we can improve the final answer accuracy. We'll make the above statements more clear in our final draft. <!-- (Side note: Table 2 and Table 4 report single-reasoning-step validation accuracy, not the validation accuracy of an entire reasoning chain) --> <!-- (As a side note, Table 4 reports single-reasoning-step validation accuracy, not the validation accuracy of an entire reasoning chain) --> <!-- As a recap on contribution #1, we post it here for convenience: `We propose a novel framework for rigorous deductive reasoning by introducing a “Natural Program” format (Fig. 1), which is suitable for verification and can be generated by just in-context learning.` We'd like to clarify that besides emphasizing Natural Program as a tool for our primary goal of proposing a novel frame --> <!-- The major contributing factor is the filtering out of reasoning chain candidates that provide correct answers but exhibit incorrect deductive reasoning. Specifically, on GSM8K, **for the problems where |number of votes received by the correct answer - largest number of votes received by a single wrong answer| <= 2** (recall that our final answer is obtained through majority voting, and we chose the forementioned cases for analysis as their final answer accuracies are most likely to improve through deductive verification; we also found that the rest of the cases are ) --> --- Thanks a lot for your reply! We'll cite ROSCOE and ReCEval. We'd also like to highlight several perspectives that distinguish our work from ROSCOE and ReCEval: - Our Natural Program-based approach leverages the in-context learning and step-by-step reasoning capabilities of off-the-shelf LLMs to extract individual reasoning steps, minimal premises for each reasoning step, and conduct step-wise verification. **Importantly, our approach achieves the entire process without the need for LLM finetuning**. On the other hand, both ROSCOE and ReCEval rely on finetuned language models for verification (ROSCOE relies on finetuned SimCSE embedding model on specific datasets; ReCEval relies on finetuned T5 model on gold reasoning chains of specific datasets) - While both ROSCOE and ReCEval assign validity scores to reasoning chains, they do not provide explainations for **why** a reasoning step is invalid. In contrast, our Natural Program-based LLM verification approach not only identifies invalid reasoning steps, but they can also provide explicit explanations for why they are invalid, detailing the specific reasoning errors involved. - Our Natural Program-based reasoning and verification approach is compatible with in-context abstract reasoning tasks where reasoning steps do not possess proof-like entailment structures. For example, our approach is compatible with Last Letters, where the LLM is instructed to output the concatenation of the last letters of all words in a sequence as the final answer. - Our Natural Program approach allows the use of commonsense knowledge not explicitly listed in premises. For example, consider this problem: "Marin eats 4 apples a day. How many apples does he eat in November?" Even though "November has 30 days" is not explicitly listed in the premises, Natural Program permits the use of such common knowledge within a reasoning step. Our in-context verification process is also capable of handling these implicit premises (e.g., if LLM outputs "November has 29 days" in a reasoning step, it will be marked as invalid). > Comparsion between using full premises and minimal premises. > Thanks for pointing out. In our work, we also compare the deductive verification performance with all previous statements(full premises) to that with minimal premises. In Table 4, we find that in most datasets, using minmal premises can improve the performance. > RecEval

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully