Revisions and Prior Reviews
===========================================================================
# Revisions
<!-- --------------------------------------------------------------------------- -->
We revised our paper and addressed the following concerns:
### Paper ignores recent approaches to grammar mining, which easily handle hierarchies over input processing functions.
AIFORE focuses on binary-based inputs, which are different from those grammar-based inputs.
However, we have investigated more recent works about grammar mining works and cited them in this version.
### Approach limited to a fixed set of field type patterns.
AIFORE identifies 6 types of fields, and these types cover most kinds of input formats like images, fonts, network packets, and compression files.
In this version, we add more targets (13 formats with 15 programs in total ) that can parse the inputs, and the experiment shows that AIFORE can predict the field type correctly with an accuracy of over 80%.
### Provide enough details about the setup of the model training phase? The performance of the CNN model should be conducted on a set of new formats and new programs.
We add Table 2 (All file input data set) that contains all the formats and programs (with parameters) we used for evaluating each module of AIFORE, and details are in Section 5 (Evaluation).
We re-design the evaluation of field type accuracy through 3 questions in Section 5.1.2 (Field Type Accuracy).
We evaluate the accuracy of unseen formats that are parsed by the trained programs with new processing code and untrained programs respectively.
### What is a pit template? Evaluating AIFORE against its base fuzzer to truly understand how much fuzzing performance gain is coming from the approaches that are specific to AIFORE.
In this version, we have explained the pit template in Section 2 (Motivation Example).
A pit file is a format template used by Peach to generate well-formed input. Peach is a generation-based testing framework.
We make an ablation study to evaluate how each module contributes to fuzzing efficiency in Section 5.4 (Comparison of contributions to fuzzing efficiency).
### The generated grammar will have little coverage due being derived from 5 inputs, which is a problem for fuzzing where many different formats are eventually explored in long runs.
In this version, we design a novel power scheduling algorithm based on format dynamically extracted during testing, to improve the fuzzing efficiency. In long runs or the fuzzer gets stuck in fuzzing, we will re-assign the fuzzing power to those seeds bound with less mutated formats and skip those seeds that are mutated adequately.
# Prior Reviews
<!-- --------------------------------------------------------------------------- -->
Review A
===========================================================================
Overall merit
-------------
2. Weak reject
Reviewer expertise in the topic domain
--------------------------------------
4. Expert (I've worked on this topic domain or a closely related topic
domain)
Reviewer confidence
-------------------
4. Confident
Paper summary
-------------
Generating test inputs (fuzzing) for a program is way more effective if the programs' input format is known. Short of a specification, one can try to infer format features from program code or its execution, notably by tracing dynamic taints and analyzing how the input is being processed. The proposed approach improves over the state of the art by using a Convolutional Neural Network (CNN) to infer field types such as size or offset, and memory access patterns to infer relationships between fields. In its evaluation, language models (pits) produced by the AIFORE prototype are shown to yield significantly higher code coverage.
Strengths / Reasons to accept
-----------------------------
* Novel usage of CNNs to infer field types
* Novel program analysis to infer field relationships
* Clear separation of language inference and actual test generation
Weaknesses / Reasons to reject
------------------------------
* Paper ignores recent approaches to grammar mining, which easily handle hierarchies over input processing functions
* Approach limited to a fixed set of field type patterns
* Tool not available for comparison or replication
Constructive comments for author
--------------------------------
There's a lot to like about this paper. The authors identify weaknesses in current approaches to infer input formats, engineer a number of novel and pragmatic solutions, and demonstrate that their engineering is successful. The paper is well-written (at least for those working in the field), and does not leave too many questions open.
Yet, the paper is not without flaws, as listed below.
### Language-based testing
First, the paper totally ignores earlier work on language-based test generation, which is much older than fuzzing. Essential papers for §6.1 include:
@article{Burkhardt1967,
title = {Generating test programs from syntax},
abstract = {The many faces of programming and systems development demand an immense amount of mechanical routine work. The present paper tries to explain some areas where automation of many tasks may be of great help. One special area, where progress seems to lag behind unduly, can be found in debugging, testing, and diagnosing systems. Here we attempted the generation of programs automatically from a definition of a problem and the characteristics of programs for its solution by a software system, which has been specially designed for this purpose. It has been indicated how the ideas underlying this project may be applied successfully to other areas.},
author = {Burkhardt, W. H. },
date = {1967/03/01},
date-added = {2022-01-25 11:41:52 +0100},
date-modified = {2022-01-25 11:41:52 +0100},
doi = {10.1007/BF02235512},
id = {Burkhardt1967},
isbn = {1436-5057},
journal = {Computing},
number = {1},
pages = {53--73},
url = {https://doi.org/10.1007/BF02235512},
volume = {2},
year = {1967},
bdsk-url-1 = {https://doi.org/10.1007/BF02235512}}
@article{Hanford1970,
author={Hanford, K. V.},
journal={IBM Systems Journal},
title={Automatic generation of test cases},
year={1970},
volume={9},
number={4},
pages={242-257},
doi={10.1147/sj.94.0242}}
@article{Purdom1972,
title = {A sentence generator for testing parsers},
abstract = {A fast algorithm is given to produce a small set of short sentences from a context free grammar such that each production of the grammar is used at least once. The sentences are useful for testing parsing programs and for debugging grammars (finding errors in a grammar which causes it to specify some language other than the one intended). Some experimental results from using the sentences to test some automatically generated simpleLR(1) parsers are also given.},
author = {Purdom, Paul},
date = {1972/09/01},
date-added = {2022-01-25 11:44:37 +0100},
date-modified = {2022-01-25 11:44:37 +0100},
doi = {10.1007/BF01932308},
id = {Purdom1972},
isbn = {1572-9125},
journal = {BIT Numerical Mathematics},
number = {3},
pages = {366--375},
url = {https://doi.org/10.1007/BF01932308},
volume = {12},
year = {1972},
bdsk-url-1 = {https://doi.org/10.1007/BF01932308}}
While these three cover context-free languages, the concept of language-based test generation (= format-aware fuzzing) should be cited and framed accordingly. It is also a surprise the Peach fuzzer is not cited, although AIFORE produces (Peach?) pit templates as output.
### Field type boundary inference
A bigger problem is that the problem of finding block boundaries has already been addressed in works on reconstructing parse trees and mining grammars, and in a similar vein as in the present papers.
The pioneering work as it comes to reconstructing input structures is "Deriving Input Syntactic Structure from Execution" by Lin and Zhang. This is listed in the bibliography as [34], but where is it cited and related to? This work uses dynamic taints to identify individual elements (fields) in the input, and infer _hierarchies__ between these elements (= parse trees). They thus address the problem of (§2) an instruction like `byte_get()` parsing multiple fields, by introducing a rule that is specific for each caller - say, a `header` element deriving into `byte_get`, and a `checksum` element also deriving into `byte_set`.
The more recent AUTOGRAM and MIMID approaches build on these insights to compose full-fledged _grammars_ from taints and induced parse trees, by introducing a rule that is specific for each caller, say
```
<header> ::= <byte_get> <byte_get>
<checksum> ::= <byte_get> <byte_get>
...
```
This makes the first contribution of AIFORE, namely to "learn which
120 input bytes are processed by each basic block" (§1) much less novel; these works have to be related to, and the contribution has to be reframed in line with the state of the art. Note that these works focus on extraction of context-free grammars, and hence do not address the handling of context-sensitive features, such as field lengths, checksums, offsets, or relationships; however, besides fields, they also easily identify enumerations, magic numbers, and string literals. Also note that MIMID also refines grammars by means of additional test generation, which might give it an additional advantage.
AUTOGRAM is published at:
@INPROCEEDINGS{Hoschele2017,
author={H\"oschele, Matthias and Zeller, Andreas},
booktitle={2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C)},
title={Mining Input Grammars with AUTOGRAM},
year={2017},
volume={},
number={},
pages={31-34},
doi={10.1109/ICSE-C.2017.14}}
An implementation is available at https://www.fuzzingbook.org/html/GrammarMiner.html
MIMID is published at:
@inproceedings{10.1145/3368089.3409679,
author = {Gopinath, Rahul and Mathis, Bj\"{o}rn and Zeller, Andreas},
title = {Mining Input Grammars from Dynamic Control Flow},
year = {2020},
isbn = {9781450370431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3368089.3409679},
doi = {10.1145/3368089.3409679},
abstract = {One of the key properties of a program is its input specification. Having a formal input specification can be critical in fields such as vulnerability analysis, reverse engineering, software testing, clone detection, or refactoring. Unfortunately, accurate input specifications for typical programs are often unavailable or out of date. In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at different locations of the input parser. This works on all stack based recursive descent input parsers, including parser combinators, and works entirely without program specific heuristics. Our Mimid prototype produced accurate and readable grammars for a variety of evaluation subjects, including complex languages such as JSON, TinyC, and JavaScript.},
booktitle = {Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
pages = {172–183},
numpages = {12},
keywords = {fuzzing, context-free grammar, control-flow, dataflow, dynamic analysis},
location = {Virtual Event, USA},
series = {ESEC/FSE 2020}
}
A replication package is available at https://giters.com/vrthra/mimid
Other notable works include Bastani's GLADE work (PLDI 2017) and REINAM (ESEC/FSE 2019) as well as [Aschemann's NAUTILUS](https://www.ndss-symposium.org/ndss-paper/nautilus-fishing-for-deep-bugs-with-grammars/) (NDSS 2022)
An experimental comparison is not needed (again, we're talking context-free grammars vs. binary inputs here), but a _conceptual discussion_, also with regard to novelty, is. Given that the paper claims that its novel identification of field boundaries is one of its three central contributions, the paper cannot be accepted without re-evaluating that claim in the light of related work.
### Using CNNs to infer field types
Using ML techniques to infer field types is novel, and an original contribution of this work. Yet, let me point out some questions:
* Does the CNN only predict one of the six semantic types, as listed in §3.2.1, or does it also predict additional information (say, the actual offset for the `offset` field type, or the actual magic number for the `magic number` field type)?
* If a field does not match any of these types, what will be the consequence? Did this occur for the file formats studied? Can the approach be extended for other field types, and how would this impact performance?
### Analyzing field relations
The engineering choices made here are reasonable, but may be overly specific to the programs analyzed.
* To which extent did evaluation results influence the design of this analysis? Are there input formats that have been evaluated only after the design was complete and unchanged? Which ones?
* When you blame other approaches for having hard-coded constraints (Tupni in §2), how are your field types and relationships not hard-coded?
### Evaluation
In general, the evaluation does a good job. The additional coverage obtained shows that the additional input format information helps during fuzzing. Commenting on the actual reasons why particular bugs could be found by AIFORE is appreciated.
* Given the several engineering choices, it is not clear which of these have contributed to the outcome, and how. In lieu of a detailed analysis (which would not fit within the page limits), the best way to counter this issue is to make the tool and data (say, the resulting pits) available.
* Note that the [FormatFuzzer](https://github.com/uds-se/FormatFuzzer) project sports extended 010 binary templates that have been extended manually to be used for fuzzing; these might be useful as additional ground truth.
### Minor issues
* LaTeX math fonts are not designed for multi-line identifiers (such as $Feature_{offset}$ in Algorithm 2); wrap them in `\textit{...}`.
* §5.3: missing space after `ProFuzzer [49],`
* §5.3.1: AIFORE archives the best performance -> achieves
Questions for authors’ response
-------------------------------
The paper needs to be revised, so responses alone do not suffice to sway me into recommending acceptance. The authors can, however, indicate how they plan to revise the paper, notably by addressing the concerns listed above.
Does the paper raise ethical concerns?
--------------------------------------
1. No
Review B
===========================================================================
Overall merit
-------------
2. Weak reject
Reviewer expertise in the topic domain
--------------------------------------
4. Expert (I've worked on this topic domain or a closely related topic
domain)
Reviewer confidence
-------------------
4. Confident
Paper summary
-------------
The authors present AIFORE, a fuzzer that uses a rule-based approach and a
machine-learning-based approach to identify key fields, their semantic types,
and their relations in program input. AIFORE is evaluated on a set of 14 formats
and demonstrates obvious improvement over existing fuzzers.
Strengths / Reasons to accept
-----------------------------
- The research problem is important.
- The solution is novel and non-trivial.
Weaknesses / Reasons to reject
------------------------------
- Some key details about the machine learning module are missing.
- Some data in the evaluation section seem unreasonable.
- The presentation can be improved.
Constructive comments for author
--------------------------------
While I like this paper and the approach that the authors presented in the
paper, I am a bit disappointed at how the machine learning module is presented
and how it is evaluated in this paper. Specifically, there are a few key
problems with the description of the CNN model:
- It is unclear what the training set really is. The authors stated in the paper
that they "collect 10,582 field records." But, the authors did not explain
anywhere prior to this point what a field record means. Inferred from the
context, I believe a field record is a field accessed by the program under test
with a specific input case. If my understanding is correct, it is unknown to
readers how many formats, how many programs, or how many test cases were used to
generate these field records.
- I like the fact that the authors acknowledged the importance of generalization
of their trained model and attempted to evaluate the performance of their model
on formats and programs on which the model was not trained. Unfortunately, given
that the authors did not specify on what formats or what programs the training
set was created, it is difficult for me to evaluate or understand how the two
experiments mentioned on Page 8 were conducted. More specifically, I assume
Table 1 was the training set (or training set + validation set) for training the
CNN model. But this assumption does not make sense because (a) not all programs
in Table 3 have existed in Table 1, which contradicts the authors' claim that
"we collect some new file formats supported by the programs in the training
set," and (b) some formats or programs in Table 4 do exist in Table 1, which
contradicts the authors' claim that "We apply the aforementioned trained model
to analyze unseen formats and untrained programs." Since the data presented in
various tables do not seem to match, I really cannot verify the authors' claims
about the CNN model.
- Ultimately, I believe a correct experiment that evaluates the performance of
the CNN model should be conducted on a set of new formats and new programs. This
test set should have a reasonable size (for example, 10 new formats + new
programs).
- Another problem is that the description in "Target File Formats and Programs"
is unclear. The authors mentioned 12 file formats and 2 network protocols. But I
only see 12 file formats in Table 1, and I am not sure what the two network
protocols were. Even worse, the authors did not even explicitly state that the
formats in Table 1 were the same 12 file formats mentioned earlier. I had to
make an educated guess (and due to data inconsistencies, I am not even sure my
guess is correct).
Other issues
--------------------------------
- The authors evaluated the performance of AIFORE against several other fuzzers.
However, it is unknown what the difference is between AIFORE and plain AFL (or
any other fuzzers). I think it is important to evaluate AIFORE against its base
fuzzer (AFLSmart) to truly understand how much fuzzing performance gain is
coming from the approaches that are specific to AIFORE.
- It is unclear how test programs were compiled (or were they compiled at all?
Was it binary-based fuzzing or source-based fuzzing?) or how seeds were picked
during the evaluation. The authors did not provide sufficient information in the
paper for readers to reproduce the experiment results.
- Does AIFORE require source code or does it work directly on binaries? If
latter, I think the authors should evaluate AIFORE on some binary-only targets.
- For the magic number in the ELF header, why must those four bytes be
classified into one field? Why can't they be viewed as four separate fields? I
feel there lacks a clear definition of "input fields" under the context of
improving fuzzing performance at the beginning of this paper.
- Page 1. What is a pit template? Please explain before using the term.
- Page 3. "The second model predicates" predicates -> predicts
Questions for authors’ response
-------------------------------
- Provide enough details about the setup of the model training phase.
Does the paper raise ethical concerns?
--------------------------------------
1. No
Review C
===========================================================================
Overall merit
-------------
2. Weak reject
Reviewer expertise in the topic domain
--------------------------------------
4. Expert (I've worked on this topic domain or a closely related topic
domain)
Reviewer confidence
-------------------
4. Confident
Paper summary
-------------
The authors present AIFORE, an automatic input format reverse engineering approach. AIFORE has 3 modules. The first one computes field boundaries by using dynamic taint trackig to monitor how a program processes an input and identifies the offsets of the tainted bytes of the input that are used by each basic block to identify field boundaries. The second one uses a Convolutional Neural Network (CNN) model to identify field semantics. The third one identifies field relationships by examining memory accesses and comparisons between tainted fields. AIFORE produces a (pit) grammar file that is uses as input for the previous AFLSmart grammar-based fuzzer. The authors evaluate AIFORE on 12 file formats (14 programs) measuring the accuracy. They compare format extraction with ProFuzzer, AFL-Analyze, and TIFF. They evaluate AIFORE for fuzzing and compare with other "format-aware" (TIFF, ProFuzzer, WEIZZ, Intriguer) and "non-format-aware" (AFL, AFLFast) fuzzers showing improved coverage and discovering 10 bugs.
Strengths / Reasons to accept
-----------------------------
+ Input format reverse engineering is an important problem for a variety of security applications including fuzzing
+ Using a CNN for field semantics inference is an interesting idea
+ The paper demonstrates the use of their inferred format for fuzzing
Weaknesses / Reasons to reject
------------------------------
- Little contribution over existing work
- Limited format reversing scope: no support for hierarchical fields, text formats, bit-level fields, no description of how formats extracted for different inputs are merged, no message type field inference, no state-machine inference (for protocols), ...
- The generated grammar will have little coverage due being derived from 5 inputs, which is a problem for fuzzing where many different formats are eventually explored in long runs
- Wrong evaluation: the accuracy comparison should be with input format reverse-engineering tools, not with format-aware fuzzers. The comparison for fuzzing should be between different grammars (e.g., AIFORE extracted, Peach/SmarAFL) for the same fuzzer (AFLSmart).
Constructive comments for author
--------------------------------
I like the problem addressed and that the authors have built a system and performed extensive evaluation on it. However, the evaluation I believe is wrongly focused as I will explain below, the contributions over prior work are not clear (your comparison with early work is not accurate in multiple points) and the scope of the reversing is pretty limited. Overall, I do not think this paper is ready for publication.
**Fuzzing focus**: This paper is about input format reversing, not about smart fuzzing. What I mean is that the authors do not propose any fuzzing contributions. AIFORE recovers a input format grammar and then feeds that to the previous AFLSmart. This is the same setup already used by works like Prospex over a decade ago that reverse the grammar and used Peach to demonstrate its value for fuzzing. I do not say that as a limitation, but rather because you do not position the paper properly, which leads to what I believe is a wrong evaluation. In contrast, smart fuzzers (partially) learn the input format during fuzzing and incorporate that information into the fuzzing loop to improve coverage and explore rare paths, but they do not output the inferred grammar. In fact the point of ProFuzzer was that it did not need to have a previously available grammar.
**Wrong evaluation**: Your evaluation focuses on showing that smart fuzzers are worse than AIFORE, but their goal is different. They do not require a grammar being available apriori but they do integrate the input recovery with the fuzzing mutations to improve the fuzzing, which your work does not (the use of your recovered grammar is up to SmartFuzz). Right now it seems that you want to show that smart fuzzing (meaning recovery and fuzzing integrated in the same process) is worse that grammar-based fuzzing (e.g., Peach-style) with a grammar obtained apriori. But, that would be the contribution of Peach/Spike, not of your approach. Besides, AFLFast point was precisely the opposite, i.e., that smart fuzzing due to its integration of the field format in the fuzzing could outperform grammar-based fuzzers like Peach. Instead, you should compare with prior input format reverse engineering approaches and show that you can output a better grammar. Then, you should evaluate different grammars for the same fuzzer (e.g., SmartFuzz) to show that your grammar is better (or at least good enough). Since SmartFuzz already has grammars for most of the input formats you evaluate (e.g., GIF, PCAP, ELF, WAV) you should include those in the evaluation to show how your automatically generated grammars compare with those manually generated. Furthermore, you should also include programs that handle unknown inputs in the evaluation, otherwise it is hard to make the point of the need for your approach. This has already been done in many previous papers (e.g., protocol reversing on malware C&C protocols). If the input format specification is available and well-documented as is the case for many of the input formats you analyze, then it is hard (although possible) for an automated approach to beat a manually generated grammar such as those SmartFuzz already has. Currently, your evaluation does not demonstrate to me that your approach indeed improves the fuzzing over the state of the art.
**Limited scope**: For a format reversing approach I find the scope quite limited. You do not address any of the issues that previous approaches cannot do (e.g., bit-level fields) and you do not address many issues that previous works have already addressed. For example, Wondracek et al. already showed that input format needs to be hierarchical to handle nested structures, but it does not seem your recovered format is hierarchical, i.e., handles field nesting. Similarly, Howard (NDSS 2011) showed to analyze memory accesses and loops for array/string recovery, but you do not cover that. Autoformat proposed techniques for text-based formats, but you do not support text formats. Prospex showed how to combine formats recovered from multiple executions (possibly contradictory) and how to infer protocol state machines, but you do not cover that. Other works have shown how to handle encrypted protocols (Input Generation Via Decomposition and Re-Stitching: Finding Bugs in Malware). While I do not expect you to cover everything, incorporating some prior techniques to broaden the scope would strenghten the paper.
**Unclear contributions**: I fail to see any contributions on the field boundary analysis and field relation analysis. The one contribution I see is using a CNN for field semantics inference but I would like a more focused evaluation on that aspect that really shows the improvements and that it evaluates the minimum training set size needed. Regarding the field boundary identification the intuition of using basic blocks I do not find very strong. The only case you mention where the intuition should work better is one a 4-byte comparison broken into four comparisons. But, Polyglot already handled such comparisons in a special way to handle this and also separators. Are there any other cases where basic-block comparison helps? On the other hand, you already show cases where BB-based approach fails, e.g., consecutive blocks, loops. In particular, loops have been leveraged before for input format reversing (e.g., Howard). Regarding relation analysis, prior work already identifies lenght/offset fields and their targets, what is new there? And the need for linear comparison is not clear (see next point).
**Integration with SmartFuzz**: It would be important to highlight how does SmartFuzz benefit from the recovered information, i.e., how does SmartFuzz leverage the information in the pit grammar for the fuzzing. In particular, I do not know why you need to recover liner comparisons. How does SmartFuzz leverage those?
**Protocol reversing**: If I understand correctly, your comparison for protocol reversing with PI and Polyglot is based on paper and pencil, i.e., compare with results presented in those papers without having re-implemented them. Is that so? In that case, I do not think it is a fair evaluation and recommend removing it from the paper.
**Static vs Dynamic**: You do not explain what steps in your approach are static. For example, you mention that you recover loops, but is not clear if you do that from the trace (which can be incomplete and miss for example loop exit points) and statically from the program binary. You mention using Angr for static analysis, but I do not think you have mentioned any such analysis. You do mention obtainining format strings in 3.2.2., is that done statically or on the trace? Please clarify what steps (if any) are done statically.
**Related work**: Your comparison with prior work I find it has multiple errors. For example, you seem to assume that Polyglot evaluates fields uses by the same structure in a static way (instruction address based), but rather it does it in a dynamic/trace way (instruction instance). More concretely, if the same instruction appears multiple times in the trace and each time it handles different fields (e.g., get_byte function) each invocation is handled separately and the different fields are recovere. You need to re-read that work and completely change the references in the text. In general, I found many claims on related work too loose or inaccurate. You need to revisit related works and make the comparison more strict. Unfortunately, I cannot get into every sentence or inaccurate reference.
**Overtainting**: A well-known limitation of tainting approaches is over-tainting. Do you apply any steps to mitigate this?
**Other evaluation isssues**: You should focus on Top-1 evaluation, Top-2 is kind of cheating since it is unclear how SmartFuzz can leverage a Top-2 result. Also, you should examine the evaluation errors (FPs, FNs) and the causes that originate them. Otherwise, there is little to learn. Also, I find your reversing of PCAP format confusing. You do not seem to be reversing the PCAP format per se (i.e., how the network traffic is encapsulated in the trace) but rather the format of the Ethenernet/IP/TCP traffic stored. That does not make sense since the pcap traces used as input could be of different protocols.
**Code release**: You do not mention code release but for this kind of system it would support the paper nicely. Unfortunately, too many prior reverse engineering techniques from binaries are not publicly available.
**Editorial**:
- The citation in the paper for field type inference should be for [35] which predates TIFF by many years.
- I find your terminology quite confusing. You introduce new terms that are not defined such as "indivisible field", "long input field", "program variable type"
Does the paper raise ethical concerns?
--------------------------------------
1. No