Other Errors - HackMD

--- tags: decompiler title: Other Errors --- # Other errors | Other Error | Count| Status| | ------- | ----| ----| | Reading file error (throw) | 3 | :heavy_check_mark: | | Corrupted files (throw) | 209 |:heavy_check_mark: | | Tokenization errors (throw) | 129| :warning: | | Parse errors w/o .py (explicit) | 267 |:heavy_check_mark: | | Internal grammar rule-bug (explicit) | 844|:heavy_check_mark: | | Python 3.9 | 21050 - (3045) - 840 | :warning: | Break down of other errors: 1) **Reading file error** (3): Trivial since the file was deleted by antivirus on my end (my bad) 2) **Bytecode is corrupted** (209): This is actually corrupted files confirmed via dis 3) **Conversion to their own tokens**(129): These are files that have dis success but because of some patterns that the decompiler expects, they fail. -> Theoretically can be fixed but will take alot more time since would need to instrument in decompiler to find the locations. The positions in code that it fails also varies but are all in the same stage of tokenization. 4) **Parse Errors without py output**(267): These are those that are indeed parse errors but do not have .py output except have filenames printed. Relatively easier to fix that previous one since filenames are there so can look up source files 5) **Internal grammar rule-bug**(844): These are those with py output but fail while finalizing the final python code. Effectively still don't point to where they fail and would be similar to implicit errors. 6) **Parse Errors that we focus on**(11029): These are the ones we are working on and this is the final level of error 7) **Python 3.9 code** (21050): These are also we are already covering ## Corrupted files All files were run under dis to confirm whether they were corrupted or not. Furthermore, the headers were changed to different python versions and `dis` was re-tried to see if there were header errors but there was no success. ## `Tokenization errors` ### Key error - out of 126: - 123 (all python 3.8): `jump_back_index = self.offset2tok_index[jump_target] - 1` - [ref](https://github.com/rocky/python-uncompyle6/blob/c7ebdb344be0ceb938b764b6b81a6a3af9913f27/uncompyle6/scanners/scanner38.py#L108) - 6 (all python 2.7): `j = self.offset2inst_index[offset]` - [ref](https://github.com/rocky/python-uncompyle6/blob/f6f0e344d02925630e4b5c78a36ef8144dd78938/uncompyle6/scanners/scanner2.py#L400) ### Other tokenizatione errors: - `AssertionError: at ifpoplaststmtl[2], expected 'c_stmts' node; got 'pass'` : 21 - `IndexError: list index out of range` : 11 - `AssertionError: set_comp_func (8)` : 2 - `IndexError: pop from empty list` : 2 All of these fall in code errors. ## `Internal grammar error` ### Common trends - All have errors on `break` statement. - All cause errors in python 3.8 bytecode - Total of 844 - All have py output ### Summary trends - 1. `while` loop is nested in `elif` block and has `break` - 2. `while` loop has `1` or any constant value as condition and has `break` - 3. Multiple conditions and nested `for` loops and has `break` in external loop - Does not loose any loops - 4. `for` loop is nested in `elif` block and has `break` - 5. Large codebase with multiple loops and one `pass` in `elif` block causes the decompiler to add `break` - 6. `for` loop with 2 `continue` and one `break` in `else` block of `try/except/else` - 7. Implicit error leads to pattern 1 - 8. Just multiple `for` loops in `elif` ### Examples Example 1: ```python def _task_get_stack(task, limit): if a: z=z#frames.reverse() elif b: while tb is not None: if limit <= 0: break ``` output: ```python def _task_get_stack(task, limit): if a: z = z else: if b: if tb is not None: if limit <= 0: break # NOTE: have internal decompilation grammar errors. # Use -t option to show full context. # not in loop: # break # L. 10 30 BREAK_LOOP 34 'to 34' ``` Solution: Using transformation key_word ```python def _task_get_stack(task, limit): if a: z=z#frames.reverse() elif b: while tb is not None: if limit <= 0: FET_break() ``` or break `elif` ```python= def _task_get_stack(task, limit): if a: z=z#frames.reverse() if b and not a: while tb is not None: if limit <= 0: break ``` Example 2: ```python def _fix_exception_context(new_exc, old_exc): # Context may not be correct, so find the end of the chain while 1: exc_context = new_exc.__context__ if exc_context is old_exc: # Context is already set correctly (see issue 20317) return if exc_context is None or exc_context is frame_exc: break new_exc = exc_context # Change the end of the chain to point to the exception # we expect it to reference new_exc.__context__ = old_exc ``` output: ```python def _fix_exception_context(new_exc, old_exc): exc_context = new_exc.__context__ if exc_context is old_exc: return else: if not exc_context is None: if exc_context is frame_exc: break new_exc = exc_context new_exc.__context__ = old_exc # NOTE: have internal decompilation grammar errors. # Use -t option to show full context. # not in loop: # break # L. 9 34 BREAK_LOOP 42 'to 42' ``` solution ```python def _fix_exception_context(new_exc, old_exc): # Context may not be correct, so find the end of the chain tmp = 1 while tmp: exc_context = new_exc.__context__ if exc_context is old_exc: # Context is already set correctly (see issue 20317) return if exc_context is None or exc_context is frame_exc: break new_exc = exc_context # Change the end of the chain to point to the exception # we expect it to reference new_exc.__context__ = old_exc ``` > Note: This leads to another implicit error but will count that as another error. Example 3: ```python= def parse_parts(self, parts): parsed = [] sep = self.sep altsep = self.altsep drv = root = '' it = reversed(parts) for part in it: if not part: continue if altsep: part = part.replace(altsep, sep) drv, root, rel = self.splitroot(part) if sep in rel: for x in reversed(rel.split(sep)): if x and x != '.': parsed.append(sys.intern(x)) else: if rel and rel != '.': parsed.append(sys.intern(rel)) if drv or root: if not drv: for part in it: if not part: continue if altsep: part = part.replace(altsep, sep) drv = self.splitroot(part)[0] if drv: break break if drv or root: parsed.append(drv + root) parsed.reverse() return drv, root, ``` Example 4: ```python= def find_module(self, fullname, path=None): if fullname in self.toc: z=z elif path is not None: z=z for p in path: if not p.startswith(SYS_PREFIX): continue p = p[SYS_PREFIXLEN:] parts = p.split(pyi_os_path.os_sep) if not parts: continue if entry_name in self.toc: break return module_loader ``` output: ```python def find_module(self, fullname, path=None): if fullname in self.toc: z = z else: if path is not None: z = z for p in path: if not p.startswith(SYS_PREFIX): pass else: p = p[SYS_PREFIXLEN:] parts = p.split(pyi_os_path.os_sep) if not parts: pass elif entry_name in self.toc: break return module_loader ``` solution: ```python def find_module(self, fullname, path=None): if fullname in self.toc: z=z if path is not None and not fullname in self.toc: z=z for p in path: if not p.startswith(SYS_PREFIX): continue p = p[SYS_PREFIXLEN:] parts = p.split(pyi_os_path.os_sep) if not parts: continue if entry_name in self.toc: break return module_loader ``` Example 5: [Link](https://github.com/numpy/numpy/blob/main/numpy/lib/arraypad.py#L806) Solution: No solution Example 6: ``` def process_listeners(self, listener_type, argument, result): removed = [] for i, listener in enumerate(self._listeners): if listener.type != listener_type: continue future = listener.future if future.cancelled(): removed.append(i) continue try: passed = listener.predicate(argument) except Exception as exc: future.set_exception(exc) removed.append(i) else: if passed: future.set_result(result) removed.append(i) if listener.type == ListenerType.chunk: break ``` Output => ``` XXXX anything loses information in decompiler. example: def process_listeners(self, listener_type, argument, result): removed = [] for i, listener in enumerate(self._listeners): if listener.type != listener_type: continue future = listener.future if future.cancelled(): removed.append(i) continue try: passed = listener.predicate(argument) except Exception as exc: future.set_exception(exc) removed.append(i) else: if passed: future.set_result(result) removed.append(i) if listener.type == ListenerType.chunk: tmp = 'break' if tmp=='break': break ``` Example 7: ```python def determineEncoding(self, chardet=True): # "likely" encoding charEncoding = lookupEncoding(self.likely_encoding), "tentative" if charEncoding[0] is not None: return charEncoding # Guess with chardet, if available if chardet: try: from chardet.universaldetector import UniversalDetector except ImportError: pass else: buffers = [] detector = UniversalDetector() while not detector.done: buffer = self.rawStream.read(self.numBytesChardet) assert isinstance(buffer, bytes) if not buffer: break buffers.append(buffer) detector.feed(buffer) detector.close() encoding = lookupEncoding(detector.result['encoding']) self.rawStream.seek(0) if encoding is not None: return encoding, "tentative" # Try the default encoding ``` converts to => ```python def determineEncoding(self, chardet=True): charEncoding = ( lookupEncoding(self.likely_encoding), 'tentative') if charEncoding[0] is not None: return charEncoding elif chardet: try: from chardet.universaldetector import UniversalDetector except ImportError: pass else: buffers = [] detector = UniversalDetector() if not detector.done: buffer = self.rawStream.read(self.numBytesChardet) assert isinstance(buffer, bytes) if not buffer: break buffers.append(buffer) detector.feed(buffer) else: detector.close() encoding = lookupEncoding(detector.result['encoding']) self.rawStream.seek(0) if encoding is not None: return ( encoding, 'tentative') ``` `elif chardet:` makes into pattern 1 and so causing issues Example 8: ``` def tokens(self, event, next): kind, data, _ = event if kind == START: tag, attribs = data name = tag.localname namespace = tag.namespace converted_attribs = {} for k, v in attribs: if isinstance(k, QName): converted_attribs[(k.namespace, k.localname)] = v else: converted_attribs[(None, k)] = v if namespace == namespaces["html"] and name in voidElements: for token in self.emptyTag(namespace, name, converted_attribs, not next or next[0] != END or next[1] != tag): yield token else: yield self.startTag(namespace, name, converted_attribs) elif kind == END: name = data.localname namespace = data.namespace if namespace != namespaces["html"] or name not in voidElements: yield self.endTag(namespace, name) elif kind == COMMENT: yield self.comment(data) elif kind == TEXT: for token in self.text(data): yield token elif kind == DOCTYPE: yield self.doctype(*data) elif kind in (XML_NAMESPACE, DOCTYPE, START_NS, END_NS, START_CDATA, END_CDATA, PI): pass else: yield self.unknown(kind) ``` Causes: ``` def tokens(self, event, next): kind, data, _ = event if kind == START: tag, attribs = data name = tag.localname namespace = tag.namespace converted_attribs = {} for k, v in attribs: if isinstance(k, QName): converted_attribs[(k.namespace, k.localname)] = v else: converted_attribs[(None, k)] = v else: if namespace == namespaces['html'] and name in voidElements: for token in self.emptyTag(namespace, name, converted_attribs, not next or next[0] != END or next[1] != tag): yield token else: yield self.startTag(namespace, name, converted_attribs) else: if kind == END: name = data.localname namespace = data.namespace if namespace != namespaces['html'] or name not in voidElements: yield self.endTag(namespace, name) else: if kind == COMMENT: yield self.comment(data) else: if kind == TEXT: for token in self.text(data): yield token else: if kind == DOCTYPE: yield (self.doctype)(*data) else: if kind in (XML_NAMESPACE, DOCTYPE, START_NS, END_NS, START_CDATA, END_CDATA, PI): break else: yield self.unknown(kind) # NOTE: have internal decompilation grammar errors. # Use -t option to show full context. # not in loop: # break # L. 40 354 BREAK_LOOP 368 'to 368' ```