--- title: PYFET overall tags: decompiler --- # Queries: - Decompyle++ and unpyc37 have no offset reported. How to tackle that? - State that root node - Decompyle++ has no implementation for loops and try/except in python 3.8/3.9 - Run experiment for instrumentation and removal - Run tonight - Add results tomorrow - FETs applied no includes R0. How to deal with R0? - Renamed R0 -> R16 ✅(Chijung) - Update blocks iterated. - Adding case study from appendix. - Feed back on root cause analysis. - Feedback on correctness # Tasks: 1.✅(Chijung) **[Completed]** ![](https://i.imgur.com/0m4QVT3.png) too small Change: Average Type of => Average # of Change: Average Error Location in All Decompilers => Average # of Errors in Code Location fig1. font too small in 3 numbers -> consistency - [data](https://docs.google.com/spreadsheets/d/1i3dRGD0GWnQ9OlN7ajnxxfHSSFn5N_yVIgcSytjc3CA/edit#gid=1708920927 ) 2. ✅(Chijung) **[Completed]** ![](https://i.imgur.com/QixLqsG.png) Change: five popular => five - CJ: there are more text like this as I remember. 3. ✅(Chijung) **[Completed]** ![](https://i.imgur.com/VgIa7d5.png) Make red the transformed instructions 4. ![](https://i.imgur.com/PZfIjio.png) Move this to main text (merge with root cause) Issue: Conflicts with "Logically Different Decompiled Source Code" study. 5. ✅(Chijung) **[Completed]** Remove superscript in eval table 6. How PyFET vs Decompiler: - Remove complexity of decompiler - Keep SLOC Fix bug: - did you try to fix a bug? - you need to try a few.......... - that's why I said that it will take time - did you have those? - it should have a high quality... - not just dumping useless numbers - what are need to fix - what are concens - what did you try - and what happen - how many iterations/errors caused by fixes..... 6. ✅(Chijung) **[Completed]** ![](https://i.imgur.com/qxcENw0.png) font sizes are not consistent between figures... --- Additional - ✅fig 12 font size - ✅text 5.4 - ✅fig 13 - ✅table 10: [data](https://docs.google.com/spreadsheets/d/1lWiTob6nIFrQFSZFpIHcUmtopqbEJNi0JVm1GPklqTQ/edit#gid=480676224) --- # Message: Hello everyone, so we are planning to submit in the next submission deadline for in response to the Major Revision (August 19). I plan to finish all core experiments by the end of July. ================================================================ These are the following new experiments that are needed to fulfill the reviewer's requirements: - ✅ Extending eval in Table 5 to other decompilers (Adding 3 new decompilers: pycdc, uncompyle2, and unpyc) - [Completed] - ✅ Implicit errors on new decompilers - [July 21st] - ✅ Ground truth study in Section 5 (use python source and apply the transformation to argue differences in transformations) - July 21st - ✅ Download github applications - ✅ Generate samples and transformations corresponding to each rule - ✅ Log down stats from the experiment - ✅ Root cause analysis in Section 1 (Categorizing errors into nature of errors in our dataset e.g., which exact grammar causes the error) - July 28th (1 week) - ✅ Study of the tradeoff between fixing decompiler and using PyFET in Appendix - July 28th (1 week) ================================================================ In the first week of August, all of these experiment results will be added to the paper and will go through revisions after it. Thanks. # Tasks for next version of paper: ## Experiments [July end] - 1. Extend eval: - [ ] Run uncompyle2 (Python 2.7), pycdc (all samples), unpyc (Python 3.3~3.4, 3.7) [Tuesday - Explicit errors] - [ ] uncompyle2 - [x] Run - [x] explicit errors - [ ] implicit errors - [ ] unpyc37 - [x] Run - [x] explicit errors - [ ] implicit errors - [ ] unpyc3.4 - [ ] NA - [ ] pycdc - [x] Run - [x] explicit errors - [x] segmentation fault - [x] unsupported instructions - [ ] implicit errors - 2. Ground truth study [End of this week] - [x] Download 10-100 python applications - [ ] Study the diversity in code structure - [ ] Diversity in the code complexity (how diverse and large code structure) - [ ] Functions used - [x] Run decompiler to pin-point errors - [ ] Resolve errors - [ ] Craft errors - [ ] Each rule should map to 30 (more than 50) source files - [ ] Include stats (# of lines of code changed, TP/FP, localized or not error--# of lines changed, # of nodes/edges changed in the CFGs, etc.) - [ ] Check for correctness, efficacy of regexes (more stats), and localiztion - 3. Implicit error patterns - [ ] Add new implicit error patterns - [ ] Propogate changes to eval numbers - 4. Run obfuscation methods - [ ] PjOrion - [ ] Adding - 5. Root cause analysis of bugs - [ ] How deep? - [ ] Grammar level - 6. Experiment for fixing decompiler vs running PyFET - [ ] 4-5 bugs story ## Changes in paper - Update intro to incorporate changes in paper - Issues with decompiler - add detail and move decompiler study - Major root causes for errors (**see `5`**) - Fundamental design choices that makes our tool better ("handling obfuscated bytecode within the decompiler would disrupt the current design of the decompiler significantly.") (**see `6`**) - Add reasoning for choice of decompilers (Eval?) - Update implicit error description to correct factual misundertandings - not blindly replace errors - Type 4 issue - Split eval of Py 3.9 binaries - Keep separate - Justify not migrating python 2.7~3.6 binaries to decompyle3 - Explain binary level transformation for FET (.i.e., the rules in table are only for presentation basis) - Discuss other obfuscation techniques - Add scope with PyFET not handling source level obfuscation techniques - Give breakdown for SETs/FETs - Discussion with developers? - Add future work for extensibility of PyFET - Explain how FETs are derived - Add discussion on errors introduced with transformation - Explicitly mention that we use different dataset for training and testing ![](https://i.imgur.com/m0avSxM.png) ![](https://i.imgur.com/4pJpar9.png) ![](https://i.imgur.com/e2aCKrx.png) :::warning :warning: Everything below is for rebuttal period and can be ignored. ::: ## Obfuscators: ### Renaming variables Renaming variable names to trick the decompiler https://github.com/ZetaTwo/python-obfuscator/blob/master/obfuscator.py Solution by uncompyle6: https://github.com/rocky/python-xdis/issues/58#issuecomment-609840742 ### # Tasks: 1. Yacc and lex spark parser dying spark parser: https://github.com/rocky/python-uncompyle6/issues/188 3. summarize 1-2 sentences https://github.com/rocky/python-uncompyle6/wiki/Fixing-Issue-%23293-(Python-3.8-grammar-problem) 5. turn around time for fix https://github.com/rocky/python-uncompyle6/issues/260 : June 10 2019 - Sept 5 2020 https://github.com/rocky/python-uncompyle6/issues/213 : Jan 25 2019 - March 26 2019 https://github.com/rocky/python-uncompyle6/issues/298 : 2 months 7. statistics on codebase uncompyle6: Lines of code: 31389 (621 functions) parsers: 10260 (182 functions) decompyle3: lines of code: 22324 (455) parser: 8849 (141) # Plan for review 1) Run preliminary experiments on PyPI packages with decompilation errors to strengthen our ground truth. - Get decompilation samples breakdown - Run rules - - 2) Cover more implicit errors to add to our evaluation. - cpython: - Get dis - find patterns - Apply patterns - Get breakdown # Review action items - Impact of FET/ Side effects - Instructions impacted - Infer from blocks - # libraries Python 3.4: - 9 PyPI Python 2.7: - 18 Cpython - 91 PuPI Python 3.6: - 16 Cpython - 12 PyPI Python 3.7: - 151 Cpython - 102 PyPI Python 3.8: - 212 Cpython (789) - 390 PyPI (697) Total cpython: 406 (983) Total PyPI: 595 (911) # Miscelineous - Obfuscated: obfuscator: - [pjorion](https://koreanrandom.com/forum/topic/15280-pjorion-%D1%80%D0%B5%D0%B4%D0%B0%D0%BA%D1%82%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5-%D0%BA%D0%BE%D0%BC%D0%BF%D0%B8%D0%BB%D1%8F%D1%86%D0%B8%D1%8F-%D0%B4%D0%B5%D0%BA%D0%BE%D0%BC%D0%BF%D0%B8%D0%BB%D1%8F%D1%86%D0%B8%D1%8F-%D0%BE%D0%B1%D1%84%D1%83%D1%81%D0%BA%D0%B0%D1%86%D0%B8%D1%8F-%D0%BC%D0%BE%D0%B4%D0%BE%D0%B2-%D0%B2%D0%B5%D1%80%D1%81%D0%B8%D1%8F-135-%D0%B4%D0%B0%D1%82%D0%B0-11082019/) - [bitboost](http://bitboost.com/#Python_obfuscator!&!) - - - http://serge-sans-paille.github.io/talks/hack.lu-2014-10-21.html#/https://isc.sans.edu/forums/diary/Nicely+Obfuscated+Python+RAT/26680/ - https://isc.sans.edu/forums/diary/Nicely+Obfuscated+Python+RAT/26680/ - https://www.mandiant.com/resources/deobfuscating-python random read: https://media.defcon.org/DEF%20CON%2018/DEF%20CON%2018%20presentations/DEF%20CON%2018%20-%20RSmith-pyREtic.pdf --- - Decompylers: - [Easy python decompiler](https://sourceforge.net/projects/easypythondecompiler/) - uses uncompyle2 and decompyle++ and only supports upto python 3.3 - [Decomyple++](https://github.com/zrax/pycdc) - Supports everything upto python 3.9 - Python 2.7-3.3 - Lightly maintained. - - [uncompyle2](https://github.com/wibiti/uncompyle2) - Supports only python 2.7 - [unpyc3](https://code.google.com/archive/p/unpyc3/) - Tested only with python 3.2 - [unpyc3 fork](https://github.com/figment/unpyc3/) --- Bugs and types of errors: https://github.com/rocky/python-uncompyle6/#known-bugsrestrictions > his comment is not scientific, so try to think about a scientific reason rather than his implementation oriented reasonings also there are his comments/blogs saying it is difficult to maintain the development of the decompiler, may need to summarize those and elaborate --- C++/C direction > for c/c++, honestly no one knows you just mention what are unique challenges, what are potential challenges that might be shared what are potential unknown challenges you didn't see in python but might be in c think about in those directions - the self-documenting part is gone in optimized release code. No variable names, no routine names, no class names just addresses. - Python these variable names and class names are preserved * | Type| Python | C/C++ | |------| -------- | -------- | |Comments| Lost | Lost | | Variable names | Lost | Lost | | Loops | Can be decompiled | Maybe unrolled | | Functions | Preserved as code objects | Maybe re-arranged| | Syntax | Recovered | Not recovered | --- # Prof comments: 1. Focus on reviewer C a. root causes b. why two decompilers c. "- Lack of information from developers. Do you discuss with the developers about your findings (meaning, bugs in their software)? What's their response? Can they confirm and fix? If they fix, then why users still require your tool? In a nutshell, if minor implementation problems in decompilers account for the majority of failures, this work's overall contribution and technical difficulty may be compromised." - this is critical point you need to focus. its not minor implementations. you may read paper again and ask some questions if you are not sure how to get your answer. d. "This study does not convince me that the recommended strategy -- finding and patching decompiled outputs -- is the best course of action," - we are not patching decompiled outputs... e. "Only two most well-known Python decompilers are taken into consideration." - list of what are all the available decompilers - get an idea of how similar there are to each other - get an idea of their performance - compose the arguments that what we choose are super-set of others in terms of decompilation errors