Thanks. We acknowledge that an LLM cannot completely replace human expertise in decompilation, much like GPT-4 has not achieved true human-like intelligence. However, the aim of our llm4decompile project is to do something like GPT-4, and offer assistance and enhance productivity in the decompilation process.
As for test suites, it's one of our project's main challenges—figuring out which functions satisfy the expectations of reverse engineers, how to autonomously produce high-coverage test suites, and how to objectively qualify decompilation outcomes without relying solely on human judgment. Looking forward to your advices!
Thanks! The concern is how to uniformly uplift binary code from various architectures and configurations to the same IR like RzIL? Is there a method to automate the disassembly process reliably across these different systems?
What do you mean? The Rizin code does all the hard part (we add uplifting code for every architecture manually; you can see a list of supported architectures using `rz-asm -L` and check for the `I` letter, which means "IL." You need to call the necessary APIs. See, for example, how it's done in one of the integration tests[1]. As for the use of Rizin from Python, we have a rz-bindgen[2][3].
Ideally, with a substantial dataset of obfuscated JavaScript and corresponding raw code, a language model could potentially make good predictions. The first key difficulty, however, is collecting a large-scale dataset and setting up a system for automatic compilation and segment out the binary-source pairs.
Thanks! We're working on Ghidra/IDA pro. The problem we face is the right kind of data to test with and how to evaluate it. It's like there's no "standard" benchmark/metrics that everyone uses for decompilation.
Thanks! But people want an all-in-one solution for decompilation. Given the vast array of architectures and compilation settings, and the fact that these information are usually not predetermined, finding a way to effectively navigate this complexity is quite difficult.
I don't have a toolchain, I am predicting research that will be able to detect the exact toolchain used from the binary. If you can detect the toolchain, then you can iterate to a fixed point (grind) until you recover a perfect copy of the source.
Thanks! We're working on Ghidra/IDA pro. The problem we face is the right kind of data to test with and how to evaluate it. It's like there's no "standard" benchmark/metrics that everyone uses for decompilation.
As others have said, the standardization of metrics is still something debated, but at the same time, this space has been explored by various top-tier papers that your paper did not cite. For example, DREAM [1], evaluated using the classic metric of goto-emittence. Rev.ng [2], evaluated using Cyclomatic Complexity and gotos. SAILR [3], evaluated using the previous metrics and a Graph Edit Distance score for the structure of the code.
I feel that without a justification for dropping previously established metrics by the peer review process, you weaken your new metrics. However, I still think this is an interesting paper. It just could be made more legit by thoroughly reading/citing previous work in the area and building an argument for why you may go against it.