Compilers

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • anyone interested in decompilation

    12 answers - 3308 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    QuantumG wrote:
    Decompilation is the process of recovering human readable source code
    from a program executable. Many decompilers exist for Java and .NET as
    the program executables (class files) maintain much of the information
    found in the source code. This is not true for machine code
    executables however.
    In recent years decompilation for machine code has moved from the
    domain of crackpots and academic hopefuls to a number of real
    technologies that are available to the general public. Decompilers for
    machine code now exist which produce output that rivals disassemblers
    as a tool for analysing programs for security flaws, malware or just
    simply to see how something works. Full source code recovery that is
    economically attainable will soon be a reality.
    The legal challenges posed by this technology differs country to
    country. As such, much research is being done in secret in countries
    that prohibit some uses of the technology, whereas some research is
    being done more publicly in countries that have laws which support the
    technology (Australia, for example).
    Boomerang is an open source decompiler written (primarily) by two
    Australian researchers. source projects need contributors. If
    you have an interest in decompilation, we'd like to hear from you.
    We're not only interested in talking to programmers. The project
    suffers from a lack of documentation, tutorials and community. There
    are many tasks that can be performed by users with minor technical
    knowledge.
    For more information on machine code decompilation see the Boomerang
    web site (). For interesting
    technical commentary on machine code decompilation, see my blog
    (http://quantumg.blotspot.com/).
    You want comp.compilers I think. This comes up once or so per year.
    P.S.
    You can't turn the DNA of a dead cow back into a cow. That sort of
    thing only works on "Jurasic Park" movies.
    When you want another cow, the best way to get one is to get a momma
    cow and a daddy cow (sometimes known as 'bulls') and let them do their
    business.
    When you want to get your source code back, if you are using a compiled
    language, the best thing is to restore from backup or pull from CVS.
    I hope you succeed and make a workable decompiler, despite the known
    impossibility of the general solution.
    I also recommend that you stick to news:comp.compilers because that is
    the arena where this sort of thing has ardent admirers.
    here, in comp.lang.c we are not terribly interested in it. You
    might say, "It's written in C!" but so is Microsoft Word, and
    Microsoft Word is not topical here. You might say, "It outputs C
    target language!" Which would be doubly interesting if the input were
    a CBL program but in any case, we don't care about that either.
    you have it all working properly, I promise to give it a look.
    Until then, don't go away mad -- just go away.
    [If you know that a program was compiled by a particular compiler, I gather
    it's possible to do pattern matching on the code idioms it uses to recover
    more source than one might expect. And debug symbols help a lot. -John]
  • No.1 | | 1732 bytes | |

    dcorbit@connx.com wrote:
    QuantumG wrote:
    You can't turn the DNA of a dead cow back into a cow. That sort of
    thing only works on "Jurasic Park" movies.

    When you want another cow, the best way to get one is to get a momma
    cow and a daddy cow (sometimes known as 'bulls') and let them do their
    business.

    [If you know that a program was compiled by a particular compiler, I gather
    it's possible to do pattern matching on the code idioms it uses to recover
    more source than one might expect. And debug symbols help a lot. -John]

    For a particular compiler (call it Machine-A), It would be possible to
    create another Machine-B, which can convert the output of Machine-A
    back to its input. Since Machine- A and its working is known a-priori,
    MachineB can be first trained (using techniques from machine learning
    or neural networks). In machine-learning domain, such techniques are
    found to be non-trivially useful, even when Machine-A is
    non-deterministic (following some probabbility distribution function
    or pure learning history). For a deterministic machine like compilers,
    it should be a piece of cake, given enough (practically feasible)
    learning of B from A.

    PS: The analogy of dead-cow is incorrect. Getting another living-cow
    from dead cow is equivalent to generating correct source code from a
    faulty machine-code. A dead cow is something which is not functional.
    [The translation from source to object is an extremely lossy one. You
    can reconstruct some source code, but it's tough in general to get
    source code you'd want to use. But see some following articles for
    success stories. -John]
  • No.2 | | 1056 bytes | |

    Thursday 03 Aug 2006 23:49, dcorbit@connx.com wrote:
    You can't turn the DNA of a dead cow back into a cow. *That sort of
    thing only works on "Jurasic Park" movies.

    Reengineering of assembler code to a high level language is certainly
    possible
    (see )
    even for programs which were hand-written in assembler,
    and not compiled from a high level language in the first place!

    I have worked on several major commercial assembler to C migration projects:
    each involving over half a million lines of hand-written assembler that was
    migrated to efficient and maintainable C code. This paper is a good
    starting point:

    %A Martin Ward
    %T Pigs from Sausages? Reengineering from Assembler to C via
    FermaT Transformations
    %J Science of Computer Programming,
    Special Issue on Program Transformation
    %E R. Lammel
    %V 52
    %N 1
    %P 213-255
    %D 2004

    The core transformation engine used in these projects can be downloaded
    from my web site: http://www.cse.dmu.ac.uk/~mward/fermat.html
    or:
  • No.3 | | 1535 bytes | |

    dcorbit@connx.com wrote:

    QuantumG wrote:

    >>Decompilation is the process of recovering human readable source code
    >>from a program executable. Many decompilers exist for Java and .NET as
    >>the program executables (class files) maintain much of the information
    >>found in the source code. This is not true for machine code
    >>executables however.


    JVM isn't all that different from many machines. I think it is more
    the exception model of Java that prohibits many optimizations that
    otherwise might confuse decompilers.

    >>In recent years decompilation for machine code has moved from the
    >>domain of crackpots and academic hopefuls to a number of real
    >>technologies that are available to the general public.


    It is likely specific to a specific version of a compiler,
    and will depend on the optimizations that the compiler does.

    (snip)

    P.S.
    You can't turn the DNA of a dead cow back into a cow. That sort of
    thing only works on "Jurasic Park" movies.

    There are people working on the Woolly Mammoth, though the probability
    might not be so high. I would expect that some extinct organism will
    be brought back in the not too distant future. It depends a lot on
    the quality of the DNA, and finding a similar enough living organism
    as an egg source.
    -- glen

  • No.4 | | 1080 bytes | |

    glen herrmannsfeldt wrote:

    from a program executable. Many decompilers exist for Java and .NET as
    >>

    the program executables (class files) maintain much of the information
    found in the source code. This is not true for machine code
    executables however.
    --
    JVM isn't all that different from many machines. I think it is more
    the exception model of Java that prohibits many optimizations that
    otherwise might confuse decompilers.

    JVM _is_ different. JVM defines special fields for the line
    numbers of the source code in the byte code. That's what the
    P was asking for. Some years ago I studies the JVM spec and
    after this I knew why the de-compiled/disassembled byte code
    had such a high quality and readability. There are tools for
    removing these "comments" from the byte code.

    JVM is also different in that it doesnt know the concept
    of a pointer. This disadvantage is essential when comparing
    NET binaries and Java byte code.

  • No.5 | | 727 bytes | |

    In comp.compilers dcorbit@connx.com wrote:

    You can't turn the DNA of a dead cow back into a cow. That sort of
    thing only works on "Jurasic Park" movies.

    Yup.

    You have to also have one complete cow cell to show you how the DNA
    is interpreted.

    I think this goes down with informal theorems "you can't learn a dead
    language (only) from their stone tablets" and "you can't learn to
    speak from a book". A certain amount of peripheral info is somtimes
    kept outside the systems in question -- "in the ether", so to speak.

    Ahhh. Quantum info and the formals of distributed computing all very
    interesting.
    [Can we talk about compilers now? -John]

  • No.6 | | 553 bytes | |

    Juergen Kahrs wrote:

    JVM is also different in that it doesnt know the concept
    of a pointer.

    This is a pet hate of mine: the JVM /does/ "know the concept of a
    pointer". You can hardly move in the JVM without using pointers, since
    they're the only way of handling arrays or instances of class types.

    What the JVM lacks is a notion of /pointer arithmetic/.

    [It's handy to use `reference` to mean `pointer without
    arithmetic opportunities`, but then C++ uses it for
    something else, not unrelated, again ]
  • No.7 | | 1247 bytes | |

    J Kahrs wrote:

    (snip regarding Java, JVM, pointers, and references)

    The Java type "reference" misses not only pointer arithmetic but also
    type casting (of pointers) and the address operator (&). You may of
    course argue that such features are not desirable. The absence of
    such feature makes it much harder to write compilers for a translation
    from IS C to JVM.

    Well, if a pointer variable includes a reference and offset, it
    comes close to what is needed. Arithmetic operations only change
    the offset, and the offset is used when the pointer is dereferenced.
    (Note that C make no guarantee as to the result of comparison
    other than equal/not equal for pointers to different objects.)

    The & operator is different. As far as I know, you have to make
    all scalar variables into arrays dimensioned one. Then they can
    be referenced as arrays are. Another way is to allocate all
    scalar variables of a given type as one array with different
    offsets. That might be more C like!

    I have, for example, used an array dimensioned one with a Hashtable,
    and can then dereference it and increment it in one expression,
    after having tested that the entry exists.
    -- glen
  • No.8 | | 991 bytes | |

    J Kahrs wrote:

    The Java type "reference" misses not only pointer arithmetic but also
    type casting (of pointers) and the address operator (&). You may of
    course argue that such features are not desirable.

    At least such features are not required in a programming language, as
    Java demonstrates ;-)

    The absence of such feature makes it much harder to write compilers
    for a translation from IS C to JVM.

    Translation or translators between C and Java, or .NET, all suffer from
    essentially the same problem: the target system doesn't support all the
    features of the source system. In either direction.

    The CLR environment of .NET is different in this
    respect. And most other virtual machines treat the concept of a
    pointer as a natural ingredient of a virtual machine.

    Java references translate well to managed .NET code, whereas C pointers
    only translate to unmanaged code, with all known risks.

    DoDi
  • No.9 | | 2207 bytes | |

    Jrgen Kahrs wrote:

    Chris Dollin wrote:
    >
    >[It's handy to use `reference` to mean `pointer without arithmetic
    >opportunities`, but then C++ uses it for something else, not
    >unrelated, again ]
    >

    The Java type "reference" misses not only pointer arithmetic but also
    type casting (of pointers) and the address operator (&). You may of
    course argue that such features are not desirable.

    I don't wish to come over as a nit-picker [1], but Java /does/
    have type-casting of pointers. What it doesn't have is casting
    pointers into or out of non-pointers. The need for &address is
    much diminished because it would only be useful for primitive
    types (since any non-primitive is already accessed via pointers).

    Whether or not those features might be /desirable/, I don't know;
    I haven't missed pointer arithmetic in Java in general, although
    I'd dearly love it at the character-sequence level for lexical
    analysis, and would personally rather have multiple-valued expressions
    than address-of; both of those features make it, I think, rather harder
    to have type-safe implementations, and weren't such implementations
    a goal of Java? If I remember correctly -- and it's been a long time --
    NET only allows address arithmetic in unmanaged (hence unsafe) code.

    The absence of
    such feature makes it much harder to write compilers for a translation
    from IS C to JVM.

    Well, yes. This surprises me no more than it being hard to ride
    a bicycle on water. The JVM wasn't designed as a general-purpose
    implementation environment.

    (In fact it /doesn't/ make it harder to write such compilers; it
    just makes it harder to write compilers that generate efficient
    code. course one hopes that C code will be efficient )

    The CLR environment of .NET is different in this
    respect. And most other virtual machines treat the concept of a
    pointer as a natural ingredient of a virtual machine.

    Probably. I'd like to see the statistics, mind; they'd be
    interesting.

    [1] Although I probably am one.
  • No.10 | | 407 bytes | |

    Hans-Peter Diettrich wrote:

    (snip)

    Java references translate well to managed .NET code, whereas C pointers
    only translate to unmanaged code, with all known risks.

    In the usual implementations, C pointers translate to unmanaged
    code, but C doesn't require that. There are many restrictions
    in the C standard to allow for other representations.
    -- glen

  • No.11 | | 979 bytes | |

    Chris Dollin <chris.dollin@hp.comwrote:

    I don't wish to come over as a nit-picker

    I'll risk it! DoDi also made the same CLI terminology slip-up.

    If I remember correctly -- and it's been a long time --
    .NET only allows address arithmetic in unmanaged (hence unsafe) code.

    The "official" CLI nomenclature for this distinction is verifiable (i.e.
    normal C#, Java) versus unverifiable (unsafe) code. Code that is written
    in CIL and running on the CLI abstract machine is still managed, even if
    it uses pointers, unsafe typecasts, or performs other unverifiable
    operations. For more info on verifiability, see ECMA 335 3rd ed 8.8, and
    throughout the spec.

    The managed / unmanaged distinction is between code running on the CIL
    versus code running on the underlying processor. must use P/Invoke
    (marshalling implemented by the CLI runtime) to escape to unmanaged code
    (ECMA 335 3rd ed 15.5.2).
    -- Barry
  • No.12 | | 1094 bytes | |

    Glen,

    As far as I know, the languages that allow operator overloading
    use the same precedence in all cases for each operator.

    It depends what you mean by an operator. Algol 68 allowed people to
    define an identifier to act like an operator. The following example
    is copied from page 188 of "Informal introduction to Algol 68" by
    Lindsey et al.

    op min = (real a, b) real: (a < b | a | b),
    min = (int a, b) int: (a < b | a | b);

    a := x min y;

    Now the priority (what Algol 68 called precedence) of one of these
    'operators' could be specified.

    prio min = 9;

    All of this could occur in nested blocks potentially resulting in the
    case about which our moderator has lost neural connection to the
    appropriate memories.

    [I don't think I've ever seen a language where the operator precedence
    depended on the types of the operands, and I hope I never do. -John]

    What's the problem? Developer knowledge of operator precedence is so
    bad it probably wouldn't make much difference.

Re: anyone interested in decompilation


max 4000 letters.
Your nickname that display:
In order to stop the spam: 0 + 9 =
QUESTION ON "Compilers"

EMSDN.COM