Assignment: LEX

Changelog:

Your Task

Patterns you need to identify

“Tricky jump”

Find machine code corresponding to assembly like:

      pushq $AddressOfVirusFunction
retq

    

(like you most likely inserted in the TRICKY assingment).

“Encrypted virus”

Identify a virus that attempts to evade detection by decrypting its code at runtime using machine code corresponding to assembly similar to:

          leaq <start location encoded using 4 bytes>(%rip), %rsi  /* load start location of encrypted */
    movl $<length>, %esp  /* place length in stack pointer ---
                                          has bonus effect of confusing debuggers */
loop:
    /* loop with *variable amount* of unrolling */
    xorl %esp, (%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    addq $<number encoded using 1 byte>, %rsi
    subl $<number encoded using 1 byte>, %esp
    jnz loop

    

where the start locations, length, number of xor’s in the loop vary at random from 1 to 10. When the number of xors is changed, the corresponding constants in the addq/subq change accordingly, for example with 4 xors, the code would look like:

          leaq <start location encoded using 4 bytes>(%rip), %rsi  /* load start location of encrypted */
    movl $<length>, %esp               /* place length in stack pointer ---
                                          has bonus effect of confusing debuggers */
loop:
    /* loop with *variable amount* of unrolling */
    xorl %esp, (%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    addq $<number encoded using 1 byte>, %rsi
    subl $<number encoded using 1 byte>, %esp
    jnz loop

    

Write a pattern that will detect most variants of this code including those in the example files we give.

(It is okay if your pattern does not check whether the jnz instruction actually jumps back to the top of the loop instead of to another location.)

Sample files

Our archive of sample files has three subdirectories:

In the t and e directories, the trivial subdirectories contain binary files who start with the offending pattern (with some extra bytes afterwards in the case of the “encrypted virus” pattern and nothing else in the case of the “tricky jump” pattern), while other files have the pattern inserted into a normal executable or library. Note that the “virus” code inserted is not actually functional and may have been inserted in a way which prevents the executable from functioning normally.

If you downloaded lex-samples.tar.gz before 26 Feb around 3:30pm, then the 1.exe that was included in the n directory had a false positive for “tricky jump” due to an error on my part. I since replaced that with an different executable (and changes the corresponding modified verisons in the t and e directories).

Resources

Hints

Machine code format

  1. The x86 machine code for addresses computed using a register that is not %rip and a displacement (like 42(%rsp) (AT&T syntax) or [RSP + 42] (Intel syntax)) uses a variable number of bytes for the displacement depending on its size:

    • if the displacement fits in a 1 byte signed number, it uses 1 byte
    • if it fits in a 2-byte signed number, it uses 2 bytes
    • if it fits in a 4-byte signed number, it uses 4 bytes
    • if it’s larger, then typically the instruction is not legal
  2. As a special case RIP-relative addressing (e.g. 42(%rip) (AT&T syntax) or [RIP + 42] (Intel syntax) always use a 4-byte displacement value.

Flex usage

  1. In the code specified to run when a pattern is matched in flex, the variable yytext is a char array that points to the matched bytes and yyleng is the length of the matched bytes.

  2. For debugging, you can output the matched bytes seperated by .s with code like:

        for (int i = 0; i < yyleng; ++i) {
            printf("%02x.", (unsigned char) yytext[i]);
        }
    

    (The cast to unsigned char makes sure that the numbers are all positive before being printed out.)

Credits

This assignment is based on an assignment from Jack Davidson’s version of this course.