Assignment: LEX

Changelog:

  • 10 Feb 2025: clarify that the start location/length in the encrypted virus pattern vary independently of the number of xors.
  • 10 Feb 2025: clarify that it’s okay if “encrypted virus” pattern does not check constants for addq/subl
  • 11 Feb 2025: describe how to disassemble raw binary files in hints
  • 12 Feb 2025: correct spelling of disassemble in command in hints

Your Task

Patterns you need to identify

“Tricky jump”

Find machine code corresponding to assembly like:

      pushq $AddressOfVirusFunction
retq

    

(like you most likely inserted in the TRICKY assingment).

“Encrypted virus”

Identify a virus that attempts to evade detection by decrypting its code at runtime using machine code corresponding to assembly similar to:

          leaq <start location encoded using 4 bytes>(%rip), %rsi  /* load start location of encrypted */
    movl $<length>, %esp  /* place length in stack pointer ---
                                          has bonus effect of confusing debuggers */
loop:
    /* loop with *variable amount* of unrolling */
    xorl %esp, (%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    addq $<number encoded using 1 byte>, %rsi
    subl $<number encoded using 1 byte>, %esp
    jnz loop

    

where the start locations and length vary; and the number of xor’s in the loop vary at random from 1 to 10. When the number of xors is changed, the corresponding constants in the addq/subq will change accordingly, for example with 4 xors, the code would look like:

          leaq <start location encoded using 4 bytes>(%rip), %rsi  /* load start location of encrypted */
    movl $<length>, %esp               /* place length in stack pointer ---
                                          has bonus effect of confusing debuggers */
loop:
    /* loop with *variable amount* of unrolling */
    xorl %esp, (%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    xorl %esp, <number encoded using 1 byte>(%rsi)
    addq $<number encoded using 1 byte>, %rsi
    subl $<number encoded using 1 byte>, %esp
    jnz loop

    

(where the numbers for the addq and subl will be different).

Write a pattern that will detect most variants of this code including those in the example files we give.

(It is okay if your pattern does not check whether the jnz instruction actually jumps back to the top of the loop instead of to another location. Also, you do not need to check the actual values of the numbers for the addq and subl, even though they will be determined by the loop.)

Sample files

Our archive of sample files has three subdirectories:

In the t and e directories, the trivial subdirectories contain binary files who start with the offending pattern (with some extra bytes afterwards in the case of the “encrypted virus” pattern and nothing else in the case of the “tricky jump” pattern), while other files have the pattern inserted into a normal executable or library. Note that the “virus” code inserted is not actually functional and may have been inserted in a way which prevents the executable from functioning normally.

Resources

Hints

Disasembling raw files

  1. You can use a command like objdump --target binary --architecture i386:x86-64 --disassemble-all foo to treat a file foo as instructions and disassemble it. This might be handy for looking at the “trivial” examples.

  2. You could also potential load these files into Ghidra as “raw” files.

Machine code format

  1. The x86 machine code for addresses computed using a register that is not %rip and a displacement (like 42(%rsp) (AT&T syntax) or [RSP + 42] (Intel syntax)) uses a variable number of bytes for the displacement depending on its size:

    • if the displacement fits in a 1 byte signed number, it uses 1 byte
    • if it fits in a 2-byte signed number, it uses 2 bytes
    • if it fits in a 4-byte signed number, it uses 4 bytes
    • if it’s larger, then typically the instruction is not legal
  2. As a special case RIP-relative addressing (e.g. 42(%rip) (AT&T syntax) or [RIP + 42] (Intel syntax) always use a 4-byte displacement value.

Flex usage

  1. In the code specified to run when a pattern is matched in flex, the variable yytext is a char array that points to the matched bytes and yyleng is the length of the matched bytes.

  2. For debugging, you can output the matched bytes seperated by .s with code like:

        for (int i = 0; i < yyleng; ++i) {
            printf("%02x.", (unsigned char) yytext[i]);
        }
    

    (The cast to unsigned char makes sure that the numbers are all positive before being printed out.)

Credits

This assignment is based on an assignment from Jack Davidson’s version of this course.