Assignment: LEX
Contents
Changelog:
- 10 Feb 2025: clarify that the start location/length in the encrypted virus pattern vary independently of the number of xors.
- 10 Feb 2025: clarify that it’s okay if “encrypted virus” pattern does not check constants for addq/subl
- 11 Feb 2025: describe how to disassemble raw binary files in hints
- 12 Feb 2025: correct spelling of disassemble in command in hints
Your Task
-
Using the regular expression tool
flex
, write patterns to detect the malware (virus or standalone) snippets provided in the section “Patterns”.You are to submit a file
virus-patterns1.l
such that runningflex -o virus-patterns.c virus-patterns1.l gcc -Wall -g -O -o scanner virus-patterns1.c -lfl ./scanner < inputfile
will output nothing if
inputfile
does not match any of the patterns we specify and will output something with the format:WARNING! {NAME}: byte number: 0x{LOCATION}
where
{NAME}
is the name of the pattern (appearing in quotes on the list below) and{LOCATION}
is offset of the first byte of the matching code (as a hexadecimal number). -
Test your tool on some of the example files provided in this archive. Note that we will test on additional files when grading (to prevent students from just detecting what example files, or hard-coding all the values of variable parts of a virus that appear in the example files).
-
If you implemented the “Tricky jump” detection, then your program will have some false positives because the bytes corresponding to a “Tricky jump” appear in innocent contents. For example, this copy of the program xsetwacom from a Ubuntu 20.04 system matches the “Tricky jump” pattern at byte number 0x8e94.
This false positive is because the “tricky jump” bytes appear in the middle of another instruction:
8e92: 0f 84 68 ff ff ff je 8e00 <__sprintf_chk@plt+0x46d0> 8e98: 83 c3 01 add $0x1,%ebx
(The
push
opcode68
appears in the bytes that specify the target of theje
instruction, and theret
opcodec3
appears in one of the bytes specifying the operands to theadd
instruction.)One way to mitigate this issue is to add a patterns for the innocent instructions, so that that pattern is for the “Tricky jump” are not matched. Create a copy of
virus-patterns1.l
calledvirus-patterns2.l
that adds a pattern for theje
instruction (or uses some other technique to avoid this type of false positive).(We only require that you avoid false positives substantially similar to this one. So it is okay if you don’t recognize other types of instructions in which the “Tricky jump” bytes might appear or if the “Tricky jump” bytes appear in non-executable files. Presumably, practical software looking for virus patterns would need to mitigate these issues, by searching for a pattern much less likely to appear in an executable by accident and/or by having a more sophisticated way of parsing executables.)
-
Submit your
virus-patterns1.l
andvirus-patterns2.l
files to the submission site.
Patterns you need to identify
“Tricky jump”
Find machine code corresponding to assembly like:
pushq $AddressOfVirusFunction
retq
(like you most likely inserted in the TRICKY assingment).
“Encrypted virus”
Identify a virus that attempts to evade detection by decrypting its code at runtime using machine code corresponding to assembly similar to:
leaq <start location encoded using 4 bytes>(%rip), %rsi /* load start location of encrypted */
movl $<length>, %esp /* place length in stack pointer ---
has bonus effect of confusing debuggers */
loop:
/* loop with *variable amount* of unrolling */
xorl %esp, (%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
addq $<number encoded using 1 byte>, %rsi
subl $<number encoded using 1 byte>, %esp
jnz loop
where the start locations and length vary; and the number of xor’s in the loop vary at random from 1 to 10. When the number of xors is changed, the corresponding constants in the addq/subq will change accordingly, for example with 4 xors, the code would look like:
leaq <start location encoded using 4 bytes>(%rip), %rsi /* load start location of encrypted */
movl $<length>, %esp /* place length in stack pointer ---
has bonus effect of confusing debuggers */
loop:
/* loop with *variable amount* of unrolling */
xorl %esp, (%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
addq $<number encoded using 1 byte>, %rsi
subl $<number encoded using 1 byte>, %esp
jnz loop
(where the numbers for the addq and subl will be different).
Write a pattern that will detect most variants of this code including those in the example files we give.
(It is okay if your pattern does not check whether the jnz
instruction actually jumps back to the top
of the loop instead of to another location. Also, you do not need to check the actual values of
the numbers for the addq and subl, even though they will be determined by the loop.)
Sample files
Our archive of sample files has three subdirectories:
n
contains files which should have no positives;t
contains files which contain a “tricky jump”; ande
contains files which contain the “encrypted virus” pattern.
In the t
and e
directories, the trivial
subdirectories contain binary files
who start with the offending pattern (with some extra bytes afterwards in the case of the “encrypted virus” pattern
and nothing else in the case of the “tricky jump” pattern),
while other files have the pattern inserted into a normal executable or library.
Note that the “virus” code inserted is not actually functional and may have been inserted in a way
which prevents the executable from functioning normally.
Resources
-
The flex manual.
-
Example flex files:
- count_foos.l: counts bytes, newlines, foos in its input
- a_words.l: output words starting with
a
- states.l: example using non-exclusive lex states
- states2.l: example using exclusive lex states
You can build any of these flex files into an executable using:
flex -o file.c file.l gcc -Wall -g -O -o file.exe file.c -lfl
They all read input from stdin and produce output to stdout. If you type input to them manually, you can send end-of-file by pressing control-D. You can also try them with input from a file using
./file.exe <input.txt
.
Hints
Disasembling raw files
-
You can use a command like
objdump --target binary --architecture i386:x86-64 --disassemble-all foo
to treat a filefoo
as instructions and disassemble it. This might be handy for looking at the “trivial” examples. -
You could also potential load these files into Ghidra as “raw” files.
Machine code format
-
The x86 machine code for addresses computed using a register that is not
%rip
and a displacement (like42(%rsp)
(AT&T syntax) or[RSP + 42]
(Intel syntax)) uses a variable number of bytes for the displacement depending on its size:- if the displacement fits in a 1 byte signed number, it uses 1 byte
- if it fits in a 2-byte signed number, it uses 2 bytes
- if it fits in a 4-byte signed number, it uses 4 bytes
- if it’s larger, then typically the instruction is not legal
-
As a special case RIP-relative addressing (e.g.
42(%rip)
(AT&T syntax) or[RIP + 42]
(Intel syntax) always use a 4-byte displacement value.
Flex usage
-
In the code specified to run when a pattern is matched in
flex
, the variableyytext
is achar
array that points to the matched bytes andyyleng
is the length of the matched bytes. -
For debugging, you can output the matched bytes seperated by .s with code like:
for (int i = 0; i < yyleng; ++i) { printf("%02x.", (unsigned char) yytext[i]); }
(The cast to
unsigned char
makes sure that the numbers are all positive before being printed out.)
Credits
This assignment is based on an assignment from Jack Davidson’s version of this course.