Assignment: LEX
Contents
Changelog:
- 25 Feb 2021: link examples archive; avoid implying code to print out bytes needs to be used in the submission
- 26 Feb 2021 3:30pm: correct
lex-samples.tar.gz
to remove “non-match” which had false positive for tricky jump pattern - 28 Feb 2021: note that there’s some extra bytes in the encrypted trivial examples, contrary to the original description of the sample files
- 1 Mar 2021: mention
sudo apt install libfl-dev
if error about-lfl
missing - 1 Mar 2021: clarify that
{LOCATION}
should be replaced the location of the first byte, not the first byte itself.S - 3 Mar 2021: formatting fix in “Your Task”
- 4 Mar 2021: explicitly mention command to actually run the generated scanner in “Your Task” as well as “Resources”
- 5 Mar 2021: change “positive number” and “negative number” in Encrypted virus examples so two exmaples are consistent. (When grading, we’ll avoid doing anything that checks whether or not solutions enforce that these numbers are positive/negative.)
Your Task
-
Using the regular expression tool
flex
, write patterns to detect the malware (virus or standalone) snippets provided in the section “Patterns”.If you get an error like
cannot find -lfl
running the commands below, runsudo apt install libfl-dev
.You are to submit a file
virus-patterns1.l
such that runningflex -o virus-patterns.c virus-patterns1.l gcc -Wall -g -O -o scanner virus-patterns1.c -lfl ./scanner < inputfile
will output nothing if
inputfile
does not match any of the patterns we specify and will output something with the format:WARNING! {NAME}: byte number: 0x{LOCATION}
where
{NAME}
is the name of the pattern (appearing in quotes on the list below) and{LOCATION}
is offset of the first byte of the matching code (as a hexadecimal number). -
Test your tool on some of the example files provided in this archive [last updated 2021-02-26 around 3:30pm]. Note that we will test on additional files when grading (to prevent students from just detecting what example files, or hard-coding all the values of variable parts of a virus that appear in the example files).
-
If you implemented the “Tricky jump” detection, then your program will have some false positives because the bytes corresponding to a “Tricky jump” appear in innocent contents. For example, this copy of the program xsetwacom from a Ubuntu 20.04 system matches the “Tricky jump” pattern at byte number 0x8e94.
This false positives is because the “tricky jump” bytes appear in the middle of another instruction:
8e92: 0f 84 68 ff ff ff je 8e00 <__sprintf_chk@plt+0x46d0> 8e98: 83 c3 01 add $0x1,%ebx
(The
push
opcode68
appears in the bytes that specify the target of theje
instruction, and theret
opcodec3
appears in one of the bytes specifying the operands to theadd
instruction.)One way to mitigate this issue is to add a patterns for the innocent instructions, so that that pattern is for the “Tricky jump” are not matched. Create a copy of
virus-patterns1.l
calledvirus-patterns2.l
that adds a pattern for theje
instruction (or uses some other technique to avoid this type of false positive).(We only require that you avoid false positives substantially similar to this one. So it is okay if you don’t recognize other types of instructions in which the “Tricky jump” bytes might appear or if the “Tricky jump” bytes appear in non-executable files. Presumably, practical software looking for virus patterns would need to mitigate these issues, by searching for a pattern much less likely to appear in an executable by accident and/or by having a more sophisticated way of parsing executables.)
-
Submit your
virus-patterns1.l
andvirus-patterns2.l
files to the submission site.
Patterns you need to identify
“Tricky jump”
Find machine code corresponding to assembly like:
pushq $AddressOfVirusFunction
retq
(like you most likely inserted in the TRICKY assingment).
“Encrypted virus”
Identify a virus that attempts to evade detection by decrypting its code at runtime using machine code corresponding to assembly similar to:
leaq <start location encoded using 4 bytes>(%rip), %rsi /* load start location of encrypted */
movl $<length>, %esp /* place length in stack pointer ---
has bonus effect of confusing debuggers */
loop:
/* loop with *variable amount* of unrolling */
xorl %esp, (%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
addq $<number encoded using 1 byte>, %rsi
subl $<number encoded using 1 byte>, %esp
jnz loop
where the start locations, length, number of xor’s in the loop vary at random from 1 to 10. When the number of xors is changed, the corresponding constants in the addq/subq change accordingly, for example with 4 xors, the code would look like:
leaq <start location encoded using 4 bytes>(%rip), %rsi /* load start location of encrypted */
movl $<length>, %esp /* place length in stack pointer ---
has bonus effect of confusing debuggers */
loop:
/* loop with *variable amount* of unrolling */
xorl %esp, (%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
xorl %esp, <number encoded using 1 byte>(%rsi)
addq $<number encoded using 1 byte>, %rsi
subl $<number encoded using 1 byte>, %esp
jnz loop
Write a pattern that will detect most variants of this code including those in the example files we give.
(It is okay if your pattern does not check whether the jnz
instruction actually jumps back to the top
of the loop instead of to another location.)
Sample files
Our archive of sample files has three subdirectories:
n
contains files which should have no positives;t
contains files which contain a “tricky jump”; ande
contains files which contain the “encrypted virus” pattern.
In the t
and e
directories, the trivial
subdirectories contain binary files
who start with the offending pattern (with some extra bytes afterwards in the case of the “encrypted virus” pattern
and nothing else in the case of the “tricky jump” pattern),
while other files have the pattern inserted into a normal executable or library.
Note that the “virus” code inserted is not actually functional and may have been inserted in a way
which prevents the executable from functioning normally.
If you downloaded lex-samples.tar.gz
before 26 Feb around 3:30pm, then the 1.exe
that was included in
the n
directory had a false positive for “tricky jump” due to an error on my part. I since replaced
that with an different executable (and changes the corresponding modified verisons in the t
and e
directories).
Resources
-
The flex manual.
-
Example flex files:
- count_foos.l: counts bytes, newlines, foos in its input
- a_words.l: output words starting with
a
- states.l: example using non-exclusive lex states
- states2.l: example using exclusive lex states
You can build any of these flex files into an executable using:
flex -o file.c file.l gcc -Wall -g -O -o file.exe file.c -lfl
They all read input from stdin and produce output to stdout. If you type input to them manually, you can send end-of-file by pressing control-D. You can also try them with input from a file using
./file.exe <input.txt
.
Hints
Machine code format
-
The x86 machine code for addresses computed using a register that is not
%rip
and a displacement (like42(%rsp)
(AT&T syntax) or[RSP + 42]
(Intel syntax)) uses a variable number of bytes for the displacement depending on its size:- if the displacement fits in a 1 byte signed number, it uses 1 byte
- if it fits in a 2-byte signed number, it uses 2 bytes
- if it fits in a 4-byte signed number, it uses 4 bytes
- if it’s larger, then typically the instruction is not legal
-
As a special case RIP-relative addressing (e.g.
42(%rip)
(AT&T syntax) or[RIP + 42]
(Intel syntax) always use a 4-byte displacement value.
Flex usage
-
In the code specified to run when a pattern is matched in
flex
, the variableyytext
is achar
array that points to the matched bytes andyyleng
is the length of the matched bytes. -
For debugging, you can output the matched bytes seperated by .s with code like:
for (int i = 0; i < yyleng; ++i) { printf("%02x.", (unsigned char) yytext[i]); }
(The cast to
unsigned char
makes sure that the numbers are all positive before being printed out.)
Credits
This assignment is based on an assignment from Jack Davidson’s version of this course.