Python marshal module fuzzing
The marshal module provides a serialization mechanism for Python values. In other words, the module contains functions for writing/reading Python objects in a binary format. Unfortunately the format is undocumented, and Python maintainers may change the format in backward incompatible ways between Python version. The marshal module is used internally by other Python components, for example, for reading and writing .pyc files which contain pseudo-compiled Python code. But Python also has public API to access this serialization mechanism.
This post shows how the marshal module can be quickly tested with a simple dumb fuzzer, and why the module shouldn’t be used with untrusted data.
The marshal module is implemented in C, so the simplest goal of fuzzing here is just to look for typical issues in C code like buffer overflows, use-after-free, null-pointer dereferences, etc. AddressSanitizer (ASan) is a great memory checker which can help with identifying such issues. AddressSanitizer instruments code while compilation. The tool replaces malloc and free functions, and adds check for memory corruption issues. Then, at runtime it tries to detect memory corruptions, and report them immediately with lots of useful information. AddressSanitizer is part of GCC 4.8+ which can be used to build Python.
Building Python with AddressSanitizer
Python code (CPython) can be cloned with the following command:
hg clone https://hg.python.org/cpython
If you run ./configure --help
, you can see that it has --with-address-sanitizer
option which is supposed to enable AddressSanitizer. But for some reason it didn’t work for me, so I just used the following commands to build Python:
CFLAGS="-g -fsanitize=address -fno-omit-frame-pointer -O0" \
CPPFLAGS="-fsanitize=address -fno-omit-frame-pointer -O0" \
LDFLAGS="-fsanitize=address" \
./configure \
--prefix=/home/artem/projects/fuzzing/python/build/ \
--disable-ipv6
ASAN_OPTIONS="detect_leaks=0" make
ASAN_OPTIONS="detect_leaks=0" make install
Let me quickly explain what those options mean:
- CFLAGS, LDFLAGS, CPPFLAGS are standard enviroment variable which specify options for C/C++ compiler and linker.
-fsanitize=address
enables AddressSanitizer (it has to be passed to both compiler and linker)-g
makes GCC produce debugging information.-O0
turns off compiler optimizations (but slows down execution).-fno-omit-frame-pointer
is for nicer stack traces.- ASAN_OPTIONS is an environment variable which contains parameters for AddressSanitizer at runtime.
ASAN_OPTIONS="detect_leaks=0"
turns off memory leaks checker which is part of AddressSanitizer.--prefix
specifies a directory where it should put output binaries, libs, etc.--disable-ipv6
disables IPv6 (nothing surprising).
If the build runs smoothly, you can run python3.6 --version
as a smoke test.
Fuzzing Python marshal module
There are a lot of fuzzers. You can choose a simple dumb fuzzer like zzuf, or use something more intelligent like American Fuzzy Lop (ALF). Or, you can always invent a bicycle - here is a simple dumb fuzzer for the marshal module written in Python:
https://github.com/artem-smotrakov/python-marshal-fuzzer
In general, this fuzzer is very similar to zzuf. Here is a couple of words about how it works:
- DumbByteArrayFuzzer class is a simple dumb fuzzer for a byte array. It takes a byte array, and randomly modifies it depending on initial settings.
data
is an original byte array to fuzz.seed
parameter specifies a seed for pseudo-random generator.min_ratio
andmax_ratio
parameters specify min and max fraction of the byte array to be fuzzed.DumbByteArrayFuzzer
generates reproducible data (test case), andstart_test
parameter specifies a test case to start from.ignored_bytes
specifies symbols that should be ignored while fuzzing.- First,
fuzzer.py
parses command line options. - Next, it defines
value
object which is then serialized bymarshal.dumps()
method. - In the end of
fuzzer.py
, it creates an instance ofDumbByteArrayFuzzer
, and starts the main fuzzing loop - In the loop, it calls
next()
method to generate fuzzed data which is passed tomarshal.loads()
- The spec says that the following exception are expected:
EOFError
,ValueError
,TypeError
. The fuzzer just ignores them.
The fuzzer can be run with default parameters with the command like the following (no checks for memory leaks):
ASAN_OPTIONS="detect_leaks=0" \
/home/artem/projects/fuzzing/python/build/bin/python fuzzer.py
Segmentation fault in the marshal module
After some time, AddressSanitizer reported the following problem:
ASAN:SIGSEGV
=================================================================
==20296==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000008 (pc 0x000000582064 bp 0x7ffc9e581310 sp 0x7ffc9e5812f0 T0)
#0 0x582063 in PyObject_Hash Objects/object.c:769
#1 0x5a3662 in tuplehash Objects/tupleobject.c:358
#2 0x5820ae in PyObject_Hash Objects/object.c:771
#3 0x5a3662 in tuplehash Objects/tupleobject.c:358
#4 0x5820ae in PyObject_Hash Objects/object.c:771
#5 0x58fac8 in set_add_key Objects/setobject.c:422
#6 0x59a85c in PySet_Add Objects/setobject.c:2323
#7 0x760d9d in r_object Python/marshal.c:1310
#8 0x76029d in r_object Python/marshal.c:1223
#9 0x760015 in r_object Python/marshal.c:1195
#10 0x7621dc in read_object Python/marshal.c:1465
#11 0x7639be in marshal_loads Python/marshal.c:1767
#12 0x577ff3 in PyCFunction_Call Objects/methodobject.c:109
#13 0x708a05 in call_function Python/ceval.c:4744
#14 0x6fb5a7 in PyEval_EvalFrameEx Python/ceval.c:3256
#15 0x70276f in _PyEval_EvalCodeWithName Python/ceval.c:4050
#16 0x70299f in PyEval_EvalCodeEx Python/ceval.c:4071
#17 0x6e07d7 in PyEval_EvalCode Python/ceval.c:778
#18 0x432354 in run_mod Python/pythonrun.c:980
#19 0x431e5b in PyRun_FileExFlags Python/pythonrun.c:933
#20 0x42e929 in PyRun_SimpleFileExFlags Python/pythonrun.c:396
#21 0x42caba in PyRun_AnyFileExFlags Python/pythonrun.c:80
#22 0x45f995 in run_file Modules/main.c:319
#23 0x4619c8 in Py_Main Modules/main.c:777
#24 0x41d258 in main Programs/python.c:69
#25 0x7f374629babf in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20abf)
#26 0x41ce28 in _start (/home/artem/projects/fuzzing/python/build/bin/python3.6+0x41ce28)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV Objects/object.c:769 PyObject_Hash
==20296==ABORTING
Here is the original data structure which was used for fuzzing:
value = ( # tuple1
"this is a string", #string1
[
1, # int1
2, # int2
3, # int3
4 # int4
],
( #tuple2
"more tuples", #string2
1.0, # float1
2.3, # float2
4.5 # float3
),
"this is yet another string"
)
The fuzzer modified it with the following:
- First, it update type of
int2
item toTYPE_SET
. - As a result,
int3
item became a length of the set. - Then, it updated
float3
item toTYPE_REF
which points totuple1
item.
In other words, now it is a a recursive tuple. What happened when marshal.loads() tried to deserialize this fuzzed data:
int2
item is now a set of length 3.- First, It adds
int4
item to the set. - Next, it adds
tuple2
item: - When an object is added to a set, it calculates a hash of this object
- When it calculates a hash of a tuple, it calculates hashes of all items from this tuple.
- During calculating a hash of
tuple2
, it calculates a hash oftuple1
becausefloat3
now is aTYPE_REF
item which points totuple1
. - But
tuple1
is not complete yet. The length oftuple1
is 4, but onlystring1
has been added to it so far. tuplehash()
function reads a length of a tuple, and then callsPyObject_Hash()
fucntion for each item of the tuple.- But it doesn’t check if a tuple is complete, and all elements have been added to the tuple.
- As a result, a null-pointer dereference happens in
tuplehash()
function when it reads second item oftuple1
.
See https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l347 for details:
static Py_hash_t
tuplehash(PyTupleObject *v)
{
Py_uhash_t x; /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
Py_uhash_t mult = _PyHASH_MULTIPLIER;
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++); <=
For tuple1
, Py_SIZE(v)
returns 4, but tuple1
contains only one element string1
. A null-pointer dereference happens in PyObject_Hash()
while reading second element.
Even if it doesn’t seem to be a serious security issue, the problem was originally reported to Python Security Response Team. They said they don’t consider crashes due to malicious marshal data to be security bugs. And documentation for the marshal module has a note about it:
Warning: The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.
Then, the problem was reported to Python maintainers, but they decided not to fix it probably because of performance.
Conclusion
As they mentioned in documentation for the marshal module, it should not be used for unmarshaling data received from an untrusted party because the module is not intended to be secure against malicious data. Furthermore, some issues (like above) are not going to be fixed even if they are known.
The interesting thing is that at the moment of posting this article I found 59601 usages of marshal.loads() function on GitHub. I hope they know what they are doing.