Background
I reported a zip bomb vulnerability to the CPython community in 2019. Here are all the interesting resources and ideas.
Decompression pitfall I wrote for official documentation
PyCon Korea 2019 - Click Click Boom! Bombs Over Our Minds
zipfile analysis
According to Black Hat’s Cara Marie research, there are some solutions against Zip Bomb. By limiting the size of the block to be read at a time, if there is still data remaining after the block that needs to be decompressed after reading this block, it is considered that it is possible to be a Zip Bomb.
Below is Cara Marie code
1 | import zlib |
As you can see, the strategy to defeating the zip bomb is by limiting a block, in this case, is max size 102400. However, we take a look at the Python standard library, zipfile.
According to Cara Marie’s approach, we try to figure out the difference between zipfile and zlib and why we can’t use zipfile directly for preventing zip bombs, so we started to study zipfile source code.
zipfile
Since I focus on the zip format and pick the most commonly used algorithm, DEFLATED algorithm. Inside the zipfile, we can see the location of unzipped function, starting at line 702
, getting the zlib object, and finally returning the object.
1 | def _get_decompressor(compress_type): |
From the above code, we can know that the zipfile is based on what zlib does. So we have to deep dive into what zlib did?
zlib
According to the zlib documentation
There are two ways to compression and decompression, .compress() and .decompress() will fit all files into memory at once. In contrast to the method of the object. It using .compressobj() and .decompressobj() which won’t fit into memory at once.
There are two ways to compress/decompress.
- .compress() and .decompress() will put the entire file into memory at a time
- .compressobj() and .decompressobj() separate the file , compress/decompress one block at a time
However, the official documentation does not clearly explain how to use the API to decompress files. The purpose of this method is to obtain the file data stream and decompress it through the Low-Level method. And we went back to the zipfile module and found that they had already done the decompression of zlib, so we planned to apply the patch for zipfile first.
In the way that zipfile belongs to decompressobj
, we have the first way to accumulate chunks. As long as we can find out where to do the decompression of chunks, we accumulate it and give a threshold. If it exceeds, then consider that it is possible to be the zip bomb.
Get back at the zipfile
- Starting with the object
1 | return zlib.decompressobj(-15) |
It is the place where the class of zlib.decompressobj(-15) object is obtained and initialized.
which belongs to ZipExtFile class
1 | def __init__(self, fileobj, mode, zipinfo, decrypter=None,close_fileobj=False): |
Let’s find out what fileobj
is
1 | return ZipExtFile(zef_file, mode, zinfo, zd, True) |
Return the class, and use zef_file, then follow zef_file
_SharedFile being initialized
1 | def __init__(self, file, pos, close, lock, writing): |
Here we know that when zlib is decompressed, you can’t start decompressing directly to Streaming, and you need to skip the file encoding in front of the zip file.
In class _Tellable: to initialize the position of the indicator that gets the file descriptor
1 | def __init__(self, fp): |
and then
1 |
|
We observed that after choosing to use the ZIP_DEFLATED compression algorithm, we did a function max to get n.
Key Point
1 | max(n, self.MIN_READ_SIZE) |
When you use zlib.decompressobj as a block, how big is your block?
, self.MIN_READ_SIZE is preset to 4096 bytes, which is the size of a page in the operating system.
Cara Marie’s solution
1 | import zlib |
It sets maxsize to 102400 bytes
According to the official document
Decompress.decompress(data, max_length=0)
Decompress data, returning a bytes object containing the uncompressed data corresponding to at least part of the data in the string. This data should be concatenated to the output produced by any preceding calls to the decompress() method. Some of the input data may be preserved in internal buffers for later processing.If the optional parameter max_length is non-zero then the return value will be no longer than max_length. This may mean that not all of the compressed input can be processed, and unconsumed data will be stored in the attribute unconsumed_tail. This byte string must be passed to a subsequent call to decompress() if decompression is to continue. If max_length is zero then the whole input is decompressed, and unconsumed_tail is empty.
Changed in version 3.6: max_length can be used as a keyword argument.
Max_length represents the file block size that can be read into the memory at a time and is marked with unconsumed_tail to see if any remaining files need to be decompressed.
Therefore, his idea is more than 102400 bytes. If there is any remaining data, it means there may be a zip bomb.