PDF is Broken: a justCTF Challenge

Evan Sultanik
PRINCIPAL SECURITY ENGINEER
Feb 2, 2021

Trail of Bits sponsored the recent justCTF competition, and our engineers helped craft several of the challenges, including D0cker, Go-fs, Pinata, Oracles, and 25519. In this post we’re going to cover another of our challenges, titled PDF is broken, and so is this file. It demonstrates some of the PDF file format’s idiosyncrasies in a bit of an unusual steganographic puzzle. CTF challenges that amount to finding a steganographic needle in a haystack are rarely enlightening, let alone enjoyable. LiveOverflow recently had an excellent video on file format tricks and concludes with a similar sentiment. Therefore, we designed this challenge to teach justCTF participants some PDF tricks and how Trail of Bits’ open source tools can make easy work of these forensic challenges.

In which a PDF is a webserver, serving copies of itself

The PDF file in the challenge is in fact broken, but most PDF viewers will usually just render it as a blank page with no complaints. The file command reports the challenge as just being “data.” Opening the file in a hex editor, we see that it looks like a Ruby script:

require 'json'
require 'cgi'
require 'socket'
=begin
%PDF-1.5
%ÐÔÅØ
% `file` sometimes lies
% and `readelf -p .note` might be useful later

The PDF header on line 5 is embedded within a Ruby multi-line comment that begins on line 4, but that’s not the part that’s broken! Almost all PDF viewers will ignore everything before the %PDF-1.5 header. Lines 7 and 8 are PDF comments affirming what we saw from the file command, as well as a readelf hint that we’ll get to later.

The remainder of the Ruby script is embedded within a PDF object stream—the “9999 0 obj” line—, which can contain arbitrary data ignored by PDF. But what of the remainder of the PDF? How does that not affect the Ruby script?

9999 0 obj
<< /Length 1680 >>^Fstream
=end
port = 8080
if ARGV.length > 0 then
  port = ARGV[0].to_i
end
html=DATA.read().encode('UTF-8', 'binary', :invalid => :replace, :undef => :replace).split(/<\/html>/)[0]+"\n"
v=TCPServer.new('',port)
print "Server running at http://localhost:#{port}/\nTo listen on a different port, re-run with the desired port as a command-line argument.\n\n"
⋮
__END__

Ruby has a feature where the lexer will halt on the __END__ keyword and effectively ignore everything thereafter. Sure enough, this curious PDF has such a symbol, followed by the end of the encapsulating PDF object stream and the remainder of the PDF.

This is a Ruby/PDF polyglot, and you can turn any PDF into such a polyglot using a similar method. If your script is short enough, you don’t even need to embed it in a PDF stream object. You can just prepend all of it before the %PDF-1.5 header. Although some PDF parsers will complain if the header is not found within the first 1024 bytes of the file.

You didn’t think it would be that easy, did you?

So let’s be brave and try running the PDF as if it were a Ruby script. Sure enough, it runs a webserver that serves a webpage with a download link for “flag.zip.” Wow, that was easy, right? Inspect the Ruby script further and you’ll see that the download is the PDF file itself renamed as a .zip. Yes, in addition to being a Ruby script, this PDF is also a valid ZIP file. PoC||GTFO has used this trick for years, which can also be observed by running binwalk -e on the challenge PDF.

Unzipping the PDF produces two files: a MμPDF mutool binary and false_flag.md, the latter suggesting the player run the broken PDF through the mutool binary.

Clearly, this version of mutool was modified to render the broken PDF properly, despite whatever is “broken” about it. Is the CTF player supposed to reverse engineer the binary to figure out what was modified? If someone tried, or if they tried the readelf clue embedded as a PDF comment above, they might notice this:

The first thing you should do is: Open the PDF in a hex editor. You’ll probably need to “fix” the PDF so it can be parsed by a vanilla PDF reader. You could reverse this binary to figure out how to do that, but it’s probably easier to use it to render the PDF, follow the clues, and compare the raw PDF objects to those of a “regular” PDF. You might just be able to repair it with `bbe`!

The Binary Block Editor (bbe) is a sed-like utility for editing binary sequences. This implies that whatever is causing the PDF to render as a blank page can easily be fixed with a binary regex.

Deeper Down the Hole

When we use the modified version of mutool to render the PDF, it results in this ostensibly meaningless memetic montage:

The “broken” challenge PDF rendered using the modified version of mutool.

Searching Google for the LMGTFY string will take you to Didier Stevens’ excellent article describing the PDF stream format in detail, including how PDF objects are numbered and versioned. One important factor is that two PDF objects can have the same number but different versions.

The first hint on the page identifies PDF object 1337, so that is probably important. The figures in Stevens’ article alone, juxtaposed to a hexdump of the broken PDF’s stream objects, provide a clear depiction of what was changed.

Annotated diagram of the parts of a PDF stream object
Didier Stevens’ annotated diagram of a PDF stream object.
A PDF stream object in the challenge PDF.

As the hints suggest, the PDF specification only allows for six whitespace characters: \0, \t, \n, \f, \r, and space. The version of mutool in the ZIP was modified to also allow ACK (0x06) to be used as a seventh whitespace character! Sure enough, on the twelfth line of the file we see:

>>^Fstream

That “^F” is an ACK character, where the PDF specification says there should be whitespace! All of the PDF object streams are similarly broken. This can be fixed with:

bbe -e "s/\x06stream\n/\nstream\n/" -o challenge_fixed.pdf challenge.pdf

Solving the Puzzle

Is fixing the file strictly necessary to solve the challenge? No, the flag may be found in PDF object 0x1337 using a hex editor

4919 0 obj
<< /Length 100 /Filter /FlateDecode >>^Fstream
x<9c>^MËA^N@0^PFá}OñëÆÊÊ        <88>X;^Ba<9a>N<8c>N£#áöº~ßs<99>s^ONÅ6^Qd<95>/°<90>^[¤(öHû       }^L^V   k×E»d<85>fcM<8d>^[køôië<97><88>^N<98> ^G~}Õ\°L3^BßÅ^Z÷^CÛ<85>!Û
endstream
endobj
4919 1 obj
<< /Length 89827 /Filter [/FlateDecode /ASCIIHexDecode /DCTDecode] >>^Fstream
…
endstream
endobj

and manually decoding the stream contents. Binwalk will even automatically decode the first stream because it can decode the Flate compression. That contains:

pip3 install polyfile
Also check out the `--html` option!
But you’ll need to “fix” this PDF first!

Binwalk doesn’t automatically expand the second stream because it’s also encoded with the ASCIIHex and DCT PDF filters. A casual observer who had not followed all of the clues and wasn’t yet familiar with the PDF specification might not even realize that the second version of the PDF stream object 0x1337 even existed! And that’s the one with the flag. Sure, it’s possible to have combed through the dozens of files extracted by binwalk to manually decode the flag, or even directly from the stream content in a hex editor, with a quick implementation of PDF’s decoders. But why do that when Polyfile can do it for you?

polyfile challenge_fixed.pdf -html challenge_fixed.html
PolyFile’s HTML output for the challenge PDF.

Oh, hey, that’s a hierarchical representation of the PDF objects, with an interactive hex viewer! How about we go to object 0x1337’s stream?

We immediately see PDF object 0x1337.
PolyFile can automatically decode the objects.

And finally, let’s look at the second version of object 0x1337, containing the multi-encoded flag:

PolyFile automatically decodes the layered PDF filters to produce the flag.
The flag!

Conclusions

PDF is a very … flexible file format. Just because a PDF looks broken, it doesn’t mean it is. And just because a PDF is broken, it doesn’t mean PDF viewers will tell you it is. PDF is at its core a container format that lets you encode arbitrary binary blobs that don’t even have to contribute to the document’s rendering. And those blobs can be stacked with an arbitrary number of encodings, some of which are bespoke features of PDF. If this is interesting to you, check out our talk on The Treachery of Files, as well as our tools for taming them, such as Polyfile and PolyTracker.