Friday, May 10, 2013

Steganography in Python

Book Deal:

As a pre-article bonus, you can enter code dotd0511au during checkout to get 50% off of my book Redis in Action. If you want to keep seeing articles like this, please buy my book and give me positive reviews on Amazon ;)

See me speak:


LA Web Speed is hosting me for a talk at 7PM at Yahoo's offices in Santa Monica. Hit this page for meetup information: http://www.meetup.com/LAWebSpeed/events/115663212/ . I hope to see some of you there!

Steganography in Python


In this post I am going to talk about some code that I wrote to offer steganographic functionality in Python source files. This code was written for a lightning talk given at the April 1 Southern California Python Interest Group meeup (aka So-cal Piggies), hosted over at Mucker Labs in Santa Monica.

What is "steganography"?


Basically, steganography is the hiding information within other information. There have been countless examples of steganography used over the centuries, and I'll leave it to you to click on that Wikipedia link to learn about all of them.

Where did the idea come from?


At the February 28 So-cal piggies meetup, a few of us were chatting about programming languages. I mentioned whitespace, a programming language that only uses whitespace for expressing syntax. When it was announced that April 1 would be a meetup of lightning talks that would discuss things to do/not to do in Python, I thought about revisiting last year's revisit of Gotos in Python to make my case statement loopable and not reliant on the internal debugging mechanisms. Then I thought a bit more, and an idea sparked from our earlier discussion of whitespace... Python's tabs and spaces.

How can you hide information in Python's tabs and spaces?


As most of you know, Python's syntax requires the use of tabs and/or spaces in order to define multi-line block structures. You can use all tabs, all spaces, or you can mix tabs and spaces, where a tab counts as 8 spaces (though mixing tabs and spaces is considered bad form). The idea was to use mixed tabs and spaces for indentation in order to hide information.

Basically, for each line with leading spaces that is not inside a triple-quoted string, consider the indentation. Break the indentation into blocks of 8 spaces (discarding any partial block). If a block is a tab, it is considered a 1 bit. If the block is 8 spaces, consider it a 0 bit.

With this semantic, we can define a protocol for embedding information inside a Python source file fairly easily. The first 8 bits of the file will define how many 4-bit hexidecimal digits will be available in the file, with 8 '0' bits meaning 1 digit. So a file consisting of all spaces will return a single '0' as its data. This allows for up to 10-28 bits to be stored per file (on average from the stdlib).

Expanding the idea


After building the above and calculating the amount of bits that are available in a variety of source files, I discovered that using only indents wouldn't offer much in the way of data storage density. Knowing a bit about Unicode, and how there are characters that behave quite a bit like a space character, but aren't, I started looking at zero-width and standard-width non-printed characters to replace *all* space characters with other characters (which would offer many more bits). After discovering about a dozen of them, and trying to inject them into Python source files, I discovered that Python is *very* strict about what it considers syntactically valid whitespace: tab, space, and form-feed.

I briefly toyed with the idea of re-formatting Python code to be PEP 8 compliant (PEP 8 is the Python enhancement proposal that defines how properly formatted Python code should look), then using extra spaces in odd places (in violation of PEP-8, like putting a space between a function name and the parenthesis calling it like foo () for both function definitions and calls) in order to represent 1 bits (the lack of an oddly-placed space would be a 0). I also considered the use of LF vs. CRLF line endings to represent data, but CRLF pairs result in ^M being shown in 'diff' calls. I also considered adding a spare trailing space on lines, but most 'diff' calls show ugly red highlights for extra spaces.

One that I was very tempted to consider was one that inserted zero-width spaces conditionally in unicode strings (aka strings in Python 3.x). Because they don't print, you wouldn't notice anything strange about them, unless you output them on the web and inspected them, wrote them to a file, or tried to calculate a hash and saw the result wasn't what you expected. But, you could combine the string manipulation with an import hook with some ctypes hackery that examined all strings loaded from the module, and replaced the strings with ones that didn't have those zero-width spaces.

I may revisit these ideas in the future, but I found a simpler and easier way to not mangle beautiful source code or needing to write an import hook that involved ctypes: spaces in comments.

Because most code has at least a handful of comments, and because you can put just anything between a # and the newline (assuming a proper bom or coding: declaration), I started conditionally replacing space characters with a different space character. I considered optionally using zero-width spaces as well, or using one of a few different space characters (for more than one bit per space replaced), but simplicity is the name of the game. The more complicated it gets, the more likely your data will be discovered. For most source files, conditionally replacing spaces in comments resulted in a 3-5x increase in the number of bits that can be hidden in a Python source file, making my super-secret plan possible: store the sha1 or sha256 hmac of a file *inside* the file itself for self-verifying code.

The result


Ultimately, I ended up writing a module/command line tool that allows you to count the number of bits that can be hidden in a file, fetch stored bits from a file, clean the bits out of a file, write bits to a file, store the sha1 or sha256 hmac in a file (you are prompted for a password), or verify that a file has its proper sha1 or sha256 hmac embedded inside it (there's a password prompt here too). It uses the tokenize module to do all of the syntactical heavy-lifting for reading and writing, and *may* only work in Python 2.6.

This module uses both the comment space trick as well as the tabs/spaces indent trick. I'm not terribly proud of the quality of the code, as I only had 2 afternoons to work on it leading up to the talk. But it works, and could be reasonably expanded to support all of the tricks I mentioned above, and more.

If you'd like to take a look at the code, you can read and/or download it from this Github Gist.


One of my favorite ideas for storing data is actually to hide passwords inside your source files. More specifically, it is generally considered to be bad form to store passwords inside files stored in your source code repository, especially when that repository is hosted by a 3rd party. Well, you could store your password in some arbitrary source file in your repository (perhaps encrypted with the private key half of a public/private key pair), then when setting up your machine for the first time, download my gist, extract your password (decrypting with your public key, which could also be in your repository - it's public), then delete my gist. You then have your password available, few people would think to look at a poorly-formatted Python source file.

This becomes really nefarious if you store the password in an initial commit as a sort of "committing what I've got, this is crap" (where formatting can sometimes be broken) along with a collection of other files with garbage data. Subsequent commits can clean up the bad formatting incrementally, and part of your "get password" mechanism can perform a checkout of the first commit of the repo before trying to extract the data (or a second commit, or looking for a specific comment inside a commit message, ...).


It's getting late, so I'm going to call this post done and head to bed. I hope that this has sparked some ideas for how you can hide and manipulate data.

Keep your eyes peeled, in the coming week or two I'll be posting about what is going on with rom, my Redis object mapper for Python. And don't forget to pick up your copy of Redis in Action, where you can learn how to use Redis in ways it was and was not intended.