Thursday, March 6, 2014

Simplifying Postgres permissions

If you are anything like me, you would prefer to deal with as few headaches when developing software as absolutely possible. One recent headache that decides to revisit me occasionally is Postgres permissions. For users/platforms where you need a single user (that isn't the "postgres" user), while offering password authentication with both local and remote access, keep reading to find out how you can ensure that your permissions and table ownership are sane-ish. This is my gathering together of other information from docs, blogs, and tech articles.

First, you need to install Postgres. I'll assume you've already done that.

Next, you need to create the user that will be performing all of your operations (schema updates, data manipulation, etc.).
$ sudo -u postgres psql
# create user DBUSER with password 'PASSWORD';
# create database DATABASE;
# grant all privileges on database DATABASE to DBUSER;
# \q
$ 
After creating and updating your initial database, you now need to update your Postgres configuration to ensure that you can actually log in with your user.
$ sudo vim /etc/postgresql/9.1/main/pg_hba.conf

Scroll to the bottom of your configuration file. About 10 lines up (at least in Postgres 9.1), you should see a line that reads:

local   all             all                               peer

Change the last column to read "md5":

local   all             all                               md5

Then save and quit the editor.

After your configuration is updated, you need to restart Postgres:
$ sudo /etc/init.d/postgresql restart

From here, if you ever want to modify your schema, use data from a client, etc., you only need to log in with the user you created earlier, and everything will just work. You will be able to alter your schema as necessary, select, update, insert, and delete. In a lot of ways, this behaves a lot like a just-installed MySQL configuration after you've set a username/password for the root user (and only ever use the root user), but where the user *only* has access to manipulate the one database. This can be very useful when you've got your one Postgres install for personal projects, but want to silo your different projects into different named databases.

If you've already got some tables created with the Postgres user, and you need to change ownership, the accepted answer over on StackOverflow can get you all fixed up.

Long-term, you may want to create additional read/write users (no DDL updates), read-only users, etc., but if you just want to get a new application going, this can get you there.

Please upvote on Reddit and Hacker News.

Friday, December 6, 2013

Properties on Python modules

Python properties let programmers abstract away get, set, and delete methods using simple attribute access without exposing getValue(), setValue(), and delValue() methods to the user. Normally, properties can only be added to classes as either an instance-level property, or a class-level property (few people use class-level properties; I've only used them once, and I had to build a ClassProperty object to have it). But in this post, you will find out how to create module-level properties, and where to get a package that offers transparent module-level property creation and use from Python 2.4 all the way up to early versions of 3.4.

Why do I want module-level properties in Python?


For many people, the desire to have module-level properties boils down to flexibility. In this case, I came up with this solution while talking with Giampaolo Rodola (author of PyFTPDLibpsutil, current maintainer of asyncore and related Python libraries, ...) and hearing about psutil's use of module-level constants. Now, module-level constants aren't usually a big deal, but in this case, some of psutil's constants are actually relatively expensive to compute - expensive enough that Giampaolo was planning on deprecating the constants, deferring computation until the library user explicitly called the relevant compute_value() functions.

But with module properties, Giampaolo can defer calculation of those values until a program accesses the module's attributes, at which point the values can be computed (and cached as necessary). But even more useful is the fact that if you don't use those attributes, you don't need to compute them, so most people will get an convenient but unexpected performance improvement any time they need to import psutil.

What doesn't work and why


The first time someone wants to use a module property, they will try to decorate a function in their module with the property() decorator like they are used to using on methods of classes. Unfortunately, when trying to access that "property", they discover that the property didn't get any of the descriptor magic applied to it, and they have a property object that doesn't do anything particularly useful.

The reason it doesn't do anything useful is because properties are attached to classes, not instances. And during import/execution of a module, your definitions are being executed in the context of an instance dictionary of the module with no substantive post-processing. On the other hand, typical Python class definition results in the body of the class being executed, then the results passed to type() (via the type(name, bases, dict) form) for class creation.

Making it work


Someone who knows a bit more about how Python's internals are put together knows that you can muck with the contents of sys.modules, and that doing so during module import will let you replace the module object itself. So that's what we are going to do. Along the way, we're going to be doing a bit of deep magic, so don't be scared if you see something that you don't quite understand.

There are 5 major steps to make module properties work:
  1. Define your property
  2. Create a new type to offer unique properties for the module
  3. Ensure that the replacement module has access to the module namespace
  4. Fix up the module namespace and handle property definitions
  5. Replace the module in sys.modules

Our first two steps are easy. We can simply use the standard @property decorator (that we'll make work later) to create a property, and  we define an empty class definition that subclasses from object.

@property
def module_property(module):
    return "I work!", module

class Module(object):
    pass

Our third step is also easy, we just need to instantiate our replacement module and replace its __dict__ with the globals from the module we are replacing.

module = Module()
module.__dict__ = globals()

Our fourth step also isn't all that difficult, we just need to go through the module's globals and extract any properties that are defined. Generally speaking, we really want to pull out *any* descriptors, not just properties. But for this version, we'll extract out only property instances.

for k, v in list(module.__dict__.items()):
    if isinstance(v, property):
        setattr(Module, k, v)
        del module.__dict__[k]

Note that when we move the properties from the module globals to the replacement module, we have to assign to the replacement module class, not the instance of the replacement module. Generally speaking, this kind of class-level function/descriptor assignment is frowned upon, but in some cases (like this), it is necessary in order to get the functionality that we want.

And our final step is actually pretty easy, but we have to remember to keep a reference to the original module, as standard module destruction includes the resetting of all values in the module to be equal to None.

module._module = sys.modules[module.__name__]
module._pmodule = module
sys.modules[module.__name__] = module

And that is it. If you copy and paste all of the above code into a module with all of your module properties defined before our fourth step executes, then after the module is imported you will be able to reference any of your defined properties as attributes of the module. Note that if you want to access the properties from within the module, you need to reference them from the _pmodule global we injected.

Where can I get a pre-packaged copy of this magic?


To save you (and me) from needing to copy/paste the above into every module we want module properties, I've gone ahead and built a Python package for module properties. You can find it on Github, or you can find it on the Python package index. How do you use it? Very similar to what I defined above:

@property
def module_property(module):
    return "I work!", module

# after all properties are defined (put this at the end of the file)
import mprop; mprop.init()

Alternatively, if you don't want to remember to throw an mprop.init() call at the end, I've got a property work-alike that handles all of the magic:

from mprop import mproperty

@mproperty
def module_property(module):
    return "I also work!", module

And that's it. Module properties in Python. Enjoy :)

Hacker news thread here. Reddit thread here.

Thursday, October 24, 2013

Multi-column (SQL-like) sorting in Redis

Recently, I received an email from a wayward Redis user asking about using Redis Sets and Sorted Sets to sort multiple columns of data, with as close to the same semantics as a traditional SQL-style "order by" clause. Well it is possible, with limitations, keep reading to find out how.

What is Redis?


For those people who don't quite know what Redis is already, the TLDR version is: an in-memory data structure server that maps from string keys to one of 5 different data structures, providing high-speed remote access to shared data, and optional on-disk persistence. In a lot of ways, you can think of Redis like a version of Memcached where your data doesn't disappear if your machine restarts, and which supports a wider array of commands to store, retrieve, and manipulate data in different ways.

The setup


With that out of the way, our intrepid Redis user had come to me with a pretty reasonable problem to have; he needed to build an application to display a listing of businesses, sorted by several criteria. In his case, he had "price", "distance[1]", and "rating". In many cases that we have all seen in recent years with individual retailer searches, never mind restaurant searches on Yelp and similar applications, when searching for something in the physical world, there are a few things you care about primarily. These usually break down preferentially as lowest distance, lowest price, highest rating. In a relational database/SQL world, these fields would all be columns in a table (or spread out over several tables or calculated in real-time), so we are going to be referring to them as "sort columns" from here on.

Now, depending on preferences, you can sometimes get column preference and ascending/descending changes, which is why we need to build a system that can support reordering columns *and* switching the order of each individual column. Say that we really want the highest rating, lowest distance, lowest price? We need to support that too, and we can.

The concept


Because we are dealing with sort orders, we have two options. We can either use the Redis SORT command, or we can use sorted sets. There are ways of building this using the SORT command, but it is much more complicated and requires quite a bit of precomputation, so we'll instead use sorted sets.

We will start by making sure that every business has an entry in each of 3 different sorted sets representing price, distance, and rating. If a business has an "id" of 5, has a price of 20, distance of 4, and a rating of 8, then at some point the commands "ZADD price 20 5", "ZADD distance 4 5", and "ZADD rating 8 5" will have been called.

Once all of our data is in Redis, we then need to determine the maximum value of each of our sort columns. If you have ranges that you know are fixed, like say that you know that all of your prices and distance will all be 0 to 100, and your rating will always be 0 to 10, then you can save yourself a round-trip. We'll go ahead and build this assuming that you don't know your ranges in advance.

We are trying to gather our data range information in advance in order to carve up the 53 bits of precision [2] available in the floating-point doubles that are available in the sorted set scores. If we know our data ranges, then we know how many "bits" to dedicate to each column, and we know whether we can actually sort our data exactly, without losing precision.

If you remember our price, distance, and range information, you can imagine that (borrowing our earlier data) if we have price=20, distance=4, rating=8, and we want to sort by distance, price, -rating, we want to construct a "score" that will sort the same as the "tuple" comparison (20, 4, -8). By gathering range information, we could (for example) translate that tuple into a score of "20042", which you can see is basically the concatenation of "20", "04", and 10-8 (we subtract from 10 here because the rating column is reversed, and it helps to understand how we got the values).

Note: because of our construction, scores that are not whole numbers may not produce completely correct sorts.

The code


Stepping away from the abstract and into actual code, we are going to perform computationally what I just did above with some string manipulation.  We are going to numerically shift our data into columns, accounting for the magnitude of the data, as well as negative values in the columns (which won't affect our results). As a bonus, this method will even tell you if it believes that you could have a lower-quality sort because your data range is too wide[3].

import math
import warnings

def sort_zset_cols(conn, result_key, sort=('dist', 'price', '-score')):
    current_multiplier = 1
    keys = {result_key: 0}
    sort = reversed(sort)

    # Gets the max/min values in a sort column
    pipe = conn.pipeline(True)
    for sort_col in sort:
        pipe.zrange(sort_col, 0, 0, withscores=True)
        pipe.zrange(sort_col, -1, -1, withscores=True)
    ranges = pipe.execute()

    for i, sort_col in enumerate(sort):
        # Auto-scaling for negative values
        low, high = ranges[i*2][1], ranges[i*2+1][1]
        maxv = int(math.ceil(high - low if low < 0 else high))

        # Adjusts the weights based on the magnitude and sort order of the
        # column
        old_multiplier = current_multiplier
        desc = sort_col.startswith('-')
        sort_col = sort_col.lstrip('-')
        current_multiplier *= maxv

        # Assign the sort key a weight based on all of the lower-priority
        # sort columns
        keys[sort_col] = -old_multiplier if desc else old_multiplier

    if current_multiplier >= 2**53:
        warnings.warn("The total range of your values is outside the "
            "available score precision, and your sort may not be precise")

    # The sort results are available in the passed result_key
    return conn.zinterstore(result_key, keys)

If you prefer to check the code out at Github, here is the gist. Two notes about this code:
  • If the maximum or minimum values in any of the indexed columns becomes more extreme between the data range check and the actual query execution, some entries may have incorrect ordering (this can be fixed by translating the above to Lua and use Redis 2.6 and later support for Lua scripting)
  • If any of your data is missing in any of the indexes, then that entry will not appear in the results
Within the next few weeks, I'll be adding this functionality to rom, my Python Redis object mapper.

Interested in more tips and tricks with Redis? My book, Redis in Action (Amazon link), has dozens of other examples for new and seasoned users alike.


[1] For most applications, the distance criteria is something that would need to be computed on a per-query basis, and our questioning developer already built that part, so we'll assume that is available already.
[2] Except for certain types of extreme-valued doubles, you get 52 bits of actual precision, and 1 bit of implied precision. We'll operate under the assumption that we'll always be within the standard range, so we'll always get the full 53 bits.
[3] There are ways of adjusting the precision of certain columns of the data (by scaling values), but that can (and very likely would) result in scores with fractional components, which may break our sort order (as mentioned in the notes).

Friday, September 27, 2013

The spectacle of it all, and why you should just walk away

This post is off topic.

I'm going to talk briefly about something that has been pissing me off for quite a while now, but recent articles, interviews, and a Reddit AMA about "Romeo Rose" (aka "Sleepless in Austin") pushed me to the point of needing to say something.

Please, just stop it.

Look, I get it. Here is this physically unattractive, sexist, racist, douche-bag who thinks that he can pay someone $1500 to find him a girlfriend. You can't get a more perfect example of someone to lift up just to knock down. From the physical requirements of the woman he wants, the ethnicity of the woman, to his various forum postings all over the internet about learning how to pick up women (which helpful Redditors have scraped up out of the ether), ... the dude is a train wreck.

And it is so tempting to jump on the bandwagon to hate on the guy for any one of several reasons. But let me ask you one simple question: does it make the world a better place to tell your friends, to post on a forum, or to pass around the links as a way of saying that you don't like the guy? Seriously. Stop and consider it for just one or two moments. In the 5, 10, or even 60 minutes that you've been reading up on this guy, just so you would have ammunition to shit all over him, did you ever stop to wonder if it made you a better person? Did saying those things make you feel good about yourself?

We've got serious problems in this world, and thousands (maybe millions) of people are wasting their time talking about a guy whose claim to fame is an awful singles page. If the world was right and just, he might have been laughed at in a tiny corner of the internet for a day or two and forgotten. But no, here we are a couple days in, and Google gives me thousands of links to people talking about him.

Please, just stop it.

Every time anyone points out how awful of a person he is, how unattractive he is, or just to say that he's got bad teeth (the teeth and the way he looks are caused by a genetic condition that he has literally zero control over), all they are doing is bullying someone. You can validate it all you want, "he deserves it", "he is awful", or even "he brought it on himself". But just because you are standing with thousands of other bullies doesn't make it or you right, it just makes you an asshole with a bunch of other assholes.

So just stop it. Just walk away. Forget about Romeo Rose, and get on with your life. And when the next spectacle comes along, consider just walking away from that one too. Because it's the same thing over and over again. There will always be people that are held up in front of you to knock down, and every time you do it, you are a bully and an asshole.

And to top it off, you are wasting your time thinking about people you don't even like, who will never think about you or give you the time of day.

Make your life better. Just stop it.

Monday, June 3, 2013

What's going on with rom?

This post will talk about the open source Redis object mapper that I've released called rom. I will talk about what it is, why I wrote it, and what I'm planning on doing with it. I've posted several articles about Redis in the past, and you can buy my book, Redis in Action, now (hard copy will be available on/around June 10, 2013 - about a week from now) - enter the code dotd0601au at checkout for half off!

What is rom?


Rom is an "active record"-style object mapper intended as an interface between somewhat-intelligent Python objects and behavior, and data stored in Redis. Early versions (everything available now, and available for the coming few months) are purposefully simplified with respect to what is possible with Redis so that rom's capabilities can grow into what is necessary/desired, rather than trying to build functionality to support any/all possible use-cases up front.

An example use for storing users* can be seen below.

from hashlib import sha256
import os

import rom

def hash_pw(password, salt=None):
    salt = salt or os.urandom(16)
    hash = sha256(salt + password).digest()
    for i in xrange(32768):
        hash = sha256(hash + password).digest()
    return salt, hash

class User(rom.Model):
    name = rom.String(indexed=True)
    email = rom.String(required=True, unique=True, indexed=True)
    salt = rom.String()
    hash = rom.String()

    def update_password(self, password):
        self.salt, self.hash = hash_pw(password)

    def check_password(self, password):
        hash = hash_pw(password, self.salt)[1]
        pairs = zip(map(ord, hash), map(ord, self.hash or ''))
        p1 = sum(x ^ y for x, y in pairs)
        p2 = (len(hash) ^ len(self.hash))
        if p1 | p2:
            raise Exception("passwords don't match")

    @property
    def contact(self):
        return '"%s" <%s>'%(self.name, self.email)

Aside from the coding overhead of handling the hashing of passwords in a somewhat secure manner, setting up attributes and adding simple behaviors on top of models is pretty much what you should expect in an object mapper used in Python.

Why did I write rom?


As is the case with many things I build, it was meant to scratch an itch. I was working on an as-of-yet unreleased personal project and I needed a database. This database would only ever need to hold a few megabytes of data (maybe up to the tens of megabytes), but it also may need to perform several thousand reads/second and several hundred writes/second on an under-powered machine, and would need to persist any updated data. Those requirements eliminated a standard database in a typical configuration. However, a relational database instructed to store data in-memory would work, except for the persisting to disk part (and Postgres with some fsync tuning wouldn't be unreasonable, except maybe for the read volume). Given that I have a lot of experience with Redis, and my requirements fit with many of the typical use-cases for Redis, my project seemed to be begging to use Redis.

But after deciding to use Redis for my project, then what? I've built ad-hoc data storage methods on Redis before (roughly 2 dozen different mechanisms that are or have run in production, never mind the dozens that I've advised others to use), and I feel bad every time I do it. So I started by looking through several of the object mappers for Redis in Python (some of which call themselves 'object Redis mappers' to stick with the 'orm' theme), but I didn't like the way they either exposed or hid the Redis internals. In particular, most of the time, users shouldn't care what some database is doing under the covers, and they shouldn't need to change the way they think about the world to get the job done.

Now, I know what you are thinking. You're thinking: "Josiah, Redis is a database that requires that you change the way you think about the world in order to make it work." Ahh, but that's where you are wrong. The purpose of rom is to abstract away about 90% of the strangeness of Redis to a new user (at least on the Python side). Almost everything works the way most users of SQLAlchemy, Django's ORM, or Appengine's datastore would expect. About the only thing that rom doesn't do that those other libraries offer on top of relational databases is: 1) composite indices and 2) ordered indices on string columns.

With Redis and the way rom handles its indices, there might be some advantages to offering composite indices on the performance side of things for certain queries. But those queries are very limited, and there is just under 64 bits of usable space in any index entry. Ordered indices on string columns is also tough, running into a limit of just under 64 bits to offer ordering there.

There is a method that would increase the limit beyond 64 bits, but that method would be incompatible with the other indices that rom already uses. So, long story short, don't expect composite indices, and don't expect ordered indices on strings that play nicely with other rom indices.

What is in rom's future?


If you've read everything above, you know that composite indices and sorted indices on strings that play nicely with the other indices is not going to happen. But what if you don't care about playing nicely with other indices? Well, that is a whole other story.

With ordered indices on strings, one really nice feature is that you can perform prefix lookups on strings - which makes autocomplete-like problems very easy. Expect that in the future.

At some point I'll also be switching to using Lua scripting to handle updating the data in Redis. That will offer fast and easy support for multiple unique index columns, while simultaneously offering point-in-time atomic updates without retries. All of the major logic would be on the Python side, leaving simple updates to be done by Lua. I haven't done it yet because the performance and feature advantages are not drastically better to necessitate it at this point. With a little work, it would even be possible to implement "check and update" behavior to ensure that data hasn't been manipulated by other clients.

I've also been thinking about deferring attribute validation until just before data is serialized and sent to Redis, as it can be a significant performance advantage. The only reason I didn't do that from the start is because a TypeError on commit() can be a nightmare. Hunting down the answer to, "how did that data get there?" some time after the write occurred can be an exercise in futility. By performing the validation on attribute write (and on data load from Redis), you can at least know when you are writing the wrong data, you will be notified of it right away. As such, deferring validation until commit() may be a documented but discouraged feature in the future.

Redis-native structure access


I know that by now, some of you are looking at your screen asking, "When can I get raw access to lists, sets, hashes, and sorted sets? Those are really what my application needs!" And my answer to that is: at some point down the line. I know, that is a wishy-washy answer. And it's wishy-washy because there are three ways of building the functionality: 1) copy data out of Redis, manipulate in Python, then write changes back on commit(), 2) create objects that manipulate the in-Redis data directly, or 3) offer smart objects like #2, but write data back on commit().

If we keep with the database-style, then #1 is the right answer, as it allows us to perform all writes at commit() time. But if people have large lists, sets, hashes, or sorted sets in Redis, then #1 is definitely not the right answer, as applications might be copying out a lot of information. But with #2, updates to other structures step outside the somewhat expected commit() behavior (this entity and all related data have been written). Really, the right answer is #3. Direct access to reading, but masking write operations until commit().

Talking about building native access is easy, but actually building it to be robust is no small task. For now, you're probably better suited writing manual commands against your structures if you need native structure access.

How can you help?


Try rom out. Use it. If you find bugs, report them at Github. If you have fixes for bugs, post the bug with a pull request. If you have a feature request, ask. If your feature request is in the range of things that are reasonable to do and which fit in with what I want rom to be, I'll build it (possibly with a delay). Tell your colleagues about it. And if you are feeling really generous, buy my book: Redis in Action. If you are feeling really generous (or lost), I also do phone consultations.

Thank you for taking the time to read, and I hope that rom helps you.

* Whether or not you should store users in rom and Redis is a question related to whether Redis is suitable for storing such data in your use-case. This is used as a sample scenario that most people that have developed an application with users should be able to recognize (though not necessarily use).

Friday, May 10, 2013

Steganography in Python

Book Deal:

As a pre-article bonus, you can enter code dotd0511au during checkout to get 50% off of my book Redis in Action. If you want to keep seeing articles like this, please buy my book and give me positive reviews on Amazon ;)

See me speak:


LA Web Speed is hosting me for a talk at 7PM at Yahoo's offices in Santa Monica. Hit this page for meetup information: http://www.meetup.com/LAWebSpeed/events/115663212/ . I hope to see some of you there!

Steganography in Python


In this post I am going to talk about some code that I wrote to offer steganographic functionality in Python source files. This code was written for a lightning talk given at the April 1 Southern California Python Interest Group meeup (aka So-cal Piggies), hosted over at Mucker Labs in Santa Monica.

What is "steganography"?


Basically, steganography is the hiding information within other information. There have been countless examples of steganography used over the centuries, and I'll leave it to you to click on that Wikipedia link to learn about all of them.

Where did the idea come from?


At the February 28 So-cal piggies meetup, a few of us were chatting about programming languages. I mentioned whitespace, a programming language that only uses whitespace for expressing syntax. When it was announced that April 1 would be a meetup of lightning talks that would discuss things to do/not to do in Python, I thought about revisiting last year's revisit of Gotos in Python to make my case statement loopable and not reliant on the internal debugging mechanisms. Then I thought a bit more, and an idea sparked from our earlier discussion of whitespace... Python's tabs and spaces.

How can you hide information in Python's tabs and spaces?


As most of you know, Python's syntax requires the use of tabs and/or spaces in order to define multi-line block structures. You can use all tabs, all spaces, or you can mix tabs and spaces, where a tab counts as 8 spaces (though mixing tabs and spaces is considered bad form). The idea was to use mixed tabs and spaces for indentation in order to hide information.

Basically, for each line with leading spaces that is not inside a triple-quoted string, consider the indentation. Break the indentation into blocks of 8 spaces (discarding any partial block). If a block is a tab, it is considered a 1 bit. If the block is 8 spaces, consider it a 0 bit.

With this semantic, we can define a protocol for embedding information inside a Python source file fairly easily. The first 8 bits of the file will define how many 4-bit hexidecimal digits will be available in the file, with 8 '0' bits meaning 1 digit. So a file consisting of all spaces will return a single '0' as its data. This allows for up to 1028 bits to be stored per file.

Expanding the idea


After building the above and calculating the amount of bits that are available in a variety of source files, I discovered that using only indents wouldn't offer much in the way of data storage density. Knowing a bit about Unicode, and how there are characters that behave quite a bit like a space character, but aren't, I started looking at zero-width and standard-width non-printed characters to replace *all* space characters with other characters (which would offer many more bits). After discovering about a dozen of them, and trying to inject them into Python source files, I discovered that Python is *very* strict about what it considers syntactically valid whitespace: tab, space, and form-feed.

I briefly toyed with the idea of re-formatting Python code to be PEP 8 compliant (PEP 8 is the Python enhancement proposal that defines how properly formatted Python code should look), then using extra spaces in odd places (in violation of PEP-8, like putting a space between a function name and the parenthesis calling it like foo () for both function definitions and calls) in order to represent 1 bits (the lack of an oddly-placed space would be a 0). I also considered the use of LF vs. CRLF line endings to represent data, but CRLF pairs result in ^M being shown in 'diff' calls. I also considered adding a spare trailing space on lines, but most 'diff' calls show ugly red highlights for extra spaces.

One that I was very tempted to consider was one that inserted zero-width spaces conditionally in unicode strings (aka strings in Python 3.x). Because they don't print, you wouldn't notice anything strange about them, unless you output them on the web and inspected them, wrote them to a file, or tried to calculate a hash and saw the result wasn't what you expected. But, you could combine the string manipulation with an import hook with some ctypes hackery that examined all strings loaded from the module, and replaced the strings with ones that didn't have those zero-width spaces.

I may revisit these ideas in the future, but I found a simpler and easier way to not mangle beautiful source code or needing to write an import hook that involved ctypes: spaces in comments.

Because most code has at least a handful of comments, and because you can put just anything between a # and the newline (assuming a proper bom or coding: declaration), I started conditionally replacing space characters with a different space character. I considered optionally using zero-width spaces as well, or using one of a few different space characters (for more than one bit per space replaced), but simplicity is the name of the game. The more complicated it gets, the more likely your data will be discovered. For most source files, conditionally replacing spaces in comments resulted in a 3-5x increase in the number of bits that can be hidden in a Python source file, making my super-secret plan possible: store the sha1 or sha256 hmac of a file *inside* the file itself for self-verifying code.

The result


Ultimately, I ended up writing a module/command line tool that allows you to count the number of bits that can be hidden in a file, fetch stored bits from a file, clean the bits out of a file, write bits to a file, store the sha1 or sha256 hmac in a file (you are prompted for a password), or verify that a file has its proper sha1 or sha256 hmac embedded inside it (there's a password prompt here too). It uses the tokenize module to do all of the syntactical heavy-lifting for reading and writing, and *may* only work in Python 2.6.

This module uses both the comment space trick as well as the tabs/spaces indent trick. I'm not terribly proud of the quality of the code, as I only had 2 afternoons to work on it leading up to the talk. But it works, and could be reasonably expanded to support all of the tricks I mentioned above, and more.

If you'd like to take a look at the code, you can read and/or download it from this Github Gist.


One of my favorite ideas for storing data is actually to hide passwords inside your source files. More specifically, it is generally considered to be bad form to store passwords inside files stored in your source code repository, especially when that repository is hosted by a 3rd party. Well, you could store your password in some arbitrary source file in your repository (perhaps encrypted with the private key half of a public/private key pair), then when setting up your machine for the first time, download my gist, extract your password (decrypting with your public key, which could also be in your repository - it's public), then delete my gist. You then have your password available, few people would think to look at a poorly-formatted Python source file.

This becomes really nefarious if you store the password in an initial commit as a sort of "committing what I've got, this is crap" (where formatting can sometimes be broken) along with a collection of other files with garbage data. Subsequent commits can clean up the bad formatting incrementally, and part of your "get password" mechanism can perform a checkout of the first commit of the repo before trying to extract the data (or a second commit, or looking for a specific comment inside a commit message, ...).


It's getting late, so I'm going to call this post done and head to bed. I hope that this has sparked some ideas for how you can hide and manipulate data.

Keep your eyes peeled, in the coming week or two I'll be posting about what is going on with rom, my Redis object mapper for Python. And don't forget to pick up your copy of Redis in Action, where you can learn how to use Redis in ways it was and was not intended.

Tuesday, April 30, 2013

Want to get a deal on Redis in Action?

As we come into the home stretch of the release of Redis in Action, a lot of work has been going on behind the scenes. Easily 3-4 complete read-throughs over the last few months, a crew of editors poking and prodding, countless major and minor edits, and for those of you who have already downloaded version 11 of the MEAP (released last week), you will notice that it has taken a pass through the typesetters. This pass has resulted in a level of spit and polish that I couldn't imagine a year ago when the first MEAP went live.

There are still a few issues to be fixed in the week since version 11 of the MEAP was released, but they are getting cleaned up as I type this.

To commemorate this final push and to give everyone a chance to learn more about their data, Manning is offering 50% off Redis in Action, Lucene in Action 2nd Edition, and Machine Learning in Action. How do you get 50% off? Follow those links and enter dotd0501au on checkout.


When I find time in the coming weeks, you can look forward to two blog posts. The first on Steganography in Python, which I presented as a lightning talk on April 1 at the Southern California Python Interest Group meetup. And in the second, I'll talk about the whys, whats, and hows of building rom, a Redis Object Mapper in Python.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!