Originally written by Thomas Hunter as a homework assignment in college. The object of the assignment was to hide data inside of a word document.
The Microsoft Word version that I used is 2007 as that is the version that I own.
Several months I read an article about Microsoft’s attempt to turn its new .docx format into an open standard for a document format. Doing this would give other companies the ability to create .docx files, thereby reducing the need for Word slightly, but at the same time having this as an acceptable format would open other opportunities such as allowing for easier document generation (the new format doesn’t use proprietary binary formats). While reading up on the new standard, I was surprised to find out that the .docx file itself is actually a .zip file with a xml documents in a standardized directory structure.
The first thing that I thought of to hide data in the file would be to simply open the archive in a zip program, add some files, and save the document. This did work, however Word complained that the document had been corrupted. Word would then ask me if I wanted to recover the file, and upon clicking “Yes”, the file was opened quite easily (the hidden data file was destroyed in the process). I then tried hiding the file in one of the sub directories of the archive, and that did not work either. I also tried changing the file extension to .xml as maybe other file extensions were setting off an alarm, but that did not work either.
Finally I decided I would need to hide the data inside of one of the xml files itself. XML has the same standard for creating comments that HTML does, the syntax is as follows:
<!—COMMENT GOES HERE -->
Unfortunately, comments used in this manner would have to be hand typed and wouldn’t allow for binary data. When the file is being parsed by Word, some binary data within the comments could signal the end of the file, or a file could coincidentally have the string “–>” within it.
The way I got around this method was by using the base64 algorithm (I was familiar with the algorithm as I had to convert data using base64 for a particular web project). Base64 provided a very simple yet effective way of taking information with a wide range of characters (ASCII has 127 possible characters) and converting this into a method that can be stored with a smaller range of characters (base64 output uses 64 characters, as the name implies). The following is an example of a before and after on a base64 encoded string, notice how the output is longer as it compensates for the shorter character range:
My name is Thomas Hunter! TXkgbmFtZSBpcyBUaG9tYXMgSHVudGVyIQ==
Using this method of commenting out an MP3 file that was converted to base64 and pasting it into example.docx/word/settings.xml, I was able to hide a data file within a Word .docx file (the 5MB MP3 file increased the size of the document by 8MB, however an un-savvy computer user would not have noticed the difference.