[OleFileIO_PL](http://www.decalage.info/python/olefileio) is a Python module to parse and read [Microsoft OLE2 files (also called Structured Storage, Compound File Binary Format or Compound Document File Format)](http://en.wikipedia.org/wiki/Compound_File_Binary_Format), such as Microsoft Office documents, Image Composer and FlashPix files, Outlook messages, StickyNotes, several Microscopy file formats ...
This is an improved version of the OleFileIO module from [PIL](http://www.pythonware.com/products/pil/index.htm), the excellent Python Imaging Library, created and maintained by Fredrik Lundh. The API is still compatible with PIL, but since 2005 I have improved the internal implementation significantly, with new features, bugfixes and a more robust design.
As far as I know, this module is now the most complete and robust Python implementation to read MS OLE2 files, portable on several operating systems. (please tell me if you know other similar Python modules)
OleFileIO_PL can be used as an independent module or with PIL. The goal is to have it integrated into [Pillow](http://python-imaging.github.io/), the friendly fork of PIL.
OleFileIO\_PL is mostly meant for developers. If you are looking for tools to analyze OLE files or to extract data, then please also check [python-oletools](http://www.decalage.info/python/oletools), which are built upon OleFileIO_PL.
- 2013-05-07 v0.24: new features to extract metadata (get\_metadata method and OleMetadata class), improved getproperties to convert timestamps to Python datetime
An OLE file can be seen as a mini file system or a Zip archive: It contains **streams** of data that look like files embedded within the OLE file. Each stream has a name. For example, the main stream of a MS Word document containing its text is named "WordDocument".
An OLE file can also contain **storages**. A storage is a folder that contains streams or other storages. For example, a MS Word document with VBA macros has a storage called "Macros".
Special streams can contain **properties**. A property is a specific value that can be used to store information such as the metadata of a document (title, author, creation date, etc). Property stream names usually start with the character '\x05'.
For example, a typical MS Word document may look like this:
As of version 0.30, the code has been changed to be compatible with Python 3.x. As a consequence, compatibility with Python 2.5 or older is not provided anymore. However, a copy of v0.26 is available as OleFileIO_PL2.py. If your application needs to be compatible with Python 2.5 or older, you may use the following code to load the old version when needed:
:::python
try:
import OleFileIO_PL
except:
import OleFileIO_PL2 as OleFileIO_PL
If you think OleFileIO_PL should stay compatible with Python 2.5 or older, please [contact me](http://decalage.info/contact).
### Test if a file is an OLE container ###
Use isOleFile to check if the first bytes of the file contain the Magic for OLE files, before opening it. isOleFile returns True if it is an OLE file, False otherwise (new in v0.16).
This is useful if the file is not on disk, e.g. already stored in a string or as a file-like object.
:::python
ole = OleFileIO_PL.OleFileIO(f)
For example the code below reads a file into a string, then uses BytesIO to turn it into a file-like object.
:::python
data = open('myfile.doc', 'rb').read()
f = io.BytesIO(data) # or StringIO.StringIO for Python 2.x
ole = OleFileIO_PL.OleFileIO(f)
### How to handle malformed OLE files ###
By default, the parser is configured to be as robust and permissive as possible, allowing to parse most malformed OLE files. Only fatal errors will raise an exception. It is possible to tell the parser to be more strict in order to raise exceptions for files that do not fully conform to the OLE specifications, using the raise_defect option (new in v0.14):
:::python
ole = OleFileIO_PL.OleFileIO('myfile.doc', raise_defects=DEFECT_INCORRECT)
When the parsing is done, the list of non-fatal issues detected is available as a list in the parsing_issues attribute of the OleFileIO object (new in 0.25):
:::python
print('Non-fatal issues raised during parsing:')
if ole.parsing_issues:
for exctype, msg in ole.parsing_issues:
print('- %s: %s' % (exctype.__name__, msg))
else:
print('None')
### Syntax for stream and storage path ###
Two different syntaxes are allowed for methods that need or return the path of streams and storages:
1) Either a **list of strings** including all the storages from the root up to the stream/storage name. For example a stream called "WordDocument" at the root will have ['WordDocument'] as full path. A stream called "ThisDocument" located in the storage "Macros/VBA" will be ['Macros', 'VBA', 'ThisDocument']. This is the original syntax from PIL. While hard to read and not very convenient, this syntax works in all cases.
2) Or a **single string with slashes** to separate storage and stream names (similar to the Unix path syntax). The previous examples would be 'WordDocument' and 'Macros/VBA/ThisDocument'. This syntax is easier, but may fail if a stream or storage name contains a slash. (new in v0.15)
Both are case-insensitive.
Switching between the two is easy:
:::python
slash_path = '/'.join(list_path)
list_path = slash_path.split('/')
### Get the list of streams ###
listdir() returns a list of all the streams contained in the OLE file, including those stored in storages. Each stream is listed itself as a list, as described above.
openstream(path) opens a stream as a file-like object.
The following example extracts the "Pictures" stream from a PPT file:
:::python
pics = ole.openstream('Pictures')
data = pics.read()
### Get information about a stream/storage ###
Several methods can provide the size, type and timestamps of a given stream/storage:
get_size(path) returns the size of a stream in bytes (new in v0.16):
:::python
s = ole.get_size('WordDocument')
get_type(path) returns the type of a stream/storage, as one of the following constants: STGTY\_STREAM for a stream, STGTY\_STORAGE for a storage, STGTY\_ROOT for the root entry, and False for a non existing path (new in v0.15).
:::python
t = ole.get_type('WordDocument')
get\_ctime(path) and get\_mtime(path) return the creation and modification timestamps of a stream/storage, as a Python datetime object with UTC timezone. Please note that these timestamps are only present if the application that created the OLE file explicitly stored them, which is rarely the case. When not present, these methods return None (new in v0.26).
:::python
c = ole.get_ctime('WordDocument')
m = ole.get_mtime('WordDocument')
The root storage is a special case: You can get its creation and modification timestamps using the OleFileIO.root attribute (new in v0.26):
:::python
c = ole.root.getctime()
m = ole.root.getmtime()
### Extract metadata ###
get_metadata() will check if standard property streams exist, parse all the properties they contain, and return an OleMetadata object with the found properties as attributes (new in v0.24).
See the source code of the OleMetadata class for more information.
### Parse a property stream ###
get\_properties(path) can be used to parse any property stream that is not handled by get\_metadata. It returns a dictionary indexed by integers. Each integer is the index of the property, pointing to its value. For example in the standard property stream '\x05SummaryInformation', the document title is property #2, and the subject is #3.
:::python
p = ole.getproperties('specialprops')
By default as in the original PIL version, timestamp properties are converted into a number of seconds since Jan 1,1601. With the option convert\_time, you can obtain more convenient Python datetime objects (UTC timezone). If some time properties should not be converted (such as total editing time in '\x05SummaryInformation'), the list of indexes can be passed as no_conversion (new in v0.25):
:::python
p = ole.getproperties('specialprops', convert_time=True, no_conversion=[10])
### Close the OLE file ###
Unless your application is a simple script that terminates after processing an OLE file, do not forget to close each OleFileIO object after parsing to close the file on disk. (new in v0.22)
See also [this paper](https://computer-forensics.sans.org/community/papers/gcfa/grow-forensic-tools-taxonomy-python-libraries-helpful-forensic-analysis_6879) about python tools for forensics, which features OleFileIO_PL.
About Python 2 and 3
--------------------
OleFileIO\_PL used to support only Python 2.x. As of version 0.30, the code has been changed to be compatible with Python 3.x. As a consequence, compatibility with Python 2.5 or older is not provided anymore. However, a copy of v0.26 is available as OleFileIO_PL2.py. See above the "import" section for a workaround.
If you think OleFileIO_PL should stay compatible with Python 2.5 or older, please [contact me](http://decalage.info/contact).
The code is available in [a Mercurial repository on bitbucket](https://bitbucket.org/decalage/olefileio_pl). You may use it to submit enhancements or to report any issue.
If you would like to help us improve this module, or simply provide feedback, please [contact me](http://decalage.info/contact). You can help in many ways:
To report a bug, for example a normal file which is not parsed correctly, please use the [issue reporting page](https://bitbucket.org/decalage/olefileio_pl/issues?status=new&status=open), or if you prefer to do it privately, use this [contact form](http://decalage.info/contact). Please provide all the information about the context and how to reproduce the bug.
By obtaining, using, and/or copying this software and/or its associated documentation, you agree that you have read, understood, and will comply with the following terms and conditions:
Permission to use, copy, modify, and distribute this software and its associated documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appears in all copies, and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Secret Labs AB or the author not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.
SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.