diff --git a/PIL/OleFileIO.py b/PIL/OleFileIO.py index e5ecc6e56..57c9c349e 100644 --- a/PIL/OleFileIO.py +++ b/PIL/OleFileIO.py @@ -1,31 +1,29 @@ #!/usr/local/bin/python # -*- coding: latin-1 -*- -""" -OleFileIO_PL: -Module to read Microsoft OLE2 files (also called Structured Storage or -Microsoft Compound Document File Format), such as Microsoft Office -documents, Image Composer and FlashPix files, Outlook messages, ... -This version is compatible with Python 2.6+ and 3.x +## OleFileIO_PL: +## Module to read Microsoft OLE2 files (also called Structured Storage or +## Microsoft Compound Document File Format), such as Microsoft Office +## documents, Image Composer and FlashPix files, Outlook messages, ... +## This version is compatible with Python 2.6+ and 3.x -version 0.30 2014-02-04 Philippe Lagadec - http://www.decalage.info +## version 0.30 2014-02-04 Philippe Lagadec - http://www.decalage.info -Project website: http://www.decalage.info/python/olefileio +## Project website: http://www.decalage.info/python/olefileio -Improved version of the OleFileIO module from PIL library v1.1.6 -See: http://www.pythonware.com/products/pil/index.htm +## Improved version of the OleFileIO module from PIL library v1.1.6 +## See: http://www.pythonware.com/products/pil/index.htm -The Python Imaging Library (PIL) is +## The Python Imaging Library (PIL) is - Copyright (c) 1997-2005 by Secret Labs AB - - Copyright (c) 1995-2005 by Fredrik Lundh +## Copyright (c) 1997-2005 by Secret Labs AB +## Copyright (c) 1995-2005 by Fredrik Lundh -OleFileIO_PL changes are Copyright (c) 2005-2014 by Philippe Lagadec +## OleFileIO_PL changes are Copyright (c) 2005-2014 by Philippe Lagadec -See source code and LICENSE.txt for information on usage and redistribution. +## See source code and LICENSE.txt for information on usage and redistribution. + +## WARNING: THIS IS (STILL) WORK IN PROGRESS. -WARNING: THIS IS (STILL) WORK IN PROGRESS. -""" # Starting with OleFileIO_PL v0.30, only Python 2.6+ and 3.x is supported # This import enables print() as a function rather than a keyword diff --git a/docs/PIL.rst b/docs/PIL.rst index 5b429dc4b..8bf89c685 100644 --- a/docs/PIL.rst +++ b/docs/PIL.rst @@ -97,14 +97,6 @@ can be found here. :undoc-members: :show-inheritance: -:mod:`OleFileIO` Module ------------------------ - -.. automodule:: PIL.OleFileIO - :members: - :undoc-members: - :show-inheritance: - :mod:`PaletteFile` Module ------------------------- diff --git a/docs/reference/OleFileIO.rst b/docs/reference/OleFileIO.rst new file mode 100644 index 000000000..74c4b7b36 --- /dev/null +++ b/docs/reference/OleFileIO.rst @@ -0,0 +1,364 @@ +.. py:module:: PIL.OleFileIO +.. py:currentmodule:: PIL.OleFileIO + +:py:mod:`OleFileIO` Module +=========================== + +The :py:mod:`OleFileIO` module reads Microsoft OLE2 files (also called +Structured Storage or Microsoft Compound Document File Format), such +as Microsoft Office documents, Image Composer and FlashPix files, and +Outlook messages. + +This module is the `OleFileIO\_PL`_ project by Philippe Lagadec, v0.30, +merged back into Pillow. + +.. _OleFileIO\_PL: http://www.decalage.info/python/olefileio + +How to use this module +---------------------- + +For more information, see also the file **PIL/OleFileIO.py**, sample +code at the end of the module itself, and docstrings within the code. + +About the structure of OLE files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +An OLE file can be seen as a mini file system or a Zip archive: It +contains **streams** of data that look like files embedded within the +OLE file. Each stream has a name. For example, the main stream of a MS +Word document containing its text is named "WordDocument". + +An OLE file can also contain **storages**. A storage is a folder that +contains streams or other storages. For example, a MS Word document with +VBA macros has a storage called "Macros". + +Special streams can contain **properties**. A property is a specific +value that can be used to store information such as the metadata of a +document (title, author, creation date, etc). Property stream names +usually start with the character '05'. + +For example, a typical MS Word document may look like this: + +:: + + \x05DocumentSummaryInformation (stream) + \x05SummaryInformation (stream) + WordDocument (stream) + Macros (storage) + PROJECT (stream) + PROJECTwm (stream) + VBA (storage) + Module1 (stream) + ThisDocument (stream) + _VBA_PROJECT (stream) + dir (stream) + ObjectPool (storage) + +Test if a file is an OLE container +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use isOleFile to check if the first bytes of the file contain the Magic +for OLE files, before opening it. isOleFile returns True if it is an OLE +file, False otherwise. + +.. code-block:: python + + assert OleFileIO.isOleFile('myfile.doc') + +Open an OLE file from disk +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Create an OleFileIO object with the file path as parameter: + +.. code-block:: python + + ole = OleFileIO.OleFileIO('myfile.doc') + +Open an OLE file from a file-like object +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is useful if the file is not on disk, e.g. already stored in a +string or as a file-like object. + +.. code-block:: python + + ole = OleFileIO.OleFileIO(f) + +For example the code below reads a file into a string, then uses BytesIO +to turn it into a file-like object. + +.. code-block:: python + + data = open('myfile.doc', 'rb').read() + f = io.BytesIO(data) # or StringIO.StringIO for Python 2.x + ole = OleFileIO.OleFileIO(f) + +How to handle malformed OLE files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, the parser is configured to be as robust and permissive as +possible, allowing to parse most malformed OLE files. Only fatal errors +will raise an exception. It is possible to tell the parser to be more +strict in order to raise exceptions for files that do not fully conform +to the OLE specifications, using the raise\_defect option: + +.. code-block:: python + + ole = OleFileIO.OleFileIO('myfile.doc', raise_defects=DEFECT_INCORRECT) + +When the parsing is done, the list of non-fatal issues detected is +available as a list in the parsing\_issues attribute of the OleFileIO +object: + +.. code-block:: python + + print('Non-fatal issues raised during parsing:') + if ole.parsing_issues: + for exctype, msg in ole.parsing_issues: + print('- %s: %s' % (exctype.__name__, msg)) + else: + print('None') + +Syntax for stream and storage path +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Two different syntaxes are allowed for methods that need or return the +path of streams and storages: + +1) Either a **list of strings** including all the storages from the root + up to the stream/storage name. For example a stream called + "WordDocument" at the root will have ['WordDocument'] as full path. A + stream called "ThisDocument" located in the storage "Macros/VBA" will + be ['Macros', 'VBA', 'ThisDocument']. This is the original syntax + from PIL. While hard to read and not very convenient, this syntax + works in all cases. + +2) Or a **single string with slashes** to separate storage and stream + names (similar to the Unix path syntax). The previous examples would + be 'WordDocument' and 'Macros/VBA/ThisDocument'. This syntax is + easier, but may fail if a stream or storage name contains a slash. + +Both are case-insensitive. + +Switching between the two is easy: + +.. code-block:: python + + slash_path = '/'.join(list_path) + list_path = slash_path.split('/') + +Get the list of streams +~~~~~~~~~~~~~~~~~~~~~~~ + +listdir() returns a list of all the streams contained in the OLE file, +including those stored in storages. Each stream is listed itself as a +list, as described above. + +.. code-block:: python + + print(ole.listdir()) + +Sample result: + +.. code-block:: python + + [['\x01CompObj'], ['\x05DocumentSummaryInformation'], ['\x05SummaryInformation'] + , ['1Table'], ['Macros', 'PROJECT'], ['Macros', 'PROJECTwm'], ['Macros', 'VBA', + 'Module1'], ['Macros', 'VBA', 'ThisDocument'], ['Macros', 'VBA', '_VBA_PROJECT'] + , ['Macros', 'VBA', 'dir'], ['ObjectPool'], ['WordDocument']] + +As an option it is possible to choose if storages should also be listed, +with or without streams: + +.. code-block:: python + + ole.listdir (streams=False, storages=True) + +Test if known streams/storages exist: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +exists(path) checks if a given stream or storage exists in the OLE file. + +.. code-block:: python + + if ole.exists('worddocument'): + print("This is a Word document.") + if ole.exists('macros/vba'): + print("This document seems to contain VBA macros.") + +Read data from a stream +~~~~~~~~~~~~~~~~~~~~~~~ + +openstream(path) opens a stream as a file-like object. + +The following example extracts the "Pictures" stream from a PPT file: + +.. code-block:: python + + pics = ole.openstream('Pictures') + data = pics.read() + + +Get information about a stream/storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Several methods can provide the size, type and timestamps of a given +stream/storage: + +get\_size(path) returns the size of a stream in bytes: + +.. code-block:: python + + s = ole.get_size('WordDocument') + +get\_type(path) returns the type of a stream/storage, as one of the +following constants: STGTY\_STREAM for a stream, STGTY\_STORAGE for a +storage, STGTY\_ROOT for the root entry, and False for a non existing +path. + +.. code-block:: python + + t = ole.get_type('WordDocument') + +get\_ctime(path) and get\_mtime(path) return the creation and +modification timestamps of a stream/storage, as a Python datetime object +with UTC timezone. Please note that these timestamps are only present if +the application that created the OLE file explicitly stored them, which +is rarely the case. When not present, these methods return None. + +.. code-block:: python + + c = ole.get_ctime('WordDocument') + m = ole.get_mtime('WordDocument') + +The root storage is a special case: You can get its creation and +modification timestamps using the OleFileIO.root attribute: + +.. code-block:: python + + c = ole.root.getctime() + m = ole.root.getmtime() + +Extract metadata +~~~~~~~~~~~~~~~~ + +get\_metadata() will check if standard property streams exist, parse all +the properties they contain, and return an OleMetadata object with the +found properties as attributes. + +.. code-block:: python + + meta = ole.get_metadata() + print('Author:', meta.author) + print('Title:', meta.title) + print('Creation date:', meta.create_time) + # print all metadata: + meta.dump() + +Available attributes include: + +:: + + codepage, title, subject, author, keywords, comments, template, + last_saved_by, revision_number, total_edit_time, last_printed, create_time, + last_saved_time, num_pages, num_words, num_chars, thumbnail, + creating_application, security, codepage_doc, category, presentation_target, + bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips, + scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty, + chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed, + version, dig_sig, content_type, content_status, language, doc_version + +See the source code of the OleMetadata class for more information. + +Parse a property stream +~~~~~~~~~~~~~~~~~~~~~~~ + +get\_properties(path) can be used to parse any property stream that is +not handled by get\_metadata. It returns a dictionary indexed by +integers. Each integer is the index of the property, pointing to its +value. For example in the standard property stream +'05SummaryInformation', the document title is property #2, and the +subject is #3. + +.. code-block:: python + + p = ole.getproperties('specialprops') + +By default as in the original PIL version, timestamp properties are +converted into a number of seconds since Jan 1,1601. With the option +convert\_time, you can obtain more convenient Python datetime objects +(UTC timezone). If some time properties should not be converted (such as +total editing time in '05SummaryInformation'), the list of indexes can +be passed as no\_conversion: + +.. code-block:: python + + p = ole.getproperties('specialprops', convert_time=True, no_conversion=[10]) + +Close the OLE file +~~~~~~~~~~~~~~~~~~ + +Unless your application is a simple script that terminates after +processing an OLE file, do not forget to close each OleFileIO object +after parsing to close the file on disk. + +.. code-block:: python + + ole.close() + +Use OleFileIO as a script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +OleFileIO can also be used as a script from the command-line to +display the structure of an OLE file and its metadata, for example: + +:: + + PIL/OleFileIO.py myfile.doc + +You can use the option -c to check that all streams can be read fully, +and -d to generate very verbose debugging information. + +How to contribute +----------------- + +The code is available in `a Mercurial repository on +bitbucket `_. You may use +it to submit enhancements or to report any issue. + +If you would like to help us improve this module, or simply provide +feedback, please `contact me `_. You can +help in many ways: + +- test this module on different platforms / Python versions +- find and report bugs +- improve documentation, code samples, docstrings +- write unittest test cases +- provide tricky malformed files + +How to report bugs +------------------ + +To report a bug, for example a normal file which is not parsed +correctly, please use the `issue reporting +page `_, +or if you prefer to do it privately, use this `contact +form `_. Please provide all the +information about the context and how to reproduce the bug. + +If possible please join the debugging output of OleFileIO. For this, +launch the following command : + +:: + + PIL/OleFileIO.py -d -c file >debug.txt + + +Classes and Methods +------------------- + +.. automodule:: PIL.OleFileIO + :members: + :undoc-members: + :show-inheritance: + :noindex: diff --git a/docs/reference/index.rst b/docs/reference/index.rst index d6baf216e..2f10b861d 100644 --- a/docs/reference/index.rst +++ b/docs/reference/index.rst @@ -26,5 +26,6 @@ Reference ImageTk ImageWin ExifTags + OleFileIO PSDraw ../PIL