MS Office (COM, OOXML)

The MS Office plug-in reads and writes metadata items from a large number of files that have a COM-based structure (older MS Office files, default format up to Office 2003) as well as all OpenXML formats. It consists of three files: ScavPISC.COM.dll, ScavPISC.zip.dll and ScavPI.MSOffice.dll.

File Extensions COM OpenXML Description
DOC, XLS, PPT, PPS, PUB, MSP
Read & Write (n/a) Legacy MS Office applications files and templates (Word, Excel, PowerPoint 97 to 2003); most MS Project files, most MS Publisher files, some MS Works files
DOCX, XLSX, PPTX
(n/a) Read & Write
Current MS Office file format (2007 and newer) for Word, Excel, PowerPoint

RawSourceElements in Office 97-2003 (COM-based)

The plug-in can process these common Summary items that are common across all document types (text, spreadsheet, presentation, etc.,):

RSE DescriptorPath Type Description
|COM|Summary|Title|
String
 The descriptive title of the document.
|COM|Summary|Subject| String The subject for this document.
|COM|Summary|Author| String
The main Author of the document.
|COM|Summary|Keywords| String
Keywords that characterize the document.
|COM|Summary|Comments| String Comments to this document.
|COM|Summary|Template| String (read-only)
|COM|Summary|LastEditAuthor|  String The editor that last saved this document.
|COM|Summary|RevisionN| String
|COM|Summary|LastPrintedDate|
|COM|Summary|LastSavedDate|
|COM|Summary|CreationDate|
DateTime (read-only)
Date and time when this document was created, last printed, saved.
|COM|Summary|TotalEditTime| DateTime (read-only) Total editing time.
|COM|Summary|PagesCount|
|COM|Summary|WordsCount|
|COM|Summary|CharsCount|
Numeric (read-only)
Number of pages, words and characters in the document.
|COM|Summary|Thumbnail| Thumbail (read-only)
Preview of the first page of the document.
|COM|Summary|AppName| String Name of the application that created the document.

In addition, the following documents-specific items are supported:

RSE DescriptorPath Type Description
|COM|DocSummary|Category|
String
 The descriptive title of the document.
|COM|DocSummary|Company| String  Company name. 
|COM|DocSummary|Manager| String   Manager of the project.
|COM|DocSummary|Bytes|
|COM|DocSummary|Lines|
|COM|DocSummary|Paragraphs|
|COM|DocSummary|Slides|
Numeric (read-only) Number of lines, slides, paragraphs, byte size.
|COM|DocSummary|HiddenSlides| Numeric (read-only) Number of slides that are hidden.
|COM|DocSummary|MMClips| Numeric (read-only) Number of sound or video clips.
|COM|DocSummary|ScaleCrop|  Boolean (read-only) Set to "true" if scaling of the thumbnails is desired.If not set, cropping is desired.
|COM|DocSummary|PresentationTarget| String The subject for this document.
|COM|DocSummary|Notes| Boolean (read-only) Number of pages that contain notes.
|COM|DocSummary|LinksUptoDate| Boolean (read-only) Indicates if the custom links are hampered by excessive noise, for all applications.

Note than not all file formats include all documents-specific items above; some are only used in Word, others only used in PowerPoint or Excel. The plugin also extracts so called TitlesofParts and HeadingPairs as well as custom-defined items (used heavily by Microsoft Project).

RawSourceElements in Office 2007/2010 (OpenXML)

Excel 2007, Word 2007 and PowerPoint 2007 (and newer versions) now use the OpenXML format, which has its own set of metadata items. The plug-in can process the following items:

RSE DescriptorPath Type Description
|OpenXML|Core|Title|
String
The descriptive title of the document.
|OpenXML|Core|Creator| String   
|OpenXML|Core|Description| String   
|OpenXML|Core|Created| DateTime (read-only)
|OpenXML|Core|ContentStatus String
The status of the document (i.e., Draft, Reviewed, Final)
|OpenXML|Core|Keywords| String
|OpenXML|Core|Subject| String
|OpenXML|Core|LastModifiedBy| String
|OpenXML|Core|Modified| DateTime (read-only)
|OpenXML|Core|Category| String
|OpenXML|Core|Revision| Numeric 
|OpenXML|Core|LastPrinted| DateTime (read-only)  
|OpenXML|Core|Thumbnail| Thumbnail (read-only)  
|OpenXML|App|Application| String  
|OpenXML|App|AppVersion| String Specified the version of the application which created the document file.
|OpenXML|App|Template| String (read-only)  
|OpenXML|App|Company| String  

|OpenXML|App|Pages|
|OpenXML|App|Words|
|OpenXML|App|Paragraphs|
|OpenXML|App|Characters|
|OpenXML|App|CharactersWithSpaces|
|OpenXML|App|Lines|

Numeric (read-only)  
|OpenXML|App|DocSecurity| Numeric (read-only) Specifies the security level of a document as a simple numeric value.
|OpenXML|App|ScaleCrop| Boolean  
|OpenXML|App|HyperlinksChanged|
|OpenXML|App|LinksUpToDate|
Boolean (read-only)
|OpenXML|App|SharedDoc| Boolean (read-only) Indicates if this document is currently shared between multiple producers. If set to TRUE, producers should take care when updating the document.
|OpenXML|App|TotalTime| Numeric (read-only)  
|OpenXML|App|Manager| String  
|OpenXML|App|HyperlinkBase| String  
|OpenXML|App|PresentationFormat| String  
|OpenXML|App|Slides|
|OpenXML|App|Notes|
|OpenXML|App|HiddenSlides|
|OpenXML|App|MMClips|
Numeric (read-only)  

Custom-defined OpenXML items are also extracted. Note than not all file formats include all items above; some are only used in Word, others only used in PowerPoint or Excel. More details can be found on the OOXML Wikipedia page.

Synchronization of COM and OpenXML

MS Office 2007/2010 applications (Word, Excel, PowerPoint) use a new file format structure called OpenXMP that also has an extended metadata schema. Corresponding metadata items in COM (MS Office 97-2003) and OpenXML may be mapped to an single InfoElement for consistency.

Metadata Scrubbing

Scrubbing metadata is not yet implemented. Most metadata items in COM and OOXML files cannot be completely removed, they can only be overwritten with an empty value. In addition, older values of metadata fields may still be stored within the file structure. Other information (prior edits, additional internal metadata, metadata created by application plug-ins) inside the file will not be removed or overwritten.