Note: this code has been updated for the final Ecma schemas and the RTM version of Office. If you note some mistakes or have some issues, please report them to julien@chable.net
This post is a summary translation of articles by Julien Chable that have are available (in French) on MSDN France:
Retrieve a document’s main part
public final static String NS_CORE_DOCUMENT = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
...
final String APP_ROOT = System.getProperty("user.dir") + File.separator; ZipFile zipFile = null;
try { zipFile = new ZipFile(APP_ROOT + "sample.docx"); } catch (IOException e) { e.printStackTrace(); }
Package p = Package.open(zipFile, PackageAccess.Read);
// Retrieve core part relationship from his type PackageRelationship coreDocRelationship = p.getRelationshipsByType( PackageRelationshipConstants.NS_CORE_DOCUMENT).getRelationship(0);
// Get the content part from the relationship PackagePart coreDocument = p.getPart(coreDocRelationship); System.out.println(coreDocument.getUri() + " -> " + coreDocument.getContentType()); |
Listing 1
Listing 1 output for several types of documents :
- Word :
word/document.xml -> application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
- Excel :
xl/workbook.xml -> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml
- PowerPoint :
ppt/presentation.xml -> application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml
Here are the extensions and the URI of the main part for several types of documents :
- WordProcessingML (.docx) : word/document.xml
- SpreadsheetML (.xlsx) : xl/workbook.xml
- PresentationML (.pptx) : ppt/presentation.xml
How to get document’s properties
The following sample demonstrates how to get the core property part of a document :
public final static String NS_CORE_PROPERTIES = "http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"; ...
Package p = Package.open(zipFile, PackageAccess.Read);
// Get core properties part relationship PackageRelationship corePropertiesRelationship = p.getRelationshipsByType(PackageRelationshipConstants.NS_CORE_PROPERTIES).getRelationship(0);
// Get core properties part from the previous relationship PackagePart coreDocument = p.getPart(corePropertiesRelationship); System.out.println(coreDocument.getUri() + " -> " + coreDocument.getContentType()); |
Listing 2
The output displays :
docProps/core.xml -> application/vnd.openxmlformats-package.core-properties+xml
Only a few simple lines are needed to get document’s properties :
... OpenXMLDocument docx = new OpenXMLDocument(Package.open(zipFile, PackageAccess.Read)); System.out.println(docx.getCoreProperties().getCreator()); System.out.println(docx.getCoreProperties().getTitle()); System.out.println(docx.getCoreProperties().getSubject()); |
Listing 3
The output displays :
Julien CHABLE
Lorem Ipsum
Sample document
How to change document’s properties
It’s as simple as to get a property :
// Destination file File destFile = new File(APP_ROOT + "sample_out.docx");
// Open the document Package pack = Package.open(zipFile, PackageAccess.ReadWrite); OpenXMLDocument docx = new OpenXMLDocument(pack);
CoreProperties coreProps = docx.getCoreProperties(); coreProps.setCreator("OpenXMLDeveloer.org powa"); coreProps.setDescription("A new description"); coreProps.setTitle("SampleListing4");
// Save document docx.save(destFile); |
Listing 4
Extended properties
The little framework associated with this article doesn’t provide any class or method to access extended properties. As a result, in this sample, we need to use DOM API to extract information from the extended properties part :
...
// Open the package Package p = Package.open(..., PackageAccess.Read);
// Get extended properties relationship PackageRelationship extendedPropertiesRelationship = p .getRelationshipsByType( PackageRelationshipConstants.NS_EXTENDED_PROPERTIES) .getRelationship(0);
// Get extended properties part from the previous relationship PackagePart extPropsPart = p.getPart(extendedPropertiesRelationship); System.out.println(extPropsPart.getUri() + " -> " + extPropsPart.getContentType());
// Extract content try { InputStream inStream = extPropsPart.getInputStream();
// Create DOM parser DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory .newInstance(); documentBuilderFactory.setNamespaceAware(true); documentBuilderFactory.setIgnoringElementContentWhitespace(true);
DocumentBuilder documentBuilder; documentBuilder = documentBuilderFactory.newDocumentBuilder();
// Parse XML content Document extPropsDoc = documentBuilder.parse(inStream);
// Extract the name and the version of the Open XML file generator System.out.println("Document generated with " + extPropsDoc.getElementsByTagName("Application").item(0) .getTextContent() + " vers. " + extPropsDoc.getElementsByTagName("AppVersion").item(0) .getTextContent());
// Extract statistics about this document System.out.println("This document contains " + extPropsDoc.getElementsByTagName("Words").item(0) .getTextContent() + " words and is composed of " + extPropsDoc.getElementsByTagName("Characters").item(0) .getTextContent() + " characters and " + extPropsDoc.getElementsByTagName("Lines").item(0) .getTextContent() + " lines");
inStream.close(); } catch (Exception ioe) { System.err .println("Failed to extract extended properties ! :("); } |
Listing 5
Output of Listing 5 :
docProps/app.xml -> application/vnd.openxmlformats-officedocument.extended-properties+xml
Document generated with Microsoft Office Word vers. 12.0000
This document contains 262 words and is composed of 1444 characters and 12 lines
Thumbnail
Many OpenXML documents, for example PowerPoint 2007, contain a thumbnail of the document. This specific part have the following relationship : http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail.
The following listing use tow methods – getThumbnails() and extractParts() – to extract the thumbnail of the document, and put it into the ‘export’ directory :
final String APP_ROOT = System.getProperty("user.dir") + File.separator; ZipFile zipFile = null; // Le fichier source try { zipFile = new ZipFile(APP_ROOT + "sample.pptx"); } catch (IOException e) { ... }
// Destination folder File destFile = new File(APP_ROOT + "export");
// Open the package OpenXMLDocument docx = OpenXMLDocument.open(zipFile, PackageAccess.Read);
// Extract thumbnails docx.extractParts(docx.getThumbnails(), destFile); |
Listing 6
Here are the details of the getThumbnails() and extractParts() methods :
public final static String NS_THUMBNAIL_PART = "http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail"; ...
// Retrieve all thumbnails contain in the document. public ArrayList<PackagePart> getThumbnails() { return container.getPartByRelationshipType( PackageRelationshipConstants.NS_THUMBNAIL_PART); } |
Listing 6-1 (class OpenXMLDocument)
/** * Extract part content into the specified folder. * * @param parts * Parts to extract. * @param destFolder * Destination folder. */
public void extractParts(ArrayList<PackagePart>parts, File destFolder) { for (PackagePart part : parts) { String filename = PackageURIHelper.getFilename(part.getUri()); try { InputStream ins = part.getInputStream(); FileOutputStream fw = new FileOutputStream(destFolder .getAbsolutePath() + File.separator + filename); byte[] buff = new byte[512]; while (ins.available() > 0) { ins.read(buff); fw.write(buff); } fw.close(); } catch (IOException e) { e.printStackTrace(); } } } |
Listing 6-2 (class OpenXMLDocument)
Listing 6 result :
Word document basic creation
To simplify this example, we’re going to create a document from a blank one by modifying his content ; this manipulation is simpler to understand and to do for this article, than a ‘from scratch’ creation. To add paragraphs in a document, the classes ParagraphBuilder, Paragraph and Run are greatly useful :
// Creation of a paragraph builder ParagraphBuilder paraBuilder = new ParagraphBuilder(); paraBuilder.setAlignment(ParagraphAlignment.CENTER); |
Listing 7-1
Once the ParagraphBuilder is ready, you could create a new paragraph by using the newParagraph() method :
// We create the first paragraph Paragraph par1 = paraBuilder.newParagraph(); |
Listing 7-2
The following example creates two paragraphs with the content : ‘Hello Office Open XML’ and ‘OpenXMLDeveloper.org’ with a great font size :
...
Package pack = Package.open(zipFile, PackageAccess.ReadWrite); WordDocument docx = new WordDocument(pack);
// Creation of a paragraph builder ParagraphBuilder paraBuilder = new ParagraphBuilder(); paraBuilder.setAlignment(ParagraphAlignment.CENTER);
// We create the first paragraph Paragraph par1 = paraBuilder.newParagraph();
// Add runs to modify the style Run r1 = new Run("Hello"); r1.setBold(true);
Run r2 = new Run(" Office"); r2.setItalic(true);
Run r3 = new Run(" Open"); r3.setUnderline(UnderlineStyle.SINGLE);
Run r4 = new Run(" XML"); r4.setVerticalAlignement(VerticalAlignment.SUPERSCRIPT);
// Add previous runs to the first paragraph par1.addRun(r1); par1.addRun(r2); par1.addRun(r3); par1.addRun(r4);
// Add the first paragraph in the document’s content docx.appendParagraph(par1);
// Creation of a second paragraph paraBuilder.setBold(true);
Paragraph par2 = paraBuilder.newParagraph();
Run r21 = new Run("www.openxmldeveloper.org"); r21.setFontSize(55); par2.addRun(r21);
// Append the second paragraph to content docx.appendParagraph(par2);
// Save the document docx.save(destFile); |
Listing 8
The result :

Convert a Word document to HTML
The OpenXML format is partly based on the XML technology, so the conversion to HTML is quite simple, at least, for basic documents thanks to the XSLT technology !
In this example, we’ll use the straightforward document generated by the previous listing with the following XSLT stylesheet :
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">
<xsl:output method="html" />
<!-- Document root --> <xsl:template match="/w:document"> <xsl:apply-templates select="w:body" /> </xsl:template>
<!-- Body and paragraphs --> <xsl:template match="w:body"> <html> <body> <xsl:for-each select="w:p"> <p> <xsl:apply-templates select="w:pPr" /> <xsl:apply-templates select="w:r" /> </p> </xsl:for-each> </body> </html> </xsl:template>
<!-- Paragraph properties --> <xsl:template match="w:pPr"> <xsl:attribute name="style"> <xsl:apply-templates /> </xsl:attribute> </xsl:template>
<!-- Text alignment --> <xsl:template match="w:jc"> text-align: <xsl:value-of select="@w:val" /> </xsl:template>
<!-- Run --> <xsl:template match="w:r"> <span> <xsl:apply-templates select="w:rPr" /> <xsl:value-of select="w:t" /> </span> </xsl:template>
<!-- Run properties --> <xsl:template match="w:rPr"> <xsl:attribute name="style"> <xsl:apply-templates /> </xsl:attribute> </xsl:template>
<!-- Font size --> <xsl:template match="w:sz"> font-size: <xsl:value-of select="@w:val" /> px; </xsl:template>
<!-- Vertical alignment --> <xsl:template match="w:vertAlign"> <xsl:variable name="jcVal" select="@w:val" /> <xsl:if test="$jcVal = 'superscript'"> font-size:33%;position:relative;bottom:0.5em; </xsl:if> <xsl:if test="$jcVal = 'subscript'"> font-size:33%;position:relative;bottom:-0.5em; </xsl:if> </xsl:template>
<!-- Bold --> <xsl:template match="w:b">font-weight:bold;</xsl:template>
<!-- Italic --> <xsl:template match="w:i">font-style:italic;</xsl:template>
<!-- Underline --> <xsl:template match="w:u">text-decoration:underline;</xsl:template>
</xsl:stylesheet> |
We’ll use the class WordToHTMLTransformer and the associated method transform() to convert our OpenXML document to HTML :
WordDocument docx = new WordDocument(...); ... WordToHTMLTransformer wt = new WordToHTMLTransformer(); InputStream transformStream = wt.transform(docx); |
Listing 9-1
The complete example :
final String APP_ROOT = System.getProperty("user.dir") + File.separator; ZipFile zipFile = null; // Le fichier source
try { zipFile = new ZipFile(APP_ROOT + "sample_out.docx"); } catch (IOException e) { e.printStackTrace(); }
// La destination du fichier de sortie File destFile = new File(APP_ROOT + "output.html");
Package pack = Package.open(zipFile, PackageAccess.ReadWrite); WordDocument docx = new WordDocument(pack);
WordToHTMLTransformer wt = new WordToHTMLTransformer(); try { InputStream transformStream = wt.transform(docx); BufferedWriter outStream = new BufferedWriter( new OutputStreamWriter(new FileOutputStream(destFile)));
BufferedReader br = new BufferedReader(new InputStreamReader( transformStream));
String buff; while ((buff = br.readLine()) != null) outStream.write(buff); outStream.close();
br.close(); } catch (Exception e) { e.printStackTrace(); } |
Listing 10
The HTML file generated by Listing 10 in Internet Explorer 7 :

Author
Julien Chable, student at EFREI in France and Microsoft Student Partner writes articles about Java and .NET in several magazines and websites. He can be contacted via his website http://julien.chable.net or his blog http://blogs.developpeur.org/neodante/