Java Program to Convert PDF to Word Document

Sometimes we create a lot of documents in pdf format. But here we need to convert pdf to word documents. The reason behind working with pdf documents is, a lot of people do not have pdf viewers installed on their Windows computers or different operating systems. There are some software solutions to convert pdf to word documents, but if you want to work with java then it’s not very easy because pdf converter software is not available for java programming language.

You might have encountered many situations where you would want to convert a PDF document into a word document. There are many good reasons to do so. For example, you could use a batch process to convert a whole set of files saved in PDF format, converting them one at a time on your computer is not the best option.

Java technology has been there with us for quite a while. It has aided many organizations and individuals to accomplish their precious tasks without any hurdles. New version of Java technology is coming up with new features which makes user interaction easy and simple. java convert word to pdf can be done using this tool. Converting PDF to Word Doc by java has been explained the previous post (java program to convert pdf to word document), Now let me explain how we can convert Word Doc to PDF in this article

Java program to convert PDF to Word Document, Read this article on convert PDF to Word Online. This is one of the best Java program for converting PDF documents into Microsoft Word documents. So if you are looking for a new Java program on the internet then it’s the fit place for you.

Convert PDF Files to MS Word Documents (DOC/DOCX) in Java

PDF to Word

PDF is one of the most commonly used formats for sending the document out to third parties. The reason behind this popularity is PDF’s compatibility across multiple platforms regardless of any hardware/software requirements. However, in some cases, you would want to convert the PDF document into an editable document format. PDF to DOC or DOCX format could be the priority conversion option in such cases. In order to automate the conversion process, this article showcases how to convert PDF to Word programmatically in Java.

So in this article, you will get to know how to:

  • Convert PDF to DOC using Java.
  • Convert PDF to DOCX format using Java.
  • Customize PDF to Word (DOC/DOCX) conversion.

Java PDF to Word Converter Library

Thanks to Aspose.PDF for Java – a PDF manipulation Java API that provides easy ways to convert PDF files to a variety of other formats including PDF to DOC and PDF to DOCX. You can download and add API’s JAR file to your project or reference it using the following Maven configurations:

Repository

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>https://repository.aspose.com/repo/</url>
</repository>

Dependency

<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-pdf</artifactId>
    <version>19.12</version>
</dependency>

Convert PDF to DOC using Java

Once you have referenced Aspose.PDF for Java in your application, you can convert any PDF document to DOC format in a couple of lines of code. The following are the steps required to perform this conversion.

  • Create an instance of the Document class and initialize it with the input PDF file’s path.
  • Call Document.save() method with the output DOC file’s name and SaveFormat.Doc arguments.

The following code sample shows how to convert PDF to DOC in Java.

// Load source PDF file
Document doc = new Document(“input.pdf”);
// Save resultant DOC file
doc.save(“output.doc”, SaveFormat.Doc);

Input PDF Document

How to Convert PDF to DOC in Java

Output Word Document

Convert PDF to DOCX in Java

Convert PDF to DOCX using Java

DOCX is a well-known format for Word documents and in contrast to the DOC format, the structure of DOCX was based on the binary as well as the XML files. In case you want to convert PDF to DOCX format, you can tell the API to do so using the SaveFormat.DocX argument in Document.save() method.

The following code sample shows how to convert PDF to DOCX in Java.

// Load source PDF file
Document doc = new Document(“input.pdf”);
// Save resultant DOCX file
doc.save(“output.docx”, SaveFormat.DocX);

Convert PDF to Word with Additional Options

Aspose.PDF for Java also provides some additional options that you can use in PDF to Word conversion, such as the output format, image resolution, distance between text lines and so on. DocSaveOptions class is used for this purpose and the following is the list of options you can use:

The following code sample shows how to use DocSaveOptions class in PDF to DOCX conversion using Java.

// Load source PDF file
Document doc = new Document(“input.pdf”);
// Instantiate DocSaveOptions instance
DocSaveOptions saveOptions = new DocSaveOptions();
// Set output format
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
// Set the recognition mode as Flow
saveOptions.setMode(DocSaveOptions.RecognitionMode.Flow);
// Set the horizontal proximity as 2.5
saveOptions.setRelativeHorizontalProximity(2.5f);
// Enable bullets recognition during conversion process
saveOptions.setRecognizeBullets(true);
// Save resultant DOCX file
doc.save(“resultant.docx”, saveOptions);

The main benefit of converting PDFs to Word documents is the ability to edit the text directly within the file. This is especially helpful if you want to make significant changes to your PDF. If most data of your PDF are in tabular form, you can choose to convert it to an Excel spreadsheet. In the following sections, I will introduce how to convert searchable PDF to Word and Excel, and how to convert PDF to images as well by using Spire.PDF for Java. 

Installing Spire.Pdf.jar

 If you create a Maven project, you can easily import the jar in your application using the following configurations. For non-Maven projects, download the jar file from this link and manually add it as a dependency in your application.

  1. <repositories>  
  2.     <repository>  
  3.         <id>com.e-iceblue</id>  
  4.         <name>e-iceblue</name>  
  5.         <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>  
  6.     </repository>  
  7. </repositories>  
  8. <dependencies>  
  9.     <dependency>  
  10.         <groupId> e-iceblue </groupId>  
  11.         <artifactId>spire.pdf</artifactId>  
  12.         <verson>4.1.2</version>  
  13.     </dependency>  
  14. </dependencies>  

Convert PDF to DOC or DOCX

 Conversion from PDF to Word or Excel is quite straightforward by using this library. Create a PdfDocument object to load the original PDF document, and then call saveToFile() method to save PDF in .doc, .docx, .xls, or .xlsx file format.

  1. import com.spire.pdf.FileFormat;  
  2. import com.spire.pdf.PdfDocument;  
  3.   
  4. public class ConvertPdfToWord {  
  5.     public static void main(String[] args) {  
  6.         //Create a PdfDocument instance  
  7.         PdfDocument pdf = new PdfDocument();  
  8.         //Load a PDF file  
  9.         pdf.loadFromFile(“C:\\Users\\Administrator\\Desktop\\original.pdf”);  
  10.         //Save to .docx file  
  11.         pdf.saveToFile(“ToWord.docx”, FileFormat.DOCX);  
  12.         pdf.close();  
  13.     }  
  14. }  

Convert PDF to XLS or XLSX

  1. import com.spire.pdf.FileFormat;  
  2. import com.spire.pdf.PdfDocument;  
  3.   
  4. public class ConvertPdfToExcel {  
  5.     public static void main(String[] args) {  
  6.         //Create a PdfDocument instance  
  7.         PdfDocument pdf = new PdfDocument();  
  8.         //Load a PDF file  
  9.         pdf.loadFromFile(“C:\\Users\\Administrator\\Desktop\\original.pdf”);  
  10.         //Save to .xlsx file  
  11.         pdf.saveToFile(“ToExcel.xlsx”, FileFormat.XLSX);  
  12.         pdf.close();  
  13.     }  
  14. }   

Convert PDF to PNG

 Converting PDF to images requires a little more code, but it’s not complicated at all. After a PDF file is loaded, call saveAsImage() method to save the specific page as image data. Then, write the data into a .png file by using the ImageIO.write() method.

  1. import com.spire.pdf.PdfDocument;  
  2. import javax.imageio.ImageIO;  
  3. import java.awt.image.BufferedImage;  
  4. import java.io.File;  
  5. import java.io.IOException;  
  6.   
  7. public class ConvertPdfToImage {  
  8.   
  9.     public static void main(String[] args) throws IOException {  
  10.   
  11.         //Create a PdfDocument instance  
  12.         PdfDocument pdf = new PdfDocument();  
  13.           
  14.         //Load a PDF file  
  15.         pdf.loadFromFile(“C:\\Users\\Administrator\\Desktop\\original.pdf”);  
  16.   
  17.         //Declare a BufferedImage variable  
  18.         BufferedImage image;  
  19.           
  20.         //Loop through the pages  
  21.         for (int i = 0; i < pdf.getPages().getCount(); i++) {  
  22.               
  23.             //Save the specific page as image data  
  24.             image = pdf.saveAsImage(i);  
  25.               
  26.             //Write image data to png file  
  27.             File file = new File(String.format(“out/ToImage-%d.png”, i));  
  28.             ImageIO.write(image, “PNG”, file);  
  29.         }  
  30.         pdf.close();  
  31.     }  
  32. }  

The article demonstrates how to convert PDF documents to Word (.doc and .docx) documents using Spire.PDF for Java with a few lines of code.

The sample PDF file looks like the one below.

Convert PDF to Word in Java

Entire Code

01import com.spire.pdf.*;
02 
03public class PdfToDoc {
04 
05    public static void main(String[] args) {
06 
07        //create a PdfDocument object
08        PdfDocument doc = new PdfDocument();
09 
10        //load a sample PDF file
11        doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Introduction of Spire.PDF for Java.pdf");
12 
13        //save as .doc file
14        doc.saveToFile("output/ToDoc.doc",FileFormat.DOC);
15 
16        //save as. docx file
17        doc.saveToFile("output/ToDocx.docx",FileFormat.DOCX);
18        doc.close();
19    }
20}

Output

Convert PDF to Word in Java

Import JAR Dependency (2 Methods)

● Download the Free API (Free Spire.PDF for Java) and unzip it, then add the Spire.Pdf.jar file to your project as dependency.

● Directly add the jar dependency to maven project by adding the following configurations to the pom.xml.

<repositories>
        <repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
        </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf.free</artifactId>
        <version>3.9.0</version>
    </dependency>
</dependencies>

The original PDF document is shown as below:

Code Snippet

import com.spire.pdf.*;

public class ConvertPDF {
    public static void main(String[] args) {

        //Create a PdfDocument object
        PdfDocument doc = new PdfDocument();
        //Load the sample PDF file
        doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\The Scarlet Letter.pdf");

        //Save as .doc file
        doc.saveToFile("output/ToDoc.doc",FileFormat.DOC);

        //Save as. docx file
        doc.saveToFile("output/ToDocx.docx",FileFormat.DOCX);
        doc.close();
    }
}

The output Word document:

Maven Dependencies

The first library we’ll look at is Pdf2Dom. Let’s start with the Maven dependencies we need to add to our project:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox-tools</artifactId>
    <version>2.0.25</version>
</dependency>
<dependency>
    <groupId>net.sf.cssbox</groupId>
    <artifactId>pdf2dom</artifactId>
    <version>2.0.1</version>
</dependency>

We’re going to use the first dependency to load the selected PDF file. The second dependency is responsible for the conversion itself. The latest versions can be found here: pdfbox-tools and pdf2dom.

What’s more, we’ll use iText to extract the text from a PDF file and POI to create the .docx document.

Let’s take a look at Maven dependencies that we need to include in our project:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.10</version>
</dependency>
<dependency>
    <groupId>com.itextpdf.tool</groupId>
    <artifactId>xmlworker</artifactId>
    <version>5.5.10</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.15</version>
</dependency>

The latest version of iText can be found here and you can look for Apache POI here.

PDF and HTML Conversions

To work with HTML files we’ll use Pdf2Dom – a PDF parser that converts the documents to an HTML DOM representation. The obtained DOM tree can then be then serialized to an HTML file or further processed.

To convert PDF to HTML, we need to use XMLWorker, library that is provided by iText.

PDF to HTML

Let’s have a look at a simple conversion from PDF to HTML:

private void generateHTMLFromPDF(String filename) {
    PDDocument pdf = PDDocument.load(new File(filename));
    Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
    new PDFDomTree().writeText(pdf, output);
    
    output.close();
}

In the code snippet above we load the PDF file, using the load API from PDFBox. With the PDF loaded, we use the parser to parse the file and write to output specified by java.io.Writer.

Note that converting PDF to HTML is never a 100%, pixel-to-pixel result. The results depend on the complexity and the structure of the particular PDF file.

Java API to Convert PDF to Word

I will be using GroupDocs.Conversion for Java API for the conversion of PDF to DOCX. This API provides a fast, efficient, and reliable file conversion solution into Java applications without installing any external software. It supports conversions among all popular business document formats such as PDF, HTML, Email, Word, Excel, PowerPoint, Project, Photoshop, CorelDraw, AutoCAD, raster image file formats, and many more. It also allows you to display the whole document, or render it partially to speed up the process. The API is compatible with all Java versions and supports popular operating systems (Windows, Linux, macOS) that are capable to run Java runtime.

Download and Configure

You can download the JAR of the API or just add the following pom.xml configuration in your Maven-based Java application to try the below-mentioned code examples.

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>http://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
        <groupId>com.groupdocs</groupId>
        <artifactId>groupdocs-conversion</artifactId>
        <version>21.7</version> 
</dependency>

Convert PDF to Word using Java

You can convert PDF documents to Word by following the simple steps given below:

  1. Create an instance of the Converter class
  2. Provide the input file path
  3. Create an instance of WordProcessingConvertOptions
  4. Set the start page number
  5. Provide total pages to convert
  6. Set output file format
  7. Call the Convert() method along with the output file path and convert options

Conclusion

A Java program is used to convert PDF to Word document. The software uses the Apache POI framework in order to work. The command line is mainly used for this conversion. The software can be easily customized and can also be used in most of the operating system platforms such as Linux and Windows.

Java is an important java convert word to pdf programming language and most of the software developers use it for developing various types of applications.  The java convert word to pdf API offers a wide range of services which are quite useful for the software developers.

Leave a Comment