Sometimes we create a lot of documents in pdf format. But here we need to convert pdf to word documents. The reason behind working with pdf documents is, a lot of people do not have pdf viewers installed on their Windows computers or different operating systems. There are some software solutions to convert pdf to word documents, but if you want to work with java then it’s not very easy because pdf converter software is not available for java programming language.
You might have encountered many situations where you would want to convert a PDF document into a word document. There are many good reasons to do so. For example, you could use a batch process to convert a whole set of files saved in PDF format, converting them one at a time on your computer is not the best option.
Java technology has been there with us for quite a while. It has aided many organizations and individuals to accomplish their precious tasks without any hurdles. New version of Java technology is coming up with new features which makes user interaction easy and simple. java convert word to pdf can be done using this tool. Converting PDF to Word Doc by java has been explained the previous post (java program to convert pdf to word document), Now let me explain how we can convert Word Doc to PDF in this article
Java program to convert PDF to Word Document, Read this article on convert PDF to Word Online. This is one of the best Java program for converting PDF documents into Microsoft Word documents. So if you are looking for a new Java program on the internet then it’s the fit place for you.
Convert PDF Files to MS Word Documents (DOC/DOCX) in Java
PDF is one of the most commonly used formats for sending the document out to third parties. The reason behind this popularity is PDF’s compatibility across multiple platforms regardless of any hardware/software requirements. However, in some cases, you would want to convert the PDF document into an editable document format. PDF to DOC or DOCX format could be the priority conversion option in such cases. In order to automate the conversion process, this article showcases how to convert PDF to Word programmatically in Java.
So in this article, you will get to know how to:
- Convert PDF to DOC using Java.
- Convert PDF to DOCX format using Java.
- Customize PDF to Word (DOC/DOCX) conversion.
Java PDF to Word Converter Library
Thanks to Aspose.PDF for Java – a PDF manipulation Java API that provides easy ways to convert PDF files to a variety of other formats including PDF to DOC and PDF to DOCX. You can download and add API’s JAR file to your project or reference it using the following Maven configurations:
Repository
<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>https://repository.aspose.com/repo/</url>
</repository>
Dependency
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>19.12</version>
</dependency>
Convert PDF to DOC using Java
Once you have referenced Aspose.PDF for Java in your application, you can convert any PDF document to DOC format in a couple of lines of code. The following are the steps required to perform this conversion.
- Create an instance of the Document class and initialize it with the input PDF file’s path.
- Call Document.save() method with the output DOC file’s name and SaveFormat.Doc arguments.
The following code sample shows how to convert PDF to DOC in Java.
// Load source PDF file | |
Document doc = new Document(“input.pdf”); | |
// Save resultant DOC file | |
doc.save(“output.doc”, SaveFormat.Doc); |
Input PDF Document
Output Word Document
Convert PDF to DOCX using Java
DOCX is a well-known format for Word documents and in contrast to the DOC format, the structure of DOCX was based on the binary as well as the XML files. In case you want to convert PDF to DOCX format, you can tell the API to do so using the SaveFormat.DocX argument in Document.save() method.
The following code sample shows how to convert PDF to DOCX in Java.
// Load source PDF file | |
Document doc = new Document(“input.pdf”); | |
// Save resultant DOCX file | |
doc.save(“output.docx”, SaveFormat.DocX); |
Convert PDF to Word with Additional Options
Aspose.PDF for Java also provides some additional options that you can use in PDF to Word conversion, such as the output format, image resolution, distance between text lines and so on. DocSaveOptions class is used for this purpose and the following is the list of options you can use:
- setFormat(int value) – To set the output format (Doc, Docx, etc.).
- setAddReturnToLineEnd(boolean value) – To add the paragraph or line breaks.
- setImageResolutionX(int value) – To set the X resolution for the images.
- setImageResolutionY(int value) – To set the Y resolution for the images.
- setMaxDistanceBetweenTextLines(float value) – To group text lines into paragraphs.
- setMode(int value) – To set recognition mode.
- setRecognizeBullets(boolean value) – To switch the recognition of bullets on.
- setRelativeHorizontalProximity(float value) – To set the width of space between different text elements in the input PDF file.
The following code sample shows how to use DocSaveOptions class in PDF to DOCX conversion using Java.
// Load source PDF file | |
Document doc = new Document(“input.pdf”); | |
// Instantiate DocSaveOptions instance | |
DocSaveOptions saveOptions = new DocSaveOptions(); | |
// Set output format | |
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX); | |
// Set the recognition mode as Flow | |
saveOptions.setMode(DocSaveOptions.RecognitionMode.Flow); | |
// Set the horizontal proximity as 2.5 | |
saveOptions.setRelativeHorizontalProximity(2.5f); | |
// Enable bullets recognition during conversion process | |
saveOptions.setRecognizeBullets(true); | |
// Save resultant DOCX file | |
doc.save(“resultant.docx”, saveOptions); |
The main benefit of converting PDFs to Word documents is the ability to edit the text directly within the file. This is especially helpful if you want to make significant changes to your PDF. If most data of your PDF are in tabular form, you can choose to convert it to an Excel spreadsheet. In the following sections, I will introduce how to convert searchable PDF to Word and Excel, and how to convert PDF to images as well by using Spire.PDF for Java.
Installing Spire.Pdf.jar
If you create a Maven project, you can easily import the jar in your application using the following configurations. For non-Maven projects, download the jar file from this link and manually add it as a dependency in your application.
- <repositories>
- <repository>
- <id>com.e-iceblue</id>
- <name>e-iceblue</name>
- <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
- </repository>
- </repositories>
- <dependencies>
- <dependency>
- <groupId> e-iceblue </groupId>
- <artifactId>spire.pdf</artifactId>
- <verson>4.1.2</version>
- </dependency>
- </dependencies>
Convert PDF to DOC or DOCX
Conversion from PDF to Word or Excel is quite straightforward by using this library. Create a PdfDocument object to load the original PDF document, and then call saveToFile() method to save PDF in .doc, .docx, .xls, or .xlsx file format.
- import com.spire.pdf.FileFormat;
- import com.spire.pdf.PdfDocument;
- public class ConvertPdfToWord {
- public static void main(String[] args) {
- //Create a PdfDocument instance
- PdfDocument pdf = new PdfDocument();
- //Load a PDF file
- pdf.loadFromFile(“C:\\Users\\Administrator\\Desktop\\original.pdf”);
- //Save to .docx file
- pdf.saveToFile(“ToWord.docx”, FileFormat.DOCX);
- pdf.close();
- }
- }
Convert PDF to XLS or XLSX
- import com.spire.pdf.FileFormat;
- import com.spire.pdf.PdfDocument;
- public class ConvertPdfToExcel {
- public static void main(String[] args) {
- //Create a PdfDocument instance
- PdfDocument pdf = new PdfDocument();
- //Load a PDF file
- pdf.loadFromFile(“C:\\Users\\Administrator\\Desktop\\original.pdf”);
- //Save to .xlsx file
- pdf.saveToFile(“ToExcel.xlsx”, FileFormat.XLSX);
- pdf.close();
- }
- }
Convert PDF to PNG
Converting PDF to images requires a little more code, but it’s not complicated at all. After a PDF file is loaded, call saveAsImage() method to save the specific page as image data. Then, write the data into a .png file by using the ImageIO.write() method.
- import com.spire.pdf.PdfDocument;
- import javax.imageio.ImageIO;
- import java.awt.image.BufferedImage;
- import java.io.File;
- import java.io.IOException;
- public class ConvertPdfToImage {
- public static void main(String[] args) throws IOException {
- //Create a PdfDocument instance
- PdfDocument pdf = new PdfDocument();
- //Load a PDF file
- pdf.loadFromFile(“C:\\Users\\Administrator\\Desktop\\original.pdf”);
- //Declare a BufferedImage variable
- BufferedImage image;
- //Loop through the pages
- for (int i = 0; i < pdf.getPages().getCount(); i++) {
- //Save the specific page as image data
- image = pdf.saveAsImage(i);
- //Write image data to png file
- File file = new File(String.format(“out/ToImage-%d.png”, i));
- ImageIO.write(image, “PNG”, file);
- }
- pdf.close();
- }
- }
The article demonstrates how to convert PDF documents to Word (.doc and .docx) documents using Spire.PDF for Java with a few lines of code.
The sample PDF file looks like the one below.
Entire Code
01 | import com.spire.pdf.*; |
02 |
03 | public class PdfToDoc { |
04 |
05 | public static void main(String[] args) { |
06 |
07 | //create a PdfDocument object |
08 | PdfDocument doc = new PdfDocument(); |
09 |
10 | //load a sample PDF file |
11 | doc.loadFromFile( "C:\\Users\\Administrator\\Desktop\\Introduction of Spire.PDF for Java.pdf" ); |
12 |
13 | //save as .doc file |
14 | doc.saveToFile( "output/ToDoc.doc" ,FileFormat.DOC); |
15 |
16 | //save as. docx file |
17 | doc.saveToFile( "output/ToDocx.docx" ,FileFormat.DOCX); |
18 | doc.close(); |
19 | } |
20 | } |
Output
Import JAR Dependency (2 Methods)
● Download the Free API (Free Spire.PDF for Java) and unzip it, then add the Spire.Pdf.jar file to your project as dependency.
● Directly add the jar dependency to maven project by adding the following configurations to the pom.xml.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf.free</artifactId>
<version>3.9.0</version>
</dependency>
</dependencies>
The original PDF document is shown as below:
Code Snippet
import com.spire.pdf.*;
public class ConvertPDF {
public static void main(String[] args) {
//Create a PdfDocument object
PdfDocument doc = new PdfDocument();
//Load the sample PDF file
doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\The Scarlet Letter.pdf");
//Save as .doc file
doc.saveToFile("output/ToDoc.doc",FileFormat.DOC);
//Save as. docx file
doc.saveToFile("output/ToDocx.docx",FileFormat.DOCX);
doc.close();
}
}
The output Word document:
Maven Dependencies
The first library we’ll look at is Pdf2Dom. Let’s start with the Maven dependencies we need to add to our project:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-tools</artifactId>
<version>2.0.25</version>
</dependency>
<dependency>
<groupId>net.sf.cssbox</groupId>
<artifactId>pdf2dom</artifactId>
<version>2.0.1</version>
</dependency>
We’re going to use the first dependency to load the selected PDF file. The second dependency is responsible for the conversion itself. The latest versions can be found here: pdfbox-tools and pdf2dom.
What’s more, we’ll use iText to extract the text from a PDF file and POI to create the .docx document.
Let’s take a look at Maven dependencies that we need to include in our project:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.10</version>
</dependency>
<dependency>
<groupId>com.itextpdf.tool</groupId>
<artifactId>xmlworker</artifactId>
<version>5.5.10</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.15</version>
</dependency>
The latest version of iText can be found here and you can look for Apache POI here.
PDF and HTML Conversions
To work with HTML files we’ll use Pdf2Dom – a PDF parser that converts the documents to an HTML DOM representation. The obtained DOM tree can then be then serialized to an HTML file or further processed.
To convert PDF to HTML, we need to use XMLWorker, library that is provided by iText.
PDF to HTML
Let’s have a look at a simple conversion from PDF to HTML:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();
}
In the code snippet above we load the PDF file, using the load API from PDFBox. With the PDF loaded, we use the parser to parse the file and write to output specified by java.io.Writer.
Note that converting PDF to HTML is never a 100%, pixel-to-pixel result. The results depend on the complexity and the structure of the particular PDF file.
Java API to Convert PDF to Word
I will be using GroupDocs.Conversion for Java API for the conversion of PDF to DOCX. This API provides a fast, efficient, and reliable file conversion solution into Java applications without installing any external software. It supports conversions among all popular business document formats such as PDF, HTML, Email, Word, Excel, PowerPoint, Project, Photoshop, CorelDraw, AutoCAD, raster image file formats, and many more. It also allows you to display the whole document, or render it partially to speed up the process. The API is compatible with all Java versions and supports popular operating systems (Windows, Linux, macOS) that are capable to run Java runtime.
Download and Configure
You can download the JAR of the API or just add the following pom.xml configuration in your Maven-based Java application to try the below-mentioned code examples.
<repository>
<id>GroupDocsJavaAPI</id>
<name>GroupDocs Java API</name>
<url>http://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-conversion</artifactId>
<version>21.7</version>
</dependency>
Convert PDF to Word using Java
You can convert PDF documents to Word by following the simple steps given below:
- Create an instance of the Converter class
- Provide the input file path
- Create an instance of WordProcessingConvertOptions
- Set the start page number
- Provide total pages to convert
- Set output file format
- Call the Convert() method along with the output file path and convert options
Conclusion
A Java program is used to convert PDF to Word document. The software uses the Apache POI framework in order to work. The command line is mainly used for this conversion. The software can be easily customized and can also be used in most of the operating system platforms such as Linux and Windows.
Java is an important java convert word to pdf programming language and most of the software developers use it for developing various types of applications. The java convert word to pdf API offers a wide range of services which are quite useful for the software developers.