How to Use Python to Reduce Pdf File Size

PDF files are a popular and quick way to share your documents. No one wants to spend the time of retyping a document or peeling away sticky notes from a printed copy, however PDF files can be large in size and may require a lot of storage space. In this article we will discuss several ways you can use to compress your PDF file size.

The pdf file is the popular electronic document. All information can be found here. This format has many pros, but it has some cons too. For me, the pdf is one of the best formats, but it’s not perfect. There are some problems with pdf file size and pdf quality that we can’t handle.

Compressing PDF allows you to decrease the file size as small as possible while maintaining the quality of the media in that PDF file. As a result, it significantly increases effectiveness and shareability.

PDF is the most popular format used these days to share documents and books worldwide. The PDF format was created in order to prevent different platforms to view the files. As a result, those who want to view a PDF file must use a special tool known as a PDF reader in order to access it. This way, any platform can see the content. However, it’s worth knowing that PDFs take up quite a lot of space on your hard drive. This is because they are made using Open Document Format (ODF).

The best way to use Python to Reduce Pdf File Size are as follows:

PDFNetPython3
Compress & optimize PDF files in Python
Writing Docstrings

PDFNetPython3

PDFNetPython3 is a wrapper for PDFTron SDK. With PDFTron components, you can build reliable & speedy applications that can view, create, print, edit, and annotate PDFs across various operating systems. Developers use PDFTron SDK to read, write, and edit PDF documents compatible with all published versions of PDF specifications (including the latest ISO32000).

PDFTron is not freeware. It offers two types of licenses depending on whether you’re developing an external/commercial product or an in-house solution.

We will use the free trial version of this SDK for this tutorial. The goal of this tutorial is to develop a lightweight command-line-based utility through Python-based modules without relying on external utilities outside the Python ecosystem (e.g., Ghostscript) that compress PDF files.

Note that this tutorial only works for compressing PDF files and not any file.

To get started, let’s install the Python wrapper using pip:

$ pip install PDFNetPython3==8.1.0

Copy

Open up a new Python file and import necessary modules:

# Import Libraries
import os
import sys
from PDFNetPython3.PDFNetPython import PDFDoc, Optimizer, SDFDoc, PDFNet

Copy

Next, let’s define a function that prints the file size in the appropriate format (grabbed from this tutorial):

def get_size_format(b, factor=1024, suffix="B"):
    """
    Scale bytes to its proper byte format
    e.g:
        1253656 => '1.20MB'
        1253656678 => '1.17GB'
    """
    for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
        if b < factor:
            return f"{b:.2f}{unit}{suffix}"
        b /= factor
    return f"{b:.2f}Y{suffix}"

Copy

Now let’s define our core function:

def compress_file(input_file: str, output_file: str):
    """Compress PDF file"""
    if not output_file:
        output_file = input_file
    initial_size = os.path.getsize(input_file)
    try:
        # Initialize the library
        PDFNet.Initialize()
        doc = PDFDoc(input_file)
        # Optimize PDF with the default settings
        doc.InitSecurityHandler()
        # Reduce PDF size by removing redundant information and compressing data streams
        Optimizer.Optimize(doc)
        doc.Save(output_file, SDFDoc.e_linearized)
        doc.Close()
    except Exception as e:
        print("Error compress_file=", e)
        doc.Close()
        return False
    compressed_size = os.path.getsize(output_file)
    ratio = 1 - (compressed_size / initial_size)
    summary = {
        "Input File": input_file, "Initial Size": get_size_format(initial_size),
        "Output File": output_file, f"Compressed Size": get_size_format(compressed_size),
        "Compression Ratio": "{0:.3%}.".format(ratio)
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################")
    return True

Copy

This function compresses a PDF file by removing redundant information and compressing the data streams; it then prints a summary showing the compression ratio and the size of the file after compression. It takes the PDF input_file and produces the compressed PDF output_file.

Now let’s define our main code:

if __name__ == "__main__":
    # Parsing command line arguments entered by user
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    compress_file(input_file, output_file)

Copy

We simply get the input and output files from the command-line arguments and then use our defined compress_file() function to compress the PDF file.

Let’s test it out:

$ python pdf_compressor.py bert-paper.pdf bert-paper-min.pdf

Copy

The following is the output:

PDFNet is running in demo mode.
Permission: read     
Permission: optimizer
Permission: write
## Summary ########################################################
Input File:bert-paper.pdf
Initial Size:757.00KB
Output File:bert-paper-min.pdf
Compressed Size:498.33KB
Compression Ratio:34.171%.
###################################################################

Copy

As you can see, a new compressed PDF file with the size of 498KB instead of 757KB. Check this out:

In order to get started you need to make an app.py file and copy paste the following code

app.py

#!/usr/bin/env python3
# Author: Sylvain Carlioz
# 6/03/2017
# MIT license -- free to use as you want, cheers.

"""
Simple python wrapper script to use ghoscript function to compress PDF files.
Compression levels:
    0: default
    1: prepress
    2: printer
    3: ebook
    4: screen
Dependency: Ghostscript.
On MacOSX install via command line `brew install ghostscript`.
"""

import argparse
import subprocess
import os.path
import sys
from shutil import copyfile


def compress(input_file_path, output_file_path, power=0):
    """Function to compress PDF via Ghostscript command line interface"""
    quality = {
        0: '/default',
        1: '/prepress',
        2: '/printer',
        3: '/ebook',
        4: '/screen'
    }

    # Basic controls
    # Check if valid path
    if not os.path.isfile(input_file_path):
        print("Error: invalid path for input PDF file")
        sys.exit(1)

    # Check if file is a PDF by extension
    if input_file_path.split('.')[-1].lower() != 'pdf':
        print("Error: input file is not a PDF")
        sys.exit(1)

    print("Compress PDF...")
    initial_size = os.path.getsize(input_file_path)
    subprocess.call(['gs', '-sDEVICE=pdfwrite', '-dCompatibilityLevel=1.4',
                    '-dPDFSETTINGS={}'.format(quality[power]),
                    '-dNOPAUSE', '-dQUIET', '-dBATCH',
                    '-sOutputFile={}'.format(output_file_path),
                     input_file_path]
    )
    final_size = os.path.getsize(output_file_path)
    ratio = 1 - (final_size / initial_size)
    print("Compression by {0:.0%}.".format(ratio))
    print("Final file size is {0:.1f}MB".format(final_size / 1000000))
    print("Done.")


def main():
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument('input', help='Relative or absolute path of the input PDF file')
    parser.add_argument('-o', '--out', help='Relative or absolute path of the output PDF file')
    parser.add_argument('-c', '--compress', type=int, help='Compression level from 0 to 4')
    parser.add_argument('-b', '--backup', action='store_true', help="Backup the old PDF file")
    parser.add_argument('--open', action='store_true', default=False,
                        help='Open PDF after compression')
    args = parser.parse_args()

    # In case no compression level is specified, default is 2 '/ printer'
    if not args.compress:
        args.compress = 2
    # In case no output file is specified, store in temp file
    if not args.out:
        args.out = 'temp.pdf'

    # Run
    compress(args.input, args.out, power=args.compress)

    # In case no output file is specified, erase original file
    if args.out == 'temp.pdf':
        if args.backup:
            copyfile(args.input, args.input.replace(".pdf", "_BACKUP.pdf"))
        copyfile(args.out, args.input)
        os.remove(args.out)

    # In case we want to open the file after compression
    if args.open:
        if args.out == 'temp.pdf' and args.backup:
            subprocess.call(['open', args.input])
        else:
            subprocess.call(['open', args.out])

if __name__ == '__main__':
    main()

Compress & optimize PDF files in Python

More languages

Sample Python code for using PDFTron SDK to reduce PDF file size by removing redundant information and compressing data streams using the latest in image compression technology. Learn more about our Python PDF Library.Get Started Samples Download

To run this sample, get started with a free trial of PDFTron SDK.

#---------------------------------------------------------------------------------------
# Copyright (c) 2001-2021 by PDFTron Systems Inc. All Rights Reserved.
# Consult LICENSE.txt regarding license information.
#---------------------------------------------------------------------------------------

import site
site.addsitedir("../../../PDFNetC/Lib")
import sys
from PDFNetPython import *

sys.path.append("../../LicenseKey/PYTHON")
from LicenseKey import *

#---------------------------------------------------------------------------------------
# The following sample illustrates how to reduce PDF file size using 'pdftron.PDF.Optimizer'.
# The sample also shows how to simplify and optimize PDF documents for viewing on mobile devices 
# and on the Web using 'pdftron.PDF.Flattener'.
#
# @note Both 'Optimizer' and 'Flattener' are separately licensable add-on options to the core PDFNet license.
#
# ----
#
# 'pdftron.PDF.Optimizer' can be used to optimize PDF documents by reducing the file size, removing 
# redundant information, and compressing data streams using the latest in image compression technology. 
#
# PDF Optimizer can compress and shrink PDF file size with the following operations:
# - Remove duplicated fonts, images, ICC profiles, and any other data stream. 
# - Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF. 
# - Optionally down-sample large images to a given resolution. 
# - Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats. 
# - Compress uncompressed streams and remove unused PDF objects.
# ----
#
# 'pdftron.PDF.Flattener' can be used to speed-up PDF rendering on mobile devices and on the Web by 
# simplifying page content (e.g. flattening complex graphics into images) while maintaining vector text 
# whenever possible.
#
# Flattener can also be used to simplify process of writing custom converters from PDF to other formats. 
# In this case, Flattener can be used as first step in the conversion pipeline to reduce any PDF to a 
# very simple representation (e.g. vector text on top of a background image). 
#---------------------------------------------------------------------------------------

def main():
    
    # Relative path to the folder containing the test files.
    input_path = "../../TestFiles/"
    output_path = "../../TestFiles/Output/"
    input_filename = "newsletter"
    
    # The first step in every application using PDFNet is to initialize the 
    # library and set the path to common PDF resources. The library is usually 
    # initialized only once, but calling Initialize() multiple times is also fine.
    PDFNet.Initialize(LicenseKey)
    
    #--------------------------------------------------------------------------------
    # Example 1) Simple optimization of a pdf with default settings.
    
    doc = PDFDoc(input_path + input_filename + ".pdf")
    doc.InitSecurityHandler()
    Optimizer.Optimize(doc)
    
    doc.Save(output_path + input_filename + "_opt1.pdf", SDFDoc.e_linearized)
    doc.Close()
    
    #--------------------------------------------------------------------------------
    # Example 2) Reduce image quality and use jpeg compression for
    # non monochrome images. 
    doc = PDFDoc(input_path + input_filename + ".pdf")
    doc.InitSecurityHandler()
    image_settings = ImageSettings()
    
    # low quality jpeg compression
    image_settings.SetCompressionMode(ImageSettings.e_jpeg)
    image_settings.SetQuality(1)
    
    # Set the output dpi to be standard screen resolution
    image_settings.SetImageDPI(144,96)
    
    # this option will recompress images not compressed with
    # jpeg compression and use the result if the new image
    # is smaller.
    image_settings.ForceRecompression(True)
    
    # this option is not commonly used since it can 
    # potentially lead to larger files. It should be enabled
    # only if the output compression specified should be applied
    # to every image of a given type regardless of the output image size
    #image_settings.ForceChanges(True)

    opt_settings = OptimizerSettings()
    opt_settings.SetColorImageSettings(image_settings)
    opt_settings.SetGrayscaleImageSettings(image_settings)

    # use the same settings for both color and grayscale images
    Optimizer.Optimize(doc, opt_settings)
    
    doc.Save(output_path + input_filename + "_opt2.pdf", SDFDoc.e_linearized)
    doc.Close()
    
    #--------------------------------------------------------------------------------
    # Example 3) Use monochrome image settings and default settings
    # for color and grayscale images. 
    
    doc = PDFDoc(input_path + input_filename + ".pdf")
    doc.InitSecurityHandler()

    mono_image_settings = MonoImageSettings()
    
    mono_image_settings.SetCompressionMode(MonoImageSettings.e_jbig2)
    mono_image_settings.ForceRecompression(True)

    opt_settings = OptimizerSettings()
    opt_settings.SetMonoImageSettings(mono_image_settings)
    
    Optimizer.Optimize(doc, opt_settings)
    doc.Save(output_path + input_filename + "_opt3.pdf", SDFDoc.e_linearized)
    doc.Close()
	
    # ----------------------------------------------------------------------
    # Example 4) Use Flattener to simplify content in this document
    # using default settings
    
    doc = PDFDoc(input_path + "TigerText.pdf")
    doc.InitSecurityHandler()
    
    fl = Flattener()
    # The following lines can increase the resolution of background
    # images.
    #fl.SetDPI(300)
    #fl.SetMaximumImagePixels(5000000)

    # This line can be used to output Flate compressed background
    # images rather than DCTDecode compressed images which is the default
    #fl.SetPreferJPG(false)

    # In order to adjust thresholds for when text is Flattened
    # the following function can be used.
    #fl.SetThreshold(Flattener.e_threshold_keep_most)

    # We use e_fast option here since it is usually preferable
    # to avoid Flattening simple pages in terms of size and 
    # rendering speed. If the desire is to simplify the 
    # document for processing such that it contains only text and
    # a background image e_simple should be used instead.
    fl.Process(doc, Flattener.e_fast)
    doc.Save(output_path + "TigerText_flatten.pdf", SDFDoc.e_linearized)
    doc.Close()

    # ----------------------------------------------------------------------
    # Example 5) Optimize a PDF for viewing using SaveViewerOptimized.
    
    doc = PDFDoc(input_path + input_filename + ".pdf")
    doc.InitSecurityHandler()
    
    opts = ViewerOptimizedOptions()

    # set the maximum dimension (width or height) that thumbnails will have.
    opts.SetThumbnailSize(1500)

    # set thumbnail rendering threshold. A number from 0 (include all thumbnails) to 100 (include only the first thumbnail) 
    # representing the complexity at which SaveViewerOptimized would include the thumbnail. 
    # By default it only produces thumbnails on the first and complex pages. 
    # The following line will produce thumbnails on every page.
    # opts.SetThumbnailRenderingThreshold(0) 

    doc.SaveViewerOptimized(output_path + input_filename + "_SaveViewerOptimized.pdf", opts)
    doc.Close()
    PDFNet.Terminate()
    
if __name__ == '__main__':
    main()

Writing Docstrings

Depending on the complexity of the function, method, or class being written, a one-line docstring may be perfectly appropriate. These are generally used for really obvious cases, such as:

def add(a, b):
    """Add two numbers and return the result."""
    return a + b

The docstring should describe the function in a way that is easy to understand. For simple cases like trivial functions and classes, simply embedding the function’s signature (i.e. add(a, b) -> result) in the docstring is unnecessary. This is because with Python’s inspect module, it is already quite easy to find this information if needed, and it is also readily available by reading the source code.

In larger or more complex projects however, it is often a good idea to give more information about a function, what it does, any exceptions it may raise, what it returns, or relevant details about the parameters.

For more detailed documentation of code a popular style used, is the one used by the NumPy project, often called NumPy style docstrings. While it can take up more lines than the previous example, it allows the developer to include a lot more information about a method, function, or class.

def random_number_generator(arg1, arg2):
    """
    Summary line.

    Extended description of function.

    Parameters
    ----------
    arg1 : int
        Description of arg1
    arg2 : str
        Description of arg2

    Returns
    -------
    int
        Description of return value

    """
    return 42

The sphinx.ext.napoleon plugin allows Sphinx to parse this style of docstrings, making it easy to incorporate NumPy style docstrings into your project.

At the end of the day, it doesn’t really matter what style is used for writing docstrings; their purpose is to serve as documentation for anyone who may need to read or make changes to your code. As long as it is correct, understandable, and gets the relevant points across then it has done the job it was designed to do.

Conclusion

There are many people upload their doc, pdf files on the internet without password protection. Those files can be accessed by any person with a browser. This article shows you how to reduce pdf file size using python. Even if you upload your pdf files on the web, they will be compressed and much smaller than original one.

Reduce pdf file size using python: The python has many usage. We can develop any kind of application using python. Most important thing about python is it is open source. Python is for the developers and it is easy to use. Today I am going to share a tutorial on how to reduce pdf file size using python. The python programming language is a great way to manipulate PDF documents while keeping file sizes small, which makes it great for web development.

PDFNetPython3

Compress & optimize PDF files in Python

Writing Docstrings

Conclusion

Leave a Comment Cancel reply