Get the number of pages in a PDF document

PhpPdf

Php Problem Overview


This question is for referencing and comparing. The solution is the accepted answer below.

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

Here are some of the answers I found insufficient or simply NOT working:

Using Imagick (a PHP extension)

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.

Using FPDI (a PHP library)

FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:

> FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

Opening a stream and search with a regular expression:

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;
  • /\/Count\s+(\d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
  • /\/Page\W*(\d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
  • /\/N\s+(\d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.

> ### So, what does work reliable and accurate? ### > > See the answer below

Php Solutions


Solution 1 - Php

A simple command line executable called: pdfinfo.

It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

There is an easy way of extracting the pagecount from the output, here in PHP:

// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows
    
    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);

    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }
    
    return $pagecount;
}

// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).

I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

Security Notice: Use escapeshellarg on $document if document name is being fed from user input or file uploads.

Solution 2 - Php

Simplest of all is using ImageMagick

here is a sample code

$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();

otherwise you can also use PDF libraries like MPDF or TCPDF for PHP

Solution 3 - Php

You can use qpdf like below. If a file file_name.pdf has 100 pages,

$ qpdf --show-npages file_name.pdf
100

Solution 4 - Php

Here is a simple example to get the number of pages in PDF with PHP.

<?php

function count_pdf_pages($pdfname) {
  $pdftext = file_get_contents($pdfname);
  $num = preg_match_all("/\/Page\W/", $pdftext, $dummy);

  return $num;
}

$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);

echo $pages;

?>

Solution 5 - Php

if you can't install any additional packages, you can use this simple one-liner:

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)

Solution 6 - Php

Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:

cpdf.exe -pages "my file.pdf"

Solution 7 - Php

I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@

/**
 * Wrapper for pdfinfo program, part of xpdf bundle
 * http://www.xpdfreader.com/about.html
 * 
 * this will put all pdfinfo output into keyed array, then make them accessible via getValue
 */
class PDFInfoWrapper {
	
	const PDFINFO_CMD = 'pdfinfo';
	
	/**
	 * keyed array to hold all the info
	 */
	protected $info = array();
	
	/**
	 * raw output in case we need it
	 */
	public $raw = "";
	
	/**
	 * Constructor
	 * @param string $filePath - path to file
	 */
	public function __construct($filePath) {
		exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);
		
		//loop each line and split into key and value
		foreach($output as $line) {
			$colon = strpos($line, ':');
			if($colon) {
				$key = trim(substr($line, 0, $colon));
				$val = trim(substr($line, $colon + 1));
				
				//use strtolower to make case insensitive
				$this->info[strtolower($key)] = $val;
			}
		}
		
		//store the raw output
		$this->raw = implode("\n", $output);
		
	}
	
	/**
	 * get a value
	 * @param string $key - key name, case insensitive
	 * @returns string value
	 */
	public function getValue($key) {
		return @$this->info[strtolower($key)];
	}
	
	/**
	 * list all the keys
	 * @returns array of key names
	 */
	public function getAllKeys() {
		return array_keys($this->info);
	}
	
}

Solution 8 - Php

This seems to work pretty well, without the need for special packages or parsing command output.

<?php                                                                               

$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);

Solution 9 - Php

this simple 1 liner seems to do the job well:

strings $path_to_pdf | grep Kids | grep -o R | wc -l

there is a block in the PDF file which details the number of pages in this funky string:

/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]

The number of 'R' characters is the number of pages

screenshot of terminal showing output from strings

Solution 10 - Php

You can use mutool.

mutool show FILE.pdf trailer/Root/Pages/Count

mutool is part of the MuPDF software package.

Solution 11 - Php

Here is a R function that reports the PDF file page number by using the pdfinfo command.

pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}

Solution 12 - Php

Here is a Windows command script using gsscript that reports the PDF file page number

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem

:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"

:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3

:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%
  
:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end
  
:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end

:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end

:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end
  
:end
  exit /b


Solution 13 - Php

The R package pdftools and the function pdf_info() provides information on the number of pages in a pdf.

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages

$pages
[1] 65

Solution 14 - Php

If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.

This should return just the number of pages:

grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

Example: https://regex101.com/r/BrUTKn/1

Switches description:

  • -m 1 is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
  • -a is neccessary to treat the binary file as text
  • -o to show only the match
  • -P to use Perl regular expression

Regex explanation:

  • starting "delimiter": (?<=\/N ) lookbehind of /N (nb. space character not seen here)
  • actual result: \d+ any number of digits
  • ending "delimiter": (?=\/) lookahead of /

Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.

Solution 15 - Php

Often you read regex /\/Page\W/ but it won't work for me for several pdf files. So here is an other regex expression, that works for me.

$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRichard de WitView Question on Stackoverflow
Solution 1 - PhpRichard de WitView Answer on Stackoverflow
Solution 2 - PhpKuldeep DangiView Answer on Stackoverflow
Solution 3 - PhpSuperNovaView Answer on Stackoverflow
Solution 4 - PhpPurvik DhorajiyaView Answer on Stackoverflow
Solution 5 - PhpMuad'DibView Answer on Stackoverflow
Solution 6 - PhpFranck DernoncourtView Answer on Stackoverflow
Solution 7 - Phpjames-geldartView Answer on Stackoverflow
Solution 8 - PhpdhildrethView Answer on Stackoverflow
Solution 9 - PhpdryliketoastView Answer on Stackoverflow
Solution 10 - PhpninfitoView Answer on Stackoverflow
Solution 11 - PhpFeiming ChenView Answer on Stackoverflow
Solution 12 - PhpcommanderView Answer on Stackoverflow
Solution 13 - PhpemeryvilleView Answer on Stackoverflow
Solution 14 - PhpSaranView Answer on Stackoverflow
Solution 15 - PhphulkyView Answer on Stackoverflow