我正在动态生成 PDF。如何使用 shell 脚本检查 PDF 中的页数?
10 回答
没有任何额外的包装:
strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1
使用 pdfinfo:
pdfinfo file.pdf | awk '/^Pages:/ {print $2}'
使用 pdftk:
pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'
您还可以通过 pdfinfo 递归汇总所有 PDF 中的总页数,如下所示:
find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'
imagemagick 库提供了一个名为 identify 的工具,它与计算输出行数相结合,可以让你得到你想要的……imagemagick 是一个在 osx 上使用 brew 轻松安装的工具。
这是一个功能性 bash 脚本,它将其捕获到 shell 变量并将其转储回屏幕......
#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"
运行它的输出......
$ ./countPages.sh aSampleFile.pdf
Processing aSampleFile.pdf
The number of pages is: 2
$
The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):
NAME
pdftotext - Portable Document Format (PDF) to text converter.
SYNOPSIS
pdftotext [options] [PDF-file [text-file]]
DESCRIPTION
Pdftotext converts Portable Document Format (PDF) files to plain text.
Pdftotext reads the PDF file, PDF-file, and writes a text file, text-file. If text-file is
not specified, pdftotext converts file.pdf to file.txt. If text-file is ´-', the text is
sent to stdout.
There are many combinations to solve your problem, choose one of them:
1) pdftotext + grep:
$ pdftotext file.pdf - | grep -c $'\f'
2) pdftotext + awk (v1):
$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'
3) pdftotext + awk (v2):
$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'
4) pdftotext + awk (v3):
$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'
Hope it Helps!
Here is a version for the command line directly (based on pdfinfo):
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done
Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):
# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:
Wrong page range given: the first page (1000000) can not be after the last page (142).
So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!
Wrapper functions and speed testing
Here are a couple wrapper functions to test these:
# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
# num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
_pdf="$1"
_num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1)"
echo "$_num_pgs"
}
# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
# num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
_pdf="$1"
_password="$2"
if [ -n "$_password" ]; then
_password="-upw $_password"
fi
_num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*')"
echo "$_num_pgs"
}
Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.
See also
- These awesome answers by Ocaso Protal.
- These functions above will be used in my
pdf2searchablepdfproject here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.
刚刚挖出一个旧脚本(在 ksh 中),我发现:
#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
# pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'
[[ "$#" != "1" ]] && {
printf "ERROR: No file specified\n"
exit 1
}
numpages=0
while read line; do
num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
(( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages
mupdf/mutool solution:
mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2
If you're on macOS you can query pdf metadata like this:
mdls -name kMDItemNumberOfPages -raw file.pdf
as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal
I made a few improvement in Marius Hofert tip to sum the returned values.
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'
To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.
for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done