java - 将 PDF 转换为多页 tiff（第 4 组）

Question

我正在尝试将由 org.apache.pdfbox.pdmodel.PDDocument 类和 icafe 库（https://github.com/dragon66/icafe/）表示的 PDF 转换为具有第 4 组压缩和 300 dpi 的多页 tiff . 示例代码适用于 288 dpi，但奇怪的是不适用于 300 dpi，导出的 tiff 仍然只是白色。有人知道这里有什么问题吗？

我在示例中使用的示例 pdf 位于此处：http ://www.bergophil.ch/a.pdf

import java.awt.image.BufferedImage;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

import cafe.image.ImageColorType;
import cafe.image.ImageParam;
import cafe.image.options.TIFFOptions;
import cafe.image.tiff.TIFFTweaker;
import cafe.image.tiff.TiffFieldEnum.Compression;
import cafe.io.FileCacheRandomAccessOutputStream;
import cafe.io.RandomAccessOutputStream;

public class Pdf2TiffConverter {
    public static void main(String[] args) {
        String pdf = "a.pdf";
        PDDocument pddoc = null;
        try {
            pddoc = PDDocument.load(pdf);
        } catch (IOException e) {
        }

        try {
            savePdfAsTiff(pddoc);
        } catch (IOException e) {
        }
    }

    private static void savePdfAsTiff(PDDocument pdf) throws IOException {
        BufferedImage[] images = new BufferedImage[pdf.getNumberOfPages()];
        for (int i = 0; i < images.length; i++) {
            PDPage page = (PDPage) pdf.getDocumentCatalog().getAllPages()
                    .get(i);
            BufferedImage image;
            try {
//              image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 288); //works
                image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 300); // does not work
                images[i] = image;
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        FileOutputStream fos = new FileOutputStream("a.tiff");
        RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
                fos);
        ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
        ImageParam[] param = new ImageParam[1];
        TIFFOptions tiffOptions = new TIFFOptions();
        tiffOptions.setTiffCompression(Compression.CCITTFAX4);
        builder.imageOptions(tiffOptions);
        builder.colorType(ImageColorType.BILEVEL);
        param[0] = builder.build();
        TIFFTweaker.writeMultipageTIFF(rout, param, images);
        rout.close();
        fos.close();
    }
}

或者是否有另一个库可以编写多页 TIFF？

编辑：

感谢dragon66，icafe现在修复了错误。与此同时，我尝试了其他库以及调用ghostscript. 我认为ghostscript它非常可靠，因为 id 是一种广泛使用的工具，另一方面，我必须相信我的代码的用户有一个ghostscript-installation，如下所示：

   /**
 * Converts a given pdf as specified by its path to an tiff using group 4 compression
 *
 * @param pdfFilePath The absolute path of the pdf
 * @param tiffFilePath The absolute path of the tiff to be created
 * @param dpi The resolution of the tiff
 * @throws MyException If the conversion fails
 */
private static void convertPdfToTiffGhostscript(String pdfFilePath, String tiffFilePath, int dpi) throws MyException {
    // location of gswin64c.exe
    String ghostscriptLoc = context.getGhostscriptLoc();

    // enclose src and dest. with quotes to avoid problems if the paths contain whitespaces
    pdfFilePath = "\"" + pdfFilePath + "\"";
    tiffFilePath = "\"" + tiffFilePath + "\"";

    logger.debug("invoking ghostscript to convert {} to {}", pdfFilePath, tiffFilePath);
    String cmd = ghostscriptLoc + " -dQUIET -dBATCH -o " + tiffFilePath + " -r" + dpi + " -sDEVICE=tiffg4 " + pdfFilePath;
    logger.debug("The following command will be invoked: {}", cmd);

    int exitVal = 0;
    try {
        exitVal = Runtime.getRuntime().exec(cmd).waitFor();
    } catch (Exception e) {
        logger.error("error while converting to tiff using ghostscript", e);
        throw new MyException(ErrorMessages.GHOSTSTSCRIPT_ERROR, e);
    }
    if (exitVal != 0) {
        logger.error("error while converting to tiff using ghostscript, exitval is {}", exitVal);
        throw new MyException(ErrorMessages.GHOSTSTSCRIPT_ERROR);
    }
}

我发现生产者的质量tif与生产者的质量ghostscript有很大不同（第 4 组看起来像灰度）tifficafetiffghostscript

score 11 · Accepted Answer

自从提出问题以来已经有一段时间了，我终于找到时间和一个美妙的有序抖动矩阵，它允许我详细说明如何使用“icafe”来获得与调用外部 ghostscript 可执行文件类似或更好的结果。最近在“icafe”中添加了一些新功能，例如更好的量化和有序抖动算法，这些算法在以下示例代码中使用。

这里我要使用的示例 pdf 是PrinceCatalogue。以下大部分代码来自 OP，由于包名称更改和更多 ImageParam 控件设置而进行了一些更改。

import java.awt.image.BufferedImage;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

import com.icafe4j.image.ImageColorType;
import com.icafe4j.image.ImageParam;
import com.icafe4j.image.options.TIFFOptions;
import com.icafe4j.image.quant.DitherMethod;
import com.icafe4j.image.quant.DitherMatrix;
import com.icafe4j.image.tiff.TIFFTweaker;
import com.icafe4j.image.tiff.TiffFieldEnum.Compression;
import com.icafe4j.io.FileCacheRandomAccessOutputStream;
import com.icafe4j.io.RandomAccessOutputStream;

public class Pdf2TiffConverter {
    public static void main(String[] args) {
        String pdf = "princecatalogue.pdf";
        PDDocument pddoc = null;
        try {
            pddoc = PDDocument.load(pdf);
        } catch (IOException e) {
        }

        try {
            savePdfAsTiff(pddoc);
        } catch (IOException e) {
        }
    }

    private static void savePdfAsTiff(PDDocument pdf) throws IOException {
        BufferedImage[] images = new BufferedImage[pdf.getNumberOfPages()];
        for (int i = 0; i < images.length; i++) {
            PDPage page = (PDPage) pdf.getDocumentCatalog().getAllPages()
                    .get(i);
            BufferedImage image;
            try {
//              image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 288); //works
                image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 300); // does not work
                images[i] = image;
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        FileOutputStream fos = new FileOutputStream("a.tiff");
        RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
                fos);
        ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
        ImageParam[] param = new ImageParam[1];
        TIFFOptions tiffOptions = new TIFFOptions();
        tiffOptions.setTiffCompression(Compression.CCITTFAX4);
        builder.imageOptions(tiffOptions);
        builder.colorType(ImageColorType.BILEVEL).ditherMatrix(DitherMatrix.getBayer8x8Diag()).applyDither(true).ditherMethod(DitherMethod.BAYER);
        param[0] = builder.build();
        TIFFTweaker.writeMultipageTIFF(rout, param, images);
        rout.close();
        fos.close();
    }
}

对于 ghostscript，我直接使用命令行，使用 OP 提供的相同参数。生成的 TIFF 图像的第一页屏幕截图如下所示：

左侧显示“ghostscript”的输出，右侧显示“icafe”的输出。可以看出，至少在这种情况下，“icafe”的输出要好于“ghostscript”的输出。

使用 CCITTFAX4 压缩，“ghostscript”的文件大小为 2.22M，“icafe”的文件大小为 2.08M。考虑到在创建黑白输出时使用了抖动这一事实，两者都不是很好。事实上，不同的压缩算法将创建更小的文件大小。例如，使用 LZW，“icafe”的相同输出仅为 634K，如果使用 DEFLATE 压缩，则输出文件大小降至 582K。

score 2 · Accepted Answer

这是一些保存在我与 PDFBox 一起使用的多页 tiff 中的代码。它需要 PDFBox 中的TIFFUtil 类（它不是公开的，所以你必须制作一个副本）。

void saveAsMultipageTIFF(ArrayList<BufferedImage> bimTab, String filename, int dpi) throws IOException
{
    Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");
    ImageWriter imageWriter = writers.next();

    ImageOutputStream ios = ImageIO.createImageOutputStream(new File(filename));
    imageWriter.setOutput(ios);
    imageWriter.prepareWriteSequence(null);
    for (BufferedImage image : bimTab)
    {
        ImageWriteParam param = imageWriter.getDefaultWriteParam();
        IIOMetadata metadata = imageWriter.getDefaultImageMetadata(new ImageTypeSpecifier(image), param);
        param.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
        TIFFUtil.setCompressionType(param, image);
        TIFFUtil.updateMetadata(metadata, image, dpi);
        imageWriter.writeToSequence(new IIOImage(image, null, metadata), param);
    }
    imageWriter.endWriteSequence();
    imageWriter.dispose();
    ios.flush();
    ios.close();
}

我前段时间使用以下代码为自己进行了实验： https ://www.java.net/node/670205 （我使用了解决方案2）

然而...

如果你创建一个包含大量图像的数组，你的内存消耗真的会增加。所以渲染图像可能会更好，然后将其添加到 tiff 文件中，然后渲染下一页并丢失前一页的引用，以便 gc 可以在需要时获取空间。

score 1 · Accepted Answer

由于此问题的解决方案使用的某些依赖项看起来没有得到维护。我通过使用最新版本（2.0.16）得到了解决方案pdfbox：

ByteArrayOutputStream imageBaos = new ByteArrayOutputStream();
ImageOutputStream output = ImageIO.createImageOutputStream(imageBaos);
ImageWriter writer = ImageIO.getImageWritersByFormatName("TIFF").next();

try (final PDDocument document = PDDocument.load(new File("/tmp/tmp.pdf"))) {

            PDFRenderer pdfRenderer = new PDFRenderer(document);

            int pageCount = document.getNumberOfPages();

            BufferedImage[] images = new BufferedImage[pageCount];
            // ByteArrayOutputStream[] baosArray = new ByteArrayOutputStream[pageCount];

            writer.setOutput(output);

            ImageWriteParam params = writer.getDefaultWriteParam();

            params.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);

            // Compression: None, PackBits, ZLib, Deflate, LZW, JPEG and CCITT
            // variants allowed
            params.setCompressionType("Deflate");

            writer.prepareWriteSequence(null);

            for (int page = 0; page < pageCount; page++) {
                BufferedImage image = pdfRenderer.renderImageWithDPI(page, DPI, ImageType.RGB);
                images[page] = image;
                IIOMetadata metadata = writer.getDefaultImageMetadata(new ImageTypeSpecifier(image), params);
                writer.writeToSequence(new IIOImage(image, null, metadata), params);
                // ImageIO.write(image, "tiff", baosArray[page]);
            }

            System.out.println("imageBaos size: " + imageBaos.size());
            // Finished write to output

            writer.endWriteSequence();

            document.close();
        } catch (IOException e) {
            e.printStackTrace();
            throw new Exception(e);
        } finally {
            // avoid memory leaks
            writer.dispose();
        }

然后您可以使用imageBaos写入本地文件。但是，如果您想将图像传递给ByteArrayOutputStream并返回到像我这样的 privious 方法。然后我们需要其他步骤。

处理完成后，图像字节将在ImageOutputStream output对象中可用。我们需要将偏移量定位到output对象的开头，然后读取 butes 以写入 new ByteArrayOutputStream，一种简洁的方式如下：

ByteArrayOutputStream bos = new ByteArrayOutputStream();
long counter = 0; 
        while (true) {
            try {
                bos.write(ios.readByte());
                counter++;
            } catch (EOFException e) {
                System.out.println("End of Image Stream");
                break;
            } catch (IOException e) {
                System.out.println("Error processing the Image Stream");
                break;
            }
        }
return bos

或者你可以ImageOutputStream.flush()在最后得到你的imageBaos字节然后返回。

score 1 · Accepted Answer

1

有关 PDFBox 的实现，请参阅我的 github代码。

于 2020-03-05T21:52:26.250 回答

score 0 · Accepted Answer

受到 Yusaku 回答的启发，

我制作了自己的版本，

这可以将多个 pdf 页面转换为字节数组。

我将 pdfbox 2.0.16 与 imageio-tiff 3.4.2 结合使用

//PDF converter to tiff toolbox method.
private byte[] bytesToTIFF(@Nonnull byte[] in) {

        int dpi = 300;
        ImageWriter writer = ImageIO.getImageWritersByFormatName("TIFF").next();

        try(ByteArrayOutputStream imageBaos = new ByteArrayOutputStream(255)){

            writer.setOutput(ImageIO.createImageOutputStream(imageBaos));
            writer.prepareWriteSequence(null);

            PDDocument document = PDDocument.load(in);
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            ImageWriteParam params = writer.getDefaultWriteParam();

            for (int page = 0; page < document.getNumberOfPages(); page++) {
                BufferedImage image = pdfRenderer.renderImageWithDPI(page, dpi, ImageType.RGB);
                IIOMetadata metadata = writer.getDefaultImageMetadata(new ImageTypeSpecifier(image), params);
                writer.writeToSequence(new IIOImage(image, null, metadata), params);
            }

            LOG.trace("size found: {}", imageBaos.size());

            writer.endWriteSequence();
            writer.reset();

            return imageBaos.toByteArray();

        } catch (Exception ex) {
            LOG.warn("can't instantiate the bytesToTiff method with: PDF", ex);
        } finally {
            writer.dispose();
        }
}

java - 将 PDF 转换为多页 tiff（第 4 组）

5 回答 5

Related

Reference