开源的Java PDF库,PDFBox 1.8.8 发布
jopen 10年前
PDFBox是一个开源的Java PDF库,这个库允许你访问PDF文件的各项信息。
PDFBox: www.pdfbox.org
它提供如下特性:
- 提取文本,包括Unicode字符。
- 和Jakarta Lucene等文本搜索引擎的整合过程十分简单。
- 加密/解密PDF文档。
- 从PDF和XFDF格式中导入或导出表单数据。
- 向已有PDF文档中追加内容。
- 将一个PDF文档切分为多个文档。
- 覆盖PDF文档。
import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileWriter; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.util.PDFTextStripper; public class PdfParser { /** * @param args */ // TODO 自动生成方法存根 public static void main(String[] args) throws Exception{ FileInputStream fis = new FileInputStream("F:\\task\\lerman-atem2001.pdf"); BufferedWriter writer = new BufferedWriter(new FileWriter("F:\\task\\pdf_change.txt")); PDFParser p = new PDFParser(fis); p.parse(); PDFTextStripper ts = new PDFTextStripper(); String s = ts.getText(p.getPDDocument()); writer.write(s); System.out.println(s); fis.close(); writer.close(); } }
Apache PDFBox 1.8.8 发布,这是一个增量的 bug 修复版本,修复了大量的 bug,包括:
Bug [PDFBOX-649] - loading an fdf containing a file attachment throws IOException [PDFBOX-1036] - FDFExport/Import gives strange results [PDFBOX-1060] - convertToImage includes "ghost" annotation outlines [PDFBOX-1087] - FDF parsing is unreliable when xref are missing [PDFBOX-1273] - java.io.IOException: Error: Unknown annotation type null [PDFBOX-1512] - TextPositionComparator is not compatible with Java 7 [PDFBOX-1574] - ImportFDF fails to do anything [PDFBOX-1595] - PDFMerger failed with the following exception: java.lang.NullPointerException [PDFBOX-1918] - PDF with incorrect startxref [PDFBOX-2001] - Digital Signature information (parser bug?) [PDFBOX-2015] - Hybrid reference pdf still contain XRefStm info in the trailer dictionary after PDDocument#save [PDFBOX-2173] - Nullpointer when validating empty file [PDFBOX-2296] - Wrong stream length [PDFBOX-2306] - Error reading stream, expected='endstream' actual='endobj' [PDFBOX-2320] - IOException: Could not read embedded TTF for font TimesNewRoman [PDFBOX-2332] - Error reading stream, expected='endstream' actual='endstream8' at offset 1993 [PDFBOX-2342] - WriteDecodedDoc cant decrypt pdf form correctly [PDFBOX-2351] - /XRefStm content missing in saved file [PDFBOX-2356] - Error Validating PDF Archive Document with half hour timezone [PDFBOX-2371] - Overlay page off by one when using -useAllPages [PDFBOX-2376] - Small regression in text extraction with PDFBox 1.8.7 vs. 1.8.6 [PDFBOX-2377] - Apparent regression in character mapping in a few files from govdocs1 [PDFBOX-2385] - inline image with EI at the end incorrectly parsed [PDFBOX-2395] - Signing PDF document changes documentID [PDFBOX-2401] - Image has wrong colors after Merge [PDFBOX-2402] - NonSequentialPDFParser cannot recover from spurious closing brackets [PDFBOX-2406] - fix typo "AlpaConstant" [PDFBOX-2411] - Pushback buffer is full on seamingly small PDF [PDFBOX-2412] - Loading XFDF document fails with ClassCastException [PDFBOX-2413] - Loaded FDF document returns null fields [PDFBOX-2419] - XFDF export is not XML compliant [PDFBOX-2424] - ClassCastException in getMetaData if no real meta data [PDFBOX-2434] - ClassCastException in readVersionInTrailer [PDFBOX-2435] - ConvertToImage Appears To Invert Colors [PDFBOX-2441] - Improve XRef self healing mechanism when more than one xref table [PDFBOX-2443] - About to return NULL from unhandled branch when constructing a PDJpeg [PDFBOX-2449] - Character missing in text extraction [PDFBOX-2455] - NonSequentialParser does not tolerate missing %%EOF markers [PDFBOX-2458] - Signing doesn't work anymore using BC 1.51 instead of 1.50 [PDFBOX-2465] - NPE in PdfaExtensionHelper.populateSchemaMapping [PDFBOX-2469] - javax.crypto.BadPaddingException in PDFBox 1.8.8-SNAPSHOT [PDFBOX-2470] - Exception in PDDocument.addSignature(PDSignature sigObject, SignatureInterface signatureInterface, SignatureOptions options)) [PDFBOX-2471] - AES encryption failing to write Acroform field names and values [PDFBOX-2477] - NPE in DomXmpParser.createProperty [PDFBOX-2478] - NPE in XObjImageValidator.checkColorSpaceAndImageMask [PDFBOX-2481] - Adding large TYPE_BYTE_BINARY image to pdf document generates distorted result [PDFBOX-2483] - StackOverflowError in preflight [PDFBOX-2484] - Cannot decrypt AES256 encrypted files with nonSeq parser [PDFBOX-2488] - NPE in FontValidator.isSubSet in preflight [PDFBOX-2490] - Return value of COSDocument#isEncrypted is unclear [PDFBOX-2491] - NPE in PDFAIdentificationValidation.checkConformanceLevel() [PDFBOX-2492] - Java 8u25 IllegalBlockSizeException decrypting pdf [PDFBOX-2497] - GRAVE: FlateFilter: stop reading corrupt stream due to a DataFormatException [PDFBOX-2498] - ArrayIndexOutOfBoundsException in PreflightParser.lastIndexOf [PDFBOX-2500] - ClassCastException in StreamValidationProcess.checkFilters [PDFBOX-2502] - false negative? 1.4.6 : Trailer Syntax error, ID is different in the first and the last trailer [PDFBOX-2503] - false negative? 1: 7.2 : Error on MetaData, Producer present in the document catalog dictionary doesn't match with XMP information [PDFBOX-2504] - ClassCastException in preflight: PDAnnotationWidget cannot be cast to PDField [PDFBOX-2512] - OutOfMemory while signing large documents [PDFBOX-2517] - Better error message on pdfA identification [PDFBOX-2520] - Don't decrypt already decrypted pdfs [PDFBOX-2521] - Don't throw IOException if stream length is missing in lenient mode [PDFBOX-2522] - javax.crypto.IllegalBlockSizeException in ExtractText [PDFBOX-2523] - IOException: Error: Expected a long type at offset 1218571, instead got 'xref' [PDFBOX-2528] - IOException: Object must be defined and must not be compressed object: 0:0 [PDFBOX-2533] - Poor rendering with non-sequential parser [PDFBOX-2541] - ClassCastException in BaseParser.parseCOSDictionaryValue Improvement [PDFBOX-543] - Document the dependencies of PDFBox [PDFBOX-1224] - Angle units are not consistent [PDFBOX-1648] - FontBox can't load CMaps with no spaces between tokens [PDFBOX-1738] - PDF with parsing IOException [PDFBOX-1798] - Performance problem with PDDocument.saveIncremental (when signing document) [PDFBOX-1833] - BaseParser tidy up [PDFBOX-2197] - Add sample how to import a page as PDFormXObject [PDFBOX-2250] - Improve XRef self healing mechanism [PDFBOX-2394] - Add example code to extract embedded files in annotations [PDFBOX-2414] - Allow non-sequential parser for PDFMerger in app [PDFBOX-2456] - create TestSymmetricKeyEncryption.java [PDFBOX-2468] - Switch FDFDocument.load from PDFParser to NonSequentialParser [PDFBOX-2475] - Fix Checkstyle errors in the 1.8 branch [PDFBOX-2480] - Add information about Snapshots to download section