If you have to extract text from a pdf file, you have a pool of possibility starting from the command line linux tools to libraries for your preferred programming languages.
One of the best library I used in conjunction with the java programming language is Apache PDFBox, you can download the full jar here.
This tutorial works fine with Netbeans 8, but you can use the IDE you like, it shouldn’t be difficult with other IDE’s.
In Netbeans create a new project (java application) named PDFBox
Right-click on the “Libraries” folder and select “Add JAR/Folder”
Select the downloaded version of PDFBox, it should be named pdfbox-app-1.x.y.jar
private void extractTextFromPdf() { PDDocument pd; BufferedWriter wr; try { // The input PDF file from where you want to extract text File input = new File("C:\\input.pdf"); // The PDF file from where you would like to extract // The text file where you will store the extracted data File output = new File("C:\\output.txt"); pd = PDDocument.load(input); // Loads the input pdf file PDFTextStripper stripper = new PDFTextStripper(); wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output))); stripper.writeText(pd, wr); // Extract and save the text pd.close(); wr.close(); } catch (IOException e) { } }
now in your main method, call extractTextFromPdf().
If you want to extract the text only from a range of pages (let’s say from page 4 to page 7) add the following code after the “PDFTextStripper stripper = new PDFTextStripper();” statement:
stripper.setStartPage(4); stripper.setEndPage(7);
Nothing more!
Gg1