If you have to extract text from a pdf file, you have a pool of possibility starting from the command line linux tools to libraries for your preferred programming languages.

One of the best library I used in conjunction with the java programming language is Apache PDFBox, you can download the full jar here.

This tutorial works fine with Netbeans 8, but you can use the IDE you like, it shouldn’t be difficult with other IDE’s.
In Netbeans create a new project (java application) named PDFBox

java-logo-ruby-style

Right-click on the “Libraries” folder and select “Add JAR/Folder”
Select the downloaded version of PDFBox, it should be named pdfbox-app-1.x.y.jar

 

    private void extractTextFromPdf() {
        PDDocument pd;
        BufferedWriter wr;
        try {
            // The input PDF file from where you want to extract text
            File input = new File("C:\\input.pdf");  // The PDF file from where you would like to extract
            // The text file where you will store the extracted data
            File output = new File("C:\\output.txt");
            pd = PDDocument.load(input); // Loads the input pdf file
            PDFTextStripper stripper = new PDFTextStripper();
            wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
            stripper.writeText(pd, wr); // Extract and save the text
            pd.close();
            wr.close();
        } catch (IOException e) {
        }
    }

now in your main method, call extractTextFromPdf().
If you want to extract the text only from a range of pages (let’s say from page 4 to page 7) add the following code after the “PDFTextStripper stripper = new PDFTextStripper();” statement:

 

            stripper.setStartPage(4);
            stripper.setEndPage(7);

 

Nothing more!
Gg1