Monday, 10 March 2008

How to read text from a pdf document?

We were having a requirement to index PDF document. To index it is necessary to convert it into text form. After searching for some API, we finally got PDFBox which is freeware. Using PDFBox it is very easy to extract text from the PDF document.
Following is the code snippet to extract text from PDF document.

Download PDFBox at http://www.pdfbox.org/

Writer output = new StringWriter();
PDFTextStripper stripper = null;
try {
stripper = new PDFTextStripper();
stripper.setSortByPosition(false);
stripper.setStartPage(1);
stripper.setEndPage(Integer.MAX_VALUE);

PDDocument document = PDDocument.load(path of the file to be read);
stripper.writeText(document, output);
document.close();
} catch (IOException e) {
log.error(e);
e.printStackTrace();
}
return result.toString();

No comments: