Low performance when extracting text from a PDF file

miloskovacevic · October 24, 2018, 1:21pm

Hello,
I am new to Kotlin and I decided to compare the performance of a PDF text extraction using Kotlin and Java approach with PDFBox 2.0.12. library. Here is the Java version that extracts the text from a pdf and creates a list of text segments of the specified minimal size:

   ...
   import org.apache.pdfbox.pdmodel.PDDocument;
   import org.apache.pdfbox.text.PDFTextStripper;
   ...
        public static List<Segment> parse(InputStream is) throws Exception {

        int segNo = 0;
        List<Segment> segments = new ArrayList<>();
        try (PDDocument pdfDocument = PDDocument.load(is)) {
            if (!pdfDocument.isEncrypted()) {
                PDFTextStripper str = new PDFTextStripper();
                str.setLineSeparator("\n");
                str.setSortByPosition(true);
                StringBuilder accumulator = new StringBuilder();
                int n = pdfDocument.getNumberOfPages();
                for(int i = 0; i < n; i++) {
                    str.setStartPage(i); str.setEndPage(i);
                    String[] pars = Segment.PDF_END_OF_LINE.split(str.getText(pdfDocument).trim());
                    for(String content : pars) {
                        content = content.trim();
                        if(!content.isEmpty()) {
                            accumulator.append(Segment.REMOVE_MULTI_SPACES.matcher(content)
                                    .replaceAll(" ")).append(".");
                            if(accumulator.length() < Segment.MIN_NUM_OF_CHARS) {
                                accumulator.append("\n");
                            } else {
                                segments.add(new Segment(Segment.PARAGRAPH, 
                                        accumulator.toString().trim(), segNo++));
                                accumulator.setLength(0);
                            }
                        }
                    }
                    if(accumulator.length() > 0) {
                        segments.add(new Segment(Segment.PARAGRAPH, 
                                            accumulator.toString().trim(), segNo++));
                        accumulator.setLength(0);
                    }
                }
            }
        }
        return segments;
    }

Here is my Kotlin code:

...
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper
...

object PdfParser {

    fun parse(fis: InputStream): List<Segment> {

        var segNo = 0
        val segments = mutableListOf<Segment>()
        PDDocument.load(fis).use { pdfDocument ->
            if (!pdfDocument.isEncrypted) {
                val stripper = PDFTextStripper()
                stripper.lineSeparator= "\n"
                stripper.sortByPosition= true
                val accumulator = StringBuilder()
                val n = pdfDocument.numberOfPages
                for (i in 0 until n) {
                    stripper.startPage = i
                    stripper.endPage = i
                    val pars = PDF_END_OF_SENT.split(stripper.getText(pdfDocument).trim())
                    for (p in pars) {
                        val content = p.trim()
                        if (content.isNotEmpty()) {
                            accumulator.append(SEG_REMOVE_MULTI_SPACES.replace(content, " ") + ".")
                            if (accumulator.length < SEG_MIN_NUM_OF_CHARS) {
                                accumulator.append("\n")
                            } else {
                                segments.add(Segment(SEG_PARAGRAPH, accumulator.toString().trim(), segNo++))
                                accumulator.setLength(0)
                            }
                        }
                    }
                    if (accumulator.isNotEmpty()) {
                        segments.add(Segment(SEG_PARAGRAPH, accumulator.toString().trim(), segNo++))
                        accumulator.setLength(0)
                    }
                }
            }
        }
        return segments
    }
}

When I measure the performance (giving the same FileInputStream pointing to 7.18MB tornadofx-guide.pdf), Java version performs the task in 3 secs and Kotlin version in 5 secs!
In both cases I use the same jdk version (1.8.something and same PDFBox libraries).

Here are the test sequences in Java and Kotlin:

       // Java
        try(InputStream is = new FileInputStream("C:\\Users\\Andjelko\\Desktop\\tornadofx-guide.pdf")){
            long st = System.currentTimeMillis();
            List<Segment> list = parse(is);
            long e = System.currentTimeMillis();
            System.out.println(e-st);
        } catch(Exception e) {
            e.printStackTrace();
        }

        // Kotlin
        val fis = FileInputStream("C:\\Users\\Andjelko\\Desktop\\tornadofx-guide.pdf")/*, 2 * 1024)*/
        val st = System.currentTimeMillis()
        val list = PdfParser.parse(fis)
        val e = System.currentTimeMillis()
        println(e - st)

Am I missing something or Kotlin is much slower than Java (apart from being much nicer)?

darksnake · October 25, 2018, 7:15am

Since you are using Java library and apparently the same code, the time should be exactly the same. The only difference I see is that file closing in your kotlin code is inside measured block and in Java code is outside, but it should not take that much time. Problem is probably with your test setup. Try to run it in different order or multiple times.

darksnake · October 25, 2018, 7:20am

Also you use slightly different logic in replacer. It invokes additional StringBuilder in kotlin on + operation

miloskovacevic · October 25, 2018, 12:54pm

I changed that but the difference is nearly 2secs! My test logic is correct. Even when I repeat tests multiple times, the same thing happens. The only difference is that I run Java code from NetBeans and Kotlin code from intellij. Does it metter?

darksnake · October 25, 2018, 1:09pm

It means that you are not building a jar, but running it directly from IDE. It could affect performance. You must at least run tests in the same conditions. Also if by repeating you mean that you click “play” button multiple times, it does not help. You do not remove VM warm-up this way.

miloskovacevic · October 25, 2018, 1:19pm

Hello,
I repeated tests in the same IDE (IDEA) and repeated them programatically 100 times, and the results are nearly the same (now Kotln little bit faster)!

Sorry for this stupid error

darksnake · October 25, 2018, 1:32pm

In this case the performance should be exactly the same because your main work is hidden in Java classes (so the bytecode is the same). Kotlin could give some boost for operation on strings since it optimizes some things, but not in this case.

Topic		Replies	Views
Targeting pdf data Android	2	2011	September 12, 2020
Broken Kotlin Documentation Reference PDF Site Feedback	4	342	May 31, 2024
Load pdf kotlin Android Android	0	1944	August 3, 2020
PDF documentation fonts are too small	1	568	November 1, 2023
Problem with Kotlin reference guide PDF	1	1101	August 10, 2015

Low performance when extracting text from a PDF file

Related topics