Low performance when extracting text from a PDF file


#1

Hello,
I am new to Kotlin and I decided to compare the performance of a PDF text extraction using Kotlin and Java approach with PDFBox 2.0.12. library. Here is the Java version that extracts the text from a pdf and creates a list of text segments of the specified minimal size:

   ...
   import org.apache.pdfbox.pdmodel.PDDocument;
   import org.apache.pdfbox.text.PDFTextStripper;
   ...
        public static List<Segment> parse(InputStream is) throws Exception {

        int segNo = 0;
        List<Segment> segments = new ArrayList<>();
        try (PDDocument pdfDocument = PDDocument.load(is)) {
            if (!pdfDocument.isEncrypted()) {
                PDFTextStripper str = new PDFTextStripper();
                str.setLineSeparator("\n");
                str.setSortByPosition(true);
                StringBuilder accumulator = new StringBuilder();
                int n = pdfDocument.getNumberOfPages();
                for(int i = 0; i < n; i++) {
                    str.setStartPage(i); str.setEndPage(i);
                    String[] pars = Segment.PDF_END_OF_LINE.split(str.getText(pdfDocument).trim());
                    for(String content : pars) {
                        content = content.trim();
                        if(!content.isEmpty()) {
                            accumulator.append(Segment.REMOVE_MULTI_SPACES.matcher(content)
                                    .replaceAll(" ")).append(".");
                            if(accumulator.length() < Segment.MIN_NUM_OF_CHARS) {
                                accumulator.append("\n");
                            } else {
                                segments.add(new Segment(Segment.PARAGRAPH, 
                                        accumulator.toString().trim(), segNo++));
                                accumulator.setLength(0);
                            }
                        }
                    }
                    if(accumulator.length() > 0) {
                        segments.add(new Segment(Segment.PARAGRAPH, 
                                            accumulator.toString().trim(), segNo++));
                        accumulator.setLength(0);
                    }
                }
            }
        }
        return segments;
    }

Here is my Kotlin code:

...
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper
...

object PdfParser {

    fun parse(fis: InputStream): List<Segment> {

        var segNo = 0
        val segments = mutableListOf<Segment>()
        PDDocument.load(fis).use { pdfDocument ->
            if (!pdfDocument.isEncrypted) {
                val stripper = PDFTextStripper()
                stripper.lineSeparator= "\n"
                stripper.sortByPosition= true
                val accumulator = StringBuilder()
                val n = pdfDocument.numberOfPages
                for (i in 0 until n) {
                    stripper.startPage = i
                    stripper.endPage = i
                    val pars = PDF_END_OF_SENT.split(stripper.getText(pdfDocument).trim())
                    for (p in pars) {
                        val content = p.trim()
                        if (content.isNotEmpty()) {
                            accumulator.append(SEG_REMOVE_MULTI_SPACES.replace(content, " ") + ".")
                            if (accumulator.length < SEG_MIN_NUM_OF_CHARS) {
                                accumulator.append("\n")
                            } else {
                                segments.add(Segment(SEG_PARAGRAPH, accumulator.toString().trim(), segNo++))
                                accumulator.setLength(0)
                            }
                        }
                    }
                    if (accumulator.isNotEmpty()) {
                        segments.add(Segment(SEG_PARAGRAPH, accumulator.toString().trim(), segNo++))
                        accumulator.setLength(0)
                    }
                }
            }
        }
        return segments
    }
}

When I measure the performance (giving the same FileInputStream pointing to 7.18MB tornadofx-guide.pdf), Java version performs the task in 3 secs and Kotlin version in 5 secs!
In both cases I use the same jdk version (1.8.something and same PDFBox libraries).

Here are the test sequences in Java and Kotlin:

       // Java
        try(InputStream is = new FileInputStream("C:\\Users\\Andjelko\\Desktop\\tornadofx-guide.pdf")){
            long st = System.currentTimeMillis();
            List<Segment> list = parse(is);
            long e = System.currentTimeMillis();
            System.out.println(e-st);
        } catch(Exception e) {
            e.printStackTrace();
        }

        // Kotlin
        val fis = FileInputStream("C:\\Users\\Andjelko\\Desktop\\tornadofx-guide.pdf")/*, 2 * 1024)*/
        val st = System.currentTimeMillis()
        val list = PdfParser.parse(fis)
        val e = System.currentTimeMillis()
        println(e - st)

Am I missing something or Kotlin is much slower than Java (apart from being much nicer)?


#2

Since you are using Java library and apparently the same code, the time should be exactly the same. The only difference I see is that file closing in your kotlin code is inside measured block and in Java code is outside, but it should not take that much time. Problem is probably with your test setup. Try to run it in different order or multiple times.


#3

Also you use slightly different logic in replacer. It invokes additional StringBuilder in kotlin on + operation


#4

I changed that but the difference is nearly 2secs! My test logic is correct. Even when I repeat tests multiple times, the same thing happens. The only difference is that I run Java code from NetBeans and Kotlin code from intellij. Does it metter?


#5

It means that you are not building a jar, but running it directly from IDE. It could affect performance. You must at least run tests in the same conditions. Also if by repeating you mean that you click “play” button multiple times, it does not help. You do not remove VM warm-up this way.


#6

Hello,
I repeated tests in the same IDE (IDEA) and repeated them programatically 100 times, and the results are nearly the same (now Kotln little bit faster)!

Sorry for this stupid error :slight_smile:


#7

In this case the performance should be exactly the same because your main work is hidden in Java classes (so the bytecode is the same). Kotlin could give some boost for operation on strings since it optimizes some things, but not in this case.