I'm starting work on a kettle plugin to ingest PDF tables and convert them to CSV as input to the TextInput plugin.
What is a good method to recognize tables in PDF format ?
1. Use markers and an API such as Apache PDFBox
2. Convert PDF to image and use image recognition algorithms.
Please share your experience.