Excalibur: PDF Table Extraction for Humans¶
Release v0.4.3. (Installation)
Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It powered by Camelot.
Excalibur only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based”.)
Note: You need to install ghostscript before moving forward.
After installing Excalibur with pip (), you can initialize the metadata database using:
$ excalibur initdb
And then start the webserver using:
$ excalibur webserver
That’s it! Now you can go to http://localhost:5000 and start extracting tabular data from your PDFs.
Upload a PDF and enter the page numbers you want to extract tables from.
Go to each page and select the table by drawing a box around it. (You can choose to skip this step since Excalibur can automatically detect tables on its own. Click on “Autodetect tables” to see what Excalibur sees.)
Choose a flavor (Lattice or Stream) from “Advanced”.
Lattice: For tables formed with lines.
Stream: For tables formed with whitespaces.
Click on “View and download data” to see the extracted tables.
Select your favorite format (CSV/Excel/JSON/HTML) and click on “Download”!
You can also download executables for Windows and Linux from the releases page and run them directly!
Extracting tables from PDFs is hard. A simple copy-and-paste from a PDF into an Excel doesn’t preserve table structure. Excalibur makes PDF table extraction very easy, by automatically detecting tables in PDFs and letting you save them into CSVs and Excels.
Excalibur uses Camelot under the hood, which gives you additional settings to tweak table extraction and get the best results. You can see how it performs better than other open-source tools and libraries in this comparison.
You can save table extraction settings (like table areas) for a PDF once, and apply them on new PDFs to extract tables with similar structures.
You get complete control over your data. All file storage and processing happens on your own local or remote machine.
Excalibur can be configured with MySQL and Celery for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.
Support us on OpenCollective¶
If Excalibur helped you extract tables from PDFs, please consider supporting its development by becoming a backer or a sponsor on OpenCollective!
The User Guide¶
This part of the documentation focuses on instructions to get you up and running with Excalibur.
- Installation of Excalibur
- How-to Guides
- Usage with screenshots