View All Docs
Product overview
Account overview
Integrations
Transforms
Security
Hide Navigation
Product overview
Account overview
Integrations
Transforms
Security

Parabola's PDF parser

Parsing PDFs in Parabola

Parabola’s PDF parser lets you extract, transform, and structure data from PDF documents—whether they’re invoices, reports, or scanned documents. It can parse handwritten notes on PDFs too! Using a combination of optical character recognition (OCR) and large language models (LLMs), Parabola makes working with messy PDFs much easier.

What You Can Do with PDFs in Parabola

  • Extract tables and key values from PDFs
  • Parse structured and unstructured content
  • Fine-tune results to improve accuracy for messy or complex documents
  • Batch process multiple files
  • Import PDF file attachments via email using The "Extract with email" step. It automates file delivery and extraction, eliminating manual emails tasks.
Explore other integrations and learn more about Parabola
Parabola helps you bring disparate data and documents together. Chat with our team to learn more.
Get a demo
Talk with us
Submitted!
Error please enter a valid email address

Bringing PDFs into Parabola

You can import PDF files in a few different ways:

  • Upload a file directly using the Extract from PDF file step
  • Pull PDFs from inbound email using the Extract from email step
  • Bulk process PDF files using the Pull from file queue step

Working with data from PDF files

Check out this Parabola University video for a quick intro to our PDF parsing capabilities, and see below for an overview of how to read and configure your PDF data in Parabola.

Understanding your PDF data

Parabola’s Pull from PDF file step can be configured to return Columns or Keys

  • Columns are parts of tables that are likely to have more than one row associated with them
  • Keys are single pieces of data that are applicable to the entire document. As an example - “Total” rows or fields like dates that only appear once at the top of a document are best expressed as keys
  • Sometimes AI can interpret something as a column or a key that a human might consider the other. If the tool is not correctly pulling a piece of information, you might try experimenting with columns versus keys for that data point
  • Both columns and keys can be given additional information from you to ensure the tool is identifying and returning the correct information - more on that below!

Step Configuration

You can use Extract from PDF, Extract from email, and Pull from file queue to parse PDFs. Once you have a PDF file uploaded into your Flow, the configuration settings are uniform.

Extract a table

1. Auto-detected Table (default)
Parabola scans your PDF, detects possible tables, and labels the most likely columns. This option uses LLM technology and works exceptionally well if the PDF document has a clear, structured table. All detected tables will be available in the sub-dropdown under the "Use an auto-detected table" dropdown.

  • Quickest setup
  • Works best when your table has headers
  • You can manually add more columns or keys after

2. Define a Custom Table
Manually define the structure of your table if the AI didn’t pick it up. You can name the table and define the columns that you want to extract from the PDF by clicking on the + Add Column button.

  • Good for multi-table documents
  • Works well with tables spread across multiple pages
  • Requires a bit more setup

3. Extract All Data (OCR-first mode)
Use OCR to return all text from the PDF — helpful if the structure is complex or you're feeding the result into an AI step later. We only recommend this option if the first two extraction methods aren't yielding the desired results.

Return formats:

  • All data → Every value, one per row
  • Table data → Tables split by page, each with a table ID
  • Key-value pairs → Labeled items like SKU: 12345
  • Raw text → One cell per page, useful for follow-up AI parsing

Extract values

If there are document-level values like invoice date and PO number that you want to extract, add them as keys in this section. You can add this by clicking on the “+ Add key” button. Each key that you configure will be represented as its own column and the value will be repeated across all the rows of the resulting data set.

  • Column and key names can be descriptive or instructive, and do not need to match exactly what the PDF says. However, you should try to ensure the name is something that the underlying AI can associate with the desired column of data
  • Providing examples is the best way to increase the accuracy of column (or key) parsing
  • The “Additional instructions to find this value” field is not required, however, here you can input further instructions on how to identify a value as well as instructions on how to manipulate that value. For example in a scenario where you want to make two distinct columns out of a singular value in the file, say an order number in the format “ABC:123".  You might use the prompt - “Take the order ID and extract all of the characters before the “:” into a new column”

See below how in this case with handwriting, with more instructions the tool is able to determine if there is writing next to the word “YES” or “NO”.

Fine Tuning

You can give the AI more context by typing additional context and instructions into this text box. Try using specific examples, or explain the situation and the specific desired outcome. Consult the chat interface on the lefthand side to help you write clear instructions.

Advanced Settings

1. Text parsing approach
You can specify the text parsing approach if necessary. The default setting is “Auto” and we recommend keeping it this way if possible. If it’s not properly parsing your PDF, you can choose between “OCR” and “Markdown”.

  • OCR - This will use a more sophisticated version of OCR text extraction that can be helpful for complex documents such as those with handwriting. This more advanced model may, however, result in the tool running slower.
  • Markdown - This will use Markdown for parsing. It is generally faster for parsing and may work better for certain documents, like pdfs that have nested columns and rows.

2. Retry step on error
The checkbox will be checked by default. LLMs can occasionally return unexpected errors and oftentimes, re-running the step will resolve the issue. When checked, this step will automatically attempt to re-run one time when encountering an unexpected error.

3. Auto-update prompt versions
The checkbox will be unchecked by default. Occasionally Parabola updates step prompts in order to make parsing results more accurate/reliable. These updates may change output results, and as a result, auto-updating is turned off by default. Enable this setting to always use the most reset prompt versions.

4. Page filtering
The checkbox will be unchecked by default. This setting allows users to define specific pages of a document to parse. If you only need specific values that are consistently on the same page(s), this can drastically improve run time. If you do check this box off, please make sure to complete the dropdown settings that appear below.

  • Keep, Remove, or Autodetect
    • The Autodetect option will allow the parser to choose what pages to use.
  • The first, the last, or these
    • If you select “the first”, input a number in the “#” box to instruct how many pages from the beginning of the file should be parsed.
    • If you select “the last”, input a number in the #” box to instruct how many pages from the end of the file should be parsed.
    • If you select “these”, input a comma-separated list of numbers in the blank box to specify which pages. For example, if you put “1, 10, 16”, the step will parse the first, tenth, and sixteenth page only of the file.

Usage tips & Other Notes

  • The more document pages that are needed for parsing, the longer it may take. To expedite this process, you can configure the step to only review certain pages from your file. The fewer the pages, the faster the results!
  • If you need to pull data across multiple tables (from a single file), you will likely need multiple steps – one per table.
  • File size: PDF files must be <500 MB and 30 pages
  • PDFs cannot be password protected
  • We recommend always auditing the results returned in Parabola to ensure that they’re complete

Using child columns

Mark columns as “Child columns” if they contain rows that have values unique from the parent columns:

Before:

After marking “Size” as a child column:

Extract from PDF

Use Extract from PDF to work with a single PDF file. Upload a file by either dragging a PDF file anywhere onto the canvas, or click "Click to upload a file" to select a file from your file picker.

Step configuration instructions can be found here.

Pull from PDF file step

Extract from email - PDF attachments

Extract from email can pull in data from a number of filetypes, including attached PDF files. Once configured, Parabola can be set to parse PDFs anytime the relevant email receives a PDF file.

Step configuration instructions can be found here.

Pull from file queue - PDF files

Pull from file queue can receive PDF files and parse the relevant data. The file queue is a way to enqueue a Flow to run with a series of metadata + a file that is accessible via URL.

Runs can be added to the file queue via API (webhook) or via Run another Parabola Flow.