View All Docs
Product overview
Account overview
Integrations
Transforms
Security
Hide Navigation
Product overview
Account overview
Integrations
Transforms
Security

PDF parsing overview

How to best work with PDFs in Parabola

Parabola’s PDF parsing leverages both optical character recognition (OCR) and large language model (LLM) technology to extract information from PDF documents.

Each billing plan comes with a certain number of PDF pages included per month (see pricing). If your usage exceeds this thresholds, we will reach out to discuss pricing for additional PDF parsing!

Known Limitations

In building this step we’ve seen tens of thousands of PDF pages, and we know that these documents can be messy. Based on extensive testing and production usage of this step, we have identified a few common scenarios in which specific document features or components may impact the quality and/or consistency of the output. Many of these can be solved with additional configuration and rule setting - so we encourage you to use fine tuning or additional instructions before determining this step won’t work for your file!

Documents that tend to experience challenges or may not be not parsable:

  • Files not in PDF format
  • Files with tables that do not have column headers or that have tables spanning multiple pages with the header only represented on the first page
  • Files that contain multiple versions of the same document - example: a file of 20 page, each page is a separate invoice with the same or similar columns
  • Long documents (currently 30 or more pages)
  • Documents that contain images which contain text that needs extraction
  • Documents with rows that are not clearly delimited
  • Some documents with handwriting or significant image distortion (e.g., blur) that make values hard to decipher
  • Something else? We’re always eager for feedback on our PDF Parsing step - if you have a file you cannot parse, even after various attempts at configuration, feel free to reach out to us.

Use PDF file

See below for the various ways you can bring a PDF file into Parabola

Upload PDF file

Use the Pull from PDF file step to work with a single PDF file. Upload a file by either dragging one into the outlined box, or select "Click to upload a file."

Pull from PDF file step

Email PDF file attachment

The Pull from inbound email step can pull in data from a number of filetypes, including attached PDF files. Once configured, Parabola can be set to parse PDFs anytime the relevant email receives a PDF file.

Working with data from PDF files

Check out this Parabola University video for a quick intro to our PDF parsing capabilities, and see below for an overview of how to read and configure your PDF data in Parabola.

Understanding your PDF data

Parabola’s Pull from PDF file step can be configured to return Columns or Keys

  • Columns are parts of tables that are likely to have more than one row associated with them
  • Keys are single pieces of data that are applicable to the entire document. As an example - “Total” rows or fields like dates that only appear once at the top of a document are best expressed as keys
  • Sometimes AI can interpret something as a column or a key that a human might consider the other. If the tool is not correctly pulling a piece of information, you might try experimenting with columns versus keys for that data point
  • Both columns and keys can be given additional information from you to ensure the tool is identifying and returning the correct information - more on that below!

Step Configuration

Selecting PDF Data

Once you have a PDF file in your flow, you will see a prompt for the second step - “Select table columns,” where you will provide information to the tool to determine what fields it should extract from the file. Parabola offers three methods for this configuration -

  1. Use an auto-detected table (default)
  2. Define a custom table
  3. Extract all data

First, we’ll outline how these choices will impact your results and then we will discuss tips and best practices for fine tuning these results:

  1. Use an auto-detected table
    • This selection, which is the default, will send the file through our PDF parsing pipeline, where our LLM will identify tables within the document, identify posible columns, name them, and extract available values.
    • Once this step finishes its first calculation, you should see a table selected with a set of columns. You can always add more columns to this original list - see the Manual Inputs selection below for more info!
    • Note that initial auto-detection does not provide any keys, however there is an option to do a full document or key specific auto-detect to have the tool provide this values
  2. Define a custom table
    • If you don’t want for the step to take a first pass at auto-detection, or, if the auto-detection is excluding columns you were hoping to extract, you can manually define a specific table. This is an advanced feature that can extract data from tables that are not obvious to the AI. Auto-detected Tables are easier to work with, but if the AI did not find your table, try defining it with this custom setting.
  3. Extract all data
    • This option will use primarily use OCR instead of an LLM to process your file. As a result of this, it is  discouraged for most use cases.
    • Should you want to use this option, however, we provide four options for how you’d like your data returned:
      • All data: this will return all of the data in the PDF, listed as one value per row
      1. Table data: this will return only data from OCR-identified tables within the PDF file. If your file has multiple tables, each will have a unique ID (which you can use to later filter results, for example), and results will be returned sequentially (e.g. table 1, then table 2, and so on). Note: tables that span multiple pages will be broken into individual tables for each page
      2. Key-Value pairs: this will return all identifiable key/value pairs – things that are clearly associated or labeled, such as “color: red” or “Customer name- Parabola”
      3. Raw text: this will return all of the PDF data, in a single cell (one cell per file page). This format is most useful if you plan to apply an AI step, like Extract or Categorize

Manual Inputs

  • Parabola’s ability to make informed predictions on what fields you might be expecting from any PDF file is incredibly powerful, but you can always provide additional inputs to the tool to increase accuracy and ensure the output is aligned with your needs. Generally speaking, the more input you provide, the more accurate the results!
  • Adding columns or keys is as easy as clicking the “+ Add column” or “+ Add key” button, which will open a prompt box. Here are a few tips for best results:
    1. Column and key names can be descriptive or instructive, and do not need to match exactly what the PDF says. However, you should try to ensure the name is something that the underlying AI can associate with the desired column of data
    2. Providing examples is the best way to increase the accuracy of column (or key) parsing
    3. The “Additional instructions to find this value” field is not required, however, here you can input further instructions on how to identify a value as well as instructions on how to manipulate that value. For example in a scenario where you want to make two distinct columns out of a singular value in the file, say an order number in the format “ABC:123".  You might use the prompt - “Take the order ID and extract all of the characters before the “:” into a new column”

See below how in this case with handwriting, with more instructions the tool is able to determine if there is writing next to the word “YES” or “NO”

Fine Tuning

  • In addition to column or key specific inputs, you can use the “Fine Tuning” to more holistically explain the type of document you are providing as well as additional information on the expected outcome. As always, if you have an expected outcome that is not being auto-generated, more examples and inputs should always help to improve accuracy

Advanced Settings

Parabola’s Pull from PDF step has four additional configurations:

  1. Use advanced text extraction. Defaults to FALSE. If toggled on, this step will use a more sophisticated version of OCR text extraction that can be helpful for complex documents such as those with handwriting. This more advanced model will also result in the tool running slower, and as a result, we suggest only toggling this on if you are not satisfied with the results from simple text extraction. Note that if a run fails with simple text extraction, Parabola uses advanced extraction by default in re-trying
  2. Retry step on error. Defaults to TRUE. LLMs can occasionally return unexpected errors and often times, re-running the step will resolve the issue. When checked, this step will automatically attempt to re-run one time when encountering an unexpected error
  3. Auto-update prompt versions. Defaults to FALSE. Occasionally Parabola updates step prompts in order to make parsing results more accurate/reliable. These updates may change output results, and as a result, auto-updating is turned off by default. Enable this setting to always use the most reset prompt versions.
  4. Page Filtering. Defaults to FALSE. Allows for user to define the specific pages of a document to run through the Pull from PDF step. If you only need specific values that are consistently on the same page(s), this can drastically improve run time.

Usage tips & Other Notes

  • This step can take many minutes to run! Grab a coffee and relax while the AI does the work for you. The more document pages that are needed for parsing, the longer it may take. To expedite this process, you can configure the step to only review certain pages from your file. The fewer the pages, the faster the results!
  • If you need to pull data across multiple tables (from a single file), you will likely need multiple steps – one per table.
  • File size: PDF files must be <500 MB and 30 pages
  • PDFs cannot be password protected
  • We recommend always auditing the results returned in Parabola to ensure that they’re complete

Using child columns

Mark columns as “Child columns” if they contain rows that have values unique from the parent columns:

Before

After marking “Size” as a child column