Parsing messy address data
Take incorrect or human entered address data, clean and parse it into a list of fields usable by another system. Even find and extract unit numbers.
Do you need this flow?
Human entered address data can be varied and difficult to parse, especially in large quantities. Notoriously, the most difficult piece of a US-based address to extract is the second line, usually indicating a suite, building, apartment, or some other subset of the address. Our Google Maps API does a great job of normalizing a filling in address data, and using it to normalize address data before passing it through this flow can improve the success rate of parsing.
If you need any sort of automation surrounding sending physical mail out, or collecting analytics about addresses you have, then the first step of that process will be to parse out the address data into consistent and computer friendly fields. Have you noticed how most address fields on e-commerce sites require a field per piece of your address? That is so they don’t have the problem that this flow solves, the problem of unstructured data. A benefit of this flow is that you simply feed it a file or connect a source with a single column of addresses, and it generates a table of the addresses broken down into their components. Be advised that some very poorly formatted or error-filled addresses will not be parsed correctly.
How the flow works
Because of the size of this flow, there will not be a step by step explanation of the process, but instead an overview. The generally accepted way to parse out address data, and many other types of semi-structured data, is to use regular expressions. Parabola’s Find & Replace object can use RegEx statements to find parts of an address that match certain patterns, much like how a human would read an address.
The first segment of a flow adds an ID to each address, so that when the flow takes apart each address, it knows where to add the data back. Since most addresses are separated by commas into major section, the flow splits out the addresses by commas, and then unpivots the data into a single column. Then the flow separates easy data out, the state and the zip code, to be joined up with the flow later.
The flow takes a few divergences, to create a lower path that extracts the city name, and two inner flows that use a series of regular expressions to try and separate the first line and second line of the address, independent of bad formatting and weird address types. Eventually all of the input joins back up and is exported as a single table.