Synthesize webscraped data and send to Google Sheets

Pull in data from a webscraper (ParseHub), and process the results into usable data sets sent to Google Sheets for further use or reporting.

Synthesize webscraped data and send to Google Sheets

Web scraping for all

Much to our frustration here at Parabola, all data on the web is not neatly organized into quality APIs. In some cases, web services don't want folks interacting with their data programmatically. In other cases they simply haven't got around to making an API yet. Either way it leaves you, the builder, in a lurch.

That's where web scraping platforms like Parsehub come in handy. Parsehub lets you mark parts of web pages for regular scraping, and makes the results available to you in a handy API.

Parabola lets you build on what you've done with Parsehub and regularly take action based on the content you've scraped. This flow shows how you could fetch content from Parsehub (eBay listings, in this case), clean it, and send it off to Google Sheets for further analysis or consumption.

Cleaning eBay data in Parabola

In this example our Parsehub endpoint send us a mashed up list of eBay listings and categories, e.g., we have 'Beanies & Winter Hats' alongside 'Dainty Name Necklace - Tiny Name Necklace (trunc). This won't do!

In Parabola, we first want to separate the listings from the categories. We do this by using a Remove rows step to check whether the retrieved URL contains '/listing.' If it does, we know it's a listing and not a category. From there the program splits into two tracks: one extracts the listing url and id from the listings, while the other extracts a category hierarchy from the categories.

Both tracks use the same tools though:

  • The Extract text from column step pulls text based off of where it is in the cell, or which characters precede it or come after it.
  • Remove columns steps let you trim extraneous columns from a dataset
  • The Split column step creates multiple new columns based on values that come after a specific character (called a delimiter) in the source column