Parse data gathered from webscraping and send to Google Sheets
Pull in data from a webscraper (ParseHub), and process the results into usable data sets sent to Google Sheets for further use or reporting.
Web scraping for all
Much to our frustration here at Parabola, all data on the web is not neatly organized into quality APIs. In some cases, web services don't want folks interacting with their data programmatically. In other cases they simply haven't got around to making an API yet. Either way it leaves you, the builder, in a lurch.
That's where web scraping platforms like Parsehub come in handy. Parsehub lets you mark parts of web pages for regular scraping, and makes the results available to you in a handy API.
Parabola lets you build on what you've done with Parsehub and regularly take action based on the content you've scraped. This flow shows how you could fetch content from Parsehub (eBay listings, in this case), clean it, and send it off to Google Sheets for further analysis or consumption.
Cleaning eBay data in Parabola
In this example our Parsehub endpoint send us a mashed up list of eBay listings and categories, e.g., we have 'Beanies & Winter Hats' alongside 'Dainty Name Necklace - Tiny Name Necklace (trunc). This won't do!
In Parabola, we first want to separate the listings from the categories. We do this by using a row filter to check whether the retrieved URL contains '/listing.' If it does, we know it's a listing and not a category. From there the program splits into two tracks: one extracts the listing url and id from the listings, while the other extracts a category hierarchy from the categories.
Both tracks use the same tools though:
- Parabola's text extract pulls text based off of where it is in the cell, or which characters precede it or come after it.
- Column filters let you trim extraneous columns from a dataset
- Column split creates multiple new columns based on values that come after a specific character (called a delimiter) in the source column