I am very close to finishing the latest iteration of NewsBot, which scrapes for Moab-related stories from the Deseret News, The New York Times, The Washington Post, The Salt Lake Tribune, KUER, FOX4, KSL, and many others. I have been finagling with the file that defines the news sources to scrape and how to scrape them to try to reduce redundancy, and I am flirting with the line that separates “too clever” from “just abstract enough.”
It turns out Google has Custom Search Site Restricted JSON API, which allows you to list up to 10 sites to search, and Google places no quota on the number of queries you can do on such a custom search.
As I was trying to figure this out, I contemplated abandoning my solution of scraping sites directly simply going through Google for all my search needs. This has seemed like a sane solution to the problem I am solving, which is just to keep get notifications about new stories from a set of news sources about a set of topics.
However, I had already defined the classes I needed to go forward with my own scraping, and I have already solved for the problem of redundancy (what if the same news story comes up from separate searches?), so I figured I would add a Programmable Search Engine from Google to the list of sites or endpoints to scrape rather than replacing it.
So, I have defined such a search engine. It is restricted to the following 10 sites:
- The Salt Lake Tribune
- The New York Times
- The Washington Post
- Utah Public Radio
- Deseret News
Excluded from that list are KZMU and The Moab Sun News. Those I am still trying to figure out how to handle. Do I put every article from them in my inbox? How do I programmatically filter the stories I actually want to read from them?
Now that I have that custom search engine defined, I just need to straighten out definitions for direct scraping. The database and emailing parts I believe are sorted, though I might need to make minor tweaks to get it to production quality.