Have you ever wanted to get data for your Arduino projects, but there is no public API for it? Or in cases like the Instagram API where the setup process is not very convenient?
In this tutorial we are going to look at two different options for scraping data from a website for your ESP8266 or ESP32 projects.
Step 1: Check Out the VideoI have made a video that covers the same thing as this tutorial, so if you are interested, please check it out!
Step 2: Before We StartJust a heads up that the data I will be talking about scraping is public facing data and does not require any authentication. So say for example, my exact YouTube subscriber count is only available to me inside creator studio, so the device would have to make a request that authenticated as me to load it. These types of requests will be out of scope for this video. A quick test to check would it be covered is to try to load the page in an incognito window as that won’t automatically log you in to any sites.
For techniques covered in this tutorial we will have to use some of the developer tools that are available in browsers. I’ll be demonstrating them with Firefox, but I know for certain Chrome has similar tools and I’m sure other browsers have them too.
Step 3: Non Public APIsThe first way we’ll look at is using a non-public API. This will not always be available, but if it is this is definitely the method you should aim to use. What I’m calling a “non-public API” is basically where a site is using an unadvertised API on their website behind the scenes to fetch the data we are looking to get.
There are a few reasons why this would be the preferred option to use.
- The biggest advantage is that it is unlikely to change as often as a webpage, if you scrape data directly from the web page HTML, every time they make a change to the site, your parsing might break.
- It's normally more data efficient. When you are scraping a webpage you are basically downloading the entire HTML page to extract pieces of info from it, APIs are going to only return data points so would normally be much smaller requests.
- It's usually easier to parse. Normally APIs return data in JSON format which is straightforward to parse, this is especially true if you are extracting multiple pieces of data.
We first have to find out if the webpage uses a setup like this. The biggest clue is if the site updates the value in real-time like it does on Kickstarter, but even if it doesn’t there is still hope that it might use this setup. Instructables uses a non-public API for fetching some data for their site even though it doesn't refresh in real time.
To check if the site is using this setup, enter the developer mode of your browser, I find the easiest way of doing this is right click on the page and select “inspect element”.
You'll then want to go to the network tab, this will display the requests the webpage makes in the background, note that you might need to reload the page after opening this tab because it will only show requests made from now on.
You normally want to look for the ones with the type “json”. There can be a lot of requests here, so it may help to sort by type. You can see it’s very obvious on kickstarter campaign page that it’s using this setup as you can see constant requests being made to a “stats.json” endpoint. On Instructables' authors page (e.g. mine is "https://www.instructables.com/member/witnessmenow/"), they don’t make constant requests, but you can see hidden amongst the others a request to “showAuthorStats” endpoint.
To find out more information about this request, you can click on it. You should be able to get all the information you need from here to replicate the request. But before you do that you want to first double check that it has the data you want. Click, on the response tab and see if the data is there.
If it does contain the data you need, you are all set! You can then use the same approaches discussed in my previous video about connecting to APIs. The short version of that is to make sure the request works as expected on a tool like Postman first and then use this example project to test that it works on your device.
For parsing the JSON data I would recommend using ArudinoJSON in most scenarios, if this is something you would like a tutorial about, just let me know!
Step 4: Scraping Data DirectlyNext we will look at scraping the data directly from the webpage, this is requesting the full webpage on the device and parsing the data we want out. I already mentioned the advantages of the non-public API has over this method, but sometimes needs must!
One thing that is important to note here, if you are familiar with web development you might be used to using the inspect element feature to find out information about a particular element and how it’s structured. This should be avoided for this approach, because modern web pages are usually dynamically changed using Javascript, which will not happen on your device. The HTML code that is available on your device will only be the original webpage that is downloaded. A good example of this is the TeamTrees page, the current donation count starts as 0 and get loaded into the page later with this animation, but unlike the two examples we have seen before, it doesn’t load the data in the background, so the correct data must be somewhere else.
To view original web page code you can right click on the page and select “View Source”. You then want to search for the particular data you want, so in the TeamTrees example when we search for the current donation count, we can see the actual count is stored in data-count property of the count element, this is where we need to scrape the data from.
You need to find a search string that leads you to your data, It’s much easier to figure this out before coding for the device. For this example, searching for “data-count\”” brings me right up to data we want, which is perfect. We don’t need to worry that it also matches in other places in the page, because it will hit the top one first. If you did need to hit the 3rd one, you could just program it to ignore the first 2 you hit.
If we take a look at the TeamTrees example, like before we have skip over the response headers and are now looking at the body of the response (which is the webpage). What comes back from the client is a stream of data. We don’t care about anything up to our search query, so we do a client.find. If it does find the search query it will return true and it will move the stream to the end of the query. The next thing available from the stream will be data we are looking for, but in this case we are unsure how long the data will be, but we do know it is all the information between our current place in the stream and the next inverted comma. We can achieve this by using “client.readBytesUntil “ which does what it says, it reads the bytes into a buffer until it hits the specified query. Just make sure the buffer you are reading into is big enough to hold all the data, I think we're pretty safe here with 32!
If you have all the data you need, then you don’t need to read anymore data. I didn’t close the connection here because it didn’t seem to cause a problem on the ESP8266, it did seem to cause problems with the ESP32, so I added a client.stop(). To be completely honest, I’m not sure why I put it up the top of the method, I would think it would make more sense to close it once you have the data you want.
Step 5: Scraping Data Using an External Server:Just one other topic to touch on, there are much better tools for parsing on regular computer based environments such as NodeJS than on a micro controller, so sometimes it might make sense to make a service that fetches the data from a webpage and provides a simpler endpoint for your ESP8266 or ESP32. One example of this was scraping the CrowdSupply page to get a live count of how many TinyPICO were sold. It may have been possible to achieve it directly on a ESP8266 or ESP32, but as it was parsing multiple different data points on several different elements, so it would have been complicated.
I ended up creating a NodeJS project and parsed the data using a library called cheerio and it worked out very well. I hosted this project on cloud server I already had, but you could run this kind of project on a pi if you didn’t have something like that setup.
Step 6: Usage LimitsOne thing that could potentially impact all of these approaches is hitting sites usage limits. In regular APIs it’s normally pretty well documented how many requests you can make per minute or per day and you can limit your projects requests based on this. When you are scraping, you don’t know what these limits are so you run the risk of hitting them and potentially getting blocked. I can’t give any exact advice on limiting it so you stay in their good books, but I would think anything under every minute would be too often, other than maybe cases like kickstarter where they seem to make requests every few seconds themselves.
Step 7: Thanks for Reading!Hopefully this video helped if you are interested in parsing data directly from webpages on your ESP8266 or ESP32. Do you have any other questions on the topic that I didn’t cover? Please let me know in the comments below, or join me and a bunch of other makers on my Discord server, where we can discuss this topic or any other maker related one you have, people are really helpful there so it’s a great place to hang out
I would also like to give a huge thanks to my GitHub Sponsors who help support what I do, I really do appreciate it. If you don’t know, GitHub are matching sponsorships for the first year, so if you make a sponsorship they will match it 100% for the next few months.
Thanks for reading!
Comments