What is a crawler and how does the ReadCast crawler work?
24th June 2019
This week I'm working on improving the ReadCast crawler to make it work better on more websites.
But let's take a step back and talk about what a crawler actually is and how ReadCast uses a crawler to do it's thing.
So what's a crawler?
A crawler (sometimes also known as a scraper, spider or bot) is a script that goes to a webpage and grabs specific information.
For example, Google uses a crawler to go to every website, get the links and metadata on it and put it on their large database of the internet.
How does the ReadCast crawler work?
When we crawl your article, we will do the following things:
- Visit the webpage of the article
- Remove things like the header, footer, sidebar or other unneeded parts of the page
- Saves the remaining content of the webpage
We've developed our own list of things that get removed from websites. You can find it all in a this list. There are certain elements we remove from all pages, no matter of what site they are on and we also have some elements that get removed if the webpage is on a certain domain.
If you find a website that's dosen't work in ReadCast, meaning all the content is wrong, then that's an issue with our crawler. If you find one of those issues, you could try and update our list yourself, or you could let us know.