Crawl Your Website Like Googlebot!

Search Engine Optimization - SEO

Updated on Sep 2, 2019

View more like this | Visit MONTCLAIR, NJ | Contact JetOctopus
Image for Crawl Your Website Like Googlebot! with ID of: 3682890

We crawl your website like Googlebot to identify every technical error, loop and blockage that prevent Google from actually seeing and indexing it.

Our customers get an average increase of 3-5X in organic traffic after deploying our recommendations!


Crawl up to 20,000 pages during free trial.

Save 10% for 1st year when free trial ends!

JetOctopus SEO Crawler Demo Overview

[0:01] JetOctopus Report consists of several basic sections: these are Problems, Analytics (isn’t in the dashboard anymore, we rebuilt it) , Data table and Segments that work for all the sections. Now we're going to take a closer look at each of them.

[0:17] The Problems section shows the most critical problems and errors that can be found at the website by means of different automatic checkups. Each section starts with a dashboard and here we can see the total number of crawled pages, the number of pages with critical errors and the number of pages with warning issues. Here we’re talking not about the total number of errors at the website, but about the number of pages with errors. For instance some page may lack title, or meta description or may have duplicate content – here such page will be counted just once. Thanks to this you can see the real number of problem pages and not just a figure of all the errors that may vary from case to case.

[1:04] In this case we can see that most errors are those with HTML.

[1:09] Moving here we can see the list of main issues: these are issues related to the given section – those are empty meta description, we’ve also got IFrame tags that are considered as warning issues and the pages lacking H1 tags.

[1:30] There are additional diagrams for many reports. In the given case we see the duplications details diagram which shows the correlation between the duplicate and unique titles, meta description and headers. It should be mentioned that JetOctopus counts duplicates not in strict compliance, but by a normalized form. For example if you put two queries on a page, say “buy black iphone” and “buy iphone black” in the framework of strict compliance they won’t be considered as duplicates because their titles are different. But as you know SEs process text much smarter, thus thanks to the normalized form we can state that these queries will be considered as duplicates. All this information is shown in the normalized form chart.

[2:30] Here you can see empty tags, title lengths, meta description lengths.

[2:37] The same with the Technical section – we can see that the site has no serious problems, its technical arrangement is rather good. You can see the breakdown of status codes, loading time, page sizes.

[2:53] In the Content section we've got also very nice diagrams that help to work with web-stores and other sites that may have thin pages problem. With the help of Words Count Range you can see at once those pages containing less than 50, 100 or 200 words. There's an option of switching to unique words mode. Sometimes it happens that some widgets are duplicated or SEO text comes twice - this will be seen at once by the changed number of All words and Unique words. [3:40] In the next chart we can see that the total number of all words on the page is 1023 and the number of unique ones - 405. It's hard to tell whether it is good or bad in this case, but if we take a page with 5000 and 500 words correspondingly - that would be the issue demanding a deeper analysis into what the problem is.

[4:01] There’s also a Text/HTML Ratio chart that shows the correlation between the text size and the whole HTML size on the page. How does it help? We often face such situations when websites are created at especially at old versions which generate a lot of - like 80 kB - of service code which is not visible on the page, which is practically useless, but which adds too much weight to the page.

[4:37] The same is with the Links section. We analyze and consider it bad when only one link refers to the page especially for a site with 1 mln pages. We consider it a warning issue when they’re less than 10 links.

[4:53] In general the Problems section helps to quickly find critical errors on the website. It is especially acute when there’s a new customer and you need to get a general picture of how his or her site is doing as soon as possible. Or when launching a new site you need to quickly estimate how well it works.

[5:20] The next section is Analytics. Here we’re not speaking about what is good or what is bad, it is more like some kind of X-raying the website. Here the system collects the data and the user decides for himself what is good and what is bad.

[5:41] Here you can see the number of crawled pages, the number of indexable pages – that means a page must have a status code 200, it must be allowed for indexing by directives of, for example, robots.txt, meta-tag robots, robots headery and etc., plus it mustn’t be canonical or self-canonical.

[6:01] We also can see the number of known pages – for instance we’ve crawled 1 mln pages and we’ve got another 27 mln found pages in a queue that are available for further crawling.

[6:13] And here we can see the average load time of the website.

[6:17] The first chart Page Type shows the correlation between the indexable and not indexable pages. It should be mentioned that all the diagrams, almost everything that you see is clickable.

[6:32] For example if you want to build a non-indexable page on the 4th nesting level – you click here and go to a data table with set filters like “Distance from index” 4 and “Is indexable” No, thus you can immediately start working with the page. I’ll tell you later about the data table.

[6:53] So, in the Analytics dashboard you can see general information - status codes, load time, found top directories and links distribution.
[7:04] The next important section is Indexation. Just from one graph you can see at once how the site is indexed - what is indexable, what is blocked. And as we are studying the example of a web-store we get a classical picture – lots of non-canonical pages to conceal filter pages.

[7:24] Here is the Follow chart. It displays the way instructions are followed – every site follows them differently – and here you can see the correspondence between the instructions and the way they are followed and indexed.

[7:38] Below there is a rather rare chart called extra Meta Robots – they are sometimes used, but more often forgotten, which can cause various errors afterwards – no snippet, no archive, no translate, no image index. In the context of our study we can see that there are pages with no archive. You can’t say this is bad, but if you've never indicated these instructions or you don't want them but they are shown in this diagram - that's the sign for you to pay attention to them.

[8:05] The Real tags subsection shows the breakdown of pages with no tags, self-canonical and canonical pages. The same data is displayed by depth.

[8:17] One of the most important diagrams is Canonical evaluation which shows how many there are non-canonical pages - here we can see 810 thousand pages - how many pages are canonicalized by number - out of 810 000 there’re only 3200 canonical pages. This allows to estimate the volume of filters – here it is clear that the number of filters is simply gigantic, perhaps an infinite number that produces such a combination.

[8:48] The next diagrams here are Pagination, Relative mobile alteration and Relative hreflang by depth.

[8:54] Let’s move on to the Technical subsections – Performance. In the Load time breakdown we can see how fast the site works - the load time of this site is 0.5 seconds that's good, a little bit lower we see that the load time is from 0.5 to 1 second that's not bad, but faster would be better, plus there are 11 pages that didn't load due to timeout.

[9:19] There’s another graph called Load time by status code – sometimes a site loads 200 code pages rather fast, but with some redirects or 404 pages it slows down to 5-7 or even more seconds – this diagram will provide you immediately with this information. Here we can see that there are no such problems found with this site.

[9:42] Then there are breakdowns Load time by languages and Page size – you can see small pages less than 200 kB, medium from 200 to 500 kB and big over 500kB. These settings are entered into the account settings or the crawler settings. That means everyone can choose his own size settings.

[10:05] The next subsection contains a breakdown of Status codes – here you can see the information on status codes and a Page loaded after error diagram. Let me tell you how a crawler works with error pages – if we can’t load the pages due to timeout or the server gives a 5xx code (an internal server error) we try to load this page later over and over again – there’s a special algorithm working in order to load the page in the long run – and if the page couldn’t load at first but loaded after several tries – this page goes to this report so that you could see there are some problems with them and this report should be taken to the technical support in order to see the page logs to find out the reason why the page didn’t load initially.

[10:51] Because even if all the pages do load okay, but some pages fail from time to time there’s a chance that Google might send you a message “There Are errors found on the website” afterwards.

[11:04] Let’s move to the HTML section. This section contains data concerning titles, meta descriptions and everything that is not in the body tag. Here we can see a breakdown of words in titles, meta descriptions and headers.

[11:22] This is the Title Words analysis. Titles like any other text content are analyzed by unique words. Some websites like aggregators, web-stores, catalogues widely use auto-generation, for example you can meet such patterns as “buy a phone”, “buy a phone in moscow”, “best choice of phones in moscow”, something like that. These titles seem long and unique, but if you look at the text content you’ll see that the number of unique words is rather small.

[12:10] With the help of this chart we can see the correspondence between all words and unique ones. Here you can see that we’ve got quite a lot of titles with over 8 words, but there are much less titles containing unique words. And these unique words go to the group “0-5”. This means there are duplicate words on the site – pay attention to it.

[12:39] If the difference is twice or even thrice bigger – that’s a sign of overspammed content.

[12:45] The next chart displays the same analysis by depth. We see that it happens more often on the 3rd and 4th levels – there must be some catalogue pages with duplicate words.

[12:57] In meta descriptions they ought to be relevant.

[13:01] The same with the headers.

[13:04] A separate subsection is Duplications. Here is collected the data on duplications as it is considered one of the most popular problems. In the chart Title Duplication Breakdown we can see a clear picture of how many unique, empty, duplicated and duplicated by normalized form titles there are. I’d like to remind you that if you want to see the details of, say, the duplicated titles, you just click here [13:34] and you go to a data table containing all the information on the problem in question.

[13:40] Another chart Title Duplication Group Size shows how exactly the titles are duplicated. For instance, on 1 mln pages there might be 1 title duplicated for 1 mln times – there’ll be a group sized 1 mln pages or there might be 500 000 titles each duplicated twice – then you’ll see that a title has got one duplicate.

[14:05] The next diagram Most Duplicated Titles shows the examples of the duplicate titles.

[14:12] Also we’ve got the same breakdown of meta description. Actually this subsection allows to estimate the site condition in the context of duplicates.

[14:20] The next section is Content. Here we can see the breakdown of words on the page. Special attention should be paid to the extreme values, for example there’s a group of quite a lot of pages with 10-20 words – sometimes due to different reasons, let it be ill realization or ajax-uploads, there appear pages with fragments of other pages, for example pages with men’s and women’s clothes sizes – there’re very few words, but they are indexable or you see that there’re 10 000 words on the page when you know it can’t be so – in other words extreme values always need further analysis.

[15:10] Again you can switch to the unique words mode.

[15:13] The next is Text/HTML ratio breakdown, the same by depth and a chart displaying the correlation between all words and unique ones.

[15:22] The next important section is Links Overview. Our crawler can analyze any number of links there are on the site – here we see that on the site with 1 mln pages it crawled 323 mln links. If you’ve got 20 bln links you’ll get statistics on all 20 bln links without any looping.

[15:42] Here is a general diagram showing how many internal, external an social links there are on the site.

[15:51] Now we go to the Internal Links section. When we analyze an internal link we can always see whether it is allowed by robots.txt or not. And in the given example we see that all the links are allowed.

[16:07] In the Follow diagram we see the correspondence between attributes rel follow and nofollow links. Some SEO experts say that there shouldn’t be any nofollow links, others believe that you can sculpture page rank with their help and etc. So here you see this breakdown.

[16:26] Below you can see how many image and text links there are.

[16:31] Then there’s a HREF Protocol – this one will be of use when you’ve moved your site to httpS and see that you still have almost 1 mln http links – knowing this you can export and start removing the old links.
[16:50] Another two diagrams HREF Absolute/Relative and HREF Target show whether it is possible to open the new tabs or not.

[16:56] The next point is Text Analysis. Here we can see that out of 310 mln 278 mln are not unique, it also shows the number of unqiue and empty ones. Those can be not text, but image links.

[17:15] There’s a breakdown of words in anchors.

[17:20] Image ALT words analysis, title links analysis, words links analysis.

[17:27] Here we can see examples of top anchors, examples of top images ALT.

[17:33] In External links there’re additional breakdowns of social links, follow and nofollow external links. Many websites like news sites always use nofollow external links.

[17:57] Then there are top domains, the correlation between text and images, HREF Protocol and target types.

[18:05] The Text Analysis of external links is pretty the same as that of internal links. You can see a breakdown of all the attributes. But usually text analysis of external links is not as important as of internal ones.

[18:16] And the last one is Sitemaps. It is necessary to customize in the crawler settings that the sitemaps should load and be analyzed. The most important moment here is orphan pages – this means that there’re found pages on sitemaps which are not found by the crawler. The presence of such pages is considered a critical error which is to be corrected as soon as possible. The below breakdown shows how many there are such pages.

[18:46] The last report is Data table. We call it ‘Excel’ for the site. There are over 140 filters with the help of which you can make various data selections on the site. For example let’s find the pages with duplicate titles.

[19:10] Titles Duplicated Count, greater or equal 1, start searching.

[19:19] 1 mln pages is processed. And we add a column Title Duplications and the same one by normalized form –

[19:30] we’ve got the results. Now let’s look at the most frequently duplicated pages – you can see that the title New in on Women Gamiss is duplicated 372 thousand times. And if we want to look at them we click here –

[19:49] and we’ve got 372 thousand filtered titles and we can start working with them, can add some attributes.

[20:05] For instance we can look at Count of in Links All – the more linked pages

[20:14] – start searching – we can put ticks to see additionally the number of In redirects and, say, of In Canonicals.

[20:27] Here we are. We see the URL of the page – New Women – it has a duplicated title and contains 378 thousand canonicals and 1.8 mln links. Now you can start working with optimization and interlinking. Any result can be further adjusted with the help of columns – there’re lots of options so that you could find whatever you need. Any report can be exported, no matter how big its size is.

[21:13] There’re the same Data table for Links. You can easily filter all the 323 mln links by different attributes, export and etc.
There’re the same Data tables for Sitemap files and Sitemap Urls in case they are analyzed.

[21:30] It is very comfortable in case you want to see what’s inside xml sitemaps. You don’t need to crawl all the 20 mln urls that are in the sitemap, crawling of 1 thousand pages would be enough and all the 20 mln would be loaded from the sitemaps – they won’t be crawled but you’ll be able to analyze the number of urls in each sitemap file, the number of incorrect urls, error urls and work with them. That’s very convenient.

[22:07] And the last of all is Segments. It is very acute when working with big sites or even medium ones, as there’re often various subsections like catalogues, news, blogs and etc.

[22:23] There’re always two preset segments – All Pages and Indexable pages. For instance we want to add a segment.

[22:40] Let’s check if there’s – url contain women

[22:50] – no, too few, let’s take wholesale – right 100 thousand pages – now we are in the Analytics section and we want to add a Wholesale segment

[23:04] – we click ‘Add a segment’ – name it Wholesale – and add a filter that the page url should contain wholesale

[23:15] – save the segment – done. The calculations are made in no time and we see that it contains 107 thousand pages, 32 thousand indexable ones. You can work with and see the information within the given segment. It is incredibly convenient. They work really fast.
[23:35] You can easily switch between them. The number of such segments may be unlimited. Again you can see all the problems, work in Data table and etc. I mean you can slice the site the way you need and work with the slice you want.

[23:54] I hope I’ve made it clear enough. We’ll be glad to welcome you as our clients.
goodideazs, LLC is not affiliated with the authors of this post nor is it responsible for its content, the accuracy and authenticity of which should be independently verified.