Sitemap Generator - HELPfor more info on how to use Sitemap Generator -- read below... |
To build your 1st sitemap, follow the instructions below:
Here is the overview on how it works and what's where:
To create a sitemap -- use the "New Project" wizard or put the domain name in the URL box and click "Create" (as shown above). The "New" button will give you a more "organized" way to set up the project. The quick way (writing the domain in the URL field) will apply the default settings.
1.1) Base URL + Create
This way is quick and uses system defaults (you can set the default project options in the "Settings" screen). Once you click "Create", you can adjust the settings from the "Project Options" button and/or just click "Run" to crawl the target site.
Hint: take a look at the bottom-left and adjust the number of threads and the delay between pages.
1.2) New Project Wizard
The New Project button will open the following screen:
If you use the New Project wizard, you can adjust these options on step 2 and 3. If you start the project directly from the Base URL field, you can change the preloaded defaults by clicking the Project Options button:
In all cases, these are the main options you can set for the sitemap project:
2.1) General / Crawler Settings
Here you set the general spider behavior:
Note that you can change the number of spiders and the delay for the currently open project "on the fly". See the controls on the bottom-left corner (before clicking the "Run" button). Even if you have something different in the options, this value will take precedence and will be used for the current "Run".
2.2) Start Pages
This option is somehow useful in two scenarios:
So, the "Start Pages" option is here for anyone who needs it, but it's better to make your site more readable for search engines. Organize things so all the important sections are findable from the front page.
2.3) Exclude Patterns
Ever since version 0.95 of Sitemap Generator you can set "exclude patterns". Exclude patterns look like this:
In the above example, we say the crawler to avoid any page that includes the text "blog/" and "freeware" in the URL. Pages like "http://wonderwebware.com/blog/index.html" or "/freeware/index.html" will not be added to the sitemap.
By default the tool will exclude number of file types from the sitemap. Those are image files, exe files etc. If you want to include such links in the final sitemap -- remove the respective pattern from the list.
2.4) Must-Follow Patterns
You can set "must-follow patterns". Must-follow pattern looks like this:
*blog/*
In the above example we say the crawler to crawl only pages that include the text "blog/" in the url. In that case only pages like "http://wonderwebware.com/blog/index.html" or "/freeware/index.html" will be added to the sitemap. Pages that do not include the text "blog/" in the url will be skipped. The asterix ( * ) says the crawler that anything or nothing in the place of the asterix fits the pattern, so if the pattern is *blog/* then all these below will be crawled/included:
www.mysite.com/blog/index.html
mysite.com/blog/page.html
/blog/index.html
/blog/
blog/
Note that you can set the pattern in different form, for example:
http://mysite.com/nofollow*
but in this case if the link in given html page looks like this one: "/nofollow/program.html" it will not be crawled (because the must-follow pattern requires the full domain url to be found in the link anchor). So, to crawl all pages from this /nofollow/ folder, you should use different pattern:
*nofollow/*
Now, no matter how the link is written in the html code, if it contains the text "nofollow/" it will be added to the site map.
IMPORTANT NOTE: If you set must-follow pattern, you must add start-page that fits the pattern in the "start-pages" box. For example, if we have the must-follow pattern below:
*blog*
We must add some page that fits that pattern in the Start Pages box, in our example the main /blog/ page:
http://mysite.com/blog/
Because otherwise we will get nothing. If we just leave the default start page (www.mysite.com), no link will be followed because the very first page visited by the spider will not fit the must-follow pattern.
So far you created a sitemap project and set up the wanted crawling speed (threads & delay). Now click “Run” and wait until the spiders crawl your site.
on "Run" button clicked:
Once the crawlers finish their job and the SEO postprocessing ends, you'll get the sitemap.
The ready-to-use sitemap is in the "Final Sitemap" tab. You can save the XML sitemap, upload it to your server and submit it to google.
If you want to know the SEO reasoning and how to use the information in the other tabs, read below.
4.1) Total Pages Found
This is the list of all URLs found by the spiders before the search engine optimization part. The SEO "postprocessor" will remove some of these.
At this stage, the exclude patterns and must-follow patterns are already in place. Only acceptable (fitting the patterns) pages should be listed. Also, what's excluded from "robots.txt" rules should be excluded here.
At this step, the so-called "postprocessor" kicks in. It will aply some "SEO" rules to make the final sitemap.
4.2) Excluded URLs
In this tab you can see why the postprocessor removed some pages from the final sitemap. It tries to follow the best SEO practices:
avoid non-HTML results
removes "noindex" pages (where it's set with HTML tag, not by robots.txt rules)
even if the "noindex" tag is only for Googlebot, it's still "no go"
"non canonical" pages are pages that point to other (canonical) page, so the canonical version must go in the sitemap
"not found" pages (server errors, internet connection errors... there should be no ages here in general, so check things out)
4.3) Issues & Notes
In this tab, I've put some "extra" tips -- pages kept in the sitemap, but with SEO issues. Those should be self-evident, but here is the reasoning behind these notes:
"orphan" canonical -- I made up this term to describe pages that was found as "canonical" while crawling the site, but were not downloaded. Some false positives could sneak here, so don't worry if there is one or two of these. Only take care of this if you see too many of these. Then you may want to check how are your pages canonicalized.
multiple canonical -- this is bad (from the SEO perspective). A page should only have ONE canonical version. Fix this or you risk search engines getting bad feelings about your site.
4.4) Final Sitemap
Well, as the name suggests, this is the final, SEO-processed sitemap, ready for export.
To get it in your preferred format: click the respective buttons in the bottom-right:
Save XML Sitemap -- to get the "industry-standard" XML version. Note that I didn't bother to add attributes like "refresh rate", "last updated" etc... We know Google doesn't use those.
Save TXT Sitemap -- will export the URL list in plain text
Save as HTML Please note this (HTML) version is NOT to be submitted to search engines. This will export all URLs in the respective "<a href="url">url</a> format.
Click the Save to XML button, save the sitemap, upload it to the server and submit it to Google, Yahoo, etc.
Oh, yes, don't forget to add the live link in your "robots.txt" file.
Those are a few "maintenance" functions you can find useful.
5.1) Delete Project
To remove a project from the database (i.e. delete all related files):
Open the project
Click the "Project Options" button
Click the "Delete" button on the bottom-left
Confirm removal
5.2) Export Project (Save As...)
The Import/Export functionality is in the "Data & Tools" menu:
Save Status Log will save the content of the "Status Log" window.
Export Including Cache -- this will save everything in a file "as is", including all the cached files.
Clear Cache and Export will first clear the cache and then export only the options and current state.
5.3) Import Project
Import Project does the reverse of "Export" and will import the project from a file. Note that if you import with cache, it will take a while. Also to use this, close any open project. If the same project exists, the import will fail.
To change default project options and program settings, click the "Settings" button.
Default settings are used when you create a new sitemap project. Those don't affect currently created projects but future ones.
User Agent — or how the Sitemap Generator presents itself to the server.
This will set the “user-agent” for the http request header sent.
Number of threads — or how many spiders will scan the site in parallel
Don’t put a too big number here. Each additional crawler will use system resources (memory, cpu) and from some point, it will start degrading the performance instead of speeding up the crawl. In my tests, 10-12 threads are the maximum useful number. The default of 4 is more on the safe side.
Be sure to check with your webmaster how the server will react to a big number of requests if you decide to start many parallel bots all requesting pages at the same time.
Delay between pages — or how long the crawlers will wait before requesting the next page.
Note that this is per-crawler, so if you have 4 threads in parallel with 1000ms delay for each one, the theoretical delay (from the server-side) will be something like 250ms.
As a rule of thumb: don’t lower the delay if you use more threads. 10 threads with a 50ms delay, for example, will put the theoretical request rate to 10 pages per second. Please be careful what pressure you put on the target site.
Timeout — or how long will the tool wait for crawlers if there are no new pages found.
This is a safety measure. If there is some strange never-ending redirection loop, we want to stop crawling and proceed. If there is no internet connection. If too many spiders are doing nothing but processing the same old pages from cache, we may want to stop wasting time. In general, if nothing happens for 3 minutes (the default value), it will stop crawling.
Those are the app settings not-related to the sitemaps themselves.
Start Maximized — will start the program in maximized screen mode
Root Folder — Use this to change the database root directory. Sitemap Generator keeps all its data as subfolders in this folder. Pro Tip: change this if you want to play around with different versions of the same projects or if you want to move things between computers. It May be faster than Import/Export (but I never tested it)
Clear Cache on Close — this will prevent the default behavior (clear cache on close). By default the tool will always erase the cached files when you click the “close” button and on next “open” will re-download the target pages. If you check this box, the cache will remain intact. This will speed up things when you experiment with settings, but when there are updates on the target site, things can get missed. You can always manually clean the cache (from the respective clear cache button on the “status” tab)
The tool comes with some limitation that doesn't prevent it from working and can be lifted with few clicks and few bucks:
Most of those are somehow "protecting" you from mistakes. If you start too many threads with no delay between pages, you'll put pressure on the server. Are you sure you know what you are doing? Are you sure you don't use my software in somehow wrong way?
I wanted to make it harder for the end users to make a mistake (and get
their IP banned from the target site, for example) and in the same time I didn't want to limit the tool... So I decided that if this is "paid" feature, it will remove any responsibility from me. If you paid to get this working and then you download too much from some server and you get banned, it's not my responsibility, because you are grown up and even donated the tool to get this feature...
Pontius Pilates, anyone...
Anyway, some of the "locked" features are here as a "hint" -- if you are a pro and need to scan so big sites and the rest, well, buy me a coffee to keep me in the loop.
Feel free to "unlock" the "locked" features by donating few bucks (the button above). You'll get an unlock code that you can put in the respective window ("Help | Unlock...") and will get the restrictions lifted as follows:
If things are not going according to plan, try the "debug" mode.
To do that, open the project and click the "debug" button on top-right.
You'll see "Debug View" tab next to the main log. (Both in green above).
This "debug" window will show you 2 things: currently found redirects and the current list of waiting links in the buffer.
This can help you if the tool doesn't produce the expected results, at least to see what's going on, but at the same time will make the crawling process slower.
Don't use it unless something doesn't work.
Also, at the bottm-right, you'll find "View Robots.txt" button. As one can expect, this will download and show you...you guessed it, the target site "robots.txt" file.
I spent almost a month, full time, to rewrite Sitemap Generator from scratch. This time it is SEO-oriented and works much better than all the alternative desktop solutions I've tested. Still, I am only one single man. No time and resources for long and tedious testing. I'm sute there are bugs and I'll try, in future, to fix them, if there is interest on this tool at all. Time and donations will tell ;-)
Meanwhile...
Please, accept and understand that:
[DISCLAIMER OF WARRANTIES] THIS SOFTWARE PRODUCT IS PROVIDED "AS-IS". NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU (RECEPIENT) USE THIS SOFTWARE AT YOUR (RECEPIENTS) OWN RISK.
Last Updated: April 4, 2022
Author: Vlad Hristov