Sitemap Generator

home >> SiteMap Generator >> help

 

 

Sitemap Generator - HELP

for more info on how to use Sitemap Generator -- read below...

 

 

Quick Guide

To build your 1st sitemap, follow the instructions below:

  1. Enter your domain name
  2. Click the "Create" button

    sitemap generator step 1

    (NOTE: put only the domain name without the http:// prefix, for example – wonderwebware.com or www.wonderwebware.com)

  3. Then click the “Run” button and wait...

    sitemap generator step 1

  4. Depending on how many pages you have, the program may take from a few minutes to hours to extract all links.

    Be aware that the max speed will be about 200 pages per minute. It can take a long time to download a site with thousands of pages.

    For beginners: please, leave the default settings. The sitemap generator will work with a small fraction of your bandwidth. This won’t overload your connection and you can safely multitask while the app is working.


  5. google sitemap xml

     

  6. When the entire site is processed, the tool will let you know. Now, click the "Final Sitemap" tab

    google sitemap xml

  7. Then click "Save XML Sitemap" to export the sitemap...

    google sitemap xml

Back to top

 

 

 

Sitemap Projects

Here is the overview on how it works and what's where:

1) New Sitemap Project

To create a sitemap -- use the "New Project" wizard or put the domain name in the URL box and click "Create" (as shown above). The "New" button will give you a more "organized" way to set up the project. The quick way (writing the domain in the URL field) will apply the default settings.

1.1) Base URL + Create

sitemap generator step 1

This way is quick and uses system defaults (you can set the default project options in the "Settings" screen). Once you click "Create", you can adjust the settings from the "Project Options" button and/or just click "Run" to crawl the target site.

Hint: take a look at the bottom-left and adjust the number of threads and the delay between pages.

1.2) New Project Wizard

The New Project button will open the following screen:

The main reason to use this instead of creating the project from the Base URL field is to "test" the base URL. As in the screenshot above -- the "test" button will show you some info. How some entry point variants are redirected (or not) and some initial "hints" (on is this OK from SEO point of view).

All in all, use this option to get an idea of how it works if you need a more "organized" way to set up the project...

 

Back to top

 

2) Project Options

If you use the New Project wizard, you can adjust these options on step 2 and 3. If you start the project directly from the Base URL field, you can change the preloaded defaults by clicking the Project Options button:

Sitemap Project Options

 

In all cases, these are the main options you can set for the sitemap project:

2.1) General / Crawler Settings

Here you set the general spider behavior:

general sitemap options

  • Number of Threads -- or how many spiders will crawl the site. By default, the Sitemap Generator will use 4.

    If you want to be more "polite" and careful with the target server, set "1" here so only one spider will crawl the site.

    If you put more, you'll get somewhat faster crawling but this will put more pressure on the server.

    From my tests, more than 10 parallel threads are a bad idea. Don't do it if you are unsure how the server will react. Check with your webmaster what are the firewall rules etc.
    Otherwise, you can get your IP banned on the server-side. (Depends on the security features on the server).
    The default value of 4 threads + 1-second delay means that the spider will request about 50 - 80 pages per minute. This should be the top of the safe zone.

    Besides, I don't use more than 10-12 threads even on my fastest sites. Because each spider uses memory and CPU time the law of diminishing returns kicks in and makes higher numbers useless.
  • Delay between pages -- or how many milliseconds each spider will wait before downloading the next page.
    1000 ms = 1 second (which is the default value).

    If you set 1 crawler and 1000ms delay, the theoretical greatest number of pages downloaded will be under 30 per minute.

    It's more complicated when you start several threads because they will use cache and will start from different "start pages". But 2 to 4 threads with 1000ms delay should be safe.

    Again: it depends on your server settings. For example, my site can handle 12 threads with a 50ms delay pretty well, but I use my own Linode VPS with Nginx, Cloudflare cache etc.
  • User Agent -- I don't know why one will want to change this, but it was one of the user-requested features in the old version. So I preserved it here. If you don't know why you are changing the user agent, don't change it.

Note that you can change the number of spiders and the delay for the currently open project "on the fly". See the controls on the bottom-left corner (before clicking the "Run" button). Even if you have something different in the options, this value will take precedence and will be used for the current "Run".

2.2) Start Pages


This option is somehow useful in two scenarios:

  • If different sections of the site are not well interlinked (which is bad SEO). For example, if there are no internal links between the "main" and "blog" sections of your site, you can add the URL to the blog section as an extra start page. This, yet, looks more like 2 separate sites and I don't see how this will help Google read these not-connected sections as one site...
  • if you use more threads, it's a good idea to set different starting points for the different threads. That should make the whole crawling process faster. Anyway, the tool will by default try to extract all the links on the front page as starting pages. So when you Run the crawl, there will be something for the rest of the crawlers.

So, the "Start Pages" option is here for anyone who needs it, but it's better to make your site more readable for search engines. Organize things so all the important sections are findable from the front page. 

2.3) Exclude Patterns

Ever since version 0.95 of Sitemap Generator you can set "exclude patterns". Exclude patterns look like this:

*blog/*
*freeware*

In the above example, we say the crawler to avoid any page that includes the text "blog/" and "freeware" in the URL. Pages like "http://wonderwebware.com/blog/index.html" or "/freeware/index.html" will not be added to the sitemap.

The asterisk ( * ) says the crawler that anything fits the pattern. So if the pattern is *blog/* then all these below will be excluded:

www.mysite.com/blog/index.html
mysite.com/blog/page.html
/blog/index.html
/blog/
blog/

Note that you can set the exclude pattern in a different form, for example:

http://mysite.com/nofollow*

but in this case if the link in given html page looks like this one: "/nofollow/program.html" it will not be skipped. (Because the exclude pattern requires the full domain url to be found in the link anchor). So, to avoid all pages from this /nofollow/ folder, you should use a different pattern:

*nofollow/*

Now, no matter how the link is written in the html page, if it contains the text "nofollow/" it will be skipped.

 

Default "Exclude Patterns"

By default the tool will exclude number of file types from the sitemap. Those are image files, exe files etc. If you want to include such links in the final sitemap -- remove the respective pattern from the list.

 

2.4) Must-Follow Patterns

sitemap must-follow patterns

 

You can set "must-follow patterns". Must-follow pattern looks like this:

*blog/*

In the above example we say the crawler to crawl only pages that include the text "blog/" in the url. In that case only pages like "http://wonderwebware.com/blog/index.html" or "/freeware/index.html" will be added to the sitemap. Pages that do not include the text "blog/" in the url will be skipped. The asterix ( * ) says the crawler that anything or nothing in the place of the asterix fits the pattern, so if the pattern is *blog/* then all these below will be crawled/included:

www.mysite.com/blog/index.html
mysite.com/blog/page.html
/blog/index.html
/blog/
blog/

Note that you can set the pattern in different form, for example:

http://mysite.com/nofollow*

but in this case if the link in given html page looks like this one: "/nofollow/program.html" it will not be crawled (because the must-follow pattern requires the full domain url to be found in the link anchor). So, to crawl all pages from this /nofollow/ folder, you should use different pattern:

*nofollow/*

Now, no matter how the link is written in the html code, if it contains the text "nofollow/" it will be added to the site map.

IMPORTANT NOTE: If you set must-follow pattern, you must add start-page that fits the pattern in the "start-pages" box. For example, if we have the must-follow pattern below:

*blog*

We must add some page that fits that pattern in the Start Pages box, in our example the main /blog/ page:

http://mysite.com/blog/

Because otherwise we will get nothing. If we just leave the default start page (www.mysite.com), no link will be followed because the very first page visited by the spider will not fit the must-follow pattern.

Back to top

 

 

 

3) Run / Stop Crawlers

So far you created a sitemap project and set up the wanted crawling speed (threads & delay). Now click “Run” and wait until the spiders crawl your site.

 

run sitemap generator

 

on "Run" button clicked:


1) the bot downloads the "Base URL" (first start page)
2) extracts the links on the above page and adds them to "start pages"
3) starts all the threads (bots, spiders, crawlers) with those start pages
4) because each spider is independent they will often find the same pages. Here the cache comes into play. When one of the threads downloaded a page, the rest will use it from the cache
5) when some thread ends, and there are more "start pages", it restarts
6) when all spiders end their job; the postprocessor starts
7) the "post-processor" is checking the SEO requirements. Those are things such as "redirects", "nofollow" tags, canonicals...
8) to prevent "dead loops", there is a physical "timeout". (When no more pages are downloaded a for few a minutes, it stops the crawlers)
9) the tool reports all major info in the "Status Log" screen

 

stop sitemap generator


when you click "Stop":

1) all running spiders will receive "end" command. Wait some time for the threads to stop
2) postprocessor will start working with what was collected so far.
3) some sitemap will be produced, but it may be partial
4) click "Stop" again to stop the postprocessing. In this case, the partial sitemap will be not even SEO optimized. Don't use it at all.

Back to top

 

 

 

4) Analyze Results & Save the Sitemap

Once the crawlers finish their job and the SEO postprocessing ends, you'll get the sitemap.

The ready-to-use sitemap is in the "Final Sitemap" tab. You can save the XML sitemap, upload it to your server and submit it to google.

If you want to know the SEO reasoning and how to use the information in the other tabs, read below.

4.1) Total Pages Found

total URLs found


This is the list of all URLs found by the spiders before the search engine optimization part. The SEO "postprocessor" will remove some of these.

At this stage, the exclude patterns and must-follow patterns are already in place. Only acceptable (fitting the patterns) pages should be listed. Also, what's excluded from "robots.txt" rules should be excluded here.

At this step, the so-called "postprocessor" kicks in. It will aply some "SEO" rules to make the final sitemap.

4.2) Excluded URLs

excluded from sitemap

In this tab you can see why the postprocessor removed some pages from the final sitemap. It tries to follow the best SEO practices:

  • avoid non-HTML results

  • removes "noindex" pages (where it's set with HTML tag, not by robots.txt rules)

  • even if the "noindex" tag is only for Googlebot, it's still "no go"

  • "non canonical" pages are pages that point to other (canonical) page, so the canonical version must go in the sitemap

  • "not found" pages (server errors, internet connection errors... there should be no ages here in general, so check things out)

4.3) Issues & Notes

seo issues & notes

In this tab, I've put some "extra" tips -- pages kept in the sitemap, but with SEO issues. Those should be self-evident, but here is the reasoning behind these notes:

  • "orphan" canonical -- I made up this term to describe pages that was found as "canonical" while crawling the site, but were not downloaded. Some false positives could sneak here, so don't worry if there is one or two of these. Only take care of this if you see too many of these. Then you may want to check how are your pages canonicalized.

  • missing canonical tag -- Being "SEO aware", Sitemap Generator will warn you about these. You know the drill: If you don't canonicalize your site, Google will, but you may not like the result ;-). Go and fix this by adding proper "canonical" tags to your pages.
  • multiple canonical -- this is bad (from the SEO perspective). A page should only have ONE canonical version. Fix this or you risk search engines getting bad feelings about your site.

4.4) Final Sitemap

final xml sitemap


Well, as the name suggests, this is the final, SEO-processed sitemap, ready for export.

To get it in your preferred format: click the respective buttons in the bottom-right:

  • Save XML Sitemap -- to get the "industry-standard" XML version. Note that I didn't bother to add attributes like "refresh rate", "last updated" etc... We know Google doesn't use those.

  • Save TXT Sitemap -- will export the URL list in plain text

  • Save as HTML Please note this (HTML) version is NOT to be submitted to search engines. This will export all URLs in the respective "<a href="url">url</a> format.

Click the Save to XML button, save the sitemap, upload it to the server and submit it to Google, Yahoo, etc.

Oh, yes, don't forget to add the live link in your "robots.txt" file.

Back to top

 

 

 

5) Delete, Import, Export Projects

Those are a few "maintenance" functions you can find useful.

5.1) Delete Project

To remove a project from the database (i.e. delete all related files):

  1. Open the project

  2. Click the "Project Options" button

  3. Click the "Delete" button on the bottom-left

  4. Confirm removal

     

5.2) Export Project (Save As...)

The Import/Export functionality is in the "Data & Tools" menu:

import export project

 

  • Save Status Log will save the content of the "Status Log" window.

  • Export Including Cache -- this will save everything in a file "as is", including all the cached files.

  • Clear Cache and Export will first clear the cache and then export only the options and current state.

 

5.3) Import Project

  • Import Project does the reverse of "Export" and will import the project from a file. Note that if you import with cache, it will take a while. Also to use this, close any open project. If the same project exists, the import will fail.

Back to top

 

 

 

Settings

To change default project options and program settings, click the "Settings" button.

settings

 

Default Project Settings

default project settings

 

Default settings are used when you create a new sitemap project. Those don't affect currently created projects but future ones.

  • User Agent — or how the Sitemap Generator presents itself to the server.

    This will set the “user-agent” for the http request header sent.

  • Number of threads — or how many spiders will scan the site in parallel

    Don’t put a too big number here. Each additional crawler will use system resources (memory, cpu) and from some point, it will start degrading the performance instead of speeding up the crawl. In my tests, 10-12 threads are the maximum useful number. The default of 4 is more on the safe side.

    Be sure to check with your webmaster how the server will react to a big number of requests if you decide to start many parallel bots all requesting pages at the same time.

  • Delay between pages — or how long the crawlers will wait before requesting the next page.

    Note that this is per-crawler, so if you have 4 threads in parallel with 1000ms delay for each one, the theoretical delay (from the server-side) will be something like 250ms.

    As a rule of thumb: don’t lower the delay if you use more threads. 10 threads with a 50ms delay, for example, will put the theoretical request rate to 10 pages per second. Please be careful what pressure you put on the target site.

  • Timeout — or how long will the tool wait for crawlers if there are no new pages found.

    This is a safety measure. If there is some strange never-ending redirection loop, we want to stop crawling and proceed. If there is no internet connection. If too many spiders are doing nothing but processing the same old pages from cache, we may want to stop wasting time. In general, if nothing happens for 3 minutes (the default value), it will stop crawling.

Back to top

 

Program Settings

 

Those are the app settings not-related to the sitemaps themselves.

  • Start Maximized — will start the program in maximized screen mode

  • Root Folder — Use this to change the database root directory. Sitemap Generator keeps all its data as subfolders in this folder. Pro Tip: change this if you want to play around with different versions of the same projects or if you want to move things between computers. It May be faster than Import/Export (but I never tested it)

  • Clear Cache on Close — this will prevent the default behavior (clear cache on close). By default the tool will always erase the cached files when you click the “close” button and on next “open” will re-download the target pages. If you check this box, the cache will remain intact. This will speed up things when you experiment with settings, but when there are updates on the target site, things can get missed. You can always manually clean the cache (from the respective clear cache button on the “status” tab)

Back to top

 

Limitations


The tool comes with some limitation that doesn't prevent it from working and can be lifted with few clicks and few bucks:

  • Max Threads
  • Min.Delay
  • Max. start pages
  • Max URLs crawled
  • User Agent

Most of those are somehow "protecting" you from mistakes. If you start too many threads with no delay between pages, you'll put pressure on the server. Are you sure you know what you are doing? Are you sure you don't use my software in somehow wrong way?

I wanted to make it harder for the end users to make a mistake (and get

their IP banned from the target site, for example) and in the same time I didn't want to limit the tool... So I decided that if this is "paid" feature, it will remove any responsibility from me. If you paid to get this working and then you download too much from some server and you get banned, it's not my responsibility, because you are grown up and even donated the tool to get this feature...

Pontius Pilates, anyone...

Anyway, some of the "locked" features are here as a "hint" -- if you are a pro and need to scan so big sites and the rest, well, buy me a coffee to keep me in the loop.

Back to top

 

Donate & Unlock (Remove Limitations)

Feel free to "unlock" the "locked" features by donating few bucks (the button above). You'll get an unlock code that you can put in the respective window ("Help | Unlock...") and will get the restrictions lifted as follows:

  • Unlimited number of threads
  • Min.delay between downloaded pages down to 100ms (theoretical maximum of 10 pages per second)
  • Unlimited start pages
  • 50,000 pages crawl limit (which is the maximum for single-file sitemap)
  • The ability to change the User Agent.

Back to top

 

Debug Mode


If things are not going according to plan, try the "debug" mode.

To do that, open the project and click the "debug" button on top-right.

You'll see "Debug View" tab next to the main log. (Both in green above).

This "debug" window will show you 2 things: currently found redirects and the current list of waiting links in the buffer.

This can help you if the tool doesn't produce the expected results, at least to see what's going on, but at the same time will make the crawling process slower.

Don't use it unless something doesn't work.

Also, at the bottm-right, you'll find "View Robots.txt" button. As one can expect, this will download and show you...you guessed it, the target site "robots.txt" file.

Back to top

 

Final Notes

I spent almost a month, full time, to rewrite Sitemap Generator from scratch. This time it is SEO-oriented and works much better than all the alternative desktop solutions I've tested. Still, I am only one single man. No time and resources for long and tedious testing. I'm sute there are bugs and I'll try, in future, to fix them, if there is interest on this tool at all. Time and donations will tell ;-)

Meanwhile...

Please, accept and understand that:

[DISCLAIMER OF WARRANTIES] THIS SOFTWARE PRODUCT IS PROVIDED "AS-IS". NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU (RECEPIENT) USE THIS SOFTWARE AT YOUR (RECEPIENTS) OWN RISK.

Back to top

Last Updated: April 4, 2022

Author: Vlad Hristov