BrainBeast

Hunt down knowledge with our help

Sitemap generator application

You can download source from our repo or download portable program.

What is Sitemap?

sitemap
Sitemap is a set of urls to website pages. Site owner should put the most important page urls with some extra information like priority number, modification date and changes frequency to sitemap. As stated in sitemap protocol description:

Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

How to create sitemap?

There are some options of making sitemaps:

  • Include urls to an xml document manually
  • Include urls to an xml document recursively by means of web client and html parser
  • use a plugin or create an endpoint that gives back an xml document (realization is different on each web platform)

Afterall, it is necessary to approve sitemap in search console. Otherwise search engines won’t register it.

.Net Core C# Application for generation of sitemap xml document

By means of this application it is possible to create sitemap for a website with unlimited number of pages.

The application crawls website in the order that links (a tags) are displayed on each page. What means that if a specific link is absent at all crawled pages this specific link won’t be added to sitemap.

Structure of sitemap generator

Full source code is kept in our repo. Here is a simplified explanation.
There are two main functional parts:

Crawl managing
  1. go to homepage and get links from its html
  2. save visited link, save found inner links (excluding visited)
  3. go to every inner link and get links from its html
  4. save visited link, save found inner links (excluding visited)
  5. keep cycle until the exclution of inner links and saved links is empty
...
   List new_urls = new List();
   List visited = new List();
...
   new_urls.Add(BaseUrl);  //first url
   do
   {
      List hrefs=new List();
      foreach (var url in new_urls)
      {
         string text =await _loader.Get(url);
         if (string.IsNullOrEmpty(text)) continue;

         visited.Add(url);
         List meta=Parser.GetAHrefs(text).Distinct().ToList();  //getting list of links
         Parser.Normalize(Domain,url,ref meta);
         if (Exclude)  //option to exclude query from url
             meta = meta.Select(u => u.Contains('?') ? u.Split('?')[0] : u).ToList();
         hrefs.AddRange(meta);
         hrefs = hrefs.Distinct().ToList();
       }
       new_urls = hrefs.Except(visited).ToList();   //excluding visited pages
    }
    while (new_urls.Count != 0);
...
HTML parsing

HtmlAgilityPack is used for html parsing.

...
   public static IEnumerable GetAHrefs(string text)
   {
      HtmlDocument document = new HtmlDocument();
      document.LoadHtml(text);
      var tags=document.DocumentNode.SelectNodes(".//*");
      foreach(var tag in tags)
      {
         if (tag.Name == "a" )
         {
            string href = tag.GetAttributeValue("href", string.Empty);
            if (!string.IsNullOrEmpty(href))
                yield return href;
         }
      }
   }
...
Application Interface

sitemapgenerator

  • Homepage url of a site
  • Domain name of a site
  • Include optional parameters to sitemap or not
  • Clear urls from query parameters

LEAVE A RESPONSE

Related Posts