Announcing SitemapGen4j 1.0

Redfin is happy to announce SitemapGen4j 1.0. SitemapGen4j is a library to generate XML sitemaps in Java.

Download SitemapGen4j 1.0

What’s an XML sitemap?

Quoting from sitemaps.org:

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft.

Getting started

The easiest way to get started is to just use the WebSitemapGenerator class, like this:

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
wsg.addUrl("http://www.example.com/index.html"); // repeat multiple times
wsg.write();

Configuring options

But there are a lot of nifty options available for URLs and for the generator as a whole. To configure the generator, use a builder:

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .gzip(true).build(); // enable gzipped output
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

To configure the URLs, construct a WebSitemapUrl with WebSitemapUrl.Options.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
    .lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
// this will configure the URL with lastmod=now, priority=1.0, changefreq=hourly 
wsg.addUrl(url);
wsg.write();

Configuring the date format

One important configuration option for the sitemap generator is the date format. The W3C datetime standard allows you to choose the precision of your datetime (anything from just specifying the year like “1997″ to specifying the fraction of the second like “1997-07-16T19:20:30.45+01:00″); if you don’t specify one, we’ll try to guess which one you want, and we’ll use the default timezone of the local machine, which might not be what you prefer.

// Use DAY pattern (2009-02-07), Greenwich Mean Time timezone
W3CDateFormat dateFormat = new W3CDateFormat(Pattern.DAY); 
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT"));
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .dateFormat(dateFormat).build(); // actually use the configured dateFormat
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

Lots of URLs: a sitemap index file

One sitemap can contain a maximum of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain only 1,000 URLs.) If you need to put more URLs than that in a sitemap, you’ll have to use a sitemap index file. Fortunately, WebSitemapGenerator can manage the whole thing for you.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
for (int i = 0; i < 60000; i++) wsg.addUrl("http://www.example.com/doc"+i+".html");
wsg.write();
wsg.writeSitemapsWithIndex(); // generate the sitemap_index.xml

That will generate two sitemaps for 60K URLs: sitemap1.xml (with 50K urls) and sitemap2.xml (with the remaining 10K), and then generate a sitemap_index.xml file describing the two.

It’s also possible to carefully organize your sub-sitemaps. For example, it’s recommended to group URLs with the same changeFreq together (have one sitemap for changeFreq “daily” and another for changeFreq “yearly”), so you can modify the lastMod of the daily sitemap without modifying the lastMod of the yearly sitemap. To do that, just construct your sitemaps one at a time using the WebSitemapGenerator, then use the SitemapIndexGenerator to create a single index for all of them.

WebSitemapGenerator wsg;
// generate foo sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("foo").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/foo"+i+".html");
wsg.write();
// generate bar sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("bar").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/bar"+i+".html");
wsg.write();
// generate sitemap index for foo + bar 
SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.example.com", myFile);
sig.addUrl("http://www.example.com/foo.xml");
sig.addUrl("http://www.example.com/bar.xml");
sig.write();

You could also use the SitemapIndexGenerator to incorporate sitemaps generated by other tools. For example, you might use Google’s official Python sitemap generator to generate some sitemaps, and use WebSitemapGenerator to generate some sitemaps, and use SitemapIndexGenerator to make an index of all of them.

Validate your sitemaps

SitemapGen4j can also validate your sitemaps using the official XML Schema Definition (XSD). If you used SitemapGen4j to make the sitemaps, you shouldn’t need to do this unless there’s a bug in our code. But you can use it to validate sitemaps generated by other tools, and it provides an extra level of safety.

It’s easy to configure the WebSitemapGenerator to automatically validate your sitemaps right after you write them (but this does slow things down, naturally).

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .autoValidate(true).build(); // validate the sitemap after writing
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

You can also use the SitemapValidator directly to manage sitemaps. It has two methods: validateWebSitemap(File f) and validateSitemapIndex(File f).

Google-specific sitemaps

Google can understand a wide variety of custom sitemap formats that they made up, including a Mobile sitemaps, Geo sitemaps, Code sitemaps (for Google Code search), Google News sitemaps, and Video sitemaps. SitemapGen4j can generate any/all of these different types of sitemaps.

To generate a special type of sitemap, just use GoogleMobileSitemapGenerator, GoogleGeoSitemapGenerator, GoogleCodeSitemapGenerator, GoogleCodeSitemapGenerator, GoogleNewsSitemapGenerator, or GoogleVideoSitemapGenerator instead of WebSitemapGenerator.

You can’t mix-and-match regular URLs with Google-specific sitemaps, so you’ll also have to use a GoogleMobileSitemapUrl, GoogleGeoSitemapUrl, GoogleCodeSitemapUrl, GoogleNewsSitemapUrl, or GoogleVideoSitemapUrl instead of a WebSitemapUrl. Each of them has unique configurable options not available to regular web URLs.

Discussion

  • http://blog.caffeinatedsoftware.com Robbie

    I’ll have to take a look at this. I just recently wrote dynamic sitemap generators for RPA’s sites and I was toying with creating a mobile sitemap. It seems a bit more cumbersome than the offline approach you seem to have taken (on the plus side, my sitemaps are always current).

    On Redfin, I notice you use gzipped sitemaps. Does this tool do that as well?

  • http://blog.redfin.com/devblog/2009/02/announcing_sitemapgen4j_10.html Dan

    Yes, SitemapGen4j supports gzipped sitemaps; just use .gzip(true) when configuring your Sitemap Generator. There’s an example embedded earlier in this blog post.

  • http://www.ernestojustiniano.org ernesto

    hi, i have just seen your post and downloaded the archive.

    now, what to do now. pls teach me. where to put the archive and an complete example of the page to be written.

    • http://devblog.redfin.com/author/dan.fabulich Dan Fabulich

      Hi Ernesto,

      You can use SitemapGen4j the same way you would use any other Java jar library. However, using a Java library requires a lot more information than I can give you in this blog post comment. Here’s a good page explaining how to add a jar to your classpath on Windows XP and here’s another page explaining more generally how to write Java programs that use the classpath. Sorry I can’t be more help.

      Note that if you don’t need to build a very complicated sitemap, you may be happier using Google’s Python-based sitemap generator, which is easier to use for people who aren’t familiar with compiling Java applications using jar files.

  • http://www.ernestojustiniano.org ernesto

    TYVM, Dan

  • Bala

    Hi

    Does your class provide a way of excluding patterns.
    For e.g. I dont want *htm in a certain place
    http://www.mysite.com/xxx/**/*.htm to be included in sitemap creation. Is it possible to exclude patterns

    Thanks
    Bala

    • http://devblog.redfin.com/author/dan.fabulich Dan Fabulich

      Hi Bala… SitemapGen4j only includes URLs that you explicitly add using “addUrl”. You have complete control over which URLs are added.

      If you want to exclude certain URLs, just make sure your Java program doesn’t add them. If you want to automatically include an array of URLs but exclude just some of them, you can use a “for” loop to exclude specific URLs and add the rest.

  • Raphael Carvalho

    Hi Dan !
    I make a dowload of SitemapGen4j, and it’s work.
    But now im need know…
    In this app, there is one way for take automatic all url’s of one site ? …
    … something like spider, web crawler for to build a sitemap.xml automatically.

    Thx a lot!

    • http://devblog.redfin.com/author/dan.fabulich Dan Fabulich

      SitemapGen4j does not crawl the web; if you had a web crawler, you could pass its URLs to SitemapGen4j to generate a sitemap.

      If you don’t have a list of URLs, you may prefer to use Google’s sitemap generator tool, which parses your web server logs to find URLs. http://googlesitemapgenerator.googlecode.com/svn/trunk/doc/gsg-installation.html

  • Raphael Carvalho

    Thanks by explain me.

  • Raphael Carvalho

    Thanks for explaining that*

  • Greg Georges

    Thanks for your great generator! One thing that I would change though, since we experienced this here, is that in your SitemapGenerator class you use a FileWriter, which by default does not support UTF-8, should change for a FileOutputstream and set the encoding to UTF-8. We had this problem when we saw chinese characters were not written to the file :) Take care and thans again

  • David

    I see google have changed the format for News site maps now is this supported? It doesn’t seem to be but I may be mistaken. The details of the new site map format are here http://www.google.com/support/news_pub/bin/answer.py?answer=74288&cbid=sznwmrlbev7e&src=cb&lev=answer

    In particluar it seems that an element called publication is required.

  • Ed

    Seems like a great tool.

    Does it automatically do xml entity-escaping on sitemaps it creates, or is that to be done on our side?

    Thanks :)

    • http://devblog.redfin.com/author/dan.fabulich Dan Fabulich

      Unfortunately, we overlooked XML entity-escaping in v1. It’s filed as a bug; I’ll try to get around to it soon.

  • Ed

    Oh yeah — another suggestion for future improvment: Consider a method sig that returns the sitemap File object as opposed to insisting on writing it to disk.

    It would enable the avoidance of file i/o when dynamically generating sitemaps.

  • http://devblog.redfin.com/author/dan.fabulich Dan Fabulich

    @Ed, I don’t think I understand your feature. SitemapGen4j actually returns a list of files (since it needs to split the sitemap into multiple files when the sitemap gets too large).

    What are you imagining returning to the user? If it just returns the File objects without writing to them, it won’t actually *do* anything (except, perhaps, counting the URLs…?)

  • Tyler

    The google news sitemap functionality does not support the required elements title and publication (and possible others). This means it is not possible to use your library to create a valid google news sitemap.

    The full list of options needs to be added before this the GoogleNewsSitemapGenerator can be used.

  • Lyndon

    Hi

    I have used SitemapGen4J successfully with Arachnid to produce a sitemap.xml. Thanks for such a great piece of code.

    What I want to do now is pass a parameter of sitemapType to my spider and it generate the specific type of sitemap file. Eg. Mobile or Web.

    I see that the Class SitemapGenerator is abstract and that the specific 'Generator' subclasses extend it.

    So, I want to instantiate a specific 'Generator' subclasses but I am confused at how to do this. I am not that good with generics. I see that SitemapGeneratorBuilder does this but I just cannot figure out how to do this.

    Any chance of some example of using SitemapGeneratorBuilder to do this would be much appreciated.

  • Lyndon

    I know it's naughty but I changed SitemapGenerator from abstract to public abstract which allowed me to do this:
          sg = new SitemapGeneratorBuilder(base, outputFolder, Class.forName(sitemapType)).build();

    Now by passing in one of the SitemapGenerator's subclasses as sitemapType I have one Spider class, clean code and any type of sitemap I want. 

    But I would still dearly love to do this properly with the unchanged code. 

    Any help would be much appreciated.

  • Gustavo

    How can I add multiple Url subtags? lets say I want:

    <url>
    <loc>http://www.example.com/videos/…</loc>
    <video:video>…</video:video>
    <video:video>…</video:video>
    </url>

  • Quique

    I'm trying to use SitemapGen4j but i'm getting an out of memory error. I'm trying to index around 400k urls.