Saturday, October 22, 2011

crawlajax

search engines never liked javascript and ajax. as a person who advocates seperation of HTML and data, SEO always became a problem for me in software design.

thanks google, finally they suggested a solution like this:

http://code.google.com/web/ajaxcrawling/docs/getting-started.html

if you use hashbang in your URL, then googlebot does a request to your backend with _escaped_fragment_ parameter in which your url fragment (the part after hashbang) is passed, so google can index the response you returned.

at backend you can render your frontend page via a tool using one of layout engines or native libraries generally used for html unit tests, then you can publish the DOM XML you gained for response.

google had suggested htmlunit for that purpose. but we had performance problems with htmlunit because our application was a very heavy javascript application and render time was too long.

at this point, for performance, i'll have 3 advice independent from tool you preferred:
  • do not render CSS and be sure you're not doing requests for CSS files.
  • do not request images, flash and etc.
  • sign your requests done from your backend via a special user agent. by that way, you can understand it's googlebot request at javascript of your frontend and you can prevent unnecessary operations (like doing facebook and google analytics requests, these are really slowing rendering).
after performance problems, as an alternative, we wrote a crawler to create and save html snapshot of rendered DOM, and served created folder statically to googlebot.

for this, we used phantomjs for rendering, and wrote a small python program to crawl site:


project satisfies only our application's requirements for now, you're free to fork and enrich it according to your needs.

when i googled with keywords 'ajax' and 'seo', there were many advices but there was no working application. i hope this project can fill this gap.

1 comment:

Oliver Jones said...


Great web site you've got here.. It's difficult to find high-quality writing like yours nowadays. I truly appreciate individuals like you! Take care!! capitalone.com login