Using PhantomJS to serve HTML content of Single-page App

One big drawback of Single-page application is the ability to fully support SEO as traditional sites. The reason is because in a single-page application we often use a technique called lazy-rendering, in which the real content is not rendered from the server but only the basic layout is returned. Many search engine crawls your websites similar to how you use "curl" command to fetch the content of the website, which prevents the search engine from understanding what is really inside the web page and it can not index the page correctly. In this post, I am going to show you how we can use PhantomJS to tackle this issue.

The main idea of this post is taken from this blog page: http://davidchin.me/blog/create-seo-friendly-angularjs-app/, you might want to read it for more information. I will explain it using my own words and this can apply to any single-page application, not just AngularJS as described in the original post. In addition, this technique focuses on Google Search, therefore, it might not work well with other search engines (but do you care much about others?).

First of all, we need a special <meta> tag to let Google know that our app is a single-page application

<meta name="fragment" content="!"/>

When Google Search sees this tag, if will append an extra parameter to the URL: ?_escaped_fragment. Our application will return full HTML version based on this extra parameter.

Now, here comes the fun part. We need a middleware on Rails to catch the request containing the above extra parameter. Create a file named snapshot.rb in config/initializer folder with the content like this:

require 'snapshot/renderer'         YourApp::Application.config.middleware.use(Snapshot::Renderer)

Remember to replace YourApp with the correct name. Next, create a file named renderer.rb inside folder lib/snapshot/ with the following content:

module Snapshot
  class Renderer
    # A list of bots that don't support _escaped_fragment_
    BOTS = [
      'baiduspider',
      'facebookexternalhit',
      'twitterbot',
      'rogerbot',
      'linkedinbot',
      'embedly',
      'bufferbot',
      'quora link preview',
      'showyoubot',
      'outbrain',
      'pinterest',
      'developers.google.com/+/web/snippet',
      'slackbot'
    ]

    def initialize(app)
      @app = app
    end

    def call(env)
      fragment = parse_fragment(env)

      if fragment
        fragment[:path] = env['REQUEST_PATH']
        render_fragment(env, fragment)
      elsif bot_request?(env) && page_request?(env)
        fragment = { path: env['REQUEST_PATH'], query: env['QUERY_STRING'] }
        render_fragment(env, fragment)
      else
        @app.call(env)
      end
    end

    private

    def parse_fragment(env)
      regexp = /(?:_escaped_fragment_=)([^&]*)/
      query = env['QUERY_STRING']
      match = regexp.match(query)

      # Interpret _escaped_fragment_ and figure out which page needs to be rendered
      { path: URI.unescape(match[1]), query: query.sub(regexp, '') } if match
    end

    def render_fragment(env, fragment)
      url = "#{ env['rack.url_scheme'] }://#{ env['HTTP_HOST'] }#{ fragment[:path] }"
      url += "?#{ fragment[:query] }" if fragment[:query].present?

      # Run PhantomJS
      body = `phantomjs lib/snapshot/browser.js #{ url } --load-images=false`

      # Output pre-rendered response
      status, headers = @app.call(env)
      response = Rack::Response.new(body, status, headers)
      response.finish
    end

    def bot_request?(env)
      user_agent = env['HTTP_USER_AGENT']
      buffer_agent = env['X-BUFFERBOT']

      buffer_agent || (user_agent && BOTS.any? { |bot| bot.include?(user_agent.downcase) })
    end

    def page_request?(env)
      method = env['REQUEST_METHOD'] || 'GET'
      accept = env['HTTP_ACCEPT']
      path = env['REQUEST_PATH']

      # Only return true if it is a GET request, accepting text/html response
      # not hitting API endpoint, and not requesting static asset
      method.upcase == 'GET' &&
      accept =~ /text\/html/ &&
      !(path =~ /^\/(?:assets|api)/) &&
      !(path =~ /\.[a-z0-9]{2,4}$/i)
    end
  end
end

This file is pretty long but it is ready to understand. Whenever there is a request comes, call method will be called. In this method, we check if the URL contains a fragment or the request is from a bot, we will return a full HTML using the method render_fragment, if not we will call app.call(env) to let Rails server handle it as usual.

Inside render_fragment, we rebuild the URL and use phantomjs to re-fetch the full HTML page of the same URL. We have to rebuild the URL because we don't want to use the original URL which includes ?_escaped_fragment in it since it can cause a recursive issue (can you explain why?).

Let's talk a bit about PhantomJS (http://phantomjs.org/). It is a Headless Webkit which allows you to crawl a webpage from a URL and execute the Javascript code in the website just like a normal browser. By using PhantomJS, we can fetch the full HTML code from a certain URL. It requires us to write a bit of code but it does not take much time to understand. As you can see in render_fragment method, we execute the shell command: phantomjs lib/snapshot/browser.js #{ url } --load-images=false. We assume that phantomjs is installed on the computer which is serving this webpage (your local development computer or the server). The phantomjs command accepts 2 params: a path to the file containing the script to process the URL and the URL. Option --load-images=false means that we do not have to make a request to load images in the HTML. Now, create lib/snapshot/browser.js with the following content:

// Dependencies
var system = require('system'),
    webpage = require('webpage');

// Arguments check
var url = system.args[1];

if (url) {
  // Load page
  var page = webpage.create();

  // Set viewport
  page.viewportSize = {
    width: 1024,
    height: 800
  };

  page.open(url, function(status) {
    var attempts = 0;

    function checkPageReady() {
      var html;

      // Evaluate page
      html = page.evaluate(function() {
        var content = document.getElementsByClassName("content")[0];
        if (content.childElementCount > 0){
          return document.getElementsByTagName('html')[0].outerHTML;
        }
      });

      // Output HTML if defined and exit
      if (html) {
        console.log(html);
        phantom.exit();
      }

      // Break if too many attempts were made
      else if (attempts < 100) {
        attempts++;

        // Try again
        setTimeout(checkPageReady, 100);
      } else {
        console.error('Failed to wait for the requested page to load');
        phantom.exit();
      }
    }

    if (status === 'success') {
      // Check if page is fully loaded
      checkPageReady();
    } else {
      // Otherwise, if page cannot be found, exit
      phantom.exit();
    }
  });
}

Again, it is a bit long but not too complex to understand. The main function to look at is the page.open(url.... It is the function to fetch the HTML from a URL. However, different to curl, this function will also execute any javascript code presenting in the fetched HTML code. Because we want to make sure the HTML is fully rendered, we have to call page.evaluate to check if the HTML currently contains any special element to indicate that the content has been fully loaded. Of course, this is totally application-dependent. For example, you might set attribute data-ready of the <body> tag to true in your Angular code to say that the page has been fully loaded, and then check the corresponding attribute in the page.evaluate function. When the needed element is presented, we use console.log to write the output to the console, which will be assigned to the body variable inside render_fragment method. We also call checkPageReady() recursively to check if HTML is fully loaded, and the attempts variable is to prevent it from waiting for eternally.

As you can see, this is one of a solution we can use to tackle the issue of SEO in a single-page app. However, pre-rendering like this costs server resources and no caching is done here. It would be better to cache the HTML files so we only have to generate them once. You can also consider third-party services even though they are often expensive. Another thing to note is that sometimes no matter how you configure, Google just does not work the way you want.

I hope this is helpful. Happy coding!