Using PhantomJS to serve HTML content of Single-page App
One big drawback of Single-page application is the ability to fully support SEO as traditional sites. The reason is because in a single-page application we often use a technique called lazy-rendering, in which the real content is not rendered from the server but only the basic layout is returned. Many search engine crawls your websites similar to how you use "curl" command to fetch the content of the website, which prevents the search engine from understanding what is really inside the web page and it can not index the page correctly. In this post, I am going to show you how we can use PhantomJS to tackle this issue.
The main idea of this post is taken from this blog page: http://davidchin.me/blog/create-seo-friendly-angularjs-app/, you might want to read it for more information. I will explain it using my own words and this can apply to any single-page application, not just AngularJS as described in the original post. In addition, this technique focuses on Google Search, therefore, it might not work well with other search engines (but do you care much about others?).
First of all, we need a special <meta>
tag to let Google know that our app is a single-page application
<meta name="fragment" content="!"/>
When Google Search sees this tag, if will append an extra parameter to the URL: ?_escaped_fragment
. Our application will return full HTML version based on this extra parameter.
Now, here comes the fun part. We need a middleware on Rails to catch the request containing the above extra parameter. Create a file named snapshot.rb
in config/initializer
folder with the content like this:
require 'snapshot/renderer' YourApp::Application.config.middleware.use(Snapshot::Renderer)
Remember to replace YourApp
with the correct name. Next, create a file named renderer.rb
inside folder lib/snapshot/
with the following content:
module Snapshot
class Renderer
# A list of bots that don't support _escaped_fragment_
BOTS = [
'baiduspider',
'facebookexternalhit',
'twitterbot',
'rogerbot',
'linkedinbot',
'embedly',
'bufferbot',
'quora link preview',
'showyoubot',
'outbrain',
'pinterest',
'developers.google.com/+/web/snippet',
'slackbot'
]
def initialize(app)
@app = app
end
def call(env)
fragment = parse_fragment(env)
if fragment
fragment[:path] = env['REQUEST_PATH']
render_fragment(env, fragment)
elsif bot_request?(env) && page_request?(env)
fragment = { path: env['REQUEST_PATH'], query: env['QUERY_STRING'] }
render_fragment(env, fragment)
else
@app.call(env)
end
end
private
def parse_fragment(env)
regexp = /(?:_escaped_fragment_=)([^&]*)/
query = env['QUERY_STRING']
match = regexp.match(query)
# Interpret _escaped_fragment_ and figure out which page needs to be rendered
{ path: URI.unescape(match[1]), query: query.sub(regexp, '') } if match
end
def render_fragment(env, fragment)
url = "#{ env['rack.url_scheme'] }://#{ env['HTTP_HOST'] }#{ fragment[:path] }"
url += "?#{ fragment[:query] }" if fragment[:query].present?
# Run PhantomJS
body = `phantomjs lib/snapshot/browser.js #{ url } --load-images=false`
# Output pre-rendered response
status, headers = @app.call(env)
response = Rack::Response.new(body, status, headers)
response.finish
end
def bot_request?(env)
user_agent = env['HTTP_USER_AGENT']
buffer_agent = env['X-BUFFERBOT']
buffer_agent || (user_agent && BOTS.any? { |bot| bot.include?(user_agent.downcase) })
end
def page_request?(env)
method = env['REQUEST_METHOD'] || 'GET'
accept = env['HTTP_ACCEPT']
path = env['REQUEST_PATH']
# Only return true if it is a GET request, accepting text/html response
# not hitting API endpoint, and not requesting static asset
method.upcase == 'GET' &&
accept =~ /text\/html/ &&
!(path =~ /^\/(?:assets|api)/) &&
!(path =~ /\.[a-z0-9]{2,4}$/i)
end
end
end
This file is pretty long but it is ready to understand. Whenever there is a request comes, call
method will be called. In this method, we check if the URL contains a fragment or the request is from a bot, we will return a full HTML using the method render_fragment
, if not we will call app.call(env)
to let Rails server handle it as usual.
Inside render_fragment
, we rebuild the URL and use phantomjs
to re-fetch the full HTML page of the same URL. We have to rebuild the URL because we don't want to use the original URL which includes ?_escaped_fragment
in it since it can cause a recursive issue (can you explain why?).
Let's talk a bit about PhantomJS (http://phantomjs.org/). It is a Headless Webkit which allows you to crawl a webpage from a URL and execute the Javascript code in the website just like a normal browser. By using PhantomJS, we can fetch the full HTML code from a certain URL. It requires us to write a bit of code but it does not take much time to understand. As you can see in render_fragment
method, we execute the shell command: phantomjs lib/snapshot/browser.js #{ url } --load-images=false
. We assume that phantomjs is installed on the computer which is serving this webpage (your local development computer or the server). The phantomjs command accepts 2 params: a path to the file containing the script to process the URL and the URL. Option --load-images=false
means that we do not have to make a request to load images in the HTML. Now, create lib/snapshot/browser.js
with the following content:
// Dependencies
var system = require('system'),
webpage = require('webpage');
// Arguments check
var url = system.args[1];
if (url) {
// Load page
var page = webpage.create();
// Set viewport
page.viewportSize = {
width: 1024,
height: 800
};
page.open(url, function(status) {
var attempts = 0;
function checkPageReady() {
var html;
// Evaluate page
html = page.evaluate(function() {
var content = document.getElementsByClassName("content")[0];
if (content.childElementCount > 0){
return document.getElementsByTagName('html')[0].outerHTML;
}
});
// Output HTML if defined and exit
if (html) {
console.log(html);
phantom.exit();
}
// Break if too many attempts were made
else if (attempts < 100) {
attempts++;
// Try again
setTimeout(checkPageReady, 100);
} else {
console.error('Failed to wait for the requested page to load');
phantom.exit();
}
}
if (status === 'success') {
// Check if page is fully loaded
checkPageReady();
} else {
// Otherwise, if page cannot be found, exit
phantom.exit();
}
});
}
Again, it is a bit long but not too complex to understand. The main function to look at is the page.open(url...
. It is the function to fetch the HTML from a URL. However, different to curl, this function will also execute any javascript code presenting in the fetched HTML code. Because we want to make sure the HTML is fully rendered, we have to call page.evaluate
to check if the HTML currently contains any special element to indicate that the content has been fully loaded. Of course, this is totally application-dependent. For example, you might set attribute data-ready
of the <body>
tag to true
in your Angular code to say that the page has been fully loaded, and then check the corresponding attribute in the page.evaluate
function. When the needed element is presented, we use console.log
to write the output to the console, which will be assigned to the body
variable inside render_fragment
method. We also call checkPageReady()
recursively to check if HTML is fully loaded, and the attempts
variable is to prevent it from waiting for eternally.
As you can see, this is one of a solution we can use to tackle the issue of SEO in a single-page app. However, pre-rendering like this costs server resources and no caching is done here. It would be better to cache the HTML files so we only have to generate them once. You can also consider third-party services even though they are often expensive. Another thing to note is that sometimes no matter how you configure, Google just does not work the way you want.
I hope this is helpful. Happy coding!