Using big data tools for small data (how PeerJ moved from Google Analytics to EMR)

by Patrick McAndrew | Sep 22, 2015 | Uncategorized | 2 comments

At PeerJ, each article and each preprint has a counter for how many times that article or preprint has been viewed, downloaded (as a pdf), and how many visitors have been on that page. We’ve recently updated our analytics process for how we determine those counts and wanted to share that process with our users.

Previously, we were using a combination of Google Analytics (GA) to measure the page views and page visitors as well as custom process in varnish to detect pdf downloads. We found that GA just wasn’t reliable enough to use. The biggest issue we found is their use of sampling* to return the results, which is fine for most applications where you don’t need an exact reproducible figure. However, our case, where we were using it for article and preprint page views, we found that the figures would often remain stuck on a particular value for months on end. Additionally, we found that occasionally, GA would just return an incorrect value, so we had to have checks in place to validate that the counts didn’t significantly increase or decrease. Obviously, this wasn’t a great solution and given that we had only planned GA as an interim solution anyway, it was time to move on.

Our new process uses Amazon Web Services (AWS) Elastic Map Reduce (EMR) to process our log data and delivery individual article and preprint analytics. We’ve found that using EMR significantly reduced the learning curve around Hadoop and Hive and is as close to plug and play as you can get with big data tools. We were also impress with Hue’s ability to immediately start to look at the data, which is great when you’re just starting out or want to extract small datasets.

We’ve setup a fairly simple, yet reliable process to launch an EMR cluster on a daily basis with steps to copy our recent log data to a combined log file, run several hive scripts to gather analytics data, and finally extract a usable dataset to AWS Dynamo Db. We then import the previous day’s figures from Dynamo Db into our production database and finally update the total counts for each article and preprint. We’ve found this process is quite cost effective (currently costing under $20/month) and provides reliable, usable analytics for our users.

We primarily chose Dynamo Db as an intermediate data store due to its ease of use and reasonable cost. We considered extracting the data out to S3 and importing via csv files and we also looked at the cost of keeping the EMR cluster on all the time and querying directly against that. However, with the way we have Dynamo Db configured, its under $1 / month / table (we’re currently using 3 tables), and seemed by far the most cost/time effective solution.

For the more technically inclined, I’ll go into a bit more depth on our process.

We first copy our hive sql files to s3. We have our hive sql files templated so that we can use the same table names but change where data is imported from as well as which Dynamo Db table we export to.
We then launch our EMR cluster with a list of steps. Our launcher takes a variety of options, so we can control the instance sizes as well as what type of action we want to perform, such as the daily import, or launching a cluster for debug analytics. The launcher also increases our Dynamo Db Read/Write Throughput prior to running and decreases it back to 1 Read/1 Write after the process is complete in order to minimize our Dynamo Db costs.

By default, EMR just launches a base cluster. As we use Hive, and sometime Hue, we add these base package installation steps to all of our clusters.
We use s3distcpy to copy the log files for the current month to a combined log file in a separate s3 bucket using a /year/month s3 path. By naming our log files with the year-month-day format in the filename, we can use s3distcpy to copy only the file we want, and combine those files into 1 log file. Hadoop is much more efficient when it processes larger files as opposed to many tiny logfiles.
Our initial hive script populates a local table by selecting from an external table pointing to the s3 /year/month folder. We only populate our data using requests with 200 status that match the requests we’re interested in for our analytics (article and preprint pages) and that do not contain a user agent that we wish to exclude. We exclude bot user agents (using the list from Project Counter). Taking from GA, we also exclude IE6-8 as we found that there are a lot of machine requests using those user agents. We also use a group by for each column, to eliminate any duplicate data (in case a log file is copied twice). This would eliminate duplicate requests within the same second, but for our purposes, can be ignored.
We then use a separate table to group similar requests (requests having the same ip, user agent, and url) into daily figures. We count up to 5 requests per day per similar request to prevent artificially incrementing the view counts.
Finally, we group the similar counts into daily total figures and insert those figures in Dynamo Db.
We repeat steps d & e with slightly different queries to get our pdf downloads and monthly visitor figures, which are also stored in different Dynamo Db tables. Although we store daily figures for both page views and pdf downloads, we found we couldn’t use the same table as hive would overwrite the full row in Dynamo Db instead of just updating the column value.

As a completely separate process, we have a Php Symfony CLI application that loops through our published articles and preprints, queries and saves the Dynamo Db analytics counts for the previous day. Similar to the EMR process, we increment the Dynamo Db Read Throughput before running and decrease back to 1 Read unit afterwards.

Prior to going live with this, we went though quite a few revisions of our process to make this as accurate as we could. There will naturally be differences as GA is a javascript implementation vs log files recording the actual requests, however in most cases, there was a good correlation at the end between the figures. We did discover a few drops in figures that appear to be repeated requests by the same ip/user agent on the same day that GA did not catch.

Going through this process, we found that we should have been logging unique device id’s awhile ago and have recently started doing so in order to improve our visitor numbers. Its always going to be imperfect recognizing the same visitors (cookies rely on the browser accepting them, whilst ip/user agent combinations are not necessarily unique), and it will be interesting to see how the numbers match up in the future.

* GA does apparently offer a paid solution that has the ability to return non-sampled data, but we decided not to pursue that option due to cost.

Get PeerJ Article Alerts