May 13th, 2007Analytics vs. FireStats vs. Raw Access Logs
I love data collection, and summarizing that data into useful information. I’ve done this on some epic scales with some of my employers recently, and I also enjoy doing it with the data collected on this site. One thing I have noticed, however, is that Google Analytics often displays very different results from what I’m finding with two other sources, and it makes me question the validity of Google’s data.
At first, I was interested in numbers. But after 8 months of blogging, I believe that this site has pretty much peaked for the time being. Unless I can offer something of real value to the online community, I don’t see my existing numbers changing too much. So aside from sheer access counts, I’ve also been seeing what operating systems people are using, what browsers, and (more importantly) where people are accessing this site from.
To share just a little of this data, over the last 8 months I have logged 1.8 million page visits. Of these, 128,912 have been from real people (as best as I’ve been able to weed out). From these 129 thousand people I’ve learned that 88.3% of them use a variation of Windows, and 3.2% use Mac OSX. Ubuntu is the most common flavour of Linux seen, with 0.4% of all visitors using that user-friendly OS. IE is still on top of the market with 72.6% of the share, Firefox with 19.5%, and the remaining 12 browsers duking it out for the rest.
While that data is partially amusing, it doesn’t really hold much value for me. My site will load properly in all the major browsers and I’m content running Windows, FreeBSD and Solaris for the various roles and tasks that my computers must fulfill. What really fascinates me is the global locations of my visitors, and how the data went completely against my expectations (data rarely ever surprises me at work).
In the first few months of operation, traffic was as expected. The United States made up the lion’s share of my traffic, followed by Canada and Japan. The occasional hit from Korea, Italy and Mexico would catch my eye, but I had expected people might stumble across this site while looking for something completely different.
However, shortly after I moved my site to ANHosting in January (mainly because my home webserver was seemingly overwhelmed with all the crawlers, and my monthly bandwidth allowances with my ISP were starting to break records) I noticed that I was receiving far more hits from these unexpected countries. In the same month, I had added Global Translator to my site and the crawlers had a hey-day with this. Every page was translated into nine other languages, then stored and indexed for future requests on Google, Yahoo, MSN and a plethora of other search engines and universities. From here, the international traffic took off. No wonder my little home server was choking….
In the last 90 days, Spain, Brazil and the US have been the three countries to visit this site the most. Spanish seems to be the language of choice for most people reading my content, which makes me wonder just how accurate BabelFish’s machine translation engine really is. Greece is right behind with Italy, Portugal, France, the Netherlands and Britain trailing behind. Then comes Japan, Canada, Mexico, Colombia, Ukraine and Korea. Then there are another 44 countries all sharing the rest of the traffic.
This site is read mostly in Spanish, followed by English, German, Chinese, Arabic and Japanese. Russian is the least accessed language.
All of this has been gleaned from the raw access logs on the web server, then downloaded into a custom database developed in SQL 2000, and sorted from there. I’ve been using IP2Location’s IP-Country-Region-City database in order to determine approximately which cities people are in to narrow the criteria down further. Please note that I don’t do any of this for marketing purposes. I will not have any AdSense ads or anything that remotely looks like an advertisement on this site. The most that I’ll do is offer a link for a product or service that I find useful. I try to give credit where credit is due.
All this started when I first installed Omry Yadan’s FireStats. I really like this plugin for WordPress (and almost any other site if you know how to integrate it) as it’s easy to install, collects data very quickly and displays accurate information that’s less than 0.5% different from what I find in the raw access logs for this site. The differences occur primarily with 404’s and downloads. I can live with this as my raw data lets me know how often people go to a non-existent page, or when one of my plugins are downloaded.
Because FireStats is so in line with these other logs, I tend to use this as my primary source of information.
Yet with all the talk about Google Analytics, I decided to give this a shot just to see if it could provide value that isn’t easily available elsewhere. And while this is mostly true, it also shows data that I cannot confirm in my own raw access logs.
Over the last few weeks, Analytics has shown hits from countries that I’m surprised has access to the internet, let alone time to use it. According to their data, I’ve had visits from South Africa, Kenya, Iraq, and Somolia. South Africa I can kind of understand, but I can’t find any South African IP’s in my access logs. Nor can I find any Iraqi, Kenyan or Somoli IPs. My IP2Location database is right up to date, and while IPs ranges can hop between countries, I can’t see this happening often enough that these countries all show up as false positives within the same week.
Using May 10th, 2007 as the base, I took sample of all the traffic between 00:00:00 GMT to 23:59:59 GMT and found that Google was often only collecting 4% of my actual access data. Thinking that they were blocking out all the crawlers (which makes sense) I then compared the data for the 10th using only valid users, and found that Google was still only showing just under 70% of my expected traffic. Just for giggles, I then compared the country information between the two sources and found that three countries reported by Google were not found in my access logs. To further validate whether I had data from the exact same time frame, I examined the access logs in search of these three countries and found that I have not had hits from two of them in more than 14 days, and the other was from the day before.
So I’ll give Analytics that one country. They may not be using GMT in their access logs, and I can live with that (it seems to be PST when I examine, so perhaps the logs are shaped based on the time zone the viewer is in). But where is this other data coming from?
Aside from the Geolocation map and language graphs that Analytics offers, I do not see much value in this for me. Maybe if I was taking part in AdSense or some other marketing campaign management where visits and clicks equate to dollars and cents … but even then, if my raw access logs are showing so much more activity on my site, I wonder just how accurate the dollars and cents reporting from Analytics would be.
I’ve tried putting the Google Java in the header, footer, and everywhere else on my site in the event some people stripped out the sidebar, but no dice. I am forced to wonder what value Analytics would offer to businesses if the data collection could be foiled just by a user preventing Java from running on their machine….
So for anyone that hosts a site and would like to know where their users are from or how many hits they receive in a day, I’d suggest using FireStats. The interface is very clean and it integrates quite easily into WordPress. If anyone knows why Analytics’ data is so different from my access logs and/or FireStats, I’d love to know why.















































[...] Google Analytics reports was the Geomap Overlay. However, after comparing data for a set period, I was unable to validate the information shown. If I had this problem when creating reports at work, I would need to go line by line into the [...]
[...] no fan of Google Analytics, either. I’ve tried confirming the numbers they show, but everything seems to be completely incorrect when put up next to my raw access logs or the FireStats WordPress plugin, and you can’t tell [...]