Quantcast
Channel: Occam's Razor by Avinash Kaushik
Viewing all articles
Browse latest Browse all 9

Smarter Data Analysis of Google's https (not provided) change: 5 Steps

$
0
0

complex-beautiful It is astonishingly common that we are asked to analyze the impossible. In perhaps a career-limiting move I'm going to try to do that today (and for a controversial topic to boot!).

In this post about an important Google change, I want you to focus less on the data and focus more on the methodology. And – so important – I want you to help me with your ideas of how we can do this impossible analysis better, in the complete absence of data :). So please share your ideas via comments and let's together make a smarter ecosystem.

[Update: As of late 2013 secure search now results in almost all of our keywords not being provided. For the latest, please see this post: Search: Not Provided: What Remains, Keyword Data Options, the Future.]

On board? Let's go….

In an effort to make search more secure, on Oct. 18th Google announced that users logged into their Google accounts using www.google.com would be redirected to https://www.google.com. The search queries by these users would hence be encrypted and not available to website owners via web analytics tools such as Omniture, WebTrends, Open Stats, Google Analytics etc.

Switching from have all the search queries in the keywords reports was our normal state, not having them feels different. As the change ramped up and more user queries came to be represented, in at least Google Analytics, under the moniker "(not provided)" we all got worried. From our perspective it would be immensely preferable to be able to analyze all the keywords individually. Sadly we don't have that now.

The wonderful thing is that in addition to passionate commentary on Twittersphere / industry blogs / gurus, we also have access to data for our own websites. We can, and should, look beyond simplistic "it is this high or that low" to see if we can understand something (anything!) deeper.

Most analytics vendors, including Google Analytics, reacted immediately to the change in order help us quantify the impact of this change in multiple ways. As you can imagine my reaction was to unleash a flurry of custom reports and apply smart advanced segments and compare data pre and post change and go down a bunch of holes.

From that experience here are five steps I recommend you follow to gain a smarter understanding of this change…

1. Establish macro context.

On Oct 20th on my Google+ page I'd shared a custom report for Google Analytics that makes it extremely simple for you to look at this data. Visits, Unique Visitors, Bounce Rates, Goal Completions for (not provided).

You can download that report into your GA account by clicking on this link after you are logged into GA: Google httpS Change Impact.

Here's what the data for this blog looks like for one month:

not provided custom report

Like me first you should compute the high level impact of the change. From Oct. 31 (when the trend started to spike and subsequently stabilized) to Nov 15…

Total site visits: 57,672
Search engine visits: 27,534
Google visits: 26,548
(not provided) – i.e. keyword unknown – visits: 4,651

User search queries not available: 4651 / 26548 = 18%

Please note that this number will vary dramatically depending on the type of website you have, audience attributes, geographic location and a number of other factors.

Now you know what the number is for your site, and you can keep the custom report handy to continue to watch what happens over time. Remember to divide the number by total Google traffic. I see people using total search traffic or total site traffic or… other imprecise metrics.

All numbers in aggregate are at best marginally useful, and that rule applies to this one too.

We want to know more. Who are these people? Are they people I should care about? Not care about? And what kind of search queries are these? Brand? Non-brand? What else?

Sadly we can't answer all of those questions, but we can make a small clump of informed judgments based on data we do have. It just needs a pinch of passion, some smarts and a lot of effort.

Let's deep drive into some very cold and choppy waters…

#2: Understand the performance profile of the (not provided) traffic.

One of the things I hate about standard reports in all web analytics tools is that they scatter necessary data across tabs, multiple reports, or outright hide it. #aaarrrrrh

So I always use custom reports . In most web analytics tools it takes as little as 20 seconds to create one. I did one for this particular purpose. It provides me the end-to-end view of search keyword performance in one place.

Here is what it looks like:

keyword analysis custom report

You can download it into your Google Analytics account by clicking here: Keyword Performance Analysis Report

Two quick things to note.

1. Never ever never never never create a custom report without three critical elements: Acquisition, Behavior, Outcomes. Without the end-to-end view you'll make bad decisions.

2. It is a bit odd that my first dimension is Source (essentially All Traffic) for a keyword report. Before I dive into search data, I always like to set context in my mind for how important this (or any other) traffic is. It is rare that we see the big picture before we go for the weeds, I personally find that sub optimal.

Though in this case if you drill down into any other report except a search engine, that second drill down won't make sense, but that is okay. Small sacrifice to be smart, right? :)

So how does (not provided) look? Here's my end to end view:

keyword performance data

The numbers in red were added to the report by me. I wanted to know what percentage of the total Visits and Goal Completions (not provided) was. [On that last point, if you have an ecommerce website you can use Orders or an appropriate proxy instead of Goal Completions.]

Bottom-line: 18% of the Visits and 22% of the Conversions.

Big numbers! But with a quick scan of the report, I think I already see that there is something delightful going on here. Stick with me. I think we have a surprise coming.

The custom report has eight metrics (two more than I normally use) simply to try to tease out some nuance of the performance as we look across keywords.

One hypothesis I had was that (not provided) might be mostly returning visitors. The overall search avg % New Visits is 67.96%, for (not provided) it is 65.06%. Very similar to the "average site visitor." But notice that all Brand Terms above (avinash, kaushik, occam's razor) have very low % New Visits. So it is possible that (not provided), contrary to my hypothesis, are mostly new people.

Overall bounce rate is 70.2% (not unusual for a blog/pure content site), and (not provided) is 66%. Again, scanning across the top ten terms you can see higher rates for non-brand searchers (people looking for specific, perhaps quick, answers) when compared to brand terms.

Content consumption, Pages/Visit, seems to be a bit on the higher side compared to the average (1.76). But like the other metrics above, there is a pattern between brand and non-brand (with brand higher on this metric).

I really, really care about Goal 2, hence that conversion rate is in the report. The average is 2.21%, (not provided) is around 2.37%. There's not much conversion going on with the broad non-brand terms (you can't get lower than 0% :).

Goal Completions is very interesting. (not provided) is a huge bucket of goal completions (and it is easy to understand why so many SEOs and Marketers and Lovers are in a tizzy!). The thing to note here are the numbers in red (% of each bucket compared to total Goal Completions, 4,816). See how quickly thing fall off the cliff. Note the difference between brand and non-brand.

Finally, my absolute favorite: Per Visit Goal Value. There is no obvious monetization on this blog, but I have 8 distinct goals and I have goal values assigned to each for the long term impact each adds. (How's that for focusing on customer lifetime value? :)). $1.27 for (not provided), compared to overall of $1.01, and the number does not come close to the other brand terms.

We still don't know what keywords are contained in the (not provided) bucket.

But what we do know is that for this site (not provided) visitors fits this bill: They seem to be new people with behavior that is quite distinct from the "head" brand terms and closer to the non-brand terms.

In the past I've lovingly termed non-brand long tail visitors as "impression virgins." The hint at the end of this step is that I've got myself a lot of impression virgins in (not provided)!

Let's go and see if we can validate that theory.

#3: Deep dive: Match up performance profile to Brand & Non-brand visits.

Based on the clues above, I'm going to try to understand whether the performance profile for (not provided) is indeed closer to brand searchers.

I create this simple segment in GA… should take you five seconds to do it for your own business…

brand keywords segment

Apply it to my custom report and boom!

brand traffic performance

[sidebar] A quick thing to note is the ratio of Unique Visitors to Visits. In context of % New Visits that makes sense. But just make a note of it. [/sidebar]

How does this compare, purely from a performance of the key performance indicators perspective, with (not provided) for the same period?

not provided keyword performance

Quite a stark difference as you look across metrics like % New Visits, Bounce Rate, Pages/Visit, Conversion Rate and Per Visit Goal Value.

So how does the performance of (not provided) compare to that of non-branded keywords? Not a difficult question to answer.

Back into GA to create a segment like the one above, expect change "Include" to "Exclude" and I have my non-branded traffic segment.

Here's how those numbers look like in the aforementioned custom report:

non-brand keyword performance

When you do this with your data you'll have a similar image and you'll compare it to your (not provided) segment performance, and your brand segment perfromance. In the comparison above it is clear that these three buckets are distinct, but that the performance of (not provided) is not as close to brand as it is to non-brand. Even though the (not provided) segment is small (4.6k) compared to non-brand (21.9k) – thinking about impact on averaging these metrics.

There are two likely scenarios in terms of what you'll find…

In your case (not provided) segment might match overall Google traffic or one of the above segments. In which case you continue business as usual with the assumption of an even distribution.

It is possible that (not provided) segment does not match overall Google traffic, or one of the above segments, in your case. In this chase you understand a bit better how to treat it in your thinking (more keywords connected to your brand or non-brand segments). At the moment you can't take action based on this information (how to you react to visitors whose keyword you don't know at all). But when presenting to your senior executives you can give them a bit more context.

It does not eliminate all the questions, but it does help me go from "I have no idea who all these people/keywords are" to "Okay looks like it might be my non-brand possibly long tail traffic."

Something of value, right?

All of the above is still kind of at an aggregate level. But we all have a lot of keyword level historical data. At some point we should have enough post change data that we can throw it all into a delightful regression model to fine tune our understanding at a keyword level.

At the moment we just know a little bit more than "here's my total (not provided)."

#4: Tentative conclusions. Why this seems so scary, but might not be (at least for now).

Most, but not all, of my branded traffic is my "head" traffic, i.e. traffic that results from a few keywords used by lots of visitors. After all your brand is unique to you and, for any type of website, drives loads of search traffic to you because you rank high in SERPs for those brand queries.

Most of my non-brand traffic is my "tail" traffic, i.e. traffic that results from a lot of keywords used by a few people each. For example you'll notice at the very start of this post that during this time period I had 27k visits. Of this my "tail" traffic comprised of 21,921 visits. These delightful folks used 10,498 distinct non-branded key phrases to find my website.

10,498 distinct search queries drove 21,921 visits!

Remember the two scenarios I'd mentioned above? Let's look at one of them (performance closer to non-brand traffic) and understand what is happening a little more visually. What is happening when (not provided) shows up as your #1 metric in your search keyword reports?

In my case above, closer to scenario #2 for me, the performance of (not provided) as shown by the metrics above looks more like that of the visitors who came via those 10,498 non-branded search key phrases.

Here's what's happening when (not provided) shows up #1 for me (clear in the screen shot in part #2 above), as explained by my head – tail illustration :

long tail slivers

Prior to this change by Google, the gray slivers above represent traffic that became (not provided) after the change.

In the past only a small part, if any, of this traffic, for me, would ever show up in the top ten or twenty keywords in the report (head traffic). Because much of it was in the long tail I never noticed it (it is hard to look at all 10,498 key words individually! :).

But after the change by Google, these tiny, in the past invisible, slivers combined look like one scary beast. I've painfully combined every pixel of gray sliver above:

long tail not provided combined

OMG! I've lost a huge chunk of something that was a very important part of my traffic!!

Not really. It just looks scarier than it really is because tiny shavings of your other keywords (now used by logged in users who are opted into https sessions on google.com) appear in one big piece. Individual cells don't look that scary. But combined they look like Darth Vader himself. :)

Let me hasten to add that this does not mean that these "slivers" from user search queries are not important. Or that just because they are mostly non-branded traffic we should ignore them (I argue 100% contrary to that here: Monetize The Long Tail of Search ). Or that you should not worry and that the sun is shining, there is no US debt problem, we have universal health care and Ashton and Demi are still together.

No. Not at all.

But the sky is not falling either.

We can use the actual data we have to keep a very close eye on this traffic and its performance. We can use advanced segmentation and custom reports to understand where this big scary block of traffic used to be. Is it (to repeat the scenarios we outlined at the end of part 3 above) closer to the average performance and hence possibly evenly distributed or closer to non-brand and less evenly distributed.

We sadly still won't know what actual long tail or non-brand keywords or overall keywords they represent or how much of a particular keyword/phrase they used to be. But my POV is that we'll be in a better place.

You can be, if the data in your case justifies this, just a little less worried.

#5: Additional awesomeness: Landing page keyword referral analysis.

One final idea I had was to wonder if the (not provided) traffic enters the website at a disproportionate rate on some landing pages when compared to all other traffic from Google. If that is the case we could do pre post analysis on referring keywords to those landing pages and get additional clues.

It is not very hard to go checkout that theory.

First, create an advanced segment for the (not provided) traffic:

not provided traffic segment

Then go and apply it to your standard Landing Pages report in Google Analytics (or SiteCatalyst or WebTrends or Yahoo! Web Analytics):

top landing pages report search

The analysis from here on is not very difficult (though in the new version of GA it is harder as the UI designers got rid of the % delta for comparative segments – what a shame). Just use our bff MS Excel.

For example 14% of the (not provided) traffic enters on the home page.

I was able to find a small clump of pages where the (not provided) traffic, at least currently, entered the site at a higher rate than overall Google traffic. I can see the referring keywords to those pages prior to the change and after the https change and attempt to identify which keywords might be contributing traffic to (not provided).

For me this analysis provided a better idea about some long tail non-brand keywords. But it was not as much as I would have liked to learn. Partly that is a function of the fact that those keywords are used by a handful of people and, this makes it worse, they are quite transient – they are not used too many times again.

But since everyone's site and visitor behavior would be different I did want to share this idea with you. It is not a hard bit of analysis to do, and you can let the data tell you something (or not).

That's it.

A simple five step process to go from reacting based on an aggregate number in your keyword reports to a much more nuanced (if imperfect) understanding based on your own data.

Caveats:

Before we go, a few important reminders that are spread throughout the post above but bear repeating….

* Perhaps the most important one is that your business might be nothing like my business. For example, you could have a lot more volatility in your search behavior (e.g.: your top ten search keywords look dramatically different every week/day), which would make my comparative analysis in part two moot.

Use the steps above, but your own data to arrive at unique conclusions.

* I'm comparing two weeks of data here, because that is all we have so far. I plan to revisit this analysis again in two more weeks, and then periodically to reaffirm my conclusions above or to burn them and start anew.

* We actually don't have any idea what keywords / key phrases comprise (not provided). We just have a better understanding of how that traffic performs.

* It is important to point out that Webmaster Tools and the AdWords Keyword Tool still have a lot of keyword-specific data related to your website. They don't have any (not provided) – mostly because their view is from Google and not from your website. Please use those two tools – both free – to understand keywords that cause your website to show up in Google SERPs, and queries that subsequently get clicks. Not exactly reveling 100% what (not provided) search queries might be, but something.

Anything else I should have here that I've forgotten?

I would love to know how you would go about doing this impossible analysis? What other path would you take in your web analytics tool? What segment, report, metric, walk on water effort would you undertake? Regarding my five step effort above… what flawed assumptions am I making? What would you change in terms of the approach/conclusions in any of the steps?

Was this nuanced understanding of what might be happening better than where you started?

Please share your alternative ideas (please!), critique of the above analysis, ideas for world peace via comments.

Thank you.

P.S: A request. This blog focuses on digital marketing and web analytics, it is not a policy blog. If you are up for it I would love for your comments to focus on the former and not the latter. If for no other reason than that my skills don't extend to the policy part and I would not be able to share anything of value with you.

I appreciate your consideration.

The post Smarter Data Analysis of Google's https (not provided) change: 5 Steps appeared first on Occam's Razor by Avinash Kaushik.


Viewing all articles
Browse latest Browse all 9

Latest Images

Trending Articles





Latest Images