March 11, 2019
One of our clients came to us with a challenge: on a large Danish portal site and they wanted to know if using ML and AI could help him identify what characterizes “good” client journeys on the site – i.e. how a user travels through the site, which searches they perform, what they clicks along the way etc.
How to Get More People to Click Away
With such knowledge, our client would be able to adjust and optimize their site. They would be able to accommodate the site to the actual behavior of users, improving the chances of users doing what they most wanted. These desired actions consisted of users clicking on exit links and moving on to partner websites – an activity which triggers a referral fee from the partner to our client, constituting our client’s primary revenue stream.
Using ML and AI in this way could provide our client with predictive insights which would help predict what to do to have users exit clicking even more. Thereby, potentially helping to make a lot of money.
The User ID is Missing From Google Analytics
Before getting the predictive analyzes up and running, there was a serious problem to solve: Google Analytics stores the data in a way nowhere good enough for ML / AI purposes. The ML/AI methodology is based on a number of algorithms being able to traverse large volumes of detailed data sets to find patterns hidden from plain sight. Exercises Google Analytics can’t handle – at least not out of the box.
On the contrary, Google Analytics only allows by default aggregate numbers to be extracted which does not allow for single-user analysis. Although Google Analytics traces everything that all users do, you can only pull out data based on the total number of users – or segments hereof – who have e.g. clicked on x, y or z. Data on the behavior of individual users of the site are simply not accessible. The User ID, which is needed for this purpose, is simply not accessible.
Custom Dimensions Implemented in Google Tag Manager
Although Google Analytics does not by default provide any kind of user ID in the data extract, there’s still a way for to extract behavioural data on an individual level. The UserID must be added through utilizing the Google Analytics “Custom dimensions”.
“Custom dimensions” is a concept for things set up to measure – in addition to the dimensions which Google provides out of the box, such as the device category, page name and traffic source, all of which are available in the standard version of the Analytics interface.
Normally, a custom dimension is used to collect and analyze data that Analytics does not automatically trace but custom dimensions can also be used more simply. Valtech did just that: set up ClientID as a custom dimension and every call the Analytics server received was stamped with a unique user ID providing a dive into the details of what individual users had clicked on and what other actions they had performed on the site.
Though setting a ClientID to the records was not quite enough. Google Analytics also lacks another important piece of information- a timestamp that shows when each call has been sent. Therefore, Hit Timestamp was implemented as our custom dimension number two.
The two custom dimensions were implemented as tags through the Google Tag Manager. The tag “stamps” every user and every action made with our two new custom dimensions to follow the user’s journey through the site. Strictly speaking, it could have been done without the Tag manager too – but the Tag Manager makes implementation much simpler and straightforward.
Samples Data Will Not Work
Before moving on, there was another stumbling block in Google Analytics to overcome. If there is a lot of traffic during the period to analyze– ie. data based on over 500,000 sessions – Google Analytics uses so-called “sampled” numbers that are only approximated and pose a problem.
Sampled numbers mean that instead of giving all the data, Google Analytics makes estimates based on small, but significant samples. This may be acceptable to know how many users have seen and on which page on the site but it does not matter when our purpose is precisely to analyze data on each individual user.
Download and Analyze Unsampled Data
Of course, the problem could be avoided by choosing to extract data for a period of time short enough to not contain more than 500,000 sessions. Then when completed, extract data for the next short period until you have pulled out enough data to adequately feed your ML and AI algorithms. Doing this by hand it is a lot of manual work and very inflexible.
Therefore, Google’s standard features were disregarded and instead coded our own extracts with all the data needed for each user- unsampled and complete with Client ID and Timestamp.
You can encode such an extract in different code languages as R or Python. Python was used –because with Python it is possible to extract with the desired structure for a full day at a time. For larger websites a full day of data can account to far more than 1,000,000 rows of data. Additionally, Python is significantly faster than R. In Python the extract-job only takes five to ten minutes to run. In R, the same extract took half an hour.
Having extracted user ID and time stamp enriched individual-level data, the actual analysis proceeded, using R to conduct ML and AI to find patterns in the now nicely formatted data.