Extract Raw Data from Adobe Analytics to use in ML and AI

Troels Moltsen

Senior Consultant at Valtech

Troels.Molsten@valtech.com

Andreas Petersson

Director Analytics & Optimization

andreas.petersson@valtech.com

agosto 28, 2019

When trying to run machine learning (ML) algorithms on Adobe Analytics data or even combining this data with other source data, the data from Adobe is not nearly fine grained enough by default. With a little Adobe-massaging, a lot can be done. Here’s a guide of how to do so.

The other day a client brought forward an issue. The page load speed on the website was at an unacceptable rate and they feared how it would impact visitor’s user journeys and ultimately sales. To help them in the next steps, we looked to their Adobe Analytics installation, which stored all activity by all users on the website. Surely, this data would be able to tell us which users actually experienced the delayed page load, and how page load time affected the conversion rates of individual users. In particular, we wanted to use ML and AI to analyze the huge amount of data, the Adobe system holds (I had my eyes set on open-source XGBoost for this job).

Adobe-Data are Not Fine Grained Enough by Default

To pull tricks like these, you need data to be as fine grained as possible – at least on an individual user-level. Even though the Adobe Analytics Workspace UI is versatile, the Adobe-data is always aggregated, summing up the behavior of many different users to deliver metrics as page views, bounce or conversion rates. Consequently, the UI is not well equipped to analyze data from each individual users’ perspective.

Data Warehouse to The Rescue

Fortunately, Adobe lets you extract the huge amounts of data through two separate approaches.

The first approach is the Data Warehouse. Here you can select from all breakdowns (dimensions), metrics and segments for any predefined data range. This data is already preprocessed and aggregated by Adobe.
The second approach is to use Data Feeds. Here you get partially processed data that has been sent to Adobe. Compared to the Data Warehouse, this is highly granular hit level data.

I went for the first method, the Data Warehouse approach.

Pulling the Data Warehouse Data from The Adobe Analytics API Using R and RsiteCatalyst

The best way to pull data out of the Adobe Data Warehouse is by requesting the export through the Adobe Analytics API using the statistical programming software, R. It’s fast and it provides highly granular, high volume datasets. However, you do need some coding skills to proceed this way. I prefer to pull data out with the RSiteCatalyst package as this enables you to request large amounts data without dealing with JSON (if you don’t mind JSON, though, feel free to access the Adobe Analytics API via JSON with e.g. Postman). R can be installed on Linux, Windows and Mac OS X from cran.r-project.org. If you want to follow my lead with R, I recommend that you also download RStudio from rstudio.com as it provides a user-friendly integrated development environment. Having installed R and RStudio you need to ensure that your user account has Web Services Access. Afterwards you can leverage the power of RSiteCatalyst by identifying your username and secret within Adobe Analytics:

Go to Admin. From the default landing page in Adobe Analytics click through to the “Admin” section.
Go to “Analytics Users & Assets”. From the “Admin” section, click through to the “Analytics Users & Assets” section.
Locate yourself. Use the search field in order to locate yourself.
Connect API. Once you click on your “USER ID” this will open the details associated with your user account. Under the “Web Service Credentials” header you will find the necessary credentials to connect to the API. The “User Name” should be your email followed by the company name whereas the “Shared Secret” is a 32 character string.

Now that you have located your credentials you can connect to the API with R using the below script which installs and loads RSiteCatalyst in R:

Request report suites data frame. Considering that a connection with the API now has been established you can start querying data. First, request a data frame that contains the report suites you want to extract data associated with.
Store as vector. You are now able to open a data frame that contains the report suite ID under the “rsid” header. The report suite you want to export data from can be stored as a character vector.
Request elements, metrics and segment data frames. Since you now have defined which report suite you will be extracting data from, a sensible next step is to request data frames that contain all relevant elements (dimensions), metrics, segments, props and eVars.
Append IDs. Importantly, both the Analytics Visitor Id and the Experience Cloud ID is not part of the elements data frame. Consequently, we need to append these dimensions to the elements data frame.
Specify headers. We can now access the data frames and specify which items we want to use for our report. Note that the items must be referenced by their value in the “id” headers. Below we specify these headers which will be used to create a request for the number of page views and average time spent on page broken out by visitor id, page name and device.
Make names correspond. As mentioned, the “id” headers do not always have meaningful names. For instance, “evar1” represents the page name. Fortunately, the “id” header also has a corresponding “name” header. Having completed the previous steps, a reference data frame containing the corresponding names can be created with the below snippet.

The above will also be beneficial when creating meaningful headers for the export.

Export data. Having completed the preceding steps, we are now ready to export data with the “QueueDataWarehouse” function. In the below export we input nine arguments into the function:
1. id– report suite id stored in the character vector.
2. from – start date for the report (YYYY-MM-DD).
3. to– end date for the report (YYYY-MM-DD).
4. metrics– metrics specified in the “used_metrics” object.
5. elements– elements specified in the “used_elements” object.
6. granularity– time granularity of the report (year/month/week/day/hour), the default is “day”.
7. seconds– how long to wait between attempts.
8. attempts– number of API attempts before stopping.
9. enqueueOnly– only enqueue the report, don’t get the data. Returns report id, which you can later use to get the data.

By default, the function will run for ten minutes before it stops (120 attempts separated by 5 second pauses). In my experience, these defaults need to be adjusted upward in order to complete requests for larger exports.

Furthermore, it is also possible to simply enqueue the report without actually receiving the data by setting “enqueueOnly” equal to true. When the following snippet is run, a request will be made for a report with the predefined metrics and elements with an upward adjusted number of attempts and pauses:

Make headers meaningful. Once you have successfully retrieved the data the below snippet can be used to map the meaningful header names to the export data frame. Note that “datetime” will always be the first column.
Excel it. Finally, if you would like to work with the data in excel format, R offers an easy way to export data frames as .csv files.

Of course, this was only an example of what kind of data you could potentially export. Actually, you could export data with a lot of other metrics and elements and transform the data as you see fit.

Sending the Data Warehouse Request to Your Email As a .CSV File

If you’d rather not embark on a journey into R-programming, there’s another way to get hold of the data. All you need to do is go into the Adobe Analytics interface, specify the report details and enter your email address in order to receive the desired data. If the data volume you want to extract can be held under 10 MB and if you’re not in a rush to get the data (it’s sent through email and can be several hours underway) this method might be just fine for you.

Log into Adobe Analytics…
Hover over the “Tools” header and click on “Data Warehouse”.
Specify the “Request Name”. This is done in order for you to locate your request in the “Request Manager” afterwards.
Click “Request this Report” to start scheduling the report.

Now all you’ll have to do is to wait for the report to arrive at your inbox (which, as noted, might not happen instantaneously).

Happy Truth Hunting

I used the R method to get hold of the client’s data and helped them with looking at how much pain the page load caused their visitors – and their business. I could have used the latter method as well, but the data size cap would have caused us trouble – and the timing issues as well.

If you’re planning to do something similar yourself, either of the two methods would help you towards your goals – and once you have gone through the above steps, you’ll end up with a nice, fine grained data extract from your Adobe Analytics. A data extract, which is just waiting for you to apply your own advanced analysis, find interesting patterns in the data and predict future actions for each user on your website. Good luck hunting for your “truths”.