Case study: How to daily monitor sitemap stats of multiple websites

Trang Nguyen Ngoc
8 min readJul 7, 2021

--

While there are many tutorials on how to solve different individual technical questions, for example how to write a particular formula, how to remove duplicates with Google Apps Script, and so on; I find it quite rare to find a complete guideline of how you would approach a business task as a whole. Of course, as every company is different — every job is different, a guideline for solving a task might be less applicable. But I personally think it’s the way how you approach a business problem that matters the most, and that’s what you could learn by seeing how others solve similar problems.

As such, for the case-study series, I would like to share my approach to the tasks I encounter, and would be happy to discuss how you would approach the same problem.

Case Overview

  • A sitemap is a file that contains a list of important URLs to be submitted to Google. On the first day of each month, you have an automated system that generates the sitemap for 15 different websites, each website operates in its own country (e.g domain.com.au in Australia, domain.de in Germany, domain.co.uk in the UK, and so on)
  • Then, every day until the first day of the next month, your system would take a certain number of URLs out of the sitemap if the URLs do not meet pre-defined criteria.
  • You have separate databases to store each website’s data (e.g all the data related to domain.com.au is saved in the database named Sitemap_AU). Each database has the same list of tables, store procedures, and functions.
  • The sitemap that is generated at the beginning of the month is saved in Sitemap_Live table, with 3 columns: UrlHash and BaseUrl, RemovedFromLive (1 if the URL is removed after the first day of the month, 0 if it is not).
  • The traffic data is saved in Sitemap_Traffic table, with 3 columns: UrlHash, DateOfTraffic, TotalTraffic.
  • The country name is saved in Country table with 1 column: Value
  • The number of URLs that were removed, together with their traffic, are saved in Sitemap_EventStatistics table, with 3 columns: Date, NumberOfUrlsRemoved, PageViewsOfRemovedUrls.
  • All those databases are in the same SQL Server.

The task

Your manager is worried that if something goes wrong, too many URLs or too much traffic can be cut down from the sitemap. He wants to have a spreadsheet that provides an overview of how much traffic is removed every day after the sitemap is modified.

The problems

To fulfill the task above, the following problems need to be solved:

  • What metrics to focus on?
  • How to get the data I need and in which structure should it be?
  • How can the data be updated in the spreadsheet every day?
  • How to visualize the data?

Below I am going to explain my thought process for solving each of the problems above.

The solutions

What metrics to focus on?

At the end of the day, the manager’s goal is to ensure that not too much traffic is removed. But how much is too much? 100? 1000? or 10,000? Obviously, it depends on the original amount of traffic the sitemap has. So, the relative comparison is important.

At the same time, the absolute number is important too. 5% of 1000 page views is different from 5% of 1 million page views. Thus the relative traffic should be accompanied by the absolute traffic.

How much traffic was already removed up to a certain day is also vital. The system might remove only 0.5% of the traffic each day, but after 30 days, it actually removes 15% of the traffic already, which is a lot more significant than what the daily number shows.

Therefore, I would focus on the following metrics:

  • the amount of traffic removed
  • % of traffic removed (compared to the original traffic)
  • the cumulative sum of % of traffic removed (compared to the original traffic)

Note: From the SEO point of view, of course, not only those traffic metrics are important but the metrics regarding the number of URLs in the sitemap are also important. However, to narrow down the scope of this document I only listed the traffic metrics here.

How to get the data and in which structure?

With the above-defined metrics, the following data is needed:

  • The traffic of the URLs in the sitemap that is generated on the first day of the month (taken from Sitemap_Traffic table)
  • The amount of traffic of URLs that are removed each day (taken from Sitemap_EventStatistics table)

I need to design my data in a way that makes it possible to:

  • apply the same query to all 15 databases
  • differentiate the countries so that the overview for each country can be made out of the data

As such, I think having the data in a long-form would be the best. I would design my query such that the results will be presented in 4 columns: Country, Metric, Date, and Value.

The SQL query is as follow:

There are two things that concern me in this step:

  • Should I calculate the percentage of pageviews removed by SQL or do that by Spreadsheet instead? — Here I chose to do it with Spreadsheet as I will need to process data for my visualization anyway.
  • As the data will be taken daily, should the above query take only today’s data or should it just take data of all days? If data of all days is taken there will be duplicated data in the spreadsheet, while if only the daily data is taken then the data is only collected from when the script is scheduled. I decided to take all data just to be safe, and after that remove the duplicates by using another script later.

How can the data be updated in the spreadsheet every day?

  • One important required feature of this spreadsheet is that the data needs to be updated daily. To do this effectively, I chose to connect to MS SQL Server to Google Sheets by Google Apps Script, which I already have a separate post introducing how to do it.
  • Combining the SQL query with the code to connect to MS SQL Server, the full script for this Spreadsheet is as follow:

How to visualize the data?

To decide what visualization is needed, we need to look again at the goal of the task: to have a spreadsheet that provides an overview of how much traffic is removed every day after the sitemap is generated.

As such, what would be useful for my manager include:

  • An overview sheet for all countries’ data today. With some conditional formatting, this sheet would allow my manager to be able to grasp very quickly if there is anything going wrong with the sitemap modification. He can also compare one country to another to spot any unusual behavior.
  • An overview sheet for each country with daily data, from the first day of the month to the last day of the month. It’s always helpful to have historical data to compare. For example, from the all countries sheet, my manager sees that the traffic is down by 5% today, which is critical. Then he can check the country sheet to see if this is a gradual change in traffic or a sudden change, and subsequentially decide if we need to dig deeper into the problem or not.

As such, I would push the SQL data to a “Raw Data” sheet, and from there create the country sheets and the “All countries Overview” sheet. Here is an example of my spreadsheet.

The country sheets

Example of how the country sheet looks like
  • The country sheet takes the data from the “Raw Data” sheet, where the rows are the metrics with the exact same name as the name of the metrics in the source sheet.
  • The columns are the date, starting from the first day of the current month. The dates are added automatically from the Raw Data, as long as the date is bigger than the starting day of the month.
  • To get the data for the Traffic of Original Sitemap and Traffic of removed URLs, I use the SUMIF function. Basically, it SUMs the Value column of the Raw Data sheet if the Country column= country of this sheet (cell A1) AND the Metric column = the metric of interest AND the Date column = the column header. SUMIFS can be used too if you calculate the data for a fixed list of Date, but as my Date is running, I use SUMIF in order to be able to apply ARRAYFORMULA.
  • The get the Cumulative sum of % of Traffic of Removed URLs, I get the cumulative sum of Traffic of Removed URLs divided by the Traffic of Original Sitemap (or you can just take the sum of % Traffic of removed URLs). In order to calculate cumulative sum across columns, use:
=ArrayFormula(If(len(B6:6),(SUMIF(COLUMN(B6:6),"<="&COLUMN(B6:6),B6:6)),))

This formula is adapted from the instruction here, which does a sum of a range if the column of a cell in the range is smaller or equal to the column of the current cell.

  • To create a sheet for a new country, I duplicate the current sheet and just change the value of cell A1 to the country of interest. To be sure that the country that I chose has data to display, I added data validation to the cell options: the country can only be one of the countries that appear in the “Raw Data” sheet.

The “All countries Today Overview” sheet

Example of All countries Today Overview sheet
  • This sheet focuses on only Today’s data, with the metrics of interest as rows and the countries as columns.
  • To get the data for each of the cells, I use VLOOKUP to look up the metric of interest in the correspondent country sheet. The formula is as follow:
=Vlookup($A4, indirect(C$3&"!A:AM"), match($C$1, indirect(C$3&"!4:4"), 0), false)

To break the formula down:

  • A4 is the metric of interest. In order to look up successfully, make sure that the name of the metrics here is exactly the same as the name of the metrics in the country sheet.
  • INDIRECT(C$3&"!A:AM") refers to the range to look up. INDIRECT is a formula that I love, as it allows me to automate all of my formulas to all the countries. As such, remember to name your country sheets in a consistent way, so that you can use INDIRECT to construct your country sheet name. (The only thing I don’t like about INDIRECT, it doesn’t let me use ARRAYFORMULA! :( However in this case it is good enough as my list of countries is limited and I can just Ctrl+R to apply the same formula for all columns).
  • MATCH($C$1, INDIRECT(C$3&"!4:4"), 0) is used to find the column of results that should be returned. Here I need the data of “Today”, which is equivalent to the value of cell $C$1, thus I use MATCH to find out which column my data of cell C1 is.

And that’s it! The task is done! Remember to triple-check your data before sharing it with your manager :D

Conclusion

Huff! It is a long post, thank you for reading if you ever reach it here! While many big companies nowadays prefer sophisticated data visualization tools such as Tableau and PowerBI, many small and medium companies still prefer to use Google Sheets. And sometimes, like this case, I find tables — the simplest data visualization — work better than any other graphs or pictures. Sometimes simple is the best!

I hope that I bring an interesting task for you to take a look at, and if any thought of another approach reaches you when you read this post, please let me know, I would be happy to discuss your suggestions!

Thank you for reading ❤

--

--

Trang Nguyen Ngoc
Trang Nguyen Ngoc

Written by Trang Nguyen Ngoc

I document my data analysis learning journey to be better together <3

No responses yet