Error logging puzzle for DevOps: Cloudflare vs NEL

DevOps

Tech opinion

Solutions

min read

Alex Vorona

May 31, 2023

Everybody knows that it’s always a good sign when your DevOps is slightly relaxed, calm, and confident. It means the project is under control and everything works as it should. And when they feel concerned, it’s a reason for anyone to get worried. Especially, if error logging of the project is a troublemaker.

Today I want to share the story of the problem with the enormous difference in the data of Cloudflare and NEL analytics we’ve discovered on one of our biggest projects (a blockchain infrastructure handling over 800 million requests per day). Don’t worry, it has a happy ending.

Here’s what happened.

20 million of 408 errors

In my daily routine as a Senior DevOps and Tech Lead, there’s nothing special. My work should be boring, but sometimes I stumble upon disruptions and imperfections that stimulate me to hunt down the problem and solve it. Yes, sometimes before you solve a problem, you have to find it.

Browsing the reports on project statuses in Cloudflare, I saw a terrific number of 408 errors on the website, around 20 million for the last 24 hours. I got curious about that.

This was way too much, especially if our infrastructure would be piping hot with other types of breakdowns if at least 1/10000 part of those 408 errors occurred. As we get the data not directly from the website, but through Cloudflare I thought I should give this case a closer look. The point I had to solve: should I worry about Cloudflare numbers, if everything was fine on the infrastructure side? How can I get truly reliable stats on our error logging? Is there any better tool to track that?

Spoiler: of course, there was a better instrument for error tracking, as Cloudflare’s “errors” are not what they were supposed to be. And I’ll prove it for you now.

The logic of Cloudflare error tracking

As you might know, Cloudflare Analytics is a web analytics platform that provides real-time insights into traffic, page views, and visitor behavior. It collects data using a variety of methods, including JavaScript tags, web beacons, and server logs. Cloudflare Analytics provides a wealth of information on website performance, including page load times, bounce rates, and conversion rates. It also allows users to drill down into specific metrics to get a more detailed view of website performance. Typically, that’s enough for the website owner to have a general picture of what’s happening.

Cloudflare uses a variety of methods, including real user monitoring (RUM), synthetic monitoring, and server-side monitoring. RUM involves tracking user interactions with a website in real-time, using JavaScript tags or browser plugins. Synthetic monitoring involves simulating user interactions with a website from different locations and devices. Server-side monitoring involves tracking server-side errors and performance metrics, such as response times and server uptime. These methods assume both processing of data and working with raw events.

Why Cloudflare registered such an enormous number of 408 error

408 errors are typically caused by network timeouts, which occur when a server takes too long to respond to a request from a client. These errors can be caused by a variety of factors, including slow network connections, server overload, or poorly optimized code. In our case, we optimized the infrastructure of this project earlier, so all main factors were excluded from the causes of appearance.

Reason #1: Sensitivity to a network timeout

One possible explanation for why Cloudflare reports more 408 errors is that it is more sensitive to network timeouts than other logging systems. It uses a global network of servers and data centers to collect data from a wide range of locations and devices, and this system may be more likely to detect network timeouts that occur in remote or poorly connected locations. Other logging systems may not be as sensitive to these types of errors, or they may be configured to ignore them.

Reason #2: Special filtering approach

Another possible explanation is that Cloudflare is reporting all 408 errors, while other logging systems may be filtering out some of these errors. For example, some logging systems may be configured to exclude 408 errors that occur during periods of high traffic or server overload, as these errors may be considered "expected" or "normal" under these circumstances. Cloudflare, on the other hand, may report all 408 errors, regardless of the context in which they occur.

Reason #3: Operational conditions and environment state

It's also worth noting that the number of reported 408 errors can be affected by factors such as website traffic, server load, and network conditions. For example, a website that experiences high levels of traffic may be more likely to experience 408 errors, as the server may be overloaded or unable to keep up with incoming requests. Similarly, a website that is hosted on a slow or poorly connected server may be more likely to experience network timeouts and 408 errors.

Any of these reasons might zoom out the truth from us. So, we decided to look for another solution for error tracking and found the most reliable of all at the moment of decision-making.

Solution revealed: A short NEL guide

Network Error Logging (NEL) is a new HTTP header designed to provide a way to log and analyze network errors that occur on websites. NEL can be used to monitor network errors like failed requests, DNS resolution failures, and connection timeouts, and to quickly diagnose and fix problems that can lead to poor user experience and loss of revenue. It’s easy to yield, open-source, and reliable tool—each characteristic meets my DevOps taste.

The NEL specification is a W3C standard that defines how web browsers should handle the NEL header. NEL header configures a client (browser) to send network [error] reports to the provided by Report-To header endpoint. It's like a link to a feedback form, where any user can leave some thoughts about the site. But it's intended for automatic reports.

Projects can benefit from NEL in a number of ways. By monitoring network errors, developers can quickly identify and fix issues that can lead to poor user experience or lost revenue. They can also use NEL data to identify trends in error occurrences and optimize their websites or applications to reduce the frequency of errors.

Benefit #1: Both client and server sides errors

One of the main benefits of NEL is that it enables web developers to collect more comprehensive data on network errors. Traditionally, developers relied on client-side error logging or server-side logs to diagnose network errors, but these methods often failed to capture all errors. With NEL, developers can see errors that occur on the client side and the server side, and they can use this information to identify patterns and trends in error occurrences.

Benefit #2: Standard format

Another benefit of NEL is that it provides a standard format for logging network errors, which makes it easier for developers to analyze and compare data from different sources. This can be particularly useful for large projects with many developers or for companies with multiple websites or applications.

Benefit #3: Simple implementation

To use NEL, web developers need to include the NEL header in their HTTP responses. The header contains a URI that points to a report endpoint, where the browser sends the JSON object describing the network error. Developers can then use this information to analyze and diagnose network errors.

One of the challenges of using NEL is that it requires a backend system to process and store the error reports. Developers need to set up a report endpoint that can handle and store the error reports, and they need to have a system in place to analyze and act on the data.

By using NEL, you can quickly diagnose and fix issues that can lead to poor user experience and lost revenue and optimize your projects to reduce the frequency of errors over time. And in our case, we needed NEL to contrast the picture of 408 errors given by Cloudflare and find a single truth to rely on.

So what does NEL say?

After we implemented NEL, we get a much more realistic picture of what was going on for sure. Also, we presented this solution to the client and get another round of applause.

The NEL Grafana reported that everything was calm and neat, with around 400 errors that didn’t change anything for an 800-million-requests-per-day infrastructure.

*Recent numbers of errors on the project. Source: NEL monitoring in Grafana*

The last two days showed us the correct number of errors that can be trustworthy because we get them directly from the website, not through the third-party provider.

Calm and Curious: The motto of best DevOpses

This little story shows us the importance of understanding the purposes of our instruments rather than using them blindly and thoughtlessly. Another lesson you can learn from it is that the solution for even the most hazardous problem might be a simple and plain on-the-surface tool, not some kind of Holy Grail.

We appreciate NEL for the kick-off start, explainability, and reliability, these are features we like in all our solutions. This is the true path of efficient tracking and analytics—avoid black box and closed third-party solutions and stick to transparent and open instruments.

And you’ll be as calm as professional DevOps.

Alex Vorona

DevOps Lead

Sharing the in-house secrets of DevOps mastery originated from Dysnix.