How we improved reporting and monitoring of test automation results

Over the last few years, we completely refactored what was described in our previous article about how we use the ELK stack for an overview of our test automation results, but some core concepts remain valid and applicable.

Giuseppe Donati · 15 Feb 2023 · 6 min read

How we scaled our Prometheus setup

In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.

Simon Brüggen · 23 Aug 2022 · 9 min read

How To Get Fooled By Metrics

Metrics are one of the main building blocks in the topic of observability.

Hence, we have a lot of metrics within our applications and especially for the connections between our applications. Every outgoing request has its latency measured and we also record the sizes of the request and the response. These numbers are collected in histograms and based on that data, in our Grafana graphs, we create corresponding graphs that show us e.g. the median size of request- and response payloads or the 99th percentile of call durations.

Dominik Sandjaja · 4 Dec 2020 · 7 min read

Better URL Search with Elasticsearch

At trivago, we generate a huge amount of logs and we have our own custom setup for shipping logs using mostly Protocol Buffers. Eventually we end up with some fields in Elasticsearch (ES) that contain partial (or full) URLs. For instance, in our specific case we store the query component of the URL in a field called query and the path component in a field named url_path. Sample values for these fields could be:

Jorge Luis Betancourt · 11 Feb 2020 · 6 min read

The Web Performance Impact Of Lossy Network Conditions

tl;dr: continuously monitor your CDN and origin servers on layer 3 with tools like MTR. Layer 3 issues on external middleware can have a significant impact on layer 7 web performance.

Tobias Baldauf · 8 Aug 2019 · 4 min read

Nomad - our experiences and best practices

Hello from trivago's performance & monitoring team. One important part of our job is to ship more than a terabyte of logs and system metrics per day, from various data sources into elasticsearch, several time series databases and other data sinks. We do so by reading most of the data from multiple Kafka clusters and processing them with nearly 100 Logstashes. Our clusters currently consists of ~30 machines running Debian 7 with bare-metal installations of the aforementioned services. This summer we decided to migrate all of this to an on-premise [Nomad](https://www.nomadproject.io/ cluster) cluster.

Inga Feick · 25 Jan 2019 · 10 min read

Splitting a Monitoring Monolith into Separate Components

Back in April 2015, I felt the need to do some work and earn money besides my studies in Computer Science at the University of Düsseldorf. After doing some research and crawling a few job platforms, I finally applied for a job in IT-Support at trivago. The job offer looked very appealing and life at trivago promised to be fun.

Thorsten Klein · 10 Jan 2018 · 11 min read

Cluecumber Report Maven Plugin for Cucumber test reporting

At trivago, we use a Cucumber based framework for end-to-end tests of our most important web applications. Cucumber stores test result as JSON files which can be turned into human-readable test reports.

Benjamin Bischoff · 16 Nov 2017 · 4 min read

Continuous Performance Monitoring for PHP - The tale of Blackfire at trivago

We're a data-driven company. At trivago we love measuring everything. Collecting metrics and making decisions based on them comes naturally to all our engineers. This workflow also applies to performance, which is key to succeed in the modern Internet.

Jorge Luis Betancourt · 27 Oct 2017 · 7 min read

Introducing Protector - a Circuit Breaker for Time Series Databases

At trivago we store a subset of our realtime metric data in InfluxDB and we are quite impressed by the load it can handle. Despite all the joy, we had to learn some lessons the hard way. It is pretty easy to overload the database or the web browser by executing queries that return too many datapoints. To prevent that, we wrote Protector - a circuit breaker for Time series databases that blocks malicious queries.

Matthias Endler · 23 Feb 2016 · 8 min read

Better Log Parsing with Logstash and Google Protocol Buffers

At trivago we rely heavily on the ELK stack for our log processing. We stream our webserver access logs, error logs, performance benchmarks and all kind of diagnostic data into Kafka and process it from there into Elasticsearch using Logstash. Our preferred encoding within this pipeline is Google's Protocol Buffers, short protobuf. In this blog post, we will explain with an example how to read protobuf encoded messages from Kafka using Logstash.

Inga Feick · 19 Jan 2016 · 4 min read

Elasticsearch and Kibana for Selenium Automation

The advances and growth of our Selenium based automated testing infrastructure generated an unexpected number of test results to evaluate. We had to rethink our reporting systems. Combining the power of Selenium with Kibana's graphing and filtering features totally changed our way of working. Now we have real-time testing feedback and the ability of filtering between thousands of tests, all in one Dashboard.

Teodor Rupi · 2 Dec 2015 · 9 min read

Realtime metrics with Go: Running InfluxDB in production

At trivago we love hotels above everything else, but we also like metrics, we love to measure everything, compare, decide, improve and then rinse and repeat. In this blog entry we are going to describe our experience with InfluxDB, a time series database that we are using to store some real time metrics.

Xoan Vilas · 14 Apr 2015 · 7 min read

Monitoring at trivago

How we improved reporting and monitoring of test automation results

How we scaled our Prometheus setup

How To Get Fooled By Metrics

Better URL Search with Elasticsearch

The Web Performance Impact Of Lossy Network Conditions

Nomad - our experiences and best practices

Splitting a Monitoring Monolith into Separate Components

Cluecumber Report Maven Plugin for Cucumber test reporting

Continuous Performance Monitoring for PHP - The tale of Blackfire at trivago

Introducing Protector - a Circuit Breaker for Time Series Databases

Better Log Parsing with Logstash and Google Protocol Buffers

Elasticsearch and Kibana for Selenium Automation

Realtime metrics with Go: Running InfluxDB in production

Popular tags

Featured articles

3 Things We Learned When Switching to TypeScript

Being on-call as a software engineer - a challenging and fast learning experience

Java Reactive Programming - Effective Usage in a Real World Application

Automation-First Approach Using the Karate API Testing Framework

Learn Redis the hard way (in production)

trivago tech newsletter

Popular tags

Featured articles

Career? trivago.