Quantcast
Viewing all articles
Browse latest Browse all 64

Building RoadRunner for a near real-time feedback loop

We use Hadoop/MapReduce batch jobs extensively to process content and activity streams. The home feed is a prime example where we employ such batch jobs to compute signals and features to create a personalized feed of interesting Pins for every Pinner. While this batch approach is effective and scalable, it’s not responsive to recent activity. RoadRunner, a stream compute infrastructure, was born to address the need for near real-time feedback cycle.

Introducing RoadRunner

RoadRunner processes a stream of activities (repins, searches, clicks, etc.) as they happen and aggregates them in near real-time. From these we derive meaningful estimates of the quality and category of content, interests of Pinners and more. These signals are made available to our search and home feed ranking models that determine which Pins are most likely to appeal to a Pinner. RoadRunner also preserves a lot of temporal information about the activity stream. For example, we can discern how quality signals vary over time which allows us to understand what content is popular on weekends, what’s trending or what has changed recently.

Under the hood

RoadRunner is built on top of Apache Storm for streaming computation, HBase for storage and Finagle for serving. It crunches about 10 billion events daily at a peak rate of ~30K/sec with end-to-end latency of just a few seconds (p99 < 10s).

We needed a storage technology to support low latency high QPS increments and very low latency reads/scans (look up hour by hour metrics for a Pin over the last seven days). HBase was a great fit as it supports atomic increments and scans, and it’s horizontally scalable and widely used throughout our engineering teams. Our HBase cluster uses SSDs to meet our latency requirements. The HBase keys are a combination of objectID and time window ID to expire old data, capture timeseries and support aggregation over dynamic ranges. This design allows for the merging of results from both batch systems and roadrunner.

We also spent time tuning the storm topology. To keep the storm topology free flowing, we use a future pool for the blocking updates to HBase. This worked better than increasing the number of bolts that were responsible for performing updates. We built an end-to-end audit system that pumps events for test objects and checks the service for expected updates in metrics in just a few seconds. This periodic check helps make sure the entire system is healthy.

Image may be NSFW.
Clik here to view.

Using RoadRunner data

The data from RoadRunner is combined with data from nightly aggregates to generate feature vectors that go into the home feed ranking models. There’s some overlap between the features computed by RoadRunner and those computed by the Hadoop jobs. While both compute estimates of the Pin quality, the estimates from Hadoop are over the lifetime of a Pin (or at least until the nightly job was run), and estimates from RoadRunner are over the last several days. Both have strengths and weaknesses. The RoadRunner estimates are fresher and more flexible which allows us to incorporate time of day- or day of week-based variation into the models. However, they’re more noisy since they’re based on less data and are less complete. Our models treat the batch and real-time estimates as two completely different features. The training procedure we use doesn’t require the features to be independent, and so this works well for us.

We built and launched RoadRunner for search and home feed first. Since, we’ve found several additional use cases, from personalization of new user in-product guides to monetization products. We’ve started work on a version 2 to support this vast array of needs. We envision the next version to be a self-service platform that allows you to easily declare features and also apply lambda architecture for improved fault tolerance.

Stay tuned for future blog posts that discuss training and evaluation of models that use these dynamic features, and what’s new in version 2.

Ramki Venkatachalam and Mukund Narasimhan are software engineers on the Data Team and the Recommendations Team, respectively.

Acknowledgements: RoadRunner is part of a long-term strategic effort to enable streaming compute solutions at Pinterest. It was developed in collaboration with many folks from the Platform, Discovery and Cloud teams.

For Pinterest engineering news and updates, follow our engineering Pinterest, Facebook and Twitter. Interested in joining the team? Check out our Careers site.


Viewing all articles
Browse latest Browse all 64

Trending Articles