Netflix's Distributed Counter Abstraction

118 points by benocodes 21 hours ago

notfried 16 hours ago

Netflix Engineering is probably far ahead of any other competitor streaming service, but I wonder how much the ROI is on that effort and cost. As a user, I don’t see much difference in reliability between Netflix, Disney, HBO, Hulu, Peacock and Paramount+. They all error out every now and then. Maybe 5-10 years ago you needed to be much more sophisticated because of lower bandwidth and less mature tech. Ultimately, the only real difference that makes me go for one service over the other is the content.

Galaco 11 hours ago

All the money they spend still fails at basic UX, creating a finely crafted, polished turd of engineering perfection. I don't see the Netflix experience to be superior in any way to the other big services; they all have their own issues that engineering haven't solved properly, most likely some different ones.
An example (in this case ONE PIECE on Netflix Japan region): My Netflix language is set to English, but I am not in an English speaking country. And yet, all shows have an English description and title, as well as episode lists with descriptions also in English. And yet, the program itself is not in English, nor does it have English audio options or subtitles. And the UI does not indicate if there are English subtitles until I play an episode and open the subtitles menu. This is compounded to be even worse when there are so many shows that are only half subtitled (different rights holders I assume). Why does the UI lie to me by showing everything about the series/movie in my native language except the actual content itself? This is really common in non-English speaking regions, and it looks like a basic engineering failure of looking up their global database content in MY language while ignoring whether the actual CONTENT is available in my language. I suppose this could be a UX issue, but it also looks like an engineering one to me. And aren't those intertwined to some extent anyway?
- flakeoil 8 hours ago
  
  I agree with your example and that this functionality sucks. However it cannot be an engineering issue as it must be technically very easy to implement in a better way. It must be a product/UX/marketing kind of decision behind it or just plain ignorance.
  - zoover2020 8 hours ago
    
    This. There's no technical limitations behind offering an A-Z catalog experience, it's all decided by that VP of user experience who wants to squeeze the most amount of screen / device time out of each user
    Am I really alone to notice that each year, our collective UX across consumer (SaaS) software gets worse?
- aprilthird2021 8 hours ago
  
  Netflix has far better UI/UX and features than it's competitors. I'm sure your specific issue is a legit one, but Netflix as a whole is much better
  - agos 7 hours ago
    
    I don't know if Netflix has caught up recently, but during the pandemic times Disney+'s group viewing was a very good feature with good UX, and I haven't seen anything like it on other services
com2kid 10 hours ago

I worked at HBO Max when it launched. Less than 300 engineers, backend services written in either Node or Go. Mostly plain REST calls between services, fancier stuff was only used when there was a performance bottleneck.
The engineering was very thoughtful and the internal tooling was really good.
One underappreciated aspect of simply engineered systems is that they are simple to think about, simple to scale up, and simple to debug when something goes wrong.
brundolf 14 hours ago

I notice their front-end engineering (and design). The app experience has always been noticeably better than every other streaming service, which matters to me
It's also possible that savings on infrastructure costs, reduced engineering hours spent on debugging, etc give them a quiet competitive advantage
Most engineering is not visible to the end-user
- mark_and_sweep 8 hours ago
  
  Lately, the user experience has gotten much worse for Windows users: In July they removed the ability to download content for offline viewing. So everyone who wants to watch content while travelling, for example, has no other option but to use a competitor like Prime Video.
  I would also like to point out that those "savings on infrastructure costs" seemingly do not benefit their users: They have repeatedly increased prices over the past few years.
  I'm also unsure whether they are using their competitive advantage properly. Anecdotally, it used to be that almost everyone I knew was only streaming on Netflix. These days, the streaming services that people use seem much more diverse.
- lofaszvanitt 13 hours ago
  
  Their front end is just plain garbage. It does everything besides helping discovery of content. And people look up to Netflix like it's some miracle thing.
  - thereisnospork 12 hours ago
    
    It is well engineered and meticulously manicured garbage though. In your example, the harder it is to discover content the easier it is to realize there is nothing new to watch. (Hence the rotating thumbnails, for a trivial example)
    
    alexander2002 10 hours ago
    
    One thing I hate about netflix that category recommendation also include other category stuff like Trending/Recently watched etc I just want to watch some classical movies so plz show me your whole catalog of such movies I don't use netflix often so this issue is more prominent for me atleast since I am not used to the UI bloat
    
    lofaszvanitt 10 hours ago
    
    There is always a lot of good things to watch, but you have to dance around the idiotic recommendation system to find the hidden pearls. I'm sure there are a lot of good, talented people working there, but for what, when the frontend team kneecaps their work with the abysmal user interface.
    And the best part... you chat with the support and they try to force a how to use our webpage shit on you when you want to report an annoying UI issue. It's really mind boggling how a congregation of idiots with their heads up their asses runs that place.
  - assimpleaspossi 9 hours ago
    
    That you don't like how you have to use the Netflix interface has nothing to do with how it's engineered.
paxys 12 hours ago

All of the services you mention have large engineering/ops teams and similar monitoring systems in place. Their is absolutely an ROI on spending on reliability.
dheerkt 14 hours ago

Not an expert by any means but streaming HQ video is pretty expensive (even more so for live content), seems like the only providers that can do so profitably are YouTube and Netflix. I'm sure a big reason for that is the engineering (esp. CDN)
TZubiri 14 hours ago

The value of netflix is probably not only on its technical prowess and app experience, but it seems they are pretty involved in content direction through metrics. ¹ ²
Sources:
[1] https://youtu.be/xL58d1l-6tA
[2] https://youtu.be/uFpK_r-jEXg
- rhplus 11 hours ago
  
  Jury’s out on their content direction. They cancel great shows before the first season has even had a chance to permeate. It’s clear they have little interest in long term artistic investments.
l33t7332273 15 hours ago

I’ve thought this for a while, and it’s sad because I want to reward good tech. I usually think about this while I’m waiting for paramount to load for the third time because picture in picture has gone black again when I tried to full screen .
parhamn 14 hours ago

Although, I agree with the general point, I am thankful for the bit of extra work they do when I travel abroad. The quality and availability of Netflix is far superior (many don't even work outside the US).

millipede 19 hours ago

> EVCache

EVCache is a disaster. The code base has no concept of a threading model. The code is almost completely untested* too. I was on call at least 2 time when EVcache blew up on us. I tried root causing it and the code is a rats nest. Avoid!

* https://github.com/Netflix/EVCache

jolynch 15 hours ago

(I work at Netflix on these Datastores)
EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.
At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.
I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.
[1] https://netflixtechblog.com/introducing-netflixs-key-value-d...
[2] https://github.com/Netflix/EVCache/blob/11b47ecb4e15234ca99c...
[3] https://www.infoq.com/articles/netflix-global-cache/
[4] https://netflixtechblog.medium.com/cache-warming-leveraging-...
[5] https://netflixtechblog.com/evolution-of-application-data-ca...
- lksajdl3lfkjasl 13 hours ago
  
  Curious to how your getting <500us latencies. Connection pooling, GRPC?
  - jolynch 12 hours ago
    
    Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].
    So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.
    [1] https://github.com/Netflix-Skunkworks/service-capacity-model...
    [2] https://jolynch.github.io/pdf/wlllb-apachecon-2022.pdf
jedberg 17 hours ago

I'm surprised it's still there! It was built over a decade ago when I was still there. At the time there were no other good solutions.
But Momento exists now. It solves every problem EVCache was supposed to solve.
There are other options too. They should retire it by now.
- tmikaeld 5 hours ago
  
  Momento? This? https://www.gomomento.com/
  Seems to be cloud hosted only.
Alupis 17 hours ago

Can you elaborate?
From the looks of it, each module has plenty of tests - and the codebase is written in a spring/boot style, making it fairly intuitive to navigate.

vlovich123 19 hours ago

It's a bit weird to not compare this to HyperLogLog & similar techniques that are designed to solve exactly this problem but much more cheaply (at least as far as I understand).

rshrin 11 hours ago

Added a note on why we chose EvCache for the "Best-Effort" use-case instead of probabilistic data structures like HLL/CMS. Appreciate the discussion.
zug_zug 19 hours ago

I came here to write the same thing. Getting an estimate accurate for at least 5 digits on all netflix video watches worldwide can all be done with intelligent sampling (like hyperloglog) and likely one macbook air as the backend. And aside from the compute save the complexity and implementation time would be much lower too.
- rshrin 18 hours ago
  
  Fwiw, we didn't mention any probabilistic data structures because they don't satisfy some of the basic requirements we had for the Best-Effort counter. HyperLogLog is designed for cardinality estimation, not for incrementing or decrementing specific counts (which in our case could be any arbitrary +ve/-ve number per key). AFAIK, both Count-Min Sketch and HyperLogLog do not support clearing counts for specific keys. I believe Count-Min Sketch cannot support decrement as well. The core EvCache solution for the Best-Effort counter is like 5 lines of code. And EvCache can handle millions of operations/second relatively cheaply.
  - vlovich123 16 hours ago
    
    Including this in the blog would have been helpful although I don’t think the decrement explanation is unsolvable - just have a second field for decrements that is incremented when you want to decrement & then the final result is a sum of the two.
    
    rshrin 16 hours ago
    
    True. You could do decrements that way. We trimmed this article as the post is already quite long. But considering the multiple threads on this, we might add a few lines. There is also something to be said on operating data stores that support HLL or similar probabilistic data structures. Our goal is to build taller on what we already operate and deploy (like EvCache)
willcipriano 15 hours ago

Can you do a distributed HyperLogLog? Wouldn't you have to have a single instance of it somewhere?
- adgjlsfhk1 14 hours ago
  
  hyperloglog distributes perfectly. Each node keeps track of the "best" hash, and then to query the global maximum, you just ask each node for their value.
  - rshrin 13 hours ago
    
    Querying each instance can lead to availability and latency challenges. Moreover, HLL is not suited for tasks like increments/decrements, TTLs on counts, and clearing of counts. Count-Min Sketch could work if we're willing to forgo certain requirements like TTLs, but relying solely on in-memory data structures on every instance isn't ideal (think about instance crashes, restarts, new code deployments etc.) Instead, using data stores that support HLL or Count-Min Sketch, like Redis, would offer better reliability. That said, we prefer to build on top of data stores we already operate. Also, the "Best-Effort" counter is literally 5 lines of code for EvCache. The bulk of the post focuses on "Eventually Consistent Accurate" counters, along with a way to track which systems sent the increments, which probabilistic counting is not ideal for.
    
    rshrin 13 hours ago
    
    Just to add, we are also trying to support idempotency wherever possible to enable safe hedging and retries. This is mentioned a bunch in the article on Accurate counters. So take that into consideration.

mannyv 20 hours ago

I wonder how they're going to go about purging all the counters that end up unused once the employee and/or team leaves?

I can see someone setting up a huge number of counters then leaving...and in a hundred years their counters are taking up TB of space and thousands of requests-per-second.

singron 20 hours ago

There is a retention policy, so the raw events aren't kept very long. The rollups probably compress really well in their time series database, which I'm guessing also has a retention policy.
If you have high cardinality metrics, it can still be really painful, although I think you will feel the pain initially and it won't take years. Usually these systems have a way to inspect what metrics or counters are using the most resources and then they can be reduced or purged from time to time.
- rshrin 18 hours ago
  
  Yes, once the events are aggregated (and optionally moved to a cost-effective storage for audits), we don't need them anymore in the primary storage. You can check the retention section in the article. The rollups themselves can have TTL if the users wish to set that on a namespace. Although doing that, they have to be fine with certain timing issues on when the rollups expire and new events are aggregated. We also have automation to truncate/delete namespaces.

fire_lake 9 hours ago

I think the design could have been simpler with Kafka (which they touched on briefly):

- Write counter changes to a Kafka topic with many partitions. The partition key is derived from the counter name.

- Use Kafka connect to push all counter events to S3 for audit and analysis.

- Write a Kafka consumer that reads events in batches and updates a persistent store with the current count.

- Pick a good Kafka message lifetime to ensure that topic size is kept under control, but data is not lost.

This gives us:

- Fast reads (count is precomputed, but potentially stale)

- Fast writes (Kafka)

- Correctness (every counter is assigned exactly one consumer)

- Durability (all state is in Kafka or the persistent store)

- Scalable storage and compute requirements over time

If I were to really go crazy with this, I would shard each counter further and use CRDTs to compute the total across all shards.

rshrin 8 hours ago

Yes, this is one of the approaches mentioned in the article and is indeed a valid approach. One thing to keep in mind is that we are already operating the TimeSeries service for a lot of other high ROI use cases within Netflix. There already exists a lot of automation to self-provision, configure, deploy and scale TimeSeries. There already exists automation to move data from Cassandra to S3/Iceberg. We somewhat get all that for free. The Counter service is really just the Rollup layer on top of it. The Rollup operational nuances are just to give it that extra edge when it comes to accuracy and reliability.
- fire_lake 5 hours ago
  
  Did you also consider AWS managed services? Like a direct write to Dynamo?

bob1029 8 hours ago

This seems a bit overcooked to me.

I suspect if I were to recursively ask "why?", we may eventually wind up at some triviality (to the actual business/customer) that could have easily gone another way and obviated the need for this abstraction in the first place.

Just thinking about the raw information, I don't see how the average streaming media consumer produces more than a few hundred kb of useful data per month. I get that the delivery of the content is hard, but I dont see why we need to torture ourselves over the gathering of basic statistics.

Dylan16807 8 hours ago

Well okay, that's some neat implementation stuff.

But what in the world are they using a global counter for that needs "low millisecond latencies"? I don't see a single example in the entire article, and I can't think of any.

dopamean 18 hours ago

Why would netflix put their blog on medium?

jedberg 17 hours ago

Better distribution. More people will read it there.
- paxys 12 hours ago
  
  That was probably true 5 years ago, but today a large chunk of readers are going to be driven away by the intrusive popups and auth walls.
  - dopamean an hour ago
    
    I didnt read it and that was the reason I asked.
lmm 10 hours ago

Because it's the least bad way to do a blog when you want to focus on the content.

est 10 hours ago

if anyone want a non-distributed but still very powerful counter service, I'd recommend Telegraf from Grafana or https://vector.dev/

leakyabstxns 17 hours ago

Given the complexity of the system, I'm curious to know how many people maintain this service

rshrin 17 hours ago

5 people (who also maintain a lot of other services like the linked TimeSeries service). The self-service to create new namespaces is pretty much autonomous (see attached link in the article on "Provisioning"). The stateless layer auto-scales up and down based on attached CPU/Network-based scaling policies. The exports to audit stores can be scheduled at a cadence. The only intervention is when we have to scale the storage layer (although parts of it also automated using the same Provisioning workflow). I guess the other intervention is when we decide to change the configs (like number of queues) and trigger a re-deploy. But thats about it. So far, we have spent a very small percentage of our support budget for this.

ilrwbwrkhv 20 hours ago

Looks a bit overengineered due to Netflix's own microservices nonsense.

I would be more interested in how a higher traffic video company like Pornhub handles things like this.

Alupis 19 hours ago

> Netflix's own microservices nonsense
How many times has Netflix been entirely down over the years?
Seems it's not "nonesense".
philjohn 19 hours ago

Have you worked with distributed counters before? It's a hard problem to solve. Typical tradeoffs are lower cardinality for exact counters.
The queue solution is pretty elegant.
- ElevenLathe 18 hours ago
  
  The queuing reminds me of old "tote board" technology. I can't find a reference easily but these were machines used to automatically figure and display parimutuel payouts at horse tracks. One particular kind of them would have the cashiers' terminals emit ball bearings on a track back at the central (digital but electromechanical) counter for each betting pool. This arrangement allowed the cashiers to sell as fast as they liked without jamming the works, and then allowed the calculation/display of odds to be "eventually consistent".
oreoftw 20 hours ago

How would you design it to support mentioned use cases?
- klaussilveira 19 hours ago
  
  HyperLogLog and PostgreSQL: https://github.com/citusdata/postgresql-hll
  Or even simpler, Roaring Bitmaps: https://pncnmnp.github.io/blogs/roaring-bitmaps.html
  https://blog.quastor.org/p/grab-rate-limiting
  https://github.com/RoaringBitmap/CRoaring
  - philjohn 19 hours ago
    
    HyperLogLog doesn't support exact counters though, does it? That seems to be one of the core requirements of the queue-based solution.
    
    rshrin 18 hours ago
    
    HyperLogLog (or even Count-Min Sketch) will not support some of the requirements of even the Best-Effort counter (clearing counts for specific keys, decrementing counts by any arbitrary number, having a TTL on counts etc.). For Accurate counters, we are trying to solve for multi-region read/write availability at low single-digit millisecond latency at cheap costs, using the infrastructure we already operate and deploy. There are also other requirements such as tracking the provenance of increments, which play a part.