Netflix's Key-Value Data Abstraction Layer

ericmcer 10 months ago

Can anyone explain why Netflix is considered to have such high tier engineering? Just from a super high level view they store and serve ~5000 videos saved at a few different qualities (4?) so lets say a total of 20,000 videos. Those files only change when specific privileged users update them.

Compare that with Youtube where ~5,000 videos are uploaded, processed into different formats/qualities every minute, and can be added by anyone with an email. It seems like Netflix has a fairly trivial problem when compared with video sharing or content sharing sites.

jolynch 10 months ago

My experience has been that the talent density is the main difference. Netflix tackles huge problems with a small number of engineers. I think one angle of complexity you may be missing is efficiency - both in engineering cost and infrastructure cost.
Also YouTube has _excellent_ engineering (e.g. Vitess in the data space), and they are building atop an excellent infrastructure (e.g. Borg and the godly Google network). It's worth noting though that the whole Netflix infrastructure team is probably smaller than a small to medium satellite org at Google.
- SR2Z 9 months ago
  
  YouTube hasn't used Vitess for a few years - they've moved into Spanner almost entirely.
  Otherwise I agree with you.
  - jolynch 9 months ago
    
    That's good to know - Spanner is even more impressive engineering!
    I wish we had the staff and time to build and maintain something like Spanner. Different constraints lead to different solutions.
  - pjmlp 9 months ago
    
    So basically, out of Go, into C++.
    One of the conference demos on how Go is used internally is now gone.
dumah 10 months ago

Netflix serves over 40,000 hours of video in more than 100 encodings. Source files are high quality and can reach terabytes per hour of bitrate. Source data must be retained for re-encoding. Encoding is adaptive to scene. The ETL can generate a quarter million tasks to process one source file.
I’m not saying YouTube’s problem isn’t much larger, but your assumptions are off by orders of magnitude. If you are curious about the workload there are many blog articles and stories here.
thecosmicfrog 10 months ago

As soon as a streaming service starts having availability issues, it will garner a reputation very quickly and lose customers just as quickly. Being able to serve N amount of content reliably and consistently (even if less than M amount) is still a strong demonstration of good engineering practice in my opinion.
On that point, I can't honestly recall a time I had Netflix streaming issues that weren't because of a problem on my side. Maybe I've just been lucky though, so ymmv.
- Iulioh 9 months ago
  
  Personally netflix went downhill the moment they stopped the "forced bitrate" option. The quality is shit. This was the moment for me where piracy was a more enticing option.
  I still kept netflix because it was cheap divided by 5, the value proposition was still there, it was easy enough for 4€/m.
  When they stopped it was over for me.
NBJack 10 months ago

Hype for the engineering culture? Helps attract the right talent. It is a relatively small team that is...ah, heavily motivated to come up with good solutions around the clock. And they maintain an excellent tech blog.
Don't get me wrong; serving the level of traffic they handle isn't easy to scale or do cost-effectively around the globe. They are also considered by some to be pioneers in chaos engineering, and made headlines years ago making a competition to find the "best" suggestion algorithm.
dangus 10 months ago

On top of that, their competition didn’t need any of that technical adeptness to catch up in the span of a decade or so.
There is now zero value to the technology advantage of Netflix. Perhaps its impressive that they managed to become a new major studio because of that early success, but we could argue that the incumbent studios’ inability to snuff them out is more of a failure of their leadership than anything impressive about Netflix itself. Heck, the incumbents gave Netflix their place in the market by licensing content to them in the first place.
So why did Netflix need to build this “pro sports team-like” team of highly paid technologists where they actively fire/lay off low performers again? Netflix was bragging all over the internet about how their culture is so different and better.
I think ideas like this are something engineers should keep in mind in their careers. You can have the technical advantage but the money and the business environment wins in the end. If you’re in an oligopoly market like Netflix it doesn’t matter that you had a 5-10 year lead and the best technology, Disney and Time Warner and everyone else already had content production, Apple and Amazon have unlimited money.
loire280 10 months ago

You're probably right, but Netflix does a good job building their engineering brand by writing up and sharing their technical work publicly.
ianbutler 10 months ago

Netflix still has to serve 20k videos to 300million people. That's about a 750million hours of streamed content. Serving that content is challenging.
Then they have their ad network on top of it. Then they have their analytics apparatus. Then they probably have a whole suite of tools for content producers. Then they probably have a bunch of janky tools for things that didn't exist as products 15 years ago.
Seems reasonable to me if you put in a little more thought about the problem and scale.
evnix 9 months ago

Yeah it's overrated, pornhub has more videos and plenty more free users. They don't seem to boast about their engineering.

snicker7 10 months ago

This API is very similar to DynamoDB, which is basically a hash table of B-trees.

My experience is that this architecture can lead to very chatty applications if you have a rich data model (eg a graph).

jolynch 10 months ago

(post author)
It is indeed similar to DynamoDB as well as the original Cassandra Thrift API! This is intentional since those are both targeted backends and we need to be able to migrate customers between Cassandra Thrift, Cassandra CQL and DynamoDB. One of the most important things we use this abstraction for is seamless migration [1] as use cases and offerings evolve. Rather than think of KeyValue as the only database you ever need, think of it like your language's Map interface, and depending on the problem you are solving you need different implementations of that interface (different backing databases).
Graphs are indeed a challenge (and Relational is completely out of scope), but the high-scale Netflix graph abstraction is actually built atop KV just like a Graph library might be built on top of a language's built in Map type.
[1] https://www.youtube.com/watch?v=3bjnm1SXLlo
pradn 10 months ago

Graphs are inherently "chatty" because there are more shapes in which you could store them. The same goes for querying. Similar to "degrees of freedom".
Even storing a graph in memory, you're going to have load a lot more cache lines to traverse/query its structure. For remote graphs, this translates into more network calls.
The smart thing Netflix did here is finding the minimal abstraction that supports their online querying needs. Turns out that they need a few things above a bare KV store:
1) Idempotency keys allow multiple reads/writes without reordering issues. You can use them to do request hedging, which greatly helps w/ tail latency, at the cost of higher resource usage.
2) KV, with the value being a map. A little more structure, which can use the backing store's native structure.
3) Passing client/server parameters back and forth in a handshake. This allows clear request policy propagation, so the whole path behaves the way the client op wants it to.
4) Filtering/selection - to reduce the set of items returned, on the server side. So the network + client don't have to bear the extra burden.
The summary is: "minimal viable structure", "maximal chances to hedge requests / reduce data movement".

jerf 10 months ago

For anyone looking for a TL;DR, I'd suggest starting at https://netflixtechblog.com/introducing-netflixs-key-value-d... , which HN is truncating so you can't see it but I've directly linked to a later section in the post with a #. Up to that point it's basically "a networked HashMap<String, SortedMap<Bytes, Bytes>>". But the ability to return partial results based on a timeout with a pagination token is somewhat unusual and the next section called "Signaling" is at least worth a look.

throwaway984393 10 months ago

Back in the 2000s it was common to have libraries and services which would expose high level database functions to applications rather than give them direct database access. It solved so many problems.