The Pragmatic Programmer for Machine Learning (2023)

195 points by rramadass a year ago

simonw a year ago

Does this have anything to do with the original Pragmatic Programmer book or the https://pragprog.com/ publishing company?

If not I think the name should be reconsidered. It's a distraction from the content of the book itself if it's not actually related to that other text.

rramadass a year ago

The phrase "Pragmatic Programmer" is a common one to denote somebody focused on "pragmatic issues" in Software Development and can be used in that capacity wherever it is applicable.
This book deals with such practical issues in ML software engineering and hence very much worthy of this name.
- sja a year ago
  
  I tend to agree with the parent comment. The name of the book sounds like it could be part of a series authored by the same people as “The Pragmatic Programmer”. For me, I subconsciously internalized the grammar of the title as “The Pragmatic Programmer: for ML”.
  I have no idea what the expectations are legally, but given the original “pragmatic programmer” book has been out for around for ~25 years and is extremely well known, it seems like a reasonable name collision to avoid.
  - graypegg a year ago
    
    The cover of the book has an Addison Wesley logo on it, and the hard cover also has a Pearson logo on it. So that name has some textbook companies backing up PragProg as well.
    https://pragprog.com/titles/tpp20/the-pragmatic-programmer-2...
    It also DOESN'T have any indication that "The Pragmatic Programmer" is any sort of trademark, so who knows. Either way, IMO calling your own writing "X for Y" where "X" is a commonly known specific work, and "Y" is a generic term, just means that you've diluted your own discoverability into a very big pot.
  - rramadass a year ago
    
    Why are we fixated on the name? "A Rose, by any other name, would Smell as Sweet" and all that.
    What i am looking for in this submission is insights/opinions from people working in this domain on the topics presented in the book. For example, the book talks about "Concept/Data Drift"; so what is it exactly, how does a ML engineer encounter it in his data and how does he deal with it over time?
    
    gessha a year ago
    
    Because names mean a lot in and outside software. Try naming it “ML: The Big Nerd Ranch Guide” or use a O’Reilly/No Starch-style cover and you will get a similar reaction.
    
    rramadass a year ago
    
    Let the authors deal with it; it doesn't concern us here.
    What i am looking for is a discussion of the contents in the book which they have kindly made available for free (the book is expensive).
    PS: I am always very appreciative and thankful of people who make their knowledge/books/software available for free and am sure they would like us to focus on the core contents rather than ancillary issues (which they doubtless are aware of and cleared with publishers).
    
    ibash a year ago
    
    It very clearly does concern us… hence the thread.
    FWIW I’m only familiar with the term “pragmatic programmer” because of the book. I don’t think I’ve ever heard it in any other context.
    When I saw the post I thought it was written by the same author.
    
    rramadass a year ago
    
    The point is; "it doesn't have to".
    
    loco5niner a year ago
    
    Honestly, I only clicked into this thread because I associated the phrase "The Pragmatic Programmer" with the famous book, and if it's not by the same people, I am less interested in their content specifically because of the "borrowed"(stolen?) term.
    
    rramadass a year ago
    
    All your assumptions/preconceived notions are only keeping you from good Knowledge.
    
    loco5niner a year ago
    
    It's possible, however I've already wasted time with a click based on the book title, and based on that I would prefer not to give the authors any more of my time regardless of missing out.
    
    rramadass a year ago
    
    [flagged]
    
    loco5niner a year ago
    
    My primary purpose is to use good judgement on choosing where to spend my time. Thus I choose not to read the book, it's not hate, it's judgement. My secondary purpose is to not reward those who use underhanded schemes to get ahead. This may not have been their intention, but it is how I perceive it.
    
    rramadass a year ago
    
    Your logic, perception and judgement are all flawed. You merely looked at a familiar phrase in the title and immediately jumped to a conclusion with negative connotations. That's on you. You have no idea about the book, have read no summary/review of it and hence do not have a clue about it and yet are trying to justify your "judgement"?. The book is published by well-known publishers who would have cleared its title to make sure that there are no legal violations (i.e. underhanded schemes) which could get them into trouble. So on that count also your "judgement" fails.
    I am advising you to browse/read the book because we (I and a few others in this thread) have browsed/read the book and found it worthwhile (if you are interested in the domain in the first place, of course).
    I have seen some silly arguments in my time but you take the cake on "judging a book by its cover" to a whole new absurd level.
    
    aulin a year ago
    
    Because that name is associated with one of the best and successful books about software engineering.
    I almost sure that "The Pragmatic Programmer" is a trademark so it comes natural to associate the book with either the same authors or the same publisher as the original book.
    
    rramadass a year ago
    
    https://news.ycombinator.com/item?id=41565030
    
    auraham a year ago
    
    X-men reference?

rramadass a year ago

I am quite surprised there is no discussion here. The book actually gives a nice overview of practical Software Engineering principles applied to ML Engineering and hence of use to regular Programmers moving to ML from other domains. I personally found it quite useful to understand the practices employed in ML Engineering and how it is different from "normal" programming which is where i come from.

Part II titled "Best Practices for Machine Learning Pipelines" and starting from chapter 5 is where the meat lies.

nerdponx a year ago

It's new to me! It certainly looks good, or at least like something that could be increasingly useful to an increasingly large group of people. But it's a whole book and I have only a few minutes' break to check HN, so I can't evaluate it for quality until I've had a chance to read it (and I definitely will, because it looks useful to me). I assume, being new, few other people have experience with it to comment. And fortunately whenever actual math and code show up, the AI maximalist/doomer blabbermouths tend to stay away.
- dijksterhuis a year ago
  
  I’ve had a skim through bits. It look good and thorough for ML Engineer / Data Engineer stuff.
  Definitely recommend checking it out yourself, even if it’s only to bookmark for sending onto others as a reference manual.
  > And fortunately whenever actual math and code show up, the AI maximalist/doomer blabbermouths tend to stay away.
  When I see something mentioning machine learning + engineering, I’m interested.
  When I see AI mentioned I automatically discount it as marketing fluff.
  It’s worked well as a filter for almost 10 years now (the last hype cycle).
FrustratedMonky a year ago

came here to ask.
Since I don't know enough about ML to know if this is good, is it as good as the original 'Pragmatic Programmer', to justify the title?
A lot of books rif off a poplar name, so asking if this lives up to it, and worth more time investment to read it.
- rramadass a year ago
  
  https://news.ycombinator.com/item?id=41563211
  It is a overview of practical issues book and hence an easy read.

dijksterhuis a year ago

Had a skim, looks good. Bookmarked.

+1 adversarial robustness [0] & privacy were included in the analysis stage. People forget that stuff.

+1 on having to rewrite academic code (or code from some Jupyter notebook). Bane of my life sometimes.

+1 versioning data and code, running pipelines based on changes in either

+1 ingest your data, then validate, then use it. Data/model drift etc.

+1 on consistent tooling and language use.

+1 references everywhere

Wasn’t sure about the super specific approach to the commit history (squashing specific file changes together with validation/safety changes in a separate commit).

But then I’ve rebased my MRs to do something similar before and enjoy doing it. I guess I’m just pointing out that trying to get other people to do this regularly is a massive PITA and usually doomed to fail.

—

[0]: adv robustness was a bit light on content unfortunately. But then I researched to topic full time for three years so probably always gonna be light for me KEKW

rramadass a year ago

Finally! Somebody who is actually talking about the contents :-)
Could you clarify a little bit on what is meant by "Concept/Data Drift"? Any examples/links you can point us to? Wikipedia (https://en.wikipedia.org/wiki/Concept_drift) describes it but without a specific example to walk through i am not really "getting" it.
- dijksterhuis a year ago
  
  is mentioned in the text in a couple of places
  https://ppml.dev/design-code.html#data-debt
  https://ppml.dev/troubleshooting-code.html#troubleshooting-d...
  Probably a very over simplified example below. Because data doesn’t usually drift in this obvious way. It’s usually more subtle and happening over a longer period.
  My model is learning on my business data of orders over time.
  People keep ordering every day. but usually in small amounts.
  But today we got a new customer and they put in monthly orders which are 1000x larger than all other combined.
  They are going to keep making orders for the next year or so. At which point they stop ordering from us.
  Two data drift “episodes” here:
  1. When we get the new customer. We’ve now got an outlier. They aren’t like all the other customers. How’s will the model react to this when being trained? Will it skew the output? Do we exclude the new customer in training data? Or do we change the model to account for them?
  2. When that customer stops ordering after a year. Now the outlier is gone. But maybe we changed some model settings and tweaked it a bit to account for it. Now we need to account for that customer not being around anymore.
  Data drift is a big PITA.
  - rramadass a year ago
    
    I got the obvious way. What i was asking about is how do you identify drift in the data in the first place? The Model has been deployed after training/test data-set passes. Presumably with drift in the input the model's predictions will not be "good" anymore. How do you disambiguate this case from the model itself being wrong for other reasons?
    
    dijksterhuis a year ago
    
    It's a continuous process checking the training data is from the same "distribution". Usually through automated pipelines running against the ingested training data (i.e. once you've got the new data fully processed and ready for training, but prior to actually training the model).
    In the pipelines you do some checks on statistical outliers/differences. Check the current training data against historical versions of the training set. If anything goes beyond some specified tolerances you highlight that for manual testing/checks.
    Using the toy example from before, something like checking the sum of orders per customers in a month compared to the last N months. If the maximum per customer orders this month is 100x higher than any previous month then something has significantly changed in the data. May affect training, we need to investigate this.
    If you've identified some statistical changes/differences, that's usually where someone needs to investigate in more depth. Train a dev model on the brand new training data. Pass multiple unseen test dataset(s) through it. What happens?
    * Is global test accuracy up or down?
    * Is robustness affected?
    * Is the accuracy degrading for specific classes?
    * How does this compare to drifts we've seen before?
    Then you make decisions about whether you need to:
    * exclude parts of the new training data?
    * tweak some model hyperparameters?
    * tweak the architecture of the model?
    There's no single right answer on what to do at this point. This is the difficult and expensive bit of machine learning. It requires a lot of continuous experimentation even after you've got something running initially.
    
    rramadass a year ago
    
    Nice. It is these sort of issues that made me realize that ML Engineering/MLOps are a very different kind of beast where Statistics and coupling of input data to the Model plays a very significant part. The awareness about the data domain is vital.
- tomtom1337 a year ago
  
  I haven't read the text, but data drift refers to how, after deploying a machine learning model, the input data changes over time to something that wasn't tested on. For instance, let's say you create a gradient boosting forecasting model that does a great job at predicting tomorrow's earnings. At the time of training, the earnings might be in the $1000 per day range. But a year later, the earnings might be in the $100k range. The model has never seen numbers this high before, so it doesn't know how to handle them well. That is data drift.
  - rramadass a year ago
    
    Right. Can you share how such issues are handled in the ML pipeline?
    
    tomtom1337 a year ago
    
    The most common solution is to frequently retrain on the latest data. A forecasting model might retrain every week, including the last weeks data, and might even drop older data, for instance training data older than a year.
    It's best to transform your target variables, like "number of orders", to "number of orders per customer per day" or something like that. And then in your pipeline, you feed the latest estimate on your number of customers (e.g. average of the last two weeks). That's way more robust over time.
    
    rramadass a year ago
    
    Makes sense. We need to continuously monitor the performance of the model deployed in the field with our preexisting statistical knowledge of the data and then accordingly schedule regular "model updates".

vismit2000 a year ago

A good related book is 'Designing Machine Learning Systems' by Chip Huyen which covers similar topics: https://www.oreilly.com/library/view/designing-machine-learn...

rramadass a year ago

This looks pretty good and more aligned to what i was looking for;
Thanks for the pointer.

pmg101 a year ago

> Even so, it is difficult to understate the impact that machine learning is having on many aspects of our lives.

I'm struggling to parse this. Does this mean that the impact has been so small that it is very difficult to understate it? Do they mean "it is difficult to OVERstate the impact", meaning that the impact has been large?

Jeff_Brown a year ago

They definitely meant the latter.

dcchambers a year ago

Wow what's the context behind this? Is it worth reading? Who are the authors? Looks like this was first published in print in 2023 but there's basically no reviews online.

rramadass a year ago

https://news.ycombinator.com/item?id=41563220
Read the introduction to see for yourself.

BodyCulture a year ago

I can haz PDF, please…

Don’t eat me!

rramadass a year ago

Psst...I heard about this awesome invention called "Search Engines" ...