This article contains affiliate links. See my affiliate disclosure for more information.
Python trainer Matt Harrison has been creating a bit of a stir.
Some of his pandas examples, like the one below, have elicited emotional responses from different folks in the Twitterverse:
One person commented: "Nobody in their right mind writes code like this. Right?" Someone Quote Tweeted it saying: "How not to write Python code."
The thing is, Matt's an experienced programmer with "years in the trenches," as he puts it. He's written best-selling books on Python and pandas, and regularly trains data science teams at top companies.
I knew there had to be more to the story.
Was there some disconnect between the folks criticizing Matt's code and the reality that has shaped Matt's lambda-laden, method-chaining style of writing pandas? I decided to reach out and see if Matt was interested in chatting with me about his code and the reactions to it on Twitter.
I'm so glad he said yes.
David: I've just been absolutely fascinated by the discourse I've seen on your posts on Twitter. And, you know, it's been interesting to watch you handle that as well. But I get the sense that there's some context around the code examples that you post that some people might be missing. Is that the case?
Matt: Yeah, maybe I should talk quickly about my background so that people know where I'm coming from. I have a CS degree and, you know, my original thought out of school was I'd be a software engineer. My first job out of school was doing natural language processing, and I've been in data most of my career using Python. I'm not a statistician. I'm not an admin. My background is in writing code.
What I do now is train professionals in the Python data space, but I also do some consulting and advising for companies. I'm an educator, but I would say that I have time in the trenches. So I'm not reading slides or just making stuff up just because I think it's fun to harass people on the internet.
My goal as an educator is to, as I say, covertly teach best practices and, more generally, software engineering best practices to people who are using code as a tool. People who claim they don't want to be coders. They're just using code as a tool, right? But they're coding, and my goal is to help them write professional code that they won't hate themselves later for writing.
That said, I get that code is like everyone's baby, and people who are coders put a lot of time, sweat, and tears into their code. When you call someone's baby ugly or say that their baby isn't doing things right, that's like a personal offense to a lot of people and they take that the wrong way. So, I get that.
But you aren't your code, just like you aren't the company you work for. Your identity should be separate from your code. And so, if you're using code your goal would be that you would want to be able to use code better, or that's what I think your goal might be. If it's not, that's maybe weird to me.
Having said that, I don't think you'd ever get a hundred people to agree on the one true way or one right way to code. 70% of them will say you should do object-oriented programming. 20% of them might say you need to do Rust. And maybe 1% of them would say you need to do functional programming or something like that. So you're never going to get a consensus on that.
David: What's the elevator pitch for writing pandas code the way that you do?
Matt: One common thing that you'll see in the data science world is this notion that there's like
Untitled2.ipynb. This notion that when data scientists go to work in the morning, they take whatever Jupyter notebook they worked on yesterday, they copy it and paste it and start off again. It's kind of like, "Well when I use Excel, I just save it as a new file and then I start from that," right?
My goal is to help with that so you don't have
Untitled28.ipynb, you have
Analysis_for_ClientA.ipynb and that's the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you're going to be productive.
Your code will be easier to read. Others can use your code, you can test your code, you can come back to your code in a week and pick it up and others can do the same.
David: You've been called a gatekeeper by some people on Twitter. In one Tweet you said you write code for professionals. Is that where this idea of gatekeeping comes from? That if you don't write code a certain way it's not "professional?"
Matt: Yeah, I think so. You know, someone was upset because I said that I teach professionals. I asked them who their audience was and they said they work with teachers or grad students or something. And they thought that my claim that I teach professionals was a jab at them. That somehow they aren't professional.
I see what they're saying. But, I do teach professionals. Like, that's what I do. I go into big companies that you've heard of and watch shows on their platforms, and I teach them how to write code that will serve them well. A lot of these people aren't necessarily software engineers, but they are writing code.
David: I see your content as filling a gap, in a way. If you search for pandas you're going to find the pandas docs and a whole bunch of beginner-oriented content. That's true of almost anything in programming. So there's this gap between what's easy to find on search engines and what, I think, professionals actually need. Does that play into this at all?
Matt: The style of code that you find out there through search — I've written code like that. And it was painful. Most data science posts on Medium are like "Top Pandas Functions To Remember" sort of things. These posts teach operations in isolation, which is okay. But in practice, I've never had a data set where I've done one thing to it in isolation.
In my 20-plus years of working with data, I have multiple steps and I don't care about the intermediate steps. I care about the raw data, what's coming in, and I care about the clean data that I'm going to visualize or I'm going to send into a machine learning system. I care about the end results. Writing a chain is the recipe that gets me those end results.
David: What is it that separates beginner pandas code from professional pandas code?
Matt: I would say that if you want to write good pandas code — let's draw out the term professional and just say good pandas code — you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking, which is something that a lot of people probably don't use, is super useful in pandas world.
Some people say, "I want to write my code so that someone who's never used pandas or Python can look at it and use it." Well, if that's your audience, good luck with that. I don't want to cater to the lowest common denominator.
I assume that they understand what lambdas, comprehensions, and dictionary unpackings are. I also assume the audience has some minimum level of Python experience. I don't think lambdas and dictionary unpacking and list comprehensions are necessarily beginner-level Python code.
So I'm not saying that people who write code in this beginner style aren't professionals. I'm saying that their code is written in a naive way. I'm sorry if that offends them. I don't think it's their fault. I think it's due to, like you said, a lot of the content floating around the internet being written in this beginner style that shows you how to do something in isolation.
David: You've mentioned a couple of times that you work with people who don't necessarily consider themselves programmers, but are professional pandas users. Is there maybe some other style of writing pandas code that is appropriate in a more traditional software engineering setting?
Matt: No. I would say that if you're writing pandas code you should embrace the chain. Basically, it's a constraint. If you limit yourself to the constraint, it forces you to think about each step of what you're doing along the way.
David: Is the size of the project a factor at all? How big are the notebooks you see people work with, typically?
Matt: I have seen some notebooks where they do longer things. But, I mean, people complain about five lines of a chain like that's the worst thing ever.
David: Yeah, I've seen some comments where people object to just a few chains in a row. And it's like, come on, really?
Matt: Well, there's Demeter's Law, which is a general programming principle. If you have an object and you call an instance member of an object and you call an instance member of that object and you chain those operations, then that's a violation of Demeter's Law, which says that you shouldn't ask the internal part about the internal part. Whatever you're starting from should expose the functionality to do that.
I don't really buy that I'm violating Demeter's Law. We're not really dealing with internal parts of internal parts here. This is completely different from that. We have a DataFrame and we're returning another DataFrame. We're not digging into the DataFrame and pulling out this part and then pulling out some subpart of it.
David: A lot of software engineering best practices make sense at a large scale. But they don't necessarily make sense on a small scale. You end up with a lot of boilerplate. Like, you could have just done this in a couple of functions in one file, so why are you scattering things around and doing all this stuff?
But, you know, at a certain scale it makes sense because, if you don't do that, it's just too hard to deal with things. Are there scalability issues that might make chaining less desirable?
Matt: From my point of view, the scale here is you've got super complicated data. And maybe you end up with a very long chain. In that case, one of the things you can do is leverage
.pipe(). Maybe these twenty lines are cleaning up the temperature data, or whatever. So you could say
.pipe(clean_temperature_data) and just remove those twenty lines.
But I've seen people write a whole bunch of pipes that all dispatch to a single line of code. And my point is: you should be able to just read what's in the pipe. By separating things you're putting more cognitive overhead on yourself because now you have to scroll up and down to read what all these things are when previously the line in the chain told you what it was.
There are cases, though, where chaining legitimately makes things hard. Maybe you really do need an intermediate variable because you're calculating something that's derived from two different things. There are ways to do that without breaking a chain. You can use
.pipe() to make an intermediate variable, and then you can refer to that later on if you need to. So, I haven't seen anything that tells chaining doesn't scale as far as code size goes.
As far as the other scale that we might think about, which is data size, one thing to be aware of is that pandas is an in-memory tool. Your data needs to fit in memory. That's why the first thing I do when I load my data is set the correct types to shrink it down. I gave an example the other day where I went 95% smaller with a few lines of code.
And the other thing is that pandas is not particularly smart about doing copy-on-write semantics. In fact, it doesn't really have that. For my clients, I usually recommend a 3X–10X memory overhead so that you have space to do these operations. So, yes, data size can be a problem.
One of the objections people have to chaining is that it's problematic with data size. But it's actually less problematic than what we might call the naive style. Maybe the word naive has a bad connotation, but it is kind of like that. You make all these intermediate variables, right? You're actually storing references to all of these intermediate objects that are all copies of your data.
When you chain there is often a copy being made, but there's no pointer or variable holding it. It gets garbage collected. You make an intermediate variable and then the next thing does something with it and, at that point, no one else is using that intermediate variable so it's gone. You don't have to worry about that or manage that.
David: Switching gears a little bit, I'm curious to know what readability means to you. What is it that makes code readable?
Matt: I think that's a question that there's no standard answer for. If you ask ten people, you're going to get ten different answers. I will say that I don't just teach pandas, I also teach Python. A lot of people in my fundamentals of Python class will hear me say, "Your goal is not to write code that's easy to write. Your goal is to write code that's easy to read.
But beauty is in the eye of the beholder and so, again, you need to take into account who your audience is. I don't think a professional who's writing Python code or a professional who's writing pandas code should write code that's easily read by someone who's fresh and hasn't had any training or background.
I would say there's some baseline. Write Python in whatever idiomatic Python style looks like, rather than if it were C or Java. I see a lot of people coming out of university saying they learned Python, but really they learned C++ or Java. The instructor took their Java content and translated it into Python, which isn't particularly hard, and now the student knows Python, right? But they don't really know Python.
Having said that, just because I write pandas in this chaining style doesn't mean that when I write Python code I'm writing chains all over the place. Someone asked me where object-oriented programming is used with pandas. When I'm writing pandas code, I don't write classes. Does that mean that I never write classes? No, I write classes all the time. It's just that I don't really need to use classes when I'm writing pandas.
So I guess I'll answer this from a pandas point of view. For me, readable pandas code looks like a recipe. It's got steps in it. You can say this is what the first step is, this is what the second step is, and this is what the third step is. Chaining is the constraint that, if you follow that constraint, basically forces you to write your code as if it was a recipe.
David: To me, there's a kind of unsatisfyingly obvious answer to "what is readable code?" and that is any code that I can read and understand without having to work too hard. If I have to write things down and take notes in order to understand a few lines of code, then we're starting to leave the territory of readability.
One of the things that attracted me to the chaining style when I first saw it in your Twitter posts was that I only have to think about what each step is doing and not how it's done. I find myself writing more of my code like this. Not chaining, necessarily, but in a more declarative style. That's something I see a lot in pandas code that I come across, and, I think, especially in method chaining.
But there's another objection people have to chaining that we haven't talked about yet. I see people argue that the lack of access to intermediate steps in a chain makes the code hard to test and debug. Can you talk about that?
Matt: I post this code online and I think a lot of people are like, "Oh, you're just like walking up to a computer, typing in this whole piece of code, and then you're done? Where are the intermediate steps? How do I debug this?"
I recommend people go and search for my idiomatic pandas talks on YouTube where I show that this is the end result. It's not like I sat down on a computer and wrote this whole thing in one go. I sat down at a computer and made this line-by-line, testing it as I go.
I don't care about the intermediate results at the end. I'm inspecting those as I'm going through them and validating that what I'm doing actually works. I think a lot of it is just like they don't understand how I got to that endpoint because they just see the end.
And, you know, I generally challenge people and ask "How would you rewrite this?" Most people don't take that, but one person did take that and they're like, "I'm going to write this big long notebook where I declare some variables up here." And they're like, "You need to put some Markdown in there."
A lot of people say I need to use Markdown because you need comments. And so you need to have multiple cells because you need to have Markdown between them, right? And then each cell needs to have like some markdown above it explaining what it's doing and that sort of thing.
I mean, if that's readable to you, that's great. But the problem is, when I come back to this tomorrow, I've got to find those cells and run them in order. And if you happen to make them out of order or something, then I'm kind of in a bad place.
In my notebooks, I put my chain into a function when I'm done, and then I just put that function at the very top of my code and I can come back tomorrow, load my raw data, and run that and I'm good to go.
Having said that, if I do want the intermediate variables, I can use
.pipe() and just make a global variable. I show examples of that in my Effective Pandas book. How do you debug that? You can comment out the pipe and walk through. That's really easy. People ask if you can use debug tools. Yes. You put a breakpoint at the line and you can step into the method at that point. The claim that you can't debug it is kind of bogus to me.
David: So what are your final thoughts on all of this? What should people take away from our conversation?
Matt: If people don't like chaining, that's probably bad news for them because my belief is that we're just going to see more of that with next-gen tools like Polars, which is a DataFrame implementation in Rust.
You kind of have to chain in Polars. You can do a
.filter() at the very end of the chain and it will go back up to read the CSV file and limit which columns and rows it reads based on the filter. You can do query optimization from the chain, which you wouldn't get if you didn't chain in Polars.
And really, my advice would be just to try out chaining in pandas. I think a lot of people have an adverse reaction to it, but they never try it out. A lot of people who read my book say, "I was skeptical, but I tried it out and now it's changed how I write pandas code."
Want to learn how to write effective pandas code?
Check out Matt's latest book Effective Pandas. It shows you how to clean your data, create powerful visualizations, and write your own data recipes. And there's an entire chapter dedicated to debugging.
Get instant access to the eBook on Matt's website or order a print copy on Amazon.
My favorite part is how Matt uses diagrams to explain operations and help you build a mental model for working with pandas DataFrames: