Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Wes McKinney, the developer of Pandas (qz.com)
155 points by gk1 on Dec 8, 2017 | hide | past | favorite | 50 comments


The majority of credit is due Wes, no doubt, for bringing pandas to life, but it's a glaring omission not to mention the huge contributions from the wider community toward pandas (shout-out to Jeff Reback!) https://github.com/pandas-dev/pandas/graphs/contributors


It's true -- as Jeff (lead core dev/maintainer the last several years) often says "Wes gets the kudos, I get the hate-mail"


There were three people who left their full time jobs in 2012 to found a startup with Wes promoting pandas; a year this article highlights as its inflection point in popularity. You were one of them. That omission also seems worth correcting.


Probably also worth at least a small shout-out to the R project (and the S+ and S teams before them) for pioneering the data frame API that Pandas mimics.


Wes noted himself that pandas.read_csv() with its 50 or so parameters probably accounts for a substantial part of its popularity :)


For years, I maintained my own python tool for loading/saving CSVs from numpy formats. It was slow, buggy, constantly hitting edge cases. When `pd.read_csv()` and `.to_csv()` came onto the scene, it was like the clouds opened up and a chorus of angels sang. And then you have all the other `read` and `to` functions, it's glorious.


I'm a fan and frequent user of Pandas, but does the increase in Stack Overflow questions indicate a surge in popularity, or difficulty in use? I for one run into Pandas issues frequently and often find myself searching for the succinct solution (though, to be fair, this may also be an indication of the impressive scope Pandas strives for).


I think you're right. There are many, many things I have come across which I search stack overflow excessively because I am overly surprised there isn't a better method of achieving the task. Try and do a cross join in pandas, it's deeply dissatisfying.


Pandas is useful and I don't want to bad mouth it as people obviously find it useful. However, it has a complicated API and contains about 200k lines of code. So, it is not a surprise that documentation is a challenge and that there are lot of Stack Overflow questions. For example, figuring out which method result in copies of the data vs new views is hard.

Compare with dlply. It solves a similar problem as pandas does but has a vastly simpler API. To be fair, Pandas does do more but dlply is also more flexible. I looked at implementing something like dlply in Python but you really need to have a lazy evaluation syntax. dlply makes extensive use of this feature of R. As the downside, it can be very confusing to new users as it is hard to debug this lazy evaluation code.

Rather than adopting Pandas to build our product, I built a very minimal version of it (on top of numpy) that only does what we need. That was some extra work but I'm happy I did it as we avoid this huge dependency. I understand quite well my little minimal version does, it is only about 1000 lines of Python code and some tiny C extensions.


I've used pandas frequently and find their docs to he entirely unsatisfying. Stackoverflow provides examples and fairly good insight on using pandas as a whole compared to the docs which just says this function exists basically.


Pandas has much more breadth than depth. The first few months I used it, I felt the same. At some point it all just clicked and I more or less knew where to look for stuff in the docs.


Anything mildly complex is difficult. And I guess an unpopular tool wouldn't have an increase in questions for a long time. The people would either leave it, or learning. So the number of questions should be decline or at least be around the same number.


I really love pandas and dplyr, but honestly both of them are inferior to modern SQL. In my workflows, I’ve almost exclusively replaced them with Postgres and it’s foreign data wrappers, spit out the results to a text file and then load into R or Python.

It’s a more complicated environment for sure, but still more efficient.


I've found pandas great for interactive sessions, for the most part, but I found doing joins was way too fiddly and I'd much rather do it in SQL. Could be I missed an important pandas concept in there somewhere that would have made it make more sense, or maybe the API has improved since the last time I tried. Generally I've found my SQL ends up being clearer and more readable after the fact.



Have you tried Apache Spark + pyspark + Jupyter? Spark allows SQL against all data frames bound to the current context:

  df1.createOrReplaceTempView("table1")
  df2.createOrReplaceTempView("table2")
  spark.sql("SELECT * FROM table1 t1, table2 t2 WHERE t1.id=t2.fid")


I haven't tried it but there is a package for R that I find similarly useful, called sqldf. The back end is sqlite, which is quite fast for anything that fits on disk.

I find both hobbled, however, by their rather simple SQL dialects. No window functions, common table expressions, grouping sets or similar, etc. An improvement, no doubt, but Postgres still has it beat.

I honestly wish the vast corpus of effort put into statistics and machine learning in R and Python were more easily portable as Postgres foreign function packages. I've tried the ones I've seen mentioned and they were all horrible, mostly due to embedding an entire language's runtime and memory model on top of the SQL inner workings. A project on my backburner, waiting for the time to work on it :)


I've only touched enough Pandas to get a feel for it, so I'm talking more here about dplyr, but I think most of this applies to Pandas as well.

Yes, SQL is incredibly battle-tested, and has a ton of features that newer systems need to catch up on, and there's a ton of work sunk into performance on too-big-for-RAM datasets that standalone Pandas and dplyr doesn't have. (But see dbplyr, for example.)

But actually programming in SQL sucks. There's no such thing as code reuse, as parameterization, as composability, etc. Everything about DRY goes right out the window when writing SQL and copy-pasta is the law of the land. Writing dplyr is so much more comfortable for me with just a little bit of learning about it. And I think it's going to be so much easier to extend dplyr to cover the gaps between it an SQL than it would be to try and fix the ergonomics of SQL.


Are there any specific examples that will illustrate your point ?


The biggest painpoints IMO involve:

* anything with data that doesn't fit in memory

* anything that could trivially be written in an SQL window function

* anything involving table joins, especially multi-column and/or non-inner joins.

* anything involving subtotal-like summaries (easily implemented using GROUPING SETS, CUBE, and/or ROLLUP)


This is so true for me as well. I actually use more SQL, Bash, and even MS Access DBs and Excel pivots connecting to text files to get the data in the right, aggregated shape first. Then I load into R or Python as you say so I can work with the data in memory.


Dan should do a Wes McKinney vs Hadley Wickham type article about the future of data science instead. (hint: very bright)

I use pandas more than any other tool, but every time I look at R, I go through a period of re-evaluating all my life decisions.


Hadley and I are also friends and collaborators! I think we're going to see a lot more interesting collaborations between the R and Python communities in the future, since at the end of the day we're solving a lot of the same problems


Oh yeah, I didn't mean to imply that it wouldn't all be in good fun. I was thinking of the obvious counterpoint for the author to use.

You know Heinrich Wolfflin, one of the pioneers of art history, invented this breakthrough method of teaching where he would set up dual projectors because demonstrating contrasts was a much more effective pedagogical tool than simply describing a painting.


Today, McKinney works full time on Pandas and other open-source data science projects as a software engineer(...)

What? It seems McKinney didn't commit almost anything for the past few years to pandas (not lowering his contribution per-se, but the article made it sounds like he's still dedicated to improvements of pandas).

https://github.com/pandas-dev/pandas/graphs/contributors


I've been working on innovating core computational and IO infrastructure for pandas (and projects like pandas) -- much of this work has been happening in other codebases. See: http://wesmckinney.com/blog/apache-arrow-pandas-internals/


And it's looking good, too. Thank you.


Pandas, Seaborn and Jupyter. A gift from the gods for any (starting but also more advanced) programming biologist.


Great, but no love for R and Hadley Wickham ?


I thought they article was going to be about Hadley too. But i'm pretty sure he was the HR star yesterday.


In case you're interest in reading his book online - or 'running' it, here it is:

https://notebooks.azure.com/wesm/libraries/python-for-data-a...

If you want to read, click on any notebook, for example:

https://notebooks.azure.com/wesm/libraries/python-for-data-a...

If you want to run, click Clone, sign in, then Run. It's basically a collection of Jupyter notebooks. This is from his personal repo.

[disclaimer: work at msft]


Or you could clone his Github repo, and do it in Jupyter:

https://github.com/wesm/pydata-book

I was really pleased to notice that the second edition of the "Pandas book" (https://www.amazon.com/Python-Data-Analysis-Wrangling-IPytho...) just came out in late October. I'm about halfway through reading it now.


Ted Petrou has written a very detailed critical review of this book. He finds it lacking in certain areas.

https://medium.com/dunder-data/python-for-data-analysis-a-cr...


"The man behind Pandas, the most important Python tool in data science"


Not even. That's unarguably NumPy.

.. Followed by Scikit learn. Then, huge projects like {Tensorflow, Pytorch}. Only then, Pandas.


I would say that Pandas likely has way more users than tensorflow or pytorch. Every burgeoning data analyst / non-ML data scientist uses Pandas.


"The man behind Pandas, a reasonably important Python tool in data science"


the title used to read "Pandas, the most important data science tool"


Pandas really is great. It's not just that it's a convenient library, it's also really nicely implemented and super efficient. And it has a lot of tools that guide the user towards doing things the right way.


I tried to zoom in on the photo to actually see his face, but the website hijacked zooming on mobile.


Which photo? The header photo at the beginning of the article with a man working on apple laptop is not Wes McKinney. That is a stock photo of "a man works at his computer at the Airbnb office headquarters in San Francisco" [1]. There is a portrait of the actual Wes McKinney a little ways down the article.

I don't know why QZ thought it is a good idea to start a "Meet the man" article with a large photo of a man who is not the titular man.

[1] https://www.citylab.com/environment/2017/09/lab-report-shari...


Wes McKinney is a good guy, and Pandas is very useful. The title of this article is grandiose though.


Exacltly. I thought this was an article about John Chambers, Ross Ihaka, and Robert Gentleman.

https://en.wikipedia.org/wiki/R_(programming_language)


Funny, but you have to admit pandas is the lingua franca of data science now.

A couple of years ago the whole team was using only R. Now some guys use it here and there a couple of days a month.


Not necessarily in my world. My theory...

Programmers grok pandas Statisticians grok R


That matches my observations very well.


Thanks, we've updated the headline.


Basically, Pandas makes it so that data analysis tasks that would have taken 50 complex lines of code in the past now only take 5 simple lines

A metric worth underlining.


Can we update the title to 'behind the pandas library'?


Added "Pandas" to title to make it less clickbaity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: