Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's cristal clear that this page has been written for people who already know what they are looking at; the first line of the first paragraph, far from describing the tool, is about some qualities of it: "Polars is written from the ground up with performance in mind"

And the rest follows the same line.

Anyone could ELI5 what this is and for what needs it is a good solution to use?

EDIT: So an alternative implementation of Pandas DataFrame. Google gave me [0] which explains:

> The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.

> DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.

[0]: https://realpython.com/pandas-dataframe/



Yes, it's annoying negative feature of many tech products. Of course it's natural to want to speak to your target audience (in this case, data scientists who like Pandas but find it annoyingly slow/inflexible), but it's quite alienating to newbies who might otherwise become your most enthusiastic customers.

I am the target audience for Polars and have been meaning to try it for several months, but I keep procrastinating about because I feel residual loyalty to Pandas because Wes McKinney (its creator) took the time to write a helpful book about the most common analytical tools: https://wesmckinney.com/book/


Ritchie Vink (the creator of Polars) deliberately decided not to write a book so that he (and his team) can focus full time on Polars itself.

Thijs Nieuwdorp and I are currently working on the O'Reilly book "Python Polars: The Definitive Guide" [1]. It'll be a while before it gets released, but the Early Release version on the O'Reilly platform gets updated regularly. We also post draft chapters on the Polars Discord server [2].

The Discord server is also a great place to ask questions and interact with the Polars team.

[1] More information about the book: https://jeroenjanssens.com/pp/

[2] Polars Discord server: https://discord.gg/fngBqDry


Slightly offtopic: it's a tragedy that projects like this use discord as the primary discussion forum. It's like slack in that knowledge goes to die there.


Yeah one thing that helps a bit is that they try to encourage that you post your questions to stack overflow and they'll answer it.


Disagree. Yes, forums with nested comment structure and voting are better for preserving and indexing data, but synchronous communication (like Discord) is better for having actual conversations and building community. Back and forth volleys of messages let people completely explore threads of thinking, rather than waiting days (or months) for a single response that misunderstands the question.

There’s a reason people prefer it: it’s better.


> There’s a reason people prefer it: it’s better.

That's a very "black and white" view, of any issue.

It could be that it's better for that specific person(s) who is having the current conversation. But what about after that?

It used to be that you could search Google and if it was answered in a forum post, you'd find it via Google. But since a lot of it are closed behind locked doors (like Discord), it becomes really hard to find, even with Discords own search.

Everything being ephemeral could help someone right now, but what about the others in the future, wanting the exact same answer?


Just ask it again so you can burn- and drive out the existing members with repetitive questions!


Okay I'll elaborate, it's better for community building and answering questions for nascent companies or organizations. A lot of the time there's a big disconnect between a community and how a product is intended to be used.

I built a paper trading app for stocks and options and Discord was the primary place where the users talked. The subreddit was almost completely empty, nobody responded to tweets, the Instagram was thriving but there was no sense of community because you couldn't tell if anyone commenting actually used the product or a meme just showed up in their feed.

Did I have to repeat myself a bunch? Yes, but that's fine, especially because the answer sometimes changed (rapidly), like "No we don't support multiple trading accounts per user" > "Yeah I implemented that last week, it's in the release notes, to add a second account..."

For a mature product that's not changing as much and more or less has all its features built out, it makes sense to branch out into more structured forums that are easily searchable, especially as you progress through different versions and users are looking for answers to past versions.


> I built a paper trading app for stocks and options and Discord was the primary place where the users talked. The subreddit was almost completely empty, nobody responded to tweets, the Instagram was thriving but there was no sense of community because you couldn't tell if anyone commenting actually used the product or a meme just showed up in their feed.

What I'm hearing is that you think Discord is better than a forum, because people talked in Discord but they didn't talk in the forum, a forum which you didn't have?

Do you have comparable experience building a community via a traditional forum VS doing so with Discord? As far as I can tell, it doesn't seem like you have tried a normal forum, yet you say Discord is better for community building.


Shouldn't the knowledge go into a wiki anyways? Reading old discussions in Reddit can turn up interesting things, but it's time consuming and bad information doesn't get fixed later.


Luckily our book will also be available in hard copy so you can digest all that hard-won knowledge in an offline manner :)


I'll wait until chatGPT can regurgitate it.


I do understand your "snarky" comment humoring, however do buy a copy if you want to support them, it's neither cheap or easy to make a book.


losing all nuance by virtue of getting dopamine quicker? count me in!


I often see this comment, and every time I think; but having people come to the information AND the community is better for the project.


Short term perhaps, but long term having a non-indexed community is inconvenient for newcomers.


microsoft copilot can summarize discussions. with some orchestration it could extract even from past discussions question+answers and structure them in a stackoverflow-like format.

source: we use this feature in beta as part of the enterprise copilot license to summarize Teams calls. Yes, it listens to us talking and spits out bullet points from our discussions at the end of the call. It's so good it feels like magic sometimes.

note on copilot: any capable model could probably do it. I just said copilot because it does it today.


That's why Ritchie is very active on, and often refers to, Stackoverflow as well! Exactly to document frequent questions, instead of losing them to chat history.


there are projects that you can use to index discord servers, unfortunately a lot of communities just don't use them.


Why would they? A person who picks Discord has no idea what knowledge discovery is.


Forums work really well for this. I personally avoid using Discord because chatrooms are too much of a time suck. There's far more chaff to sift through and trying to keep up with everything leads to FOMO.


by community do you mean all the people who make an account just to ask a question on the project's discord, only ever open it to check if someone answered and then never use discord again?


There is a free Polars user guide [0] as a part of Polars project. It was known as "polars-book" before it has been was in-tree [1].

[0] https://docs.pola.rs/user-guide/

[1] https://github.com/pola-rs/polars/tree/main/docs/user-guide


I'm not suggesting people need to write books to introduce their projects, but that landing pages should be more accessible to newbies if you want to build a big user base. A lot of projects introduce themselves by ticking off a list of currently-desirable buzzwords ('performant', 'beautiful' etc.) but neglect to articulate clearly what their project is and why someone might want to use it.


Any plans to try fine tune an LLM specialised in polars? That would really be the killer feature to get major adoption IMO.


Is there a book that is even more basic for more junior people in regards to dataframe / storage solutions for ML applications to recommend? Thank you


> it's quite alienating to newbies who might otherwise become your most enthusiastic customers.

Newbies are your best target audience too! They aren't already ingrained in a system and have to learn a new framework. They are starting from yours. If a newbie can't get through your docs, you need to improve your docs. But it's strange to me how mature Polars is and that the docs are still this bad. It makes it feel like it isn't in active/continued development. Polars is certainly a great piece of software, but that doesn't mean much if you can't get people to use it. And the better your docs, the quicker you turn noobs into wizards. The quicker you do that, the quicker you offload support onto your newfound wizards.


"Newbies" to data science are indeed a good target audience, before they are already attached to pandas. But this doesn't imply they know nothing. It's very unlikely that someone both 1. has a need to do the kind of data analysis that polars is good at, and 2. has never heard of the "data frame" concept.


The docs are okay, but the feature set is lacking compared to pandas, which is understandable since this is at version 0.2. I was exploring if it's possible to use this, but we need diff aggregation which it doesn't have, so it's a no go right now.


Do you mean something like `.agg(pl.col("foo").diff())`?

Or is diff aggregation its own thing? (I tried searching for the term, but didn't find much.)


Nevermind, it has it but it's under Computation in polars.Series.diff and I was looking under Aggregation. This is great.

For instance you've got a time series with an odometer value and you want the a delta from the previous sample to compute the every trip.


> But it's strange to me how mature Polars is and that the docs are still this bad.

Interesting, I've personally found them quite good and compared to datafusion or duckdb they're dramatically better. I agree pandas has better docs, but one of the strengths of polars is that I find I often don't need the docs due to putting lots of careful thought into designing a minimal and elegant API, not to mention they're actually care about subtle quirks like making autocomplete, type hinting, etc. work well.


Sounds like we might be coming from different perspectives. I honestly don't use any DF libraries often, and really only Pandas. I used to use pandas a fair amount, but that was years ago, and now I only have to reach for it a few times a year. So maybe the docs are good for people that already have deeper experience. Because I think just the fact that you have used datafusion and duckdb illustrates that you're more skilled in this domain than I am, because I haven't used those haha.

But I do think making good docs is quite hard. You usually have multiple audiences that you might not even be aware of. Which makes one of the most important things to do is keep an open ear to listen for them. It's easy to get trapped thinking you got your audience but you're actually closing the door to many more groups (unintentionally). It's also just easy to be focused on the "real" work and not think about docs.


What, specifically, is bad about the docs? This whole thread is people who just looked at the home page, saw that it is "DataFrames", but didn't know what that means and came here to complain. Nobody has said anything about issues with the docs for someone who understands what a data frame is (or spent like two minutes looking that up) but is struggling to figure out how to use this library specifically.


I think your experience is probably making it difficult to understand the noob side of things. For me, I've struggled with simply slicing up a dataframe. And as I specified, these aren't tools I use a lot, so the "who understands what a data frame is" probably doesn't apply to me very well and we certainly don't need the pejorative nature suggesting that it is trivially understood or something I should know through divine intervention. I'm sure it's not difficult, but it can take time for things to click.

Hell, I can do pretty complex integrals and derivatives and now so much of that seems trivial to me now but I did struggle when learning it. Don't shame people for not already knowing things when they are explicitly trying to learn things. Shame the people that think they know and refuse to learn. There's no reason to not be nice.

Having done a lot of teaching I have a note, don't expect noobs to be able to articulate their problems well. They're noobs. They have the capacity to complain but it takes expertise to have clarify that complaint, turning it into a critique. I get that this is frustrating, but being nice turns noobs into experts and often friends too.


I really think this is a misunderstanding of the purpose of different kinds of documentation. The documentation of a new tool for a mature technique is just not the primary place to focus on writing a beginners' tutorial / course on using that technique. Certainly, "the more the merrier" is a good mantra for documentation, so if they do add such material, all the better. But it is very sensible for it to not be the focus. The focus should be, "how can you use this specific iteration of a tool for this technique to do the things you already know how to do".

Nobody is suggesting that you should be an expert on data frames "through divine intervention". But the place to expect to learn about those things is the many articles, tutorials, courses, and books on the subject, not the website of one specific new tool in the space.

If you're really interested in learning about this, a fairly canonical place to start would be "Python for Data Analysis"[0] by Wes McKinney, the creator of pandas and one of the creators of the arrow in-memory columnar data format that most of these projects build atop now.

This is a (multiple-) book length topic, not a project landing page length topic.

0: https://wesmckinney.com/book/


> But it is very sensible for it to not be the focus.

Sure. I mean devs can do whatever they want. But the package is a few years old now and they do frequently advertise, so I don't think it makes make it more approachable for... you know... noobs.

This is a bit difficult of a conversation too, because you've moved the goal post. I've always held the context of noob, but now you've shifted to just be dismissive of noobs. Totally fine, but different from the last comment.

> But the place to expect to learn about those things is the many articles, tutorials, courses, and books on the subject, not the website of one specific new tool in the space.

I actually disagree. This is the outsourcing I expressed previously, but it's clear from the number of complaints that this is not sufficient for a novice. You do seem passionate about this issue, and so maybe you have the opportunity to fill that gap. But I very much think that official documentation is supposed to be the best place. Frankly because it is written by the people who have a full understanding of the system and how it all integrates together. I'm sure you've run into tons of Medium tutorials that get the job done but are also utter garbage and misinform users. It isn't surprising when most of these are written by those in the process of learning, and are better than nothing, but they are entirely insufficient. The whole point of creating a library is to save people time. That time includes onboarding. For example of good docs, I highly recommend the vim docs. Even man pages are often surprisingly good.


> now you've shifted to just be dismissive of noobs

No, I'm sorry, this is getting ridiculous. I'm not being dismissive of noobs, I'm saying "noobs should seek introductory material when attempting to learn an entirely new subject, like books, courses, or tutorials on the subject matter".

It's just so freaking weird for you to expect every single tool in some space to create that introductory material.

I promise you that the ruby on rails website did not assume total ignorance of the term "web application" when I first came across it as a "noob". I was a total noob at ruby on rails, but I had to understand why I might be interested in "web applications, but easier".

I could spend all day coming up with examples that are just like this. And this is not some kind of failure of imagination in how to document specific projects, it's just specialization. The website of a new tool for something that has been done a bunch of times over multiple decades is not the right place to put the canonical text on what the thing you're doing is; you put that in a book or in college courses or other kinds of training materials.

Unless what you have made is a brand new entirely unfamiliar thing (which is very rare) with no introductory materials for your brand new novel concept available anywhere, it makes more sense to focus your documentation on "why choose this specific solution over the other ones people are already familiar with" rather than "what even is the thing that we're doing here from first principles". Sure, add some links to the best introductory materials, but don't try to write them yourself, that's crazy!

> I actually disagree. This is the outsourcing I expressed previously, but it's clear from the number of complaints that this is not sufficient for a novice. You do seem passionate about this issue, and so maybe you have the opportunity to fill that gap.

No, I'm not passionate about this issue. I think people who actually want to learn things will continue doing research and reading books and taking classes to learn about new subjects, and that people who just want to complain will continue to do so. There is no "gap" to fill. There are tons of great materials that will describe in great depth what "data frames" are, and how to work with them, for anyone who is even the tiniest bit interested.

> I very much think that official documentation is supposed to be the best place. Frankly because it is written by the people who have a full understanding of the system and how it all integrates together.

I think what you seem to be confused by is the difference between this one library - polars - and an entire large subject - tabular data analysis using data frames. It certainly does make sense for the polars website to document the polars library, which (in my view) it already does. But if you want to learn the subject, you need to do that in the normal way that people have always learned new subjects. I'm sorry, because you seem resistant to this, but again, the way to do that is with books and courses, not by reading the documentation of one tool comprising a tiny sliver of a very large subject.

> I'm sure you've run into tons of Medium tutorials that get the job done but are also utter garbage and misinform users.

No, Medium tutorials should not be your go-to source for learning about a new subject! Your go-to source should be books and courses.

This is why I keep commenting here. I want to get through to you that you seem to be going about the acquisition of knowledge in a very weird and fundamentally misguided way. It just isn't the case that knowledge is mostly found in the documentation of tools! There is way more foundational knowledge to learn than it would ever make sense for every little tool to document themselves.

This is, in a very literal sense, why people write books about things, and why schools exist. We don't teach algebra by linking to the Mathematica documentation.


I can't speak for the Python side of the Polars docs but coming from Python and Pandas to Rust and Polars hasn't always been easy. To be fair, that isn't just about docs but also finding articles or Stack Overflow answers for people doing similar things.


That certainly makes sense!


I'm a dataframes noob. I saw this post and the performance claims attracted me. I went to chatGPT to understand what dataframes were about. Then on udemy, I searched for a polar course. A course required pre-requisites : a bit about jupyter notebooks and pandas. Then I went through a few modules of a pandas course. Now, I'm going through a polars course. Altogether, I spent about 2-3 hours to setup the environment and know what this is all about.

A little bit context would have helped to have attracted a lot more noobs.g


Your first paragraph makes perfect sense! I was nodding along. But then your concluding sentence was a bit of a record scratch for me. This all worked as intended! You knew what the project was about - "data frames" - and what might make it attractive to you - the performance claims - and then you went and followed exactly the right path to get the context you needed to understand what's going on with it. It's a big topic that you were able to spin up on to a basic level in 2-3 hours, by pulling on strings starting at this landing page. This is a very successful outcome.

I'd also recommend this book: https://wesmckinney.com/book/. It's not about polars, but you'd be able to transfer its ideas to polars easily once you read it.


"How To Be A Pandas Expert"[1] is a good primer on dataframes. There's a certain mental model you need to use dataframes effectively but it's not apparent from reading the official docs. The video makes it explicit: dataframes are about like-indexed one-dimensional data, and every dataframe operation can be understood in terms of what it does to the index.

[1] https://www.youtube.com/watch?v=oazUQPrs8nw


The Rust docs are for some reason much worse than the Python docs, or at least that used to be the case


I'm a data engineering newbie and I found it very clear, and it gave me an enthusiastic feeling (not an "alienating" feeling).

This whole thread just comes across as unmitigated pedantry to me.


Presumably you were introduced to the concept of DataFrames and how they're used through some other source, because Polars landing page doesn't even bother to mention it's used for data analysis and documentation simply assumes you're already familiar with the core concepts.

Compare that to Pandas which starts with the basics, "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." It then leads you to "Getting started" guide which features "Intro to pandas" that explains the core concepts.


Wes has also worked hard to improve a lot of the missteps of pandas, such as through pyarrow, which may prove even more impactful than pandas has been to date.

Polars is also a wonderful project!


Polars is also based on McKinney’s Arrow project.

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

https://github.com/pola-rs/polars/blob/main/README.md


Wes also literally created another Python dataframe project, Ibis, to overcome many of the issues with pandas

https://ibis-project.org

most data engines and dataframe tools these days use Apache Arrow, it's a bit orthogonal


It’s annoying only because it’s on hacker news, because what are the odds of getting on it if you don’t know what is it and don’t have a need for it?


It's annoying because a single leading sentence would be enough to explain a product. Some of the words (for example "Data Frame") in that sentence can be links to other pages if that's necessary. It's a small change but it makes a huge difference.


HTML has an element to provide definitions of terms without having to link out, but almost nobody uses it.


I mean, pretty high. What if your boss just tells you to learn polars, and you don’t know why? Saying what something is, is just good communication, and can help clarify for people who are confused.


Guess in the remote event that you're told to learn a new skill that you don't know anything about, you go to pola.rs website and see "DataFrames for the new era" and start getting documentation from there, about what DataFrame is, the website is clearly showing what is it, it's your duty to understand what is it, I would argue that if you knew what DataFrames are you would be saying "Why is it saying something so basic and don't just show me the good stuff?"

I for example hate website that try to serve newbies, newbies have a lot of content if they are interested, it's not that all the web needs to serve them


> What if your boss just tells you to learn polars, and you don’t know why? Saying what something is, is just good communication

Shouldn't the good communication happen when the boss tells you to learn polars? Like, why are you telling me this, boss; what is it that you need done?


These workplaces where bosses tell employees to learn unheard-of tools with zero context sound terrible.


> Yes, it's annoying negative feature of many tech products.

Sadly its not only tech products, but also things like security disclosures too.

It always follows the same pattern:

    - Spend $X time coding/researching something.
    - Spend $not_enough_time documenting it.
    - Spend $far_too_much_time thinking about / "engaging with the community" in deciding on a cute name, fancy logo and cool looking website.


It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.


Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.


Yep! Unless I'm mistaken, R (and its predecessor S) seems to have been the first to introduce the concept of a dataframe.

One could also argue that dataframes are basically in-memory database tables. And in that case, S and SQL probably tie in terms of the creation timeline.


The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc. These kind of things don't really make sense in DB tables (and they are generally not supported and you jump through hoops to do similar things in DBs).


Yes, that's totally fair; dataframes are more flexible in that sense.


Oh, and another important difference is memory layout. The dataframe implementations mostly (or all) use column-major format. Whereas most conventional SQL implementations use row-major format, I believe.


I think most OLTP databases are row oriented whilst most OLAP are column.


> The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc.

I think this is overblowing the similarities to matrices. Matrices have elements all of the same type, while data.frames mix numbers, characters, factors, etc. You certainly cannot transpose a data.frame and still have a data.frame that makes sense. Multiplying rows would not make sense either, since within one row you will have different types of data. Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.


> Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.

They still have their advantages with row/column labels, NaN handling etc. These are not operations I am speculating about by the way. I am most familiar with pandas and the dataframe there has transpose, dot product operations and almost all column operations have their correspondence in rows (i.e. you either sum(axis=0) or sum(axis=1)).


Oh, based on the comment you replied to I thought this was about R. In R matrices can handle NaNs and NAs, have column and row labels, have dot products and much more.


I feel like the predecessor of R should be Q!


The way that I've heard the story, S was short for "statistics", and R was chosen because the authors were _R_obert [Gentleman] and _R_oss [Ihaka].

Statisticians are funny!


Yeah. I think Wes McKinney liked the data frames in R, but preferred the programming language of Python. I've heard somewhere that he also got a lot of inspiration from APL.


R is literally designed to do statistics and has first class support and language feature support for many specialized tasks in statistics and closely related fields.

Python is literally designed to be easy to program with in general.

Well, it turns out when you’re dealing with terabytes of data and TFLOPS, the programming becomes more important than the math. Not all R devs are happy about this and they are very loud about it.

But it shouldn’t really surprise anyone. That is literally how those languages are designed.

Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course


R is heavily influenced by Scheme. Not only is it heavily functional, but it has metaprogramming capabilities allowing a high level of flexibility and expressiveness. The tidyverse libraries use this heavily to produce very nice composable APIs that aren't really practically possible in Python.

R is fine. The issue is more in the ecosystem (with the aforementioned exception of the tidyverse).


Imho forecasting is still way better in R than in Python


> Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course

Look, I started with R and use mostly Python these days, but this is not really a fair take.

R is (still) much, much, much better for analytics and graphing (the only decent plotting library in python is a ggplot clone). The big change (and why Python ended up winning) is that integrating R with other tools (like web stuff, for example) is harder than just using Python.

pandas (for instance) is like an unholy clone of the worst features from both R and Python. Polars is pretty rocking, though (mostly because it clones from Spark/dplyr/linc).

It's another example of Python being the second best language for everything winning out in the marketplace.

That being said, if I was starting a data focused company and needed to pick a language, I'd almost certainly build all the DS focused stuff in R as it would be many many times quicker, as long as I didn't need to hire too many people.


> which includes the unjustly neglected data.table

So so true.

I was working on an adhoc project that needed a quick result by the end of the day. I had to pull this series of parquet files and do some quick and dirty analysis. My first reflex was to use python with pandas, quick and easy. Python could not handle the datasets, too large. I decided to give R and data.table a go and it went smoothly. I am usually a python user but from time to time I feel compelled to jump back to R and data.table. Phenomenal tool.


Indeed.

Python is great for getting things into a data.frame. But once in a data.frame, R is so much easier to work with for all things stats and ml-related.


My friend. You cannot make people like R. We all know about and study data.table, so it’s not neglected, we just don’t use that implementation.

Mainly because R sucks for anything that isn’t statistics.


Ah, like polar bears are a much more aggressive implementation of the idea behind panda bears? That’s a pretty funny name if so.


Yeah. The name always makes me chuckle


I don't know, I think the name is kind of polar-izing

/pun


Oh, I'm not sure. I'd say it's bear-ly polarizing.

I'm so sorry.


Depends on your frame of mind.


This thread is turning into pandamonium


I'm worried it's going to get grizzly.


Thanks folks, you all made my day. "Frame of mind" was my favorite. I'm surprised I didn't think of some of these, I must be getting... Rust-y


Next re-implementation will be called grizzl.ys, hand-written in Y86 assembly.


Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.

[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...


Not according to DuckDB benchmarks. Not even close.

https://duckdblabs.github.io/db-benchmark/


Ouch! It is going to take a lot of work to get Polars this fast. If ever.


Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.


This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?


Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.

Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.


Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.


yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.


It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.


> It's been around in R-land for a while with dplyr and its variety of backends

Only for SQL databases, so not really. Source: have been running dplyr since 2011.


The Arrow backend does allow for lazy eval.

https://arrow.apache.org/cookbook/r/manipulating-data---tabl...


Memory and CPU usage is still really high though.


Not with eager API.


> Pandas is the original open source data frame library

...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.


That's interesting! I didn't realize there had been prior dataframe libraries in Python!

Out of curiosity, what was/were the previous libraries?


it is built in data structure and function in R.


Oh, yes, I was aware that R (and its predecessor S) have a native dataframe object in the language.

It seemed that gmfawcett was indicating that there was a dataframe library in _Python_ that existed prior to Pandas. I was curious what that library was/is, as I'd not heard that before.


ok, guess I misunderstood both comments of you two. ´_>`


Sorry :) Pandas is undisputed king. But there were multiple bindings from Python into R available in the early 2000's. Some like rpy and rpy2 are still around, others are long defunct. I concede that these weren't standalone dataframe libraries, but rather dataframe features built into a language binding.


Not *original* but probably most commonly used.


Yeah, I believe Pandas was inspired by similar functionality in R.


yup I first met data frames in R and pandas is the Python answer to R isn't it


If I understand correctly, Pandas original scope was indexed in-memory data frames for use in high frequency trading, making use of the numpy library under the hood. At the time it was written you had JPMC's Athena, GS's platform, and several HFT internal systems (C++ my friends in that space have mentioned). Pandas just is so darn useful! I've been using it since maybe version 0.10, even got to contribute a tiny bit for the sas7bdat handling.


indeed it's both: it was created for financial analytics, and it provides R dataframe features to python. thanks for.making me detour into the history of it.


I may be a rare bird starting with R dataframes (still newbie+ level), then python polars (intermediate- ?). Frankly whenever I have to use pandas or df's in R I am not convinced that these are more intuitive/easier to master. I.e. I do not like the concept of row names.

Polars can be an overkill for small/medium dataset, but since I have been bitten by corrupted/badly formatted CSVs/TSVs I love the fact that Polars will throw the towel & complain about types/column number mismatches etc. And the fact that it can scale up to millions of rows on a modest workstation compensates the fact that sometimes one can spend hours finding a proper way to manipulate a dataset.


I was going to say - it always feels so humbling seeing pages like this. "DataFrames for the new era" okay… maybe I know what data frames are? "Multi-threaded query engine" ahh, so it’s like a database. A graph comparing it to things called pandas, modin, and vaex - I have no clue what any of these are either! I guess this really isn’t for me.

It’s a shame because I like to read about new tech or project and try and learn more, even if I don’t understand it completely. But there’s just nothing here for me.

This must be what normal people go through when I talk about my lowly web development work…


It's pretty much just an alternative to SQL that's a lot easier/natural to use for more hardcore data analysis.

You can much more easily compose the operations you want to run.

Just think of it as an API for manipulating tabular data stored somewhere (often parquet files, though they can query many different data sources).


Dataframes and SQL have overlapping functionality, but I wouldn't say that dataframes are an "alternative" to SQL. The tradeoffs are very different. You don't have to worry about minimizing disk reads or think about concurrency issues like transactions or locks, because a dataframe is just an in-memory data structure like a list or a dict, rather than a database. Dataframes also aren't really about relational algebra like SQL is.


Have you tried polars? I agree that if pandas is all you've tried, that it's pretty far from an alternative frontend for query engines, but if you've tried polars it maps pretty cleanly to SQL, can be query optimized to a query plan similarly to SQL, I should've made it clear that I'm talking about an alternative to SQL used in an OLAP context, not for OLTP.


Data tables tend to also be a standard ingestion format for statistical tools in many cases.


> It's pretty much just an alternative to SQL

Polars actually has SQL supported, so you can mix and match: e.g. construct DF from SQL.


In fairness, the title of the page is “Dataframes for the new Era”. The “Get Started” link below the title links to a document that points to the GitHub page, which explains what the library is about to people with data analysis backgrounds: https://github.com/pola-rs/polars


But annoyingly, not the <title>, thus the useless HN headline.


I wish HN had secondary taglines we could use to talk about the actual content or relevance of an article apart from its headline.


I'm currently getting dragged into "data" stuff, and I get the impression it's a parallel universe, with its own background and culture. A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".

Anyway, probably the interesting things about polars are that it's like pandas, but uses a more efficient rust "backend" called Arrow (although I think that part's also in pandas now) and something like a "query planner" that makes combining operations more efficient. Typically doing things in polars is much more efficient than pandas, to the extent that things that previously required complicated infrastructure can often be done on a single machine. It's a very friendly competition, created by the main developer of pandas.

As far as I can tell everybody loves it and it'll probably supplant pandas over time.


> As far as I can tell everybody loves it and it'll probably supplant pandas over time.

I've been using pandas heavily, everyday, for something like 8 years now. I also did contribute to it, as well as wrote numpy extensions. That is to say, I'm fairly familiar with the pandas/numpy ecosystem, strengths and weaknesses.

Polars is a breeze of fresh air. The API of pandas is a mess:

* overuse of polymorphic parameters and return types (functions that accept lists, ndarrays, dataframes, series or scalar) and return differently shaped dataframes or series.

* lots of indirections behind layers of trampoline functions that hide default values behind undocumented "=None" default values.

* abuse of half baked immutable APIs, favoring "copy everything" style of coding, mixed with half supported, should-have-been-deprecated, in-place variants.

* lots and lots of regressions at every new release, of the worst kind ("oh yeah we changed the behavior of function X when there is more than Y NaNs over the window")

* Very hard to actually know what is delegated to numpy, what is Cython/pandas, and what is pure python/pandas.

* Overall the beast seemed to have won against its masters, and the maintainers seem lost as to what to fix versus what to keep backward compatible.

Polars fixes a lot of these issues, but it has some shortcomings as well. Mainly I found that:

* the API is definitely more consistent, but also more rigid than pandas. Some things can be very verbose to write. It will take some years for nicer simpler "shortcuts" and patterns to emerge.

* The main issue IMHO is polars' handling of cross sectional (axis=1) computations. Polars is _very_ time series (axis=0) oriented, and most cross sectional computations require to transpose the data frame, which is very slow. Pandas has a lot of dedicated axis=1 implementations that avoid a full transposition.


Many axis=1 operations in pandas do a transpose under the hood, mind you. Axis=1 belongs in matrices, not in heterogeneous data. They are a performance footgun. We make the transpose explicit.


> Many axis=1 operations in pandas do a transpose under the hood, mind you

Sure, but many others are natively axis=1-aware and avoid full transposition.

> Axis=1 belongs in matrices, not in heterogeneous data.

I'm not sure to understand what that means. Care to elaborate?

> They are a performance footgun.

You don't get to only solve the problems that are efficient to solve...

> We make the transpose explicit.

Yes, but when you do mixed time series / cross sectional computations, you cannot always untangle both dimensions and transpose once. Sometimes your computation intrinsicely interleaves cross sectional and time series. In these case, which happen a lot in financial computations, then explicitly fully transposing is very slow.


Arrow is an in-memory data format designed to allow zero-copy interop between languages, including Rust, C++, and Python. It's a bit more sophisticated than just some arrays, but ultimately everything is just arrays.

Polars is implemented in Rust and uses Arrow to represent some data in memory.

Pandas is often used in a way that results in a lot of processing happening in Python. Data processing in Python is comically inefficient and single-threaded. Beating that sort of pipeline by 10-100x is not that difficult if you have optimization experience and are able to work in a more suitable language.


> A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".

I've been recently getting into DevOps stuff, and this is exactly what that sounds like to me. Same thing whenever I look at any new web framework:

Okay, cool, so your product brings the power of "WazzupJS" to "the edge"... But it also has "teleports" like "Blimper" but without all of the downsides of "Vormhole Blues"?

I'm sure that's really useful, but I kinda wish I knew what the product actually does


All domains seem to have this kind of in-group shorthand, regardless of scale of the community.


I try to use polars each time I have to do some analysis where dataframes helps. So basically any time I'd reach for pandas, which isn't too often. So each time it's fairly "new". This makes me have a hard time believing everyone that is saying "Pandas but faster" has used Polars, because I can often write Pandas from memory.

There's enough subtle and breaking changes that it is a bit frustrating. I really think Polars would be much more popular if the learning curve wasn't so high. It wouldn't be so high if there were just good docs. I'm also confused why there's a split between "User Guide" and "Docs".

To all devs:

Your docs are incredibly important! They are not an afterthought. And dear god, don't treat them as an afterthought and then tell people opening issues to RTFM. It's totally okay to point people in the right direction without hostility. It even takes less energy! It's okay to have a bad day and apologize later too, you'll even get more respect! Your docs are just as important as your code, even if you don't agree with me, they are to everyone but you. Besides technical debt there is also design debt. If you're getting the same questions over and over, you probably have poor design or you've miscommunicated somewhere. You're not expected to be a pro at everything you do and that's okay, we're all learning.

This isn't about polars, but I'm sure I'm not the only one to experience main character devs. It makes me (and presumably others) not want to open issues on __any__ project, not just bad projects, and that's bad for the whole community (including me, because users find mistakes. And we know there's 2 types of software: those with bugs and those that no one uses). Stupid people are just wizards in training and you don't get more wizards without noobs.


As another data point, I switched to Polars because I found it much more intuitive than Pandas - I coulnt remember how to do much in pandas in the rare (maybe twice a year) times I want to do data analysis. In contrast, Polars has a (to me anyway) wonderfully consistent API that reminds me a lot of SQL


Maybe that's it, because I don't really use SQL much. Reaching for pandas about the same rate. Maybe that's the difference? But I come from a physics background and I don't know many physicists and mathematicians that are well versed in SQL. But I do know plenty that use pandas and python. So there's definitely a lot of people like me. Also I could be dumb. Totally willing to accept that lol.


I also use Pandas very infrequently (it has been years). Usually, for data analysis, I'm reaching for R + tidyverse/data.table/arrow. I have found Python and Pandas to be inelegant and verbose for these tasks.

As of last week, I have a need to process tabular data in Python. I started working with polars on Friday, and I have an analysis running across 16 nodes today. I find it very intuitive.


Yep, been using pandas for years now, still have no real mental model for it and constantly have to experiment or chat with our AI overlords to figure out how to use it. But SQL and polars make sense to me.


Any new library will have a learning curve. I used Pandas for many years and never could get used to it, never mind memorize most of it. It was always all over the place with no consistency. Switched to Polars a year or so ago and never looked back. I can now usually write it from memory as the method structure is clear. Still running Pandas in a production system that doesn't get many updates but that's all. Even if you like Pandas, you cannot ignore how incredibly slow it is and more importantly, a memory hog.


The story of Polars seems to be shaping up a bit like the story of Python 3000: everything probably could have been done in a slow series of migrations, but the BDFL was at their limit, and had to start fresh. So it takes 10 years for the community to catch up. In the mean time, there will be a lot of heartache.


I honestly don't believe something like polars could've evolved out of pandas.

It's a complete paradigm shift.

There's honestly not much that probably could've been shared at the point polars was conceived. Maybe now there's a little more (due to the arrow backend) but still very little probably.


Above that it says “DataFrames for a new era” hidden in their graphics. I believe it’s a competitor to the Python library “Pandas”, which makes it easy to do complex transformations on tabular data in Python.


It seems like it's a disease endemic to data products. Everybody, the big cloud providers and the small data products, build something whose selling point is "I'm the same as Apache X but better." But if you don't know what Apache X is, you have to go read up on that, and its website might say "I'm the same as Whatever Else but better," and you have to go read up on that. I don't want to figure out what a product does by walking a "like X but better" chain and applying diffs in my head. Just tell me what it does!

I get that these are general purpose tools with a lot of use cases, but some real quick examples of "this is a good use case" and "this is a bad use case, maybe prefer SQL/nosql/quasisql/hadoop/a CSV file and sed" would be really helpful, please.


I dunno, I get the criticism, but also, every field assumes a large amount of "lingua franca" in order to avoid documenting foundational things over and over again.

Programming language documentation doesn't all start with "programming languages are used to direct computers to do things"; it is assumed the target audience knows that. Database documentation similarly doesn't start out with discussing what it means to store and access data and why you'd want to do that.

It's always hard to know where to draw this line, and the early iterations of a new idea really do need to put more time into describing what they are from first principles.

I remember this from the early days of "NoSQL" databases. They spilled lots of ink on what they even were trying to do and why.

But in my view this isn't one of those times. I think "DataFrames" are well within a "lingua franca" that is reasonable to expect the audience of this kind of tool to understand. This is not an early iteration of a concept that is not widely familiar, it is an iteration of an old, mature, and foundational concept with essentially universal penetration in the field where it is relevant.

Having said all that, I came across this "what is mysql" documentation[0] which does explain what a relational database is for. It's not the main entry point to the docs, but yeah, sure, it's useful to put that somewhere!

0: https://dev.mysql.com/doc/refman/8.0/en/what-is-mysql.html


If you don't know what the comparison product is either then you are not the target customer. This is a library for analyzing and transforming (mostly numerical) data in memory. Data scientists use it.


See also: Is it pokemon or big data https://pixelastic.github.io/pokemonorbigdata/


I run into the same problem. I don't know what Pandas are (besides the bears) and at some point up the "it's like X" chain, I guess you have to stop and admit you're just not the target user of this tech product.


> I guess you have to stop and admit you're just not the target user of this tech product.

On the other hand, how can you become a target user if you don't know that a product category exists?


This project is a solution to a particular kind of problem. The way you become a target user of that solution is by first having the problem it's a solution to.

If you have the problem "I want to analyze a bunch of tabular data", you'll start researching and asking around about it, and you'll quickly discover a few things: 1. people do this with (usually columnar / "OLAP") sql query interfaces, 2. people usually end up augmenting that with some in memory analyses in a general purpose programming environment, 3. people often choose R or python for this, 4. both of those languages lean heavily on a concept they both call "data frames", 5. in python, this is most commonly done using the pandas library, which is pervasive in the python data science / engineering world.

Once you've gotten to that point, you'll be primed for new solutions to the new problems you now have, one of which is that pandas is old and pretty creaky and does some things in awkward and suboptimal ways that can be greatly improved upon with new iterations of the concept, like polars.

But if you don't have these problems, then the solution won't make much sense.


That's on you. If you want to become a data engineer and data scientist -- the two software positions most likely to use polars -- get learning. Or don't: learn it when you need it.


I think something like dataframes suffers from having a name that isn't obscure enough. You read "dataframes" and think those are two words you know, so you should understand what it is.

If they'd called them flurzles you wouldn't feel like you should understand if it's not something you work with.


For me, “data frames” are forever associated with MPEG


How come some submissions don't even describe what it is about than just the name of it? It's really puzzling how everyone is meant to know what it is by its name.


I've mentioned this before and got downvoted because of course everyone is a web dev and knows what xyz random framework (name and version number in the title, nothing else) is.


Right... but the title before the first line reads "DataFrames for the new era". If you don't know what a data frame is then, yes, it's for people who already know that.


You were right that the page is written for those that know what they are looking for, which is just fine. If you are getting started in DS/ML/etc and you have used numpy, pandas, etc. polars is useful in some cases. A simple one, it loads dataframes faster (from experience with a team I help) than pandas.

I haven't played enough to know all it's benefits, but yes it's the next logical step if you are in the space using the above mentioned libraries, it's something one will find.


Dataframes in Python are a wrapper around 2D numpy arrays, that have labels and various accessors. Operations on them are OOM slower than using the underlying arrays.


There's a very good point here but I don't think its made clear.

If your data fits into numpy arrays or structured arrays (mainly if it is in numeric types), numpy is designed for this and will likely be much faster than pandas/polars (though I've also heard pandas can be faster on very large tables).

Pandas and Polars are designed for ease of use on heterogeneous data. They also include a python 'Object' data type which numpy very much does not. They are also designed more like database (e.g. 'join' operations). This allows you to work directly with imported data that numpy won't accept - after which Pandas uses numpy for underlying operations.

So I think the point is if you are running into speed issues in Pandas/Polars, you may find that the time-critical operations could be things that are more efficiently done in numpy (and this would be a much bigger gain than moving from Pandas to Polars)


I don't know where this myth originated from but I have seen this in multiple places. Even if you think about it 2d numpy arrays can't have different type for different columns.


"myth originated"

It's in the documentation.

I just learned pandas recently, and I would have said this same thing. Not because I read through the numpy code, but because I read the documentation.

Is it wrong? Can't a user pick up a new tool and trust some documentation without reading through 100's of libraries built on libraries.

When was last time someone traced out every dependency, so they can confidently say something is "Myth".


Where is it written in pandas documentation? Pandas dataframe is stored in list of 1d numpy arrays, not a single 2d array.


The columns with common dtypes are grouped together in something called "blocks" and inside those blocks are 2D numpy arrays. It is probably not in the documentation because it is seen as implementation detail but you can see the block manager's structure in this article (https://dkharazi.github.io/blog/blockmanager/) or in this talk (https://thomasjpfan.github.io/scipy-2020-lightning-talk-pand...).


Actually numpy has something called a structured array that is pretty much what what you described.


Well, if you use structured arrays or record arrays, you can do this (more or less).

https://numpy.org/doc/stable/user/basics.rec.html


This is true specifically for Pandas DataFrames. Numpy arrays are themselves just wrappers around c arrays, which are contiguous chunks of memory. Polars supports operations on files larger than memory.


Marketing is a skill that needs to be learned. You have to put yourself in the shoes of a person who knows nothing about your product. This does not come naturally to the engineers who make these products and are used to talking to other specialists like themselves.


This is true in general but I'm not sure it's what's going on here.

Marketing is also very concerned with understanding who your target audience(s) are and speaking their language.

I think talking about "DataFrames" is exactly that; the target audience of this project knows what that means. What they are interested in is "ok but who cares about data frames? I've been using pandas for like fifteen years", so what you want to tell them is why this is an improvement, how it would help them. Dumbing it down to spend a bunch of space describing what data frames are would just be a distraction. You'd probably lose the target audience before you ever got to the actual benefits of the project.


Noticed exactly the same - there's no description of the library whatsoever on the landing page. It is implied that it is a DataFrame library, whatever that means.


Maybe this is sort of like the opposite of how scam emails are purposefully scammy, so that only people who can't recognize scams will fall for them. Only people who know what "a DataFrame library" is - which is an enormous number of people, since this is probably the most broadly known concept in data science / engineering - will keep reading this, and they are the target audience.


> which is an enormous number of people

While that may be, I think it would make sense to describe the project in a succinct way on the first page a visitor lands.


It is described in a succinct way. "DataFrames" is that description. It's the very first text on the page. It's really the same as having the word "database" be the first text on the landing page of a new database project. If you don't know what the word "database" means, the landing page for a new database project is really not the place to expect to learn about that. The "data frame" concept is not quite as old or broad as the concept of "databases", but it's really not that far off. It's decades old, and is about as close to a universal concept for data work as it's possible to get.


But you're not the audience? There is very little to gain by tailoring the introduction to people who aren't the audience.

You don't go car parts manufacturer expecting an explanation of what an intercooler is.


It’s not written for you and that’s fine. This is a library targeted at a very specific subset of people and you’re not in it.


pandas dataframes but faster


It's fine to me. Tech UI is bad and weird, but not like if you gain 5x customers with better UX.


I don't use dataframes in my day job but have dabbled in them enough that I found this website pretty easy to digest.

You'd really have to be a complete data engineering newbie to not understand it I think?

I mean, where do you draw the line? You wouldn't expect a software tool like this to explain what it is in language my grandma would understand, I don't think?


> You'd really have to be a complete data engineering newbie to not understand it I think?

I do occasionally use Pandas in my day job, but I honestly think very few programmers that could have use for a data frame library would describe themselves as a “data engineer” at all.

In my case, for example, I’m just a physicist - I don’t work with machine learning, big data, or in the software industry at all. I just use Pandas + Seaborn to process the results of numerical simulations and physical experiments similarly to how someone else might use Excel. Works great.


Had the exact same thought seeing this. Too many of these websites are missing a simple tldr of the thing actually is. Great, it's fast, but fast at what??


It has that simple tldr, it's the very first word, "DataFrames". Everyone in this thread just doesn't know what that means, and that's fine, I get that, but seriously, that's the simple summary. Data frames aren't an obscure or esoteric concept in the data analysis space; quite the opposite.


Hard agree. People post links to websites with technical descriptions and little basic info all the time, and this is the first time I'm seeing a thread of people complaining about it. If I'm interested in something I see, I start Googling terms; I don't expect a specification for software in a specific field to cater to my beginner-level knowledge.


I hate this doc style that has become so popular lately. They get so wrapped up in selling you their story that they forget to tell you basic shit. Like what it is. Or how to install it.

The PMs literally simplified things so much they simplified the product right ought of the docs.


It is right there on the page, set to Python by default:

> Quick install > Polars is written from the ground up making it easy to install. Select your programming language and get started!


Just once I’d like to see “this library was written to fulfill head-in-the clouds demands by management that we have some implementation, without regards to quality.”


For posterity, polars was a hobby product that started in 2020: https://news.ycombinator.com/item?id=23768227

> As a hobby project I tried to build a DataFrame library in Rust. I got excited about the Apache Arrow project and wondered if this would succeed.

> After two months of development it is faster than pandas for groupby's and left and inner joins. I still got some ideas for the join algorithms. Eventually I'd also want to add a query planner for lazy evaluation.


Definitely not intended as a slight toward this project, just (what I thought was) a funny thought about that expression.


That has absolutely no relation to this project. What in the world are you talking about?


I’m just responding to

> "Polars is written from the ground up with performance in mind"

It is a common thing to see, I thought it would be funny to imagine the opposite.


Trust me. It does. ;)


What do you mean? What "management" was this created to fulfill the demands of?


Oh.. I misread. Somehow I read this as that the previous polars dataframe post on my blog had no relation to this website.

You can ignore my comment. It doesn't make sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: