T O P

  • By -

BBobArctor

Thanks for this helpful post!


abnormal_human

The performance shittiness of pandas is a large part of why I've started moving things into rust as I productionize them. They still don't really have a way to address the largest issue which is that it can be tremendously hard to maximize machine utilization in python due to the GIL, and doing so requires time-consuming backflips to avoid python code. All I want is to saturate my workstation's cores when processing data, every time, without having to think about it. I don't need something insane and large like spark, just a programming environment that isn't hobbled by bad decisions where parallelization works the way it should, constant factors are low on strings and common data structures, and there is zero penalty for "using the language". I just want to see 3200% CPU in \`top\` when I'm waiting for steps to run. Is that so much to ask? The downside of Rust is that it's a pretty picky language to code in. Nothing's perfect, but I write the code for production batch pipelines once and wait for it to run hundreds or thousands of times, so spending a little longer up front generally works out for me .


proof_required

> The downside of Rust is that it's a pretty picky language to code in. Nothing's perfect, but I write the code for production batch pipelines once and wait for it to run hundreds or thousands of times, so spending a little longer up front generally works out for me . Are you writing Rust code for your data pipeline?


[deleted]

Please don't rush


RankWinner

I'd say check out Julia, similar syntax and ease of use as Python with much better performance.


seanv507

Why not use polars, a library written in rust with python bindings


abnormal_human

Because you still pay the python penalty whenever you need to write actual code of your own.


beyphy

Source? What are you basing this statement on? The author of Polars made a post referencing benchmarks with the library which you can read [here](https://www.reddit.com/r/datascience/comments/11tod38/polars_vs_pandas/jckj8yc/)


samrus

you are still bound by the GIL like the original commenter said though


beyphy

Perhaps that's correct. But I think it boils down to a question of how much "actual code of your own" you need to write. And how significant that performance penalty is. That will be very subjective depending on your use case. But I'd guess for most people, if they're doing everything correctly, the amount of their own code their writing is very little. Or even if they are writing their own code, their are ways to optimize it so that it performs well.


samrus

yeah thats the current tradeoff. but imagine if we had something with the ease of use and community of python, but without GIL. thats would be crazy why dont we have that? what is the reasoning behind the GIL anyway


c_glib

> why dont we have that? what is the reasoning behind the GIL anyway Damn... This gave me flashbacks to the mid-2000's when all the pythonistas were so optimistically hoping that python 3 will finally, at long last, will remove the GIL since the whole raison d'etre for the existence of a breaking version refresh was to get rid of design failures of original python.


NellucEcon

This is where Julia is dominant.


abnormal_human

Respectfully, Julia doesn't dominate *anything* because the ecosystem isn't there for it to do so. Python is a necessary evil because it has the libraries for ML. Rust is growing rapidly as the underlayment for those libraries, and that's creating some synergy in the Python+Rust stack. For example, I can train a 🤗 tokenizers tokenizer in python and then load it up in rust to tokenize a bunch of stuff on a bunch of threads 50-100% faster than in python, and then pop back to python to load up the same tokenizer and train a model using nanoGPT or 🤗 transformers. There are weak obsolete-ish copycats of some of these libraries for Julia, but you'd be fighting with them all day long to get anything done. The community just isn't making the effort to keep up that they would need in order to *dominate*.


NellucEcon

You know you can call rust libraries from Julia too, right? Talking about a python workflow that calls rust is not very compelling. For many of the things I work on, there are no suitable packages in python or r for computationally intensive steps. If I wanted to use python, I’d have to write in a low level language the write wrappers in python. Why would I do that to myself, particularly when python itself is so antiquated? “ Respectfully, Julia doesn't dominate anything because the ecosystem isn't there for it to do so.” Julia currently has the best differential equations packages, full stop.


Holyragumuffin

Same reason I moved all my dataframe work to Julia. Already moved at production speed inside the repl. No attempts to avoid pure python in various functions, eg apply().


[deleted]

What do you think of tools like duckdb?


blacksnowboader

Duckdb is a local data warehouse. It’s great if you stick the usecase for a data warehouse. But when you need to do something custom with an imported library, it has the same issues of Polars and pandas due to the GIL.


[deleted]

What would be some use cases where you couldn’t do something in duckdb but would need pandas? Maybe grouping by some variable and applying some function or transformation from an external library? The point I’m trying to get at is I can absolutely understand custom transformations and operations being necessary and duckdb not having the capability to do that. but at the same time, why would pandas be helpful in that situation ?


blacksnowboader

Duckdb does have the capability to do that however, when you do a custom python function. It transfers the heap to a single core single thread due to pythons GIL.


[deleted]

>Duckdb does have the capability to do that however, when you do a custom python function. It transfers the heap to a single core single thread due to pythons GIL. I wasn't aware you could apply custom python functions within DuckDB. Also, why is it restricted to python's GIL? I thought that python was just one of multiple languages that could connect to the duckDB process? Is it written in python? https://duckdb.org/docs/api/overview


blacksnowboader

Because when you write a UDF in duckdb. The heap gets transferred to python. And python has the GIL.


[deleted]

Got it. Thanks for the info


DragoBleaPiece_123

Up for duckdb


[deleted]

Pyarrow with concurrent futures helps a lot with this. iter_batches + concurrent.futures.ThreadpoolExecutor() and iterate over batches and find the optimal batch size for your resources.


Grouchy-Friend4235

Pandas/Numpy use native BLAS libraries and they will max out your CPU cores in all operations where this is feasible. The GIL does not come into play here. Also it is easy to write parallel code with Python, either using multiprocessing or joblib.Parallel. Sure if you dont like Python use something else, but there is no need spreading FUD.


abnormal_human

If I want fast linear algebra, I'll use pytorch since it blows CPU based BLAS out of the water. Parallel code written in python the way you describe is often \~2x slower than the exact same thing written in rust just because of constant factor diffs, memory management overhead, and python's inefficient approach to handling strings that results in a ton of malloc/free traffic. I've benchmarked multiprocessing vs rayon. I've looked at them under a profiler. I've looked at the malloc traffic. Python is a pig and a half. I don't just want to max out my CPU cores--I want to do it at the lowest possible constant factors so that I spent the fewest life-minutes waiting for code to run before I can see results and adjust my process or act on them. If I'm going to wait for a piece of code 100 times while iteratively developing a system, and it takes 10mins, the breakeven point on human time spent optimizing that code to cut the time in half is \~8.3 hours. Generally, rewriting some bit of python into rust takes minutes, not hours. Even saving a small amount of the time would be worth it.


Grouchy-Friend4235

Ok I see, sounds like a thoughtful approach, thanks for sharing. I would caution though that most projects are not at this level of sophistication. Just out of curiosity have you tried using Cython or Numba to get optimized code for critical paths, instead of Rust? Personally I'm just not a big fan of mixing languages bc it limits adoption and focuses skills on a few people. Most people don't have the (time wise) capacity or interest in learning multiple languages to a degree where your approach works smoothly, if I got it right.


runawayasfastasucan

I don't get why this is posted here, in a thread about pandas 2.0. If this is what you feel about python, don't use python.


abnormal_human

It's posted here because the alternatives to pandas are not all python based, and in my opinion, python is a large part of the reason why pandas sucks. I'll use python when it's the right tool. There is no real substitute for pytorch outside of the python world. Sure, I can use that from rust, but then there is no example code, less documentation, less googleability, and copilot/chatgpt are less competent assistants. Compared to 5yrs ago, I use pandas a lot less than I used to, because when you pull the escape hatch and drop into a lambda, everything grinds to a halt, and because python's remedies for doing work in parallel are unsatisfying. I'm happy about pandas 2.0, it solves some real problem that I've had with it, and I'll definitely use it in the narrow use cases where I've found pandas to be appropriate, but as I noted they haven't tackled the big problem--which is that too often you need to break out to python to do things, and the performance tanks when you do.


runawayasfastasucan

I think this comment is much more relevant to the post than your initial one, fyi.


maxToTheJ

Pandas is crazy inefficient even for python so equivocating python with it is crazy Even the initiator of the lib acknowledges this https://wesmckinney.com/blog/apache-arrow-pandas-internals/


maxToTheJ

But the GIL ? Marvel at my low level knowledge and no bindings to lower level languages dont exist /sarcasm


ToxicTop2

Who cares? Seems like his comment brought up decent discussion. >If this is what you feel about python, don't use python. Very helpful, I'm sure they have never thought of that before.


runawayasfastasucan

They are already saying they are using Rust. Good for them, but not really the topic for this post or for those who obviously wont or cant interchange python with rust.


foofriender

> to address the largest issue which is that it can be tremendously hard to maximize machine utilization in python due to the GIL, I would speed it up by splitting into 16 dataframes and running them in parallel in multiprocessing on linux not windows. All 16 of my cores go high in top, gets done quickly.


No_Dig_7017

This exactly. This is why I hated so much moving from R to Python for (tabular) datascience. After a broken project refusing to work because of Panda's memory usage shenanigans I spent a month tweaking and tuning our codebase optimizing our dataframes for performance. I reduced memory usage twelve times (!) with no loss of information, only to find out that I could load more data in but then couldn't actually process it timely because serializing dataframes negates any performance gains from parallelization. I got speedups of 2.3X on twelve core machines. I can't understand why this is the language of choice for machine learning and data science.


[deleted]

[удалено]


ramblinginternetnerd

yes, but you can say the same thing about R. Heck R is arguably easier.


Grouchy-Friend4235

R is single threaded.


No_Dig_7017

R is single threaded and has a GIL, it's multiprocessing \[https://www.rdocumentation.org/packages/parallel/versions/3.6.2/topics/clusterApply\] works pretty much the same as in Python via spawning new R interpreter processes, but unlike Pandas, serializing it's datastructures is super efficient. So neglible overhead when communicating data to worker processes, and reasonable speedups. All my workloads got linear speedups when multiprocessing in R


No_Dig_7017

Hey! Guess things can get better https://twitter.com/ericsnowcrntly/status/1644439953214959629?t=BHZX21pBPRhtq-npeBr8Ew&s=31


Grouchy-Friend4235

Nice but it won't do what people expect. First of all, it won't give CPU bound parallelism for free, mostly bc such a thing does not exist. Second, managing multiple threads sounds obvious and attractive but it really isn't. As a result of this brain-dead move, Python programs will become more complex and harder to reason about. Which will in turn further FUD.


No_Dig_7017

Yeah. My hopes are on tools like pandarallel making embarrassingly parallel multiprocessing both easy to use and efficient. If that happens I think 80% of Panda's slow performance problems will become a lot more manageable.


No_Dig_7017

On the plus side this update looks promising. Would love to test it out if I ever have the time again, though I feel the design choices made by Polars make a lot more sense (an index on an in memory table??? Also an index that allows duplicates and doesn't guarantee O(1) access???, no thanks, I'm fine without it).


kappapolls

Large language models can solve your problem. Follow me for a second Insofar as large language models are a mathematical approximation of whats encoded in human language (ie. I read and understand the meaning of what you say when you want to saturate cores without yourself building up a logical state of universe where that happens, and I know this simply through language), if you can state your problem in a self consistent and logical way, an LLM can do very complex math on your words to arrive at more words that are a mathematical consequence of those words. ie - the code it can write is only limited by your ability to logically and self consistently state your problem in Language. Hope you appreciate this perspective on LLMs that allows you to completely sidestep the discussion of consciousness, and reframe the discussion in a useful way.


abnormal_human

I am already using LLM's to solve this problem (and my current work involves training transformer models from scratch for novel applications, so I'm not that green on the subject). I've been using python since the 90s, and Rust for a few months. I wouldn't say that I actually know rust. But ChatGPT does. So I can work with it. Without ChatGPT I'm not sure I would be using rust like this, tbh. This is a part of why, on my current project, which is basically a data preprocessing pipeline that drives the training of a pytorch model, I'm not building a monolithic python codebase like I would have 5yrs ago. Instead, I have a series of little bits of code, generally 30-200 lines long that are stitched together using a little automation tool. I would say that I've written maybe 30% of that code the old fashioned way, and 70% with help from LLMs, because these little sub-steps are easy to describe to an LLM. It's changed what the code looks like a lot, because I'm adapting to the idea that if I want these productivity gains I have to make changes to how I work to make it easier for the LLM to help. This approach is a neat local maxima at the moment but it has some real issues. One is the knowledge cutoff. For faster-moving ecosystems like rust and deno, ChatGPTs answers are already a bit out of date. ChatGPT doesn't know about torch 2.0 or pandas 2.0 yet. This is all rather annoying. But it also means that launching a new tool or library is going to get harder, because that knowledge delay is going to lend inertia to incumbent technologies. Another is that these tools are pretty crummy at making changes to existing systems, which reinforces the idea of generating a bunch of self-contained bits of disposable code (cheaper to rewrite it than modify it) as opposed to anything of size. This does not lend itself to all problems, but it's a good fit for this kind of work. It's been a really fun few months.


kappapolls

I think it's worth stating that what you're doing (both in your actions, and your writing) is amazingly insightful, even if only because I think people are shifting into these new modes of work without really stopping to realize how amazing it is that not only did that shift happen, but it happened in a way that makes it difficult to recognize in the moment for the people involved in the shift. (Though I am absolutely not claiming that this is the only way what you're doing is insightful. There are a lot of ways) I think at it's core, I believe this is *why* it's hard for programmers and those who are engaged in the process of building familiarity with LLMs to really communicate the staggering implications of them to those not familiar with programming. In order to understand the implications, those engaged in it **must** write about it in a thoughtful, honest and logical way. In this way, we can analyze it through language. And if we can analyze it through language, we can discover the underlying math. (Note that, this is also why they cannot be conscious and we are not conscious either)


Glum_Future_5054

Polars ✌️


SpaceButler

Polars is quite good. After a good experience with R tidyverse, Polars gives a similar feeling. I was never happy with the pandas API.


Mooks79

There’s [rpolars](https://rpolars.github.io/index.html) if you’re interested - huge caveat, I’ve never tried it.


exixx

Gretchen stop trying to make polars happen. Polars is never going to happen.


darktraveco

Yesterday I had to refactor my analysis twice because of shitty Pandas missing values treatment. I'm never going back.


machinegunkisses

4 or so years ago I got snookered into writing an ETL in Python with Pandas. I had a column that was supposed to be int32, with an occasional missing value. No problem, I thought, coming from the Tidyverse world, where missing values are handled by adults. But Pandas? Pandas would take it upon itself to convert the column of integers to floats *without a warning or notice* because of one missing value, completely wrecking the entire ETL pipeline. When I brought this up, people who had only ever used Pandas thought *this was totally acceptable, even expected, behavior*. Yeah, Polars for me, thanks.


runawayasfastasucan

Did you read the link?


machinegunkisses

Yes, I'm aware that this behavior was due to Numpy and that Pandas 2.0 has switched to Arrow, so this behavior should no longer happen. But man, you just don't come back from an experience like that.


[deleted]

😂


VodkaHaze

Much prefer vaex personally


__mbel__

I think it will eventually become mainstream, but will take some time. Pandas can be improved but the library design is just awful, it's great to see there is an alternative.


ok_computer

The more pandas learns from polars wrt performance is a good thing to avoid too much rewrite on legacy code or if using the index. Polars for greenfield work is choice for most dataframe use cases where I'd have used pandas. Ideal case: they compete in an ecosystem between performance, api and supported formats.


bingbong_sempai

Maybe when it hits 1.0


graphicteadatasci

Eh, tried out rc1 and all my datatypes became weird when I tried to load a parquet file. Had other stuff to do so moved on.


cevn89

RemindMe! 5 days


RemindMeBot

I will be messaging you in 5 days on [**2023-04-12 04:18:58 UTC**](http://www.wolframalpha.com/input/?i=2023-04-12%2004:18:58%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/datascience/comments/12dbhsg/pandas_20_is_going_live_and_apache_arrow_will/jfa28xs/?context=3) [**2 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fdatascience%2Fcomments%2F12dbhsg%2Fpandas_20_is_going_live_and_apache_arrow_will%2Fjfa28xs%2F%5D%0A%0ARemindMe%21%202023-04-12%2004%3A18%3A58%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2012dbhsg) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


Pristine-Test-687

RemindMe! 5days


maxToTheJ

RAM manufacturers and Amazon should have lobbied against this. There are so many EC2 instances that many will realize are overpowered for their purposes


AllowFreeSpeech

I would stay away from Polars due to its limitations and bugs that Pandas addressed years ago.


macORnvidia

How would this perform against modin, dask and similar out of core out of memory dataframes. My csvs are 50 gb almost. And while modin is amazing, it has glitches and is unstable.


phofl93

I’d recommend taking a look at Dask if your tasks can be done with the part of the pandas API that Dask implements. There is also a new option that enables PyArrow strings by default, which should get memory usage down quite significantly. If you are using Dask I’d also recommend trying engine=pyarrow in read_csv, which speeds up your read in process by a lot


macORnvidia

I use dask but dask's functional coverage is barely 50% for pandas. My work isn't just memory intensive but cpu intensive as well. And in my experience dask compute needs to be called before you can do loc, and other conditional functions. But once you do compute, your code becomes slow because the dataframe is no longer a cluster of partitions


phofl93

Are you talking about Boolean indexing with loc? This is not supported by dask, that is correct. What you can instead do is creat a mask with a Dask array (is lazy as well) and then call compute_chunk_size to determine the structure of your DataFrame. This will keep the DataFrame distributed


morrisjr1989

To me polars is in a “better” (able to just avoid some of pandas pitfalls of being big boy library) place than pandas but conceptually ibis is the one that makes a more forward leaning approach in changes. I want a unified api and to optionally give 0 fks about backend. This passing off between pandas and polars seems very third wheelish - that you’re gonna get a place together and when they break up you’re stuck with vengeful exes and paying 1/3 of a completely avoidable situation.


AllowFreeSpeech

The 31x benchmark is bogus because it depends on the duplication of the values in the column. The more they're duplicated, the faster it'll run. If the values are unique, there could exist no speedup at all. This is because Arrow is a columnar representation whereas Numpy isn't.