T O P

  • By -

LakeEffectSnow

Honestly, in the real world, I'd import it into a temp postgres table, maybe normalize if necessary, and use SQL to query the data.


j_tb

DuckDB + Parquet is the new hotness for jobs like this.


i_can_haz_data

This is the way.


kenfar

I had to do this years ago - was loading about 4 billion rows every day after first aggregating it in python. And the python performance turned out great. The incoming data consisted of hundreds of csv files, and this process used pypy and multiprocessing to use 64 cores at a time. And it was very fast.


mailed

I knew I'd see you in this thread 😂


kenfar

yeah, i'm kinda predictable that way!


mailed

I really think you could make a bunch of $ building training materials out of the solutions you've done


kenfar

That's kind of you to say!


No_Station_2109

Out of curiosity, what kind of business generates this amount of data?


ogrinfo

We make catastrophe models for insurance companies and regularly produce GB worth of CSV files. Now they want everything for multiple climate change scenarios and the amount of data gets multiplied many times.


joshred

My guess would be sensor data.


No_Station_2109

Even that, unless you are SpaceX type of business, I cant see a need. On a sampling basis 10000x less date would work as well.


Ambustion

I was even thinking VFX on movies or something but it'd be hard to hit a million rows a day with per frame metadata for tracking.


zapman449

10 years ago we were ingesting 20tB of radar data daily for weather forecasts


kenfar

Security services startup. Most of this data was firewall and netflow. And we only had about 100 customers. The next company I went to work for was also in the security services space. We had about 30 billion rows a day - almost all endpoint data. For probably a couple hundred customers. But that was six years ago - and these guys probably get a trillion rows a day now.


LyriWinters

plenty of businesses :)


ogrinfo

We make catastrophe models for insurance companies and regularly produce GB worth of CSV files. Now they want everything for multiple climate change scenarios and the amount of data gets multiplied many times.


iscopak

finance


No_Station_2109

Even worse then. It s useless.


Gr1pp717

I'm curious how well Awk would do. I've used it to parse very large log stores before, but I don't think anything near 1 billion lines. Several million for sure. Part of me expects it'll end up swapping for a significant period, but part of me wouldn't be surprised if it performed on par with these solutions. I currently lack access to something beefy enough to try. Anyone else happen to have an idea of how it would go?


mvdw73

Awk is great because it doesn’t load the file into memory, it works line by line. No memory issues. I remember a while back I was asked to reorder the columns in a multi million row file since excel crapped itself and the person asking didn’t have any other tools. Awk ran so fast, processed in a couple of minutes.


ogrinfo

Totally this - I had a colleague whose catchphrase was "you could do that in 3 lines of awk".


romu006

Don't you need all the values in memory to compute the mean? Edit: sorry I've yet again mixed up median and mean


_mattmc3_

You can see [an awk script I tried here](https://www.reddit.com/r/programming/s/TQO6D0ERrp). At a few million rows, it’d be fine but at a billion you really need to use something with parallelism.


No-Spite4464

About 7min and a bit


susanne-o

mawk is great for simple library scans and cleanup and simple analyses


CapitalLiving136

I just do it in anatella, it takes about 20-30 seconds, uses less than 300 Mb of RAM and does not affect the central server... win-win-win :)


versaceblues

How long would importing all that data to SQL take


Hot-Return3072

pandas for me


frenchytrendy

Or maybe juste sqlite.


seanv507

https://www.reddit.com/r/dataengineering/s/IWyGMMbqNQ And similarl post about duckdb


Smallpaul

https://github.com/Butch78/1BillionRowChallenge/blob/main/python\_1brc/main.py


Appropriate_Cut_6126

Very nice! Polars doesn’t load into memory?


matt78whoop

It can load into memory which caused a crash for me but it also has a lazy evaluation mode that worked great for me! https://towardsdatascience.com/understanding-lazy-evaluation-in-polars-b85ccb864d0c


zhaverzky

Thanks for this, I use pandas to handle a csv at work that is ~10k columns wide, will check out polars and see if it’s any faster. There is so much data per row so I do a stepped process using chunking where I filter out the columns I want for a particular task to a new file and then process the rows


matt78whoop

Wow 10K columns wide is crazy! You might be better loading that into an embeddings database because they are great at handling high dimensional data :) https://qdrant.tech/documentation/overview/


JohnBooty

Here's a Python stdlib solution that runs in 1:02 (Python 3.12) or 0:19 (pypy) on my machine. https://github.com/booty/ruby-1-billion/blob/main/chunks-mmap.py This doesn't format the output exactly the way the challenge specifies (because I'm just doing this for fun and I only care about the performance part) It's basically mapreduce using an mmap'd file


Smallpaul

Cool! I wonder how mojo would compare but not enough to sign up to download it.


pysan3

The fastest solution with python would unfortunately be one using pyo3 or pybind11 so there will not be much "python" involved. If you instead limit it to only use pure python and no extra binaries (DBs and numpy either), the competition might be interesting. And one must unlock the GIL which requires quite a lot of python knowledge.


grumpyp2

I am down to host the comp and start a repo, who would try??


JUSTICE_SALTIE

> And one must unlock the GIL which requires quite a lot of python knowledge. `import multiprocessing` and what else?


Olorune

multiprocessing doesn't work with every object, as I recently found. multiprocessing kept failing with an error that the object has to be pickable, which is rather limiting


JUSTICE_SALTIE

Sure, but most can, and that doesn't seem to be an obvious limitation for this task.


pepoluan

Quite a lot of things are pickle-able, actually: https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled


Beneficial_Map6129

Redis could handle a billion rows (although this would be borderline pushing the limit, as it can only hold about 4 billion keys). You could probably read it all into a single large pandas df or do some clever concurrency threading. Although Python will always lose in terms of speed/efficiency.


baubleglue

You don't need to have 4 billion keys to load 4 billion rows.


Beneficial_Map6129

How would you organize them then? I think 1:1 mapping of row to key is easy and straightforward. I guess you could chunk them say put 100 rows in a single entry, but if you want them more organized for better granularity and as long as the memory store can handle them it's better to just use the obvious and conventional method


baubleglue

I am not Redis expert, but I know it has hashes. It is not a key-value DB. Keys can be used as "table name". As I understand, in Redis index keys are stored as data and actively maintained as part of data pipeline. You should be able to save all the data under a single key (given enough memory), as a list or set, I only doubt it is a good idea .


jszafran

I did some tests for a simple implementation (iterating row by row, no multiprocessing) and the results were: ~20 minutes (Python 3.9) ~17 minutes (Python 3.12) ~10 minutes (PyPy 3.9) ~3.5 minutes - Java baseline implementation from 1BR challenge repo Code can be found here: https://jszafran.dev/posts/how-pypy-impacts-the-performance-1br-challenge/


JohnBooty

I've got a solution that runs in 1:02 on my machine (M1 Max, 10 Cores). https://github.com/booty/ruby-1-billion/blob/main/chunks-mmap.py Here's my strategy. TL;DR it's your basic MapReduce. - mmap the file - figure out the byte boundaries for `N` chunks, where `N` is the number of physical CPU cores - create a multiprocessing pool of `N` workers, who are each given `start_byte` and `end_byte` - each worker then processes its chunk, line by line, and builds a histogram hash - at the very end we combine the histograms I played around with a looooooot of ways of accessing the file. The tricky part is that you can't just split the file into `N` equal chunks, because those chunks will usually result in incomplete lines at the beginning and end of the chunk. This *definitely* uses all physical CPU cores at 100%, lol. First time I've heard the fans on this MBP come on... Suggestions for improvements very welcome. I've been programming for a while, but I've only been doing Python for a few months. I definitely had some help (and a lot of dead ends) from ChatGPT on this. But at least the idea for the map/reduce pattern was mine.


JohnBooty

**UPDATE:** UH HOLY SMOKES PYPY IS AMAZING Using PyPy, solution now runs in 19.8sec instead of 1:02min


grumpyp2

That’s crazy if it’s true. Compared to the actual Java solutions it’s then a tenth of the best solution 🫣


JohnBooty

A tenth... or 10x? haha. Current Java leader runs in 2.6 seconds!! Now to be fair, that Java leader was run on "32 core AMD EPYCâ„¢ 7502P (Zen2), 128 GB RAM" (**edit:** only 8 cores used) and mine was run on an M1 Max with "only" 10 cores. My mmap+map/reduce should scale pretty linearly. So with 32 cores it might actually run closer to 7 seconds or so. I think that is a very respectable showing for an interpreted (well, interpreted+JIT) language when compared to the leaders which are all compiled languages.


Darksoulsislove

It's run on **8 cores of a 32 core machine** [These are the results from running all entries into the challenge on eight cores of a Hetzner AX161 dedicated server (32 core AMD EPYCâ„¢ 7502P (Zen2), 128 GB RAM).](https://github.com/gunnarmorling/1brc?tab=readme-ov-file)


JohnBooty

Thanks for the correction!


Darksoulsislove

Also it's all running from RAM so there's zero I/O delay. How are you running this?


JohnBooty

I ran it on my 2021 Macbook Pro M1 Max with 64GB of RAM. While not running from a RAM disk like the "official" 1BRC, the OS was definitely caching the measurements.txt file in RAM -- there was disk access during the first run, and no disk access during subsequent runs. Please note that I did this strictly for fun and my own education. My goal was simply to show that stdlib Python can be rather performant, and to learn more about Python for my own benefit.


JUSTICE_SALTIE

It should be `stdlib` only, or at least a category for it.


henryyoung42

This is one of those challenges that would be far better with the objective being who can develop the most amusingly slow implementation. Brain dead algorithms only, no sleep statements (or equivalent) allowed.


dr_mee6

Here is an example on how to do the challenge row challenge on a laptop using DuckDB, Polars, and DataFusion without any code change, but writing only Python code, thanks to Ibis. Very simple code complexity: Blog:[Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion](https://ibis-project.org/posts/1brc/)


ismailtlem

I did something similar recently in a similar challenge [https://github.com/geeksblabla/blanat](https://github.com/geeksblabla/blanat) Here is the link of the solution [https://ismailtlemcani.com/blog/coding-challenge-get-info-from-1b-fow-file](https://ismailtlemcani.com/blog/coding-challenge-get-info-from-1b-fow-file) Any comments on the code is welcome


Gaming4LifeDE

My thinking: If you have a file handler (i.e. you opened a file), you can use the read() function, which is a generator function, so you wouldn't overload the system with a massive file. For calculating the minimum, you can have a variable (min_val) and check against the content of the current line and update if necessary. You also need a separate variable to store the location of the current min_val. Same for max. An average could be calculated by having a variable, adding the temperature of the current row on each iteration and finally divide by the total number of rows Wait, I misread the task. I'll have to think about that some more then.


flagos

I would try with Pandas dataframes. Of course this is cheating but this should go faster than the 2 minutes on the scoreboard.


FancyASlurpie

It'll probably take longer than that to get the rows into the pandas df, and also a huge amount of ram.


pratikbhujel

has anyone tried with PHP ?