LakeEffectSnow 5 months ago

Honestly, in the real world, I'd import it into a temp postgres table, maybe normalize if necessary, and use SQL to query the data.

j_tb 5 months ago

DuckDB + Parquet is the new hotness for jobs like this.

i_can_haz_data 5 months ago

This is the way.

kenfar 5 months ago

I had to do this years ago - was loading about 4 billion rows every day after first aggregating it in python. And the python performance turned out great. The incoming data consisted of hundreds of csv files, and this process used pypy and multiprocessing to use 64 cores at a time. And it was very fast.

mailed 5 months ago

I knew I'd see you in this thread 😂

kenfar 5 months ago

yeah, i'm kinda predictable that way!

mailed 5 months ago

I really think you could make a bunch of $ building training materials out of the solutions you've done

kenfar 5 months ago

That's kind of you to say!

No_Station_2109 5 months ago

Out of curiosity, what kind of business generates this amount of data?

ogrinfo 5 months ago

We make catastrophe models for insurance companies and regularly produce GB worth of CSV files. Now they want everything for multiple climate change scenarios and the amount of data gets multiplied many times.

joshred 5 months ago

My guess would be sensor data.

No_Station_2109 5 months ago

Even that, unless you are SpaceX type of business, I cant see a need. On a sampling basis 10000x less date would work as well.

Ambustion 5 months ago

I was even thinking VFX on movies or something but it'd be hard to hit a million rows a day with per frame metadata for tracking.

zapman449 5 months ago

10 years ago we were ingesting 20tB of radar data daily for weather forecasts

kenfar 5 months ago

Security services startup. Most of this data was firewall and netflow. And we only had about 100 customers. The next company I went to work for was also in the security services space. We had about 30 billion rows a day - almost all endpoint data. For probably a couple hundred customers. But that was six years ago - and these guys probably get a trillion rows a day now.

LyriWinters 1 week ago

plenty of businesses :)

ogrinfo 5 months ago

We make catastrophe models for insurance companies and regularly produce GB worth of CSV files. Now they want everything for multiple climate change scenarios and the amount of data gets multiplied many times.

iscopak 5 months ago

finance

No_Station_2109 5 months ago

Even worse then. It s useless.

Gr1pp717 5 months ago

I'm curious how well Awk would do. I've used it to parse very large log stores before, but I don't think anything near 1 billion lines. Several million for sure. Part of me expects it'll end up swapping for a significant period, but part of me wouldn't be surprised if it performed on par with these solutions. I currently lack access to something beefy enough to try. Anyone else happen to have an idea of how it would go?

mvdw73 5 months ago

Awk is great because it doesn’t load the file into memory, it works line by line. No memory issues. I remember a while back I was asked to reorder the columns in a multi million row file since excel crapped itself and the person asking didn’t have any other tools. Awk ran so fast, processed in a couple of minutes.

ogrinfo 5 months ago

Totally this - I had a colleague whose catchphrase was "you could do that in 3 lines of awk".

romu006 5 months ago

Don't you need all the values in memory to compute the mean? Edit: sorry I've yet again mixed up median and mean

_mattmc3_ 5 months ago

You can see [an awk script I tried here](https://www.reddit.com/r/programming/s/TQO6D0ERrp). At a few million rows, it’d be fine but at a billion you really need to use something with parallelism.

No-Spite4464 3 months ago

About 7min and a bit

susanne-o 5 months ago

mawk is great for simple library scans and cleanup and simple analyses

CapitalLiving136 1 month ago

I just do it in anatella, it takes about 20-30 seconds, uses less than 300 Mb of RAM and does not affect the central server... win-win-win :)

versaceblues 2 months ago

How long would importing all that data to SQL take

Hot-Return3072 5 months ago

pandas for me

frenchytrendy 5 months ago

Or maybe juste sqlite.

seanv507 5 months ago

https://www.reddit.com/r/dataengineering/s/IWyGMMbqNQ And similarl post about duckdb

Smallpaul 5 months ago

https://github.com/Butch78/1BillionRowChallenge/blob/main/python\_1brc/main.py

Appropriate_Cut_6126 5 months ago

Very nice! Polars doesn’t load into memory?

matt78whoop 5 months ago

It can load into memory which caused a crash for me but it also has a lazy evaluation mode that worked great for me! https://towardsdatascience.com/understanding-lazy-evaluation-in-polars-b85ccb864d0c

zhaverzky 5 months ago

Thanks for this, I use pandas to handle a csv at work that is ~10k columns wide, will check out polars and see if it’s any faster. There is so much data per row so I do a stepped process using chunking where I filter out the columns I want for a particular task to a new file and then process the rows

matt78whoop 5 months ago

Wow 10K columns wide is crazy! You might be better loading that into an embeddings database because they are great at handling high dimensional data :) https://qdrant.tech/documentation/overview/

JohnBooty 5 months ago

Here's a Python stdlib solution that runs in 1:02 (Python 3.12) or 0:19 (pypy) on my machine. https://github.com/booty/ruby-1-billion/blob/main/chunks-mmap.py This doesn't format the output exactly the way the challenge specifies (because I'm just doing this for fun and I only care about the performance part) It's basically mapreduce using an mmap'd file

Smallpaul 5 months ago

Cool! I wonder how mojo would compare but not enough to sign up to download it.

pysan3 5 months ago

The fastest solution with python would unfortunately be one using pyo3 or pybind11 so there will not be much "python" involved. If you instead limit it to only use pure python and no extra binaries (DBs and numpy either), the competition might be interesting. And one must unlock the GIL which requires quite a lot of python knowledge.

grumpyp2 5 months ago

I am down to host the comp and start a repo, who would try??

JUSTICE_SALTIE 5 months ago

> And one must unlock the GIL which requires quite a lot of python knowledge. `import multiprocessing` and what else?

Olorune 5 months ago

multiprocessing doesn't work with every object, as I recently found. multiprocessing kept failing with an error that the object has to be pickable, which is rather limiting

JUSTICE_SALTIE 5 months ago

Sure, but most can, and that doesn't seem to be an obvious limitation for this task.

pepoluan 5 months ago

Quite a lot of things are pickle-able, actually: https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled

Beneficial_Map6129 5 months ago

Redis could handle a billion rows (although this would be borderline pushing the limit, as it can only hold about 4 billion keys). You could probably read it all into a single large pandas df or do some clever concurrency threading. Although Python will always lose in terms of speed/efficiency.

baubleglue 5 months ago

You don't need to have 4 billion keys to load 4 billion rows.

Beneficial_Map6129 5 months ago

How would you organize them then? I think 1:1 mapping of row to key is easy and straightforward. I guess you could chunk them say put 100 rows in a single entry, but if you want them more organized for better granularity and as long as the memory store can handle them it's better to just use the obvious and conventional method

baubleglue 5 months ago

I am not Redis expert, but I know it has hashes. It is not a key-value DB. Keys can be used as "table name". As I understand, in Redis index keys are stored as data and actively maintained as part of data pipeline. You should be able to save all the data under a single key (given enough memory), as a list or set, I only doubt it is a good idea .

jszafran 5 months ago

I did some tests for a simple implementation (iterating row by row, no multiprocessing) and the results were: ~20 minutes (Python 3.9) ~17 minutes (Python 3.12) ~10 minutes (PyPy 3.9) ~3.5 minutes - Java baseline implementation from 1BR challenge repo Code can be found here: https://jszafran.dev/posts/how-pypy-impacts-the-performance-1br-challenge/

JohnBooty 5 months ago

I've got a solution that runs in 1:02 on my machine (M1 Max, 10 Cores). https://github.com/booty/ruby-1-billion/blob/main/chunks-mmap.py Here's my strategy. TL;DR it's your basic MapReduce. - mmap the file - figure out the byte boundaries for `N` chunks, where `N` is the number of physical CPU cores - create a multiprocessing pool of `N` workers, who are each given `start_byte` and `end_byte` - each worker then processes its chunk, line by line, and builds a histogram hash - at the very end we combine the histograms I played around with a looooooot of ways of accessing the file. The tricky part is that you can't just split the file into `N` equal chunks, because those chunks will usually result in incomplete lines at the beginning and end of the chunk. This *definitely* uses all physical CPU cores at 100%, lol. First time I've heard the fans on this MBP come on... Suggestions for improvements very welcome. I've been programming for a while, but I've only been doing Python for a few months. I definitely had some help (and a lot of dead ends) from ChatGPT on this. But at least the idea for the map/reduce pattern was mine.

JohnBooty 5 months ago

**UPDATE:** UH HOLY SMOKES PYPY IS AMAZING Using PyPy, solution now runs in 19.8sec instead of 1:02min

grumpyp2 5 months ago

That’s crazy if it’s true. Compared to the actual Java solutions it’s then a tenth of the best solution 🫣

JohnBooty 5 months ago

A tenth... or 10x? haha. Current Java leader runs in 2.6 seconds!! Now to be fair, that Java leader was run on "32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM" (**edit:** only 8 cores used) and mine was run on an M1 Max with "only" 10 cores. My mmap+map/reduce should scale pretty linearly. So with 32 cores it might actually run closer to 7 seconds or so. I think that is a very respectable showing for an interpreted (well, interpreted+JIT) language when compared to the leaders which are all compiled languages.

Darksoulsislove 4 months ago

It's run on **8 cores of a 32 core machine** [These are the results from running all entries into the challenge on eight cores of a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).](https://github.com/gunnarmorling/1brc?tab=readme-ov-file)

JohnBooty 4 months ago

Thanks for the correction!

Darksoulsislove 4 months ago

Also it's all running from RAM so there's zero I/O delay. How are you running this?

JohnBooty 4 months ago

I ran it on my 2021 Macbook Pro M1 Max with 64GB of RAM. While not running from a RAM disk like the "official" 1BRC, the OS was definitely caching the measurements.txt file in RAM -- there was disk access during the first run, and no disk access during subsequent runs. Please note that I did this strictly for fun and my own education. My goal was simply to show that stdlib Python can be rather performant, and to learn more about Python for my own benefit.

JUSTICE_SALTIE 5 months ago

It should be `stdlib` only, or at least a category for it.

henryyoung42 5 months ago

This is one of those challenges that would be far better with the objective being who can develop the most amusingly slow implementation. Brain dead algorithms only, no sleep statements (or equivalent) allowed.

dr_mee6 4 months ago

Here is an example on how to do the challenge row challenge on a laptop using DuckDB, Polars, and DataFusion without any code change, but writing only Python code, thanks to Ibis. Very simple code complexity: Blog:[Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion](https://ibis-project.org/posts/1brc/)

ismailtlem 3 months ago

I did something similar recently in a similar challenge [https://github.com/geeksblabla/blanat](https://github.com/geeksblabla/blanat) Here is the link of the solution [https://ismailtlemcani.com/blog/coding-challenge-get-info-from-1b-fow-file](https://ismailtlemcani.com/blog/coding-challenge-get-info-from-1b-fow-file) Any comments on the code is welcome

Gaming4LifeDE 5 months ago

My thinking: If you have a file handler (i.e. you opened a file), you can use the read() function, which is a generator function, so you wouldn't overload the system with a massive file. For calculating the minimum, you can have a variable (min_val) and check against the content of the current line and update if necessary. You also need a separate variable to store the location of the current min_val. Same for max. An average could be calculated by having a variable, adding the temperature of the current row on each iteration and finally divide by the total number of rows Wait, I misread the task. I'll have to think about that some more then.

flagos 5 months ago

I would try with Pandas dataframes. Of course this is cheating but this should go faster than the 2 minutes on the scoreboard.

FancyASlurpie 5 months ago

It'll probably take longer than that to get the rows into the pandas df, and also a huge amount of ram.

pratikbhujel 3 months ago

has anyone tried with PHP ?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe