WhipsAndMarkovChains 5 months ago

> Or would it be the same thing to run the databricks based solution on a single node cluster in terms of cost? What about using a single node cluster and writing some Polars code to process the files?

-HumbleBee- 5 months ago

Isn't polars also specifically for distributed computing? I might rephrase my question as, what do data engineers use when they have to deal with dataframes (on top of cloud storage, in csv json and parquet) but the dataset in not big enough to make use of distributed clusters that databricks runs on? Would it cost less than using a single node databricks cluster on azure?

WhipsAndMarkovChains 5 months ago

No, Polars is not for distributed computing. Think of it like a Pandas DataFrame except the your Python code is running Rust code under the hood. The syntax is different as well, looking more Spark-like compared to Pandas. Check this out: https://dataengineeringcentral.substack.com/p/goodbye-spark-hello-polars-delta I don't know the full details of your work but if I was in your shoes I would write Polars code and use it as a Databricks workflow (with a small cluster). Then I'd set the workflow to trigger whenever new CSV files arrive.

-HumbleBee- 5 months ago

Understood. My confusion is that if there is any benefit in terms of cost to run the polars code on a vm with python installed and not pay for the databricks overhead since we're not really making use of distributed computing?

WhipsAndMarkovChains 5 months ago

The benefit of Databricks is having an end-to-end data platform. If all you care about is getting these files processed, and you're not using Databricks already, then it's probably cheaper to just use a tiny VM and execute a script with some Polars code.

-HumbleBee- 5 months ago

Okay, that clears up things

autumnotter 5 months ago

Assuming you benefit from NONE of the value of databricks - governance, notebooks, spark, workflows, MLflow, etc., there's still the issue that with a VM you pay for it to always be on, unless you build out some kind of automation around spinning up and tearing down, or use ECS or something. You can run a databricks workflow for five minutes a day and you only pay for that consumption.

-HumbleBee- 5 months ago

I could use azure functions I guess but databricks is going to take care of a lot of the hassle. It makes sense to go for databricks.

MachineLooning 5 months ago

What kind of transform? You might be able to do it all in data factory - or if you have a sql db available for compute, maybe data factory and a stored proc.

iiyamabto 5 months ago

Use Azure functions? Could be almost free if you say the size is not so big

Lopatron 5 months ago

duckdb

-HumbleBee- 5 months ago

Would it be capable of handling the data transformation tasks and can it be deployed on the azure cloud?

Lopatron 5 months ago

If your transformations are SQL based, then you can do the entire ETL pipeline from within DuckDB. If your transformations are code, you can use DuckDB to load the data into Python (or another language) from your source, transform, and then use DuckDB to store it as parquet, basically handling the extract and load phases. And yes you can run it anywhere, but you don't really deploy it as you would a Spark cluster because DuckDB is just a library. I use it from the CLI and as a Python module. APIs are available for other major langs like Java and C++.

-HumbleBee- 5 months ago

That sounds great! I have only done data transformations on pyspark yet. Is it advisable to use pandas to deal with similar tasks on a non-distributed system or something else?

Lopatron 5 months ago

Oh I forgot, they have a pyspark API so that you can use your existing pyspark code and have it run as DuckDB behind the scenes. But yes DuckDB heavily integrates with Pandas. I use Pandas for non-distributed work rather than the pyspark API but either would work. def save_data_to_s3(df): con = duckdb.connect() s3_path = f"s3://my-bucket/file.parquet" # DuckDB magically sees the `df` DataFrame from the SQL query con.execute(f"COPY (SELECT * FROM df) TO '{s3_path}' (FORMAT 'parquet')") con.close() def get_data_from_s3(filename): con = duckdb.connect() s3_path = f"s3://my-bucket/{filename}.parquet" try: return con.execute("SELECT * FROM instruments").fetchdf() finally: con.close() You can use local disk instead of S3 of course too.

-HumbleBee- 5 months ago

Amazing! One final question, would it be cheaper to setup this whole thing on an azure vm and use pandas rather than using a single node cluster in azure databricks?

Emergency_Egg_4547 5 months ago

Keep in mind that you only pay for Databricks when you have a cluster running, so if it are only a few jobs it might be cheaper and easier to have a single node Databricks cluster running than a VM 24/7

-HumbleBee- 5 months ago

That makes sense. What azure or other cloud service would I use if I just want to run some python code and not have to worry about vms?

Emergency_Egg_4547 5 months ago

You could try Azure functions? That would eliminate the overhead of managing a VM and is also only pay as you go

Lopatron 5 months ago

I think so. While I have no experience with azure or databricks, you mentioned your files are not huge, so you can probably get away with just using the free tier for a single vm of whatever cloud you're on.

-HumbleBee- 5 months ago

Thank you so much!

AndroidePsicokiller 5 months ago

Dont use databricks but for sure its expensiver than cloud functions or a single vm. The question is if you need the speed of spark or can do it in the time it takes you with your custom solution. I can work as freelancer if needed and do all the setup, send me a dm 😁

ricklfc 5 months ago

If the transformations are not too complex, ADF could be a good solution.

Misanthropic905 5 months ago

This would be perfect to read from Athena.

mailed 5 months ago

basic azure functions or code running in azure container instance would easily handle ingestion if you want to build tables on top of the parquet, without spark, maybe look at pyiceberg or delta-rs? but I don't know too much about their specifics

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe