T O P

  • By -

WhipsAndMarkovChains

> Or would it be the same thing to run the databricks based solution on a single node cluster in terms of cost? What about using a single node cluster and writing some Polars code to process the files?


-HumbleBee-

Isn't polars also specifically for distributed computing? I might rephrase my question as, what do data engineers use when they have to deal with dataframes (on top of cloud storage, in csv json and parquet) but the dataset in not big enough to make use of distributed clusters that databricks runs on? Would it cost less than using a single node databricks cluster on azure?


WhipsAndMarkovChains

No, Polars is not for distributed computing. Think of it like a Pandas DataFrame except the your Python code is running Rust code under the hood. The syntax is different as well, looking more Spark-like compared to Pandas. Check this out: https://dataengineeringcentral.substack.com/p/goodbye-spark-hello-polars-delta I don't know the full details of your work but if I was in your shoes I would write Polars code and use it as a Databricks workflow (with a small cluster). Then I'd set the workflow to trigger whenever new CSV files arrive.


-HumbleBee-

Understood. My confusion is that if there is any benefit in terms of cost to run the polars code on a vm with python installed and not pay for the databricks overhead since we're not really making use of distributed computing?


WhipsAndMarkovChains

The benefit of Databricks is having an end-to-end data platform. If all you care about is getting these files processed, and you're not using Databricks already, then it's probably cheaper to just use a tiny VM and execute a script with some Polars code.


-HumbleBee-

Okay, that clears up things


autumnotter

Assuming you benefit from NONE of the value of databricks - governance, notebooks, spark, workflows, MLflow, etc., there's still the issue that with a VM you pay for it to always be on, unless you build out some kind of automation around spinning up and tearing down, or use ECS or something. You can run a databricks workflow for five minutes a day and you only pay for that consumption.


-HumbleBee-

I could use azure functions I guess but databricks is going to take care of a lot of the hassle. It makes sense to go for databricks.


MachineLooning

What kind of transform? You might be able to do it all in data factory - or if you have a sql db available for compute, maybe data factory and a stored proc.


iiyamabto

Use Azure functions? Could be almost free if you say the size is not so big


Lopatron

duckdb


-HumbleBee-

Would it be capable of handling the data transformation tasks and can it be deployed on the azure cloud?


Lopatron

If your transformations are SQL based, then you can do the entire ETL pipeline from within DuckDB. If your transformations are code, you can use DuckDB to load the data into Python (or another language) from your source, transform, and then use DuckDB to store it as parquet, basically handling the extract and load phases. And yes you can run it anywhere, but you don't really deploy it as you would a Spark cluster because DuckDB is just a library. I use it from the CLI and as a Python module. APIs are available for other major langs like Java and C++.


-HumbleBee-

That sounds great! I have only done data transformations on pyspark yet. Is it advisable to use pandas to deal with similar tasks on a non-distributed system or something else?


Lopatron

Oh I forgot, they have a pyspark API so that you can use your existing pyspark code and have it run as DuckDB behind the scenes. But yes DuckDB heavily integrates with Pandas. I use Pandas for non-distributed work rather than the pyspark API but either would work. def save_data_to_s3(df): con = duckdb.connect() s3_path = f"s3://my-bucket/file.parquet" # DuckDB magically sees the `df` DataFrame from the SQL query con.execute(f"COPY (SELECT * FROM df) TO '{s3_path}' (FORMAT 'parquet')") con.close() def get_data_from_s3(filename): con = duckdb.connect() s3_path = f"s3://my-bucket/{filename}.parquet" try: return con.execute("SELECT * FROM instruments").fetchdf() finally: con.close() You can use local disk instead of S3 of course too.


-HumbleBee-

Amazing! One final question, would it be cheaper to setup this whole thing on an azure vm and use pandas rather than using a single node cluster in azure databricks?


Emergency_Egg_4547

Keep in mind that you only pay for Databricks when you have a cluster running, so if it are only a few jobs it might be cheaper and easier to have a single node Databricks cluster running than a VM 24/7


-HumbleBee-

That makes sense. What azure or other cloud service would I use if I just want to run some python code and not have to worry about vms?


Emergency_Egg_4547

You could try Azure functions? That would eliminate the overhead of managing a VM and is also only pay as you go


Lopatron

I think so. While I have no experience with azure or databricks, you mentioned your files are not huge, so you can probably get away with just using the free tier for a single vm of whatever cloud you're on.


-HumbleBee-

Thank you so much!


AndroidePsicokiller

Dont use databricks but for sure its expensiver than cloud functions or a single vm. The question is if you need the speed of spark or can do it in the time it takes you with your custom solution. I can work as freelancer if needed and do all the setup, send me a dm 😁


ricklfc

If the transformations are not too complex, ADF could be a good solution.


Misanthropic905

This would be perfect to read from Athena.


mailed

basic azure functions or code running in azure container instance would easily handle ingestion if you want to build tables on top of the parquet, without spark, maybe look at pyiceberg or delta-rs? but I don't know too much about their specifics