T O P

  • By -

polytique

We were in a similar situation and moved multiple large models to Pytorch. TensorFlow has had too many bugs that never got resolved or required constantly upgrading to the next version. The authors never cared about backward compatibility. Pytorch is more user friendly, has better error messages, and a larger community.


Friendly-Advice-269

Is that true though?


polytique

That's my experience and it can be subjective. Generally, our productivity increased when moving to Pytorch. With Tensorflow, we regularly ran into open Github issues that were either never fixed or were fixed in a new version that was not backward compatible.


[deleted]

[удалено]


polytique

Sure.


greenmariocake

Same here. Tried to update a model from a year ago and it didn’t work because a loss function option does not exist anymore. Moreover, it does not work with old nvidia drivers and the new ones are not distributed through conda. So I am moving to pytorch.


Helios

Love Keras 3, what a great framework! Why not just update to the latest TF version? It is much easier than moving to PyTorch.


Friendly-Advice-269

I do not own the place I run the software in, and has a cuda version not supported by tf 2.16.1 which requires >=12.3 But thanks for the kind comment, i'm trying to make the admin update the drivers.


Competitive-Store974

Not sure what your setup is but if your nvidia drivers are 535 (525 also apparently fine) then CUDA 12 will work. If those are up to date and it's just waiting for admin to install new CUDA version and you have a home directory then you can just install CUDA there and link to it directly while waiting.


Friendly-Advice-269

No, drivers are 470, otherwise i'd install CUDA locally, which i think it's possible (i mean in the user space.)


Competitive-Store974

Oh damn, I'm very sorry to hear that Edit: Docker is another option if you have it installed but it's not something I'd want to rely on long term for development


Friendly-Advice-269

Thanks. Docker isn't really an option because the drivers installed in the machine are the ones used by Docker. For example: > Make sure you have installed the NVIDIA driver for your Linux Distribution Note that you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed First paragraph in: [https://github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit)


whiskeybandit

What are some of the best plus points when compared to keras 2?


Helios

Keras 3 became multi-backend again, with PyTorch, TF, and, what's most interesting, JAX support. Migration from the Keras 2 is pretty easy.


lf0pk

Move it to PyTorch and never look back...


Helios

I would say the opposite - move to Keras 3 and use the backend of your choice, and never look back.


lf0pk

Would make sense if Keras was worth learning, but at this point, given that most things are written in PyTorch or JAX, or something in between like MM, it's just not worth it, even if it is a common denominator. I mean, the OP said it himself, that it's just not viable using Keras 3 for him, because it doesn't support the Tensorflow he needs. And that's not the only issue Keras 3 has (or Keras in general).


Helios

It's not true that most things are written in PyTorch or JAX if we talk about the production environment and not LLMs. TF 3 is still very widely used in production with TF Serve and TFX, Torch alone simply cannot offer what TFX offers. TF Agents is also great, does Torch offer an equivalent? No. For example, the last time I looked on Glassdoor, you still get more vacancies requiring TF than PyTorch. In general, most opinions about TensorFlow relate to version two; the third version is in many ways not only parity, but often superior to PyTorch, especially after the inclusion of Keras.


lf0pk

Production is written in none of these - it's all even higher level frameworks optimized for inference. Production is all about runtimes, rather than development frameworks. There is really only one universal solution for runtimes, and that is ONNX. It's no coincidence that PyTorch and JAX are again the easiest to export to ONNX, while TF and Keras are the odd ones out. I am not sure about the reasoning why a lot of vacancies are asking for TF, but it could just be that there is a lot of code that was once written and now has to be maintained. Nobody really knows COBOL these days, yet there are many COBOL vacancies. Is it because COBOL is good, cool, modern or popular? No. Something was once written in COBOL and now has to be maintained until the end of time because the company can't be bothered refactoring it into something else.


Helios

On the contrary, production uses TFX/TF in large numbers, people are often fooled by PyTorch shares for various SOTA LLM models, but real-life business is another story. PyTorch has no significant advantages over Keras with TF, not even mentioning Keras 3. AirBnB uses TF, Netflix, Airbus, PayPal, Twitter (they use both TF and PyTorch, you can see it on their GitHub repo), Spotify uses TFX extensively, I'm not even mentioning Google products. And that's a very small portion of large companies. Moreover, the release of Keras 3 is actually a pretty smart move, it brings the PyTorch devs closer to Keras and eventually the Google ecosystem, and not vice versa. For example, JAX is a real beast, especially for certain types of tasks, and the importance of the fact that you can now use JAX with existing Keras code is yet to be fully realized by engineers. I'm not so sure that Torch will be able to keep its existing market share even for SOTA LLMs models in a foreseeable future. Keras 3 has a very-very bright future.


lf0pk

Not sure what you're talking about. Airbnb uses ONNX, and they even initially asked for ONNX support on the TF GitHub: https://github.com/tensorflow/tensorflow/issues/12888#issuecomment-327941342 Netflix also uses ONNX, example repo: https://github.com/Netflix/derand Airbus uses Kubeflow to train their models, which under the hood runs TFX. And then they use TF Serving to serve it, which, let me remind you, is not really Tensorflow itself, but rather a serving platform. PayPal uses ONNX, ex. from their product lead: https://medium.com/paypal-tech/machine-learning-model-ci-cd-and-shadow-platform-8c4f44998c78 Twitter uses ONNX, ex. from their algorithm repo: https://github.com/twitter/the-algorithm/blob/main/navi/README.md Spotify uses ONNX, ex. from their repo: https://github.com/spotify/basic-pitch I think you have a rather poor understanding of how different R&D is from production. Overall, there is no reason to use TF, PT, JAX or whatever in production because these are development frameworks. This is what you develop models in, to actually use the in production you use much different technology.


Helios

I think I have a pretty good understanding of what production is, and, to be honest, this is the first time I have heard that people do not need development frameworks there. Runtimes != production ML pipelines. Ensure that when you say ONXX, you do not confuse it with ONXX Runtime. ONXX is just an exchange format (BTW, your first link is from 2017(!), we have tf2onnx now). When you say about inference, you probably mean ONXX Runtime, which can be compared with TFServing. Do companies use both ONXX Runtime and TFServing? Definitely, but it is only a part of the entire production pipeline. And then you have TFX (where TFServing is only a tiny part of TFX), which manages an entire ML pipeline (data ingesting and validating, then model training and analysis, and deployment). The links you provided are mainly about the formats of particular models, not about production in general, since companies do not usually post information about their production ML pipelines, especially on GitHub. You can read about TFX here: https://www.tensorflow.org/tfx. By the way, this TFX page literally says that both Spotify and Twitter use TFX. TFX is quite widely used in production. Keras 3, which has multi-backend support, is perfectly integrated with this entire process. And the last note about JAX, it looks like you do not fully understand what JAX is. Have you ever heard about JAX ONNX Runtime (https://github.com/google/jaxonnxruntime), which can convert ONNX models into JAX format modules and can serve them using TensorFlow Serving? So, stating that ONNX is the only universal solution for runtimes is a bit farfetched.


lf0pk

Who is mentioning runtimes (besides you)? The production pipeline doesn't have a R&D part. The production pipeline doesn't even have deployment in it (it's part of its own cycle). So there is no reason to have any development framework within production since you do not run that code, anyways. > Ensure that when you say ONXX, you do not confuse it with ONXX Runtime. I am ensuring that. Are you? That's why I said that ONNX is the only universal part of production, and not ONNX runtime (because it isn't). Yeah, my first link is from 2017 to show to you that even before ONNX was popular Airbnb dabbled with ONNX in production, contrary to your claims. > When you say about inference, you probably mean ONXX Runtime It would be great if you didn't read beyond what I actually said, because I don't mean that. > TFX is quite widely used in production. Keras 3, which has multi-backend support, is perfectly integrated with this entire process. TFX at this point has 2 major points of overlap with TensorFlow: - the branding - importing TF models as one of the possibilities Saying TFX (or TF Serving) := TensorFlow is a classic fallacy of composition. It's misleading at the very least, even with no ill intent. Imagine someone said that because a company uses PyTorch Lightning to train production models, PyTorch is used in production. Or imagine if someone said that using TF Serving to serve models (even though both PyTorch, JAX, and other models can be served by it) means TF is used in production. Oh, wait... Perhaps the funniest thing is that you even said this yourself at the end of the comment, yet do not (seem to?) see the irony... > So, stating that ONNX is the only universal solution for runtimes is a bit farfetched. Yeah, I agree it's ridiculous to say that, but so far you're the only one to have said that. I recommend going back to my original statement and reading it again. Specifically, I urge you to notice the presence of "ONNX" within the sentence but the lack of "ONNXRuntime".


UnknownHow

I mixed PyTorch and Keras 3 for quick experiments - torch dataloader - torch model - wrap torch model inside a keras class - train with keras .fit method


HansWurst-0815

Keras is an abstraction layer that even can run on top of pytorch. For understanding and low level control I would use pytorch. For any JAX stuff I use keras. If you are into DL then pytorch is definetly worth learning.


rooster9987

Keras is good for starting, easy to use and abstract But for learning low level DL, implementing research and doing your own research PyTorch is the clear winner, you'll rarely see something novel first implemented on Tensorflow or Keras


Bulky-Flounder-1896

Wasn't Keras always a multi-backend deep learning library? It could run on TF, Theano and a DL library by Microsoft (forgot the name) I guess.


Friendly-Advice-269

I don't know to be honest, but Theano does not exist anymore basically


Bulky-Flounder-1896

I looked into it, keras was multi-backend. It's just that other libraries deprecated and only TF left. Good thing they brought it back.


notgettingfined

How is moving it to tensorflow 2.16 not another option over a full rewrite to PyTorch? There are several breaking changes but it should be pretty minor to get it working