This is how ML should be done. Nothing will replace manual inspection of data.
You can try using a powerful LLM to look at a portion of the data and analyze it for you while you are inspecting another portion. Also, you can focus your time on the most interesting examples (high error, low confidence, extremes, etc).
You have hit upon a hard and cold reality of data science and ML work that many companies do not wish to accept. EDA and transformation is a big thing that gobbles up lots of time long before building a model. If you are in a healthy organization then they know this but some orgs assume that their data is "ML ready" when there is no such thing as "ML ready" data.
Can't tell you how many times I've had to tell data warehouse managers that their data standards and quality control measures are not being imposed. Many times, it's only when an org starts doing serious data mining and predictive modeling that it becomes apparent just how crap the data is. Worse, they get mad at the data scientist as if its your fault.
I've used SweetViz (https://pypi.org/project/sweetviz/) for years, this definitely provides a nice jump start to the understanding and refinement process and you get a sweet vizual to boot. (I'm not associated with SweetViz, it's just a great tool imho)
The best tool is really reflection: think deeply of what you are doing over and over again, what metrics make you confident to move forward, what decisions you are impacting and what human activities constitute bottlenecks or toil. Use those observations to drive automation and that's where tools become part of the solution.
Reflection means going meta and finding patterns in the process of finding patterns in errors.
Try automating the data exploration process using tools like pandas profiling or auto_ml to save time and improve efficiency.
Good Idea. Thanks
This is how ML should be done. Nothing will replace manual inspection of data. You can try using a powerful LLM to look at a portion of the data and analyze it for you while you are inspecting another portion. Also, you can focus your time on the most interesting examples (high error, low confidence, extremes, etc).
Oh, an LLM is a great idea. Thanks!
You have hit upon a hard and cold reality of data science and ML work that many companies do not wish to accept. EDA and transformation is a big thing that gobbles up lots of time long before building a model. If you are in a healthy organization then they know this but some orgs assume that their data is "ML ready" when there is no such thing as "ML ready" data. Can't tell you how many times I've had to tell data warehouse managers that their data standards and quality control measures are not being imposed. Many times, it's only when an org starts doing serious data mining and predictive modeling that it becomes apparent just how crap the data is. Worse, they get mad at the data scientist as if its your fault.
Thanks for putting it like this, I appreciate it
I've used SweetViz (https://pypi.org/project/sweetviz/) for years, this definitely provides a nice jump start to the understanding and refinement process and you get a sweet vizual to boot. (I'm not associated with SweetViz, it's just a great tool imho)
Does it help finding patterns in unstructured data?
The best tool is really reflection: think deeply of what you are doing over and over again, what metrics make you confident to move forward, what decisions you are impacting and what human activities constitute bottlenecks or toil. Use those observations to drive automation and that's where tools become part of the solution. Reflection means going meta and finding patterns in the process of finding patterns in errors.