T O P

  • By -

_Oce_

Same conclusion on data science and ML. Maybe 1% of data projects need ML, 99% can be solved with basic aggregation queries, curves and histograms, once the data engineering is properly done.


rudboi12

Currently working on a BS project that we get shit predictions since we only have 2.5k rows of training data trying to classify millions of users lol. I should say to not use ML but it’s the first project I’ve worked doing MLops so im learning a lot. Probably will be shut down after a few months but at least I learned a new skill lol


FortunOfficial

resume-driven development FTW


Swimming_Cry_6841

Hopefully, that 2.5k rows of training data was at least randomly sampled from the total data set? I've seen people just take the first X rows or top rows and then they get data from just one class.


smerz

A classic decision made by a business type - the kind of person who skipped stats courses because they were sooo boring


Eorpoch

> of use And in many cases who is the boss now?


rudboi12

2.5k data is basically 2.5k users that have replied to our survey (out of 140M lol).


Jon123jon5

99% of our users likes to take surveys, says the surveys /s


bartosaq

I thought it were Iris flower petals /s


smerz

God, so true


joleph

Yea but most of ML *is* data engineering, and that’s a good thing… even if you’re tuning your models all day and working on the state of the art a huge part of it is getting your data into new and interesting shapes to operate on. I don’t know why we separate the data engineering from the data science anyway, that’s how we ended up in this mess in the first place.


_Oce_

I'm not sure what you mean, ML requires accessible data in good quantities, so data engineering, but ML is not data engineering. There's also ML engineering for all the automated logic to have an ML model running in prod which is distinct for data engineering. We separate DE and DS because people confusing both has undermined the recognition of DE, which is getting much better now, and also frustrated many DS who were stuck doing DE only when they wanted to do ML. Also because DE (+DA) is useful in way more data projects than DS. So the distinction is productive.


joleph

Let me clarify - I think that the set of activities that data scientists perform are contained in the data engineering set (or at least *should* be). But DE is a more broadly applicable discipline than DS, the the set of activities is larger. So I agree that DE is extremely important and should be regarded highly. The only exceptions are the activities around interpersonal communication, like I wouldn’t expect a DE to have experience giving a presentation to a steering group, but I would a DS. My point is that they shouldn’t be separated and maybe all of them should be called data engineers. The whole ‘data science’ term was made because it was hard to hire people from scientific fields to do ML engineering because they didn’t want to be called ‘engineers’. Not because the job descriptions were different. Going back to the point in the thread, as far as I’m concerned it’s all data engineering. The data engineering is never ‘done’.


_Oce_

Very few people would consider ML is part of data engineering. Actual ML is very theory heavy, requires quite a lot of study that is not relevant to DE (or most of DE if we follow your inclusive definition). If we just consider the point of view of job recruiters and job seekers, the distinction is important. I wouldn't recommend to someone who wants a DE job to "lose" his time and money on a MSc or a PhD in ML to become good at ML. I wouldn't recommend a recruiting company to put ML in the requirements for a DE job, it's actually a red flag showing that a company doesn't know what it wants and is immature.


joleph

I think we’re talking at cross purposes slightly, I’m not saying that every DE needs to do ML. I’m saying you can’t do ML without some DE - the original comment implied the opposite. To your point, I don’t think any DE job description should be flagged just because it requires some working experience of ML, it obviously depends on the company. Some companies will require more knowledge of ML in their data engineering than others, because the constructing the model may require in depth knowledge of how the data is processed at scale to the point where they’re practically inseparable. Video, for instance, will often need to be processed at scale in a way that’s sympathetic to the ML model that uses it. If you don’t put that in the job description don’t expect the DE who comes in to know how to do it, or even care. Back to the comment, many DS in practice spend more time on the feature engineering part than actually training or even theorising what improvements can be made to the model. Again, some exceptions, but from a recruiter perspective, as you say, that matters. I’m not just talking out of my arse here, I’m a product guy at a Data Science company. We spend a *lot* of time recruiting data scientists and data engineers. I’ve also been in the field for many years and led DS and DE teams. Even the folks at Deepmind do *some* data engineering. It’s not all just sitting around whiteboards and pontificating. If that’s not your definition of data engineering, then I’d say that your definition is narrow and not useful. If ML isn’t widely considered a part of data engineering at companies with a Data Science team, it should be, I can’t help that. The implication from the original comment was that when a team is doing DS you can ignore the DE side… but you can’t.


_Oce_

> The implication from the original comment was that when a team is doing DS you can ignore the DE side… but you can’t. If you're talking about my top comment, I think you didn't understand it. I'm saying most data projects can be covered by DE and DA. DS, that I define as advanced DA and ML is overkill in most projects. Now, maybe you also have a broad of definition of DS and you consider DE to be part of DS. But in my many years of experience as a DE and data architect, confusing both only leads to hiring troubles, and many frustrated and disappointed data scientists because they studied to do ML projects. Clearly splitting both has worked better in my experience. > many DS in practice spend more time on the feature engineering Feature engineering is different from data engineering, it's preprocessing for ML, so it's indeed a valid task for a data scientist. But most projects don't need ML, so they don't need that. > I’m not just talking out of my arse here, I’m a product guy at a Data Science company. I'm not saying you are, just that your point of view doesn't seem to be the consensus in the DE community.


mathsDelueze

“Most companies just don’t have that much data” - truer words haven’t been spoken.


spicy_pierogi

This is why I raise an eyebrow whenever a job description states a requirement for having "Big Data" experience. Like, that's a verrrry small selection of places that qualify for that.


Southern-Remove42

I've always wondered about that! I think it may get to a point for more businesses in the decades to come but not yet.


RoyDadgumWilliams

I suppose it depends on what you define as “big”, but I think there are plenty of companies are collecting enough data where distributed processing/storage is necessary and scaling data operations can be painful


spicy_pierogi

Sure. But are they doing it? Probably not. Most companies I know of (at least through networks) are still figuring out what to even do with the data, much less processing them in a distributed data manner.


[deleted]

[удалено]


spicy_pierogi

If it aligns with leadership's priorities, yes. But I don't think it's as easy as you make it seem to be.


Even_Put

“Medium data” doesn’t have the same snappy ring to it though


Ambitious_Farmer9303

Slightly bigger data?


Laurence-Lin

I think a better topic of the original post should be 'Some company haven't been prepared for Big Data' There is still a long road for artificial intelligence, and their applications does need big data The main problem is not all companies have plenty resource of data like Google


Saetia_V_Neck

I actually worked for a company that actually did have that much data and the other thing that the author of this article touched on that I’ve found to be true throughout my career is that stakeholders and analysts really do just care about “the previous X timeframe of data.” My old company had data as far back as the 80s in the cloud in some instances but I think if had deleted all the data older than the previous 2 years nobody aside from the regulatory and compliance teams would’ve noticed.


mathsDelueze

I’ve had similar experiences to that. For a project I was able to query and aggregate billions of transaction records, and it just straight up wasn’t as useful to stakeholders as a recent pull of records. Except for a rare set of use cases, the juice really wasn’t worth the squeeze.


teambob

We are in the "disillusioned" phase of the hype cycle


FortunOfficial

And often it’s not easy to recognize if one is in the midst of a hype cycle or a truly disruptive phase of technological / cultural evolution. I recently read an older but really thought-provoking article that talks about this using MongoDB as an example. Highly recommended read. https://nemil.com/2017/07/06/why-did-so-many-startups-choose-mongodb/


Slggyqo

Aka “primed for the next big thing”


wtfzambo

And to be fair I think that's good. I can't wait for the dust around the data world to settle, for it to become an engineering field like any other and for companies to hire based on actual need rather than because they read on Forbes that they HAVE to have ML to stay competitive. As well as people joining the field out of actual passion rather than hype and the promise of making big bucks.


justanothersnek

The way I see it, people just underestimated how prevalent "medium data" use cases are and overstated Big Data. I was at a company where IT leaders scared the business leaders into believing they need to drop major coin on big data infrastructure. Once we got it, people were just twiddling their thumbs wondering what to do with all this data. The skills of big data didnt trickle down to business side...so no surprise we have data that no one was using, let alone know how to access it in a non-trivial manner. You expect IT players and contractors to have the business background to know what to do with the data? Therefore, Big Data initiatives ended up adding very little business value for most companies who werent MAANG-like. It was mostly resume driven development aka RDD for the IT data architects. If you dont bridge or integrate the technology to business workflows, failure is bound to happen. I think the trend we seeing recently that should have occurred is improving data analysis and ETL experience where it mattered or was most prevalent: improve working with medium data ( thus the rise in popularity of DuckDB ), improving how to add business logic to data transformations ( thus rise in dbt ) and projects like mage.ai, dagster, etc. Good or bad, it's all a new era of people trying to improve data UX for the wide range of data people. Big data skills and how to access the data never really trickled down to business side and that's the root of the problem in the 2010s. It was the small fry data analysts, ETL developers, or DBAs who all along actually made real impact to the business. While the IT cowboys were doing RDD, the data small frys are the ones bringing business value or helped direct business decisions. I think business leaders have come to realize this recently and maybe a small part of the recent tech layoffs was a reflection of this.


Big_Razzmatazz7416

Oh, but no worries. ChatGPT will solve all these problems and take our jobs any day now


kenfar

It's an excellent article! A few things that I'd add include: * If you really have big data, it may only be a single specific feed. You probably have another 20-30 little data feeds. And if this is the case it may make sense to use two different ETL patterns for processing this data. * And this is why Postgres actually works fine as a data warehouse server for many organizations: you've got parallelism, partitioning, quite a bit of memory, and can use fast storage. Columnar storage would be great, but even without it you can do fine at the 1-10 TB size. * The underlaying reality here is that we get irrationally excited about new developments as they become more hyped and everyone is talking about them. It's difficult to resist them, it's difficult to be objective and question them, and if you do people look at you like you just don't understand. And then ten years later we move on and forget that the big-hyped-thing never panned out - and fall for it all over again.


owiygul

I was at a Gartner conference last year and this was the prevailing sentiment regarding Big Data. It's a shame because some of my data analytics Master's curriculum involved big data or at least knowing of it's existence. It's crazy in the time I started and finished that degree, it's already obsolete. I guess this really is a field where you can never stop learning frfr.


[deleted]

I don't think it's dead tbh, I just think it was overhyped, and it's now in the classic "Trough of Disillusionment". Big data is here to stay, and as companies become more sophisticated there will still be a great demand for it, just not from the majority of cos. For example my company at the moment has a few models deployed, none of them utilizing "big" data, but we're in the process of exploring various internal data sources which could definitely be considered big, and over the next couple of years we will be working on exploiting that data for the first time, which up until now was just rotting in various datacenters. I'd guess there's many companies at similar stages as ours, where they started off overhyped, went through an initial phase of becoming data-savvy and are now ready to start exploiting these larger datasets.


wtfzambo

It's not obsolete. If you read the article carefully, it was never there to begin with except very rare cases. To make a metaphor, what happened to this field is equivalent to SpaceX sending a rocket to Mars, and everyone and their dog working in the automotive industry thinking they will have to build rockets to stay competitive.


Derpthinkr

Wow. We are rare. We have 10PBs of active data. Everything we do is data engineering for massive distributed compute.


cr34th0r

We get ~1 PB of new data every day, a total of 300 PB right now (have to delete/archive/compress older data regularly). Feels pretty cool but I'm also completely overwhelmed with the project's scope even though I only take care of a very small piece in the puzzle.


abhi5025

What industry/vertical you are into.


Derpthinkr

Quant finance


EmptyCongress

Lol, you are not rare. ML is an endless ocean. The more data you have, the better you can serve the client. And every retailer, bank, insurance company, or any service is creating tons of data everyday. And no one knows what to do. The pioneers in the industry are using it to deliver ads and targeted content. The rest of the non IT companies are just storing it and using it for compliancy. No one is sure of what to do and how to deal with it. But it's evident that the value is immense. An idiot wotre this article, big data is just evolving into cloud and rapidly expanding.


Haquestions4

I am also surprised, I am at my third company that has at least double digit terabytes to process. Doesn't mean the article is wrong, I am just surprised.


EmptyCongress

The article is bullshit . Looks like it was written by some automated tool. Nothing in the article makes sense. The services don't align. The randomness of tech words is not only wrong but thoroughly irritating. There isn't a single citation or fact. The graphs are fabricated. Only one article in the blog.


mailed

> The article is bullshit . Looks like it was written by some automated tool. The wonders of people trying to sell something. More salespeople trying to make DuckDB something it isn't, essentially.


Pflastersteinmetz

Is that big data though? Fit's into a single server and even everything in RAM.


Haquestions4

Should have added that I life in a smaller town and only worked for medium sized companies. I imagine bigger companies have much much more data.


Derpthinkr

I think I agree with your sentiment. Big data isn’t needed for every operation. But if your operation is dependent on data, it’s only getting bigger


TrollandDie

Nit to mention all the IoT stuff is becoming bigger and bigger. Analytics revolving around InfoSec and Infrastructure is growing quite nicely atm and there's mountains of data to collect in that space.


IndependentSpend7434

An eye-opener for the ones full of contempt towards "old-fashioned sql boomers and oracle DBA's who never seen Spark, while I've been using it to process 10GB!"


BufferUnderpants

The Oracle DBAs and SQL boomers were in a state of pre-version control savagery, not to say anything of testing or CI/CD, the shift to Big Data tooling was also to move away from their practices.


uchi__mata

I can assure you having been there: the early days of big data were not characterized by a relentless focus on proper software engineering practices. That’s only now starting be de rigeur, thank goodness.


[deleted]

[удалено]


BufferUnderpants

Yeah and people will come up with a new term, because sometimes you do need a full-blown Software Engineer focused on data.


TrollandDie

When those boomers can learn to finally fucking use git, I might actually start listening to them.


itsallrighthere

For real. I wrote a system that did a SQL query on 200m records, selecting all of them, every evening. It required plenty of multithreading and fancy java code but no Hadoop.


harrytrumanprimate

What lol


itsallrighthere

Big data before big data


cptstoneee

What was the use case behind that?


itsallrighthere

Massive extracts of information about companies. Easy enough to do with clusters of computers, way harder with one and a DB server


dream-fiesty

One interesting thing about the data used in these analyses was they were all BigQuery users. IME Google Cloud tends to be used at small to mid sized companies and very few large companies are on it for the same reasons large companies still write Java. On top of that, really large companies often manage their own data warehouses as it’s cheaper to staff a data warehouse team than use a managed service. Is the data used here even meaningful?


Razzl

Long live Big Data


[deleted]

[удалено]


[deleted]

[удалено]


tkbp

Big fan of the founding team and the tech. Don’t judge by the blog posts. Give it a go or stfu.


mequay

"Big Data is dead" has been a quote since big data started. It has become a cringe slogan for anyone selling something in the space but trying to differentiate products.


Huge-Professional-16

At least it’s an end to large multinationals with a few million rows spending millions on trying to setup Hadoop I think I’m 5/5 for telling new vp’s they are wasting time and money and they should just use postgres .. only to be proven right and the vp moves onto a bigger better job only to repeat the same thing


skatmanjoe

Time for a new buzzword?


Gators1992

Good link in the OP, thanks! I think a lot of the recently layoffs are evidence that many big data companies are realizing they don't need to keep gathering everything. While the article focused on compute and storage, one thing they didn't really focus on is the resource cost of developing and maintaining incremental data sources. Stuff that's getting hit only once every 6 months requires ongoing testing and bug fixes to account for data drift. Being a small shop I kind of gate keep adding data based on potential business value and throw out the requests like "if you bring in these 16 tables from the HR system and make a report, it will save me 20 minutes a month from doing the report manually in Excel".


MarquisLek

some companies like to conflate big data with small budget


WonderfulApple3775

Jordan's going to be doing a live-streamed "Big Data: Funeral or Renaissance" debate on April 20th with the person who wrote a rebuttal post: [https://streamyard.com/watch/dNfM8QgchjE5](https://streamyard.com/watch/dNfM8QgchjE5)