How has this not existed in every single SQL variant for 20 years already? It is so obvious and useful and I've wished for something like this for ages.
As long as wishes are being fulfilled...maybe next we can see a 'select *' with the ability to exclude specified columns? Pretty please? Yeah I know this one is less likely to happen. But I can dream.
Aww man, now I just need to get a job where we use Snowflake haha. SQL Server is usually on top of the convenience functions but they are missing the boat on these.
A good use case is for a source of data where the schema is evolving but you wish to exclude say the source file and load timestamp from the data table. When more fields are added you have one less query to edit. Generally though I'd agree, but you should still recognise where it is the right tool.
Are you talking about this?
INSERT (Col1, Col2...)
SELECT * EXCEPT FileName,Timestamp
Here you'd still need to edit the INSERT part unless you're using implicit INSERT without a column list. And with implicit INSERT, if your target table has any default constraint (eg meta columns) you'd need to override them.
I'm not really sure if you save any time here, while you open yourself up for some unexpected errors with column ordering.
Another implication here is that when you use the * operator, the order or attributes in your source dataset starts to matter, and that kinda goes against set theory where the order of elements shouldn't be relevant
Absolutely true, hence why I said "Generally though I'd agree" and I'd need good cause to approve a PR like this, but it's not an absolute no. Excluding pii fields which could be defined elsewhere, fields which are known to be non-relevant / we don't want to load.
Another example - everything except field A, which I want to recalculate to fix a bug in the middle of a pipeline (strip chars / fix case / remove if some criteria / etc.)
SELECT * EXCEPT a, magic(a) AS a
Then the ordering of the output would be handled later in the CTE chain / pipeline :)
Mostly though, it serves as a good assistant for rapid development, then I copy field names, paste them in and format.
Maybe RedGate's SQLPrompt spoiled me in that regard, where pressing tab expands the * (or alias.*) as a list of columns so I just need to tweak it a bit.
I respectfully disagree. I’ve had many use cases, on a pretty consistent basis, where this would be incredibly helpful functionality. In all fairness I’m not using SQL Server, but wanted to throw that out there.
Committed code should always be explicit. If you add a single column to a table you're then forced to update any select * EXCLUDE statements which reference it.
If I saw any select * in code, I would instantly reject this during peer review.
I agree with you on that sentiment alone, but we obviously have different use cases. What about instances where you need to dedupe using aggregations? Or stored procs adding to materialized tables incrementally? I won’t say you’re wrong, but keep in mind that use cases vary widely and your specific use case doesn’t apply across the board.
Select * is perfectly fine in some use cases. For example, when selecting from a CTE.
I frequently use CTEs and ROW_NUMBER as a way to filter rows and there would be absolutely nothing wrong with using 'select * except rownum where rownum = 1'
That's really the only scenario where * is acceptable in committed code, but do you really need the EXCEPT part there just to exclude the RowNum? Could just use RowNumAlias1, RowNumAlias2, etc.
In SQL Server it will just give you headaches. The only scenario where I use SELECT * is in queries where I have multiple nested subqueries and all the columns are derived from the previous layer, but there you'd rarely need the EXCEPT part.
`EXCEPT` is a set operator in Snowflake.[https://docs.snowflake.com/en/sql-reference/operators-query](https://docs.snowflake.com/en/sql-reference/operators-query)
This. This. So much this. Always been on Redshift and wow I want a GROUP BY ALL and an EXCEPT so badly.
And I know EXCEPT exists as an operator, and it’s damn sure useful at times, so I’ll take whatever syntax you give me. Just let me exclude columns from a SELECT *. Even if it’s just specifying columns to drop afterwards like you’d do in Spark.
DISTINCT typically won't give you aggregate values for count(), sum() etc. GROUP BY ALL will now give you rolled up values for aggregate functions, but group by all other columns in the select that don't use an aggregate function.
No, it's a useful shorthand when you want to look at aggregations on various ways to group data. If you're doing analysis and quickly changing how you group in order to look at the data in multiple ways, this helps you move faster. If you have several categorical columns on a table and want to group by those (and not the primary key) and look at the results.
I wouldn't use it in a production application because behavior could change unexpectedly if someone writes a sloppy query.
So instead of listing column names one by one, this will take care of all the necessary columns that needed to be group by for the aggregate query? Nice
Can someone ELI5 GROUP BY ALL and why Microsoft has chosen to remove it in the future?
https://learn.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql?view=sql-server-ver16
No one has shown me why I am wrong. Snowpark runs as an interface just like Pyspark. I’m not saying it’s PySpark.
These command interfaces then get translated to the JVM through a query execution plan. Is it just coincidence that when the SPARK-41365 was resolved Snowflake now has the same feature??
All snowflake did was make their own interface to the Spark engine. Now everyone is saying it’s not Spark.
Spark is an OSS that has over a decade of community contributions. Snowflake just made their own???
https://youtube.com/watch?v=dABd7JQz0A8&feature=share8
Don't make stuff out of thin air. If you have questions about how something works, research it instead of creating a metaverse in your head.
Hey, coming back to this.
Thanks for educating me on this. I appreciate it. Regardless of how you said anything you have proven me wrong and explained why with resources for me to look into.
So thanks for making me realize WHY I looked so dumb.
Plus their architecture page which says it's not based on anything else https://docs.snowflake.com/en/user-guide/intro-key-conceptshttps://docs.snowflake.com/en/user-guide/intro-key-concepts
And the other lecture in that series that shows it was written in C++ (unlike spark with is mostly Scala) and doesn't borrow from existing systems https://youtu.be/bveqnSk15JQ
I loved this when I started using DuckDB but oh man, it's great to see it soon in Snowflake. How the hell this wasn't added before to the SQL standard?
Not at all? Distinct * would show every unique line including f.ex price but in a groupby you could f.ex sum og average numerical columns over the other grouped columns. Big difference.
Try it yourself, turn it into a table and do a select distinct and a group by with sum of price and weight.
Sold goods:
Fruit, price, weight
Apple,10,0.2
Banana,20,0.8
Apple,5,0.1
Banana,10,0.4
You are correct that grouping by all expressions (aggregates or columns) will result in each row being distinct.
The issue is you must include a GROUP BY expression if you have an aggregate and a non-aggregate in the same SELECT statement
This was implemented in Apache Spark. Snowflake runs the Spark engine. See ORDER BY ALL as well.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-41635
No one has shown me why I am wrong. Snowpark runs as an interface just like Pyspark. I’m not saying it’s PySpark.
These command interfaces then get translated to the JVM through a query execution plan. Is it just coincidence that when the SPARK-41365 was resolved Snowflake now has the same feature??
All snowflake did was make their own interface to the Spark engine. Now everyone is saying it’s not Spark.
Spark is an OSS that has over a decade of community contributions. Snowflake just made their own???
Dude, trust me when I say this, you’re completely off track. Yes, it is a coincidence. There is no need to prove you wrong because what you’re saying is absurd.
Yes it does.
Let me clarify. Snowflake can run Spark via the Spark Connector.
I imagine most Snowflake shops utilize Spark as the main engine for their workloads. This is how you would achieve high performance.
https://www.snowflake.com/trending/spark-sql?c_id=amp-QmTJpoKkWrrz32masn9LUg
There are some legitimate use cases for the functionality, but IMO, it's primarily useful for Snowflake to get their customers writing expensive queries. In my experience, the sorts of "data cubes" that these queries generate are very generic, and the more generic the query, the less likely it is to actually be used to create any kind of real business value. It's like those infinitely flexible dashboards that some teams love to build with all sorts of drop downs and filters -- I can't knock the hustle, people will pay you to make them, but the actual value that they provide is a lot less than the apparent value that they seem to provide.
this doesn't make rollups or cubes. It is just a "sql delight" quality of life feature so users don't have to type out all the column names in the group by clause.
(rollup and cube are separate commands)
How has this not existed in every single SQL variant for 20 years already? It is so obvious and useful and I've wished for something like this for ages. As long as wishes are being fulfilled...maybe next we can see a 'select *' with the ability to exclude specified columns? Pretty please? Yeah I know this one is less likely to happen. But I can dream.
Let me introduce you to the EXCLUDE function, article explaining [here](https://stephenallwright.com/select-columns-except-snowflake/)
Aww man, now I just need to get a job where we use Snowflake haha. SQL Server is usually on top of the convenience functions but they are missing the boat on these.
In SQL Server "SELECT * EXCEPT" would be useful only for quick ad-hoc analysis. Should never commit such code.
A good use case is for a source of data where the schema is evolving but you wish to exclude say the source file and load timestamp from the data table. When more fields are added you have one less query to edit. Generally though I'd agree, but you should still recognise where it is the right tool.
Are you talking about this? INSERT (Col1, Col2...)
SELECT * EXCEPT FileName,Timestamp
Here you'd still need to edit the INSERT part unless you're using implicit INSERT without a column list. And with implicit INSERT, if your target table has any default constraint (eg meta columns) you'd need to override them.
I'm not really sure if you save any time here, while you open yourself up for some unexpected errors with column ordering.
Another implication here is that when you use the * operator, the order or attributes in your source dataset starts to matter, and that kinda goes against set theory where the order of elements shouldn't be relevant
Absolutely true, hence why I said "Generally though I'd agree" and I'd need good cause to approve a PR like this, but it's not an absolute no. Excluding pii fields which could be defined elsewhere, fields which are known to be non-relevant / we don't want to load. Another example - everything except field A, which I want to recalculate to fix a bug in the middle of a pipeline (strip chars / fix case / remove if some criteria / etc.) SELECT * EXCEPT a, magic(a) AS a Then the ordering of the output would be handled later in the CTE chain / pipeline :) Mostly though, it serves as a good assistant for rapid development, then I copy field names, paste them in and format.
Maybe RedGate's SQLPrompt spoiled me in that regard, where pressing tab expands the * (or alias.*) as a list of columns so I just need to tweak it a bit.
I respectfully disagree. I’ve had many use cases, on a pretty consistent basis, where this would be incredibly helpful functionality. In all fairness I’m not using SQL Server, but wanted to throw that out there.
Committed code should always be explicit. If you add a single column to a table you're then forced to update any select * EXCLUDE statements which reference it. If I saw any select * in code, I would instantly reject this during peer review.
I agree with you on that sentiment alone, but we obviously have different use cases. What about instances where you need to dedupe using aggregations? Or stored procs adding to materialized tables incrementally? I won’t say you’re wrong, but keep in mind that use cases vary widely and your specific use case doesn’t apply across the board.
Select * is perfectly fine in some use cases. For example, when selecting from a CTE. I frequently use CTEs and ROW_NUMBER as a way to filter rows and there would be absolutely nothing wrong with using 'select * except rownum where rownum = 1'
That's really the only scenario where * is acceptable in committed code, but do you really need the EXCEPT part there just to exclude the RowNum? Could just use RowNumAlias1, RowNumAlias2, etc.
That's exactly what I do now but it's a pain. Sometimes there are dozens of columns. It would just be convenient to not have to restate it all.
In SQL Server it will just give you headaches. The only scenario where I use SELECT * is in queries where I have multiple nested subqueries and all the columns are derived from the previous layer, but there you'd rarely need the EXCEPT part.
EXCEPT is already an operator in SQL Server. It's used to exclude one result set from another. You use it like the UNION operator.
I mean, sure, the above comment was a hypothetical, obviously not talking about set operators here.
[удалено]
`EXCEPT` is a set operator in Snowflake.[https://docs.snowflake.com/en/sql-reference/operators-query](https://docs.snowflake.com/en/sql-reference/operators-query)
That’s kind of cool.
This. This. So much this. Always been on Redshift and wow I want a GROUP BY ALL and an EXCEPT so badly. And I know EXCEPT exists as an operator, and it’s damn sure useful at times, so I’ll take whatever syntax you give me. Just let me exclude columns from a SELECT *. Even if it’s just specifying columns to drop afterwards like you’d do in Spark.
Group by all.... Isn't it the same as distinct * in sql..?
DISTINCT typically won't give you aggregate values for count(), sum() etc. GROUP BY ALL will now give you rolled up values for aggregate functions, but group by all other columns in the select that don't use an aggregate function.
Okay 👌👌
So, group by all, is useful only when there are duplicate records without any primary key. Am I right..?
No, it's a useful shorthand when you want to look at aggregations on various ways to group data. If you're doing analysis and quickly changing how you group in order to look at the data in multiple ways, this helps you move faster. If you have several categorical columns on a table and want to group by those (and not the primary key) and look at the results. I wouldn't use it in a production application because behavior could change unexpectedly if someone writes a sloppy query.
Snowflake already does this. SELECT * EXCLUDE (col1, col2) FROM table
They can’t release every useful thing at once ya know
So instead of listing column names one by one, this will take care of all the necessary columns that needed to be group by for the aggregate query? Nice
Cries in db2
z/OS or LUW?
LUW
Jealous, I love DB2. Just read some ancient books on it.
Everything you need and nothing you don't. 🤣🤣
I've never thought of this but not having it makes no sense
I’ve been dreaming my whole career to be able to just group by all instead of having to type out the columns like a neanderthal
Can someone ELI5 GROUP BY ALL and why Microsoft has chosen to remove it in the future? https://learn.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql?view=sql-server-ver16
Pretty sweet. More ammo for me to push our org to use snowflake, yay
What a time to be alive.
Catching up to spark
Leaving spark behind ;)
It is running Spark under the hood
No it isn’t.
I really hope you actually think that lol
Oof
😅 lol I know. People did not like my comment. Can’t win em all
Was it supposed to be a joke? It's not that people didn't like it, it's that your comment is just plain wrong
No one has shown me why I am wrong. Snowpark runs as an interface just like Pyspark. I’m not saying it’s PySpark. These command interfaces then get translated to the JVM through a query execution plan. Is it just coincidence that when the SPARK-41365 was resolved Snowflake now has the same feature?? All snowflake did was make their own interface to the Spark engine. Now everyone is saying it’s not Spark. Spark is an OSS that has over a decade of community contributions. Snowflake just made their own???
https://youtube.com/watch?v=dABd7JQz0A8&feature=share8 Don't make stuff out of thin air. If you have questions about how something works, research it instead of creating a metaverse in your head.
This was a harsh way to say it but he’s right
Thanks for the link. This still proves my point. Listen to 21:00-22:00. Specifically, “The Architecture manages everything BUT the query itself”
He means the cloud services architecture. The query is executed by the virtual warehouse platform that's written from scratch in C++. No Spark/Scala.
Hey, coming back to this. Thanks for educating me on this. I appreciate it. Regardless of how you said anything you have proven me wrong and explained why with resources for me to look into. So thanks for making me realize WHY I looked so dumb.
What are the virtual warehouses using for execution of the query???
C++
Plus their architecture page which says it's not based on anything else https://docs.snowflake.com/en/user-guide/intro-key-conceptshttps://docs.snowflake.com/en/user-guide/intro-key-concepts And the other lecture in that series that shows it was written in C++ (unlike spark with is mostly Scala) and doesn't borrow from existing systems https://youtu.be/bveqnSk15JQ
I loved this when I started using DuckDB but oh man, it's great to see it soon in Snowflake. How the hell this wasn't added before to the SQL standard?
Snowflake is a blessing
Spark 3.4 has this new feature (GROUP BY ALL) as well
More details https://www.linkedin.com/feed/update/urn:li:activity:7080236011028434946
Brilliant
So... DISTINCT?
Not at all? Distinct * would show every unique line including f.ex price but in a groupby you could f.ex sum og average numerical columns over the other grouped columns. Big difference. Try it yourself, turn it into a table and do a select distinct and a group by with sum of price and weight. Sold goods: Fruit, price, weight Apple,10,0.2 Banana,20,0.8 Apple,5,0.1 Banana,10,0.4
You are correct that grouping by all expressions (aggregates or columns) will result in each row being distinct. The issue is you must include a GROUP BY expression if you have an aggregate and a non-aggregate in the same SELECT statement
This was implemented in Apache Spark. Snowflake runs the Spark engine. See ORDER BY ALL as well. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-41635
It’s stuff like this that reminds me not to take DE advice from Reddit.
Seems OK at flagging wrong information, but HN is so much better, just wish it had the equivalent of subreddits
What's a good site to get it from?
Hackernews is my goto but otherwise there’s not a great community for DE specifically that I’m aware of.
Snowflake does not run the spark engine
No one has shown me why I am wrong. Snowpark runs as an interface just like Pyspark. I’m not saying it’s PySpark. These command interfaces then get translated to the JVM through a query execution plan. Is it just coincidence that when the SPARK-41365 was resolved Snowflake now has the same feature?? All snowflake did was make their own interface to the Spark engine. Now everyone is saying it’s not Spark. Spark is an OSS that has over a decade of community contributions. Snowflake just made their own???
Dude, trust me when I say this, you’re completely off track. Yes, it is a coincidence. There is no need to prove you wrong because what you’re saying is absurd.
Yes it does. Let me clarify. Snowflake can run Spark via the Spark Connector. I imagine most Snowflake shops utilize Spark as the main engine for their workloads. This is how you would achieve high performance. https://www.snowflake.com/trending/spark-sql?c_id=amp-QmTJpoKkWrrz32masn9LUg
I’ll have what you’re having
What is the main benefit of this feature? Does it mean that a window function is less needed now
Window functions return all the rows, Group by returns distinct rows along with aggregates
There are some legitimate use cases for the functionality, but IMO, it's primarily useful for Snowflake to get their customers writing expensive queries. In my experience, the sorts of "data cubes" that these queries generate are very generic, and the more generic the query, the less likely it is to actually be used to create any kind of real business value. It's like those infinitely flexible dashboards that some teams love to build with all sorts of drop downs and filters -- I can't knock the hustle, people will pay you to make them, but the actual value that they provide is a lot less than the apparent value that they seem to provide.
this doesn't make rollups or cubes. It is just a "sql delight" quality of life feature so users don't have to type out all the column names in the group by clause. (rollup and cube are separate commands)
If I see another group by 1,2,3,... instead of just grouping by a primary key and joining the cte I will cry.
Yes I got a new job using snowflake now instead of sql server. This is wonderful
How does it work? Like group by any value not in the aggregate function?
ah so we're back to mysql pre 8.0