Optimal way to concatenate/aggregate strings

SqlSql ServerAzureAggregate FunctionsAzure Sql-Database

Sql Problem Overview


I'm finding a way to aggregate strings from different rows into a single row. I'm looking to do this in many different places, so having a function to facilitate this would be nice. I've tried solutions using COALESCE and FOR XML, but they just don't cut it for me.

String aggregation would do something like this:

id | Name                    Result: id | Names
-- - ----                            -- - -----
1  | Matt                            1  | Matt, Rocks
1  | Rocks                           2  | Stylus
2  | Stylus

I've taken a look at CLR-defined aggregate functions as a replacement for COALESCE and FOR XML, but apparently SQL Azure does not support CLR-defined stuff, which is a pain for me because I know being able to use it would solve a whole lot of problems for me.

Is there any possible workaround, or similarly optimal method (which might not be as optimal as CLR, but hey I'll take what I can get) that I can use to aggregate my stuff?

Sql Solutions


Solution 1 - Sql

SOLUTION

The definition of optimal can vary, but here's how to concatenate strings from different rows using regular Transact SQL, which should work fine in Azure.

;WITH Partitioned AS
(
	SELECT 
		ID,
		Name,
		ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Name) AS NameNumber,
		COUNT(*) OVER (PARTITION BY ID) AS NameCount
	FROM dbo.SourceTable
),
Concatenated AS
(
	SELECT 
		ID, 
		CAST(Name AS nvarchar) AS FullName, 
		Name, 
		NameNumber, 
		NameCount 
	FROM Partitioned 
	WHERE NameNumber = 1

	UNION ALL

	SELECT 
		P.ID, 
		CAST(C.FullName + ', ' + P.Name AS nvarchar), 
		P.Name, 
		P.NameNumber, 
		P.NameCount
	FROM Partitioned AS P
		INNER JOIN Concatenated AS C 
				ON P.ID = C.ID 
				AND P.NameNumber = C.NameNumber + 1
)
SELECT 
	ID,
	FullName
FROM Concatenated
WHERE NameNumber = NameCount

EXPLANATION

The approach boils down to three steps:

  1. Number the rows using OVER and PARTITION grouping and ordering them as needed for the concatenation. The result is Partitioned CTE. We keep counts of rows in each partition to filter the results later.

  2. Using recursive CTE (Concatenated) iterate through the row numbers (NameNumber column) adding Name values to FullName column.

  3. Filter out all results but the ones with the highest NameNumber.

Please keep in mind that in order to make this query predictable one has to define both grouping (for example, in your scenario rows with the same ID are concatenated) and sorting (I assumed that you simply sort the string alphabetically before concatenation).

I've quickly tested the solution on SQL Server 2012 with the following data:

INSERT dbo.SourceTable (ID, Name)
VALUES 
(1, 'Matt'),
(1, 'Rocks'),
(2, 'Stylus'),
(3, 'Foo'),
(3, 'Bar'),
(3, 'Baz')

The query result:

ID          FullName
----------- ------------------------------
2           Stylus
3           Bar, Baz, Foo
1           Matt, Rocks

Solution 2 - Sql

Are methods using FOR XML PATH like below really that slow? Itzik Ben-Gan writes that this method has good performance in his T-SQL Querying book (Mr. Ben-Gan is a trustworthy source, in my view).

create table #t (id int, name varchar(20))

insert into #t
values (1, 'Matt'), (1, 'Rocks'), (2, 'Stylus')

select	id
  		,Names = stuff((select ', ' + name as [text()]
		from #t xt
		where xt.id = t.id
		for xml path('')), 1, 2, '')
from #t t
group by id

Solution 3 - Sql

STRING_AGG() in SQL Server 2017, Azure SQL, and PostgreSQL: https://www.postgresql.org/docs/current/static/functions-aggregate.html
https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql

GROUP_CONCAT() in MySQL
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_group-concat

(Thanks to @Brianjorden and @milanio for Azure update)

Example Code:
select Id
, STRING_AGG(Name, ', ') Names 
from Demo
group by Id

SQL Fiddle: http://sqlfiddle.com/#!18/89251/1

Solution 4 - Sql

Although @serge answer is correct but i compared time consumption of his way against xmlpath and i found the xmlpath is so faster. I'll write the compare code and you can check it by yourself. This is @serge way:

DECLARE @startTime datetime2;
DECLARE @endTime datetime2;
DECLARE @counter INT;
SET @counter = 1;

set nocount on;

declare @YourTable table (ID int, Name nvarchar(50))

WHILE @counter < 1000
BEGIN
	insert into @YourTable VALUES (ROUND(@counter/10,0), CONVERT(NVARCHAR(50), @counter) + 'CC')
	SET @counter = @counter + 1;
END

SET @startTime = GETDATE()

;WITH Partitioned AS
(
    SELECT 
        ID,
        Name,
        ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Name) AS NameNumber,
        COUNT(*) OVER (PARTITION BY ID) AS NameCount
    FROM @YourTable
),
Concatenated AS
(
    SELECT ID, CAST(Name AS nvarchar) AS FullName, Name, NameNumber, NameCount FROM Partitioned WHERE NameNumber = 1

    UNION ALL

    SELECT 
        P.ID, CAST(C.FullName + ', ' + P.Name AS nvarchar), P.Name, P.NameNumber, P.NameCount
    FROM Partitioned AS P
        INNER JOIN Concatenated AS C ON P.ID = C.ID AND P.NameNumber = C.NameNumber + 1
)
SELECT 
    ID,
    FullName
FROM Concatenated
WHERE NameNumber = NameCount

SET @endTime = GETDATE();

SELECT DATEDIFF(millisecond,@startTime, @endTime)
--Take about 54 milliseconds

And this is xmlpath way:

DECLARE @startTime datetime2;
DECLARE @endTime datetime2;
DECLARE @counter INT;
SET @counter = 1;

set nocount on;

declare @YourTable table (RowID int, HeaderValue int, ChildValue varchar(5))

WHILE @counter < 1000
BEGIN
	insert into @YourTable VALUES (@counter, ROUND(@counter/10,0), CONVERT(NVARCHAR(50), @counter) + 'CC')
	SET @counter = @counter + 1;
END

SET @startTime = GETDATE();

set nocount off
SELECT
    t1.HeaderValue
        ,STUFF(
                   (SELECT
                        ', ' + t2.ChildValue
                        FROM @YourTable t2
                        WHERE t1.HeaderValue=t2.HeaderValue
                        ORDER BY t2.ChildValue
                        FOR XML PATH(''), TYPE
                   ).value('.','varchar(max)')
                   ,1,2, ''
              ) AS ChildValues
    FROM @YourTable t1
    GROUP BY t1.HeaderValue
    
SET @endTime = GETDATE();

SELECT DATEDIFF(millisecond,@startTime, @endTime)
--Take about 4 milliseconds

Solution 5 - Sql

Update: Ms SQL Server 2017+, Azure SQL Database

You can use: STRING_AGG.

Usage is pretty simple for OP's request:

SELECT id, STRING_AGG(name, ', ') AS names
FROM some_table
GROUP BY id

Read More

Well my old non-answer got rightfully deleted (left in-tact below), but if anyone happens to land here in the future, there is good news. They have implimented STRING_AGG() in Azure SQL Database as well. That should provide the exact functionality originally requested in this post with native and built in support. @hrobky mentioned this previously as a SQL Server 2016 feature at the time.

--- Old Post: Not enough reputation here to reply to @hrobky directly, but STRING_AGG looks great, however it is only available in SQL Server 2016 vNext currently. Hopefully it will follow to Azure SQL Datababse soon as well..

Solution 6 - Sql

You can use += to concatenate strings, for example:

declare @test nvarchar(max)
set @test = ''
select @test += name from names

if you select @test, it will give you all names concatenated

Solution 7 - Sql

I found Serge's answer to be very promising, but I also encountered performance issues with it as-written. However, when I restructured it to use temporary tables and not include double CTE tables, the performance went from 1 minute 40 seconds to sub-second for 1000 combined records. Here it is for anyone who needs to do this without FOR XML on older versions of SQL Server:

DECLARE @STRUCTURED_VALUES TABLE (
	 ID					INT
	,VALUE				VARCHAR(MAX) NULL
	,VALUENUMBER		BIGINT
	,VALUECOUNT			INT
);

INSERT INTO @STRUCTURED_VALUES
SELECT	 ID
		,VALUE
		,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY VALUE)	AS VALUENUMBER
		,COUNT(*) OVER (PARTITION BY ID)	AS VALUECOUNT
FROM	RAW_VALUES_TABLE;

WITH CTE AS (
	SELECT	 SV.ID
			,SV.VALUE
			,SV.VALUENUMBER
			,SV.VALUECOUNT
	FROM	@STRUCTURED_VALUES SV
	WHERE	VALUENUMBER = 1

	UNION ALL

	SELECT	 SV.ID
			,CTE.VALUE + ' ' + SV.VALUE AS VALUE
			,SV.VALUENUMBER
			,SV.VALUECOUNT
	FROM	@STRUCTURED_VALUES SV
	JOIN	CTE 
		ON	SV.ID = CTE.ID
		AND	SV.VALUENUMBER = CTE.VALUENUMBER + 1

)
SELECT	 ID
		,VALUE
FROM	CTE
WHERE	VALUENUMBER = VALUECOUNT
ORDER BY ID
;

Solution 8 - Sql

Try this, i use it in my projects

DECLARE @MetricsList NVARCHAR(MAX);

SELECT @MetricsList = COALESCE(@MetricsList + '|', '') + QMetricName
FROM #Questions;

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmattView Question on Stackoverflow
Solution 1 - SqlSerge BelovView Answer on Stackoverflow
Solution 2 - SqlslachtermanView Answer on Stackoverflow
Solution 3 - SqlHrobkyView Answer on Stackoverflow
Solution 4 - SqlQMasterView Answer on Stackoverflow
Solution 5 - SqlBrian JordenView Answer on Stackoverflow
Solution 6 - SqljvcView Answer on Stackoverflow
Solution 7 - SqlTom HalladayView Answer on Stackoverflow
Solution 8 - SqlVladyslav FominView Answer on Stackoverflow