[SQL SERVER]String Splitting效能比較

比較使用TSQL那些方法效能可最接近CLR

撰寫TSQL達到string splitting並不難,以前我使用CTE來處理但不太關心效能,但如你是效能至上的話,透過SQL CLR在任何案例中,效能都有不錯表現,

這篇我要測試使用TSQL還有那些方法可以達到string splitting且效能接近CLR。

CTE

CREATE FUNCTION dbo.StrSplit_CTE

(

   @inputcontent nvarchar(MAX),

   @Delimiter  nvarchar(255)

)

RETURNS @Items TABLE (Item NVARCHAR(4000))

WITH SCHEMABINDING

AS

BEGIN

   DECLARE @ll INT = LEN(@inputcontent) + 1, @ld INT = LEN(@Delimiter);

 

   WITH a AS

   (

       SELECT

           [start] = 1,

           [end]   = COALESCE(NULLIF(CHARINDEX(@Delimiter,

                       @inputcontent, 1), 0), @ll),

           [value] = SUBSTRING(@inputcontent, 1,

                     COALESCE(NULLIF(CHARINDEX(@Delimiter,

                       @inputcontent, 1), 0), @ll) - 1)

       UNION ALL

       SELECT

           [start] = CONVERT(INT, [end]) + @ld,

           [end]   = COALESCE(NULLIF(CHARINDEX(@Delimiter,

                       @inputcontent, [end] + @ld), 0), @ll),

           [value] = SUBSTRING(@inputcontent, [end] + @ld,

                     COALESCE(NULLIF(CHARINDEX(@Delimiter,

                       @inputcontent, [end] + @ld), 0), @ll)-[end]-@ld)

       FROM a

       WHERE [end] < @ll

   )

   INSERT @Items SELECT [value]

   FROM a

   WHERE LEN([value]) > 0

   OPTION (MAXRECURSION 0);

 

   RETURN;

END

GO

透過遞迴CTE取得每個部份字串,但要注意MVTF無法使用統計值(因為需要本地變數資料表),而且要設定最大遞迴數0,以免預設100不夠用而出現錯誤。

 

Test

declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'

select * from dbo.StrSplit_CTE(@inputcontent,',')

select * from dbo.StrSplit_CTE(REPLICATE(@inputcontent,1000),',')

 

XML

create FUNCTION dbo.StrSplit_XML

(

   @inputcontent       nvarchar(MAX),

   @Delimiter  nvarchar(255)

)

RETURNS TABLE

WITH SCHEMABINDING

AS

   RETURN

   ( 

      SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')

      FROM

      (

        SELECT x = CONVERT(XML, '<i>'

          + REPLACE(@inputcontent, @Delimiter, '</i><i>')

          + '</i>').query('.')

      ) AS a CROSS APPLY x.nodes('i') AS y(i)

   );

GO

如果可以確保輸入的XML資料正確的話,那麼這方法我還滿推薦的,且並不像CTE屬於MTVF(all inline)

 

TEST

declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'

select * from dbo.StrSplit_XML(@inputcontent,',')

select * from dbo.StrSplit_XML(REPLICATE(@inputcontent,1000),',')

 

Numbers table

DECLARE @UpperLimit INT = 100000;

 

WITH n AS

(

    SELECT

        x = ROW_NUMBER() OVER (ORDER BY (select null))

    FROM       sys.all_objects AS s1

    CROSS JOIN sys.all_objects AS s2

    CROSS JOIN sys.all_objects AS s3

)

SELECT Number = x

  INTO dbo.testNumbers

  FROM n

  WHERE x BETWEEN 1 AND @UpperLimit;

 

 

CREATE UNIQUE CLUSTERED INDEX n ON dbo.testNumbers(Number)

    WITH (DATA_COMPRESSION = PAGE);--使用頁面壓縮減少I/O

GO

 

CREATE FUNCTION dbo.StrSplit_Numbers

(

   @inputcontent   nvarchar(MAX),

   @Delimiter  NVARCHAR(255)

)

RETURNS TABLE

WITH SCHEMABINDING

AS

   RETURN

   (

       SELECT Item = SUBSTRING(@inputcontent, Number,

         CHARINDEX(@Delimiter, @inputcontent + @Delimiter, Number) - Number)

       FROM dbo.testNumbers

       WHERE Number <= CONVERT(INT, LEN(@inputcontent))

         AND SUBSTRING(@Delimiter + @inputcontent, Number, LEN(@Delimiter)) = @Delimiter

   );

GO

該方法須先建立數字位置資料表表來存放字串起始位置,而且還要確保該資料表總筆數>=字串最大長度,這方法有一點麻煩。

 

TEST

declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'

select * from dbo.StrSplit_Numbers(@inputcontent,',')

select * from dbo.StrSplit_Numbers(REPLICATE(@inputcontent,1000),',')

 

Jeff Moden

利用Tally Table分配每個字串的起始字元位置和長度(類似Numbers table

概念),我覺得很Cool

create FUNCTION dbo.StrSplit_Moden

(

  @inputcontent nvarchar(max),

  @Delimiter nvarchar(255)

)

RETURNS TABLE

WITH SCHEMABINDING AS

RETURN

WITH E1(N)        AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

                         UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

                         UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),

       E2(N)        AS (SELECT 1 FROM E1 a, E1 b),

       E4(N)        AS (SELECT 1 FROM E2 a, E2 b),

       E42(N)       AS (SELECT 1 FROM E4 a, E2 b),

       cteTally(N)  AS (SELECT 0 UNION ALL SELECT TOP (DATALENGTH(ISNULL(@inputcontent,1)))

                         ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E42),

       cteStart(N1) AS (SELECT t.N+1 FROM cteTally t

                         WHERE (SUBSTRING(@inputcontent,t.N,1) = @Delimiter OR t.N = 0))

  SELECT Item = SUBSTRING(@inputcontent, s.N1, ISNULL(NULLIF(CHARINDEX(@Delimiter,@inputcontent,s.N1),0)-s.N1,8000))

  FROM cteStart s;

 

Test

declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'

select * from dbo.StrSplit_Moden(@inputcontent,',')

select * from dbo.StrSplit_Moden(REPLICATE(@inputcontent,1000),',')

 

結論:

方法

10001 rows (長字串)

11 rows(短字串)

CTE

CPU 時間 = 812 ms,經過時間 = 1330 ms。

掃描計數 1,邏輯讀取 31,讀取前讀取 6。

CPU 時間 = 15 ms,經過時間 = 16 ms。

掃描計數 1,邏輯讀取 1

XML

CPU 時間 = 141 ms,經過時間 = 352 ms。

掃描計數 0,邏輯讀取 27,讀取前讀取 0,LOB 邏輯讀取 1461。

CPU 時間 = 0 ms,經過時間 = 1 ms。

Numbers table

CPU 時間 = 156 ms,經過時間 = 296 ms。

掃描計數 1,邏輯讀取 95,實體讀取 0,讀取前讀取 85。

CPU 時間 = 0 ms,經過時間 = 1 ms。

掃描計數 1,邏輯讀取 3,實體讀取 1,讀取前讀取 8。

Jeff Moden

 

CPU 時間 = 985 ms,經過時間 = 1161 ms。

CPU 時間 = 0 ms,經過時間 = 0 ms。

String splitting透過CLR效能還是最佳,如果嫌維護CLR麻煩的話,那麼XML是第二選擇,因為Numbers table和Jeff Moden需額外考慮字串最大長度,這一點我個人覺得彈性不是那麼好。

 

參考

[SQL SERVER][Performance]資料表值函數(2)

SQLCLR String Splitting Part 2: Even Faster, Even More Scalable

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Splitting Strings : Now with less T-SQL

Split strings the right way – or the next best way