How to strip HTML tags from a string in SQL Server?
HtmlSqlSql ServerStringSql Server-2005Html Problem Overview
I've got data in SQL Server 2005 that contains HTML tags and I'd like to strip all that out, leaving just the text between the tags. Ideally also replacing things like <
with <
, etc.
Is there an easy way to do this or has someone already got some sample T-SQL code?
I don't have the ability to add extended stored procs and the like, so would prefer a pure T-SQL approach (preferably one backwards compatible with SQL 2000).
I just want to retrieve the data with stripped out HTML, not update it, so ideally it would be written as a user-defined function, to make for easy reuse.
So for example converting this:
<B>Some useful text</B>
<A onclick="return openInfo(this)"
href="http://there.com/3ce984e88d0531bac5349"
target=globalhelp>
<IMG title="Source Description" height=15 alt="Source Description"
src="/ri/new_info.gif" width=15 align=top border=0>
</A>> <b>more text</b></TD></TR>
to this:
Some useful text > more text
Html Solutions
Solution 1 - Html
There is a UDF that will do that described here:
User Defined Function to Strip HTML
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO
Edit: note this is for SQL Server 2005, but if you change the keyword MAX to something like 4000, it will work in SQL Server 2000 as well.
Solution 2 - Html
Derived from @Goner Doug answer, with a few things updated:
-
using REPLACE where possible
-
conversion of predefined entities like
é
(I chose the ones I needed :-) -
some conversion of list tags
<ul> and <li>
ALTER FUNCTION [dbo].[udf_StripHTML] --by Patrick Honorez --- www.idevlop.com --inspired by http://stackoverflow.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602 ( @HTMLText varchar(MAX) ) RETURNS varchar(MAX) AS BEGIN DECLARE @Start int DECLARE @End int DECLARE @Length int
set @HTMLText = replace(@htmlText, '
',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '
',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '
',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, ' - ','- ') set @HTMLText = replace(@htmlText, ' ',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '’' collate Latin1_General_CS_AS, '''' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '€' collate Latin1_General_CS_AS, '€' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '<' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '>' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'œ' collate Latin1_General_CS_AS, 'oe' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, ' ' collate Latin1_General_CS_AS, ' ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '©' collate Latin1_General_CS_AS, '©' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '«' collate Latin1_General_CS_AS, '«' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '®' collate Latin1_General_CS_AS, '®' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '±' collate Latin1_General_CS_AS, '±' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '²' collate Latin1_General_CS_AS, '²' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '³' collate Latin1_General_CS_AS, '³' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'µ' collate Latin1_General_CS_AS, 'µ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '·' collate Latin1_General_CS_AS, '·' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'º' collate Latin1_General_CS_AS, 'º' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '»' collate Latin1_General_CS_AS, '»' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '¼' collate Latin1_General_CS_AS, '¼' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '½' collate Latin1_General_CS_AS, '½' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '¾' collate Latin1_General_CS_AS, '¾' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'Ç' collate Latin1_General_CS_AS, 'Ç' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'È' collate Latin1_General_CS_AS, 'È' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'É' collate Latin1_General_CS_AS, 'É' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'Ê' collate Latin1_General_CS_AS, 'Ê' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'Ö' collate Latin1_General_CS_AS, 'Ö' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'à' collate Latin1_General_CS_AS, 'à' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'â' collate Latin1_General_CS_AS, 'â' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ä' collate Latin1_General_CS_AS, 'ä' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'æ' collate Latin1_General_CS_AS, 'æ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ç' collate Latin1_General_CS_AS, 'ç' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'è' collate Latin1_General_CS_AS, 'è' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'é' collate Latin1_General_CS_AS, 'é' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ê' collate Latin1_General_CS_AS, 'ê' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ë' collate Latin1_General_CS_AS, 'ë' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'î' collate Latin1_General_CS_AS, 'î' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ô' collate Latin1_General_CS_AS, 'ô' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ö' collate Latin1_General_CS_AS, 'ö' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '÷' collate Latin1_General_CS_AS, '÷' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ø' collate Latin1_General_CS_AS, 'ø' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ù' collate Latin1_General_CS_AS, 'ù' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ú' collate Latin1_General_CS_AS, 'ú' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'û' collate Latin1_General_CS_AS, 'û' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, 'ü' collate Latin1_General_CS_AS, 'ü' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '‹' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '›' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
-- Remove anything between ', @HTMLText, CHARINDEX('<', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('
-- Remove anything between
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 END
RETURN LTRIM(RTRIM(@HTMLText))
END
Solution 3 - Html
Here's an updated version of this function that incorporates the RedFilter answer (Pinal's original) with the LazyCoders additions and the goodeye typo corrections AND my own addition to handle in-line <STYLE>
tags inside the HTML.
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity < with the '<' character
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity > with the '>' character
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity & with the '&' character
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity with the ' ' character
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
Solution 4 - Html
If your HTML is well formed, I think this is a better solution:
create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
declare @textXML xml
declare @result varchar(max)
set @textXML = REPLACE( @text, '&', '' );
with doc(contents) as
(
select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
)
select @result = contents.value('.', 'varchar(max)') from doc
return @result
end
go
select dbo.StripHTML('This <i>is</i> an <b>html</b> test')
Solution 5 - Html
Here is a version that doesn't require an UDF and works even if the HTML contains tags without matching closing tags.
TRY_CAST(REPLACE(REPLACE(REPLACE([HtmlCol], '>', '/> '), '</', '<'), '--/>', '-->') AS XML).value('.', 'NVARCHAR(MAX)')
Solution 6 - Html
This is not a complete new solution but a correction for afwebservant's solution:
--note comments to see the corrections
CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
--DECLARE @TempStr varchar(255) (this is not used)
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>')
begin
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
end
-- this ELSE and SET is important
ELSE
SET @Length = 0;
-- minus @Length here below is important
SET @Start = CHARINDEX('<',@HTMLText, @End-@Length)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
-- instead of -1 it should be +1
SET @Length = (@End - @Start) + 1
END
RETURN RTRIM(LTRIM(@HTMLText))
END
Solution 7 - Html
How about using XQuery with a one liner:
DECLARE @MalformedXML xml, @StrippedText varchar(max)
SET @MalformedXML = @xml.query('for $x in //. return ($x)//text()')
SET @StrippedText = CAST(@MalformedXML as varchar(max))
This loops through all elements and returns the text() only.
To avoid text between elements concatenating without spaces, use:
DECLARE @MalformedXML xml, @StrippedText varchar(max)
SET @MalformedXML = @xml.query('for $x in //. return concat((($x)//text())[1]," ")')
SET @StrippedText = CAST(@MalformedXML as varchar(max))
And to respond to "How do you use this for a column:
SELECT CAST(html_column.query('for $x in //. return concat((($x)//text()) as varchar(max))
FROM table
For the above code, ensure your html_column
is of data type xml
, if not, you need to save a casted version of the html as xml
. I would do this as a separate exercise when you are loading HTML data, as SQL will throw an error if it finds malformed xml, e.g. mismatched start/end tags, invalid characters.
These are excellent for when you want to build seachh phrases, strip HTML, etc.
Just note that this returns type xml, so CAST or COVERT to text where appropriate. The xml version of this data type is useless, as it is not a well formed XML.
Solution 8 - Html
Try this. It's a modified version of the one posted by RedFilter ... this SQL removes all tags except BR, B, and P with any accompanying attributes:
CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
DECLARE @TempStr varchar(255)
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B')
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
END
SET @Start = CHARINDEX('<',@HTMLText, @End)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
SET @Length = (@End - @Start) - 1
END
RETURN RTRIM(LTRIM(@HTMLText))
END
Solution 9 - Html
Try this if you don't want to use the UDF function.
SELECT COLUMN1, TRY_CONVERT(xml, COLUMN2).value('.', 'nvarchar(max)') as COL2, COLUMN3
FROM DBO.TABLENAME
Solution 10 - Html
While Arvin Amir's answer comes close to a full one-line solution you can drop in anywhere; he's got a slight bug in his select statement (missing the end of the line), and I wanted to handle the most common character references.
What I ended up doing was this:
SELECT replace(replace(replace(CAST(CAST(replace([columnNameHere], '&', '&') as xml).query('for $x in //. return concat((($x)//text())[1]," ")') as varchar(max)), '&', '&'), ' ', ' '), ' ', ' ')
FROM [tableName]
Without the character reference code it can be simplified to this:
SELECT CAST(CAST([columnNameHere] as xml).query('for $x in //. return concat((($x)//text())[1]," ")') as varchar(max))
FROM [tableName]
Solution 11 - Html
Patrick Honorez code needs a slight change.
It returns incomplete results for html that contains <
or >
This is because the code below the section
> -- Remove anything between
will in fact replace the < > to nothing. The fix is to the apply the below two lines at the end:
set @HTMLText = replace(@htmlText, '<' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '>' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)