Find non-ASCII characters in varchar columns using SQL Server

Sql ServerTsqlSql Server-2005Non Ascii-Characters

Sql Server Problem Overview


How can rows with non-ASCII characters be returned using SQL Server?
If you can show how to do it for one column would be great.

I am doing something like this now, but it is not working

select *
from Staging.APARMRE1 as ar
where ar.Line like '%[^!-~ ]%'

For extra credit, if it can span all varchar columns in a table, that would be outstanding! In this solution, it would be nice to return three columns:

  • The identity field for that record. (This will allow the whole record to be reviewed with another query.)
  • The column name
  • The text with the invalid character

 Id | FieldName | InvalidText       |
----+-----------+-------------------+
 25 | LastName  | Solís             |
 56 | FirstName | François          |
100 | Address1  | 123 Ümlaut street |

Invalid characters would be any outside the range of SPACE (3210) through ~ (12710)

Sql Server Solutions


Solution 1 - Sql Server

Here is a solution for the single column search using PATINDEX.
It also displays the StartPosition, InvalidCharacter and ASCII code.

select line,
  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position],
  substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter],
  ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode]
from  staging.APARMRE1
where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0

Solution 2 - Sql Server

try something like this:

DECLARE @YourTable table (PK int, col1 varchar(20), col2 varchar(20), col3 varchar(20));
INSERT @YourTable VALUES (1, 'ok','ok','ok');
INSERT @YourTable VALUES (2, 'BA'+char(182)+'D','ok','ok');
INSERT @YourTable VALUES (3, 'ok',char(182)+'BAD','ok');
INSERT @YourTable VALUES (4, 'ok','ok','B'+char(182)+'AD');
INSERT @YourTable VALUES (5, char(182)+'BAD','ok',char(182)+'BAD');
INSERT @YourTable VALUES (6, 'BAD'+char(182),'B'+char(182)+'AD','BAD'+char(182)+char(182)+char(182));

--if you have a Numbers table use that, other wise make one using a CTE
WITH AllNumbers AS
(   SELECT 1 AS Number
    UNION ALL
    SELECT Number+1
        FROM AllNumbers
        WHERE Number<1000
)
SELECT 
    pk, 'Col1' BadValueColumn, CONVERT(varchar(20),col1) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col1)
    WHERE ASCII(SUBSTRING(y.col1, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col1, n.Number, 1))>127
UNION
SELECT 
    pk, 'Col2' BadValueColumn, CONVERT(varchar(20),col2) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col2)
    WHERE ASCII(SUBSTRING(y.col2, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col2, n.Number, 1))>127
UNION
SELECT 
    pk, 'Col3' BadValueColumn, CONVERT(varchar(20),col3) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col3)
    WHERE ASCII(SUBSTRING(y.col3, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col3, n.Number, 1))>127
order by 1
OPTION (MAXRECURSION 1000);

OUTPUT:

pk          BadValueColumn BadValue
----------- -------------- --------------------
2           Col1           BA¶D
3           Col2           ¶BAD
4           Col3           B¶AD
5           Col1           ¶BAD
5           Col3           ¶BAD
6           Col1           BAD¶
6           Col2           B¶AD
6           Col3           BAD¶¶¶

(8 row(s) affected)

Solution 3 - Sql Server

I've been running this bit of code with success

declare @UnicodeData table (
     data nvarchar(500)
)
insert into 
	@UnicodeData
values 
	(N'Horse�')
	,(N'Dog')
	,(N'Cat')

select
	data
from
	@UnicodeData 
where
	data collate LATIN1_GENERAL_BIN != cast(data as varchar(max))
    

Which works well for known columns.

For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.

declare 
	@sql	varchar(max)	= ''
	,@table	sysname			= 'mytable' -- enter your table here

;with ColumnData as (
	select
		RowId				= row_number() over (order by c.COLUMN_NAME)
		,c.COLUMN_NAME
		,ColumnName			= '[' + c.COLUMN_NAME + ']'
		,TableName			= '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']'	
	from
		INFORMATION_SCHEMA.COLUMNS c
	where
		c.DATA_TYPE			= 'nvarchar'
		and c.TABLE_NAME	= @table
)
select
	@sql = @sql + 'select FieldName = ''' + c.ColumnName + ''',         InvalidCharacter = [' + c.COLUMN_NAME + ']  from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) '  +  case when c.RowId <> (select max(RowId) from ColumnData) then  ' union all ' else '' end + char(13)
from
	ColumnData c

-- check
-- print @sql
exec (@sql)

I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.

Solution 4 - Sql Server

This script searches for non-ascii characters in one column. It generates a string of all valid characters, here code point 32 to 127. Then it searches for rows that don't match the list:

declare @str varchar(128);
declare @i int;
set @str = '';
set @i = 32;
while @i <= 127
	begin
	set @str = @str + '|' + char(@i);
	set @i = @i + 1;
	end;

select  col1
from    YourTable
where   col1 like '%[^' + @str + ']%' escape '|';

Solution 5 - Sql Server

running the various solutions on some real world data - 12M rows varchar length ~30, around 9k dodgy rows, no full text index in play, the patIndex solution is the fastest, and it also selects the most rows.

(pre-ran km. to set the cache to a known state, ran the 3 processes, and finally ran km again - the last 2 runs of km gave times within 2 seconds)

patindex solution by Gerhard Weiss -- Runtime 0:38, returns 9144 rows

select dodgyColumn from myTable fcc
WHERE  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,dodgyColumn ) >0

the substring-numbers solution by MT. -- Runtime 1:16, returned 8996 rows

select dodgyColumn from myTable fcc
INNER JOIN dbo.Numbers32k dn ON dn.number<(len(fcc.dodgyColumn ))
WHERE ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))<32 
    OR ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))>127

udf solution by Deon Robertson -- Runtime 3:47, returns 7316 rows

select dodgyColumn 
from myTable 
where dbo.udf_test_ContainsNonASCIIChars(dodgyColumn , 1) = 1

Solution 6 - Sql Server

There is a user defined function available on the web 'Parse Alphanumeric'. Google UDF parse alphanumeric and you should find the code for it. This user defined function removes all characters that doesn't fit between 0-9, a-z, and A-Z.

Select * from Staging.APARMRE1 ar
where udf_parsealpha(ar.last_name) <> ar.last_name

That should bring back any records that have a last_name with invalid chars for you...though your bonus points question is a bit more of a challenge, but I think a case statement could handle it. This is a bit psuedo code, I'm not entirely sure if it'd work.

Select id, case when udf_parsealpha(ar.last_name) <> ar.last_name then 'last name'
when udf_parsealpha(ar.first_name) <> ar.first_name then 'first name'
when udf_parsealpha(ar.Address1) <> ar.last_name then 'Address1'
end, 
case when udf_parsealpha(ar.last_name) <> ar.last_name then ar.last_name
when udf_parsealpha(ar.first_name) <> ar.first_name then ar.first_name
when udf_parsealpha(ar.Address1) <> ar.last_name then ar.Address1
end
from Staging.APARMRE1 ar
where udf_parsealpha(ar.last_name) <> ar.last_name or
udf_parsealpha(ar.first_name) <> ar.first_name or
udf_parsealpha(ar.Address1) <> ar.last_name 

I wrote this in the forum post box...so I'm not quite sure if that'll function as is, but it should be close. I'm not quite sure how it will behave if a single record has two fields with invalid chars either.

As an alternative, you should be able to change the from clause away from a single table and into a subquery that looks something like:

select id,fieldname,value from (
Select id,'last_name' as 'fieldname', last_name as 'value'
from Staging.APARMRE1 ar
Union
Select id,'first_name' as 'fieldname', first_name as 'value'
from Staging.APARMRE1 ar
---(and repeat unions for each field)
)
where udf_parsealpha(value) <> value

Benefit here is for every column you'll only need to extend the union statement here, while you need to put that comparisson three times for every column in the case statement version of this script

Solution 7 - Sql Server

To find which field has invalid characters:

SELECT * FROM Staging.APARMRE1 FOR XML AUTO, TYPE

You can test it with this query:

SELECT top 1 'char 31: '+char(31)+' (hex 0x1F)' field
from sysobjects
FOR XML AUTO, TYPE

The result will be:

> Msg 6841, Level 16, State 1, Line 3 FOR XML could not serialize the > data for node 'field' because it contains a character (0x001F) which > is not allowed in XML. To retrieve this data using FOR XML, convert it > to binary, varbinary or image data type and use the BINARY BASE64 > directive.

It is very useful when you write xml files and get error of invalid characters when validate it.

Solution 8 - Sql Server

Here is a UDF I built to detectc columns with extended ascii charaters. It is quick and you can extended the character set you want to check. The second parameter allows you to switch between checking anything outside the standard character set or allowing an extended set:

create function [dbo].[udf_ContainsNonASCIIChars]
(
@string nvarchar(4000),
@checkExtendedCharset bit
)
returns bit
as
begin

	declare @pos int = 0;
	declare @char varchar(1);
	declare @return bit = 0;
	
	while @pos < len(@string)
	begin
		select @char = substring(@string, @pos, 1)
		if ascii(@char) < 32 or ascii(@char) > 126 
			begin
				if @checkExtendedCharset = 1
					begin
						if ascii(@char) not in (9,124,130,138,142,146,150,154,158,160,170,176,180,181,183,184,185,186,192,193,194,195,196,197,199,200,201,202,203,204,205,206,207,209,210,211,212,213,214,216,217,218,219,220,221,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,248,249,250,251,252,253,254,255)
							begin
								select @return = 1;
								select @pos = (len(@string) + 1)
							end
						else
							begin
								select @pos = @pos + 1
							end
					end
				else
					begin
						select @return = 1;
						select @pos = (len(@string) + 1)	
					end
			end
		else
			begin
				select @pos = @pos + 1
			end
	end
	
	return @return;
	
end

USAGE:

select Address1 
from PropertyFile_English
where udf_ContainsNonASCIIChars(Address1, 1) = 1

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionGerhard WeissView Question on Stackoverflow
Solution 1 - Sql ServerGerhard WeissView Answer on Stackoverflow
Solution 2 - Sql ServerKM.View Answer on Stackoverflow
Solution 3 - Sql ServerVashView Answer on Stackoverflow
Solution 4 - Sql ServerAndomarView Answer on Stackoverflow
Solution 5 - Sql ServerAndrew HillView Answer on Stackoverflow
Solution 6 - Sql ServerTwelfthView Answer on Stackoverflow
Solution 7 - Sql Serverlynx_74View Answer on Stackoverflow
Solution 8 - Sql ServerDeon RobertsonView Answer on Stackoverflow