Isaac Sloan - How to safely remove Byte Order Marks (BOM) from files created on windows
Banner700

How to safely remove Byte Order Marks (BOM) from files created on windows

The problem

If you read enough files created on a windows box you'll eventually run into this. ""\xEF\xBB\xBFyour expected string" instead of "your expected string". I recently ran into again while parsing some csv files exported in windows. In order to parse the string you'll need to remove what is called a Byte Order Mark.

Solution

Just search for "\xEF\xBB\xBF" and remove it.

file = File.read("filename.txt").sub("\xEF\xBB\xBF", '')

That will usually work but if the string encoding is ASCII-8Bit it will throw the error Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) If you force it to be UTF-8 first you shouldn't have any issues.

file = File.read("filename.txt").force_encoding('utf-8').encode.sub("\xEF\xBB\xBF", '')
October 15, 2017
rubyutf-8ASCII-8Bit
comments powered by Disqus