Home » Questions » Computers [ Ask a new question ]

Batch convert encoding in files

Batch convert encoding in files

How can I batch-convert files in a directory for their encoding (e.g. ANSI → UTF-8) with a command or tool?

Asked by: Guest | Views: 261
Total answers/comments: 5
Guest [Entry]

"Cygwin or GnuWin32 provide Unix tools like iconv and dos2unix (and unix2dos). Under Unix/Linux/Cygwin, you'll want to use ""windows-1252"" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)
Convert from one (-f) to the other (-t) with:
$ iconv -f windows-1252 -t utf-8 infile > outfile

Or in a find-all-and-conquer form:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;

Alternatively:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;

This question has been asked many times on this site, so here's some additional information about ""ANSI"". In an answer to a related question, CesarB mentions:

There are several encodings which are called ""ANSI"" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.
The ANSI encoding is the encoding used by the ""A"" functions in the Windows API (the ""W"" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning ""whatever the API functions use as the default ANSI encoding"", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:

[...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see ""ANSI"" or ""A"" used to signify Windows code page support."
Guest [Entry]

"The Wikipedia page on newlines has a section on conversion utilities.

This seems your best bet for a conversion using only tools Windows ships with:

TYPE unix_file | FIND """" /V > dos_file"
Guest [Entry]

Use this Python script: github.com/goerz/convert_encoding.py It works on any platform. Requires Python 2.7.
Guest [Entry]

"iconv -f original_charset -t utf-8 originalfile > newfile

Run the above command in a for loop."
Guest [Entry]

"In my use case, I needed automatic input encoding detection and there there was a lot of files with Windows-1250 encoding, for which command file -bi <FILE> returns charset=unknown-8bit. This is not a valid parameter for iconv.
I have had the best results with enca.
Convert all files with txt extension to UTF-8
find . -type f -iname *.txt -exec sh -c 'echo ""$1"" && enca ""$1"" -x utf-8' -- {} \;"