Can I convert from MySQL to Elasticsearch

MySQL PHP umlauts / special characters fix UTF-8 / ISO

Everyone knows the problem, for some reason words were written to the database in the wrong encoding. When that has happened, you can tell that there are signs like these mixed in:

'¦,' ¨, '?,' ´, '¸,' à €, 'Ã,' à ‚, 'Ã,' Ä, 'Ã…,' à †, 'à ‡,' È, 'à ‰,' Ê, 'à ‹,' ÃŒ, 'Ã,' ÃŽ, 'Ã,' à ',' à ',' Ó, 'à ”,' à •, 'Ö,' Ø, 'à ™,' Ú, 'à ›,' Ãœ, 'Ã,' Þ, 'ß,' Ã, 'á,' à ¢, 'à £,' ä, 'à ¥,' æ, 'ç,' è, 'à ©,' ê, 'à «,' ì, 'Ã,' î, 'ï,' à °, 'à ±,' ò, 'ó , 'ô,' õ, 'ö,' ø, 'ù,' ú, 'à »,' ý, 'þ,' ÿ

The problem is that these characters were not encoded in utf8, but were represented in utf8, for a variety of reasons.

Convert iso-coded strings to UTF-8

In order to avoid this, the function

$ string = utf8_encode ($ string);

be applied.

Checking the coding

The coding of strings can be checked with the function mb_detect_encoding.

echo mb_detect_encoding ($ string);

The following solution can be used for a quick and dirty fix:

Change the database connection encoding

Another source of error is the transfer of the data to the database, this should always be set to UTF-8 once after opening the database connection:

... mysql_connect (); mysql_query ("SET NAMES 'utf8'");

Loading UTF-8 encoded PHP files into an ISO encoded project

If UTF-8 encoded PHP files are loaded by mistake, the encoding may be changed to UTF-8 despite all the effort.

Then the following helps:

require_once "utf-8.php" header ('Content-Type: text / html; charset = ISO-8859-1');

Auxiliary function to code arrays according to UTF-8

A simple recursive function to code a multi-dimensional array according to UTF-8 is (it would be better to work with references):

function utf8encodeArray ($ array) {foreach ($ array as $ key => $ value) {if (is_array ($ value)) {$ array [$ key] = utf8encodeArray ($ value); } elseif (! mb_detect_encoding ($ value, 'UTF-8', true)) {$ array [$ key] = utf8_encode ($ value); }}}

Update: a simpler function to dynamically code arrays:

function encodeArray (array $ array, string $ sourceEncoding, string $ destinationEncoding = 'UTF-8'): array {if ($ sourceEncoding === $ destinationEncoding) {return $ array; } array_walk_recursive ($ array, function (& $ array) use ($ sourceEncoding, $ destinationEncoding) {$ array = mb_convert_encoding ($ array, $ destinationEncoding, $ sourceEncoding);}); return $ array; }

The header

You should also check whether the header of the HTML document has been set to UTF8:

<head>        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">  </head>

or with PHP

File encoding

The coding of the PHP file must also be UTF-8, otherwise umlauts will also be displayed incorrectly (e.g. can be checked and changed with Notepad ++: Main menu-> Coding-> UTF-8). The coding for an entire project can be preset in any good IDE.

Other sources of error

There are a number of ways that you can lose the encoding of a string or document. PHP string manipulation functions, which automatically convert the string to UTF8 and return it, are particularly insidious, which is a problem if the website is coded in ISO-8859-1. Unfortunately, I couldn't find the functions anymore, I would be very happy to receive feedback on the topic.

When it's too late and the data has been saved in the DB, you can replace the wrong umlauts as follows:

Danger: The script only works if the coding of the PHP file is UTF-8 (can e.g. be checked / changed with Notepad ++: Main menu-> Coding-> UTF-8).

So the script only works with a UTF8 coded project (see article: PHP puzzles).