encoding - PHP writes garbled UTF8 characters to output file -
[please see below answer]
i using preg_match_all extract hashtags strings,for example:
#tree#ztdf #n4# night
contains hashtags: tree, ztdf, n4, night
strings can language characters, emojis. therefore, enabled utf-8 flag (/u) in preg_match_all:
preg_match_all('/#([\pl\p{mn}]+)/u', $media_caption, $matches); however, characters wrongly matched byte sequences:
i read problem preg_match_all, utf-8 encoding , php here. tried add additional utf-8 flag (*utf8) pcre:
preg_match_all('(*utf8)/#([\p{l}\p{mn}]+)/u', $media_caption, $matches) .. getting error
syntax error, unexpected 'enabled' t-flag
anyone knows how can extract #hashtags utf-8 character using preg_match_all?
[edit]
ok.. day, problem: realized yesterday, garbled characters got after json_decode() looking @ output windows command line, can't handle utf8. today run program using git bash console , - shows input preg_match_all looking fine in utf8. - after this, no problems: str_replace(array("\r\n", "\r", "\n",","), ";", $media_caption); (replace linebreaks) - , no problems after this: preg_replace('!\s+!u', ' ', $media_caption); (replace multiple space characters one) - funny part: looks fine after this: preg_match_all('/#([\p{l}\p{mn}]+)/u', $media_caption, $matches);
for example, var_dump following string in git bash:
string(15) "presadebuendía" .. in written csv/txt this: presadebuend㮡 while embalse de buendía correctly written file.
i looking parts of code may mess character encoding during data processing. far, have tried:
header('content-encoding: utf-8');header('content-type: text/csv; charset=utf-8');mb_internal_encoding("utf-8");, replacing fopen function:
function utf8_fopen_read($filename) { $fc = iconv('windows-1250', 'utf-8', file_get_contents($filename)); $handle=fopen("php://memory", "rw"); fwrite($handle, $fc); fseek($handle, 0); return $handle; } .. none of solved issue.
thank commenting. apologize pointing in wrong direction: preg_match_all , other regex functions not problem messing characters. couple of things confused me (such windows command line not being able output utf8). in end, there 1 issue in code:
- before writing strings file, used
strtolowerfunction, reduces lowercase, including special characters such í (\u00e). solution use mb_strtolower instead, limited alphabetic characters.
of course, couldn't spot problem because didn't include specific code part in question! during searching problem, added
header('content-encoding: utf-8');header('content-type: text/csv; charset=utf-8');mb_internal_encoding("utf-8");
to php-script file, doesn't seem have effect on output file. anyway, solved problem. thank you!

Comments
Post a Comment