encoding - PHP writes garbled UTF8 characters to output file -


[please see below answer]

i using preg_match_all extract hashtags strings,for example:

#tree#ztdf #n4# night

contains hashtags: tree, ztdf, n4, night

strings can language characters, emojis. therefore, enabled utf-8 flag (/u) in preg_match_all:

preg_match_all('/#([\pl\p{mn}]+)/u', $media_caption,  $matches); 

however, characters wrongly matched byte sequences:

enter image description here

i read problem preg_match_all, utf-8 encoding , php here. tried add additional utf-8 flag (*utf8) pcre:

preg_match_all('(*utf8)/#([\p{l}\p{mn}]+)/u', $media_caption,  $matches) 

.. getting error

syntax error, unexpected 'enabled' t-flag

anyone knows how can extract #hashtags utf-8 character using preg_match_all?

[edit]

ok.. day, problem: realized yesterday, garbled characters got after json_decode() looking @ output windows command line, can't handle utf8. today run program using git bash console , - shows input preg_match_all looking fine in utf8. - after this, no problems: str_replace(array("\r\n", "\r", "\n",","), ";", $media_caption); (replace linebreaks) - , no problems after this: preg_replace('!\s+!u', ' ', $media_caption); (replace multiple space characters one) - funny part: looks fine after this: preg_match_all('/#([\p{l}\p{mn}]+)/u', $media_caption, $matches);

for example, var_dump following string in git bash:

 string(15) "presadebuendía" 

.. in written csv/txt this: presadebuend㮡 while embalse de buendía correctly written file.

i looking parts of code may mess character encoding during data processing. far, have tried:

  • header('content-encoding: utf-8');
  • header('content-type: text/csv; charset=utf-8');
  • mb_internal_encoding("utf-8"); , replacing fopen function:
function utf8_fopen_read($filename) {      $fc = iconv('windows-1250', 'utf-8', file_get_contents($filename));      $handle=fopen("php://memory", "rw");      fwrite($handle, $fc);      fseek($handle, 0);      return $handle;  }  

.. none of solved issue.

thank commenting. apologize pointing in wrong direction: preg_match_all , other regex functions not problem messing characters. couple of things confused me (such windows command line not being able output utf8). in end, there 1 issue in code:

  • before writing strings file, used strtolower function, reduces lowercase, including special characters such í (\u00e). solution use mb_strtolower instead, limited alphabetic characters.

of course, couldn't spot problem because didn't include specific code part in question! during searching problem, added

  • header('content-encoding: utf-8');
  • header('content-type: text/csv; charset=utf-8');
  • mb_internal_encoding("utf-8");

to php-script file, doesn't seem have effect on output file. anyway, solved problem. thank you!


Comments

Popular posts from this blog

python - Operations inside variables -

Generic Map Parameter java -

arrays - What causes a java.lang.ArrayIndexOutOfBoundsException and how do I prevent it? -