python - Pyspark: help filtering out any rows which have unwanted characters -

April 15, 2015

writing parquet file gives me error states " ,;{}()\n\t=" characters not allowed.

i'd eliminate rows have of these characters anywhere.

would use "like", "rlike" or else?

i have tried this:

df = df.filter(df.account_number.rlike('*\n*', '*\ *','*,*','*;*','*{*','*}*','*)*','*(*','*\t*') == false)

obviously not work. i'm unsure right regex syntax is, or if need regex in particular case.

you use rlike since it's regular expressions:

df.filter(~df.account_number.rlike("[ ,;{}()\n\t=]"))

when put characters between [] means of following characters.

i don't see why these characters wouldn't allowed in dataframe rows, there might invalid character in column names instead. can use .withcolumnrenamed() rename it.

Search This Blog

Force Net

python - Pyspark: help filtering out any rows which have unwanted characters -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -