python - Pyspark: help filtering out any rows which have unwanted characters -

April 15, 2015

writing parquet file gives me error states " ,;{}()\n\t=" characters not allowed.

i'd eliminate rows have of these characters anywhere.

would use "like", "rlike" or else?

i have tried this:

df = df.filter(df.account_number.rlike('*\n*', '*\ *','*,*','*;*','*{*','*}*','*)*','*(*','*\t*') == false)

obviously not work. i'm unsure right regex syntax is, or if need regex in particular case.

you use rlike since it's regular expressions:

df.filter(~df.account_number.rlike("[ ,;{}()\n\t=]"))

when put characters between [] means of following characters.

i don't see why these characters wouldn't allowed in dataframe rows, there might invalid character in column names instead. can use .withcolumnrenamed() rename it.

Search This Blog

Force Net

python - Pyspark: help filtering out any rows which have unwanted characters -

Comments

Post a Comment

Popular posts from this blog

python - Operations inside variables -

Generic Map Parameter java -

arrays - What causes a java.lang.ArrayIndexOutOfBoundsException and how do I prevent it? -