python - Pyspark error handling with file name having spaces -


i using pyspark 2.1

problem statement: need validate hdfs path, file if exist need copy file name variable

below code used far after referring few websites , stackoverflow

import os import subprocess import pandas pd import times def run_cmd(args_list):      print('running system command: {0}'.format(' '.join(args_list)))      proc = subprocess.popen(args_list, stdout=subprocess.pipe,         stderr=subprocess.pipe)      proc.communicate()      return proc.returncode   today = datetime.now().date().strftime('%d%b%y')  source_dir = '/user/dev/input/'+ today  hdfs_file_path=source_dir+'\'student marks details.csv\''  cmd = ['hdfs', 'dfs', '-find','{}','-name', hdfs_file_path]  code=run_cmd(cmd)  if code<>1:     print 'file doesnot exist'     system.exit(1)  else:     print 'file exist' 

with above code getting error "file doesn't exist" file present in folder

problem able run run below command in shell console getting complete path.

hdfs dfs -find () -name /user/dev/input/08aug2017/'student marks details.csv' 

when tried import in pyspark above detailed code not able execute there exist space in filename . please me in resolving issue.

the problem

your problem on line:

 hdfs_file_path = source_dir + '\'student marks details.csv\'' 

you adding 2 unneeded single quotes, , forgetting add directory separator.

the reason path works in command:

hdfs dfs -find () -name /user/dev/input/08aug2017/'student marks details.csv' 

is because shell command. on shell using (presumably bash), following commands equivalent:

echo '/user/dev/input/08aug2017/student marks details.csv' echo /user/dev/input/08aug2017/'student marks details.csv' 

bash removes quotes, , merges strings together, yielding same string result, /user/dev/input/08aug2017/student marks details.csv. quotes not part of path, way tell bash not split string @ spaces, create single string, , remove quotes.

when write:

 hdfs_file_path = source_dir + '\'student marks details.csv\'' 

the path end getting /user/dev/input/08aug2017'student marks details.csv', instead of correct /user/dev/input/08aug2017/student marks details.csv.

the subprocess call requires plain strings correspond values want, , not process them same way shell does.

solution

in python, joining paths best performed calling os.path.join. suggest replace these lines:

source_dir = '/user/dev/input/' + today hdfs_file_path = source_dir + '\'student marks details.csv\'' 

with following:

source_dir = os.path.join('/user/dev/input/', today) hdfs_file_path = os.path.join(source_dir, 'student marks details.csv') 

os.path.join takes care add single directory separator (/ on unix, \ on windows) between arguments, can't accidentally either forget separator, or add twice.


Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -