tensorflow - Code changes needed for custom distributed ML Engine Experiment -

June 15, 2013

i completed tutorial on distributed tensorflow experiments within ml engine experiment , looking define own custom tier instead of standard_1 tier use in config.yaml file. if using tf.estimator.estimator api, additional code changes needed create custom tier of size? example, article suggests: "if distribute 10,000 batches among 10 worker nodes, each node works on 1,000 batches." suggest config.yaml file below possible

traininginput:   scaletier: custom   mastertype: complex_model_m   workertype: complex_model_m   parameterservertype: complex_model_m   workercount: 10   parameterservercount: 4

are code changes needed mnist tutorial able use custom configuration? distribute x number of batches across 10 workers tutorial suggests possible? poked around of other ml engine samples , found reddit_tft uses distributed training, appear have defined own runconfig.cluster_spec within trainer package: task.pyeven though using estimator api. so, there additional configuration needed? current understanding if using estimator api (even within own defined model) there should not need additional changes.

does of change if config.yaml specifies using gpus? article suggests estimator api "no code changes necessary long clusterspec configured properly. if cluster mixture of cpus , gpus, map ps job name cpus , worker job name gpus." however, since config.yaml identifying machine type parameter servers , workers, expecting within ml-engine clusterspec configured based on config.yaml file. however, not able find ml-engine documentation confirms no changes needed take advantage of gpus.

last, within ml-engine wondering if there ways identify usage of different configurations? line "if distribute 10,000 batches among 10 worker nodes, each node works on 1,000 batches." suggests use of additional workers linear, don't have intuition around how determine if more parameter servers needed? 1 able check (either within cloud dashboards or tensorboard) determine if have sufficient number of parameter servers?

are additional code changes needed create custom tier of size?

no; no changes needed mnist sample work different number or type of worker. use tf.estimator.estimator on cloudml engine, must have program invoke learn_runner.run, exemplified in samples. when so, framework reads in tf_config environment variables , populates runconfig object relevant information such clusterspec. automatically right thing on parameter server nodes , use provided estimator start training , evaluation.

most of magic happens because tf.estimator.estimator automatically uses device setter distributes ops correctly. device setter uses cluster information runconfig object constructor, default, uses tf_config magic (e.g. here). can see device setter being used here.

this means can change config.yaml adding/removing workers and/or changing types , things should work.

for sample code using custom model_fn, see census/customestimator example.

that said, please note add workers, increasing effective batch size (this true regardless of whether or not using tf.estimator). is, if batch_size 50 , using 10 workers, means each worker processing batches of size 50, effective batch size of 10*50=500. if increase number of workers 20, effective batch size becomes 20*50=1000. may find may need decrease learning rate accordingly (linear seems work well; ref).

i poked around of other ml engine samples , found reddit_tft uses distributed training, appear have defined own runconfig.cluster_spec within trainer package: task.pyeven though using estimator api. so, there additional configuration needed?

no additional configuration needed. reddit_tft sample instantiate own runconfig, however, constructor of runconfig grabs properties not explicitly set during instantiation using tf_config. , convenience figure out how many parameter servers , workers there are.

does of change if config.yaml specifies using gpus?

you should not need change use tf.estimator.estimator gpus, other possibly needing manually assign ops gpu (but that's not specific cloudml engine); see this article more info. clarifying documentation.

Search This Blog

Force Net

tensorflow - Code changes needed for custom distributed ML Engine Experiment -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -