python - PySpark reversing StringIndexer in nested array -


i'm using pyspark collaborative filtering using als. original user , item id's strings, used stringindexer convert them numeric indices (pyspark's als model obliges so).

after i've fitted model, can top 3 recommendations each user so:

recs = (     model     .recommendforallusers(3) ) 

the recs dataframe looks so:

+-----------+--------------------+ |useridindex|     recommendations| +-----------+--------------------+ |       1580|[[10096,3.6725707...| |       4900|[[10096,3.0137873...| |       5300|[[10096,2.7274625...| |       6620|[[10096,2.4493625...| |       7240|[[10096,2.4928937...| +-----------+--------------------+ showing top 5 rows  root  |-- useridindex: integer (nullable = false)  |-- recommendations: array (nullable = true)  |    |-- element: struct (containsnull = true)  |    |    |-- productidindex: integer (nullable = true)  |    |    |-- rating: float (nullable = true) 

i want create huge jsom dump dataframe, , can so:

(     recs     .tojson()     .saveastextfile("name_i_must_hide.recs") ) 

and sample of these jsons is:

{   "useridindex": 1580,   "recommendations": [     {       "productidindex": 10096,       "rating": 3.6725707     },     {       "productidindex": 10141,       "rating": 3.61542     },     {       "productidindex": 11591,       "rating": 3.536216     }   ] } 

the useridindex , productidindex keys due stringindexer transformation.

how can original value of these columns back? suspect must use indextostring transformer, can't quite figure out how since data nested in array inside recs dataframe.

i tried use pipeline evaluator (stages=[stringindexer, als, indextostring]) looks evaluator doesn't support these indexers.

cheers!

in both cases you'll need access list of labels. can accessed using either stringindexermodel

user_indexer_model = ...  # type: stringindexermodel user_labels = user_indexer_model.labels  product_indexer_model = ...  # type: stringindexermodel product_labels = product_indexer_model.labels 

or column metadata.

for useridindex can apply indextostring:

from pyspark.ml.feature import indextostring  user_id_to_label = indextostring(     inputcol="useridindex", outputcol="userid", labels=user_labels) user_id_to_label.transform(recs) 

for recommendations you'll need either udf or expression this:

from pyspark.sql.functions import array, col, lit, struct  n = 3  # same numitems  product_labels_ = array(*[lit(x) x in product_labels]) recommendations = array(*[struct(     product_labels_[col("recommendations")[i]["productidindex"]].alias("productid"),     col("recommendations")[i]["rating"].alias("rating") ) in range(n)])  recs.withcolumn("recommendations", recommendations) 

Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -