python - PySpark reversing StringIndexer in nested array -

April 15, 2010

i'm using pyspark collaborative filtering using als. original user , item id's strings, used stringindexer convert them numeric indices (pyspark's als model obliges so).

after i've fitted model, can top 3 recommendations each user so:

recs = (     model     .recommendforallusers(3) )

the recs dataframe looks so:

+-----------+--------------------+ |useridindex|     recommendations| +-----------+--------------------+ |       1580|[[10096,3.6725707...| |       4900|[[10096,3.0137873...| |       5300|[[10096,2.7274625...| |       6620|[[10096,2.4493625...| |       7240|[[10096,2.4928937...| +-----------+--------------------+ showing top 5 rows  root  |-- useridindex: integer (nullable = false)  |-- recommendations: array (nullable = true)  |    |-- element: struct (containsnull = true)  |    |    |-- productidindex: integer (nullable = true)  |    |    |-- rating: float (nullable = true)

i want create huge jsom dump dataframe, , can so:

(     recs     .tojson()     .saveastextfile("name_i_must_hide.recs") )

and sample of these jsons is:

{   "useridindex": 1580,   "recommendations": [     {       "productidindex": 10096,       "rating": 3.6725707     },     {       "productidindex": 10141,       "rating": 3.61542     },     {       "productidindex": 11591,       "rating": 3.536216     }   ] }

the useridindex , productidindex keys due stringindexer transformation.

how can original value of these columns back? suspect must use indextostring transformer, can't quite figure out how since data nested in array inside recs dataframe.

i tried use pipeline evaluator (stages=[stringindexer, als, indextostring]) looks evaluator doesn't support these indexers.

cheers!

in both cases you'll need access list of labels. can accessed using either stringindexermodel

user_indexer_model = ...  # type: stringindexermodel user_labels = user_indexer_model.labels  product_indexer_model = ...  # type: stringindexermodel product_labels = product_indexer_model.labels

or column metadata.

for useridindex can apply indextostring:

from pyspark.ml.feature import indextostring  user_id_to_label = indextostring(     inputcol="useridindex", outputcol="userid", labels=user_labels) user_id_to_label.transform(recs)

for recommendations you'll need either udf or expression this:

from pyspark.sql.functions import array, col, lit, struct  n = 3  # same numitems  product_labels_ = array(*[lit(x) x in product_labels]) recommendations = array(*[struct(     product_labels_[col("recommendations")[i]["productidindex"]].alias("productid"),     col("recommendations")[i]["rating"].alias("rating") ) in range(n)])  recs.withcolumn("recommendations", recommendations)

Search This Blog

Force Net

python - PySpark reversing StringIndexer in nested array -

Comments

Post a Comment

Popular posts from this blog

python - Operations inside variables -

Generic Map Parameter java -

arrays - What causes a java.lang.ArrayIndexOutOfBoundsException and how do I prevent it? -