python - PySpark reversing StringIndexer in nested array -
i'm using pyspark collaborative filtering using als. original user , item id's strings, used stringindexer
convert them numeric indices (pyspark's als model obliges so).
after i've fitted model, can top 3 recommendations each user so:
recs = ( model .recommendforallusers(3) )
the recs
dataframe looks so:
+-----------+--------------------+ |useridindex| recommendations| +-----------+--------------------+ | 1580|[[10096,3.6725707...| | 4900|[[10096,3.0137873...| | 5300|[[10096,2.7274625...| | 6620|[[10096,2.4493625...| | 7240|[[10096,2.4928937...| +-----------+--------------------+ showing top 5 rows root |-- useridindex: integer (nullable = false) |-- recommendations: array (nullable = true) | |-- element: struct (containsnull = true) | | |-- productidindex: integer (nullable = true) | | |-- rating: float (nullable = true)
i want create huge jsom dump dataframe, , can so:
( recs .tojson() .saveastextfile("name_i_must_hide.recs") )
and sample of these jsons is:
{ "useridindex": 1580, "recommendations": [ { "productidindex": 10096, "rating": 3.6725707 }, { "productidindex": 10141, "rating": 3.61542 }, { "productidindex": 11591, "rating": 3.536216 } ] }
the useridindex
, productidindex
keys due stringindexer
transformation.
how can original value of these columns back? suspect must use indextostring
transformer, can't quite figure out how since data nested in array inside recs
dataframe.
i tried use pipeline
evaluator (stages=[stringindexer, als, indextostring]
) looks evaluator doesn't support these indexers.
cheers!
in both cases you'll need access list of labels. can accessed using either stringindexermodel
user_indexer_model = ... # type: stringindexermodel user_labels = user_indexer_model.labels product_indexer_model = ... # type: stringindexermodel product_labels = product_indexer_model.labels
or column metadata.
for useridindex
can apply indextostring
:
from pyspark.ml.feature import indextostring user_id_to_label = indextostring( inputcol="useridindex", outputcol="userid", labels=user_labels) user_id_to_label.transform(recs)
for recommendations you'll need either udf
or expression this:
from pyspark.sql.functions import array, col, lit, struct n = 3 # same numitems product_labels_ = array(*[lit(x) x in product_labels]) recommendations = array(*[struct( product_labels_[col("recommendations")[i]["productidindex"]].alias("productid"), col("recommendations")[i]["rating"].alias("rating") ) in range(n)]) recs.withcolumn("recommendations", recommendations)
Comments
Post a Comment