python - PySpark reversing StringIndexer in nested array -
i'm using pyspark collaborative filtering using als. original user , item id's strings, used stringindexer convert them numeric indices (pyspark's als model obliges so).
after i've fitted model, can top 3 recommendations each user so:
recs = ( model .recommendforallusers(3) ) the recs dataframe looks so:
+-----------+--------------------+ |useridindex| recommendations| +-----------+--------------------+ | 1580|[[10096,3.6725707...| | 4900|[[10096,3.0137873...| | 5300|[[10096,2.7274625...| | 6620|[[10096,2.4493625...| | 7240|[[10096,2.4928937...| +-----------+--------------------+ showing top 5 rows root |-- useridindex: integer (nullable = false) |-- recommendations: array (nullable = true) | |-- element: struct (containsnull = true) | | |-- productidindex: integer (nullable = true) | | |-- rating: float (nullable = true) i want create huge jsom dump dataframe, , can so:
( recs .tojson() .saveastextfile("name_i_must_hide.recs") ) and sample of these jsons is:
{ "useridindex": 1580, "recommendations": [ { "productidindex": 10096, "rating": 3.6725707 }, { "productidindex": 10141, "rating": 3.61542 }, { "productidindex": 11591, "rating": 3.536216 } ] } the useridindex , productidindex keys due stringindexer transformation.
how can original value of these columns back? suspect must use indextostring transformer, can't quite figure out how since data nested in array inside recs dataframe.
i tried use pipeline evaluator (stages=[stringindexer, als, indextostring]) looks evaluator doesn't support these indexers.
cheers!
in both cases you'll need access list of labels. can accessed using either stringindexermodel
user_indexer_model = ... # type: stringindexermodel user_labels = user_indexer_model.labels product_indexer_model = ... # type: stringindexermodel product_labels = product_indexer_model.labels or column metadata.
for useridindex can apply indextostring:
from pyspark.ml.feature import indextostring user_id_to_label = indextostring( inputcol="useridindex", outputcol="userid", labels=user_labels) user_id_to_label.transform(recs) for recommendations you'll need either udf or expression this:
from pyspark.sql.functions import array, col, lit, struct n = 3 # same numitems product_labels_ = array(*[lit(x) x in product_labels]) recommendations = array(*[struct( product_labels_[col("recommendations")[i]["productidindex"]].alias("productid"), col("recommendations")[i]["rating"].alias("rating") ) in range(n)]) recs.withcolumn("recommendations", recommendations)
Comments
Post a Comment