Structure dataset from rows to columns pandas python -


i have dataframe following many feature columns 3 mentioned below:

productid   |feature1   |value1 |feature2    |value2     | feature3    |value3 100001      |weight     | 130g   |                       |price        |$140.50 100002      |weight     | 200g   |pieces     |12 pcs     | dimensions  |150x75cm 100003      |dimensions |70x30cm |price      |$22.90         100004      |price      |$12.90  |manufacturer| abc    |calories    |556kcal 100005      |calories   |1320kcal|dimensions |20x20cm  |manufacturer   | xyz 

and want structure in following way using pandas:

productid   weight  dimensions  price   calories    no. of pieces   manufacturer 100001       130g              $140.50           100002       200g    150x75cm                         12 pcs     100003               70x30cm    $22.90           100004                          $12.90   556kcal                          abc 100005               20x20cm            1320kcal                         xyz 

i studied various pandas methods reset_index, stack etc didn't convert in required way.

you looking code unpack dataframe. straightforward way (with many features , possibly repeating productids):

import pandas pd import numpy np  def expand(frame):     df = pd.dataframe()     row in frame.iterrows():         data = row[1]         feature_name, feature_value in zip(data[1::2], data[2::2]):             if feature_name:                 df.loc[data.productid, feature_name] = feature_value     return df.replace(np.nan, '')   df = pd.dataframe([("100001", "weight", "130g", none, none, "price", "$140.50"), ("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150x75cm"), ("100003", "dimensions", "70x30cm", "price", "$22.90"), ("100004", "price", "$12.90", "manufacturer", "abc", "calories", "556kcal"), ("100005", "calories", "1320kcal", "dimensions", "20x20cm", "manufacturer", "xyz")],                   columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"])  xdf = expand(df) print(xdf) 

output:

       weight    price  pieces dimensions manufacturer  calories 100001   130g  $140.50                                           100002   200g           12 pcs   150x75cm                        100003          $22.90            70x30cm                        100004          $12.90                             abc   556kcal 100005                            20x20cm          xyz  1320kcal 

edit1: compressed form: (slow!)

def expand2(frame):     return pd.dataframe.from_dict(         {data.productid: {f: v f, v in zip(data[1::2], data[2::2]) if f} _, data in frame.iterrows()},         orient='index') 

edit2: using generator expression:

def expand3(frame):     return pd.dataframe.from_records(         ({f: v f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f}          _, data          in frame.iterrows()), index='productid').replace(np.nan, '') 

some testing (decorate function s @timeit):

def timeit(f):     @functools.wraps(f)     def timed(*args, **kwargs):         try:             start_time = time.time()             return f(*args, **kwargs)         finally:             end_time = time.time()             function_invocation = "x"             sys.stdout.flush()             print(f'function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.',                   flush=true, file=sys.stderr)      return timed  def generate_wide_df(n_rows, n_features):     possible_labels = [f'label_{i}' in range(n_features)]     columns = ['productid']     in range(1, n_features):         columns.append(f'feature_{i}')         columns.append(f'value_{i}')      df = pd.dataframe(columns=columns)     row_n in range(n_rows):         df.loc[row_n, 'productid'] = int(1000000 + row_n)         _ in range(n_features):             feature_num = random.randint(1, n_features)             df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels)             df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000)     return df.where(df.notnull(), none)   df = generate_wide_df(4000, 30)   expand(df) expand3(df) expand2(df) 

and results:

function expand(x), took: 1.1576 seconds. function expand3(x), took: 1.1185 seconds. function expand2(x), took: 16.3055 seconds. 

Comments

Popular posts from this blog

python - Operations inside variables -

Generic Map Parameter java -

arrays - What causes a java.lang.ArrayIndexOutOfBoundsException and how do I prevent it? -