Structure dataset from rows to columns pandas python -
i have dataframe following many feature columns 3 mentioned below:
productid |feature1 |value1 |feature2 |value2 | feature3 |value3 100001 |weight | 130g | |price |$140.50 100002 |weight | 200g |pieces |12 pcs | dimensions |150x75cm 100003 |dimensions |70x30cm |price |$22.90 100004 |price |$12.90 |manufacturer| abc |calories |556kcal 100005 |calories |1320kcal|dimensions |20x20cm |manufacturer | xyz and want structure in following way using pandas:
productid weight dimensions price calories no. of pieces manufacturer 100001 130g $140.50 100002 200g 150x75cm 12 pcs 100003 70x30cm $22.90 100004 $12.90 556kcal abc 100005 20x20cm 1320kcal xyz i studied various pandas methods reset_index, stack etc didn't convert in required way.
you looking code unpack dataframe. straightforward way (with many features , possibly repeating productids):
import pandas pd import numpy np def expand(frame): df = pd.dataframe() row in frame.iterrows(): data = row[1] feature_name, feature_value in zip(data[1::2], data[2::2]): if feature_name: df.loc[data.productid, feature_name] = feature_value return df.replace(np.nan, '') df = pd.dataframe([("100001", "weight", "130g", none, none, "price", "$140.50"), ("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150x75cm"), ("100003", "dimensions", "70x30cm", "price", "$22.90"), ("100004", "price", "$12.90", "manufacturer", "abc", "calories", "556kcal"), ("100005", "calories", "1320kcal", "dimensions", "20x20cm", "manufacturer", "xyz")], columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"]) xdf = expand(df) print(xdf) output:
weight price pieces dimensions manufacturer calories 100001 130g $140.50 100002 200g 12 pcs 150x75cm 100003 $22.90 70x30cm 100004 $12.90 abc 556kcal 100005 20x20cm xyz 1320kcal edit1: compressed form: (slow!)
def expand2(frame): return pd.dataframe.from_dict( {data.productid: {f: v f, v in zip(data[1::2], data[2::2]) if f} _, data in frame.iterrows()}, orient='index') edit2: using generator expression:
def expand3(frame): return pd.dataframe.from_records( ({f: v f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f} _, data in frame.iterrows()), index='productid').replace(np.nan, '') some testing (decorate function s @timeit):
def timeit(f): @functools.wraps(f) def timed(*args, **kwargs): try: start_time = time.time() return f(*args, **kwargs) finally: end_time = time.time() function_invocation = "x" sys.stdout.flush() print(f'function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.', flush=true, file=sys.stderr) return timed def generate_wide_df(n_rows, n_features): possible_labels = [f'label_{i}' in range(n_features)] columns = ['productid'] in range(1, n_features): columns.append(f'feature_{i}') columns.append(f'value_{i}') df = pd.dataframe(columns=columns) row_n in range(n_rows): df.loc[row_n, 'productid'] = int(1000000 + row_n) _ in range(n_features): feature_num = random.randint(1, n_features) df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels) df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000) return df.where(df.notnull(), none) df = generate_wide_df(4000, 30) expand(df) expand3(df) expand2(df) and results:
function expand(x), took: 1.1576 seconds. function expand3(x), took: 1.1185 seconds. function expand2(x), took: 16.3055 seconds.
Comments
Post a Comment