PHP前端开发

计算多索引 pandas 数据帧外部索引每行的总和

百变鹏仔 1天前 #Python
文章标签 索引
问题内容

我有一个数据框:seller、item、price、shipping、免费送货最低、count available和count required。我的目标是根据稍后计算的 total 找到 seller 和 item 的最便宜的组合(计算代码如下所示)。示例数据如下:

import pandas as pditem1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']seller1 = ['seller 1', 'seller 2', 'seller 3', 'seller 4', 'seller 1']price1 = [1.85, 1.94, 2.00, 2.00, 2.02]shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]freeship1 = [5, 5, 5, 50, 5]countavailable1 = [1, 2, 2, 5, 2]countneeded1 = [2, 1, 2, 2, 1]df1 = pd.dataframe({'seller':seller1,                    'item':item1,                    'price':price1,                    'shipping':shipping1,                    'free shipping minimum':freeship1,                    'count available':countavailable1,                    'count needed':countneeded1})# create columns that states if seller has all counts needed.# this will be used to sort by to prioritize the smallest number of orders possiblefor index, row in df1.iterrows():    if row['count available'] >= row['count needed']:        df1.at[index, 'fulfills count needed'] = 'yes'    else:        df1.at[index, 'fulfills count needed'] = 'no'# dont want to calc price based on [count available], so need to check if seller has count i need and calc cost based on [count needed].# if doesn't have [count needed], then calc cost on [count available].for index, row in df1.iterrows():    if row['count available'] >= row['count needed']:        df1.at[index, 'price x count'] = row['count needed'] * row['price']    else:        df1.at[index, 'price x count'] = row['count available'] * row['price']

但是,任何一个seller都可以出售多个item。我想尽量减少支付的运费,所以我想通过 seller 将 items 分组在一起。因此,我根据我在另一个线程中看到的方式使用 .first() 方法对它们进行分组,以便将每一列保留在新的分组数据框中。

# don't calc [total] until sellers have been grouped# use first() method to return all columns and perform no other aggregationsgrouped1 = df1.sort_values('price').groupby(['seller', 'item']).first()

此时我想通过seller计算total。所以我有以下代码,但它为每个 item 计算 total,而不是 seller,这意味着 shipping 根据每个组中的商品数量被多次添加,或者当 price x count 结束时不应用免费送货最低免运费。

# calc [Total]for index, row in grouped1.iterrows():    if (row['Free Shipping Minimum'] == 50) & (row['Price x Count'] > 50):        grouped1.at[index, 'Total'] = row['Price x Count'] + 0    elif (row['Free Shipping Minimum'] == 5) & (row['Price x Count'] > 5):        grouped1.at[index, 'Total'] = row['Price x Count'] + 0    else:        grouped1.at[index, 'Total'] = row['Price x Count'] + row['Shipping']

实际上看起来我可能需要在计算 total 时对每个 seller 求和 price x count ,但这本质上是同一个问题,因为我不知道如何计算外部索引的每行列。我可以使用什么方法来做到这一点?

另外,如果有人对如何实现我的后半部分目标有任何建议,请尽管提出。我只想退回我需要的每件商品。例如,我需要 2 个“项目 1”和 2 个“项目 2”。如果“卖家 1”有 2 个“商品 1”和 1 个“商品 2”,而“卖家 2”有 1 个“商品 1”和 1 个“商品 2”,那么我想要“卖家 1”的所有商品(假设它最便宜),但只有“卖家 2”的 1 个“商品 1”。这似乎会影响 total 列的计算,但我不确定如何实现它。


正确答案


我最终决定首先对 seller 进行分组,并对 price x count 进行求和以找到 subtotals,将其转换为数据帧,然后将 df1 与新的 subtotal 数据帧合并以创建 groupedphpcnend cphpcn 数据框。然后我使用 np.where 建议创建了 totals 列(这比我的 for 循环优雅得多,并且可以轻松处理 nan 值)。最后按seller、total、item分组返回我想要的结果。最终代码如下:

import pandas as pdimport numpy as npitem1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']seller1 = ['Seller 1', 'Seller 2', 'Seller 3', 'Seller 4', 'Seller 1']price1 = [1.85, 1.94, 2.69, 2.00, 2.02]shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]freeship1 = [5, 5, 5, 50, 5]countavailable1 = [1, 2, 2, 5, 2]countneeded1 = [2, 1, 2, 2, 1]df1 = pd.DataFrame({'Seller':seller1,                    'Item':item1,                    'Price':price1,                    'Shipping':shipping1,                    'Free Shipping Minimum':freeship1,                    'Count Available':countavailable1,                    'Count Needed':countneeded1})# create columns that states if seller has all counts needed.# this will be used to sort by to prioritize the smallest number of orders possiblefor index, row in df1.iterrows():    if row['Count Available'] >= row['Count Needed']:        df1.at[index, 'Fulfills Count Needed'] = 'Yes'    else:        df1.at[index, 'Fulfills Count Needed'] = 'No'# dont want to calc price based on [count available], so need to check if seller has count I need and calc cost based on [count needed].# if doesn't have [count needed], then calc cost on [count available].for index, row in df1.iterrows():    if row['Count Available'] >= row['Count Needed']:        df1.at[index, 'Price x Count'] = row['Count Needed'] * row['Price']    else:        df1.at[index, 'Price x Count'] = row['Count Available'] * row['Price']# subtotals by seller, then assign calcs to column called [Subtotal] and merge into dataframesubtotals = df1.groupby(['Seller'])['Price x Count'].sum().reset_index()subtotals.rename({'Price x Count':'Subtotal'}, axis=1, inplace=True)grouped = df1.merge(subtotals[['Subtotal', 'Seller']], on='Seller')# calc [Total]grouped['Total'] = np.where(grouped['Subtotal'] > grouped['Free Shipping Minimum'],                             grouped['Subtotal'], grouped['Subtotal'] + grouped['Shipping'])grouped.groupby(['Seller', 'Total', 'Item']).first()