小的技巧

打包合并

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# 打包多项特征 根据某项特征排序
tasks = [{'order': o, 'product': p, 'time': t, 'deadline': d} for o, p, t, d in zip(line_orders, line_product_name, line_standard_time, line_deadline)]
tasks = sorted(tasks, key=lambda x: x['deadline'])

# 合并相同订单
merged_tasks = []
current_order = None
current_index = -1

for task in tasks:
    if task['order'] == current_order:
        # 合并相同订单的项
        merged_tasks[current_index]['number'] += 1
        order_number -=1
    else:
        # 添加新的订单项
        current_order = task['order']
        current_index += 1
        merged_tasks.append({
            'order': task['order'],
            'product': task['product'],
            'time': task['time'],
            'deadline': task['deadline'],
            'number': 1
        })

输入 + 选择

实现"Space+Enter"输入

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


import readline
def input_str():
    user_input = ""
    while True:
        char = input()
        if char == " ":
            user_input = user_input.rstrip(" ")
            break
        elif char == "q":
            user_input = "q"
            break
        else:
            user_input = user_input + char + "\n"
    return user_input

模糊搜索选择

1
2
3
4
5


import pyfzf

f = pyfzf.FzfPrompt()
result = f.prompt(['','']) # 选择功能
function = result[0]

pandas

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# 读取表格
data = pd.read_excel('example.xlsx', usecols='B:F, H:M')

# 提取第一列数据
a = np.array(data.iloc[:, 0])
b = np.array(data.iloc[:, 1])

# 导出数据
data_out = pd.DataFrame({
    '第一列数据': a,
    '第二列数据': b
})
data_out.to_excel('导出数据.xlsx', index=False)

排序

单次排序

1
2
3
4
5


# reverse是降序 默认是升序

# x可以是 x[0], x['name'], x.value, 根据列表元素确定

sorted_list = sorted(my_list, key=lambda x: x.value, reverse=True)

多重排序

1

sorted_list = sorted(my_list, key=lambda x: (x.A, -x.B))

加速技巧

Category	Concept
用csv而不是xlsx文件	效率更高
熟悉使用numpy

数据预处理

Clustering

KNN(K-Nearest Neighboring)

SVM (Supporting Vector Machine)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 读取数据集
df = pd.read_excel('~/数模/2020C/附件/指标选择.xlsx', sheet_name='第一题')
X = df.iloc[:, 1:6]
y = df.iloc[:, 0]

# 划分数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=36)

# 创建SVM模型
svm_model = SVC()

'''
C: 这个参数控制SVM中的正则化强度，它指定了对错误分类的惩罚。较小的C值会导致更宽的间隔，但可能会产生更多的错误分类；而较大的C值会导致更窄的间隔，但错误分类较少。在示例中，使用的取值为[0.1, 1, 10]。

kernel：SVM使用核函数将输入数据转换为更高维的特征空间，从而更容易找到一个分离超平面。可以使用不同的核函数，例如'linear'（线性核函数）、'rbf'（径向基函数）、'poly'（多项式核函数）等。在示例中，使用的选项为['linear', 'rbf']。

gamma：这个参数定义了每个训练样本的影响力。它指定了核函数的系数，控制了样本点对决策边界的影响程度。较小的gamma值表示影响范围较大，较大的gamma值表示影响范围较小。选项有 ['scale', 'auto']。
'''

# 定义参数网格
param_grid = {
    'C': [0.1, 1, 10],  # 正则化参数
    'kernel': ['linear', 'rbf'],  # 核函数
    'gamma': ['scale', 'auto']  # 核函数参数
}

# 创建GridSearchCV对象
grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=5)

# 在训练集上进行网格搜索
grid_search.fit(X_train, y_train)

# 输出最佳参数组合
print("最佳参数组合：", grid_search.best_params_)

# 使用最佳参数的模型进行预测
y_pred = grid_search.predict(X_test)

# 计算模型在测试集上的准确性
accuracy = accuracy_score(y_test, y_pred)
print("模型在测试集上的准确性：", accuracy)

Association Rules

指标选择

技巧

Unsupervised Feature Selection
Supervised Feature Selection

Category	Concept
Filters Methods	Missing Values, $\Xi^2$测试, Information Gain, Fisher Score
Embeded Methods	Regularization $L_1, L_2 …$, Random Forest
Wrapper Methods	Forward/Backward/Exhaustive/Recursive Feature SelectionFeature

TOPSIS (Technique for Order Preference by Similarity to Ideal Solution)

定义

一种多属性决策分析方法,运用正理想解和负理想解的概念来解决多准则决策问题。

基本思想

正理想解是选择中各指标达到最优值的虚拟方案;
负理想解是选择中各指标达到最劣值的虚拟方案;
最优方案应该同时与正理想解的距离最近且与负理想解的距离最远。

通过计算各方案与正负理想解的欧式距离,并结合距离正负理想解的相对接近度,来对可选方案进行排序和选择。

优点

TOPSIS能够较为全面地考虑各个指标,并体现方案间的相对优劣,因此在多属性决策分析问题中被广泛使用。

代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47


import numpy as np
import pandas as pd
import math

# 加载数据集
df = pd.read_excel('Data1.xlsx', sheet_name='供应量', usecols= [243, 245, 246], skiprows=1, header=None)

# 对数据进行归一化处理
data_norm = (df - df.min()) / (df.max() - df.min())

# 计算每个指标的熵值
entropies = pd.Series([-sum(p * math.log2(p) for p in data_norm[col] if p > 0) for col in data_norm])

# 计算每个指标的权重
weights = (1 - entropies / sum(entropies)) 

# 正向指标和负向指标
I_plus = [0, 1]  
I_minus = [2]

# 计算加权归一化决策矩阵
matrix = df.values
weighted_norm_matrix = matrix 
for i in range(weighted_norm_matrix.shape[1]):
    weighted_norm_matrix[:,i] *= weights[i]

# 确定正理想解和负理想解 
A_plus = weighted_norm_matrix[:,I_plus].max(axis=0)  
A_minus = weighted_norm_matrix[:,I_minus].min(axis=0)

# 计算欧式距离
diff_plus = weighted_norm_matrix[:, I_plus] - A_plus
S_plus = np.sqrt(np.sum(diff_plus ** 2, axis=1))

diff_minus = weighted_norm_matrix[:, I_minus] - A_minus
S_minus = np.sqrt(np.sum(diff_minus ** 2, axis=1))

# 计算相对接近度
C_i = S_minus/(S_plus + S_minus)

# 按相对接近度C_i降序排序
order = np.argsort(-C_i)

# 输出排序结果
for i in range(1, 403):
    index = order[i-1]
    print(f"第{i}重要的供应商是S{index+1}, 相对接近度C_i={C_i[index].round(4)}")

小的技巧

打包 合并

输入 + 选择

实现"Space+Enter"输入

模糊搜索选择

pandas

排序

加速技巧

数据预处理

Clustering

KNN(K-Nearest Neighboring)

SVM (Supporting Vector Machine)

Association Rules

指标选择

技巧

TOPSIS (Technique for Order Preference by Similarity to Ideal Solution)

打包合并