머신러닝 공부를 하다가 0~1의 연속적인 값을 가지는 데이터를 10개의 그룹으로 나누고 싶어졌다.
아래 데이터는 기존의 데이터이다.
normalized_train.head()
각 column의 데이터들이 0~1 사이의 연속적인 값들을 가지고 있다.
내가 하고 싶은 것은
0.0 <= x < 0.1 범위의 값은 0으로,
0.1 <= x < 0.2 범위의 값은 1로,
...
0.9 <= x < 1.0 범위의 값은 9로 나누는 것이었다.
그래야 머신러닝할 때 시간이 덜 걸릴 것이라 예상했기 때문이다.
우선 그룹화하고 싶은 컬럼들을 column_names 변수로 담았다.
이후 각 컬럼들의 그룹명을 group_names 변수로 담았다.
column_names = ['EState_VSA2', 'HallKierAlpha', 'MaxAbsEStateIndex', 'MinEStateIndex', 'PEOE_VSA10', 'PEOE_VSA14', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'frCOO_family', 'bertzCT_family_ext']
group_names = ['EState_VSA2_group', 'HallKierAlpha_group', 'MaxAbsEStateIndex_group', 'MinEStateIndex_group', 'PEOE_VSA10_group', 'PEOE_VSA14_group', 'PEOE_VSA6_group', 'PEOE_VSA7_group', 'PEOE_VSA8_group', 'frCOO_family_group', 'bertzCT_family_ext_group']
다음 작업으로 column_names 변수와 group_names 변수를 Dictionary 형태로 만들었다.
column_group_names = dict(zip(column_names, group_names))
이 Dictionary를 순회하면서 각 컬럼들을 위에서 언급한 기준에 맞추어 그룹으로 나누는 작업을 하였다.
for column_name, group_name in column_group_names.items():
normalized_train[group_name] = 0
normalized_train.loc[normalized_train[column_name] < 0.1, group_name] = 0
normalized_train.loc[(normalized_train[column_name] >= 0.1) & (normalized_train[column_name] < 0.2), group_name] = 1
normalized_train.loc[(normalized_train[column_name] >= 0.2) & (normalized_train[column_name] < 0.3), group_name] = 2
normalized_train.loc[(normalized_train[column_name] >= 0.3) & (normalized_train[column_name] < 0.4), group_name] = 3
normalized_train.loc[(normalized_train[column_name] >= 0.4) & (normalized_train[column_name] < 0.5), group_name] = 4
normalized_train.loc[(normalized_train[column_name] >= 0.5) & (normalized_train[column_name] < 0.6), group_name] = 5
normalized_train.loc[(normalized_train[column_name] >= 0.6) & (normalized_train[column_name] < 0.7), group_name] = 6
normalized_train.loc[(normalized_train[column_name] >= 0.7) & (normalized_train[column_name] < 0.8), group_name] = 7
normalized_train.loc[(normalized_train[column_name] >= 0.8) & (normalized_train[column_name] < 0.9), group_name] = 8
normalized_train.loc[(normalized_train[column_name] >= 0.9) & (normalized_train[column_name] < 1.0), group_name] = 9
그룹화되었기 때문에 이전의 column들은 우선 drop시켰다.
normalized_train = normalized_train.drop(column_names, axis=1)
그리고 남아있는 column들을 출력해보면..
normalized_train.head()
각 컬럼의 값들이 연속적인 값 대신 10개의 숫자(0~9) 중 하나를 가지고 있는 것을 볼 수 있다!!
전체 코드>
column_names = ['EState_VSA2', 'HallKierAlpha', 'MaxAbsEStateIndex', 'MinEStateIndex', 'PEOE_VSA10', 'PEOE_VSA14', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'frCOO_family', 'bertzCT_family_ext']
group_names = ['EState_VSA2_group', 'HallKierAlpha_group', 'MaxAbsEStateIndex_group', 'MinEStateIndex_group', 'PEOE_VSA10_group', 'PEOE_VSA14_group', 'PEOE_VSA6_group', 'PEOE_VSA7_group', 'PEOE_VSA8_group', 'frCOO_family_group', 'bertzCT_family_ext_group']
column_group_names = dict(zip(column_names, group_names))
for column_name, group_name in column_group_names.items():
normalized_train[group_name] = 0
normalized_train.loc[normalized_train[column_name] < 0.1, group_name] = 0
normalized_train.loc[(normalized_train[column_name] >= 0.1) & (normalized_train[column_name] < 0.2), group_name] = 1
normalized_train.loc[(normalized_train[column_name] >= 0.2) & (normalized_train[column_name] < 0.3), group_name] = 2
normalized_train.loc[(normalized_train[column_name] >= 0.3) & (normalized_train[column_name] < 0.4), group_name] = 3
normalized_train.loc[(normalized_train[column_name] >= 0.4) & (normalized_train[column_name] < 0.5), group_name] = 4
normalized_train.loc[(normalized_train[column_name] >= 0.5) & (normalized_train[column_name] < 0.6), group_name] = 5
normalized_train.loc[(normalized_train[column_name] >= 0.6) & (normalized_train[column_name] < 0.7), group_name] = 6
normalized_train.loc[(normalized_train[column_name] >= 0.7) & (normalized_train[column_name] < 0.8), group_name] = 7
normalized_train.loc[(normalized_train[column_name] >= 0.8) & (normalized_train[column_name] < 0.9), group_name] = 8
normalized_train.loc[(normalized_train[column_name] >= 0.9) & (normalized_train[column_name] < 1.0), group_name] = 9
normalized_train = normalized_train.drop(column_names, axis=1)
normalized_train.head()
반응형
'Development Experience > Python' 카테고리의 다른 글
Numpy 차원 낮추기 - ravel() vs reshape() vs flatten() (0) | 2023.07.07 |
---|