본문 바로가기

Development Experience/Python

연속적인 DataFrame의 값들을 몇 개의 그룹으로 나누고 싶을 때

 

 

머신러닝 공부를 하다가 0~1의 연속적인 값을 가지는 데이터를 10개의 그룹으로 나누고 싶어졌다.

 

아래 데이터는 기존의 데이터이다.

normalized_train.head()

 

각 column의 데이터들이 0~1 사이의 연속적인 값들을 가지고 있다.

 

내가 하고 싶은 것은

0.0 <= x < 0.1 범위의 값은 0으로,

0.1 <= x < 0.2 범위의 값은 1로,

...

0.9 <= x < 1.0 범위의 값은 9로 나누는 것이었다.

 

그래야 머신러닝할 때 시간이 덜 걸릴 것이라 예상했기 때문이다.

 

우선 그룹화하고 싶은 컬럼들을 column_names 변수로 담았다.

이후 각 컬럼들의 그룹명을 group_names 변수로 담았다.

column_names = ['EState_VSA2', 'HallKierAlpha', 'MaxAbsEStateIndex', 'MinEStateIndex', 'PEOE_VSA10', 'PEOE_VSA14', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'frCOO_family', 'bertzCT_family_ext']
group_names = ['EState_VSA2_group', 'HallKierAlpha_group', 'MaxAbsEStateIndex_group', 'MinEStateIndex_group', 'PEOE_VSA10_group', 'PEOE_VSA14_group', 'PEOE_VSA6_group', 'PEOE_VSA7_group', 'PEOE_VSA8_group', 'frCOO_family_group', 'bertzCT_family_ext_group']

 

다음 작업으로 column_names 변수와 group_names 변수를 Dictionary 형태로 만들었다.

column_group_names = dict(zip(column_names, group_names))

 

 

이 Dictionary를 순회하면서 각 컬럼들을 위에서 언급한 기준에 맞추어 그룹으로 나누는 작업을 하였다.

for column_name, group_name in column_group_names.items():

    normalized_train[group_name] = 0

    normalized_train.loc[normalized_train[column_name] < 0.1, group_name] = 0
    normalized_train.loc[(normalized_train[column_name] >= 0.1) & (normalized_train[column_name] < 0.2), group_name] = 1
    normalized_train.loc[(normalized_train[column_name] >= 0.2) & (normalized_train[column_name] < 0.3), group_name] = 2
    normalized_train.loc[(normalized_train[column_name] >= 0.3) & (normalized_train[column_name] < 0.4), group_name] = 3
    normalized_train.loc[(normalized_train[column_name] >= 0.4) & (normalized_train[column_name] < 0.5), group_name] = 4
    normalized_train.loc[(normalized_train[column_name] >= 0.5) & (normalized_train[column_name] < 0.6), group_name] = 5
    normalized_train.loc[(normalized_train[column_name] >= 0.6) & (normalized_train[column_name] < 0.7), group_name] = 6
    normalized_train.loc[(normalized_train[column_name] >= 0.7) & (normalized_train[column_name] < 0.8), group_name] = 7
    normalized_train.loc[(normalized_train[column_name] >= 0.8) & (normalized_train[column_name] < 0.9), group_name] = 8
    normalized_train.loc[(normalized_train[column_name] >= 0.9) & (normalized_train[column_name] < 1.0), group_name] = 9

 

그룹화되었기 때문에 이전의 column들은 우선 drop시켰다.

normalized_train = normalized_train.drop(column_names, axis=1)

 

 

그리고 남아있는 column들을 출력해보면..

normalized_train.head()

각 컬럼의 값들이 연속적인 값 대신 10개의 숫자(0~9) 중 하나를 가지고 있는 것을 볼 수 있다!!

 

 

 

전체 코드>

column_names = ['EState_VSA2', 'HallKierAlpha', 'MaxAbsEStateIndex', 'MinEStateIndex', 'PEOE_VSA10', 'PEOE_VSA14', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'frCOO_family', 'bertzCT_family_ext']
group_names = ['EState_VSA2_group', 'HallKierAlpha_group', 'MaxAbsEStateIndex_group', 'MinEStateIndex_group', 'PEOE_VSA10_group', 'PEOE_VSA14_group', 'PEOE_VSA6_group', 'PEOE_VSA7_group', 'PEOE_VSA8_group', 'frCOO_family_group', 'bertzCT_family_ext_group']

column_group_names = dict(zip(column_names, group_names))

for column_name, group_name in column_group_names.items():

    normalized_train[group_name] = 0

    normalized_train.loc[normalized_train[column_name] < 0.1, group_name] = 0
    normalized_train.loc[(normalized_train[column_name] >= 0.1) & (normalized_train[column_name] < 0.2), group_name] = 1
    normalized_train.loc[(normalized_train[column_name] >= 0.2) & (normalized_train[column_name] < 0.3), group_name] = 2
    normalized_train.loc[(normalized_train[column_name] >= 0.3) & (normalized_train[column_name] < 0.4), group_name] = 3
    normalized_train.loc[(normalized_train[column_name] >= 0.4) & (normalized_train[column_name] < 0.5), group_name] = 4
    normalized_train.loc[(normalized_train[column_name] >= 0.5) & (normalized_train[column_name] < 0.6), group_name] = 5
    normalized_train.loc[(normalized_train[column_name] >= 0.6) & (normalized_train[column_name] < 0.7), group_name] = 6
    normalized_train.loc[(normalized_train[column_name] >= 0.7) & (normalized_train[column_name] < 0.8), group_name] = 7
    normalized_train.loc[(normalized_train[column_name] >= 0.8) & (normalized_train[column_name] < 0.9), group_name] = 8
    normalized_train.loc[(normalized_train[column_name] >= 0.9) & (normalized_train[column_name] < 1.0), group_name] = 9


normalized_train = normalized_train.drop(column_names, axis=1)
normalized_train.head()

 

 

 

반응형