[NLP][ML] 문자열 기반 카테고리 분류 예측 모델

1. 문자열 데이터 수집

사내 데이터를 기반으로 진행한 프로젝트로, 데이터는 공유할 수 없음.

2. 텍스트 전처리

1) 한국어와 영어 분리

re 활용

# 한글과 영어를 분리
korean_text = re.findall(r"[가-힣]+", text)
english_text = re.findall(r"[a-zA-Z]+", text)

print("Korean:", korean_text)
print("English:", english_text)

한국어는 konlpy 라이브러리를 활용하여 불용어 처리, 토큰화를 진행한다.

한국어의 경우, 띄어쓰기만으로는 형태소 분리가 어렵고, 단순하게 문맥을 파악하기 쉽지 않다.
konlpy는 JAVA의 패키지를 사용하므로 jdk 설치 후 사용 가능, 불용어 처리 시 Okt 라이브러리를 사용
대용량 데이터의 경우 JVM 메모리 부족 문제가 발생할 수 있다. 이 경우, 데이터 전처리를 최대한 세부적으로 진행하는게 중요하고 Okt 라이브러리보다 더 경량 분석기인 Mecab 을 사용한다. 할당된 JVM 메모리를 증가하는 방법도 있으나, 주의해서 진행하는 것이 필요하다
Okt와 Mecab은 사전기반의 방식으로 정형화된 데이터(뉴스, 문서)를 분석하기에 적합하다
soynlp는 비정형 데이터(SNS, 댓글) 또는 띄어쓰기가 되어 있지 않은 데이터를 분석하기에 적합하다

비교 항목	Soynlp	Okt/Mecab
분석 방식	비지도 학습 (통계 기반)	사전 기반
띄어쓰기 보정	가능	불가능
신조어 대응	강함	약함
속도	빠름	중간
정확도	데이터 의존적(변동 가능)	사전 기반으로 안정적
문법적 분석	불가능	가능 (품사 태깅 제공)
활용 추천	SNS, 크롤링 데이터, 신조어 분석	뉴스, 논문, 법률 문서 등 정형 데이터

영어는 nltk, spaCy 라이브러리를 활용하여 불용어 처리, 토큰화를 진행한다.

영어의 경우, 띄어쓰기만 잘 처리되어 있다면, 단어의 구분이 명확하고 이미 형성된 라이브러리도 많아 불용어 처리가 쉽게 가능하다.
불용어 처리는 nltk의 stopwords를 사용

# nltk 라이브러리는 gensim과 충돌 위험이 있고, 
# spicy, seaborn, numpy 라이브러리와 종속성이 있어 설치 시 주의해야 한다.
import nltk
nltk.download('stopwords')
english_stop_words = set(stopwords.words('english'))

2) 임베딩(텍스트의 벡터화)

GPU 환경이 구축되지 않은 노트북에서 진행하므로 CPU 환경에서 가능한 임베딩 방식을 활용하였다.

GPU 환경이 구축되어 사용가능하다면, Hugging Face에서 제공하는 방식을 사용하는 것이 정확도 향상에 기여할 것으로 예상됨.

TF-IDF : 단어의 중요도를 반영

빈도 기반으로 특정 단어가 문서에서 얼마나 중요한지 반영.

Word2Vec : 문맥 정보 활용

단어 간의 의미적 유사성을 모델이 학습

TF-IDF + Word2Vec 결합

단순히 단어의 빈도와 중요도를 넘어 문맥 정보를 활용한 더 정교한 분류 가능

3. Target 데이터 생성

카테고리 분류 예측 모델을 생성하기 위해 카테고리가 완전히 분류된 데이터가 존재해야하나, 현 상황에서는 분류된 데이터가 없는 상태였다. Target을 생성하기위해 2가지 방법을 고안하였다.

1) 규칙 기반 카테고리 분류

도메인 전문가의 배경 지식을 기반으로 분류를 진행하였다.

데이터 증강의 경우, 일정 패턴의 규칙을 찾아 조건문을 활용하여 Target 데이터를 생성하였다.

2) 비지도학습(군집) - KMeans 모델 활용

KMeans 모델을 활용하여 비지도학습을 진행할 때도 Target 데이터를 생성할 임의의 데이터를 대상으로 임베딩 방식을 선택해야만 하였다.

아래 내용은 위 전처리에서 진행한 TF-IDF, Word2Vec 방식을 사용하여 모델을 비교한 내용을 기반으로 한다.

*) Test Set은 데이터 중 200개를 랜덤으로 선택

TF-IDF 임베딩

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# 1. 예제 데이터
texts = df_train['String'].to_list()

# 2. TF-IDF 벡터화 (문장 → 숫자 벡터로 변환)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts).toarray()

# 3. 차원 축소 (PCA로 2D 변환)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# 4. K-Means 클러스터링 적용 (클러스터 개수 지정)
num_clusters = 8
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
predicted_labels = kmeans.fit_predict(X)

# 5. 클러스터링 결과 시각화
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_reduced[:, 0], y=X_reduced[:, 1], hue=predicted_labels, palette="viridis", s=100)
for i, txt in enumerate(texts):
    # plt.annotate(txt, (X_reduced[i, 0], X_reduced[i, 1]), fontsize=9, alpha=0.75)
    plt.annotate(i, (X_reduced[i, 0], X_reduced[i, 1]), fontsize=9, alpha=0.75)
plt.title("TF-IDF Embeddings - KMeans Clustering")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(title="Cluster")
plt.show()

# 6. 클러스터링 성능 평가 (Silhouette Score)
silhouette_avg = silhouette_score(X, predicted_labels)
print(f"Silhouette Score: {silhouette_avg:.3f}")

Silhouette Score: 0.017

Word2Vec

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# 샘플 텍스트 데이터
texts = df_train['String'].to_list()

# 1. 데이터 전처리 및 토큰화
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))  # 불용어 제거
    words = word_tokenize(text.lower())  # 소문자로 변환 후 토큰화
    return [word for word in words if word.isalnum() and word not in stop_words]

tokenized_texts = [preprocess_text(text) for text in texts]

# 2. Word2Vec 모델 학습 (CBOW 또는 Skip-gram)
model = Word2Vec(sentences=tokenized_texts, vector_size=100, window=5, min_count=1, workers=4, sg=1)

# 3. 문서 벡터화 (각 문서의 단어 벡터 평균 계산)
def document_vector(words, model):
    vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vectors, axis=0) if len(vectors) > 0 else np.zeros(model.vector_size)

X = np.array([document_vector(doc, model) for doc in tokenized_texts])

# 4. K-Means 클러스터링 수행
num_clusters = 9
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
predicted_labels = kmeans.fit_predict(X)

# 5. PCA 차원 축소 및 시각화
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_reduced[:, 0], y=X_reduced[:, 1], hue=predicted_labels, palette="viridis", s=100)
for i, txt in enumerate(texts):
    plt.annotate(i, (X_reduced[i, 0], X_reduced[i, 1]), fontsize=9, alpha=0.75)
plt.title("Word2Vec Embeddings - KMeans Clustering")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(title="Cluster")
plt.show()

from sklearn.metrics import silhouette_score

silhouette_avg = silhouette_score(X, predicted_labels)
print(f'Silhouette Score: {silhouette_avg:.3f}')

Silhouette Score: 0.253

TF-IDF + Word2Vec

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
from konlpy.tag import Okt

texts = df_train['String'].to_list()

# 1. 예제 데이터
X = df_train['String']

# 2. TF-IDF 벡터화 (문장 → 숫자 벡터로 변환)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(X)

# TF-IDF 단어와 가중치 저장
tfidf_vocab = vectorizer.vocabulary_

# 형태소 분석
okt = Okt()
tokenized_texts = [okt.morphs(text) for text in X]

# Word2Vec 모델 학습
word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=100, window=5, min_count=1, workers=4)

# 문장의 Word2Vec 벡터 생성 (TF-IDF 가중치 적용)
def get_weighted_word2vec(sentence, model, tfidf_vocab, tfidf_matrix, idx):
    tokens = okt.morphs(sentence)
    weighted_vector = np.zeros(model.vector_size)
    for token in tokens:
        if token in model.wv and token in tfidf_vocab:
            tfidf_idx = tfidf_vocab[token]
            tfidf_weight = tfidf_matrix[idx, tfidf_idx]
            weighted_vector += model.wv[token] * tfidf_weight
    return weighted_vector

# 모든 문장의 Word2Vec 벡터 생성
word2vec_vectors = np.array([
    get_weighted_word2vec(sentence, word2vec_model, tfidf_vocab, tfidf_matrix, idx)
    for idx, sentence in enumerate(X)
])

# TF-IDF와 Word2Vec 결합
combined_features = np.hstack([tfidf_matrix.toarray(), word2vec_vectors])

# 3. 차원 축소 (PCA로 2D 변환)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(combined_features)

# 4. K-Means 클러스터링 적용 (클러스터 개수 지정)
num_clusters = 8
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
predicted_labels = kmeans.fit_predict(combined_features)

# 5. 클러스터링 결과 시각화
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_reduced[:, 0], y=X_reduced[:, 1], hue=predicted_labels, palette="viridis", s=100)
for i, txt in enumerate(texts):
    # plt.annotate(txt, (X_reduced[i, 0], X_reduced[i, 1]), fontsize=9, alpha=0.75)
    plt.annotate(i, (X_reduced[i, 0], X_reduced[i, 1]), fontsize=9, alpha=0.75)
plt.title("TF-IDF Word2Vec Embeddings - KMeans Clustering")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(title="Cluster")
plt.show()

# 6. 클러스터링 성능 평가 (Silhouette Score)
silhouette_avg = silhouette_score(combined_features, predicted_labels)
print(f"Silhouette Score: {silhouette_avg:.3f}")

Silhouette Score: 0.265

군집(클러스터링) 모델의 경우, Silhouette Score가 0.5 이상이 나와야 신뢰성이 높은 모델이며, 0.3 이상이라도 나와야 중복값을 감안하고 사용할 수 있는 모델로 채택할 수 있으나, 현재 분류된 상태로는 0.3도 넘지 않아 사용할 수 없었다.

[별첨]

KMeans의 모델의 주요한 하이퍼파라미터로는 군집의 개수인 n_clersters가 있다. 데이터의 최적의 군집 개수를 선정하기 위해 Elbow 함수로 확인한다.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

distortions = []
K_range = range(1, 15)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    distortions.append(kmeans.inertia_)

plt.plot(K_range, distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion (Inertia)')
plt.title('Elbow Method for Optimal k')
plt.show()

위 그래프에서 갑자기 기울기가 완만해지는 시점이 가장 최적의 하이퍼파라미터 값이다.

그래프 상으로는 4~6 사이의 값이 최적의 파라미터이나, 데이터 특성 상 카테고리 분류가 세부적으로 진행되어야하여 8~10 사이의 값으로 n_clusters 값을 선정하였다.

이로 인해, 도메인 전문가의 의견을 바탕으로 직접 분류된 규칙 기반 카테고리 분류를 사용하여 예측 모델 생성을 진행하였다.

4. 예측 모델 생성

참고문서 : Comparative Analysis of Machine Learning Algorithms for Email Phishing Detection Using TF-IDF,Word2Vec, and BERT

Model : LogisticRegression
최적의 하이퍼파라미터 선정 : Grid Search 활용

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from konlpy.tag import Okt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# 데이터 준비
# train과 test 셋 분리
train = df_total[~(df_total['target'] == '')]
test = df_total[(df_total['target'] == '')]

# 텍스트와 라벨 분리
X = train["issue-title"]
y = train["target"]

# TF-IDF 벡터화
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(X)

# TF-IDF 단어와 가중치 저장
tfidf_vocab = tfidf_vectorizer.vocabulary_

# 형태소 분석
okt = Okt()
tokenized_texts = [okt.morphs(text) for text in X]

# Word2Vec 모델 학습
word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=100, window=5, min_count=1, workers=4)

# 문장의 Word2Vec 벡터 생성 (TF-IDF 가중치 적용)
def get_weighted_word2vec(sentence, model, tfidf_vocab, tfidf_matrix, idx):
    tokens = okt.morphs(sentence)
    weighted_vector = np.zeros(model.vector_size)
    for token in tokens:
        if token in model.wv and token in tfidf_vocab:
            tfidf_idx = tfidf_vocab[token]
            tfidf_weight = tfidf_matrix[idx, tfidf_idx]
            weighted_vector += model.wv[token] * tfidf_weight
    return weighted_vector

# 모든 문장의 Word2Vec 벡터 생성
word2vec_vectors = np.array([
    get_weighted_word2vec(sentence, word2vec_model, tfidf_vocab, tfidf_matrix, idx)
    for idx, sentence in enumerate(X)
])

# TF-IDF와 Word2Vec 결합
combined_features = np.hstack([tfidf_matrix.toarray(), word2vec_vectors])

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(combined_features, y, test_size=0.2, random_state=42)


# 그리드서치
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],             # 규제 강도
    'penalty': ['l1', 'l2', 'elasticnet'],    # 규제 유형
    'solver': ['liblinear', 'saga'],          # 최적화 알고리즘
    'max_iter': [100, 200, 500]               # 반복 횟수
}

model = LogisticRegression(max_iter=1000)

# GridSearchCV 설정
grid_lr = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='accuracy',   # 평가 지표 (기본: accuracy)
    cv=5,                 # 교차 검증 folds 수
    error_score=np.nan,  # 실패한 조합에 대해 점수를 nan으로 설정
    verbose=1,            # 출력 정보 레벨
    n_jobs=-1             # 병렬 처리 (모든 CPU 사용)
)

# Fit the model
grid_lr.fit(X_train, y_train)

# 검증
grid_lr.score(X_test, y_test)

# 테스트 데이터 예측
y_pred = grid_lr.predict(X_test)

# 정확도 확인
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# 상세 평가 보고서
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy : 0.98

항목	Precision	Recall	F1-score	Support
Text1	1.00	1.00	1.00	17
Text2	1.00	1.00	1.00	23
Text3	1.00	1.00	1.00	13
Text4	1.00	0.94	0.97	47
Text5	0.91	1.00	0.95	29

• Precision(정밀도) : 모델이 ‘True’라고 예측한 결과 중 실제로 ‘True’인 비율
• Recall(재현율) : 실제로 ‘True’인 데이터 중에 ‘True’라고 예측한 비율
• F1-Score : Precision과 Recall의 조화 평균

[한계점]

위 모델의 정확도는 높으나, 데이터 분류가 하나의 항목에 치우쳐저 있어 결과가 균일하지 못해 신뢰성이 떨어집니다.

항목의 원인이 중첩되어 있어 수집한 데이터만으로 분류를 진행하기에는 어려움이 있어 추가로 데이터를 수집하여 정확성을 높일 필요가 있습니다.

'Python > MachineLearning' 카테고리의 다른 글

Tensorflow dataset 'cats_vs_dogs' 이미지 분류 (0)	2024.04.01
[Scikit-Learn] K-최근접 이웃(K-nearest Neightbors, KNN) 알고리즘_분류 (0)	2022.08.03
[Tensorflow] 'contib()' 오류 해결 (0)	2022.06.02