English material follows Korean material.
지난 번에 구글 Colaboratory 에서 PyTorch 를 사용하는 방법을 살펴보았다.
https://not-a-robot.tistory.com/11
PyTorch 튜토리얼 (1): 구글 Colab을 활용한 입문
English material follows Korean material.개발자든 비개발자든 AI 열풍이 부는 한, 뚝딱뚝딱 만들어보고 배우는 과정은 꼭 필요하다고 생각이 든다.처음 알파고가 나와서 바둑을 둘 때가 벌써 10년이 다됐
not-a-robot.tistory.com
이번에는 샘플 데이터셋을 활용하여 Deep Neural Network (DNN) 를 하나 구성하는 방법을 살펴보려한다.
데이터셋은 Kaggle 에서 가져왔다. ChatGPT 의 실력을 한 번 보자.

PyTorch를 사용하여 Kaggle "House Prices - Advanced Regression Techniques" 문제를 해결하는 과정을 단계별로 다시 정리하겠습니다.
1단계: 데이터 준비 (Data Preparation)
1.1 데이터 로드 및 탐색
- pandas 를 사용하여 데이터를 로드하고, 결측값(Missing Values)과 데이터 타입을 확인합니다.
import torch
import pandas as pd
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using Device : {device}")
direc = "./"
train_data = pd.read_csv(direc + "train.csv")
test_data = pd.read_csv(direc + "test.csv")
# 데이터 정보 확인
print(train_data.info())
print(train_data.describe())
1.2 결측값 처리
- 수치형(Numerical) 변수: 결측값을 중앙값(median)으로 대체합니다.
- 범주형(Categorical) 변수: 결측값을 가장 많이 등장하는 값(mode) 또는 'Unknown' 으로 대체합니다.
1.3 데이터 전처리
- 수치형 데이터: StandardScaler 를 사용하여 표준화합니다.
- 범주형 데이터: OneHotEncoder 를 사용하여 원-핫 인코딩(One-Hot Encoding)을 수행합니다.
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import numpy as np
# 열 선택
X = train_data.drop(columns=["SalePrice", "Id"])
y = np.log1p(train_data["SalePrice"])
# 수치형 및 범주형 변수 분리
num_features = X.select_dtypes(include=['int64', 'float64']).columns
cat_features = X.select_dtypes(include=['object']).columns
# 수치형 데이터 전처리 파이프라인
num_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# 범주형 데이터 전처리 파이프라인
cat_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# 전처리 통합
preprocessor = ColumnTransformer(
transformers=[
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
# 전처리 적용
X_preprocessed = preprocessor.fit_transform(X)
1.4 데이터 분리
- 훈련 데이터와 검증 데이터를 80:20 비율로 나눕니다.
- PyTorch의 DataLoader 를 사용하기 위해 데이터를 텐서로 변환합니다.
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset, TensorDataset
# 훈련-검증 분리
X_train, X_valid, y_train, y_valid = train_test_split(X_preprocessed, y, test_size=0.2)
# 텐서 변환
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32, device=device)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32, device=device).view(-1, 1)
X_valid_tensor = torch.tensor(X_valid.toarray(), dtype=torch.float32, device=device)
y_valid_tensor = torch.tensor(y_valid.values, dtype=torch.float32, device=device).view(-1, 1)
# DataLoader 생성
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)
workers = 4 if torch.cuda.is_available() else 0
train_loader = DataLoader(train_dataset, batch_size=32, num_workers=workers, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32, num_workers=workers)
2단계: 모델 정의 (Model Definition)
드롭아웃 포함한 모델 구성
- 드롭아웃을 통해 과적합을 방지합니다.
- nn.Sequential 을 사용하여 네트워크를 구성합니다.
import torch.nn as nn
from torchinfo import summary
# 모델 파라미터
input_dim = X_train_tensor.shape[1]
# 모델 초기화
model = nn.Sequential(
nn.Linear(input_dim, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3), # 드롭아웃
nn.Linear(128, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Dropout(0.3), # 드롭아웃
nn.Linear(64, 32),
nn.BatchNorm1d(32),
nn.ReLU(),
nn.Dropout(0.2), # 드롭아웃
nn.Linear(32, 1) # 출력 레이어
).to(device)
summary(model)
3단계: 손실 함수 및 옵티마이저 정의
- 손실 함수: MSELoss (Mean Squared Error).
- 옵티마이저: Adam .
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
4단계: 모델 학습 (Training)
- 모델의 학습 과정을 반복하며 훈련 손실과 검증 손실을 출력합니다.
def train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=100):
for epoch in range(epochs):
# 훈련 단계
model.train()
train_loss = 0.0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
loss.backward()
optimizer.step()
train_loss += loss.item()
# 진행 상황 출력
if epoch%100 == 0:
# 검증 단계
model.eval()
valid_loss = 0.0
with torch.no_grad():
for X_batch, y_batch in valid_loader:
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
valid_loss += loss.item()
print(f"Epoch {epoch+1:4d}/{epochs:4d}, Train Loss: {train_loss/len(train_loader):.4f}, Valid Loss: {valid_loss/len(valid_loader):.4f}")
# 모델 학습
train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=501)
5단계: 테스트 데이터 예측 및 제출 파일 생성
- 테스트 데이터에 전처리를 적용하고 예측 값을 생성합니다.
# 테스트 데이터 전처리
X_test = test_data.drop(columns=["Id"])
X_test_preprocessed = preprocessor.transform(X_test)
X_test_tensor = torch.tensor(X_test_preprocessed.toarray(), dtype=torch.float32)
# 예측
model.eval()
with torch.no_grad():
test_predictions = model(X_test_tensor).numpy()
# 제출 파일 생성
submission = pd.DataFrame({
"Id": test_data["Id"],
"SalePrice": test_predictions.flatten()
})
submission.to_csv("submission.csv", index=False)
요약
- 데이터 로드 및 전처리 (결측값 처리, 인코딩, 스케일링).
- PyTorch 텐서로 변환 및 DataLoader 생성.
- 드롭아웃을 포함한 DNN 모델 정의.
- 손실 함수와 옵티마이저 설정 후 학습.
- 테스트 데이터 예측 및 제출 파일 생성.
Here’s the same explanation in English, detailing the step-by-step process for solving the Kaggle "House Prices - Advanced Regression Techniques" problem using PyTorch.
Step 1: Data Preparation
1.1 Load and Explore the Data
- Use pandas to load the data and check for missing values and data types.
import pandas as pd
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
# Check data info and summary statistics
print(train_data.info())
print(train_data.describe())
1.2 Handle Missing Values
- Numerical features: Fill missing values with the median.
- Categorical features: Fill missing values with the most frequent value (mode) or 'Unknown'.
1.3 Data Preprocessing
- Numerical features: Use StandardScaler to standardize the data.
- Categorical features: Use OneHotEncoder for one-hot encoding.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Separate numerical and categorical features
num_features = train_data.select_dtypes(include=['int64', 'float64']).columns
cat_features = train_data.select_dtypes(include=['object']).columns
# Pipeline for numerical data
num_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Pipeline for categorical data
cat_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
# Apply preprocessing
X = train_data.drop(columns=["SalePrice", "Id"])
y = train_data["SalePrice"]
X_preprocessed = preprocessor.fit_transform(X)
1.4 Split the Data
- Split the data into training and validation sets (80:20).
- Convert the data into PyTorch tensors for use with DataLoader.
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, TensorDataset
# Train-validation split
X_train, X_valid, y_train, y_valid = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)
# Convert to tensors
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
X_valid_tensor = torch.tensor(X_valid.toarray(), dtype=torch.float32)
y_valid_tensor = torch.tensor(y_valid.values, dtype=torch.float32).view(-1, 1)
# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32)
Step 2: Define the Model
Add Dropout to Prevent Overfitting
- Dropout randomly disables some neurons during training, improving generalization.
import torch.nn as nn
house_prediction = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.Dropout(0.3), # Dropout added
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.3), # Dropout added
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.2), # Dropout added
nn.Linear(32, 1) # Output layer
)
# Initialize the model
input_dim = X_train_tensor.shape[1]
model = house_prediction(input_dim)
Step 3: Define Loss Function and Optimizer
- Use Mean Squared Error (MSE) as the loss function for regression tasks.
- Use the Adam optimizer for efficient gradient descent.
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Step 4: Train the Model
- Train the model using the training dataset and validate it on the validation dataset.
- Print training and validation losses at each epoch.
def train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=100):
for epoch in range(epochs):
# Training phase
model.train()
train_loss = 0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation phase
model.eval()
valid_loss = 0
with torch.no_grad():
for X_batch, y_batch in valid_loader:
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
valid_loss += loss.item()
# Print progress
print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss/len(train_loader):.4f}, Valid Loss: {valid_loss/len(valid_loader):.4f}")
# Train the model
train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=50)
Step 5: Predict on Test Data and Create Submission File
- Preprocess the test data, make predictions, and save them as a CSV file for Kaggle submission.
# Preprocess test data
X_test = test_data.drop(columns=["Id"])
X_test_preprocessed = preprocessor.transform(X_test)
X_test_tensor = torch.tensor(X_test_preprocessed.toarray(), dtype=torch.float32)
# Predict
model.eval()
with torch.no_grad():
test_predictions = model(X_test_tensor).numpy()
# Create submission file
submission = pd.DataFrame({
"Id": test_data["Id"],
"SalePrice": test_predictions.flatten()
})
submission.to_csv("submission.csv", index=False)
Summary of Steps
- Load and preprocess data: Handle missing values, standardize numerical features, and one-hot encode categorical features.
- Convert to tensors: Prepare data for PyTorch using DataLoader.
- Define the model: Build a deep neural network with Dropout layers to prevent overfitting.
- Train the model: Use MSE loss and the Adam optimizer to train the model.
- Predict and submit: Make predictions on the test set and save them in the required format.
If you have any further questions or need clarification, feel free to ask! 😊
'Computer > 인공지능' 카테고리의 다른 글
| PyTorch 튜토리얼 (3): 합성곱신경망 CNN 을 활용한 CIFAR-10 이미지 분류기 (8) | 2025.01.18 |
|---|---|
| PyTorch 튜토리얼 (1): 구글 Colab을 활용한 입문 (6) | 2025.01.13 |