본문 바로가기

Computer/인공지능

PyTorch 튜토리얼 (2): 심층신경망 DNN 을 활용한 집값 예측

반응형

English material follows Korean material.

지난 번에 구글 Colaboratory 에서 PyTorch 를 사용하는 방법을 살펴보았다.

https://not-a-robot.tistory.com/11

 

PyTorch 튜토리얼 (1): 구글 Colab을 활용한 입문

English material follows Korean material.개발자든 비개발자든 AI 열풍이 부는 한, 뚝딱뚝딱 만들어보고 배우는 과정은 꼭 필요하다고 생각이 든다.처음 알파고가 나와서 바둑을 둘 때가 벌써 10년이 다됐

not-a-robot.tistory.com

 

이번에는 샘플 데이터셋을 활용하여 Deep Neural Network (DNN) 를 하나 구성하는 방법을 살펴보려한다.

데이터셋은 Kaggle 에서 가져왔다. ChatGPT 의 실력을 한 번 보자.


집값 예측 튜토리얼. 출처: GitHub MuhammadMooazam

PyTorch를 사용하여 Kaggle "House Prices - Advanced Regression Techniques" 문제를 해결하는 과정을 단계별로 다시 정리하겠습니다.


1단계: 데이터 준비 (Data Preparation)

1.1 데이터 로드 및 탐색

  •  pandas  를 사용하여 데이터를 로드하고, 결측값(Missing Values)과 데이터 타입을 확인합니다.
 
import torch
import pandas as pd

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using Device : {device}")

direc = "./"
train_data = pd.read_csv(direc + "train.csv")
test_data = pd.read_csv(direc + "test.csv")

# 데이터 정보 확인
print(train_data.info())
print(train_data.describe())

1.2 결측값 처리

  • 수치형(Numerical) 변수: 결측값을 중앙값(median)으로 대체합니다.
  • 범주형(Categorical) 변수: 결측값을 가장 많이 등장하는 값(mode) 또는  'Unknown'  으로 대체합니다.

1.3 데이터 전처리

  • 수치형 데이터 StandardScaler  를 사용하여 표준화합니다.
  • 범주형 데이터 OneHotEncoder  를 사용하여 원-핫 인코딩(One-Hot Encoding)을 수행합니다.
 
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import numpy as np

# 열 선택
X = train_data.drop(columns=["SalePrice", "Id"])
y = np.log1p(train_data["SalePrice"])

# 수치형 및 범주형 변수 분리
num_features = X.select_dtypes(include=['int64', 'float64']).columns
cat_features = X.select_dtypes(include=['object']).columns

# 수치형 데이터 전처리 파이프라인
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 범주형 데이터 전처리 파이프라인
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 전처리 통합
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ])

# 전처리 적용
X_preprocessed = preprocessor.fit_transform(X)

1.4 데이터 분리

  • 훈련 데이터와 검증 데이터를 80:20 비율로 나눕니다.
  • PyTorch의  DataLoader  를 사용하기 위해 데이터를 텐서로 변환합니다.
 
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset, TensorDataset

# 훈련-검증 분리
X_train, X_valid, y_train, y_valid = train_test_split(X_preprocessed, y, test_size=0.2)

# 텐서 변환
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32, device=device)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32, device=device).view(-1, 1)
X_valid_tensor = torch.tensor(X_valid.toarray(), dtype=torch.float32, device=device)
y_valid_tensor = torch.tensor(y_valid.values, dtype=torch.float32, device=device).view(-1, 1)

# DataLoader 생성
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)

workers = 4 if torch.cuda.is_available() else 0
train_loader = DataLoader(train_dataset, batch_size=32, num_workers=workers, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32, num_workers=workers)

2단계: 모델 정의 (Model Definition)

드롭아웃 포함한 모델 구성

  • 드롭아웃을 통해 과적합을 방지합니다.
  •  nn.Sequential  을 사용하여 네트워크를 구성합니다.
import torch.nn as nn
from torchinfo import summary

# 모델 파라미터
input_dim = X_train_tensor.shape[1]

# 모델 초기화
model = nn.Sequential(
    nn.Linear(input_dim, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.3),  # 드롭아웃
    nn.Linear(128, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Dropout(0.3),  # 드롭아웃
    nn.Linear(64, 32),
    nn.BatchNorm1d(32),
    nn.ReLU(),
    nn.Dropout(0.2),  # 드롭아웃
    nn.Linear(32, 1)  # 출력 레이어
).to(device)

summary(model)

 

3단계: 손실 함수 및 옵티마이저 정의

  • 손실 함수:  MSELoss  (Mean Squared Error).
  • 옵티마이저:  Adam .
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

4단계: 모델 학습 (Training)

  • 모델의 학습 과정을 반복하며 훈련 손실과 검증 손실을 출력합니다.
 
def train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=100):
    for epoch in range(epochs):
        # 훈련 단계
        model.train()
        train_loss = 0.0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        # 진행 상황 출력
        if epoch%100 == 0:
            # 검증 단계
            model.eval()
            valid_loss = 0.0
            with torch.no_grad():
                for X_batch, y_batch in valid_loader:
                    y_pred = model(X_batch)
                    loss = criterion(y_pred, y_batch)
                    valid_loss += loss.item()
            
            print(f"Epoch {epoch+1:4d}/{epochs:4d}, Train Loss: {train_loss/len(train_loader):.4f}, Valid Loss: {valid_loss/len(valid_loader):.4f}")
            
# 모델 학습
train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=501)

5단계: 테스트 데이터 예측 및 제출 파일 생성

  • 테스트 데이터에 전처리를 적용하고 예측 값을 생성합니다.
# 테스트 데이터 전처리
X_test = test_data.drop(columns=["Id"])
X_test_preprocessed = preprocessor.transform(X_test)
X_test_tensor = torch.tensor(X_test_preprocessed.toarray(), dtype=torch.float32)

# 예측
model.eval()
with torch.no_grad():
    test_predictions = model(X_test_tensor).numpy()

# 제출 파일 생성
submission = pd.DataFrame({
    "Id": test_data["Id"],
    "SalePrice": test_predictions.flatten()
})
submission.to_csv("submission.csv", index=False)

 

요약

  1. 데이터 로드 및 전처리 (결측값 처리, 인코딩, 스케일링).
  2. PyTorch 텐서로 변환 및 DataLoader 생성.
  3. 드롭아웃을 포함한 DNN 모델 정의.
  4. 손실 함수와 옵티마이저 설정 후 학습.
  5. 테스트 데이터 예측 및 제출 파일 생성.

더보기

Here’s the same explanation in English, detailing the step-by-step process for solving the Kaggle "House Prices - Advanced Regression Techniques" problem using PyTorch.


Step 1: Data Preparation

1.1 Load and Explore the Data

  • Use pandas to load the data and check for missing values and data types.
import pandas as pd

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

# Check data info and summary statistics
print(train_data.info())
print(train_data.describe())​

1.2 Handle Missing Values

  • Numerical features: Fill missing values with the median.
  • Categorical features: Fill missing values with the most frequent value (mode) or 'Unknown'.

1.3 Data Preprocessing

  • Numerical features: Use StandardScaler to standardize the data.
  • Categorical features: Use OneHotEncoder for one-hot encoding.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Separate numerical and categorical features
num_features = train_data.select_dtypes(include=['int64', 'float64']).columns
cat_features = train_data.select_dtypes(include=['object']).columns

# Pipeline for numerical data
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for categorical data
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ])

# Apply preprocessing
X = train_data.drop(columns=["SalePrice", "Id"])
y = train_data["SalePrice"]
X_preprocessed = preprocessor.fit_transform(X)

1.4 Split the Data

  • Split the data into training and validation sets (80:20).
  • Convert the data into PyTorch tensors for use with DataLoader.
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, TensorDataset

# Train-validation split
X_train, X_valid, y_train, y_valid = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# Convert to tensors
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
X_valid_tensor = torch.tensor(X_valid.toarray(), dtype=torch.float32)
y_valid_tensor = torch.tensor(y_valid.values, dtype=torch.float32).view(-1, 1)

# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32)

Step 2: Define the Model

Add Dropout to Prevent Overfitting

  • Dropout randomly disables some neurons during training, improving generalization.
import torch.nn as nn

house_prediction = nn.Sequential(
    nn.Linear(input_dim, 128),
    nn.ReLU(),
    nn.Dropout(0.3),  # Dropout added
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3),  # Dropout added
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.2),  # Dropout added
    nn.Linear(32, 1)  # Output layer
)

# Initialize the model
input_dim = X_train_tensor.shape[1]
model = house_prediction(input_dim)

Step 3: Define Loss Function and Optimizer

  • Use Mean Squared Error (MSE) as the loss function for regression tasks.
  • Use the Adam optimizer for efficient gradient descent.
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Step 4: Train the Model

  • Train the model using the training dataset and validate it on the validation dataset.
  • Print training and validation losses at each epoch.
def train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=100):
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        
        # Validation phase
        model.eval()
        valid_loss = 0
        with torch.no_grad():
            for X_batch, y_batch in valid_loader:
                y_pred = model(X_batch)
                loss = criterion(y_pred, y_batch)
                valid_loss += loss.item()
        
        # Print progress
        print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss/len(train_loader):.4f}, Valid Loss: {valid_loss/len(valid_loader):.4f}")

# Train the model
train_model(model, train_loader, valid_loader, criterion, optimizer, epochs=50)

Step 5: Predict on Test Data and Create Submission File

  • Preprocess the test data, make predictions, and save them as a CSV file for Kaggle submission.
# Preprocess test data
X_test = test_data.drop(columns=["Id"])
X_test_preprocessed = preprocessor.transform(X_test)
X_test_tensor = torch.tensor(X_test_preprocessed.toarray(), dtype=torch.float32)

# Predict
model.eval()
with torch.no_grad():
    test_predictions = model(X_test_tensor).numpy()

# Create submission file
submission = pd.DataFrame({
    "Id": test_data["Id"],
    "SalePrice": test_predictions.flatten()
})
submission.to_csv("submission.csv", index=False)

Summary of Steps

  1. Load and preprocess data: Handle missing values, standardize numerical features, and one-hot encode categorical features.
  2. Convert to tensors: Prepare data for PyTorch using DataLoader.
  3. Define the model: Build a deep neural network with Dropout layers to prevent overfitting.
  4. Train the model: Use MSE loss and the Adam optimizer to train the model.
  5. Predict and submit: Make predictions on the test set and save them in the required format.

If you have any further questions or need clarification, feel free to ask! 😊

반응형