[논문 리뷰] Deep Neural Networks and Tabular Data: A Survey

AI, DataScience

[논문 리뷰] Deep Neural Networks and Tabular Data: A Survey

이무기뱀술 2022. 11. 15. 17:15

728x90

SUBMITTED TO THE IEEE, JUNE 2022

Deep Neural Networks and Tabular Data: A Survey

Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk and Gjergji Kasnec

논문의 목표

a thorough review of existing scientiﬁc literature on deep learning for tabular data 표 형식의 데이터에 대한 딥 러닝에 대한 기존 과학 문헌의 철저한 검토
a taxonomic categorization of the available approaches for classiﬁcation and regression tasks on heterogeneous Tabular data 이질적인 표 형식 데이터에 대한 분류 및 회귀 작업에 사용할 수 있는 접근법의 분류학적 분류
a presentation of the state of the art and promising paths towards tabular data generation 최신 기술과 표 형식의 데이터 생성을 위한 유망한 경로 제시
an overview of existing explanation approaches for deep models for tabular data 표 형식의 데이터에 대한 심층 모델에 대한 기존 설명 접근법 개요
an extensive empirical comparison of traditional machine learning methods and deep learning models on multiple real-world heterogeneous tabular data sets 여러 실제 이기종 테이블 데이터 세트에 대한 전통적인 기계 학습 방법과 딥 러닝 모델의 광범위한 경험적 비교
a discussion on the main reasons for the limited success of deep learning on tabular data 표 형식의 데이터에 대한 딥 러닝의 제한된 성공의 주요 이유에 대한 토론
a list of open challenges related to deep learning for tabular data. 표 형식의 데이터에 대한 딥 러닝과 관련된 개방형 과제 목록.

용어 정의

Tabular Data
- 표 형식 데이터
  - data points as rows and features as columns.
  - heterogeneous tabular data
  - 이기종 데이터: 여러 데이터 섞여있는 거
- 일반적으로 많이 쓰이는 데이터
  - 의료 진단, 주택 가격 등등
- 기존에는 GBDT(Gradient Boosted Decision Trees)로 많이 분석함
GBDT(Gradient Boosted Decision Trees)
- Gradient: 경사
  - 손실 함수의 최솟값을 찾기 위한 1차 미분
  - Gradient descent: 경사하강법
- Boosting: 앙상블 알고리즘
  - 앙상블: 약한 분류기를 결합하여 강한 분류기를 만드는 과정
  - Boosting: A 분류기를 만든 후, 그 정보를 바탕으로 B 분류기를 만들고, 다시 그 정보를 바탕으로 C 분류기를 만듭니다. 그리고 최종적으로 만들어진 분류기들을 모두 결합하여 최종 모델을 만드는 것이 Boosting의 원리입니다.
- Decision Tree
  - 의사 결정 나무
  - 데이터를 트리 구조로 만듦 (규칙들의 조합)
  - 삽화 넣기
deep neural network
- 심층 신경망
- 뇌의 뉴런의 구조를 인공적으로 구현한 것.
- 퍼셉트론 - 다층 퍼셉트론 - 은닉층이 많은 딥 신경망,,,,,

History of Deep Learning on Tabular Data

Kadra et al. called tabular data sets the last “unconquered castle” for deep neural network models

프라이부르크 대학교의 Arlind Kadra 등은 딥 뉴럴 네트워크에서 마지막 정복되지 않은 성을 표 형식 데이터라고 한다.
Deep Learning
- 이미지, 자연어, 소리 등에 초점
- Tabular Data는 최근에 들어서야 다시 연구
- 다시 연구한 이유? - 전자 상거래(e-commerce)
  - 광고, 클릭률 예측 문제 연구
그동안 연구
- regularization
- attention-based approaches
  - transformer architectures
어떤 것이 Tabular Data에 쓸만한가?

Challenges of Learning With Tabular Data

딥 러닝이 Tabular Data에서 성능이 안 나오는 이유

Low-Quality Training Data
- 결측값
- 이상치
- 데이터 자체의 오류
- 상대적으로 적은 데이터의 수
- 불균형
- 현대적인 의사결정 트리 알고리즘은 이 문제 해결 가능함
Missing or Complex Irregular Spatial Dependencies
- 변수 사이에 상관 관계 없음
- 의존성 복잡, 불규칙
  - feature 사이의 관계 처음부터 학습
- CNN(inductive biases) 사용 불가
Dependency on Preprocessing
- 전처리에 대한 의존성
  - homogeneous data(동종 데이터 이미지 등)는 간단함.
- 예시
  - 범주형 데이터 - 원 핫 인코딩
- 전처리 = 정보 손실 = 성능 저하
Importance of Single Features
- 이미지는 특징 변경해도 괜찮다. 즉 클래스를 바꾸려면 많은 특징을 바꿔야한다.
- 그러나 Tabular Data는 값 변경하면 위험함
  - 현대적인 의사결정 트리 알고리즘은 이 문제 해결 가능함

Uniﬁed Taxonomy

그동안 연구된 Deep Tabular 모델의 통합 분류 체계 종류별로...

data transformation methods
- 데이터 변환 방법
- 모델이 정보를 더 잘 추출할 수 있게 범주형, 수치 데이터 변환
- 전처리 시간 오래 걸림, 고성능 필요
- 컴퓨터 비전에서 이 방법 사용
specialized architectures
- 전문 아키텍처
- 특징
  - 가장 많은 연구
  - 새로운 deep neural network architecture 가 필요하다는 것 시사
- hybrid models
  - 고전적인 기계 학습과 딥러닝 융합
- transformer-based models
  - rely on attention mechanisms.
  - 어텐션 메커니즘(attention mechanism)이란?
    - 자연어 처리 등에서, 장기 의존성 문제(Long-term dependancy) 해결하는 방법
    - 예측해야할 단어와 그렇지 않은 단어 구분하여, 예측해야할 단어에 더 집중
regularization models
- 정규화 모델
- 비선형성, 모델 복잡성을 해결하기 위해 강력한 정규화 기법 적용
분류 체계의 의의
- 실무자가 적절한 모델 선택에 도움
- data transformation methods
  - 현재 아키텍처 유지, 성능 향성
- specialized architectures
  - 데이터 전처리 파이프라인 유지
표 이미지 추가
- 많은 종류의 아키텍처가 있다.
- 더 이상의 자세한 설명은 생략한다.

TABULAR DATA GENERATION

목적
- augmentation (증대)
- imputation (대치)
- rebalancing (재조정)
- 개인 정보 인식 머신러닝
  - 생성된 데이터를 잠재적 으로 활용하여 개인 정보 보호 문제를 극복
종류
- Generative Adversarial Networks (GANs)
  - 이미지 생성
  - G that generates samples from the data distribution, and a discriminator D that estimates the probability that a sample came from the ground truth distribution. Both G and D are usually chosen to be non-linear functions such as a multilayer perceptrons.
- Variational Autoencoders (VAEs)
표 이미지 추가
- 많은 종류의 Method가 있다.
- 각각의 성능을 평가하겠다.
- 더 이상의 자세한 설명은 생략한다.

실험

모델 성능 평가

데이터셋

실험에 사용할 표준화된 데이터셋
컴퓨터 비전은 MNIST, CIFAR, ImageNet 등등이 있지만,
heterogeneous data는 표준없다.
사용할 데이터셋
- Home Equity Line of Credit (HELOC) data set
  - FICO에서 제공
  - 주택 자산 - 은행 신용 한도
- Adult Income data set
  - 성인 소득 데이터셋
  - 종속변수: 연령, 성별, 교육 등 개인정보
  - 독립변수: binary; it represents high and low income.
- HIGGS
  - 힉스 입자
- The Covertype data set
  - 육지 셀 지도 정보(표고, 경사) 저장
  - 다중 분류 데이터셋
- California Housing data set
  - 주택 가격 예측

공개 성능 벤치마크

Hyperparameter Selection
- Optuna library
Data Preprocessing
- numerical features
  - zero-mean
  - unit-variance normalization
- categorical features
  - ordinal encoding
  - 일부는 그냥 사용 LightGBM, DeepFM, DeepGBM, TabNet, TabTransformer, and SAINT
- missing value: 0
Reproducibility and Extensibility
- docker container
- 소스코드 공개

Results

표5 삽입

HIGSS를 제외한 모든 경우에서 boosted decision tree ensembles이 가장 좋은 결과
HIGGS는 SAINT가 가장 좋았다.
전반적으로 SAINT가 가장 좋았다.
다른 딥러닝 모델은 데이터에 따라 성능이 천차만별

Run Time Comparison

gradient-boosting-based models이 다른 딥러닝 모델보다 빠름
HIGGS 데이터셋은 딥러닝이 더 빨랐다.

Interpretability Assessment

해석 가능성 평가

해석 가능성이란?
- the degree to which a human can understand the cause of a decision
- the degree to which a human can consistently predict the model’s result
모든 모델에 걸쳐 낮고 때로는 음수
- ?

We come to the conclusion that more profound benchmarks of the claimed interpretability characteristics and their usefulness in practice are necessary.

DISCUSSION AND FUTURE PROSPECTS

Summary and Trends

Decision Tree Ensembles are still State-of-the-Art.
- 의사결정 트리 앙상블은 여전히 최첨단이다.
- XGBoost: 6년
- gradient boosting: 20년
We think that a fundamental reorientation of the domain may be necessary.
- 이 산이 아닌가벼
the question of whether the use of current deep learning techniques is beneﬁcial for tabular data can generally be answered in the negative.
- 표 형태 데이터를 분석할 때 딥러닝을 쓰는게 맞을까요? 아니요.

성능 격차의 원인에 대한 분석

Uniﬁed Benchmarking
- 공정하고 효율적인 비교 방법에 대한 합의가 없다.
- 같은 데이터셋이라도 다른 결과 나타날 수도
Tabular Data Preprocessing
- heterogeneity of the data
- 수치형, 범주형 등등
Architectures for Deep Learning on Tabular Data
- transformer-based solutions
- self-supervised or unsupervised pre-training이 더 좋은 경우가 있다.
- 테이블 형식 데이터에 대한 특수 목적 아키텍처 필요함
Regularization Models for Tabular Data
- 정규화가 중요하다!
Deep Generative Models for Tabular Data
- 데이터를 생성하면
  - 모델의 품질을 높일 수 있다
  - 개인정보 문제를 해결할 수 있다
- Unfortunately, the generation task is as hard as inference in predictive models, so progress in both areas will likely go hand in hand.
Interpretable Deep Learning Models for Tabular Data
- 해석 가능성 적용하면 좋을텐데....
Learning From Evolving Data Streams
- 요즘 데이터는 지속적으로 변화하니 그에 대응해야한다.

Open Research Questions

Information-theoretic Analysis of Encodings
Computational Efﬁciency in Hybrid Models.
Specialized Regularizations
Novel Processes for Tabular Data Generation
Interpretability
Transfer of Deep Learning Methods to Data Streams
Transfer Learning for Tabular Data
Data Augmentation for Tabular Data
Self-supervised Learning

728x90

저작자표시 비영리 동일조건 (새창열림)