[kaggle] American Express - Default Prediction - (1) 대회 소개

목차

    American Express - Default Prediction

    오랜만에 캐글에 default prediction 대회가 열렸다. Home Credit Default Risk 대회 이후로는 처음인 것 같다.

    아멕스[각주:1]에서 연 대회이고, 상금과 채용 특전이 있다.

    1st Place - $40,000
    2nd Place - $30,000
    3rd Place - $20,000
    4th Place - $10,000

    In addition to cash prizes to the top winners, American Express is hiring!

    Highly ranked contestants who indicate their interest will be considered by American Express for interviews, based on their work in the competition and additional background.

    JOB DESCRIPTION
    American Express is seeking experienced data scientists and machine learning researchers to join our Global Decision Science team. Members of Global Decision Science are responsible for managing enterprise risks throughout the customer lifecycle by developing industry-first data capabilities, building profitable decision-making frameworks and creating machine learning-powered predictive models. Our Global Decision Science team uses industry-leading modeling and AI practices to predict customer behavior. We develop, deploy and validate predictive models and support the use of models in economic logic to enable profitable decisions across credit, fraud, marketing and servicing optimization engines.
    Positions are available in the US, UK and India.

    If you'd like your work to be considered for review by the American Express team:

    - Please upload your resume through the Team tab on the competition’s menu bar. Scroll down to “Your Model” and “Upload file” with your solution.
    - You acknowledge that at the end of the competition, the American Express team may request to review your model for purposes of reviewing your capabilities for the job. This license is limited for recruiting and review purposes only.
    - Note that applicants who are one member of a team may be requested to provide documentation of their specific contribution to a team model.

    Overview - Description

    대회의 Description은 다음과 같다.

    Whether out at a restaurant or buying tickets to a concert, modern life counts on the convenience of a credit card to make daily purchases. It saves us from carrying large amounts of cash and also can advance a full purchase that can be paid over time. How do card issuers know we’ll pay back what we charge? That’s a complex problem with many existing solutions—and even more potential improvements, to be explored in this competition.

    Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

    American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.

    In this competition, you’ll apply your machine learning skills to predict credit default. Specifically, you will leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. You're free to explore any technique to create the most powerful model, from creating features to using the data in a more organic way within a model.

    If successful, you'll help create a better customer experience for cardholders by making it easier to be approved for a credit card. Top solutions could challenge the credit default prediction model used by the world's largest payment card issuer—earning you cash prizes, the opportunity to interview with American Express, and potentially a rewarding new career.

     

    "industrial scale data set" 이라는 문구가 눈에 들어온다. 얼마나 데이터가 얼마나 큰지 확인해보자.

    총 50GB 정도이고 train이 16.39GB, test가 33.82GB이다.

    train과 test는 특정 시점을 기준으로 나누어져 있다.

    • train period: 2017.03.01~2018.03.31
    • test period: 2018.04.01~2019.10.31

    Data Description

    이제 Data Description을 살펴보자.


    The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

    The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

    • D_* = Delinquency variables
    • S_* = Spend variables
    • P_* = Payment variables
    • B_* = Balance variables
    • R_* = Risk variables

    with the following features being categorical:

    ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

    Your task is to predict, for each customer_ID, the probability of a future payment default (target = 1).

    Note that the negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.


    데이터셋을 만드는데 상당히 공을 들였다는 게 느껴진다. 프로파일 변수들도 이미 다 만들어져 있는 상태이고, 항목들을 다 비식별 처리하긴 했지만 prefix를 통해 각 항목의 특성을 알 수 있게 되어 있다. 또한, negative class(target=0)에 대한 언더샘플링도 이미 되어 있는 상태이다. 참가자들은 모델링만 잘 하면 된다. 문제는 데이터 사이즈가 개인 pc에서 다루기에는 상당히 크다는 것. 일단 램이 문제다. train만 올려도 램 16GB가 거의 다 찬다. test는 33GB이니 train과 test 전체를 한 번에 올리려면 산술적으로 49GB 이상의 램 용량이 필요하다.

    일단 생각해볼 수 있는 방법은

    (1) float타입을 더 작은 크기의 float 타입이나 integer 타입으로 변경
    (2) 각 항목별로 데이터 저장 후 하나씩 불러오면서 단변량 변수중요도 파악 후 스크리닝하여 일부 항목만 사용

    이 정도가 있을 것 같다. 아마 대회가 진행되다 보면 사람들이 용량 관련 해결책에 대한 코드들을 올려줄 것으로 보인다.

    1. American Express Company (Amex) is an American multinational corporation specialized in payment card services [본문으로]