{"id":46626,"date":"2021-01-14T00:00:00","date_gmt":"2021-01-14T08:00:00","guid":{"rendered":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/"},"modified":"2025-11-13T12:55:07","modified_gmt":"2025-11-13T20:55:07","slug":"predicting-credit-card-attrition-using-python-and-griddb","status":"publish","type":"post","link":"https:\/\/www.griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/","title":{"rendered":"Predicting Credit Card Attrition Using Python and GridDB"},"content":{"rendered":"<p>Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources, be it mobile devices or sensors, has many outliers. Moreover, the data may be high dimensional, so it becomes hard to interpret the data&#39;s summary statistics. As a result, nowadays, data analysis is the umbrella term for the process of getting raw data to getting human interpretable results. Thus, data analysis consists of data cleaning, transforming and modelling such that meaningful information can be extracted from it. <\/p>\n<p><a href=\"#source-code\"> FULL SOURCE CODE <\/a><\/p>\n<p>The most crucial precursor of a good data analysis system has a reliable database. Our database should be scalable, and we should be able to query large datasets easily from it. One such modern database that allows for all these functionalities is GridDB. GridDB is a high performance and can easily be integrated with many programming languages. In this post, we will analyze some data with python and GridDB. As there are many types of analyses we can do, we will focus on a random forest model in this post.<\/p>\n<h3 id=\"griddb-setup\">GridDB setup<\/h3>\n<p>This <a href=\"https:\/\/www.youtube.com\/watch?v=yWCVfLoV9_0&amp;t=61s\">video<\/a> has the setup guide for the GridDB python client.<\/p>\n<h3 id=\"python-libraries\">Python libraries<\/h3>\n<p>We will use <code>python 3.6<\/code> and pandas to do our analysis.<br \/>\nTo install libraries we use the following command:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">pip <span class=\"hljs-keyword\">install <\/span>pandas\npip <span class=\"hljs-keyword\">install <\/span><span class=\"hljs-keyword\">scikit-learn\n<\/span>pip <span class=\"hljs-keyword\">install <\/span>plotly\npip <span class=\"hljs-keyword\">install <\/span>matplotlib\n<\/code><\/pre>\n<\/div>\n<p>import libraries<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-keyword\">import<\/span> pandas <span class=\"hljs-keyword\">as<\/span> pd\n<span class=\"hljs-keyword\">import<\/span> matplotlib.pyplot <span class=\"hljs-keyword\">as<\/span> plt\n<span class=\"hljs-keyword\">import<\/span> plotly.graph_objs <span class=\"hljs-keyword\">as<\/span> go\n<span class=\"hljs-title\">from<\/span> plotly.subplots <span class=\"hljs-keyword\">import<\/span> make_subplots\n<span class=\"hljs-title\">from<\/span> sklearn.model_selection <span class=\"hljs-keyword\">import<\/span> train_test_split,cross_val_score\n<span class=\"hljs-title\">from<\/span> sklearn.ensemble <span class=\"hljs-keyword\">import<\/span> RandomForestClassifier\n<span class=\"hljs-title\">from<\/span> sklearn.metrics <span class=\"hljs-keyword\">import<\/span> f1_score <span class=\"hljs-keyword\">as<\/span> f1\n<\/code><\/pre>\n<\/div>\n<h3 id=\"data-collection\">Data Collection<\/h3>\n<p>GridDB provides an excellent interface to access data. The <a href=\"https:\/\/griddb.net\/en\/blog\/using-griddbs-cpythonruby-apis\/\">GridDB python client blog<\/a> goes into great detail to link a GridDB database and push all the data to a pandas data frame.<\/p>\n<p>For this analysis we will use credit card data to predict attrition or churn. The data can be found <a href=\"https:\/\/www.kaggle.com\/sakshigoyal7\/credit-card-customers\">here<\/a>.<\/p>\n<p>We can set up GridDB as our database by instantiating the container and dumbing all the data into a pandas dataframe.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-built_in\">import<\/span> griddb_python as griddb\n<span class=\"hljs-comment\"># Initialize container<\/span>\n<span class=\"hljs-attr\">gridstore<\/span> = factory.get_store(<span class=\"hljs-attr\">host=<\/span> host, <span class=\"hljs-attr\">port=port,<\/span> \n            <span class=\"hljs-attr\">cluster_name=cluster_name,<\/span> <span class=\"hljs-attr\">username=uname,<\/span> \n            <span class=\"hljs-attr\">password=pwd)<\/span>\n\n<span class=\"hljs-attr\">conInfo<\/span> = griddb.ContainerInfo(<span class=\"hljs-string\">\"attrition\"<\/span>,\n                    [[<span class=\"hljs-string\">\"CLIENTNUM\"<\/span>, griddb.Type.LONG],\n                    [<span class=\"hljs-string\">\"Gender\"<\/span>,griddb.Type.STRING],\n              .... <span class=\"hljs-comment\">#for all 23 variables      <\/span>\n                    griddb.ContainerType.COLLECTION, True)\n\n<span class=\"hljs-attr\">cont<\/span> = gridstore.put_container(conInfo)    \ncont.create_index(<span class=\"hljs-string\">\"CLIENTNUM\"<\/span>, griddb.IndexType.DEFAULT)\n<\/code><\/pre>\n<\/div>\n<p>We can retrive data from GridDB using the following SQL query:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">  <span class=\"hljs-keyword\">query<\/span> = cont.<span class=\"hljs-keyword\">query<\/span>(<span class=\"hljs-string\">\"select *\"<\/span>)\n<\/code><\/pre>\n<\/div>\n<p>The data analysis pipeline has the following steps:<\/p>\n<ol>\n<li>\n    <a href=\"#data-collection-and-exploration\"> <strong>Data exploration<\/strong><\/a>: We first run some summary statistics on the various variables we have and try to understand the correlation with the dependent variable, i.e. survival. We also clean the dataset to remove outliers, if any.\n  <\/li>\n<li>\n    <a href=\"#feature-engineering\"> <strong>Feature Engineering<\/strong><\/a>: We will then select the features that can be used for modelling. We can create new features either from existing data or open-source resources.\n  <\/li>\n<li>\n     <a href=\"#data-modelling\"><strong>Modelling<\/strong><\/a>: We then use a machine learning model, random forest, in our case. We would first split the data into a test set and a training set. Typically we train the model on the training data and evaluate it on the test set. Sometimes we either have a validation set or do cross-validation to tune the hyperparameters of the model.\n  <\/li>\n<li>\n    <a href=\"#evaluation\"><strong>Evaluation<\/strong><\/a>: Finally, we will use the model for prediction and analyze its performance.\n  <\/li>\n<\/ol>\n<h3 id=\"data-collection-and-exploration\">Data Collection and Exploration<\/h3>\n<p>We load the data using <code>pandas<\/code>. We remove the last two columns as they are the results of a different classifier.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">data<\/span> = pd.read_csv('\/<span class=\"hljs-title\">kaggle<\/span>\/<span class=\"hljs-title\">input<\/span>\/<span class=\"hljs-title\">credit<\/span>-<span class=\"hljs-title\">card<\/span>-<span class=\"hljs-title\">customers<\/span>\/<span class=\"hljs-type\">BankChurners<\/span>.<span class=\"hljs-title\">csv'<\/span>)<\/span>\n<span class=\"hljs-class\"><span class=\"hljs-keyword\">data<\/span> = <span class=\"hljs-keyword\">data<\/span>[<span class=\"hljs-keyword\">data<\/span>.columns[:-2]]<\/span>\n<\/code><\/pre>\n<\/div>\n<p>We first create summary statistics of some of the variables. Ideally, we would check every variable, but for brevity, we showcase a few important ones.<\/p>\n<h4 id=\"attrition_flag-\">Attrition_Flag:<\/h4>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">attdata = data.groupby(['Attrition_Flag']).count()[[\"CLIENTNUM\"]].reset_index()\nattdata['percentage'] = attdata['CLIENTNUM']\/attdata['CLIENTNUM'].sum()\nattdata[attdata.Attrition_Flag  == \"Attrited Customer\"]<\/code><\/pre>\n<\/div>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\">CLIENTNUM<\/th>\n<th style=\"text-align:center\">percentage<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">1627<\/td>\n<td style=\"text-align:center\">0.16066<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/p>\n<p>We have an attrition of about 16%.<\/p>\n<h4 id=\"demographic-variables\">Demographic Variables<\/h4>\n<h5 id=\"gender\">Gender<\/h5>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">genderdata = data.groupby(['Gender']).count()[[\"CLIENTNUM\"]].reset_index()\ngenderdata['percentage'] = genderdata['CLIENTNUM']\/genderdata['CLIENTNUM'].sum()\ngenderdata[genderdata.Gender  == \"F\"]<\/code><\/pre>\n<\/div>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\">CLIENTNUM<\/th>\n<th style=\"text-align:center\">percentage<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">5358<\/td>\n<td style=\"text-align:center\">0.529081<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/p>\n<p>We see that there are 52.9% females. But the difference in the genders is not that significant.<\/p>\n<h5 id=\"education\">Education<\/h5>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">data<\/span>.groupby(['<span class=\"hljs-type\">Education_Level<\/span>']).count()[[\"<span class=\"hljs-type\">CLIENTNUM<\/span>\"]].reset_index()<\/span>\n<\/code><\/pre>\n<\/div>\n<h5 id=\"education_level\">Education_Level<\/h5>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\">Type<\/th>\n<th style=\"text-align:center\">Number<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">College<\/td>\n<td style=\"text-align:center\">1013<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Doctorate<\/td>\n<td style=\"text-align:center\">451<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Graduate<\/td>\n<td style=\"text-align:center\">3128<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">High School<\/td>\n<td style=\"text-align:center\">2013<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Post-Graduate<\/td>\n<td style=\"text-align:center\">516<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Uneducated<\/td>\n<td style=\"text-align:center\">1487<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Unknown<\/td>\n<td style=\"text-align:center\">1519<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p> We see about 70% of the customers are educated. <\/p>\n<h5 id=\"income_category\">Income_Category<\/h5>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">data<\/span>.groupby(['<span class=\"hljs-type\">Income_Category<\/span>']).count()[[\"<span class=\"hljs-type\">CLIENTNUM<\/span>\"]].reset_index()<\/span>\n<\/code><\/pre>\n<\/div>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\">Income_Category<\/th>\n<th style=\"text-align:center\">Number<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">$120K +<\/td>\n<td style=\"text-align:center\">727<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">40k\u00e2\u02c6\u2019 60k<\/td>\n<td style=\"text-align:center\">1790<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">60k\u00e2\u02c6\u2019 *80K<\/td>\n<td style=\"text-align:center\">1402<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Less than $40k<\/td>\n<td style=\"text-align:center\">3561<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Unknown<\/td>\n<td style=\"text-align:center\">1112<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p> We see that most people earn less than $40k.<\/p>\n<h4 id=\"bank-variables\">Bank variables<\/h4>\n<p>We will draw histograms for <strong>Months_on_book<\/strong>, &#39;<strong>Total_Relationship_Count<\/strong>&#39;,<br \/>\n  &#39;<strong>Months_Inactive_12_mon<\/strong>&#39; and &#39;<strong>Credit_Limit<\/strong>&#39;<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">fig = make_subplots(rows=<span class=\"hljs-number\">2<\/span>, cols=<span class=\"hljs-number\">2<\/span>)\ntr1=<span class=\"hljs-keyword\">go<\/span>.Histogram(<span class=\"hljs-keyword\">x<\/span>=data[<span class=\"hljs-string\">'Months_on_book'<\/span>],name=<span class=\"hljs-string\">'Months on book Box Plot'<\/span>)\ntr2=<span class=\"hljs-keyword\">go<\/span>.Histogram(<span class=\"hljs-keyword\">x<\/span>=data[<span class=\"hljs-string\">'Total_Relationship_Count'<\/span>],name=<span class=\"hljs-string\">'Total no. of products Histogram'<\/span>)\ntr3=<span class=\"hljs-keyword\">go<\/span>.Histogram(<span class=\"hljs-keyword\">x<\/span>=data[<span class=\"hljs-string\">'Months_Inactive_12_mon'<\/span>],name=<span class=\"hljs-string\">'number of months inactive Histogram'<\/span>)\ntr4=<span class=\"hljs-keyword\">go<\/span>.Histogram(<span class=\"hljs-keyword\">x<\/span>=data[<span class=\"hljs-string\">'Credit_Limit'<\/span>],name=<span class=\"hljs-string\">'Credit_Limit Histogram'<\/span>)\n\nfig.add_trace(tr1,row=<span class=\"hljs-number\">1<\/span>,<span class=\"hljs-keyword\">col<\/span>=<span class=\"hljs-number\">1<\/span>)\nfig.add_trace(tr2,row=<span class=\"hljs-number\">1<\/span>,<span class=\"hljs-keyword\">col<\/span>=<span class=\"hljs-number\">2<\/span>)\nfig.add_trace(tr3,row=<span class=\"hljs-number\">2<\/span>,<span class=\"hljs-keyword\">col<\/span>=<span class=\"hljs-number\">1<\/span>)\nfig.add_trace(tr4,row=<span class=\"hljs-number\">2<\/span>,<span class=\"hljs-keyword\">col<\/span>=<span class=\"hljs-number\">2<\/span>)\n\nfig.update_layout(height=<span class=\"hljs-number\">700<\/span>, width=<span class=\"hljs-number\">1200<\/span>, title_text=<span class=\"hljs-string\">\"Distribution of bank variables\"<\/span>)\nfig.show()\n<\/code><\/pre>\n<\/div>\n<p><img decoding=\"async\"\n    src=\"https:\/\/lh3.googleusercontent.com\/vqSOhxLVLzv8vgKtwqRAbplYd__UE7cZS2PaXjvgJxOsWjhs3xDMGdRKS4gRmxyYERrq-vVEPb2_sJ47a50qA2XE__UbrUw1H-IaXApUbA69ioxwP3LFP-_Cwsvuk3_PVlsSMXNU\"\n    alt=\"\"><\/p>\n<p>We see that the distribution of the total number of products is mostly uniform so ideally this variable can be removed from our analysis. Other variables show a lot of variation so we can keep them.<\/p>\n<h4 id=\"card_category\">Card_Category<\/h4>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">data<\/span>.groupby(['<span class=\"hljs-type\">Card_Category<\/span>']).count()[[\"<span class=\"hljs-type\">CLIENTNUM<\/span>\"]].reset_index()<\/span>\n<\/code><\/pre>\n<\/div>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\">Card_Category<\/th>\n<th style=\"text-align:left\">Total<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">Blue<\/td>\n<td style=\"text-align:left\">9496<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Gold<\/td>\n<td style=\"text-align:left\">116<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Platinum<\/td>\n<td style=\"text-align:left\">20<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">Silver<\/td>\n<td style=\"text-align:left\">555<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p> We see that most customers use <strong>Blue<\/strong> so we can ignore this variable as well.<\/p>\n<h3 id=\"feature-engineering\">Feature Engineering<\/h3>\n<p>We can ignore the <strong>client number<\/strong> for this analysis. However, if the bank had more data source then the <strong>client number<\/strong> could be used to match different datasets.<\/p>\n<p>The <strong>attrition flag<\/strong> is our dependent variable as we want to predict attrition. We will code this as a 0,1 variable.<\/p>\n<p>Next, we create dummy variables. Dummy variables essentially encode the variables of one category. We will leave one category out. Otherwise, we&#39;ll face the issue of collinearity. We create dummy variables from the Demographic variable. This is also known as Hot-One Encoding.<\/p>\n<ul>\n<li>Education_Levels<\/li>\n<li>Income<\/li>\n<li>Marital status<\/li>\n<\/ul>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-keyword\">data<\/span>.Attrition_Flag = <span class=\"hljs-keyword\">data<\/span>.Attrition_Flag.replace({<span class=\"hljs-string\">'Attrited Customer'<\/span>:<span class=\"hljs-number\">1<\/span>,<span class=\"hljs-string\">'Existing Customer'<\/span>:<span class=\"hljs-number\">0<\/span>})\n<span class=\"hljs-keyword\">data<\/span>.Gender = <span class=\"hljs-keyword\">data<\/span>.Gender.replace({<span class=\"hljs-string\">'F'<\/span>:<span class=\"hljs-number\">1<\/span>,<span class=\"hljs-string\">'M'<\/span>:<span class=\"hljs-number\">0<\/span>})\n<span class=\"hljs-keyword\">data<\/span> = pd.concat([<span class=\"hljs-keyword\">data<\/span>,pd.get_dummies(<span class=\"hljs-keyword\">data<\/span>[<span class=\"hljs-string\">'Education_Level'<\/span>]).drop(columns=[<span class=\"hljs-string\">'Unknown'<\/span>])],axis=<span class=\"hljs-number\">1<\/span>)\n<span class=\"hljs-keyword\">data<\/span> = pd.concat([<span class=\"hljs-keyword\">data<\/span>,pd.get_dummies(<span class=\"hljs-keyword\">data<\/span>[<span class=\"hljs-string\">'Income_Category'<\/span>]).drop(columns=[<span class=\"hljs-string\">'Unknown'<\/span>])],axis=<span class=\"hljs-number\">1<\/span>)\n<span class=\"hljs-keyword\">data<\/span> = pd.concat([<span class=\"hljs-keyword\">data<\/span>,pd.get_dummies(<span class=\"hljs-keyword\">data<\/span>[<span class=\"hljs-string\">'Marital_Status'<\/span>]).drop(columns=[<span class=\"hljs-string\">'Unknown'<\/span>])],axis=<span class=\"hljs-number\">1<\/span>)\n\n<span class=\"hljs-keyword\">data<\/span>.drop(columns = [<span class=\"hljs-string\">'Education_Level'<\/span>,<span class=\"hljs-string\">'Income_Category'<\/span>,<span class=\"hljs-string\">'Marital_Status'<\/span>,<span class=\"hljs-string\">'Card_Category'<\/span>,<span class=\"hljs-string\">'CLIENTNUM'<\/span>, <span class=\"hljs-string\">'Card_Category'<\/span>],inplace=True, errors = <span class=\"hljs-string\">\"ignore\"<\/span>)\n<\/code><\/pre>\n<\/div>\n<h3 id=\"data-modelling\">Data Modelling<\/h3>\n<p>Next we run a simple random forest model with 100 trees after splitting the data into test and train.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\"><span class=\"hljs-attribute\">X_features<\/span> = [<span class=\"hljs-string\">'Customer_Age'<\/span>, <span class=\"hljs-string\">'Gender'<\/span>, <span class=\"hljs-string\">'Dependent_count'<\/span>,\n<span class=\"hljs-string\">'Months_on_book'<\/span>, <span class=\"hljs-string\">'Total_Relationship_Count'<\/span>, <span class=\"hljs-string\">'Months_Inactive_12_mon'<\/span>,\n<span class=\"hljs-string\">'Contcts_Count_12_mon'<\/span>, <span class=\"hljs-string\">'Credit_Limit'<\/span>, <span class=\"hljs-string\">'Total_Revolving_Bal'<\/span>,\n<span class=\"hljs-string\">'Avg_Open_To_Buy'<\/span>, <span class=\"hljs-string\">'Total_Amt_Chng_Q4_Q1'<\/span>, <span class=\"hljs-string\">'Total_Trans_Amt'<\/span>,\n<span class=\"hljs-string\">'Total_Trans_Ct'<\/span>, <span class=\"hljs-string\">'Total_Ct_Chng_Q4_Q1'<\/span>, <span class=\"hljs-string\">'Avg_Utilization_Ratio'<\/span>,\n<span class=\"hljs-string\">'College'<\/span>, <span class=\"hljs-string\">'Doctorate'<\/span>, <span class=\"hljs-string\">'Graduate'<\/span>, <span class=\"hljs-string\">'High School'<\/span>, <span class=\"hljs-string\">'Post-Graduate'<\/span>,\n<span class=\"hljs-string\">'Uneducated'<\/span>, <span class=\"hljs-string\">'<span class=\"hljs-variable\">$120<\/span>K +'<\/span>, <span class=\"hljs-string\">'<span class=\"hljs-variable\">$40<\/span>K - <span class=\"hljs-variable\">$60<\/span>K'<\/span>, <span class=\"hljs-string\">'<span class=\"hljs-variable\">$60<\/span>K - <span class=\"hljs-variable\">$80<\/span>K'<\/span>, <span class=\"hljs-string\">'<span class=\"hljs-variable\">$80<\/span>K - <span class=\"hljs-variable\">$120<\/span>K'<\/span>,\n<span class=\"hljs-string\">'Less than <span class=\"hljs-variable\">$40<\/span>K'<\/span>, <span class=\"hljs-string\">'Divorced'<\/span>, <span class=\"hljs-string\">'Married'<\/span>, <span class=\"hljs-string\">'Single'<\/span>]\n\n\nX = data[X_features]\ny = data[<span class=\"hljs-string\">'Attrition_Flag'<\/span>]\ntrain_x,test_x,train_y,test_y = train_test_split(X,y,random_state=<span class=\"hljs-number\">42<\/span>)\n\nrf = RandomForestClassifier(n_estimators = <span class=\"hljs-number\">100<\/span>, random_state = <span class=\"hljs-number\">42<\/span>)\nrf.fit(train_x,train_y)\n<\/code><\/pre>\n<\/div>\n<h3 id=\"evaluation\">Evaluation<\/h3>\n<p>We find the F1 score on prediction. F1 is defined as the harmonic mean of precision and recall. <\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">rf_prediction = rf_pipe.<span class=\"hljs-keyword\">predict<\/span>(test_x)\n<span class=\"hljs-keyword\">print<\/span>('F1 <span class=\"hljs-keyword\">Score<\/span> of Random Forest Model <span class=\"hljs-keyword\">On<\/span> <span class=\"hljs-keyword\">Test<\/span> <span class=\"hljs-keyword\">Set<\/span> {}'.<span class=\"hljs-keyword\">format<\/span>(f1(rf_prediction,test_y)))\n<\/code><\/pre>\n<\/div>\n<p>We can also get relative variable importances from the random forest model.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"lang-python\">importances = rf.feature_importances_\n<span class=\"hljs-built_in\">indices<\/span> = <span class=\"hljs-built_in\">np<\/span>.argsort(importances)\nplt.<span class=\"hljs-built_in\">title<\/span>('Feature Importances')\nplt.barh(<span class=\"hljs-built_in\">range<\/span>(len(<span class=\"hljs-built_in\">indices<\/span>)), importances[<span class=\"hljs-built_in\">indices<\/span>], <span class=\"hljs-built_in\">color<\/span>='b', align='<span class=\"hljs-built_in\">center<\/span>')\nplt.yticks(<span class=\"hljs-built_in\">range<\/span>(len(<span class=\"hljs-built_in\">indices<\/span>)), [X_features[i] <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> <span class=\"hljs-built_in\">indices<\/span>])\nplt.<span class=\"hljs-built_in\">xlabel<\/span>('Relative Importance')\nplt.<span class=\"hljs-built_in\">show<\/span>()\n<\/code><\/pre>\n<\/div>\n<p><img decoding=\"async\"\n    src=\"https:\/\/lh5.googleusercontent.com\/qxvRYnlyPxkqzcxyz6ncyqUE9M1haCPAS9o84JIovW7EMrISruZVNGXlFmHyWKiY_pZKbp7t_9YANX_1CqmALpJARcJckERHgtQMajM7qFP-0M1SGnAzlF5cbNLxBn28t_Ss5kKO\"\n    alt=\"\"><\/p>\n<p>We see that the total transaction amount is the most important variable.<\/p>\n<h3 id=\"conclusion\">Conclusion<\/h3>\n<p>In this post we learned the basics of data analysis and predictive modelling using python and GridDB.<\/p>\n<h3 id=\"source-code\"> Source Code <\/h3>\n<p><a href=\"https:\/\/griddb.net\/en\/download\/27183\/\"> <span class=\"download-button\"> Click here to download the full source code <\/span><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources, be it mobile devices or sensors, has many outliers. Moreover, the data may be high dimensional, so it becomes hard to interpret the data&#39;s summary statistics. As a result, nowadays, [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":27161,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[121],"tags":[],"class_list":["post-46626","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Predicting Credit Card Attrition Using Python and GridDB | GridDB: Open Source Time Series Database for IoT<\/title>\n<meta name=\"description\" content=\"Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources,\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Predicting Credit Card Attrition Using Python and GridDB | GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"og:description\" content=\"Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources,\" \/>\n<meta property=\"og:url\" content=\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\" \/>\n<meta property=\"og:site_name\" content=\"GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/griddbcommunity\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-01-14T08:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-13T20:55:07+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.griddb.net\/wp-content\/uploads\/2020\/12\/pasted_image_0.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1197\" \/>\n\t<meta property=\"og:image:height\" content=\"703\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"griddb-admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:site\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"griddb-admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\"},\"author\":{\"name\":\"griddb-admin\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\"},\"headline\":\"Predicting Credit Card Attrition Using Python and GridDB\",\"datePublished\":\"2021-01-14T08:00:00+00:00\",\"dateModified\":\"2025-11-13T20:55:07+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\"},\"wordCount\":850,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/griddb.net\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\",\"url\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\",\"name\":\"Predicting Credit Card Attrition Using Python and GridDB | GridDB: Open Source Time Series Database for IoT\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png\",\"datePublished\":\"2021-01-14T08:00:00+00:00\",\"dateModified\":\"2025-11-13T20:55:07+00:00\",\"description\":\"Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources,\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage\",\"url\":\"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png\",\"contentUrl\":\"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png\",\"width\":1197,\"height\":703},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/griddb.net\/en\/#website\",\"url\":\"https:\/\/griddb.net\/en\/\",\"name\":\"GridDB: Open Source Time Series Database for IoT\",\"description\":\"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL\",\"publisher\":{\"@id\":\"https:\/\/griddb.net\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/griddb.net\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/griddb.net\/en\/#organization\",\"name\":\"Fixstars\",\"url\":\"https:\/\/griddb.net\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"contentUrl\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"width\":200,\"height\":83,\"caption\":\"Fixstars\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/griddbcommunity\/\",\"https:\/\/x.com\/GridDBCommunity\",\"https:\/\/www.linkedin.com\/company\/griddb-by-toshiba\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\",\"name\":\"griddb-admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"caption\":\"griddb-admin\"},\"url\":\"https:\/\/www.griddb.net\/en\/author\/griddb-admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Predicting Credit Card Attrition Using Python and GridDB | GridDB: Open Source Time Series Database for IoT","description":"Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources,","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/","og_locale":"en_US","og_type":"article","og_title":"Predicting Credit Card Attrition Using Python and GridDB | GridDB: Open Source Time Series Database for IoT","og_description":"Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources,","og_url":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/","og_site_name":"GridDB: Open Source Time Series Database for IoT","article_publisher":"https:\/\/www.facebook.com\/griddbcommunity\/","article_published_time":"2021-01-14T08:00:00+00:00","article_modified_time":"2025-11-13T20:55:07+00:00","og_image":[{"width":1197,"height":703,"url":"https:\/\/www.griddb.net\/wp-content\/uploads\/2020\/12\/pasted_image_0.png","type":"image\/png"}],"author":"griddb-admin","twitter_card":"summary_large_image","twitter_creator":"@GridDBCommunity","twitter_site":"@GridDBCommunity","twitter_misc":{"Written by":"griddb-admin","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#article","isPartOf":{"@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/"},"author":{"name":"griddb-admin","@id":"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233"},"headline":"Predicting Credit Card Attrition Using Python and GridDB","datePublished":"2021-01-14T08:00:00+00:00","dateModified":"2025-11-13T20:55:07+00:00","mainEntityOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/"},"wordCount":850,"commentCount":0,"publisher":{"@id":"https:\/\/griddb.net\/en\/#organization"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/","url":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/","name":"Predicting Credit Card Attrition Using Python and GridDB | GridDB: Open Source Time Series Database for IoT","isPartOf":{"@id":"https:\/\/griddb.net\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png","datePublished":"2021-01-14T08:00:00+00:00","dateModified":"2025-11-13T20:55:07+00:00","description":"Data Analysis aims to extract useful information from data and then aids the decision making process. However, the raw data we get from external sources,","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/blog\/predicting-credit-card-attrition-using-python-and-griddb\/#primaryimage","url":"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png","contentUrl":"\/wp-content\/uploads\/2020\/12\/pasted_image_0.png","width":1197,"height":703},{"@type":"WebSite","@id":"https:\/\/griddb.net\/en\/#website","url":"https:\/\/griddb.net\/en\/","name":"GridDB: Open Source Time Series Database for IoT","description":"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL","publisher":{"@id":"https:\/\/griddb.net\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/griddb.net\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/griddb.net\/en\/#organization","name":"Fixstars","url":"https:\/\/griddb.net\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/","url":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","contentUrl":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","width":200,"height":83,"caption":"Fixstars"},"image":{"@id":"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/griddbcommunity\/","https:\/\/x.com\/GridDBCommunity","https:\/\/www.linkedin.com\/company\/griddb-by-toshiba"]},{"@type":"Person","@id":"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233","name":"griddb-admin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","caption":"griddb-admin"},"url":"https:\/\/www.griddb.net\/en\/author\/griddb-admin\/"}]}},"_links":{"self":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/46626","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/comments?post=46626"}],"version-history":[{"count":1,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/46626\/revisions"}],"predecessor-version":[{"id":51302,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/46626\/revisions\/51302"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/media\/27161"}],"wp:attachment":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/media?parent=46626"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/categories?post=46626"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/tags?post=46626"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}