Bank Statement Cleaner

Abstract

Bank statements can be utilized to unveil our financial situation as they contain almost the complete history of our financial activities. However, they cannot be used raw as complications, such as double counting and nonatomic data, still exist. This program organizes bank statements such that users can perform more advanced financial analysis to create a personal cash flow statement. Before use, download the CIBC chequing, savings, and credit CSV files. Run this program following this guide, and users will obtain structured data in an Excel sheet.

Introduction

Bank statements are one of the richest information sources on your financial activities. Bank statement cleaner is a program that processes statements and produces structured data that can be used to create a personal cash flow statement, which is the foundation of a budget. The program combines transactions from debit and credit accounts and yields pure inflows to and outflows from your bank account. The source code is available at bank-statement-cleaner. You can run the software on Windows and test the features using the synthetic data.

A cash flow statement is different from a budget. A cash flow statement is a financial document that tracks cash going in and out of a business over a specific period. It shows the ability of a company to maintain its operations, invest, and pay its debts. In the context of personal finance, the cash flow statement records the income and expenditures of an individual with less detail than its corporate version. In contrast, a budget is an approximation of revenue and expenses in a future period that is regularly updated. Implementing a budget helps individuals and families to have greater control of their financial situation and provides more safety in unexpected times. It is paramount to analyze past cash flow to create a realistic and achievable budget.

Generally, Canadian banks provide their customers with digital account statements, which are downloadable from the banks’ websites. This program consumes CSV files of the statements and outputs an Excel worksheet containing the result. Currently, the application only supports CIBC statements. However, support for other banks might be added in the future.

This software is a great tool for people who want to analyze and manage their or their clients’ finances. However, please read the LICENSE before using the software. Note that the software is provided as is and without any warranty.

Data Analysis

CIBC Statements

Bank statements consist of debit and credit statements. Savings and chequing statements are categorized as debit statements. The debit statements have 4 headerless columns. Based on the content of the column, the headers can be inferred as follows.

  • Date: date when the transaction occurs
  • Description: text containing transaction details, such as the counterparty’s name
  • Debit: the amount debited from an account
  • Credit: the amount credited from an account

All debit transaction descriptions have at least a transaction method and type. Nevertheless, some transactions have more elements in their descriptions. The following is an example of a complete description.

Point of Sale - Interac RETAIL PURCHASE 111001001111 NO FRILLS

The anatomy of a debit description is as follows.

  • Method: the device that facilitates the transaction, e.g., Point of Sale - Interac
  • Type: the purpose of the transaction, e.g., RETAIL PURCHASE
  • ID: a series of characters that acts as an identifier for a transaction, e.g., 111001001111. Usually, an ID consists of 12 characters, and not all transactions have it.
  • Party: the individual or institution that is on the other side of the transaction, e.g., NO FRILLS

Transactions in both the debit and credit accounts have either a debit or credit amount. Cash coming into an account is recorded as a credit to the account. This might seem counterintuitive at first because more cash entails an increase in assets. Consequently, the account should have been debited. However, the statement is created from the perspective of the bank. Deposits are a liability to the bank as deposits are owed by the bank to the depositors, coupled with the fact that depositors can withdraw their funds at any time. When a depositor adds money to their account, the bank’s liability increases. As a result, the liability account is credited. Fun fact: Banks lend out most of their deposits as loans. The interest rates on the loans are higher than the interest paid to the depositors. The difference, or usually called the net interest rate spread, is the sweet and succulent profit for the bank.

Likewise, the credit statement has all the columns of debit statements with one additional column that is filled with only the same censored card number, which can be discarded. Credit transaction descriptions can be grouped either as a charge or a payment. All credit payments have “PAYMENT THANK YOU/PAIEMEN T MERCI” as the description, whereas purchase descriptions contain the party’s name and location. For example, “IKEA TORONTO, ON” where “IKEA” is the party and “TORONTO, ON” is the location.

Lastly, pending transactions are not included in the statements. Only posted transactions are present in the statements. Pending transactions are already approved but have not been posted to an account. It takes up to 5 days for a pending transaction to be posted.

Problem

Two critical problems hindering our analysis of the bank statements: double counting and unorganized data.

Double Counting

The error of including an item in a calculation more than once, which inflates figures and leads to inaccuracies. This problem occurs if internal transfers and internal payments are not separated from the bank statements. The raw statements contain both transfers between chequing and savings, and also payments from debit accounts to a credit account.

Internal transfers are exchanges of money between debit accounts. An account holder can transfer their funds from savings to chequing and vice versa. One of the reasons to do an internal transfer is to limit the amount of money in the chequing account to avoid overspending. Internal exchanges pose a double-counting problem because the amount of income and expenses will be gradually increased by each transfer amount. The problem is illustrated in the schedule below.

Transaction Income Expense
Deposit cash to savings 100
Transfer funds from savings to chequing 50
(chequing account receives the funds)
-50
(savings account loses the funds)
Buy pizza -20
Total 150 -70

The total income and expenses should have been $100 and $20. However, because the internal transfer of $50 is still in the record, the income and expense are both overstated to $150 and $70, respectively. Although the net cash flow is unaffected, this still produces inaccurate income and expenses.

Internal payments are payments from a debit account to a credit account to decrease the credit balance. When savings, chequing, and credit statements are merged into one statement, the records of inbound and outbound from each account still persist, leading to double-counting. Paying using a credit card can be a good thing if it is paid afterwards in a timely manner. By paying bills on time, the account holder’s credit score will increase. A high credit score opens avenues to attractive offers when applying for a loan. Moreover, account holders are rewarded with cashback when paying with credit cards, which, if redeemed, can decrease the overall credit payment. The double-counting problem caused by internal payments is illustrated in the following table.

Transaction Income Expense
Credit purchase at Tim Hortons -10
Credit purchase at Eaton Centre -20
Pay for the credit balance owing 30
(credit account receives the funds)
-30
(debit account surrenders the funds)
Total 30 -60

The actual total of the expenses is $30 because that is the amount spent using the credit card, and there is no income. Conversely, if the internal payments are not separated, the income can be inflated to $30, and the expenses can hike to $60, which distorts the results. To calculate the cash flow, only the actual credit expenditures are required.

A payment can cover one or more charges. If we only consider the payment from the debit account to the credit account as an expense and ignore the credit purchases, we will lose valuable information because the payment does not reflect the different categories of the actual expenses, namely food and beverages (i.e., Tim Hortons) and clothing (i.e., Eaton Centre).

What if we pay more than we owe? The excess amount will go to our credit balance, and it is irreversible. However, it will be used to pay for future credit purchases. Curiously, the credit balance will be negative.

Unrefined Data

Some useful data is lumped together that needs to be separated to allow further analysis. In bank statements, the transaction description contains important tokens that describe the exchange. Unfortunately, the tokens are fused together. If we want to query certain transactions based on their method or type, the description needs to be deconstructed and tokens put into separate columns first. Additionally, the number of records can be hundreds or even thousands, which requires a significant amount of work to process each description manually.

The transaction date is another example. The components of year, month, and day need to be extracted to allow more meaningful transaction groupings. By separating the date components, we can organize transactions temporally and gain insights into our financial activities. For instance, group by year and month to analyze month-to-month income and expenses.

In creating a personal cash flow statement, our concern is to know the inflows and outflows. Classifying each transaction as debit and credit may be useful to create more formal statements, but not all people understand the concept of debit and credit. Moreover, I think that format is harder to read in general. Therefore, the debit and credit columns need to be combined into one signed value. Positive for inflows and negative for outflows.

Solution

To solve the double-counting problem, the internal transfers and internal payments need to be removed from the records. Instead of deleting them, they are transferred into different sheets for future analysis.

An internal transfer can be found by searching for a transaction pair from the savings and chequing accounts that have an identical description. After internal transfers are removed, the residual transactions that have “internet banking” as their method and are categorized as “internet transfer” are the internal payments from the debit accounts. Further, the payments that are received by the credit account all have a description of “PAYMENT THANK YOU/PAIEMEN T MERCI”. Knowing that condition, internal payments can be queried and transferred to a separate sheet. Hence, the remainders contain only refined fund inflows and outflows representing external transactions.

Tokens required for analysis need to be extracted from the description using regular expressions (regex). Regex can be used to find specific patterns in text. Based on observation, transactions always have a method and type, and a transaction type is always written in capital letters, therefore, regex can be utilized to extract the substring with only capital letters and spaces. The method can be obtained as the piece preceding the transaction type denotes the method. The remaining string consists of the transaction ID and party. To distill the party, the ID must be eliminated. Regex is utilized to find a substring that contains only digits or a mix of letters and digits with at least 1 letter and a digit. Finally, the method, type, and party (if it exists) can be forwarded to the next process.

The original date format from the statements is [month]/[day]/[year], which can be broken down and stored into their respective columns to allow for more specific groupings. The debit and credit amounts need to be melded into one value for ease of querying. If the debit is empty, then the income is the credited amount; else, the debit amount is the expense. Additionally, a new column is created to clarify the flow with “income” as the entry if the amount is positive and “expense” otherwise.

Usage

Data and Application Preparation

Prior to executing the program, bank statements need to be downloaded, grouped, and renamed first. For CIBC clients, the transaction history can be downloaded as CSVs on the CIBC Banking web application. It is recommended to download all transactions from the opening of the account. You can follow the guide on the CIBC website for the complete steps.

Required steps to create the input:

  • Put all of the CSV files in one folder
  • Rename the chequing statement to chequing.csv
  • Rename the savings statement to savings.csv
  • Rename the credit statement to credit.csv

For this guide, the folder will be referred to as the dataset directory. In addition, the synthetic dataset in the “synthetic_data” folder can be used as an example and input for testing the features.

You can download the program at this link. I suggest using the latest release.

To see all arguments and options, type cleaner.exe -h or cleaner.exe -help.

Create a New Worksheet

Construct a worksheet using the debit and credit statements. This command makes a new file with a customizable filename.

  1. Open Command Prompt and navigate to the directory where the program (i.e., cleaner.exe) is located
  2. Type cleaner.exe [path to the dataset directory] -c [output filename]
  3. A novel worksheet will appear in the current working directory
Example: cleaner.exe synthetic_data -c out

Build a worksheet with the filename “out” (input is without file extension) using the dataset in the “synthetic_data” folder.

Get Complements

Build a worksheet that only contains entries that are not in an existing worksheet. The existing worksheet is an Excel file. This is an extension to the create command.

  1. Type cleaner.exe [path to the dataset directory] -c [output filename] --complement [path to the existing worksheet]
  2. A new worksheet will appear in the current working directory
Example: cleaner.exe synthetic_data -c new_out --complement out.xlsx

A worksheet named “new_out” will appear in the current working directory that has complementary entries that are present in “synthetic_data” but not in the existing worksheet “out.xlsx” yet.

Update an Existing Worksheet

Append all new entries to the existing worksheet in a new file. This will create a new worksheet instead of modifying the existing one.

  1. Type cleaner.exe [path to the dataset directory] -u [path to the existing worksheet] [output filename]
  2. A new worksheet named [output filename] will appear in the current working directory
Example: cleaner.exe synthetic_data -u synthetic_data/outdated.xlsx updated

Make an updated worksheet named “updated” using the data from “synthetic_data” and the existing worksheet “synthetic_data/outdated.xlsx”.

Note: Banks provide the option to download statements from a certain start date and an end date. Note that the complement and update functions work properly only if the beginning date of the existing worksheet and the input statements are the same. For example, the current worksheet was built using statements that started from January 1, 2025. To update or complement that worksheet, the new dataset must also start from January 1, 2025.

Customize Output Path

If we want to change the path of the output file to something other than the current working directory, we can set a different directory path. This applies to all commands, but for this example, the create command will be used.

  1. Type cleaner.exe [path to the dataset directory] -c [output filename] --dirpath [path to the custom directory]
  2. A worksheet will appear in the directory at [path to the custom directory]
Example: cleaner.exe synthetic_data -c out --dirpath output

Create a worksheet “out” based on data from “synthetic_data” and save it in the “output” directory.

With these features and your willingness to learn, you can now periodically update your worksheet so that you have the latest data from your bank statements. You are free to choose the update period. I recommend updating the worksheet weekly.

Method

Before building this program, the bank statements were first investigated to find issues and patterns using Excel. Transaction records on the banking application were used as a comparison to make sense of the dataset. Moreover, the list items in the filters of the application are used to determine possible transaction types and methods.

Ideas on how to organize the statements are experimented with in a Jupyter notebook. A notebook is a collection of text and code where authors can swiftly explore, visualize, and test their code in one document. The main libraries to query and manipulate the data are Pandas and numpy. Additionally, re is used to find tokens with specific patterns. I highly recommend the Regular Expressions Quick Start by Jan Goyvaerts if you want to start learning regex. In fact, I used that guide to write the expressions in this program. Lastly, the regex expressions were tested using regex101.

After the ideas have ripened and the prototype is built, the code is translated into a command-line application. PyCharm is the development environment for this project because it provides many features that accelerates the development process. Features such as the virtual environment and debugger make the development process more convenient in isolating dependencies and tracing issues. The main library to create the command-line interface is the argparse, with the help of a GeeksforGeeks article titled “Command Line Interface Programming in Python”.

Finally, the scripts and their dependencies are bundled into one single executable using the pyinstaller library with the guidance of another useful post titled “Convert Python Script to .exe File” by GeeksforGeeks.

To create a continuously updatable database, which in this case is the worksheet, a unique identifier (UID) must be established. This program uses incrementing integers starting from 1. A UID is used to identify new and existing transactions, which makes database updates possible. However, I was thinking about a way of making a unique ID using transaction details. It occurs to me that it is possible to create one by concatenating the date, description, and amount, but this raises a problem when there is more than one transaction with an identical description, e.g., “CREDIT MEMO”, with the same amount. The resulting ID of these transactions will not be unique. To solve this potential problem, each transaction can be augmented with the order of its occurrence, starting from 1 up to infinity. Nevertheless, this is still an untested idea, and I put it here just for food for thought.

Benefits and Limitations

Naturally, this program has its advantages and constraints, which will be discussed further in this section.

Advantages

This application assists us in understanding our habits in handling our money. We can find a highly accurate approximation of our monthly and annual income and spending. Furthermore, through building a cash flow statement, our financial capacity can be revealed, which helps us set future limits and plan our budget more confidently. As a consequence, we can stay on track in our arduous and challenging journey to our financial goals.

The worksheet produced by the program can be accessed using Microsoft Excel or other spreadsheet programs, such as LibreOffice. This allows us to do more advanced analysis on our financial situation. Features such as Pivot Tables and Charts can group and visualize data so that we can unearth insights from our financial condition. The output of this program can be processed further locally where you can group it in one place, such that you have a one true source of your personal information easing the access to your critical information.

Equally important, not a single penny is charged to use this program. All it needs is time and effort to understand how to use it. As The Ink Spots beautifully serenades, “The best things in life are free”.

Below is an example of a basic analysis using a Pivot Table based on the output. The rows consist of year and month fields, respectively, the sign field as the columns, and the amount field is grouped and summed.

As mentioned in Update an Existing Worksheet, this program works with entries that are inserted manually. Adding a cash account to our cash flow statement is a case in point. A cash account means a group of transactions that we perform using cash. We may want to track it along with our digital balance cash flow to obtain a more complete image. Moreover, with the Get Complements feature, we can obtain entries that are missing due to certain reasons (e.g., forgetting to record some transactions manually) from the existing worksheet and reconcile our records.

Constraints

At this stage, the program can only process CIBC bank statements. It might work for other banks if the structure is identical to CIBC’s, however, this is not a guarantee. More samples and data are needed to further enhance the support for various banks.

Some names of parties need more refinement. There are some entries where the party still includes details, which might not be relevant. For example, “steam purchase 15.00 cad @ 1.000000”. The party is “steam” but it also includes the price and quantity, which might be irrelevant since we only want to know if the transaction is related to Steam.

Another instance is “7 eleven store #12345”. Some institutions that have branches attach certain codes to the institution’s name in the bank statements. This might obstruct our analysis as we might only be concerned about the name “7 eleven” instead of the specific branch. The cause of this is that the program assumes that the remaining string after the method and type have been extracted is the party. This allows impurities to still attach to the party. Further data refinement is necessary to produce more structured data and accurate analysis.

The cash flow only reflects the inflows and outflows in the bank account. It does not reflect spending in accounts other than the bank account. For example, Presto can be used to store funds from the bank account for paying the public transport in Ontario. We might only see that there is a lump sum payment from the bank to the Presto, but we cannot see the actual usage from the Presto, at least for now. As a result, the inflows and outflows are not exact, but they provide a good approximation for a basic cash flow statement.

Let us assume that we have a bank account and a Presto card. We deposit $100 from the bank account to the Presto card on January 1, 2025. Suppose that the current date is May 31, 2025. In our cash flow statement, it may seem that $100 was spent on transportation (i.e., Presto) in January 2025, but the following months have $0 transaction expense. Meanwhile, in the Presto account, the remaining balance is $50. If we want to approximate how much we spent each month, we can evenly distribute the total spent, which is $50, calculated by subtracting $50 (Presto’s current balance) from the $100 (Presto’s initial balance), over 5 months. The estimation is $10 per month on transportation.

Moreover, you may notice that the net cash flow in the output does not equal the current balance of deposit accounts. This might be caused by paying credit purchases using cash backs, which is not explicitly shown in the cash flow statement.

Recommendation and Credits

Moving forward, I would like to add more support for other banks, starting with major banks in Canada, such as TD, RBC, and Scotiabank. I plan to extend the input to other non-bank accounts, e.g., Presto. The reason is that more data on our financial activities may gradually lead to a more precise financial plan.

I am excited to announce that I am currently drafting a guide on cash management using various tools and how to apply this program to a real-world scenario. I will publish the article by June 21, 2025.

I would love to hear your honest questions, critiques, comments, or ideas on this project. You can contact me through email or LinkedIn. Please let me know if you find a problem with this project. If you are motivated to contribute to this project, feel free to send a pull request on the Github repository.

I hope that you are inspired to explore more about your finances and create at least a basic financial plan going forward. If we know where we are going, even though it might not be the best or most correct direction, at least we can start to work on our wishes, which leads to learning a variety of lessons, notably dealing with failures and making decisions ourselves. This idea resonates with a Bible passage that says, “The plans of the diligent lead to profit as surely as haste leads to poverty” (Proverbs 21:5, NIV).

Lastly, I would like to give my special thanks to Ninomae Ina'nis for the thumbnail art. I am grateful that she introduced me to Spice and Wolf, which inspires me to study more about economics. Wah!

Sampai jumpa di waktu berikutnya!