Kanjis project episode 1: find data and prepare it with PyTorch

As we said in episode 0, I need a lot of handwritten kanjis with good labels. And here my adventure begins.

Find data (easy? Maybe not so much)

I naively thought that there were a lot of datasets of handwritten kanjis. Well, I was wrong. In fact, I just found one, and it has a lot of downsides: Kuzushiji-MNIST. This dataset was created from old handwritten books. And this is the first problem: some kanjis are now barely recognizable in this form:

人 : this one is pretty close

That is supposed to be 也

It seems that other datasets exists (like the ETL Character Database, or JEITA Database, but they are not freely accessible). So for now, I will use Kuzushiji-MNIST (which I’ll name KMNIST for now on) and accept its downsides. My idea is later to create another dataset using not handwritten characters but printed ones, and apply filters on them to make them look like handwritten ones.

If I use KMNIST, first step is to understand its composition better. You can find my data exploration notebook on my GitHub but here are the main characteristics:

  • All kanjis are separated by sub-folder (one sub-folder contains one or more images of a given kanji), all images are 64×64 pixels and grayscale. There are 3832 different kanjis.
  • One of the big problems with KMNIST is that it is very unbalanced: some kanjis have thousands of images, but 75% of them have less than 24 images. Here is a representation:
Number of images per kanji

I cannot really use kanjis with only one image, and at the same times kanjis with 1768 images. I need to find a minimum of images per kanji considered in my models, here are some possibilities:

Only 13 kanjis with more than 1000 images, for a total of 16985 images.
Only 47 kanjis with more than 500 images, for a total of 39381 images.
Only 314 kanjis with more than 100 images, for a total of 92832 images.
Only 568 kanjis with more than 50 images, for a total of 110578 images.
Only 1566 kanjis with more than 10 images, for a total of 133480 images.

I will put a parameter in my models to change this minimum number. But this will be a problem for my learnings, because it introduces a bias towards the kanjis that have a lot of representations (if you have 500 images of kanji A, and 1 of kanji B, then guessing that a new kanji is A without even looking at the image is statistically the only reasonable thing to do).

Let’s start working: prepare data with PyTorch

For this project, I have decided to use PyTorch. No very profound reason for that, I have used Caffe in the past, and I want to learn PyTorch and TensorFlow. It seems that PyTorch can be a little easier to understand at the beginning, so let’s start with it. I’ll probably use TensorFlow in my next project.

So I have followed the Quickstart tutorial of PyTorch with MNIST, which is great to get the basics but does not really cover the data preparation side of things. I do understand that there are 3 important concepts in PyTorch:

  • Tensors: they are essentially matrices, but with added metadata (I’ve read here that the internal representation differs from what we are used for arrays, but for now I don’t need this level of detail). Everything is tensors in PyTorch: the images, the labels, the results of a learning, the layers of a network…
  • Datasets: that’s how we describe to our model how to get the data we are going to use. The important part is that we generally want to work with large datasets, so we are not going to load all the data at once in the memory. So the idea of a custom PyTorch Dataset is mainly to describe the data and how to retrieve it.
  • Dataloaders: an iterable that go through our Dataset, batch after batch.

The first step is to describe my dataset by creating a subclass of the Dataset one, that’s the KuzushijiDataset in my code. The most important part is probably the __getitem__ function, which will be used by my Dataloader. I don’t apply any transforms to my images because I don’t really need to do some data augmentation yet.

The next part is directly in my main function: creating the Dataloaders. Of course as I am doing some training with this data, I need to split my Dataset into train/validation/test. Train and validation allow me to control overfitting, test to evaluate my model when finally trained. You can see that I’m fixing the random seed, that’s because for now I want to get the same datasets for all my learnings (but if I implement cross validation later, I’ll need to change that). Once my dataset is separated into train/validation/test, I just need to create DataLoaders over each of these datasets. That’s the easy part if the Dataset job was done right.

Now my data is ready for learning! Next I will need to create and train some models. But before, I need to find how to compare different models, that’s what will see in episode 2.