Accelerate Your Machine Learning Model Development in Under an Hour
Written on
Introduction
As data scientists, we're well-acquainted with the various Python libraries employed in constructing machine learning algorithms. Most likely, you've utilized tools like pandas, NumPy, scikit-learn, TensorFlow, or PyTorch to create your models, followed by deployment with Docker, Kubernetes, or major cloud providers. While these tools are incredibly powerful and flexible, integrating them can often be complex and time-consuming, frequently stretching the timeline from data collection to production deployment to several weeks.
Machine learning platforms aim to simplify this entire process, enabling users to complete all essential steps within a single tool and in significantly less time. This not only enhances productivity but also allows for rapid prototyping and can lower the budget required for developing machine learning solutions.
In this article, I will demonstrate how to construct a text classification model using an innovative machine learning platform, allowing you to complete the task in under an hour and without any coding. By selecting a pre-trained model from a catalog, you won’t need access to a GPU or your own resources for training or deployment. This method is much faster than developing a solution from scratch and is ideal for quick prototyping.
Let's dive in.
Problem Statement: Text Classification
In this tutorial, we will create a text classifier capable of distinguishing between three categories: politics, wellness, and entertainment. We will train the model using a concise dataset consisting of 298 samples, distributed as follows:
- POLITICS: 142 instances
- WELLNESS: 80 instances
- ENTERTAINMENT: 76 instances
These entries are taken from the News Category dataset, which was released under the Attribution 4.0 International (CC BY 4.0) license by Rishabh Misra. You can download the dataset needed for this tutorial from [here](#).
Selecting a Base Model
For our classifier, we will utilize a pre-trained Multilingual Large Transformer model available on the Toloka Machine Learning platform. You can sign up for a free version of this tool through this [link](#). After registration, you will gain access to a diverse selection of pre-trained machine learning models in the Models/Collections section.
Among the models, two are particularly suitable for our final application: the Multilingual Large Transformer and the MultiClass Text Classification models. Both leverage extensive pre-trained transformers, enabling you to develop new natural language processing models without the need for extensive data labeling, numerous experiments, or GPU setups.
In this tutorial, we will focus on the Multilingual Large Transformer, but feel free to explore the MultiClass Text Classification model for comparison.
You can test its inference capabilities by entering text and observing the model's responses. Keep in mind that these are general model outputs; we will need to fine-tune it for our specific task.
Fine-Tuning the Model
The Multilingual Large Transformer offers two fine-tuning options based on your requirements. The first option is for text generation, where you provide text to train the model to produce similar outputs. The second option is for text classification, which requires examples of text paired with their corresponding labels. We will utilize this option since our goal is to classify news.
To begin fine-tuning, click the "Tune Classifier" tab in the model section.
After this, you will have the choice to use an existing dataset for fine-tuning or create a new one. If you're new to the platform, you'll need to create a new dataset. Just follow the "Create your dataset" link.
Creating a Dataset
To set up a dataset on the platform, you must upload a CSV file containing the instances required for training the classifier. Additionally, you will need to assign it a name and a brief description.
Once the dataset is created, its overview will be displayed. The datasets are versioned, with the initial version labeled as 0.0, corresponding to the raw data you uploaded.
You can annotate your dataset by clicking the "Label this version" button next to the version number. Since we are working with text, select the appropriate option and choose the columns to be used as text and labels.
In the following step, you will need to define the labels. The CSV file we provided already has three classes specified: POLITICS, WELLNESS, and ENTERTAINMENT, which should be automatically detected by the platform.
You can proceed to the next step, where you will label each individual instance. Since we already included labels in our CSV file, they should appear for each entry. This is your opportunity to verify and adjust the labels as necessary.
Upon completing this step, your dataset will be updated to version 0.1, which will be utilized for fine-tuning the model.
Fine-Tuning the Existing Model
With your dataset ready, navigate back to the Models/Collections section, select the Multilingual Large Transformer model, and go to the "Tune classifier" section.
You will need to provide information regarding the training data you wish to use for fine-tuning, specifically the labeled version of the dataset you created previously.
In this section, you will also need to give the model a name and a brief description. Once completed, click the "Run scenario" button to initiate the fine-tuning process. Once finished, you will be able to access the model and test it using the prompt.
Testing the Model
To evaluate the model's performance, input a short phrase in the model prompt, such as:
"The President of the US visited Germany last week."
The output will classify it as POLITICS!
You've just fine-tuned a large language transformer model tailored to your specific needs. It can now categorize news into the three designated classes: politics, wellness, and entertainment. Remarkably, this was accomplished using only 300 training examples and took less than an hour. While the model may not be flawless, it’s impressive how quickly you can prototype and experiment with new machine learning concepts on this platform, even using an unlabeled dataset.
If you're interested in utilizing our model, you can access it anytime through the Models/Collections section of the platform.
Further Enhancements
The machine learning platform utilized in this tutorial is currently in its Beta phase, with new features being developed continuously. Upcoming capabilities will include experiment tracking, model comparison functionalities, and tools for visualizing metrics.
For seasoned machine learning practitioners seeking more advanced features and interaction through code, a Python library is set to be released soon.
Summary
In this article, you have discovered how to swiftly build a straightforward model for text classification tasks using the Toloka ML platform. I encourage you to adapt this example to suit your needs and create your unique text classifier or sentiment analysis model.
Feel free to explore models designed for other data types, such as images or audio. Enjoy the experience, and please share your feedback on the platform, highlighting what you like and what could be improved.
In this video, learn how to accelerate the end-to-end machine learning lifecycle on Azure for better productivity and efficiency.
Explore how to build an ML platform from scratch, understanding key components and best practices for successful implementation.