Running a Self-Hosted LLM for Email Address Classification
Introduction
Recently I was asked to come up with a solution to classify a large list of tens of thousands of email addresses based on their details. Although the original request was a bit different, for this blog post I will make the example of classifying the addresses as being from a personal email provider (like Gmail, Yahoo, etc.) or from a business or other institution. You can imagine this seemed like a tedious enough job that I wasn’t rushing to start with these activities manually.
However, after experimenting a bit with GPT-based models and asking those to perform this classification for me, it seemed that the LLM was able to perform this job quite effectively with some detailed instructions. Due to the large number of requests to the OpenAI API I was expecting to send and also wanting to have a detailed explanation for each of the classifications, I felt it was more appropriate to run the model locally and not worry about large numbers of tokens being invoiced by OpenAI.
Architecture
In this blog post I will demonstrate a way to use a self-hosted LLM specifically for this purpose. I will use the Litserve package from the PyTorch Lightning authors (docs at https://lightning.ai/docs/litserve/home) to expose a self-hosted LLM with an API endpoint and write a simple client to query the LLM. This approach is very easy to adapt to other scenarios and can also be easily deployed to a cloud provider.
This leads to the following two separate components:
- Generic LLM server which can run any LLM of choice which listens to incoming requests on its API.
- Client component querying the language model which iterates over the list of email addresses to be classified and using a single query (see below) for every email address to perform the classification. This is done to prevent the LLM from having to storing a large conversation in its context and get fresh output every time.
The model I will be using is Mistral-7B-Instruct-v0.3 from Mistral AI, which is available from the Huggingface hub. Don’t worry about downloading it yourself because the code will take care of this.
I will be using the following prompt:
Please assist in classifying email addresses as being from a personal email provider. You can identify personal email providers by their domain names. Please answer with YES when the email address is from a personal email provider or NO if it is another type of address like a small business or corporate address. A country-specific top level domain does not imply the address is a personal email address. If you answered YES, specify why the address is classified as being from a personal email provider. Only look at the domain names and not the usernames. If an address is from an Internet Service Provider (ISP), also classify it as YES. Now please classify the following address: [INSERT ADDRESS HERE]
As you can see, I am asking the model to give an explanation for why it gave the classification it did (the prompt asks in case of YES but the model does it always anyway, which is fine for me).
Results
I have used a very short mockup of non-existing email addresses for testing purposes. You will probably want to read this from Pandas DataFrame containing real-world addresses in practice. The classification results are listed below:
Outcome | Explanation | |
---|---|---|
nonexistent@gmail.com | YES | the email address nonexistent@gmail.com is from a personal email provider (Gmail). Gmail is a free email service provided by Google, which is a well-known personal email provider." |
nonexistent@hotmail.com | YES | The domain name is hotmail.com, which is a personal email provider (Hotmail is now Outlook.com)." |
nonexistent@philips.com | NO | The domain name philips.com is not a personal email provider. It is a company domain name." |
nonexistent@accenture.com | NO | Accenture is a multinational management consulting, technology services, and outsourcing corporation. It is not a personal email provider." |
nonexistent@utwente.nl | YES | The domain name utwente.nl is associated with the University of Twente, a public research university in the Netherlands. This suggests that the email address is likely not a personal email address, but rather an address associated with the university. However, it is possible that the university also provides personal email accounts to its students or staff, in which case the address could still be considered a personal email address. To be certain, further investigation would be needed to confirm the nature of the email account. However, given the domain name, it is more likely that the address is not a personal email address." |
nonexistent@chello.nl | YES | The domain name chello.nl is owned by Chello, a Dutch Internet Service Provider (ISP) that provides personal email services to its customers. Therefore, the email address nonexistent@chello.nl is classified as being from a personal email provider." |
nonexistent@gmx.de | YES | The domain name gmx.de is owned by GMX, a German Internet Service Provider that offers free email accounts to its customers. Therefore, the email address nonexistent@gmx.de is classified as being from a personal email provider." |
nonexistent@umich.edu | NO | The domain name umich.edu is associated with the University of Michigan, which is not a personal email provider. It is a public university, and its email addresses are not typically personal email addresses." |
nonexistent@yahoo.com | YES | the email address nonexistent@yahoo.com is from a personal email provider (Yahoo). This is because Yahoo is a well-known personal email provider, similar to Gmail, Hotmail, and AOL." |
As you can see the classification results are perfect and the model’s explanations of what the domains entail are also great. In practice you will see a bit of hallucination around lesser-used domains, so don’t trust the results blindly.
Getting started
If you want to get started yourself, please have a look at the code at https://github.com/kemperd/email-classification
Just follow the steps in the README:
- Clone the repo
- Create a Python venv and install packages from requirements.txt
- Start server: python server.py
- Start client (after server has fully started): python client.py
Code details
- I am using 4-bit quantization of the model weights to fit them in my 12 GB VRAM GPU. Reported memory usage is about 5.2 GB, however switching to 8-bit quantization is giving memory issues. I assume 8-bit quantization with 16 GB VRAM should work.
- The response formatting code in the client is very much tailored towards the Mistral 7B-model, which is quite stable in its responses. If you are planning to use another model which is getting more creative at times you’ll want to revisit the response parsing code.
Wrapup
In this blog post I have shown you a way to quickly set up a client-server architecture for deploying an LLM using an API endpoint using Litserve. The Litserve package allows you to quickly deploy an LLM on either your own hardware or on cloud provider infrastructure. The email classification problem as presented seems a good use case where an LLM is able to provide value, in which the Mistral 7B model performed well given its relatively limited size.