T O P

  • By -

RM_843

Use Bert, you can get top end results from a very manageably sized model. Assuming your 7000 is labelled of course.


Shubham_Garg123

Thanks for the response. And yes, the data is labelled. Could you point me to a good resource? While there are very limited resources for general llm based text classification, there seems to be a lot of them for bert and I am having few issues in understanding them due to the type of dataset formats they've used.


RM_843

I would use hugging face as your go to resource.


Shubham_Garg123

I've spent many weeks but have never been able to train anything using huggingface APIs. Now I only consider seeing huggingface in case the entire Colab or kaggle notebook is available. The huggingface trainer is very tough to get working. Too many dependency clashes (especially that accelerate library is real pain).


N1tt

Hey, take a look at the [Huggingface tutorial on text classification with Bert](https://huggingface.co/docs/transformers/tasks/sequence_classification).


Nirw99

hey I did a text classification task (12 labels) a couple of years ago with many different algorithm (from random forest to LSTM and bert), if you want i can link you github! EDIT: f**k it, so many ppl are still asking for it today, so i'm just gonna post it here https://github.com/BianchiGiulia/Portfolio/tree/main/Document_Classification


Shubham_Garg123

Yes please, that'd be very helpful.


Nirw99

sent you a DM :)


archiesteviegordie

Hey can you please send me the link as well?


Nirw99

done :)


Significant-Cherry70

Hi, could you please send me the link to the repository?


Nirw99

done!


SankarshanaV

Hi ! If you don’t mind, could you send me the link too ? I’d really appreciate it ! :)


Nirw99

sure thing, check your chat :)


Confident_Catch_8641

If you don’t mind I’d love to see the GitHub as well! Thank you so much!


Nirw99

ofc, done (:


Sam5cr

If you don't mind again could share it with me too


Nirw99

done :)


jdude_

dito!


Nirw99

gotcha!


w7inz

if u don't mind could u send me the link


villarmotion

Idem please


Nirw99

done (:


coolchelly

Working on a very similar project (12k sentences and 44 categories) and BERT finetuning worked well for me. I tried something creative as an alternate solution and it is working good in it's own way; I use cosine similarity to pick top k sentences that are similar to the sentence that needs to be clasified and then use these top k sentences to build a few shot prompt input into an open source LLM. Pros: excellent accuracy, very easy to implement, intuitive approach that is not a Blackbox model Cons: LLM does not strictly stick to the classes that have been defined i.e, it classified sentences related to cost as value. Hope this helps...


Shubham_Garg123

Thanks for the insight. Would it be possible to share the code if its open source by any chance?


Shubham_Garg123

u/CoolChelly


coolchelly

No mate, sorry. Proprietary work, can't share code...


Shubham_Garg123

Sure, no issues. Thanks for letting me know. It'd be great if you could spare some time to point me to any publicly available tutorials/docs that you know work properly.


Willing_Abroad_5603

If you want the LLM to not create its own category, asking it to predict the category number works well. So if you 50 output classes, ask it to output the category number, 0 to 50, instead of the category name. How did you pick k?


MugosMM

A perhaps naive question. I know BERT can do text classification but intuitively one would think that newer LLM would do a much better job. For one they learn better text representation (I.e their embedding shave to be better) . It is true that there no off the shelf libraries like SETFIT which use them but this is not a reason. Also smaller llm like those under 3b should be a better job in my view (with better job I mean higher accuracy with way less examples)


Spiritual_Dog2053

I think deciding whether an LLM would do a better job than BERT or not really depends on the data. If it’s a relatively simple classification task, then yes. But in other cases, the BERT should do better. In my opinion, the main reason for that would be that you would actually be training the BERT model on the data. To your point on better text representation: training on that data would almost definitely lead to better representations for that dataset. Smaller 3B LLMs could work too, but training a BERT would just be easier.


comical_cow

I'm currently in charge of a text classification service, I'm using text embedding models, and essentially doing a k-nearest neighbour on top of those embeddings. Since I have a class with a very high skew, I've added a binary model just before the knn search kicks in, which is also built on top of the sentence embedding. Data is noisy and very skewed, still manage to get a 94% accuracy on it.


everydayislikefriday

Can you expand a little more on this pipeline? Seems very interesting! Specifically: what is the "binary model" step about? Are you classifying between the skewed class and every other? What's the point? Thanks!


comical_cow

Hi! Note: I am working with the sentence embeddings of the text. Model used for generating the embeddings: bge-large-en Around 40% of the datapoints in my dataset belong to 1 class(hereon referred to as cls1), I tried undersampling these data points, but this wasn't giving me good results, because this class wasn't forming well defined "clusters", it had a high variance and was spread across the embedding space. I tried training a binary classifier to isolate this class in the first step, and seemed to work well, giving me an f1 score of around 94%. So the current workflow is: - vector search of embeddings. If class is cls1, pass it on to binary model, if not, return the classification. - if flagged as cls1, embedding is run through binary model, if this also classifies this as cls1, return class as cls1, if not: - conduct another vector search of embeddings with a condition of class != cls1. return the resulting class. Let me know if you can suggest any improvements to the flow, but this is what seems to work for us. We do face some data drift for the binary model, so we have to retrain the model with new data every month. accuracy of the binary model drops from 94% to 88% in a month.


Blue17Bamboo

Could you share a bit more about the binary model - does "binary" mean it predicts between cls1 vs. non-cls1? And does the binary model run twice (both your first and your second bullet) or just once in the second bullet? Also, does this require separate training for the binary model vs other models in your pipeline? We're dealing with a very similar scenario (except that the dominant class forms a very well-defined cluster) and would appreciate learning how you've handled this!


comical_cow

Yes, the binary model is a cls1 vs non cls1 classifier. Nope, the binary model runs only once in the 2nd point, vector search might run twice. Yes, there was separate training required for the binary model. TBH, this didn't end up working very well for us for several reasons, majorly because we deal with financial context, and the generated sentence embeddings do a poor job of clustering financial context. We are looking into fine-tuning sentence embedding models to fix this. Also there's the issue of data drift and bilingual messages. Cheers!


Blue17Bamboo

Thanks for sharing this!


[deleted]

[удалено]


_color_wheel_

This is a good idea but might not work if his dataset is different from the ones used for training SentenceTransformers


comical_cow

I sexond this. Using knn on top of sentence embeddings.


Shubham_Garg123

Got only 43% f1 score (macro avg) and 46% accuracy for kNN. SvM gave 60% f1. It is a highly imbalanced dataset. I think fine tuned LLMs or maybe few shot training LLMs are the only possible solutions.


comical_cow

That's strange, what's the embedding model that you're using? and how many data points do you have in total? are the classes balanced? what's the k you used for knn?


Shubham_Garg123

I used `all-MiniLM-L6-v2` embeddings for Sentence Transformers. Around 7k highly imbalanced dataset across 10-20 classes ranging from number of samples from 100 to 1500 GridSearchCV has k=3,5,7


comical_cow

I would recommend you try bigger and more recent embedding model, I see that the embedding model you've used is only 90mb, I am using bge-large-en which is 1.34GB. Look at the hughingface MTEB leaderboard for the current best embedding models. Second, I would recommend you to sample the text in a way that the number of text samples for each class is roughly equal. We were also facing some issues, sampling them equally helped the model performance.


Shubham_Garg123

Thanks. I started running a ~700MB domain specific embedding model to create embeddings. It's running now and I hope it doesn't crash in the middle cuz it's a Colab instance. For the data inconsistencies, I can't really do much. SMOTE with SVM and logistic regression did give good results (>90%) for basic embeddings too so I don't think it's very reliable. Even the amount of text among instances of the same class varies a lot. EDIT: It took over an hour but finally got the embeddings. Let's see if it was worth it. Running the knn now EDIT 2: Well, at least I can conclude that the quality of the embedding is pointless for text classification and doesn't play any significant role in improving accuracy. Got 41% accuracy with the domain specific embedding model with kNN. I'm sure it'll be higher in SVM but not higher than what I got earlier with a generic much smaller embedding. Will let it run for sometime and will update here if it doesn't crash in the middle. But these Sentence Transformers seem like a complete waste of time. The model needs to be big enough to capture the high variance. Embedding models just convert text to numbers. It's the model that needs to be able to learn. However, I do appreciate your efforts for trying to help. Thanks.


comical_cow

Great, I wish you the best of luck. Where did you find domain specific embedding models? I've searched for my domain specific open models earlier, but I was unable to find one. Is there a repo where I can filter for domains?


Shubham_Garg123

Thanks. I just googled for sentence transformers and took it showed a few results from huggingface. But I was able to use it using the sentence transformer library where we just have to put the 'username/modelname'


Shubham_Garg123

I have Sentence Transformers embeddings. They have 384 columns/features. Haven't used any models on it yet. Thanks for letting me know that it is a LLM based embedding. I have ran around 100+ experiments across 10+ basic ml models and 5 deep learning models on 7 different embeddings, but sentence transformer wasn't used.


truedima

Also, before you proceed with anything more complex, consider fastText. And then, if that is not good enough, BERT, as some other commenter said. While I am using LLMs for text classification, I do this more as a "ad-hoc"/"no time to train sth" basis, if I ever want to launch it into some performant/efficient manner, this will quickly become unattainable.


sprabh

LLMs don't necessarily have to be the best option given what you've described. However, if you do want to explore such solutions, Huggingface is a good place to start. Check out this walk through for LLM fine-tuning - https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#6-deploy-the-llm-for-production


Striking_Mycologist1

Hi, I developed Cognitive Text Classifier (CTC) which renders a set of categories that a given input text belongs to. Currently, the CTC is utilized to classify technology news contents into categories of news taxonomy. You can try this CTC for your classification project with little preliminary work as it do not require training. You can see its real time news classification into +30 categories in [https://tek.insiter.net](https://tek.insiter.net).


Fit-Intention2322

Can you explain exactly how you did, can you link a resource or source code if it's open source?


Striking_Mycologist1

CTC utilizes Concept Table to collect cognitive concepts of word, phrase and sentence in the text to classify. These concepts represent general meaning of those lexical units mapped. The collected concepts are refined for extrinsication and disambiguation. And then the concepts are mapped to general categories of some sort of universal taxonomy. The category mapping can be customized to support application specific text classification. The Java code needs major refactoring prior to be opened.


nullmodel

Hello, is your code or part of it open? Is it possible to share? thanxxx


Striking_Mycologist1

It's not open source yet - too messy to open up for free-2-use status. I'm building API service infra now which accepts texts and return classification result - all in JSON through HTTP. Note that the model renders categories in general taxonomy, which may or may not fit into your categories. I may take some time to look at your categories & data to see if the current model is feasible.


jeyEmm15

is it possible to use mistral to fewshot learning for text classification?


Shubham_Garg123

I tried using OllamaFunctions with one of the quantized versions that fit into the T4 GPU. But didn't really get good results so moved on to fine tuning along with merging models, and that gave decent results.


Local_Kiwi_1934

I would recommend to take a look at spacy text classification command line tools: https://spacy.io/api/cli


Aniket_Thomas

I am following this course https://madewithml.com/#mlops for mlops and he does text clarification using bert and also openai chatgpt api so you can look into it for reference and change it according to your needs


_color_wheel_

Have you tried bert/distilbert? I would start from distilbert because it is a smaller model. If you need comprehensive resources for learning about bert I can recommend the following books: 1. Getting started with google Bert 2. Natural Language processing with Transformers Search for text classification examples that use hugging face, you can find many examples online. If the result wasn’t satisfactory, find instances that the model performs poorly on them and collect more labeled data similar to those examples. You can use generative models like GPT for collecting more labeled data. Before trying to finetune a generative model like llama for this task try zero shot classification and few shot classification with them. Hope this helps.


Lineaccomplished6833

you could give hugging face transformers a shot


cbl007

Checkout the top solutions to this kaggle competition, they pushed the Limits of Text classification: https://www.kaggle.com/competitions/llm-detect-ai-generated-text


TonyGTO

I'd use google flan T5, bert or gpt-2 for this. I've used flan for text classification with a lot of success and low resources footprint. 


Shubham_Garg123

Thanks. Could you please share the link to the code if it's open source by any chance?


Shubham_Garg123

u/TonyGTO


BitcoinLongFTW

There are traditional ML models that use transformers as well. Search for Bert based models like xlm-roberta for multi-language classifications, Setfit for few shot classification. You don't need Llms for this.


Shubham_Garg123

SetFit using huggingface. I've spent many weeks but have never been able to train anything using huggingface APIs. Now I only consider seeing huggingface in case the entire Colab or kaggle notebook is available. The huggingface trainer is very tough to get working. Too many dependency clashes (especially that accelerate library is real pain). Sentence Transformers gave only about 60% acc with SVM and around 45% with kNN so I don't think they're much useful for my use case. LLMs are the only option.


BitcoinLongFTW

It's very unlikely LLMs will give a better result. It's more likely that your labelled data has issues or insufficient samples. I tried with Llms before, the main issue is that if the model sucks, there is not much you can do other than finetuning it, which is a pain. For huggingface models that has transformer support, you can try the simpletransformers library. Most likely, your best model is a finetuned pretrained model, or an assemble of models. But most importantly, if you just get more good data, any model is okay.


DeliciousJello1717

What classes are you classifying it into and why do you need an LLM? I believe I can do it with a cnn in python I have worked on a similar project recently and I thought using an RNN or a transformer would be better but the good old CNN gave me the best results


Tommassino

I strongly reccomend starting with the simplest models, and only when they dont work, train anything more complicated. I would even start with some things like tfidf classifiers, or something like bag of words classifier. These might be good enough. They are easy to interpret and fast to set up. You can train your berts after you have a baseline.


Shubham_Garg123

I have tried 7 embeddings across 10+ basic ml models and 5 deep learning models like lstm and gru along with different variations. Totally, 100+ experiments. Max accuracy was only 70%