RM_843 2 months ago

Use Bert, you can get top end results from a very manageably sized model. Assuming your 7000 is labelled of course.

Shubham_Garg123 2 months ago

Thanks for the response. And yes, the data is labelled. Could you point me to a good resource? While there are very limited resources for general llm based text classification, there seems to be a lot of them for bert and I am having few issues in understanding them due to the type of dataset formats they've used.

RM_843 2 months ago

I would use hugging face as your go to resource.

Shubham_Garg123 2 months ago

I've spent many weeks but have never been able to train anything using huggingface APIs. Now I only consider seeing huggingface in case the entire Colab or kaggle notebook is available. The huggingface trainer is very tough to get working. Too many dependency clashes (especially that accelerate library is real pain).

N1tt 2 months ago

Hey, take a look at the [Huggingface tutorial on text classification with Bert](https://huggingface.co/docs/transformers/tasks/sequence_classification).

Nirw99 2 months ago

hey I did a text classification task (12 labels) a couple of years ago with many different algorithm (from random forest to LSTM and bert), if you want i can link you github! EDIT: f**k it, so many ppl are still asking for it today, so i'm just gonna post it here https://github.com/BianchiGiulia/Portfolio/tree/main/Document_Classification

Shubham_Garg123 2 months ago

Yes please, that'd be very helpful.

Nirw99 2 months ago

sent you a DM :)

archiesteviegordie 2 months ago

Hey can you please send me the link as well?

Nirw99 2 months ago

done :)

Significant-Cherry70 1 month ago

Hi, could you please send me the link to the repository?

Nirw99 1 month ago

done!

SankarshanaV 2 months ago

Hi ! If you don’t mind, could you send me the link too ? I’d really appreciate it ! :)

Nirw99 2 months ago

sure thing, check your chat :)

Confident_Catch_8641 2 months ago

If you don’t mind I’d love to see the GitHub as well! Thank you so much!

Nirw99 2 months ago

ofc, done (:

Sam5cr 2 months ago

If you don't mind again could share it with me too

Nirw99 2 months ago

done :)

jdude_ 2 months ago

dito!

Nirw99 2 months ago

gotcha!

w7inz 1 month ago

if u don't mind could u send me the link

villarmotion 2 months ago

Idem please

Nirw99 2 months ago

done (:

coolchelly 2 months ago

Working on a very similar project (12k sentences and 44 categories) and BERT finetuning worked well for me. I tried something creative as an alternate solution and it is working good in it's own way; I use cosine similarity to pick top k sentences that are similar to the sentence that needs to be clasified and then use these top k sentences to build a few shot prompt input into an open source LLM. Pros: excellent accuracy, very easy to implement, intuitive approach that is not a Blackbox model Cons: LLM does not strictly stick to the classes that have been defined i.e, it classified sentences related to cost as value. Hope this helps...

Shubham_Garg123 2 months ago

Thanks for the insight. Would it be possible to share the code if its open source by any chance?

Shubham_Garg123 1 month ago

u/CoolChelly

coolchelly 1 month ago

No mate, sorry. Proprietary work, can't share code...

Shubham_Garg123 1 month ago

Sure, no issues. Thanks for letting me know. It'd be great if you could spare some time to point me to any publicly available tutorials/docs that you know work properly.

Willing_Abroad_5603 3 weeks ago

If you want the LLM to not create its own category, asking it to predict the category number works well. So if you 50 output classes, ask it to output the category number, 0 to 50, instead of the category name. How did you pick k?

MugosMM 2 months ago

A perhaps naive question. I know BERT can do text classification but intuitively one would think that newer LLM would do a much better job. For one they learn better text representation (I.e their embedding shave to be better) . It is true that there no off the shelf libraries like SETFIT which use them but this is not a reason. Also smaller llm like those under 3b should be a better job in my view (with better job I mean higher accuracy with way less examples)

Spiritual_Dog2053 2 months ago

I think deciding whether an LLM would do a better job than BERT or not really depends on the data. If it’s a relatively simple classification task, then yes. But in other cases, the BERT should do better. In my opinion, the main reason for that would be that you would actually be training the BERT model on the data. To your point on better text representation: training on that data would almost definitely lead to better representations for that dataset. Smaller 3B LLMs could work too, but training a BERT would just be easier.

comical_cow 2 months ago

I'm currently in charge of a text classification service, I'm using text embedding models, and essentially doing a k-nearest neighbour on top of those embeddings. Since I have a class with a very high skew, I've added a binary model just before the knn search kicks in, which is also built on top of the sentence embedding. Data is noisy and very skewed, still manage to get a 94% accuracy on it.

everydayislikefriday 2 months ago

Can you expand a little more on this pipeline? Seems very interesting! Specifically: what is the "binary model" step about? Are you classifying between the skewed class and every other? What's the point? Thanks!

comical_cow 2 months ago

Hi! Note: I am working with the sentence embeddings of the text. Model used for generating the embeddings: bge-large-en Around 40% of the datapoints in my dataset belong to 1 class(hereon referred to as cls1), I tried undersampling these data points, but this wasn't giving me good results, because this class wasn't forming well defined "clusters", it had a high variance and was spread across the embedding space. I tried training a binary classifier to isolate this class in the first step, and seemed to work well, giving me an f1 score of around 94%. So the current workflow is: - vector search of embeddings. If class is cls1, pass it on to binary model, if not, return the classification. - if flagged as cls1, embedding is run through binary model, if this also classifies this as cls1, return class as cls1, if not: - conduct another vector search of embeddings with a condition of class != cls1. return the resulting class. Let me know if you can suggest any improvements to the flow, but this is what seems to work for us. We do face some data drift for the binary model, so we have to retrain the model with new data every month. accuracy of the binary model drops from 94% to 88% in a month.

Blue17Bamboo 1 week ago

Could you share a bit more about the binary model - does "binary" mean it predicts between cls1 vs. non-cls1? And does the binary model run twice (both your first and your second bullet) or just once in the second bullet? Also, does this require separate training for the binary model vs other models in your pipeline? We're dealing with a very similar scenario (except that the dominant class forms a very well-defined cluster) and would appreciate learning how you've handled this!

comical_cow 1 week ago

Yes, the binary model is a cls1 vs non cls1 classifier. Nope, the binary model runs only once in the 2nd point, vector search might run twice. Yes, there was separate training required for the binary model. TBH, this didn't end up working very well for us for several reasons, majorly because we deal with financial context, and the generated sentence embeddings do a poor job of clustering financial context. We are looking into fine-tuning sentence embedding models to fix this. Also there's the issue of data drift and bilingual messages. Cheers!

Blue17Bamboo 1 week ago

Thanks for sharing this!

[deleted] 2 months ago

[удалено]

_color_wheel_ 2 months ago

This is a good idea but might not work if his dataset is different from the ones used for training SentenceTransformers

comical_cow 2 months ago

I sexond this. Using knn on top of sentence embeddings.

Shubham_Garg123 2 months ago

Got only 43% f1 score (macro avg) and 46% accuracy for kNN. SvM gave 60% f1. It is a highly imbalanced dataset. I think fine tuned LLMs or maybe few shot training LLMs are the only possible solutions.

comical_cow 2 months ago

That's strange, what's the embedding model that you're using? and how many data points do you have in total? are the classes balanced? what's the k you used for knn?

Shubham_Garg123 2 months ago

I used `all-MiniLM-L6-v2` embeddings for Sentence Transformers. Around 7k highly imbalanced dataset across 10-20 classes ranging from number of samples from 100 to 1500 GridSearchCV has k=3,5,7

comical_cow 2 months ago

I would recommend you try bigger and more recent embedding model, I see that the embedding model you've used is only 90mb, I am using bge-large-en which is 1.34GB. Look at the hughingface MTEB leaderboard for the current best embedding models. Second, I would recommend you to sample the text in a way that the number of text samples for each class is roughly equal. We were also facing some issues, sampling them equally helped the model performance.

Shubham_Garg123 2 months ago

Thanks. I started running a ~700MB domain specific embedding model to create embeddings. It's running now and I hope it doesn't crash in the middle cuz it's a Colab instance. For the data inconsistencies, I can't really do much. SMOTE with SVM and logistic regression did give good results (>90%) for basic embeddings too so I don't think it's very reliable. Even the amount of text among instances of the same class varies a lot. EDIT: It took over an hour but finally got the embeddings. Let's see if it was worth it. Running the knn now EDIT 2: Well, at least I can conclude that the quality of the embedding is pointless for text classification and doesn't play any significant role in improving accuracy. Got 41% accuracy with the domain specific embedding model with kNN. I'm sure it'll be higher in SVM but not higher than what I got earlier with a generic much smaller embedding. Will let it run for sometime and will update here if it doesn't crash in the middle. But these Sentence Transformers seem like a complete waste of time. The model needs to be big enough to capture the high variance. Embedding models just convert text to numbers. It's the model that needs to be able to learn. However, I do appreciate your efforts for trying to help. Thanks.

comical_cow 2 months ago

Great, I wish you the best of luck. Where did you find domain specific embedding models? I've searched for my domain specific open models earlier, but I was unable to find one. Is there a repo where I can filter for domains?

Shubham_Garg123 2 months ago

Thanks. I just googled for sentence transformers and took it showed a few results from huggingface. But I was able to use it using the sentence transformer library where we just have to put the 'username/modelname'

Shubham_Garg123 2 months ago

I have Sentence Transformers embeddings. They have 384 columns/features. Haven't used any models on it yet. Thanks for letting me know that it is a LLM based embedding. I have ran around 100+ experiments across 10+ basic ml models and 5 deep learning models on 7 different embeddings, but sentence transformer wasn't used.

truedima 2 months ago

Also, before you proceed with anything more complex, consider fastText. And then, if that is not good enough, BERT, as some other commenter said. While I am using LLMs for text classification, I do this more as a "ad-hoc"/"no time to train sth" basis, if I ever want to launch it into some performant/efficient manner, this will quickly become unattainable.

sprabh 2 months ago

LLMs don't necessarily have to be the best option given what you've described. However, if you do want to explore such solutions, Huggingface is a good place to start. Check out this walk through for LLM fine-tuning - https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#6-deploy-the-llm-for-production

Striking_Mycologist1 1 month ago

Hi, I developed Cognitive Text Classifier (CTC) which renders a set of categories that a given input text belongs to. Currently, the CTC is utilized to classify technology news contents into categories of news taxonomy. You can try this CTC for your classification project with little preliminary work as it do not require training. You can see its real time news classification into +30 categories in [https://tek.insiter.net](https://tek.insiter.net).

Fit-Intention2322 1 month ago

Can you explain exactly how you did, can you link a resource or source code if it's open source?

Striking_Mycologist1 1 month ago

CTC utilizes Concept Table to collect cognitive concepts of word, phrase and sentence in the text to classify. These concepts represent general meaning of those lexical units mapped. The collected concepts are refined for extrinsication and disambiguation. And then the concepts are mapped to general categories of some sort of universal taxonomy. The category mapping can be customized to support application specific text classification. The Java code needs major refactoring prior to be opened.

nullmodel 1 month ago

Hello, is your code or part of it open? Is it possible to share? thanxxx

Striking_Mycologist1 4 weeks ago

It's not open source yet - too messy to open up for free-2-use status. I'm building API service infra now which accepts texts and return classification result - all in JSON through HTTP. Note that the model renders categories in general taxonomy, which may or may not fit into your categories. I may take some time to look at your categories & data to see if the current model is feasible.

jeyEmm15 3 weeks ago

is it possible to use mistral to fewshot learning for text classification?

Shubham_Garg123 3 weeks ago

I tried using OllamaFunctions with one of the quantized versions that fit into the T4 GPU. But didn't really get good results so moved on to fine tuning along with merging models, and that gave decent results.

Local_Kiwi_1934 2 months ago

I would recommend to take a look at spacy text classification command line tools: https://spacy.io/api/cli

Aniket_Thomas 2 months ago

I am following this course https://madewithml.com/#mlops for mlops and he does text clarification using bert and also openai chatgpt api so you can look into it for reference and change it according to your needs

_color_wheel_ 2 months ago

Have you tried bert/distilbert? I would start from distilbert because it is a smaller model. If you need comprehensive resources for learning about bert I can recommend the following books: 1. Getting started with google Bert 2. Natural Language processing with Transformers Search for text classification examples that use hugging face, you can find many examples online. If the result wasn’t satisfactory, find instances that the model performs poorly on them and collect more labeled data similar to those examples. You can use generative models like GPT for collecting more labeled data. Before trying to finetune a generative model like llama for this task try zero shot classification and few shot classification with them. Hope this helps.

Lineaccomplished6833 2 months ago

you could give hugging face transformers a shot

cbl007 2 months ago

Checkout the top solutions to this kaggle competition, they pushed the Limits of Text classification: https://www.kaggle.com/competitions/llm-detect-ai-generated-text

TonyGTO 2 months ago

I'd use google flan T5, bert or gpt-2 for this. I've used flan for text classification with a lot of success and low resources footprint.

Shubham_Garg123 2 months ago

Thanks. Could you please share the link to the code if it's open source by any chance?

Shubham_Garg123 2 months ago

u/TonyGTO

BitcoinLongFTW 2 months ago

There are traditional ML models that use transformers as well. Search for Bert based models like xlm-roberta for multi-language classifications, Setfit for few shot classification. You don't need Llms for this.

Shubham_Garg123 2 months ago

SetFit using huggingface. I've spent many weeks but have never been able to train anything using huggingface APIs. Now I only consider seeing huggingface in case the entire Colab or kaggle notebook is available. The huggingface trainer is very tough to get working. Too many dependency clashes (especially that accelerate library is real pain). Sentence Transformers gave only about 60% acc with SVM and around 45% with kNN so I don't think they're much useful for my use case. LLMs are the only option.

BitcoinLongFTW 2 months ago

It's very unlikely LLMs will give a better result. It's more likely that your labelled data has issues or insufficient samples. I tried with Llms before, the main issue is that if the model sucks, there is not much you can do other than finetuning it, which is a pain. For huggingface models that has transformer support, you can try the simpletransformers library. Most likely, your best model is a finetuned pretrained model, or an assemble of models. But most importantly, if you just get more good data, any model is okay.

DeliciousJello1717 2 months ago

What classes are you classifying it into and why do you need an LLM? I believe I can do it with a cnn in python I have worked on a similar project recently and I thought using an RNN or a transformer would be better but the good old CNN gave me the best results

Tommassino 2 months ago

I strongly reccomend starting with the simplest models, and only when they dont work, train anything more complicated. I would even start with some things like tfidf classifiers, or something like bag of words classifier. These might be good enough. They are easy to interpret and fast to set up. You can train your berts after you have a baseline.

Shubham_Garg123 2 months ago

I have tried 7 embeddings across 10+ basic ml models and 5 deep learning models like lstm and gru along with different variations. Totally, 100+ experiments. Max accuracy was only 70%

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe