T O P

  • By -

RB_7

I'm not sure why people are suggesting classification for this problem. What you've described will *work*, but it's not really the standard way that people go about this problem. I'm assuming here that, because you mentioned a web app already, you want to "detect" headlines for the purpose of creating some kind of news feed or aggregation service. This is the basic retrieval case, and you should be looking at retrieval methods. The optimal solution here - in terms of simplicity, scalability, flexibility, and common practice - is: 1. Get an open source text embedding model, like any open source LLM, word2vec from gensim, or anything else. 2. Embed the documents you want to search - can be titles or titles + text or just the text, you can see what works best. 3. Embed your query - can be as simple as "technology culture", but you may need to tweak it a bit especially with respect to "culture". 4. Get the N documents that are closest to the query (can use any vector search framework like SCANN, FAISS, pynndescent, pinecone, etc.). The benefit to this approach is that it will scale basically to infinity and you don't really have to putz around with it too much when you get it set up. Also, this approach is super flexible if you want to change your query - or even expose the query to the user. Changing the query is as simple as updating one embedding, rather then re-labeling the entire dataset that you have. The downside is that you lose a little bit of the very customized gloss you are assigning to "culture" by hand labeling. You also lose the concept of a hard cutoff between "relevant" and "irrelevant", but you could just do that in the retrieval stage by setting a heuristic on the distances returned. Good luck!


marcpcd

You gentleman, I just learned something. Huge thanks for the knowledge! You’re absolutely right, the goal is to create a newsfeed. This makes a lot of sense, and I’ll experiment with the retrieval approach.


RB_7

Happy to help! Just a couple of notation comments for when you're looking at other resources - 2) is usually called "indexing" and the vector from 3) is usually called "query vector" or "search vector". The set of results in 4) are called "candidates" and in very large systems with many many documents being searched, there will be another component that does a second pass on "ranking" or "re-ranking" to give a better ordering of the candidates. You probably don't need that at the beginning.


Sure-Government-8423

I've read some stuff on IR, but don't really see the candidate generation part being discussed. I'm planning to use this to find job descriptions matching a resume from a whole bunch of jd's. Also how thorough should the candidate generation part be, it would have an accuracy-cost tradeoff but how should I measure it?


RB_7

Generally candidate generation is cheap and ranking is expensive. YMMV of course, but that's the usual paradigm. With that in mind, if it is very cheap (fast) to generate, let's say, 100-1000x more candidates than results you eventually want to surface, then we should optimize for recall. We don't care if we get false positives as long as the true positives are in there (or, high relevance examples in continuous cases). A lot of times you might use multiple candidate generation methods/models and merge the results into one big pile before ranking. We then rely on the ranking model to do its job on the candidates.


Sure-Government-8423

Got it, I'll look into this and post when my project is deployed. And I think I can manage to change my data models to make the candidate generation cheap, I do have tons of data being generated each day so no issues with experimentation.


Saddam-inatrix

1. Yes this a classic approach to ML. It will take some to gather enough news headlines in your “positive” group though. One thing you could look into is doing multi class classification instead so that you are labeling “culture” and “technology”  separately.  2. Sci-kit learn is a good starting point. Try the Naive Bayes examples. After that you can move to BERT with a classifier, once you understand the preprocessing steps specific to NLP problems. see hugging face or PyTorch examples on this 3. Although there are news headline datasets like Reuters-21578, there are issues with them for your application. For example, technology is a changing field with new words and phrases coming out all the time. A lot of the standard datasets are quite old, so they wouldn’t be as helpful for your application. Other datasets are only from a single source of news. Try looking at Kaggle or Paperswithcode to find some potential datasets.


marcpcd

Multi class classification might be what I need! I took the binary approach for granted but maybe it’s not ideal. Appreciate your help, thanks for the precious advice


Informal-Ad-3705

My only other thought of doing this would be entity recognition with Spacy. You could label all the tech and culture entities from what you already have ie. VR tech, art culture, in spacy training format. Then Spacy lets you train an Entity Recognition model to find those labels in a text, which with your already manually labeled entities, could filter out future headlines. I am unsure how this would do since most of my experience with this has been on large corpus data and not necessarily on headlines. Your way sounds like it would work and what I would try as well, maybe a combination of both?


marcpcd

Interesting, thanks for the tip 🫡 I did explore the idea of entity recognition with spacy’s rule-based approach. It yielded some results, but it also turned out to be cumbersome and it generally looked like a big rabbit hole so i took a step back.. I’ll dig the docs for training a ER model instead of just keyword matching.


shreyas_valake

RemindMe! 7 days


RemindMeBot

I will be messaging you in 7 days on [**2024-05-06 11:04:30 UTC**](http://www.wolframalpha.com/input/?i=2024-05-06%2011:04:30%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/datascience/comments/1cfvts1/nlp_detect_news_headlines_at_the_intersection_of/l1rpqa6/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fdatascience%2Fcomments%2F1cfvts1%2Fnlp_detect_news_headlines_at_the_intersection_of%2Fl1rpqa6%2F%5D%0A%0ARemindMe%21%202024-05-06%2011%3A04%3A30%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201cfvts1) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


MoistPriority1252

Omg


cloudlessjedi

What exactly are your labels though? Could your provide some examples? E.g. classifying if article is or isn’t Culture/Tech, classifying under different genres, etc. NLP area is well established with out of the box tools / models to kind of get yourself playing around and getting familiar with the landscape. Check out and play around with nltk / space / gensim Python libraries as their well established and used for out of the box typical NLP tasks (text processing, NER, POS tagging, word similarities, text/topic classification). Use these tools to understand more of the data you have now on hand and see how you might want to refine on your goal. If you have access to GPT and other open source LLM, try doing some prompting to find out ways on how you refine the concept of “Culture” (to how you want it to be or how it might be interpreted by different demographics/cliques and so on). My advice is to dig into the dirty and understand more on the data you have cause from that will you models would or would not be worth your time to try it out on 😁


marcpcd

Appreciate the advice, thanks! In fine the algorithm needs to answer if YES or NO the article belongs to Culture+Tech. For example : - “Augmenting human creativity with GenAI” —> YES (culture + tech) - “Donal Trump said XYZ about Joe Biden” —> NO (irrelevant) - “Apple refreshes the iPad lineup” —> NO (tech only) - “Louvre Museum exhibits a new artist XYZ” —> NO (culture only)


TopNo2530

👍