Top LinkedIn Content on Training Content Management Systems

building AI systems @meta

207,118 followers 1y

Disclosing the full list of datasets used to train IBM LLMs Granite 3.0. This is true transparency - no other LLM provider shares such detailed information about their training datasets. WEB Data - FineWeb: More than 15T tokens of cleaned and deduplicated English data from CommonCrawl. - Webhose: Unstructured web content in English converted into machine-readable data. - DCLM-Baseline: A 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. CODE - Code Pile: Sourced from publicly available datasets like GitHub Code Clean and StarCoderdata. - FineWeb-Code: Contains programming/coding-related documents filtered from the FineWeb dataset using annotation. - CodeContests: Competitive programming dataset with problems, test cases, and human solutions in multiple languages. DOMAIN - USPTO: Collection of US patents granted from 1975 to 2023. - Free Law: Public-domain legal opinions from US federal and state courts. - PubMed Central: Biomedical and life sciences papers. - EDGAR Filings: Annual reports from US publicly traded companies over 25 years. MULTILINGUAL - Multilingual Wikipedia: Data from 11 languages to support multilingual capabilities. - Multilingual Webhose: Multilingual web content converted into machine-readable data feeds. - MADLAD-12: Document-level multilingual dataset covering 12 languages. INSTRUCTIONS - Code Instructions Alpaca: Instruction-response pairs about code generation problems. - Glaive Function Calling: Dataset focused on function calling in real scenarios. ACADEMIC - peS2o: A collection of 40M open-access academic papers for pre-training. - arXiv: Scientific paper pre-prints posted to arXiv. Full author acknowledgement can be found here. - IEEE: Technical content from IEEE acquired by IBM. TECHNICAL - Wikipedia: Technical articles sourced from Wikipedia. - Library of Congress Public Domain Books: More than 140,000 public domain English books. - Directory of Open Access Books: Publicly available technical books from the Directory of Open Access Books. - Cosmopedia: Synthetic textbooks, blog posts, stories, and WikiHow articles. MATH - OpenWebMath: Mathematical text from the internet, filtered from 200B HTML files. - Algebraic-Stack: Mathematical code dataset including numerical computing and formal mathematics. - Stack Exchange: User-contributed content from the Stack Exchange network. - MetaMathQA: Dataset of rewritten mathematical questions. - StackMathQA: A curated collection of 2 million mathematical questions from Stack Exchange. - MathInstruct: Focused on chain-of-thought (CoT) and program-of-thought (PoT) rationales for mathematical reasoning. - TemplateGSM: Collection of over 7 million grade-school math problems with code and natural language solutions. BOOM!

114 Comments

Sneha Vijaykumar

25,712 followers 2mo

You’re in an AI Engineer interview. Interviewer asks: How do you handle multi language prompting effectively? Most people jump to translation APIs. Strong answer goes deeper. 1. Detect language first Never assume. Identify the user’s language and script before prompting. 2. Preserve intent, not just words Literal translation often breaks tone, context, and business meaning. 3. Prompt in the user’s language when possible Models usually respond better when instructions and output language align. 4. Use English for complex reasoning, then localize output For harder logic tasks, reasoning in English + final response in target language often works better. 5. Handle mixed language inputs Real users switch languages mid sentence. Your system should too. 6. Keep terminology consistent Especially for healthcare, finance, legal, and product names. 7. Test by language, not globally Kannada, Hindi, Tamil, Japanese, Arabic, Spanish all fail differently. 8. Build fallback layers If confidence is low, ask clarifying questions instead of hallucinating. What interviewers want to hear: You understand that multilingual AI is a product problem, not just a translation problem. #AI #GenerativeAI #PromptEngineering #LLM #AIEngineer #MachineLearning #NLP #AIEngineering Follow Sneha Vijaykumar for more... 😊

1 Comment

Karen Kim

CEO @ Human Managed, the AI-Native Service Operator for Enterprise Cyber, Risk, and Digital.

5,957 followers 1y

User Feedback Loops: the missing piece in AI success? AI is only as good as the data it learns from -- but what happens after deployment? Many businesses focus on building AI products but miss a critical step: ensuring their outputs continue to improve with real-world use. Without a structured feedback loop, AI risks stagnating, delivering outdated insights, or losing relevance quickly. Instead of treating AI as a one-and-done solution, companies need workflows that continuously refine and adapt based on actual usage. That means capturing how users interact with AI outputs, where it succeeds, and where it fails. At Human Managed, we’ve embedded real-time feedback loops into our products, allowing customers to rate and review AI-generated intelligence. Users can flag insights as: 🔘Irrelevant 🔘Inaccurate 🔘Not Useful 🔘Others Every input is fed back into our system to fine-tune recommendations, improve accuracy, and enhance relevance over time. This is more than a quality check -- it’s a competitive advantage. - for CEOs & Product Leaders: AI-powered services that evolve with user behavior create stickier, high-retention experiences. - for Data Leaders: Dynamic feedback loops ensure AI systems stay aligned with shifting business realities. - for Cybersecurity & Compliance Teams: User validation enhances AI-driven threat detection, reducing false positives and improving response accuracy. An AI model that never learns from its users is already outdated. The best AI isn’t just trained -- it continuously evolves.

1 Comment

Allys Parsons

Co-Founder at techire ai. Hiring in AI since ’19 ✌️ Speech AI, TTS, Audio, Multimodal AI & more! Top 200 Women Leaders in Conversational AI ‘23 | No.1 Conversational AI Leader ‘21

18,350 followers 1y

Latest research from KAIST and Imperial College London introduces Zero-AVSR, an innovative framework that enables audio-visual speech recognition across languages without requiring training data in target languages. By learning language-agnostic speech representations through romanisation and leveraging LLMs, it can recognise speech even in languages never seen during training. What makes this approach interesting is the scale of language support. The team created MARC, a dataset spanning 2,916 hours of audio-visual speech across 82 languages—far beyond the 9 languages typical systems support. Their results show comparable performance to traditional multilingual systems while supporting this vastly larger language inventory. Zero-AVSR represents a significant advancement for speech tech in low-resource languages, potentially democratising access across thousands of languages without requiring extensive labelled datasets for each. The approach particularly excels when recognising languages from families similar to those in the training data, suggesting promising pathways for further expansion. Paper: https://lnkd.in/dnw_V7XK Authors: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro #SpeechRecognition #MultilingualAI #SpeechAI

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations arxiv.org

2 Comments

Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,749 followers 1y

Exciting breakthrough in multilingual embedding models! A team of researchers from HIT and Tongji University have developed KaLM-Embedding, setting a new standard for models under 1B parameters. What makes this model special? It leverages cleaner, more diverse training data and introduces three game-changing techniques: 1. Persona-based synthetic data generation using QWen2-72B-Instruct, creating 550k diverse examples across 6 task types 2. Ranking consistency filtering to remove noise and improve data quality by ensuring positive examples rank within top-k matches 3. Semi-homogeneous task batching that balances negative sample hardness with false negative risks Under the hood, KaLM-Embedding uses Qwen2-0.5B as its foundation and implements Matryoshka Representation Learning for flexible dimension embedding (896 to 64 dimensions). The model excels in Chinese and English while showing strong performance across other languages. The results? KaLM-Embedding achieves state-of-the-art performance on the MTEB benchmark, outperforming larger models with scores of 64.13 for Chinese and 64.94 for English tasks. This work demonstrates how thoughtful data curation and innovative training techniques can push the boundaries of what's possible with compact models. The team has open-sourced their work for the research community.

Tom Aarsen

🤗 Sentence Transformers & NLTK maintainer, MLE @ Hugging Face

20,646 followers 9mo

ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT. Model details: - 2 model sizes: 42M non-embed (140M total) and 110M non-embed (307M total) - Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.) - Maximum sequence length of 8192 tokens, on the high end for encoders - Trained on 1833 languages using DCLM, FineWeb2, and many more sources - 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages. - Also uses model merging and clever transitions between the three training phases. - Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released Evaluation details: - Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning) - Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning) - In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc. - Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios. Check out the full blogpost with more details. It's super dense & gets straight to the point: https://lnkd.in/ebqTK3JS Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc. I'm very much looking forward to seeing embedding models based on mmBERT! Great work by Marc Marone, Orion Weller, and the rest of the team at JHU!

mmBERT: ModernBERT goes Multilingual huggingface.co

20 Comments

Zain Ul Hassan

Freelance Senior Analyst, Alibaba Group | Writing on Data, Operations, Supply Chain, AI & Modern Business

82,173 followers 1y

A few years ago, I worked with an online education platform facing challenges with student engagement. While they had a significant number of users enrolling in courses, they struggled with low participation rates in course discussions and activities, leading to a decline in course completion rates. The platform needed to identify the causes behind low engagement and implement strategies to encourage more active participation. Improving Student Engagement Using Data Analytics 1️⃣ Analyzing Engagement Data We began by analyzing user interaction data, focusing on metrics such as time spent on the platform, participation in discussions, video completion rates, and quiz scores. Using SQL, we aggregated the data to identify patterns and pinpoint where students were losing interest. SELECT student_id, course_id, AVG(time_spent) AS avg_time_spent, COUNT(discussion_post_id) AS posts_made, AVG(quiz_score) AS avg_quiz_score FROM student_activity GROUP BY student_id, course_id; 🔹 Insight: We identified that students who interacted with course discussions and quizzes had higher completion rates, while others dropped off quickly. 2️⃣ Building a Predictive Model We then created a predictive model to determine which students were at risk of disengaging based on their activity patterns. The model incorporated features such as time spent on the platform, participation in discussions, and progress through the course material. # Pseudocode for Predictive Model def predict_student_engagement(student_data): model = train_engagement_model(student_data) predictions = model.predict(student_data) return predictions 🔹 Insight: This model helped us flag students who were likely to disengage early, allowing for timely interventions. 3️⃣ Implementing Engagement Strategies Based on insights from the model, we implemented strategies such as sending personalized emails with reminders, offering incentives for completing activities, and increasing interaction opportunities through live Q&A sessions. # Pseudocode for Engagement Follow-Up def send_engagement_reminder(student_data): if model.predict(student_data) == 'at_risk': send_email_reminder(student_data) 🔹 Insight: Personalized engagement and incentives led to an increase in student participation. Challenges Faced Identifying meaningful engagement metrics that were predictive of success. Finding the right balance between engaging students without overwhelming them. Business Impact ✔ Student engagement improved, leading to higher completion rates. ✔ Retention rates increased, as more students continued with courses. ✔ Revenue grew, driven by more active and satisfied students. Key Takeaway: By analyzing user activity and leveraging predictive analytics, businesses can identify disengaged customers early and implement strategies to improve engagement and retention.

3 Comments

Wes Bush

Author of Product-Led Growth & The Product-Led Playbook | I’ve been told I make PLG simple but you tell me!

43,199 followers 8mo

40-60% of first-time users never come back. Most companies focus on one type of onboarding support and neglect the other. Some build great in-app experiences but never follow up when users drop off. Others send tons of emails but their product experience is confusing. The best PLG companies use both product bumpers (inside the app) and conversational bumpers (outside the app) working together. Here are 11 bumpers that could double your activation rates: Product Bumpers (Inside Your App) 1. Welcome Messages Restate your value prop and set expectations for what users will experience. Make them feel invited, not lost. 2. Product Tours Eliminate distractions and give users only the options they care about. Use profiling questions to launch them into the right part of your product. 3. Progress Bars Show how close users are to completion. They'll know onboarding won't take long and they're just a few steps away. 4. Checklists Break big tasks into bite-sized steps. Pre-fill some items before users see them to boost motivation. 5. Onboarding Tooltips Provide just-in-time guidance, but don't drown users in tooltips. Keep it simple and guide them only through the critical steps. 6. Empty States Turn blank dashboards into clear next steps that lead users closer to value. Conversational Bumpers (Outside Your Product) 7. External Messaging Emails, texts, LinkedIn - meet users wherever they spend their time. The best messages provide clear next steps to re-engage. 8. Knowledge Base Give users instant answers to common questions. They solve problems independently while you deflect support tickets. 9. In-app Messaging When users need to ask specific questions, let them message your team in-app for near-instant solutions. 10. Community Forums Let users help each other. Notion does this brilliantly - users share templates that simultaneously increase adoption, engagement, and retention. 11. Training & Specialists Close the knowledge gap with coaching calls, academies, or cohorts. For high-value users, assign specialists to speed up time to value. Pro Tip: If you're just starting, give 10-40 signups per week the white-glove treatment. Welcome emails, onboarding calls, trial extensions - do everything to get them to value. Find what works, then automate it. Manual emails become automated. Common questions get added to your knowledge base. The unscalable path is how you identify the scalable path. What bumpers are you using today?

19 Comments

Lisa Trosien

Multifamily Keynote Speaker, Consultant, Educator and Thought Leader | Leasing, Marketing, Resident Retention, Customer Service Expert| Proptech Advisor | Founder, Apartment All Stars, Apartment Expert

20,018 followers 6mo

PropTech Tuesday: Your Tech Is Only as Good as Your Training I keep hearing the same thing from management companies: their sites aren't using the tech they have. 📚 That's not a tech problem. That's a training problem. A proptech exec once told me something that has always stuck with me. He said he could always tell when a site had staff turnover. How? Management companies would reach out saying they weren't seeing the results they used to get, so the product "wasn't working" anymore. But the product was fine. The new team members brought on just hadn't been trained, so they weren’t using it. Some didn't even know the product was available. ⚙️ Most platforms offer solid training resources - live sessions, recorded webinars, help docs. But here's what happens: teams go through initial onboarding, then they're pretty much on their own. When results drop off six months later, the software gets blamed. But the reality? People can't use tools they don't fully understand. 🔄 Research backs this up - organizations that prioritize ONGOING employee development with new technology rollouts see a 30-50% spike in user engagement (McKinsey, 2023). Not just onboarding. Actual, continuous learning. 💬 The difference between tech that sits idle and tech that transforms operations is training. The ongoing, "let me show you this shortcut," actually-talking-to-people kind. 🎯 You want better results from your proptech stack? Teach your teams how to use it. KEEP teaching them. 💡 Great tech makes you efficient. Great training makes you unstoppable. Happy PropTech Tuesday. ✨ (This is the first in a series of Tuesday posts I'm dedicating to all things proptech.)

13 Comments

Fred Thompson

buildempire.co.uk • claruswms.co.uk • thirst.io | Helping logistics and professional development through technology.

3,458 followers 1y

If Your Learners Aren’t Engaged, Nothing Else Matters.👎 You can build the world’s most beautifully designed training program. But if learners don’t finish it, don’t remember it, and don’t apply it? Then it’s just content. Not learning. And that’s exactly where many L&D teams are stuck. Here’s what the data shows: * 70% of training content is forgotten within 24 hours * Engaged learners are 3x more likely to apply what they’ve learned * High engagement = higher productivity, stronger retention, and real business impact So, how do the best L&D teams drive engagement...and keep it? These are the three biggest game-changers we’re seeing in 2025 👀👇 1️⃣ Make Learning Feel Personal If a course doesn’t connect with someone’s day-to-day role, they’ll disengage...𝑭𝒂𝒔𝒕. Relevance is 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. What forward-thinking teams are doing: → Adapting content based on role, skill level, and performance  → Letting AI adjust learning pathways in real-time  → Giving learners more say in their own development ✅ Teams making this shift are seeing 2x to 3x higher engagement. 2️⃣ Make It Impossible to Just Click Next No one remembers a 60-slide eLearning deck. Passive content is forgotten content. What’s working now: * Scenario-based challenges that mimic real decisions * Interactive formats like quizzes and simulations * Collaborative elements that get people talking and solving together ✅ One SME switched to interactive compliance training and jumped from 20% to 92% completion overnight. 3️⃣ Make Learning Continuous When learning is personal, interactive, and continuous, people pay attention. Annual training? It’s forgotten before the next login. The best teams are shifting to learning that’s consistent, quick, and embedded in the flow of work. How they’re doing it: → Microlearning delivered in bite-sized bursts each week → Spaced repetition to strengthen memory → Turning learning into a habit, not a one-off ✅ One team replaced a yearly course with weekly 5-minute refreshers — and saw engagement and on-the-job application soar. Engagement isn’t a “nice-to-have” in L&D.  It’s the foundation of every successful learning strategy. When learning is personal, interactive, and continuous - people pay attention. And when people are paying attention, performance improves. If you’re looking to future-proof your L&D approach, this is where to begin. But what’s stopping most teams from getting it right?

45 Comments

LinkedIn respects your privacy

Training Content Management Systems

Explore categories

Training Content Management Systems

More in Training Content Management Systems

More Training & Development topics

Explore categories