Training Content Management Systems

Explore top LinkedIn content from expert professionals.

  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems @meta

    207,118 followers

    Disclosing the full list of datasets used to train IBM LLMs Granite 3.0. This is true transparency - no other LLM provider shares such detailed information about their training datasets. WEB Data - FineWeb: More than 15T tokens of cleaned and deduplicated English data from CommonCrawl. - Webhose: Unstructured web content in English converted into machine-readable data. - DCLM-Baseline: A 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. CODE - Code Pile: Sourced from publicly available datasets like GitHub Code Clean and StarCoderdata. - FineWeb-Code: Contains programming/coding-related documents filtered from the FineWeb dataset using annotation. - CodeContests: Competitive programming dataset with problems, test cases, and human solutions in multiple languages. DOMAIN - USPTO: Collection of US patents granted from 1975 to 2023. - Free Law: Public-domain legal opinions from US federal and state courts. - PubMed Central: Biomedical and life sciences papers. - EDGAR Filings: Annual reports from US publicly traded companies over 25 years. MULTILINGUAL - Multilingual Wikipedia: Data from 11 languages to support multilingual capabilities. - Multilingual Webhose: Multilingual web content converted into machine-readable data feeds. - MADLAD-12: Document-level multilingual dataset covering 12 languages. INSTRUCTIONS - Code Instructions Alpaca: Instruction-response pairs about code generation problems. - Glaive Function Calling: Dataset focused on function calling in real scenarios. ACADEMIC - peS2o: A collection of 40M open-access academic papers for pre-training. - arXiv: Scientific paper pre-prints posted to arXiv. Full author acknowledgement can be found here. - IEEE: Technical content from IEEE acquired by IBM. TECHNICAL - Wikipedia: Technical articles sourced from Wikipedia. - Library of Congress Public Domain Books: More than 140,000 public domain English books. - Directory of Open Access Books: Publicly available technical books from the Directory of Open Access Books. - Cosmopedia: Synthetic textbooks, blog posts, stories, and WikiHow articles. MATH - OpenWebMath: Mathematical text from the internet, filtered from 200B HTML files. - Algebraic-Stack: Mathematical code dataset including numerical computing and formal mathematics. - Stack Exchange: User-contributed content from the Stack Exchange network. - MetaMathQA: Dataset of rewritten mathematical questions. - StackMathQA: A curated collection of 2 million mathematical questions from Stack Exchange. - MathInstruct: Focused on chain-of-thought (CoT) and program-of-thought (PoT) rationales for mathematical reasoning. - TemplateGSM: Collection of over 7 million grade-school math problems with code and natural language solutions. BOOM!

  • View profile for Sneha Vijaykumar

    Data Scientist @ Takeda | Ex-Shell | Gen AI | Agentic AI | RAG | AI Agents | Azure | NLP | AWS

    25,712 followers

    You’re in an AI Engineer interview. Interviewer asks: How do you handle multi language prompting effectively? Most people jump to translation APIs. Strong answer goes deeper. 1. Detect language first Never assume. Identify the user’s language and script before prompting. 2. Preserve intent, not just words Literal translation often breaks tone, context, and business meaning. 3. Prompt in the user’s language when possible Models usually respond better when instructions and output language align. 4. Use English for complex reasoning, then localize output For harder logic tasks, reasoning in English + final response in target language often works better. 5. Handle mixed language inputs Real users switch languages mid sentence. Your system should too. 6. Keep terminology consistent Especially for healthcare, finance, legal, and product names. 7. Test by language, not globally Kannada, Hindi, Tamil, Japanese, Arabic, Spanish all fail differently. 8. Build fallback layers If confidence is low, ask clarifying questions instead of hallucinating. What interviewers want to hear: You understand that multilingual AI is a product problem, not just a translation problem. #AI #GenerativeAI #PromptEngineering #LLM #AIEngineer #MachineLearning #NLP #AIEngineering Follow Sneha Vijaykumar for more... 😊

  • View profile for Karen Kim

    CEO @ Human Managed, the AI-Native Service Operator for Enterprise Cyber, Risk, and Digital.

    5,957 followers

    User Feedback Loops: the missing piece in AI success? AI is only as good as the data it learns from -- but what happens after deployment? Many businesses focus on building AI products but miss a critical step: ensuring their outputs continue to improve with real-world use. Without a structured feedback loop, AI risks stagnating, delivering outdated insights, or losing relevance quickly. Instead of treating AI as a one-and-done solution, companies need workflows that continuously refine and adapt based on actual usage. That means capturing how users interact with AI outputs, where it succeeds, and where it fails. At Human Managed, we’ve embedded real-time feedback loops into our products, allowing customers to rate and review AI-generated intelligence. Users can flag insights as: 🔘Irrelevant 🔘Inaccurate 🔘Not Useful 🔘Others Every input is fed back into our system to fine-tune recommendations, improve accuracy, and enhance relevance over time. This is more than a quality check -- it’s a competitive advantage. - for CEOs & Product Leaders: AI-powered services that evolve with user behavior create stickier, high-retention experiences. - for Data Leaders: Dynamic feedback loops ensure AI systems stay aligned with shifting business realities. - for Cybersecurity & Compliance Teams: User validation enhances AI-driven threat detection, reducing false positives and improving response accuracy. An AI model that never learns from its users is already outdated. The best AI isn’t just trained -- it continuously evolves.

  • View profile for Allys Parsons

    Co-Founder at techire ai. Hiring in AI since ’19 ✌️ Speech AI, TTS, Audio, Multimodal AI & more! Top 200 Women Leaders in Conversational AI ‘23 | No.1 Conversational AI Leader ‘21

    18,350 followers

    Latest research from KAIST and Imperial College London introduces Zero-AVSR, an innovative framework that enables audio-visual speech recognition across languages without requiring training data in target languages. By learning language-agnostic speech representations through romanisation and leveraging LLMs, it can recognise speech even in languages never seen during training. What makes this approach interesting is the scale of language support. The team created MARC, a dataset spanning 2,916 hours of audio-visual speech across 82 languages—far beyond the 9 languages typical systems support. Their results show comparable performance to traditional multilingual systems while supporting this vastly larger language inventory. Zero-AVSR represents a significant advancement for speech tech in low-resource languages, potentially democratising access across thousands of languages without requiring extensive labelled datasets for each. The approach particularly excels when recognising languages from families similar to those in the training data, suggesting promising pathways for further expansion. Paper: https://lnkd.in/dnw_V7XK Authors: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro #SpeechRecognition #MultilingualAI #SpeechAI

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,749 followers

    Exciting breakthrough in multilingual embedding models! A team of researchers from HIT and Tongji University have developed KaLM-Embedding, setting a new standard for models under 1B parameters. What makes this model special? It leverages cleaner, more diverse training data and introduces three game-changing techniques: 1. Persona-based synthetic data generation using QWen2-72B-Instruct, creating 550k diverse examples across 6 task types 2. Ranking consistency filtering to remove noise and improve data quality by ensuring positive examples rank within top-k matches 3. Semi-homogeneous task batching that balances negative sample hardness with false negative risks Under the hood, KaLM-Embedding uses Qwen2-0.5B as its foundation and implements Matryoshka Representation Learning for flexible dimension embedding (896 to 64 dimensions). The model excels in Chinese and English while showing strong performance across other languages. The results? KaLM-Embedding achieves state-of-the-art performance on the MTEB benchmark, outperforming larger models with scores of 64.13 for Chinese and 64.94 for English tasks. This work demonstrates how thoughtful data curation and innovative training techniques can push the boundaries of what's possible with compact models. The team has open-sourced their work for the research community.

  • View profile for Tom Aarsen

    🤗 Sentence Transformers & NLTK maintainer, MLE @ Hugging Face

    20,646 followers

    ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT. Model details: - 2 model sizes: 42M non-embed (140M total) and 110M non-embed (307M total) - Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.) - Maximum sequence length of 8192 tokens, on the high end for encoders - Trained on 1833 languages using DCLM, FineWeb2, and many more sources - 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages. - Also uses model merging and clever transitions between the three training phases. - Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released Evaluation details: - Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning) - Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning) - In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc. - Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios. Check out the full blogpost with more details. It's super dense & gets straight to the point: https://lnkd.in/ebqTK3JS Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc. I'm very much looking forward to seeing embedding models based on mmBERT! Great work by Marc Marone, Orion Weller, and the rest of the team at JHU!

  • View profile for Zain Ul Hassan

    Freelance Senior Analyst, Alibaba Group | Writing on Data, Operations, Supply Chain, AI & Modern Business

    82,173 followers

    A few years ago, I worked with an online education platform facing challenges with student engagement. While they had a significant number of users enrolling in courses, they struggled with low participation rates in course discussions and activities, leading to a decline in course completion rates. The platform needed to identify the causes behind low engagement and implement strategies to encourage more active participation. Improving Student Engagement Using Data Analytics 1️⃣ Analyzing Engagement Data We began by analyzing user interaction data, focusing on metrics such as time spent on the platform, participation in discussions, video completion rates, and quiz scores. Using SQL, we aggregated the data to identify patterns and pinpoint where students were losing interest. SELECT student_id, course_id, AVG(time_spent) AS avg_time_spent, COUNT(discussion_post_id) AS posts_made, AVG(quiz_score) AS avg_quiz_score FROM student_activity GROUP BY student_id, course_id; 🔹 Insight: We identified that students who interacted with course discussions and quizzes had higher completion rates, while others dropped off quickly. 2️⃣ Building a Predictive Model We then created a predictive model to determine which students were at risk of disengaging based on their activity patterns. The model incorporated features such as time spent on the platform, participation in discussions, and progress through the course material. # Pseudocode for Predictive Model def predict_student_engagement(student_data): model = train_engagement_model(student_data) predictions = model.predict(student_data) return predictions 🔹 Insight: This model helped us flag students who were likely to disengage early, allowing for timely interventions. 3️⃣ Implementing Engagement Strategies Based on insights from the model, we implemented strategies such as sending personalized emails with reminders, offering incentives for completing activities, and increasing interaction opportunities through live Q&A sessions. # Pseudocode for Engagement Follow-Up def send_engagement_reminder(student_data): if model.predict(student_data) == 'at_risk': send_email_reminder(student_data) 🔹 Insight: Personalized engagement and incentives led to an increase in student participation. Challenges Faced Identifying meaningful engagement metrics that were predictive of success. Finding the right balance between engaging students without overwhelming them. Business Impact ✔ Student engagement improved, leading to higher completion rates. ✔ Retention rates increased, as more students continued with courses. ✔ Revenue grew, driven by more active and satisfied students. Key Takeaway: By analyzing user activity and leveraging predictive analytics, businesses can identify disengaged customers early and implement strategies to improve engagement and retention.

  • View profile for Wes Bush

    Author of Product-Led Growth & The Product-Led Playbook | I’ve been told I make PLG simple but you tell me!

    43,199 followers

    40-60% of first-time users never come back. Most companies focus on one type of onboarding support and neglect the other. Some build great in-app experiences but never follow up when users drop off. Others send tons of emails but their product experience is confusing. The best PLG companies use both product bumpers (inside the app) and conversational bumpers (outside the app) working together. Here are 11 bumpers that could double your activation rates: Product Bumpers (Inside Your App) 1. Welcome Messages Restate your value prop and set expectations for what users will experience. Make them feel invited, not lost. 2. Product Tours Eliminate distractions and give users only the options they care about. Use profiling questions to launch them into the right part of your product. 3. Progress Bars Show how close users are to completion. They'll know onboarding won't take long and they're just a few steps away. 4. Checklists Break big tasks into bite-sized steps. Pre-fill some items before users see them to boost motivation. 5. Onboarding Tooltips Provide just-in-time guidance, but don't drown users in tooltips. Keep it simple and guide them only through the critical steps. 6. Empty States Turn blank dashboards into clear next steps that lead users closer to value. Conversational Bumpers (Outside Your Product) 7. External Messaging Emails, texts, LinkedIn - meet users wherever they spend their time. The best messages provide clear next steps to re-engage. 8. Knowledge Base Give users instant answers to common questions. They solve problems independently while you deflect support tickets. 9. In-app Messaging When users need to ask specific questions, let them message your team in-app for near-instant solutions. 10. Community Forums Let users help each other. Notion does this brilliantly - users share templates that simultaneously increase adoption, engagement, and retention. 11. Training & Specialists Close the knowledge gap with coaching calls, academies, or cohorts. For high-value users, assign specialists to speed up time to value. Pro Tip: If you're just starting, give 10-40 signups per week the white-glove treatment. Welcome emails, onboarding calls, trial extensions - do everything to get them to value. Find what works, then automate it. Manual emails become automated. Common questions get added to your knowledge base. The unscalable path is how you identify the scalable path. What bumpers are you using today?

  • View profile for Lisa Trosien

    Multifamily Keynote Speaker, Consultant, Educator and Thought Leader | Leasing, Marketing, Resident Retention, Customer Service Expert| Proptech Advisor | Founder, Apartment All Stars, Apartment Expert

    20,018 followers

    PropTech Tuesday: Your Tech Is Only as Good as Your Training I keep hearing the same thing from management companies: their sites aren't using the tech they have. 📚 That's not a tech problem. That's a training problem. A proptech exec once told me something that has always stuck with me. He said he could always tell when a site had staff turnover. How? Management companies would reach out saying they weren't seeing the results they used to get, so the product "wasn't working" anymore. But the product was fine. The new team members brought on just hadn't been trained, so they weren’t using it. Some didn't even know the product was available. ⚙️ Most platforms offer solid training resources - live sessions, recorded webinars, help docs. But here's what happens: teams go through initial onboarding, then they're pretty much on their own. When results drop off six months later, the software gets blamed. But the reality? People can't use tools they don't fully understand. 🔄 Research backs this up - organizations that prioritize ONGOING employee development with new technology rollouts see a 30-50% spike in user engagement (McKinsey, 2023). Not just onboarding. Actual, continuous learning. 💬 The difference between tech that sits idle and tech that transforms operations is training. The ongoing, "let me show you this shortcut," actually-talking-to-people kind. 🎯 You want better results from your proptech stack? Teach your teams how to use it. KEEP teaching them. 💡 Great tech makes you efficient. Great training makes you unstoppable. Happy PropTech Tuesday. ✨ (This is the first in a series of Tuesday posts I'm dedicating to all things proptech.)

  • View profile for Fred Thompson

    buildempire.co.uk • claruswms.co.uk • thirst.io | Helping logistics and professional development through technology.

    3,458 followers

    If Your Learners Aren’t Engaged, Nothing Else Matters.👎 You can build the world’s most beautifully designed training program. But if learners don’t finish it, don’t remember it, and don’t apply it? Then it’s just content. Not learning. And that’s exactly where many L&D teams are stuck. Here’s what the data shows: * 70% of training content is forgotten within 24 hours * Engaged learners are 3x more likely to apply what they’ve learned * High engagement = higher productivity, stronger retention, and real business impact So, how do the best L&D teams drive engagement...and keep it? These are the three biggest game-changers we’re seeing in 2025 👀👇 1️⃣ Make Learning Feel Personal If a course doesn’t connect with someone’s day-to-day role, they’ll disengage...𝑭𝒂𝒔𝒕. Relevance is 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. What forward-thinking teams are doing: → Adapting content based on role, skill level, and performance
 → Letting AI adjust learning pathways in real-time
 → Giving learners more say in their own development ✅ Teams making this shift are seeing 2x to 3x higher engagement. 2️⃣ Make It Impossible to Just Click Next No one remembers a 60-slide eLearning deck. Passive content is forgotten content. What’s working now: * Scenario-based challenges that mimic real decisions * Interactive formats like quizzes and simulations * Collaborative elements that get people talking and solving together ✅ One SME switched to interactive compliance training and jumped from 20% to 92% completion overnight. 3️⃣ Make Learning Continuous When learning is personal, interactive, and continuous, people pay attention. Annual training? It’s forgotten before the next login. The best teams are shifting to learning that’s consistent, quick, and embedded in the flow of work. How they’re doing it: → Microlearning delivered in bite-sized bursts each week → Spaced repetition to strengthen memory → Turning learning into a habit, not a one-off ✅ One team replaced a yearly course with weekly 5-minute refreshers — and saw engagement and on-the-job application soar. Engagement isn’t a “nice-to-have” in L&D.
 It’s the foundation of every successful learning strategy. When learning is personal, interactive, and continuous - people pay attention. And when people are paying attention, performance improves. If you’re looking to future-proof your L&D approach, this is where to begin. But what’s stopping most teams from getting it right?

Explore categories