ML & AI news of the week

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

A collection of the best ML & AI news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence

For each week you will find different sections:

Research: the most important published research of the week.
News: the most important news related to companies, institutions, and much more.
Resources: released resources for artificial intelligence and machine learning.
Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

ML news: Week 9 - 15 September
ML news: Week 2 - 8 September
ML news: Week 26 August - 1 September
ML news: Week 19 - 25 August
ML news: Week 12 - 18 August
ML news: Week 5 - 11 August
ML news: Week 29 July - 4 August
ML news: Week 21 - 28 July
ML news: Week 15 - 21 July
ML news: Week 8 - 14 July
ML news: Week 1 - 7 July
ML news: Week 24 - 30 June
ML news: Week 17 - 23 June
ML news: Week 10 - 16 June
ML news: Week 3 - 9 June
ML news: Week 27 May - 2 June
ML news: Week 20 - 26 May
ML news: Week 13 - 19 May
ML news: Week 6 - 12 May
ML news: Week 29 April - 5 May
ML news: Week 21 - 28 April
ML news: Week 15 - 21 April
ML news: Week 8 - 14 April
ML news: Week 1 - 7 April
ML news: Week 25 - 31 March
ML news: Week 18 - 24 March
ML news: Week 11 - 17 March
ML news: Week 4 - 10 March
ML news: Week 26 February - 3 March
ML news: Week 19 - 25 February
ML news: Week 12 - 18 February
ML news: Week 5 - 11 February
ML news: Week 29 January - 4 February
ML news: Week 22 - 28 January
ML news: Week 15 - 21 January
ML news: Week 8 - 14 January
ML news: Week 1 - 7 January

Back to index

2024

ML news: Week 9 - 15 September

Research

Link	description
De novo design of high-affinity protein binderswith AlphaProteo.	demonstrates a family of machine learning models that have been trained for protein design; reports 3-to 300-fold improvements in binding affinities and higher experimental success rates when compared to other methods on seven target proteins; demonstrates that AlphaProteo's performance is similar to the seven targets when tested on hundreds of target proteins from the PDB.
In Defense of RAG in the Era of Long-Context Language Models.	reports that one of the main problems that a RAG system addresses (i.e., uses more relevant information) is that longer-context LLMs suffer from a diminished focus on relevant information. They suggest an order-preserving RAG mechanism that enhances performance on long-context question answering, but it's not perfect—in fact, the quality of responses increases and then declines as retrieved chunks increase. They also mention a sweet spot where it can achieve better quality with a lot fewer tokens than long-context LLMs.
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation.	a technique to improve LLM performance by adding strategic information before the intermediate CoT reasoning phases; the strategy for addressing problems aids in directing the creation of the CoT paths and solutions; promises to use the Llama3-8b model to get a 21.05% gain on the GSM8K datasets.
The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers.	Examines the effects of generative AI on software developers, highlighting a 26.08% rise in completed tasks among developers utilizing AI tools such as GitHub Copilot. Additionally, it indicates that less experienced developers are more inclined to adopt AI tools and experience significant productivity improvements.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA.	Creates a large-scale supervised fine-tuning (SFT) dataset using off-the-shelf large language models (LLMs) to enhance long-context question answering with citations. The training focuses on 8B and 9B parameter models, improving their ability to generate citations from extended contexts while enhancing response accuracy. It claims to outperform GPT-4o on its proposed LongBench-Cite benchmark.
MemLong: Memory-Augmented Retrieval for Long Text Modeling.	Employs an external retriever to gather historical information, enhancing the performance of long-context large language models (LLMs). It consistently surpasses other state-of-the-art LLMs on long-context benchmarks and can extend context length from 4k to 80k on a single 3090 GPU.
Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models.	Introduces a benchmark, NoiserBench, to assess how various types of noisy information impact the performance of retrieval-augmented generation (RAG) models. The study reveals that, among different beneficial noise types (e.g., semantic, datatype, and illegal sentence), illegal sentence noise leads to the greatest performance improvement across models and datasets.
Beyond Preferences in AI Alignment.	Critiques the prevailing AI alignment method of human preference tuning, highlighting how it fails to grasp the rich, nuanced content of human values. The argument is made that AI alignment requires reframing, suggesting that instead of aligning with individual human preferences, AI systems should align with normative standards relevant to their societal roles.
Planning In Natural Language Improves LLM Search For Code Generation.	Obtaining a variety of candidate solutions is one of the difficulties in code creation. Even repeated sampling frequently falls short of producing enough originality to address an issue. But if you start with a natural language plan and generate ideas for potential solution paths, the resulting generation is much more varied and diverse, which leads to better solutions for code creation.
Imitating Language via Scalable Inverse Reinforcement Learning.	Modern language modeling can largely be viewed as a specialized form of imitation learning, which benefits from extensive research in the broader field. This paper investigates the application of inverse reinforcement learning to mimic entire sequences rather than individual tokens. The findings are encouraging and suggest that reinforcement learning could play an increasingly important role in the training pipelines of language models moving forward.
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers.	This longitudinal study evaluated the abilities of 100 NLP researchers to generate and review novel ideas. The findings revealed that while LLMs were able to produce more innovative ideas, these ideas were slightly less practical compared to those created by human researchers.
Superhuman Automated Forecasting.	The Safe AI Institute has published research on a system capable of surpassing human experts in forecasting accuracy.
The AdEMAMix Optimizer: Better, Faster, Older.	This paper from Apple introduces an alternative to the traditional exponential moving average optimization method, incorporating contributions from older gradients to significantly enhance learning convergence.
DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data.	DiverGen is an innovative approach for generating datasets to improve instance segmentation models. Instead of relying on expensive manual annotations, it leverages generative models to create diverse data, helping to mitigate overfitting and boost model performance.
Policy Filtration in RLHF to Fine-Tune LLM for Code Generation.	Policy Filtration for Proximal Policy Optimization (PF-PPO) is a technique aimed at enhancing the precision of reinforcement learning from human feedback (RLHF), specifically in the context of code generation tasks.
Data Augmentation via Latent Diffusion for Saliency Prediction.	Researchers have introduced a novel data augmentation technique to enhance saliency prediction models, which have historically struggled due to the scarcity of labeled data.

News

Link	description
Google using anti-competitive tactics in UK ad market, claims watchdog.	CMA says tech company has ‘abused its dominant position’ to the detriment of publishers and advertisers
Apple to unveil iPhone 16 and ‘Apple Intelligence’ AI features.	Apple watchers also expect new colors for the iPhone at the annual launch event, this year titled ‘It’s Glow time’
TSMC's $65 billion Arizona facility can now match Taiwan production yields according to early trials.	the US is committed to establishing semiconductor manufacturing within its borders, and perhaps no effort is more crucial to this goal than TSMC's three-fab facility in Arizona. The government is pouring billions into the development, alongside TSMC's $65 billion investment.
AI Firm’s Misconfigured Server Exposed 5.3 TB of Mental Health Records.	A misconfigured server from a US-based AI healthcare firm Confidant Health exposed 5.3 TB of sensitive mental health records, including personal details, assessments, and medical information, posing serious privacy risks for patients.
California’s big AI regulation bill is headed to Gavin Newsom.	A California bill requiring makers of large AI systems to test them for potential harm cleared the Legislature today. It could still face a veto by Gov. Gavin Newsom.
Google search monopoly US case remedies to come by December.	The U.S. Department of Justice plans to issue an outline by December on what Alphabet's, must do to restore competition after a judge earlier found the company illegally monopolized the market for online search, prosecutors said at a court hearing in Washington on Friday.
Intel reveals first Lunar Lake laptop CPUs: everything you need to know.	Previously known as Lunar Lake, Intel has introduced its Core Ultra 200V portfolio, which features competitive integrated GPUs for tiny notebooks, fast CPUs, and enhanced AI capabilities. The CPUs have 32GB RAM capacity, eight CPU cores, integrated memory, and improved efficiency. Prominent producers such as Acer, Asus, Dell, and HP will introduce laptops equipped with these novel CPUs. Reviews to support Intel's assertions are still pending.
OpenAI, Still Haunted by Its Chaotic Past, Is Trying to Grow Up.	To draw in significant investors such as Microsoft, Apple, and Nvidia, OpenAI is reorganizing its management and organization intending to reach a $100 billion valuation. Internal disagreements within the organization regarding its safety procedures and objectives have resulted in a high employee turnover rate, with important researchers leaving to work for competitors such as Anthropic. OpenAI struggles to strike a balance between business goals and moral considerations while developing AI technology, despite increasing income and user base growth.
BP extends the use of AI in a five-year deal with spy tech firm Palantir.	Oil and gas company to use artificial intelligence to speed up decision-making by engineers
Google’s second antitrust suit brought by US begins, over online ads.	DoJ accused tech giant of more monopolistic behavior a month after a judge found it illegally cornered online search
What is Apple Intelligence, when is it coming and who will get it?	At WWDC 2024, Apple unveiled Apple Intelligence, a platform designed to integrate AI capabilities into existing applications like Mail, Messages, and Siri. Utilizing large language models, it supports functions such as text summarization and image generation, all aimed at enhancing the user experience. A beta version will be available in the U.S. starting this October, with plans to expand globally in 2025.
New open source AI leader Reflection 70B’s performance questioned, accused of ‘fraud’.	HyperWrite's Reflection 70B, a variant of Meta's Llama 3.1 LLM, is under scrutiny after independent evaluators were unable to reproduce its advertised performance. The problems were traced back to corrupted model weights during the upload to Hugging Face, causing inconsistencies. The AI community is now awaiting further clarifications and updates to better understand the model's true capabilities.
The new Shortwave AI Assistant.	Shortwave has substantially enhanced its AI Assistant, equipping it to handle complex, multi-step tasks like advanced searches, calendar lookups, and in-depth email analysis, making it more versatile and powerful in managing user tasks.
OpenAI might use Apple’s TSMC for chips.	OpenAI could greatly lower operational costs by adopting more efficient chips, which would be particularly beneficial as its user base continues to expand, allowing for better scalability and resource management.
Apple takes direct aim at Microsoft’s Copilot+ PCs in new AI-focused Mac promos.	Apple is actively marketing the Mac as the "best AI PC," positioning it as a direct competitor to Microsoft's Copilot+ PCs. This strategic push highlights Apple's focus on integrating AI capabilities into its devices, aiming to challenge Microsoft's AI-driven offerings in the PC market.
GPT-fabricated scientific papers on Google Scholar: Key features, spread, and implications for preempting evidence manipulation.	Generative AI tools, such as ChatGPT, are increasingly generating fraudulent research papers that are finding their way into databases like Google Scholar, mixing with legitimate studies. These papers, frequently addressing sensitive topics like health and the environment, threaten the integrity of science and public trust. Strengthened oversight and improved filtering mechanisms in academic search engines are crucial to addressing this rising concern.
Apple announces its new A18 and A18 Pro iPhone chips.	At its "Glowtime" event, Apple introduced the A18 and A18 Pro chips, highlighting substantial CPU and GPU upgrades compared to the A16 Bionic. The A18 Pro offers increased memory bandwidth and improved image processing. Both chips come equipped with advanced AI capabilities, with the A18 Pro specifically enhancing on-device model performance and thermal design for a superior gaming experience.
AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem.	At IFA 2024, AMD revealed plans to merge its RDNA and CDNA architectures into a unified UDNA microarchitecture, positioning itself to compete more effectively with Nvidia's CUDA ecosystem. This strategic shift is aimed at simplifying development and strengthening AMD's foothold in the AI and high-performance computing (HPC) markets. The move to UDNA marks a significant transition, with full-scale adoption anticipated after the release of the RDNA 4 generation.
Waymo Giving 100,000 Robotaxi Rides Per Week But Not Making Any Money.	Waymo is now delivering over 100,000 paid autonomous rides per week in San Francisco, Phoenix, and Los Angeles, a figure that has doubled since May. Despite this growth, the company remains unprofitable, with Google’s experimental division facing a $2 billion operating loss. The high costs of vehicles and city mapping, along with ongoing public hesitation, continue to hinder Waymo's journey to profitability.
iOS 18.1 with Apple Intelligence launches in October, more languages rolling out over time.	Apple announced that Apple Intelligence will launch in beta with iOS 18.1 in October, initially available exclusively for US English users.
Bringing generative AI to video with Adobe Firefly Video Model.	Adobe's Firefly Video Model introduces AI-driven tools to video editing programs such as Premiere Pro. Set to launch in beta later this year, the model provides editors with improved workflows, enabling them to experiment with creative concepts, fill gaps in timelines, and incorporate new elements into their videos.
Mistral releases Pixtral 12B, its first multimodal model.	French AI startup Mistral has introduced Pixtral 12B, a multimodal model with 12 billion parameters designed to handle both images and text. The model, accessible through GitHub and Hugging Face, can be fine-tuned and is available under the Apache 2.0 license. This release comes after Mistral secured $645 million in funding, strengthening its role as a key player in Europe's AI industry.
Elon Musk says Tesla has ‘no need’ to license xAI models.	Elon Musk has refuted claims that Tesla will share revenue with his AI startup xAI in exchange for using its AI models. He explained that while Tesla has gained from xAI engineers' expertise, it doesn't need to license xAI's models. Musk also noted that xAI's large models are incompatible with Tesla's vehicle computers.
Apple is thinking about a rival to Meta Ray-Ban glasses.	Apple might be developing non-AR smart glasses, positioning them as potential competitors to Meta's $299 Ray-Ban glasses, which also lack AR functionality. Meta's glasses come equipped with features like a camera and an AI chatbot. By excluding AR capabilities, Apple's glasses could be more affordable, lighter, and have improved battery life due to reduced complexity.
OpenAI in talks to raise funds at $150B valuation, Bloomberg says.	OpenAI is in talks to raise $6.5B from investors at a valuation of $150B, people familiar with the matter told Bloomberg
Meta fed its AI on almost everything you’ve posted publicly since 2007.	Unless you’re in the EU, there’s no ability to opt out of AI training settings that keep Facebook or Instagram posts public.
Google is using AI to make fake podcasts from your notes.	Google’s NotebookLM app can now generate ‘lively’ audio discussions with two AI hosts about the documents you’ve given it.
Introducing OpenAI o1-preview.	OpenAI has launched its latest model, designed to think carefully before responding. It was trained using reasoning processes, allowing it to take time to deliberate before providing an answer. This approach has resulted in superhuman performance in certain areas. Initially, users will be limited to around 30 queries per week, though OpenAI plans to remove this restriction shortly.
Google is now rolling out Gemini Live to free users on Android.	Google is launching Gemini Live, its conversational AI tool, to all free Android users following a month of early access for advanced users. With this feature, users can interrupt responses to provide new information and receive text transcripts of their conversations. While extensions like Gmail are not yet supported, Gemini Live introduces ten new voice options, with additional features expected to be added soon.
Sergey Brin says he’s working on AI at Google ‘pretty much every day’.	Google co-founder and ex-Alphabet president Sergey Brin said he’s back working at Google “pretty much every day” because he hasn’t seen anything as exciting as the recent progress in AI — and doesn’t want to miss out.
Amazon starts testing ads in its Rufus chatbot.	Amazon's shopping chatbot, Rufus, will soon incorporate sponsored ads, displaying them based on the user's search queries and the context of their conversations.

Resources

Link	description
OLMoE: Open Mixture-of-Experts Language Models.	Presents a fully open large language model (LLM) that utilizes a sparse Mixture-of-Experts approach. OLMoE is a 7B parameter model with 1B active parameter per input token. An instruction-tuned version is also available, which reportedly surpasses the performance of Llama-2-13B-Chat and DeepSeekMoE 16B.
Large Language Model-Based Agents for Software Engineering: A Survey.	A survey paper on large language model (LLM)-based agents in software engineering, offering insights across various areas such as requirements engineering, test generation, and software maintenance.
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos.	Researchers were able to produce very accurate depth information without requiring any camera posture or optical flow information by using Stable Diffusion video as a prior model.
SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration.	Using DPO-style data and supervised fine-tuning on open-source language models, LLMs can be trained to produce compounds with intriguing features for potential medicinal development.
Running a LLM on the ESP32.	This code demonstrates how to execute a small language model on an Arduino board, showcasing the process of deploying and running AI models on resource-constrained hardware.
DocAI.	This is another example of effectively leveraging existing models to extract structured information from documents, demonstrating the innovative use of pre-trained AI models to automate data extraction tasks efficiently.
FluxMusic.	Text-to-music generation using a rectified flow transformer involves converting text inputs into musical compositions by utilizing a model that combines transformer architectures with rectified flow techniques. This approach enhances the model's ability to generate coherent and diverse music sequences based on textual descriptions.
iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models.	iText2KG is a Python package that leverages large language models to extract entities and relationships from text, progressively constructing consistent knowledge graphs. This tool automates the process of transforming unstructured text into structured knowledge, allowing for the incremental growth of comprehensive knowledge graphs.
Multimodal RAG using ColPali (with Byaldi) and Qwen2-VL.	Merve has created a great resource for using language and vision models to improve retrieval.
Awesome-Text2X-Resources.	This is an open collection of state-of-the-art (SOTA) and novel Text-to-X methods (where X can represent any output, such as images, audio, or 3D models). The collection includes papers, code, and datasets, aimed at staying up-to-date with the expected surge in research developments in this area over the coming months.
Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task.	The Proxy Token Diffusion Transformer optimizes diffusion transformers by minimizing redundant computations, employing a reduced set of representative tokens for attention processing. This approach enhances efficiency while maintaining model performance.
UniDet3D: Multi-dataset Indoor 3D Object Detection.	UniDet3D is a robust 3D object detection model designed to operate across multiple indoor datasets, delivering strong performance in identifying and detecting objects in three-dimensional spaces.
Starst3r.	This innovative tool leverages Mast3r along with smart optimizations to efficiently reconstruct 3D scenes from just a few 2D images, offering impressive results with minimal input.
simple_tma.	Image processing and cropping that can be run on the GPU.
Lexicon3D.	In a recent study comparing seven visual encoding models for 3D scene understanding, researchers found that the most effective model varied based on the specific task. DINOv2 emerged as the top performer overall, while video models excelled in object-level tasks, and diffusion models outperformed others in geometric tasks. Surprisingly, models pre-trained on language showed notable limitations in this context.
One-DM:One-Shot Diffusion Mimicker for Handwritten Text Generation.	The One-DM model generates handwritten text that can imitate any style using only a single sample as a reference. This approach allows for highly personalized handwriting generation with minimal input data.
optillm.	Optillm assists in optimizing prompts by utilizing various well-established research algorithms, including Monte Carlo Tree Search, Z3 solvers, and Self Consistency, to improve performance.
Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation.	Researchers tackled the challenge of source-free unsupervised domain adaptation for 3D semantic segmentation by implementing regularization techniques and proposing a new criterion to improve adaptation performance.
Memory-Efficient Optical Flow.	HCVFlow is a newly developed memory-efficient optical flow method designed to address the high computational demands of all-pairs cost volumes in high-resolution images.
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models.	Concept Sliders offer a powerful mechanism for controlling the output of diffusion models. Recent efforts have been made to integrate them with the new Flux suite of models, enhancing their functionality and adaptability.
Minifying HTML for GPT-4o: Remove all the HTML Tags.	Converting HTML to plain text can significantly reduce costs with minimal performance loss in GPT-4o for data extraction tasks. Tests on the Mercury Prize dataset demonstrated that GPT-4o performs effectively even without the HTML structure, and GPT-4o mini offers a cost-efficient solution for handling unstructured questions. For structured extraction tasks, it's advisable to test both versions to find the right balance between cost and accuracy.
Prompt2Fashion: An automatically generated fashion dataset.	This dataset, created with large language models, curates outfit recommendations for various occasions, styles, and body types, providing high-quality and relevant suggestions.
Sources of Uncertainty in 3D Scene Reconstruction.	Researchers are improving 3D scene reconstruction techniques such as Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (GS) by incorporating uncertainty estimation methods. Although these approaches produce high-quality renders, they face challenges in addressing uncertainties caused by noise, occlusions, and camera inaccuracies.
🦙🎧 LLaMA-Omni: Seamless Speech Interaction with Large Language Models.	Llama Omni is a speech input-output model built on Llama 3.1 8B, designed to operate with extremely low latency while maintaining high-quality responses.
AWS AI Stack.	This ready-to-use, full-stack boilerplate project is designed for building serverless AI applications on AWS. It is ideal for developers looking for a reliable AWS foundation for AI apps and seamless access to powerful LLM models through Bedrock while ensuring your app's data remains separate from model providers.
Internet of Agents.	The Internet of Agents (IoA) is a novel framework aimed at enhancing multi-agent collaboration by enabling more efficient integration of diverse third-party agents.
ell: The Language Model Programming Library.	Ell is a newly released package developed by a former OpenAI scientist, designed to manage prompts as code, streamlining the process of working with prompts in AI applications.
EMO-Disentanger.	This research employs a two-stage model to separate and analyze emotive elements in piano music generation, enabling more expressive and nuanced performances.
Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown.	Jina has unveiled two cutting-edge models capable of transforming noisy HTML into clean, structured Markdown, optimized for training and reasoning tasks.
Agent Workflow Memory.	Agent Workflow Memory (AWM) is a technique that enables language model-based agents to learn and retain reusable task workflows from previous experiences, allowing them to effectively manage complex, long-horizon tasks.
Hi3D-Official.	Hi3D is a novel model designed to improve the generation of multi-view consistent, high-resolution 3D images from a single input. By using a video diffusion technique, it addresses the limitations of traditional 2D methods that lack 3D awareness, leveraging temporal consistency from video models to enhance geometric coherence across different views.
Fine Tuning Llama 3.1 405B with Axolotl on a Lambda 1-Click Cluster.	Axolotal AI has collaborated with Lambda Labs to demonstrate how their one-click cluster can be used to fine-tune the Llama 3.1 405B model. Although the process requires 64 GPUs, the new tools make it possible with minimal infrastructure setup, streamlining the process significantly.
super-benchmark.	SUPER is a newly introduced benchmark aimed at evaluating how effectively large language models (LLMs) can replicate tasks sourced from research repositories.
Using GPT-4o for web scraping.	An AI-powered web scraper, utilizing OpenAI's GPT-4o, is designed to extract structured data from HTML tables. While it performs well on simple tables, its results are mixed when dealing with more complex tables, such as those with merged rows or intricate structures.

Perspectives

Link	description
‘If journalism is going up in smoke, I might as well get high off the fumes’: confessions of a chatbot helper.	Journalists and other writers are employed to improve the quality of chatbot replies. The irony of working for an industry that may well make their craft redundant is not lost on them
Will AI make us overconfident?	Students are increasingly turning to AI tools like ChatGPT to tackle complex research challenges, surprising educators with their swift advancements. AI-powered development tools, particularly in coding, greatly enhance both ambition and productivity, though they also introduce risks of overconfidence and mistakes. Despite occasional inaccuracies, AI offers valuable interactive starting points for difficult tasks, potentially fostering more active learning and encouraging exploration across disciplines.
LLMs struggle to explain themselves.	An interactive demo was employed to evaluate large language models' (LLMs) ability to recognize and explain number sequences produced by random programs. The findings revealed that although LLMs often correctly identified the sequences, their explanations of the underlying patterns were frequently inaccurate. This underscores the limitations of LLMs' reasoning capabilities, despite their strong performance on standardized tests.
No more free pass: Regulation starts to crack down on social media platforms.	The arrest of Telegram’s CEO in France and the closure of X in Brazil are two of the latest signs that times are changing, with networks beginning to be held more accountable
Here’s how 7 news audience directors are thinking about Google’s AI Overviews.	Google's AI Overviews, which use the Gemini language model, received significant criticism for inaccuracies and potentially harmful recommendations following their launch in the U.S. Despite the negative feedback, Google extended the feature to six additional countries, sparking concerns among publishers about decreased web traffic and distorted content. AI experts and SEO specialists stress the importance of transparency and improved citation methods to preserve trust and ensure consistent traffic.
Diffusion is spectral autoregression.	Diffusion models and autoregressive models share a fundamental similarity, as both rely on iterative refinement processes. The author demonstrates, using Fourier transform techniques, that diffusion models function similarly to approximate autoregression in the frequency domain, especially for visual data. This insight suggests promising pathways for unifying generative modeling approaches across various data types.
Why We Fear Diverse Intelligence Like AI.	The emergence of AI and various forms of intelligence is blurring traditional distinctions between "real beings" and machines. Rather than centering discussions only on AI, it's important to recognize and ethically interact with a broad range of cognitive systems, including bioengineered, robotic, and hybrid entities. By broadening our understanding of intelligence and fostering compassion, we can better navigate the ethical challenges posed by these rapidly evolving technologies.
SGSeg: Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidance.	SGSeg is a segmentation framework for chest X-rays that incorporates language guidance during training but allows for text-free inference during the prediction phase.
Are novelists who worry about the rise of AI really ‘classist and ableist’?	An international writing organization appeared to greenlight the use of AI, prompting anger, the resignation of four board members and an entire creative community to ask: ‘What?!’
AI Chatbots Have a Political Bias That Could Unknowingly Influence Society.	A new study has uncovered strong evidence that we can now add political bias to that list, further demonstrating the potential of the emerging technology to unwittingly and perhaps even nefariously influence society's values and attitudes.
How influencers and algorithms mobilize propaganda — and distort reality.	The engagement-fuelled logic of social media has bequeathed us a world in which what’s trending is a yardstick for what’s true.
Artificial intelligence can help to make animal research redundant.	One alternative in its early stages is artificial intelligence (AI), whereby generative adversarial networks produce animal data. However, there remains a disconnect between AI-generated animal data and human safety data. Computer models that simulate complex human physiological processes could close this gap, with AI used to analyze the resulting data sets.
Wikipedia is facing an existential crisis. Can gen Z save it?	The world’s most important knowledge platform needs young editors to rescue it from chatbots – and its own tired practices
AI-Generated Junk Science Is Flooding Google Scholar, Study Claims.	New study claims to have uncovered a disturbing trend in the world of academic research: AI tools like ChatGPT being used to produce fake scientific papers that are infiltrating Google Scholar, one of the most widely used academic search engines.
Will the "AI Scientist" Bring Anything to Science?	Researchers have created an AI tool capable of automating scientific workflows, from generating hypotheses to executing experiments and drafting research papers. While its accuracy and coherence require further development, critics warn that AI's role in simulations, such as in quantum computing and materials science, may lead to narrower research questions and less impactful findings. Supporters, however, see potential in using this AI to streamline the early stages of research, helping scientists conceptualize and define their projects more efficiently.
Is AI Quietly Sabotaging Itself—And The Internet?	Amid the growth of AI content online, a group of researchers at Cambridge and Oxford universities set out to see what happens when generative AI tools query content produced by AI. What they found was alarming.

Back to index

ML news: Week 2 - 8 September

Research

Link	description
Diffusion Models Are Real-Time Game Engines.	a two-phase training process involving an RL agent to learn and a diffusion model to generate frames; it can interactively simulate DOOM over 20 frames per second on a single TPU. A game engine driven by a diffusion model allows real-time interaction with complex environments over long trajectories.
Agentic Retrieval-Augmented Generation for Time Series Analysis.	suggests an agentic RAG framework for time series analysis. It makes use of a multi-agent architecture in which an agent directs specialized sub-agents to carry out time-series tasks. These sub-agents can retrieve pertinent prompts that contain information about past patterns and trends, which helps to improve predictions on new data. The sub-agents use tuned small language models to accomplish these tasks.
Persuasion Games using Large Language Models.	asserts that the persuasive efficacy of LLMs can be increased by using a multi-agent framework, in which the main agent conducts persuasive dialogue while supporting agents handle crucial functions like information retrieval and response analysis. The study finds that LLMs are capable of influencing users' perspectives and convincing them to make a purchase decision; for example, sales agents can influence user perspectives in a 71% positive way.
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling.	discovers that synthetic data produced by weaker + less costly (WC) models is superior to data produced by stronger but more expensive models for fine-tuning models; generally, the results imply that WC models might be a compute-optimal method for training sophisticated LLM reasoners.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.	demonstrates that it is possible to scale from 7B parameter models to 2T multi-modal tokens that can compete in performance with similar scale diffusion and language models. It also presents a training recipe to train multi-modal models over discrete and continuous data; it combines next token prediction with diffusion to train transformer models over mixed-modality sequences.
ReMamba: Equip Mamba with Effective Long-Sequence Modeling.	examines the long-context capacities and efficiency of Mamba models; the RNN-like nature of Mamba is the cause of the long-context deficiencies; it does this by compressing data using the following method: achieves a 3.2 improvement over the baseline on LongBench and 1.6 improvement on L-Eval; the strategy appears to also apply to Mamba 2. the top-k hidden states during the first forward pass and uses Mamba's selective mechanism to incorporate them into the state space during the second forward pass.
Text2SQL is Not Enough: Unifying AI and Databases with TAG.	develops a benchmark and discovers that standard methods only answer 20 percent of natural language queries correctly. It suggests Table-Augmented Generation (TAG), a unified framework for responding to natural language queries over databases. It represents a wider range of unexplored interactions between LLMs and databases.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts.	Sparsifying the computation is aided by routing tokens to MoE experts. But it can be hard to learn that routing. Usually, there is a complex loss structure. This research presents an innovative solution to this issue, leading to a significant increase in training stability and expert balancing.
Toward Robust Early Detection of Alzheimer's Disease via an Integrated Multimodal Learning Approach.	A multimodal classification approach intended to enhance the early detection of Alzheimer's disease is presented in this work.
Targeted Cause Discovery with Data-Driven Learning.	A sophisticated machine learning technique has been created by researchers to determine a target's direct and indirect causal variables within a system.
Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training.	To prevent overfitting in Vision Mamba models and enable them to scale up to 300M parameters while still performing competitively with Vision Transformers (ViTs), this research presents a stochastic layer-wise shuffle regularization strategy.
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control.	Stable Control Representations are a tool that researchers are using to help embodied AI machines interpret scenes more precisely. These representations capture detailed visuospatial information required for challenging tasks by utilizing pre-trained text-to-image diffusion models.
AI generates covertly racist decisions about people based on their dialect.	Language models perpetuate covert racism through dialect prejudice, specifically against African American English (AAE), leading to negative stereotypes and harmful consequences, while overt stereotypes about African Americans are more positive, and current bias mitigation practices may worsen this issue.
Latent Distillation for Continual Object Detection at the Edge.	A unique Continual Learning technique for object detection that overcomes memory and computational limitations on edge devices is called latent distillation.
Masked Mixers for Language Generation and Retrieval.	Masked mixers are a unique architecture designed to enhance input representation in language models by substituting masked convolutions for self-attention.
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology.	Using masked autoencoders and self-supervised learning, researchers have created a novel technique that greatly enhances the processing of large-scale microscope pictures.
Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?	This work compares alternative pooling and attention strategies while examining multiple designs for LLM-based embedding models.
AlphaProteo generates novel proteins for biology and health research.	New AI system designs proteins that successfully bind to target molecules, with potential for advancing drug design, disease understanding and more.

News

Link	description
X goes offline in Brazil after Elon Musk’s refusal to comply with local laws.	Millions of users shut out and 500,000 switch to rival platform Bluesky as providers enact supreme court ban
'A tech firm stole our voices - then cloned and sold them'.	Paul Skye Lehrman and Linnea Sage, voice-over performers, discovered that an AI-powered text-to-speech platform had cloned their voices without permission after they were tricked into providing audio recordings through Fiverr. The couple has filed a lawsuit against the platform, Lovo, for allegedly using their voices illegally.
Did your car witness a crime? Bay Area police may be coming for your Tesla — and they might tow it.	Tesla's Sentry Mode, a feature that uses the car's cameras to monitor its surroundings, is increasingly being used by law enforcement as evidence in criminal investigations. The footage captured by the system has been instrumental in solving various crimes, such as car break-ins and hit-and-run incidents.
Updates to the Command R Series.	Updates were made to Command R and Command R+ for almost every task. Their recall, speed, arithmetic, and reasoning have all improved.
Workers at Google DeepMind Push Company to Drop Military Contracts.	In a letter, almost 200 workers at Google DeepMind demanded that the firm revoke its military contracts, citing a breach of its own AI ethics policy. Armed forces have purchased DeepMind technology from Google Cloud, which has caused internal strife among AI personnel who respect moral principles. Although Google's response showed that the company was following the AI Principles, employees are still not pleased and want further regulation to prevent the military from using their AI.
TRL release.	This could be among the Transformer Reinforcement Learning library's more significant updates. WinRate Callbacks, Liger Kernels, onlineDPO, and other features are included.
xAI Starts Colossus Training Cluster.	With intentions to double its size in a few months, xAI has initiated the 100,000 Colossus H100 training cluster, which is now the largest in the world.
First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI.	In MLPerf's LLM Q&A benchmark, Nvidia's new Blackwell chip showed the best per GPU performance, demonstrating notable improvements with its 4-bit floating-point accuracy. Rivals like AMD and Untether AI, however, have displayed encouraging outcomes, especially in terms of energy efficiency. For example, Untether AI's speedAI240 chip performed exceptionally well in the edge-closed category, demonstrating a range of strengths in emerging AI inference technology.
Two Oxford PhDs are building an app to let you remix photos into memes.	A new social network by a duo of Oxford PhDs is working on an app to let you add friends to a photo in a more memeable and fun way.
Apple and Nvidia may invest in OpenAI.	The two tech giants might join OpenAI’s potentially huge funding round.
Boston Dynamics’ new electric Atlas can do push-ups.	In a recent video, Boston Dynamics demonstrated Atlas, its electric biped robot, completing push-ups to highlight the strength of its actuators during its early commercialization phase for factory floor applications.
Meet Boardwalk Robotics’ Addition to the Humanoid Workforce.	The humanoid upper torso robot Alex, by Boardwalk Robotics, is intended for use in manufacturing, logistics, and maintenance. Alex is a legless robot that was developed separately while utilizing the heritage of IHMC's bipedal robot experience. Its designers prioritized manipulation over mobility in order to guarantee efficiency and safety. Pilots are now choosing commercial partners, but researchers can buy Alex right now.
Americans Are Uncomfortable with Automated Decision-Making.	Consumer Reports recently released a national survey finding that Americans are uncomfortable with the use of artificial intelligence (AI) and algorithmic decision-making in their day to day lives. Nearly three-quarters of respondents (72%) said they would be “uncomfortable”
Canva says its AI features are worth the 300 percent price increase.	The design software company is massively jacking up subscription prices for some users.
AI worse than humans in every way at summarising information, government trial finds.	A test of AI for Australia's corporate regulator found that the technology might actually make more work for people, not less.
Reliant’s paper-scouring AI takes on science’s data drudgery.	Karl Moritz Hermann co-founded Reliant AI, which has raised $11.3 million in a seed round to automate academic literature reviews. Tabular, the company's AI solution, promises zero-error data extraction from scientific papers. Reliant offers researchers an intuitive user interface (UI) while utilizing LLMs and patented methodologies to increase efficiency compared to conventional methods. Its usage of in-house hardware highlights its dedication to providing the research sector with premium, domain-specific AI solutions.
Leveraging AI for efficient incident response.	With the help of heuristic retrieval and LLM-based ranking, Meta has developed an AI-assisted root cause analysis system that has successfully identified 42% of the causes in its web monorepo investigations. Improving system accuracy has mostly been achieved by fine-tuning the Llama 2 model using previous data. The organization intends to increase the integration of AI tools with the goal of achieving autonomous processes and proactive risk mitigation.
Artificial Intelligence Predicts Earthquakes With Unprecedented Accuracy.	After testing their AI in China, researchers at the University of Texas were able to predict 70% of earthquakes.
Recall 2.0? Microsoft plans another AI feature that scans everything.	Another AI-driven feature that searches PC content surfaces in Windows 11, raising questions about data privacy.
You.com raises $50M Series B.	The search engine, agent platform, and knowledge base startup You.com has raised more money as it expands.
Sakana raises $100m Series A.	With the increase, Sakana will be able to hire more researchers, expand its computational capacity, and generally establish itself as one of Japan's top AI labs.
Google AI Overviews rollout hits news publisher search visibility.	Some news items now have AI-written summaries available in Google's US and UK search results. According to research, publisher visibility is being impacted by these AI Overviews, which is causing original articles to fall in the search results. To sustain traffic, this move may require major adjustments to SEO tactics.
US, UK, EU and others sign landmark AI safety treaty.	More than a dozen countries have signed a treaty designed to ensure that artificial intelligence models are used in a safe manner.
OpenAI's Next-Generation Models Could Reportedly Cost $2,000.	The Sam Altman-led company's new artificial intelligence models, such as Strawberry and Orion, likely won't be cheap (prices as high as $2,000 per month).
Alleged fraudster got $10 million in royalties using robots to stream AI-made music.	A North Carolina man is facing fraud charges after allegedly uploading hundreds of thousands of AI-generated songs to streaming services and using bots to play them billions of times. Michael Smith is said to have received over $10 million in royalties since 2017 via the scheme.
Advertisers plan to withdraw from X in record numbers.	A record number of firms plan to cut advertising spending on X next year because of concerns that extreme content on the platform could damage their brands, dealing another blow to the financial fortunes of Elon Musk’s social media company.
Dutch Regulator Slams Clearview AI with €30.5 Million Penalty for “Massive” Rights Breach.	The Dutch Data Protection Authority (DPA) announced on Tuesday that it has imposed a €30.5 million ($33.7 million) fine on US facial recognition company Clearview AI for illegally creating a database of billions of facial images.
M&S using AI as personal style guru in effort to boost online sales.	Shoppers can use technology to advise them on outfit choices based on their body shape and style preferences
Google’s AI-powered Ask Photos feature begins US rollout.	More sophisticated natural language queries may now be used to search through photographs with Google photographs' new AI-powered search function, "Ask Photos," which is now available to a limited number of American users.
Alibaba releases new AI model Qwen2-VL that can analyze videos more than 20 minutes long.	Qwen2-VL, a new vision-language model with improved visual understanding, multilingual text-image processing, and video comprehension, has been published by Alibaba Cloud. In comparison to models such as Meta's Llama 3.1 and OpenAI's GPT-4o, Qwen2-VL performs better and is compatible with a wider range of applications, such as real-time video analysis and technical help. The models are open-source under Apache 2.0 for the smaller versions, and are available in three sizes (7B, 2B, and shortly 72B).
Broadcom is working to integrate optical connectivity directly into GPUs.	Currently, one of the main obstacles to training large models is the bandwidth of GPU interface. The problem would be much reduced if Broadcom could include optical transfer directly into GPUs, as they are now working on doing.
YouTube is making tools to detect face and voice deepfakes.	It plans to launch a pilot program for the voice detection tool by early next year.
Google is working on AI that can hear signs of sickness.	Given everything you’ve already heard about AI, you may not be surprised to learn that Google is among other outfits beginning to use sound signals to predict early signs of disease.

Resources

Link	description
AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems.	An interface written in minimal code to quickly prototype AI agents. It may be used for multi-agent workflow evaluation and debugging, and it is constructed on top of the AutoGen framework.
Foundation Models for Music: A Survey.	gives a thorough rundown of the most recent pre-trained models and foundation models in the music industry.
A Practitioner's Guide to Continual Multimodal Pretraining.	a thorough manual on ongoing multimodal related; presents FoMo-In-Flux, a large-scale continuous pretraining benchmark with fine-grained and extended horizons.
AI feedback loop will spell death for future generative models.	When you train LLMs with LLM-generated content, the results tend to be digital poop
Apple's robotics work aims to solve user's first-world problems.	Apple might be getting more involved in robotics and releasing moving gadgets, like an iPad supported by a robotic arm. Under the direction of Vice President of Technology Kevin Lynch, Apple is making headway in robotics with the assistance of specialists from companies such as Israel's Technion, and plans to expand its AI interfaces beyond Siri. Apple is thinking of releasing these new robotic devices around 2026 or 2027, while they are still conceptual.
Towards Real-world Event-guided Low-light Video Enhancement and Deblurring.	Using event cameras, this end-to-end system concurrently solves motion deblurring and low-light enhancement in videos.
Enhancing Sound Source Localization via False Negative Elimination.	To overcome false negatives in conventional methods of sound source localization, researchers have put forth a novel audio-visual learning framework. Two schemes are included in the framework: Semantic-Aware Contrastive Learning (SACL) and Self-Supervised Predictive Learning (SSPL). While SACL improves the contrastive learning process to better align auditory and visual elements, SSPL removes false negatives by emphasizing positive-only learning.
FastSD CPU.	Flux Schnell on the CPU is now supported by a widely used inference library.
Spiking Diffusion Models.	A new class of Spiking Neural Networks (SNNs) called Spiking Diffusion Models (SDMs) is intended for image production and offers significant energy savings along with great biological plausibility.
Laion 5B safety Release.	The biggest publicly available image dataset on the internet was Laion 5B. Because of worries about offensive and hazardous imagery, it was taken down. After a major effort to address these problems, the group is now rereleasing the dataset.
ml_dtypes.	Bfloat16 and fp8 support for native numpy arrays.
VisionTS.	By redefining time series forecasting as an image reconstruction challenge, VisionTS is a novel method that takes advantage of the similarities between time series data and natural images to improve forecasting. To achieve remarkable zero-shot performance, it makes use of a visual masked autoencoder (MAE) that has been pre-trained on ImageNet.
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model.	A novel method for improving LLMs' audio-generating performance is called X-Codec.
The timm (PyTorch Image Models) Leaderboard.	This leaderboard is based on the results of the models from Timm. Timm comprises various vision models.
CogVideoX-5B.	CogVideo 5B model will launch next week in Hugging Face Diffusers.
Anthropic Quickstarts.	Anthropic has made available a helpful selection of initial projects. It collaborated with former chief AI officers from Brex, Uber, Facebook, and other companies to draft the first Quickstart, a Claude-powered scalable customer support assistant.
The Missing Guide to the H100 GPU Market.	This guide covers all the important factors of buying a GPU, such as availability considerations, pricing for various alternatives, and guaranteeing reliability in addition to highlighting the significance of other hardware features. It answers the most important queries consumers have about GPUs, including pricing, performance, and shipping.
Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning.	A deep reinforcement learning framework is being developed in this research to enhance the stability of visual odometry (VO) systems in difficult-to-light settings.
Multi-scale Cross-restoration Framework for Electrocardiogram Anomaly Detection.	a sophisticated ECG diagnosis system that enhances the identification of uncommon but serious cardiac anomalies by self-supervised anomaly detection pretraining.
RWKV.cpp.	The great RWKV models have included a local inference model with its CPP project.
MAPF-GPT.	A novel learning-based method called MAPF-GPT has been developed to tackle the difficult multi-agent pathfinding (MAPF) problem. The model navigates agents by imitation learning; it does not require extra heuristics, reward functions, or communication.
EnsLoss.	An ensemble approach called EnsLoss integrates loss functions into the Empirical Risk Minimization (ERM) paradigm.
Disentangled Motion Modeling for Video Frame Interpolation.	MoMo is a novel diffusion-based approach for video frame interpolation (VFI). It enhances visual quality by focusing on intermediate motion modeling through a disentangled two-stage training process.
repo2vec.	Repo2vec is a new package that functions similarly to GitHub Copilot but with up-to-date repo information, making it simple to communicate with any public or private codebase.
Building LLMs from the Ground Up: A 3-hour Coding Workshop.	Great resource about LLM building from scratch
SGLang v0.3 Release.	The most recent release brings enhancements to SGLang inference, including Multi-Image/Video LLaVA-OneVision, 1.5x Faster torch.compile, and 7x Faster DeepSeek MLA.
OLMoE: Open Mixture-of-Experts Language Models.	Best in-class performance for 1B activated parameters in an excellent open MoE.
StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models.	This work presents StyleTokenizer, an approach that aligns style representation with text prompts to improve style control in text-to-image generation.
Applied Machine Learning (Cornell CS5785, Fall 2024).	Open resources for the Fall 2024 Applied ML class at Cornell.
Laminar - Open-Source observability, analytics, evals and prompt chains for complex LLM apps.	Laminar hosts background job queues of LLM pipelines. Outputs of those pipelines are turned into metrics.
LongLLaVA.	A multimodal model called LongLLaVA was created to handle long-context tasks like comprehending high-resolution images and videos.

Perspectives

Link	description
I learned the language of computer programming in my 50s – here’s what I discovered.	A writer with no technical background recounts his incredible journey into the realm of coding and the invaluable lesson it taught him about the modern world
Why A.I. Isn’t Going to Make Art.	To create a novel or a painting, an artist makes choices that are fundamentally alien to artificial intelligence.
Autonomous car bombs, online recruitment: Experts worry how AI can transform terrorism.	Law enforcement has to anticipate novel AI uses and develop countermeasures
Researchers built an ‘AI Scientist’ — what can it do?	The large language model does everything from reading the literature to writing and reviewing its own papers, but it has a limited range of applicability so far.
The Next Generation Pixar: How AI will Merge Film & Games.	With its ability to combine dynamic gaming engagement with narrative depth, generative AI has the potential to completely transform storytelling. This change is being accelerated by recent developments in generative models, such as Luma AI's Dream Machine and OpenAI's Sora, which allow for the creation of interactive videos in real-time. This development, which combines AI, gaming, and film, could result in the next "Pixar" in interactive media.
China's robot makers chase Tesla to deliver humanoid workers.	At the World Robot Conference in Beijing, more than 25 Chinese businesses featured humanoid robots designed for factory automation. These companies were supported by significant government funding and took advantage of China's extensive supply network. By 2035, the market for humanoid robots is expected to reach $38 billion globally. By 2025, China hopes to have these robots in large quantities, stepping up the battle with Tesla's planned Optimus robot. Tesla expects to roll out 1,000 Optimus robots in its factories over the course of the next year, while Chinese companies are predicting substantial cost savings on their models.
Why AI can’t spell ‘strawberry’.	Because of their tokenization techniques, large language models occasionally perform poorly on tasks like letter counting. This demonstrates how the LLM architecture has shortcomings that impact how well they comprehend text. Nevertheless, developments are still being made. For example, Google DeepMind's AlphaGeometry 2 for formal math and OpenAI's Strawberry for enhanced reasoning
Diffusion is spectral autoregression.	It's common knowledge that auto-regressive models and diffusion models are essentially distinct types of methodologies. When it comes to diffusion models that genuinely take auto-regressive steps in the frequency domain, they might, in fact, be more comparable than we previously realized.
Can AI Scaling Continue Through 2030?	AI training is expanding at a rate that has never been seen before—four times faster than previous technology advances in genome sequencing and mobile use. According to research, the main limitations in scaling AI training could last until 2030 and are related to power availability and chip production capacity. If hundreds of billions are committed, training runs up to 2e29 FLOP would become feasible, representing significant advancement comparable to the transition from GPT-2 to GPT-4. Advanced network topologies and multimodal and synthetic data production methodologies might help overcome difficulties like data shortages and latency.
GPU Utilization is a Misleading Metric.	Although frequently tracked, GPU utilization may not fully capture GPU performance in machine learning workloads since it does not take into consideration whether the GPU's computational power is being utilized to its fullest. Trainy found this out when, during LLM training, 100% GPU usage was achieved, but only ~20% model FLOPS utilization (MFU) was achieved. It suggests using fused kernel optimization and the appropriate model parallelism level to obtain a 4x speedup in training time and tracking SM efficiency for a better performance indication.
AI-Implanted False Memories.	In simulated criminal witness interviews, generative chatbots driven by massive language models greatly increased the generation of false memories, inducing roughly three times more instantaneous false recollections than a control group, according to a study by MIT Media Lab.
The biology of smell is a mystery — AI is helping to solve it.	Scientists are beginning to crack the fiendishly complex code that helps us to sense odours.
How much is AI hurting the planet? Big tech won't tell us.	big tech companies, like Google, are not disclosing the full environmental impact of AI, while emissions from their operations have significantly increased, with Google's greenhouse gas emissions rising by 48% between 2019 and 2023
AI Has Created a Battle Over Web Crawling.	A research by the Data Provenance Initiative cautions that when websites restrict crawler bots more and more, high-quality data may become inaccessible to generative AI models. This trend, which is motivated by worries about data exploitation, may cause AI training to rely more on low-quality data rather than well-maintained sources. Businesses may use direct licensing or synthetic data to preserve the effectiveness of AI models in the face of increasing data scarcity.
What Succeeding at AI Safety Will Involve.	Sam from Anthropic hazard a guess as to what will have to be done in order for AI safety to be successful while creating superhuman AI systems.
the art of programming and why i won't use llm.	Although LLMs are praised for increasing productivity and are being incorporated into coding workflows more and more, some contend that their programming effectiveness is overstated.
‘He was in mystic delirium’: was this hermit mathematician a forgotten genius whose ideas could transform AI – or a lonely madman?.	In isolation, Alexander Grothendieck seemed to have lost touch with reality, but some say his metaphysical theories could contain wonders
AI Checkers Forcing Kids To Write Like A Robot To Avoid Being Called A Robot.	Can the fear of students using generative AI and the rise of questionable AI “checker” tools create a culture devoid of creativity?
The AI Arms Race Isn’t Inevitable.	Prominent AI labs are pushing Western governments to support swift AI developments in order to prevent rivals like China from gaining a decisive technological advantage. They are increasingly portraying AI research as a geopolitical zero-sum game crucial for national security. This story supports drastic steps to ensure AI domination, even at the expense of escalating geopolitical tensions and possibly jeopardizing safety and ethical standards.
Is AI eating all the energy?	AI's total energy footprint is influenced by both rising demand and rising energy efficiency. Power, heat, carbon, and water use are all positively connected with AI's energy consumption. The general trend of AI processing becoming more power-hungry is being countered by hardware efficiency improvements. Although its influence is lessened by broad use, AI still accounts for a small but growing portion of data center power consumption, with training activities using a lot more energy than inference.
Debate over “open source AI” term brings new push to formalize definition.	In an effort to clarify the meaning and address the term's overuse, the Open Source Initiative (OSI) published a proposed definition of "open source AI" that includes usage rights, study, modification, and sharing freedoms. With this step, researchers and engineers will be able to assess AI systems in a more transparent manner. In October, a stable version of the definition is anticipated, which may have an impact on upcoming releases of AI models and regulations.
Predicting AI.	This author considers their forecasts for AI and notes that they were correct to predict the growth of open source, multimodal models, and improved tool usability.
Bill Gates has a good feeling about AI.	The Verge spoke with Bill Gates about AI, misinformation, and climate change.
Enterprise AI Infrastructure: Privacy, Maturity, Resources.	An interesting interview with BentoML's CEO discusses how to enhance business tooling, make sure you can expand, and avoid over-engineering it from the start.

Back to index

ML news: Week 26 August - 1 September

Research

Link	description
Automated Design of Agentic Systems.	declares that it is possible to learn any possible agentic system, including prompts, tool use, control flows, and more, using their approach. They accomplish this by concentrating on three main components, known as search space (define agents), search algorithm (explore search space), and the evaluation function (evaluate candidate agents). presents Meta Agent Search, a meta agent that iteratively programs and tests new agents based on a growing archive of previous discoveries.
LLM Pruning and Distillation in Practice: The Minitron Approach.	presents pruning and distillation techniques applied to the original models to produce 4B and 8B parameter models, respectively. Before pruning, they also fine-tune the teacher model on their datasets leading to better distillation; their compression strategy yields a state-of-the-art 8B model (MN-Minitron-8B) which outperforms all similarly-sized models on common language modeling benchmarks. offers a thorough report on effective methods for compressing Llama 3.1 and Mistral NeMo models.
The Vizier Gaussian Process Bandit Algorithm.	introduces Vizier, an open-source Python implementation of the Gaussian process bandit optimization technique, which is utilized by Google for millions of optimizations and research. It includes benchmarking data that show the algorithm's wider applicability.
Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information.	proposes a two-stage prompting technique to remove irrelevant information from context; it serves as a self-mitigation process that first identifies the irrelevant information and then filters it out; this leads to enhancement in robustness of the model and overall better performance on reasoning tasks.
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding.	demonstrates how speculative decoding can improve throughput, lower latency, and preserve accuracy in long context generation scenarios; it discovers that bottlenecks change from compute-bound to memory-bound as sequence length and batch size increase; with these realizations, they demonstrate that speculative decoding can be used more successfully for longer sequences, even when using large batch sizes.
PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars.	employs a hybrid self-ensembling approach (based on diverse exemplars) to enhance LLM performance overall. Specifically, it generates multiple candidate responses using diverse exemplars and aggregates them using an LLM to produce a final response; this approach achieves lower cost compared to self-consistency approaches and better accuracy compared to greedy decoding.
Autonomous Driving with Spiking Neural Networks.	The first unified Spiking Neural Network (SNN) designed to tackle the energy issues associated with autonomous driving is called Spiking Autonomous Driving (SAD).
Pre-training Small Base LMs with Fewer Tokens.	By inheriting a few transformer blocks and training on a very small percentage (0.1%) of the initial data, Inheritune is a simplified technique for creating smaller base language models from larger ones. With just one A6000 GPU and this method, a 1.5B parameter model could be created in less than 30 minutes, with performance comparable to larger models trained on much greater amounts of data.
Teaching chat models to solve chess puzzles.	At 1800 elo on average, traditional base language models are rather competent chess players. Nevertheless, chat models frequently see a sharp decline in performance. This article explains how to use prompting and fine-tuning to teach conversation models, such as GPT-4o, to play chess.
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations.	The text-to-video (T2V) model xGen-VideoSyn-1 from Salesforce creates lifelike scenes based on written descriptions. The model makes use of a diffusion transformer (DiT) for enhanced temporal consistency and generalization and a video variational autoencoder (VidVAE) for video data compression, which lowers processing requirements.
Memory-Efficient LLM Training with Online Subspace Descent.	Online Subspace Descent is a novel optimizer that increases memory efficiency to improve LLM training.
Generative Verifiers: Reward Modeling as Next-Token Prediction.	Typically, reward models are taught to be discriminative classifiers. The reward signal in this DeepMind experiment is the yes/no logits of a language model. It was discovered that enabling a model to incorporate ensembling and CoT increased performance by sixteen percent.
Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress.	By using the discrepancy between routing synthetic data creation and oracle model performance, Cohere's Aya model was able to significantly increase its win rate in comparison to baseline models.
Text2SQL is Not Enough: Unifying AI and Databases with TAG.	A novel paradigm called Table-Augmented Generation answers complex natural language queries by fusing databases and language models.
The Mamba in the Llama: Distilling and Accelerating Hybrid Models.	Because mamma models do not include a KV cache for backtracking, they are difficult to accelerate with speculative decoding. This document presents several new distillation techniques and acceleration algorithms from some of the original authors.
Efficient LLM Scheduling by Learning to Rank.	Head of-line bottlenecks occur when delivering multiple concurrent requests to a large language model since we don't know how long output generation will take. The shortest requests can be served first if you can learn to rank the relative lengths between them, which will increase throughput for multi-batch generation by 6.5 times.
MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders.	A new model architecture called MTMamba++ aims to improve multi-task scene understanding. This method captures long-range dependencies and enhances cross-task interactions using a Mamba-based decoder with two core blocks: STM and CTM.

News

Link	description
Scientists to use AI to analyze 1.6m brain scans to develop tool predicting dementia risk.	Researchers will use artificial intelligence to match image data of patients from Scotland with linked health records
Microsoft releases powerful new Phi-3.5 models, beating Google, OpenAI, and more.	Microsoft unveiled the Phi-3.5-mini-instruct, Phi-3.5-MoE-instruct, and Phi-3.5-vision-instruct, three new models in its Phi series that each achieve remarkable benchmark achievements while tackling distinct AI tasks. Developers can access these models on Hugging Face and they are offered as open source under the MIT License. The Phi models have outperformed rivals like GPT-4o and Llama in certain benchmarks, demonstrating near-state-of-the-art performance despite their smaller size than some of their contemporaries.
Data Exfiltration from Slack AI via indirect prompt injection.	It was found that there is a vulnerability in Slack AI that allows attackers to use indirect prompt injection to steal data from private channels they do not have access. Through the use of public channel messages, attackers can coerce the LLM into disclosing sensitive data, like API keys, in response to queries. This problem continues, along with a phishing attack vector, even after Slack AI's update on August 14th, which added channel and DM files and greatly increased the surface area at risk for exploits of this kind.
Bringing Llama 3 to life.	Llama 3.1, an enhanced open-source LLM from Meta, adds new features like model distillation and the ability to generate synthetic data.
Anthropic reveals system prompts for Claude.	Anthropic has updated all models' dates and included system prompts.
D-ID launches an AI video translation tool that includes voice cloning and lip sync.	AI video creation platform D-ID is the latest company to ship a tool for translating videos into other languages using AI technologies. However, in this case, D-ID also clones the speaker’s voice and changes their lip movements to match the translated words as part of the AI editing process.
Vyond Pushes AI Video's Enterprise Era.	Vyond is an AI platform for creating videos with an emphasis on enterprise use cases.
Mark Zuckerberg says White House ‘pressured’ Facebook to censor Covid-19 content.	Meta boss regrets bowing to government power and says he would not make the same choices today
What the Telegram founder’s arrest means for the regulation of social media firms.	Pavel Durov’s detention by French authorities is a major break from the norm – but his low-moderation, non-encrypted app is an anomaly
Tesla Is Erasing Its Own History.	CEO Elon Musk’s original Tesla Motors Master Plan no longer exists on Tesla’s website.
After a decade of free Alexa, Amazon now wants you to pay.	AI is a chance for companies to charge for products we’re in the habit of using for free.
AI for creating comics? Europe’s industry completely rejects it, Tintin executive says.	Tools such as Midjourney and Dall-E have triggered a fightback in comic land as publishers gear up for litigation ahead of new EU rules
Police officers are starting to use AI chatbots to write crime reports. Will they hold up in court?	AI technology is being integrated into police work to automate the writing of reports from body camera footage.
Questions about the safety of Tesla’s ‘Full Self-Driving’ system are growing.	Tesla has been accused of deceptive marketing over its self-driving technology, as a prominent analyst questions the safety and readiness of the system, potentially leading to increased scrutiny of automated driving claims.
Japan: AI-powered drones to monitor disaster zones and identify criminals.	Drones move faster than police cars or guards, reaching incident site quickly and allowing for prompt action and response.
Artifacts are now generally available.	Artifacts are now widely accessible, including on mobile devices, thanks to Anthropic.
Introducing Cerebras Inference.	Large unified memory is present in the chipset of Cerebras. It can therefore avoid problems with bandwidth and serve models at thousands of tokens per second.
OpenAI Aims to Release New AI Model, ‘Strawberry,’ in Fall.	"Strawberry" is a new AI product that OpenAI intends to launch in the fall. It will be able to carry out complex jobs like creating marketing plans and will have advanced thinking abilities, such as the capacity to answer math problems that have never been seen before.
This 1mm 'fan on a chip' could put active cooling inside ultra-thin gadgets.	The XMC-2400 µCooling chip, a 1mm-tall solid-state fan intended to cool down thin electronics such as smartphones, has been introduced by xMEMS.
Nvidia rides big tech’s AI investment to beat Wall Street’s sky-high expectations.	Chipmaker, third most valuable company in world, records $30.04bn in revenue, showing AI demand continues to rise
AI makes racist decisions based on dialect.	Large language models strongly associated negative stereotypes with African American English
Lawmakers call for crackdown on AI deepfakes after Grok backlash.	A group of Democratic lawmakers are pushing the Federal Election Commission (FEC) to increase regulation on artificial intelligence (AI) deepfakes following the release of the social platform X’s chatbot Grok.
Midjourney says it’s ‘getting into hardware’.	Midjourney, the AI image-generating platform that’s reportedly raking in more than $200 million in revenue without any VC investment, is getting into hardware.
Google rolling out Gems and Imagen 3, with people generation, to Gemini Advanced.	Gems are “custom versions of Gemini” that you can create to “act as an expert on topics or refine them toward your specific goals.” They can “remember a detailed set of instructions to help you save time on tedious, repetitive or difficult tasks.”
OpenAI in Talks for Funding Round Valuing It Above $100 Billion.	With Microsoft anticipated to take part, OpenAI is in talks to raise several billion dollars in a fresh investment round headed by Thrive Capital, which would value the business over $100 billion.
How to harness AI’s potential in research — responsibly and ethically.	Artificial intelligence is propelling advances in all areas of science. But vigilance is needed, warn four researchers at the leading edge.
The On‑Device Intelligence Update.	Cartesian has released several updates to its models and systems. Additionally, an open hybrid State space model has been released.
Stephen Wolfram thinks we need philosophers working on big questions around AI.	Stephen Wolfram, a renowned mathematician and computer scientist, has grown to appreciate the importance of philosophy in understanding and guiding the development of AI. He argues that as AI raises profound existential and moral questions, integrating philosophical thinking into AI research is crucial for addressing these complex issues, signaling a potential "golden age" of philosophy in the context of technology.
The top AI deals in Europe this year.	Despite general headwinds for startups, AI ventures continue to secure substantial funding. U.S. AI startups have achieved nearly 30 deals over $100M in 2024, with Europe not far behind. Major investments include WAYVE ($1B), Mistral AI (~$1B), Helsing ($484M), Poolside ($400M), DeepL ($320M), H ($220M), and Flo Health ($200M).
California advances landmark legislation to regulate large AI models.	Groundbreaking bill aims to reduce potential AI risks – requiring model testing and disclosure of safety protocol
Nvidia shares fall on slowing growth and production concerns.	Doubling of quarterly revenues to £23bn fails to allay worry about delays to next generation of AI chips
X’s AI tool Grok lacks effective guardrails preventing election disinformation, a new study finds.	The Center for Countering Digital Hate (CCDH) found that Grok was able to churn out ‘convincing’ AI fake images including one of Vice President Kamala Harris doing drugs and another of former president Donald Trump looking sick in bed
100M Token Context Windows.	It isn't a typo, yes. 100 million tokens for agent programming and reasoning in context. Additionally, Magic Dev disclosed a collaboration to construct two new supercomputers on Google Cloud. This is a result of a recent $320 million fundraising effort to quicken the company's product development.
OpenAI and Anthropic will share their models with the US government.	The companies will grant the AI Safety Institute access to major new models for safety testing.
California legislature passes controversial “kill switch” AI safety bill.	After passing the State Assembly, California's contentious AI safety bill, SB-1047, is now one step closer to being signed into law by Governor Gavin Newsom. By September 30, Newsom must determine whether or not to sign it into law.
OpenAI says ChatGPT usage has doubled since last year.	OpenAI reported that 92% of Fortune 500 firms utilize ChatGPT, and that the platform has over 200 million weekly active users—a tripling of its user base from a year ago.
TikTok owner ByteDance launches new video search tool, eyeing Baidu’s dominance.	In a direct challenge to Baidu's search dominance, ByteDance has released Douyin Search, an app for searching short video content on TikTok's Chinese counterpart.

Resources

Link	description
Language Modeling on Tabular Data: A Survey of Foundations, Techniques, and Evolution.	includes topics like classification of tabular data structures and data types, datasets used for model training and evaluation, modeling techniques and training objectives, data processing methods, popular architectures, challenges, and future research directions. It also provides a thorough survey of language modeling techniques for tabular data.
Graph Retrieval-Augmented Generation: A Survey.	focuses on methods used in the GraphRAG workflow (graph-guided retrieval, graph-based indexing, and graph-enhanced creation); explores GraphRAG's tasks, applications, assessment, and industrial use cases.
Controllable Text Generation for Large Language Models: A Survey.	gives a thorough overview of controllable text generating techniques in LLMs; covers topics like helpfulness, safety, consistency, and style.
Challenges and Responses in the Practice of Large Language Models.	selects several significant questions and provides thoughtful answers; the questions are divided into groups according to themes including data, applications, infrastructure, software architecture, and brain science.
Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask.	The first diffusion-based method for learning time series representations is called Time Series Diffusion Embedding, or TSDE. Time series data is divided into segments by TSDE, which then creates informative embeddings by using dual-orthogonal Transformer encoders with a crossover mechanism.
Liger Kernel: Efficient Triton Kernels for LLM Training.	Surprisingly, LinkedIn released the Liger Kernel, a productive set of kernels for training language models. For the widely used Llama models, it reduces memory utilization by about 60% and boosts throughput by 20%. It interacts with several common modeling frameworks and just takes three lines of code change, which is important for practitioners.
pgvectorscale.	With better performance for embedding search and more affordable storage for AI applications, pgvectorscale expands upon pgvector. Compared to other popular and competitive vector retailers, it is about 28 times faster.
GenderCARE.	A thorough framework called GenderCARE is designed to identify and lessen gender prejudices. It presents novel standards for assessing gender prejudice, with a focus on diversity, inclusivity, and impartiality.
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes.	A novel technique for more effectively fine-tuning the Segment Anything Model (SAM) with variable-size images is called Generalized SAM (GSAM).
google/siglip-so400m-patch14-224.	A new SigLIP model from Google leverages a vision transformer model architecture that is tuned for shape.
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting.	Using surround views, GaussianOcc is an effective and entirely self-supervised approach for 3D occupancy estimate.
Infinite Dataset Hub.	This space, which is powered by phi-3-mini, generates data on any topic using a rarity prompt. It is intriguing and potent even though it isn't the most accurate.
Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models.	By conditioning on individual object representations, neural networks are able to represent and manage 3D objects in 2D contexts. This work could be the key to untangling 3D objects.
T3M: Text Guided 3D Human Motion Synthesis from Speech.	T3M is a brand-new technique that researchers have developed for producing 3D animations that are controlled by text inputs. T3M is a useful technology for virtual reality, gaming, and film creation because it enables more precise and customized animations than earlier methods that solely used voice.
BiRefNet.	Bireference segmentation with background removal at the cutting edge of technology.
RB-Modulation.	Google has developed a really innovative method for customizing diffusion models that works better than several widely used techniques. It may be used with PyTorch and, with some adjustments, Flux as well.
FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing.	With FlexEdit, you may precisely modify images based on language commands by combining free-shape masks with Vision Large Language Models (VLLMs).
Quick Fine-tuning of Phi 3.5.	Quick fine-tuning script with Unsloth of the new Microsoft models.
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning.	A paper detailing DeepSeek's hardware-software co-design approach for deep learning has been published.
Announcing Higgs Llama V2.	Higgs-Llama-3-70B-v2, a new model from Boson AI, performs exceptionally well on conversation and comprehension benchmarks such as Arena-Hard and AlpacaEval 2.0. Compared to Claude 3.5 Sonnet, the model increases day 1 retention by 5.3% and decreases response regeneration rates by 21.6%. Improved using an internal reward model called Higgs Judger, its performance is tied to that of Google's Gemini 1.5 Pro.
The Zyphra Training Cookbook.	Pre-training normal Transformers is not the same as pre-training hybrid (Mamba type) models. To get the desired performance, this post examines scaling various hyperparameters, data gathering, and other factors.
LlamaDuo.	This is a system that optimizes small models to act as a backup if closed API models become unavailable. It demonstrates a smooth transition from a large to a small model.
LitServe.	A flexible and user-friendly serving engine for AI models based on FastAPI is called LitServe. The need to rebuild a FastAPI server for each model is eliminated by features like batching, streaming, and GPU autoscaling.
IntelLabs/LlavaOLMoBitnet1B.	Llava BitNet is the first ternary (-1, 0, 1) weight model trained on VLM tasks. The model, weights, and scripts are in the process of being fully open-sourced. The technical report will be released soon and suggests the model has promising performance.
Qwen2-Audio.	Qwen has released audio input style models that can reason about music, audio, and sound.
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches.	This team developed an incredible model that generates fully playable 3D game scenarios from a single input sketch by sequentially using many models.
OctFusion: Octree-based Diffusion Models for 3D Shape Generation.	OctFusion is an efficient and high-quality method for using diffusion models to generate 3D objects. In about 2.5 seconds, it can generate 3D shapes at any resolution using a single Nvidia 4090 GPU.
MambaInLlama.	By reusing weights from attention layers, researchers have shown that massive Transformer models can be reduced to more deployable linear RNNs.
Cross-Modal Temporal Alignment for Event-guided Video Deblurring.	By incorporating an event camera—which records motion with microsecond temporal resolution—researchers have created a novel method for video deblurring that improves the quality of motion-blurred footage.
JoyCaption Pre-Alpha.	An open-source VLM created especially for upcaptioning images.
Introducing RPBench-Auto.	An automated evaluation pipeline called RPBench-Auto, which draws inspiration from ArenaHard and Alpaca Eval, has been introduced by Boson AI to measure the role-playing talents of LLMs.
Lightweight Champ: NVIDIA Releases Small Language Model With State-of-the-Art Accuracy.	Mistral-NeMo-Minitron 8B is a miniaturized version of the recently released Mistral NeMo 12B model, delivering high accuracy combined with the compute efficiency to run the model across GPU-accelerated data centers, clouds, and workstations.
NousResearch/hermes-function-calling-v1.	Excellent publicly available dataset from Nous Research to train call function models.
Qwen2-VL: To See the World More Clearly.	Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model families
RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images.	A novel method called RAW-Adapter modifies pre-trained sRGB models so they can efficiently handle RAW data from cameras.
Llama usage double May through July.	Meta has published some usage statistics for the Llama model. It discovered that there was a high demand for its models being used in business environments.
SAM & SAM 2 in 3D Slicer: SegmentWithSAM Extension for Annotating Medical Images.	In order to expedite the annotation of 3D medical pictures, this study modified the Segment Anything Model 2 (SAM 2), which was initially created for video annotation.

Perspectives

Link	description
AI analysed 1,500 policies to cut emissions. These ones worked.	Only 63 climate change interventions led to significant reductions in carbon emissions.
AI cheating is overwhelming the education system – but teachers shouldn’t despair.	With adjustments to the way we teach students to think about writing, we can shift the emphasis from product to process
What’s Really Going On in Machine Learning? Some Minimal Models.	The inventor of Wolfram
AI companies are pivoting from creating gods to building products. Good.	AI firms are finding it difficult to match their products to the markets for LLMs, which has resulted in large investments but little profit. The five primary obstacles impeding the commercialization of AI products are price, dependability, security and safety concerns, privacy, and user interface constraints. It is imperative that these sociotechnical obstacles are resolved in order for AI to be widely integrated and used in consumer goods.
My friend, Claude.	Due to increased job obligations, this author relies on Anthropic's LLM Claude for technical writing, highlighting the expanding value of LLMs in professional settings. Claude's help has been cost-effective even though it required expert verification, and it highlights how quickly the landscape for specialty experts confronting AI-driven automation is changing. The author considers how knowledge work may change when AI technologies like Claude are more frequently used for everyday tasks.
AI firms must play fair when they use academic data in training.	Researchers are among those who feel uneasy about the unrestrained use of their intellectual property in training commercial large language models. Firms and regulators need to agree on the rules of engagement.
Stakes high for European Union after arrest of Telegram co-founder.	The charges against Pavel Durov increases pressure on Brussels to enforce new European law on the platform
MIT neuroscientists discover neurons with distinct language processing timescales.	In language-processing areas of the brain, some cell populations respond to one word, while others respond to strings of words.
How to Tell If What You're Reading Was Written By AI.	From the moment ChatGPT introduced the world to generative AI in late 2022, it was apparent that, going forward, you can no longer trust that something you're reading was written by a human.
California AI bill sparks debate in Silicon Valley as some tech giants call it a threat to innovation.	A first-of-its-kind AI bill is winding its way through California, causing infighting between groups of AI pioneers.
Exodus at OpenAI: Nearly half of AGI safety staffers have left, says former researcher.	Nearly half the OpenAI staff that once focused on the long-term risks of superpowerful AI have left the company in the past several months, according to Daniel Kokotajlo, a former OpenAI governance researcher.
Technology may be advancing - but it’s making us more stupid.	‘Deskilling’ in the face of cognitive automation is a problem that is too easily ignored
Inference is FREE and INSTANT.	Large language models (LLMs) may not be much better at reasoning, but they will be more helpful for repeated jobs due to their rising speeds and falling prices. These models may not have genuine understanding, yet they are nonetheless capable of handling simple tasks effectively.
UK’s new science minister on budget battles, Brexit and AI leadership.	Former clinical scientist Patrick Vallance speaks to Nature about his priorities as the minister overseeing the nation’s research.
Urgently clarify how AI can be used in medicine under new EU law.	The European Union’s Artificial Intelligence Act entered into force on 1 August. Phased implementation begins in February 2025, banning artificial intelligence (AI) systems deemed to pose unacceptable risks. Before that happens, policymakers must do more to ensure that patients’ safety and interests are protected.

Back to index

ML news: Week 19 - 25 August

Research

Link	description
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.	a novel artificial intelligence (AI) agent that, for less than $15, can develop and write a full conference-level scientific paper; it automates scientific discovery by empowering frontier LLMs to conduct independent research and summarize findings; it also uses an automated reviewer to assess the papers it generates; it claims to achieve near-human performance in assessing paper scores; and it claims to generate papers that, according to their automated reviewer, surpass the acceptance threshold at a premier machine learning conference.
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs.	suggests AgentWrite as a way to allow off-the-shelf LLMs to produce coherent outputs longer than 20K words. AgentWrite divides the long generation task into smaller tasks and uses a divide-and-conquer strategy to produce the outputs; the agent then splits the task into smaller writing subtasks and concatenates the outputs to produce a final output (i.e., plan + write). This method is then used to create SFT datasets, which are used to tune LLMs to produce coherent longer outputs automatically; a 9B parameter model, further enhanced through DPO, achieves state-of-the-art performance on their benchmark and outperforms proprietary models.
EfficientRAG: Efficient Retriever for Multi-Hop Question Answering.	trains a filter model to formulate the next-hop query based on the original question and previous annotations; this is done iteratively until all chunks are tagged as or the maximum # of iterations is reached; after the above process has gathered enough information to answer the initial question, the final generator (an LLM) generates the final answer. trains an auto-encoder LM to label and tag chunks; it retrieves relevant chunks, tags them as either or , and annotates chunks for continuous processing.
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation.	a detailed assessment methodology for RAG retrieval and generating module diagnosis; demonstrates that RAGChecker exhibits superior correlations with human judgment; presents multiple illuminating patterns and trade-offs in RAG architecture design decisions.
HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction.	integrates VectorRAG and GraphRAG to create a HybridRAG system that performs better than either one separately; it was tested on a set of transcripts from financial earning calls. When the benefits of both methods are combined, questions can be answered with more accuracy.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers.	introduces self-play mutual reasoning to enhance small language models' reasoning powers without the need for better models or fine-tuning; To create richer reasoning trajectories, MCTS is enhanced with human-like reasoning actions derived from SLMs; The target SLM chooses the last reasoning trajectory as the solution, while another SLM offers unsupervised input on the trajectories; For LLaMA2-7B, rStar increases GSM8K accuracy from 12.51% to 63.91% while steadily raising other SLM accuracy.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.	explores how inference-time computation in LLMs scales. Specifically, it examines how much an LLM can be improved given a fixed amount of inference-time compute; it discovers that the efficacy of various scaling strategies varies by prompt difficulty; it then suggests an adaptive compute-optimal strategy that can increase efficiency by more than 4x when compared to a best-of-N baseline; it reports that optimally scaling test-time compute can outperform a 14x larger model in a FLOPs-matched evaluation.
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation.	a graph-based framework for the medical domain that improves LLMs and produces evidence-based results; makes use of chunk documents and a hybrid static-semantic approach to enhance context capture; uses graphs to represent entities and medical knowledge, creating an interconnected global graph; This method outperforms cutting-edge models and increases precision across several medical Q&A metrics.
BAM dense to MoE Upcycling.	By using this technique, the FFN and Attention layers of dense models can be recycled into a Mixture of Experts (MoE) model for additional training. This preserves downstream performance while saving a significant amount of computing expense.
BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning.	Backdoor attacks can be incorporated into medical foundation models using the BAPLe technique during the prompt learning stage.
ShortCircuit: AlphaZero-Driven Circuit Design.	AI-powered automation and optimization of chip design can lower costs while satisfying the need for more powerful chips. Using an Alpha Zero-based approach, this method was tested on numerous circuits and produced small and effective designs with an 84.6% success rate.
Automated Design of Agentic Systems.	This study examines the fragility of current agent systems and explores potential future directions for the design of learning systems. Programming languages are used by their creators as a testbed where unsupervised agent creation and execution are possible.
Loss of plasticity in deep continual learning.	The pervasive problem of artificial neural networks losing plasticity in continual-learning settings is demonstrated and a simple solution called the continual backpropagation algorithm is described to prevent this issue.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.	Incredible new model from Meta that performs diffusion and next token prediction on text and image interleaving. It performs comparably to earlier generation devices like Dalle 2 and Llama 2 in benchmark tests for text and graphics.
To Code, or Not To Code? Exploring Impact of Code in Pre-training.	The industry keeps this to itself, although pretraining models on code aid in their generalization to other reasoning-intensive activities. This Cohere study investigates that issue in detail and demonstrates that code may be used as a foundational element of thinking in a variety of contexts.

News

Link	description
AI-generated parody song about immigrants storms into German Top 50.	Artist Butterbro accused of walking fine line between parody and discrimination and helping make racial slur mainstream
Tesla faces lowest duty on Chinese-made cars exported to EU.	The 9% tariff is much less than others face after investigation into Beijing’s ‘unfair’ subsidies of EVs
Google’s upgraded AI image generator is now available.	Google says Imagen 3 is its highest-quality image generator so far — and now more users in the US can try it.
Runway’s Gen-3 Alpha Turbo is here and can make AI videos faster than you can type.	The new Gen-3 Alpha Turbo from Runway ML is currently available with a variety of subscription plans, including free trials, and offers 7x quicker AI video creation at half the cost of its predecessor. The time lag is greatly decreased by this speed increase, which promotes more productive workflows, especially in industries where time is of the essence. Runway is negotiating the ethical waters of AI training data practices while pushing for more advancements, such as improved control systems.
Eric Schmidt Walks Back Claim Google Is Behind on AI Because of Remote Work.	Eric Schmidt, ex-CEO and executive chairman at Google, walked back remarks in which he said his former company was losing the artificial intelligence race because of its remote-work policies.
Gemini Advanced updated with latest 1.5 Pro model for improved reasoning.	Google has enhanced Gemini 1.5 Pro in Gemini Advanced, delivering improved responses for prompts requiring advanced reasoning and coding.
Waymo is developing a roomier robotaxi with less-expensive tech	Waymo has revealed its Generation 6 self-driving technology that is built into Geely Zeekr EVs and requires less cameras and sensors. With the help of machine intelligence and semiconductor developments, the Alphabet division intends to quickly implement this technology to survive a variety of weather conditions. With this update, Waymo is able to continue scaling its Waymo One service, which is presently offering 50,000 trips each week.
Gemini Live could use some more rehearsals.	Google's AI-powered voice interaction technology, Gemini Live, attempts to replicate genuine speech but has trouble with errors and hallucinations. It isn't as customizable or expressive as rivals like OpenAI's Advanced Voice Mode, even though it uses professional actors for more expressive voices. Overall, the bot's usefulness and purpose are unclear due to its limited capability and dependability concerns, especially considering that it is a component of Google's expensive AI Premium Plan.
Hamming Launches 100x faster testing of voice agents.	With the use of a technology called hamming, you may test hundreds of situations for your voice AI systems and create personalities that resemble Character AI.
Fine-tuning now available for GPT-4o.	With the announcement of fine-tuning for GPT-4o, OpenAI enables developers to tailor the model using their datasets for certain use cases. Through September 23, it will be giving away one million free training tokens per day.
OpenAI strikes search deal with Condé Nast.	With the signing of a multi-year licensing deal, OpenAI and Condé Nast can integrate content from the publisher's brands, like Vogue and The New Yorker, into their ChatGPT and SearchGPT platforms.
Meta’s Self-Taught Evaluator enables LLMs to create their own training data.	Meta FAIR researchers have introduced the Self-Taught Evaluator, a method to train evaluative LLMs without human annotations, potentially enhancing the efficiency and scalability of LLM assessment. Using the LLM-as-a-Judge concept, it iteratively generates and refines responses to create a training dataset, demonstrating improved performance on benchmarks like RewardBench. This technique could enable enterprises to leverage unlabeled data for LLM tuning while acknowledging the importance of a well-aligned seed model and the limitations of benchmarks.
Video: $16,000 humanoid robot ready to leap into mass production.	China's Unitree Robotics is a relatively recent entry in the general-purpose humanoid robot space, but its $16,000 G1 model is already proving itself to be quite the performer. So much so that the company has now revealed a version that's ready for mass production.
US mayoral candidate who pledged to govern by customized AI bot loses race.	Victor Miller proposed customized ChatGPT bot to govern Cheyenne, Wyoming – but fared badly at the ballot box
Authors sue Anthropic for copyright infringement over AI training.	Andrea Bartz, Charles Graeber and Kirk Wallace Johnson allege company misused work to teach chatbot Claude
Ideogram 2.0.	A new model from Ideogram has better text rendering and image-generating capabilities.
Introducing Zed AI.	With the help of a hosted service called Zed AI, developers may employ LLMs and yet have complete control over their code by integrating AI-powered coding into the Zed text editor. Zed and Anthropic have teamed up to enable quick editing with Claude.
Nvidia’s AI NPCs will debut in a multiplayer mech battle game next year.	Nvidia ACE, the company’s AI-powered system for giving voices and conversation skills to in-game characters, is set to debut in Mecha Break, a new multiplayer mech battle game coming to PC, Xbox X / S, and PlayStation 5 in 2025.
These 'living computers' are made from human neurons — and you can rent one for $500 a month.	Using human-brain organoids into computing, FinalSpark's "Neuroplatform" provides a biocomputing platform that may be rented to lower AI's energy consumption. Standardizing production and increasing the life of organoids beyond 100 days are challenges. Alternatives such as fungal networks and cellular computing are also investigated for jobs that are beyond the capabilities of silicon-based computers.
AI made of jelly ‘learns’ to play Pong — and improves with practice.	Inspired by neurons in a dish playing the classic video game, researchers show that synthetic hydrogels have a basic ‘memory’.
Cursor raises $60m.	Cursor raised a Series A to continue building its AI-powered coding IDE.
Perplexity AI plans to start running ads in the fourth quarter as AI-assisted search gains popularity.	The AI-assisted search startup Perplexity AI, which just raised $1 billion in funding, intends to launch adverts on its search app in Q4.
Pixel 9 phones: The Gemini AI stuff, reviewed.	One of the main features of the Pixel 9 phones is Google's Gemini AI, which provides customers with several AI-powered features like task assistance, picture editing, and screenshot management. Its effectiveness as a full-fledged assistant is uneven, though, with sporadic hiccups and several Google Assistant functions that aren't completely incorporated. Notwithstanding these problems, Pixel users can benefit from intriguing features like document summarizing and creative photo "reimagining" tools.
AMD explains its AI PC strategy.	With its Ryzen AI 300 CPUs, AMD is pushing the AI PC industry forward by incorporating NPUs to improve AI-powered applications such as Microsoft's Recall.
Gemini in Gmail can now help polish up your drafts.	‘Help me write’ can now polish your emails, in addition to being able to formalize them or shorten them.
Royal Society facing calls to expel Elon Musk amid concerns about conduct.	Some fellows fear tech billionaire could bring the institution into disrepute with incendiary comments
Apple Intelligence is coming. Here’s what it means for your iPhone.	Apple is about to launch a ChatGPT-powered version of Siri as part of a suite of AI features in iOS 18. Will this change the way you use your phone – and how does it affect your privacy?

Resources

Link	description
A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?	a thorough rundown of NL2SQL approaches driven by LLMs, including models, data gathering, assessment strategies, and error analysis
DeepSeek-Prover-V1.5.	Process supervision was used to train DeepSeek's extremely potent math model, which performs noticeably better than larger models on several MATH benchmarks.
DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model.	This is a fun project that reconstructs very low-quality images from a cheap camera using a diffusion model.
Knowledge Fusion of Large Language Models.	Several models can be combined with Fuse Chat, allowing each to contribute their unique capabilities. This is the code base containing the model weights for several robust 7B models that achieve good results on the MT bench.
SigmaRL.	The goal of the decentralized, open-source SigmaRL framework is to enhance the generalization and sample efficiency of multi-agent Reinforcement Learning (RL) in the context of motion planning for automated and networked vehicles.
Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation.	To evaluate how the quality of 3D reconstructions affects object position estimate accuracy in industrial applications, this work presents a thorough benchmark.
MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing.	The process of producing many views from a single image is known as multi-view image synthesis.
BLIP-3.	For a while, BLIP was the most used multimodal model. The most recent iteration employs a pure autoregressive loss and is noticeably simpler. It attains cutting-edge results on certain captioning benchmarks.
SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation.	A new image segmentation framework called SAM2-UNet uses the potent Segment Anything Model 2 (SAM2) as its encoder.
A Survey on Benchmarks of Multimodal Large Language Models.	A thorough analysis of 180 benchmarks for Multimodal Large Language Model evaluation is presented in this work.
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering.	You can create an editable and animatable mesh output from a video or image series using mesh reconstruction from Gaussian splatting. It just takes a few steps on a single GPU to accomplish this, and it does so very rapidly and efficiently.
Llama-3.1 Storm Models.	These are the first tuned models that significantly outperform Meta's Llama-3.1 base models.
EasyRec: Simple yet Effective Language Model for Recommendation.	EasyRec is a language paradigm created especially for jobs involving recommendations. To produce high-quality semantic embeddings, it makes use of cooperative data from several datasets and creative contrastive learning objectives.
Classifying all of the pdfs on the internet.	A wonderful post about classifying every PDF available on the internet according to its semantic content using clever prompting and embeddings.
How to get from high school math to cutting-edge ML/AI: a detailed 4-stage roadmap with links to the best learning resources that I’m aware of.	Software experts can use the following four-step learning plan to comprehend advanced ML/AI papers: Basic math (calculus, algebra, linear algebra, probability, statistics), deep learning (multi-layer neural networks), classical machine learning (basic regression, classification models), and cutting-edge machine learning (transformers, LLMs, diffusion models) are the first four areas of study in machine learning. For stages 1-2, author-created content is essential, while for stages 3–4, suggested outside items are necessary. Once each level is mastered, students are better prepared to take on challenging ML papers and keep up with the rapidly advancing field of AI research.
llamafile v0.8.13	Whisper models are now supported by Llama files, which also offer a number of speed and quality-of-life enhancements.
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model.	A quick, affordable, and cutting-edge approach for creating 3D meshes that can be trained on text or images. In particular, it employs a cascade of steps, such as a normal map generator, that transfers distinct duties to different submodels and signed distance function supervision.
NeuFlow_v2.	Optical flow code that is incredibly quick and effective and suitable for low-power devices like phones and certain security camera systems.
X-ray Report Generation.	To produce X-ray medical reports more efficiently and with less computer complexity, a new framework was created.
TraDiffusion：Trajectory-Based Training-Free Image Generation.	A novel technique called TraDiffusion uses mouse trajectories rather than box or mask controls to guide text-to-image generation.
Loss Rider.	A fun utility that illustrates when loss functions converge and get too spiky by animating a curve rider sled as it descends them.
kyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama.	The goal of the large dataset SkyScript-100M is to improve the production of excellent shooting scripts for short dramas.
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices.	This work presents a novel approach to optical flow estimation that delivers excellent accuracy at a large computational cost savings.
Torch-Pruning.	repository of cutting-edge techniques with numerous supported algorithms for language model pruning that is kept up to date.
Image, Tell me your story!.	A novel strategy for identifying visual misrepresentation has been presented by researchers, which emphasizes the importance of the original meta-context of images—a factor that automated approaches frequently ignore.
Pathology-LLaVA.	Pathology image analysis is the target application for PA-LLaVA, a domain-specific language-vision assistant.
Microsoft's Phi-3 family.	A detailed analysis of the MoE and vision model from Microsoft's recently released Phi 3.5 models.
The Top 100 Gen AI Consumer Apps - 3rd Edition.	Based on customer interaction patterns, Andreessen Horowitz's most recent consumer AI research ranks the top 100 generative AI apps and divides them into the top 50 AI online products and the top 50 AI mobile apps. The research offers in-depth analyses of trends, new competitors in the sector, and developing categories.
Eight basic rules for causal inference.	This comprehensive blog article explains the relationship between causal mechanisms and observable correlations using R code simulations, causal graphs, and logic concepts to illustrate the seven basic laws of causal inference.
Jamba-1.5.	AI21 has released new versions of its hybrid Transformer and State space model architecture.
biorecap: an R package for summarizing bioRxiv preprints with a local LLM.	The recently released biorecap R package uses locally run big language models to fetch and summarize recent publications, assisting academics in managing the massive amount of bioRxiv preprints.
aurora.	Microsoft's high-quality atmospheric prediction model, code, and checkpoints are available as open source.
NuSegDG.	A novel framework named NuSegDG has been created by researchers to improve the generalizability of nuclei segmentation in various medical pictures.
Pano2Room: Novel View Synthesis from a Single Indoor Panorama.	Pano2Room is a novel technique that overcomes limitations in single-view 3D scene synthesis by reconstructing high-quality 3D indoor scenes from a single panoramic image.
Awesome Object-Centric Robotic Manipulation.	This repository offers a thorough introduction to embodied learning, a promising robotic manipulation methodology that prioritizes perceptual feedback and physical interaction.

Perspectives

Link	description
‘Threads is just deathly dull’: have Twitter quitters found what they are looking for on other networks?	There’s been an exodus of users from X, propelled by Elon Musk’s lurch to the far right, but the alternatives have drawbacks too
Five ways the brain can age: 50,000 scans reveal possible patterns of damage.	Results raise hopes that methods could be developed to detect the earliest stages of neurodegenerative disease.
An AI Empire.	As AI develops, mankind may surpass other species as the most intelligent on Earth. AGI may not be far off, as it might allow AI research to be replicated on a never-before-seen scale. The exponential rise in computing suggests that humans will soon become significantly less relevant as AI takes over. Despite possible roadblocks in AI development, society might not be prepared for such a significant transformation.
What does Bitcoin smell like? AI startup wants to ‘teleport’ digital scents.	A firm focused on artificial intelligence called Osmo is creating technology that will allow computers to recognize and replicate smells, which might help with disease detection and digital scent communication. Scent detection lacks a defined "smell map," which makes it more difficult for the team to create a molecular bond scent database than audiovisual AI advancements. Osmo's applications, which integrate olfactory sensations, have the potential to transform digital marketing and medical diagnostics.
Eric Schmidt’s AI prophecy: The next two years will shock you.	In the next years, former Google CEO Eric Schmidt believes that artificial intelligence will evolve quickly and might produce important apps similar to TikTok rivals in a matter of minutes. He draws attention to the unpredictable and rapid advancements in AI, noting the possibility of massive technological and economic disruption from the convergence of agent-based systems with text-to-action capabilities and big language models. Schmidt's perspective indicates a revolutionary age ahead, reflecting the significant investments and energy requirements expected for cutting-edge AI development.
Why Neuralink’s Blindsight and Brain Implants to restore sight won’t work like human eyesight.	This study emphasizes the difficulties in using AI-powered cortical implants to restore vision by highlighting the fact that neurons in the visual cortex do not behave like pixels on a screen. Although high-resolution simulations are promising, cortical implants cannot achieve genuine vision since doing so would entail reproducing intricate neural patterns, which is far beyond the capabilities of present technology and will result in pixelated and subpar images.
A Personalized Brain Pacemaker for Parkinson’s.	Researchers have created an adaptive method of deep brain stimulation that greatly shortens the duration of symptoms by adjusting electrical pulses to the various symptoms experienced by Parkinson's sufferers.
Why Diffusion could help LLMs reason.	Present-day language models anticipate words one at a time, leaving very little opportunity for reasoning and planning. This can be avoided by using techniques like Chain of Thought prompting. To enhance model reasoning, diffusion models—which have the capacity to spend more diffusion steps per token—might be used.
AI companies are pivoting from creating gods to building products. Good.	The preparedness of generative AI for broad commercial applications has been overstated by AI businesses, which has resulted in expensive errors in product development and market integration. They have five major obstacles to overcome to change direction: making sure that the system is affordable, boosting security and safety, protecting privacy, and optimizing user interfaces. These challenges draw attention to the discrepancy between the potential of AI and the actual difficulties in implementing AI systems that satisfy user expectations and fit in with current processes. Rather than occurring in the quick timeframe some have projected, the route to broad adoption will probably take ten years or longer.
Has your paper been used to train an AI model? Almost certainly.	Artificial intelligence developers are buying access to valuable data sets that contain research papers — raising uncomfortable questions about copyright.
The testing of AI in medicine is a mess. Here’s how it should be done.	Hundreds of medical algorithms have been approved on the basis of limited clinical data. Scientists are debating who should test these tools and how best to do it.
Light bulbs have energy ratings — so why can’t AI chatbots?	The rising energy and environmental cost of the artificial intelligence boom is fuelling concern. Green policy mechanisms that already exist offer a path towards a solution.
How the human brain creates cognitive maps of related concepts.	Neural activity in human brains rapidly restructures to reflect hidden relationships needed to adapt to a changing environment. Surprisingly, trial-and-error learning and verbal instruction induce similar changes.
Switching between tasks can cause AI to lose the ability to learn.	Artificial neural networks become incapable of mastering new skills when they learn them one after the other. Researchers have only scratched the surface of why this phenomenon occurs — and how it can be fixed.
Markov chains are funnier than LLMs.	This article explores LLM predictability and its limitations when it comes to producing humor. It makes the case that although LLMs are excellent at producing text that is appropriate for the context, their predictive nature renders them unsuitable for humorous writing, which depends on unexpectedness.
AI at Work Is Here. Now Comes the Hard Part.	In the last six months, the use of generative AI has almost doubled globally, with 75% of knowledge workers currently using it.
AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work.	This is a lengthy and comprehensive overview of the research that DeepMind is doing on AGI safety and alignment.
The newest weapon against mosquitoes: computer vision.	Developments in computer vision are helping combat malaria by enabling applications such as VectorCam, which facilitates fast identification of mosquito species and data gathering. The Gates Foundation helped develop the app, which can identify species that transmit malaria and aid in improving disease control tactics. Innovative mosquito surveillance techniques are essential for the tactical use of pesticides and other mitigating actions.
Fields that I reference when thinking about AI takeover prevention.	This article compares fields battling insider threats with AI control, offering ideas on developing and assessing strong AI safety measures. It emphasizes how much more control developers have over AIs than they do over people, but it also points out that, in contrast to humans, AI dishonesty can be endemic. AI control is different mainly because it is adversarial and doesn't involve complicated system interactions, even though it is influenced by different domains such as physical security and safety engineering.
‘Never summon a power you can’t control: Yuval Noah Harari on how AI could threaten democracy and divide the world.	Forget Hollywood depictions of gun-toting robots running wild in the streets – the reality of artificial intelligence is far more dangerous, warns the historian and author in an exclusive extract from his new book

Back to index

ML news: Week 12 - 18 August

Research

Link	description
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters.	This is an expansion of Ring Attention, which spans many GPUs to provide incredibly lengthy context. An energy function is derived by the researchers to guide the sharding of the models.
Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models.	Bias propagation from pre-training data is addressed via a novel method for optimizing LLMs called bias-aware low-rank adaptation (BA-LoRA).
MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models.	Researchers investigate how employing LLMs to improve temporal event predictions can benefit from photos. Two important roles of images are identified by their suggested framework, MM-Forecast: highlighting and supplementing textual data.
SAM 2: Segment Anything in Images and Videos.	an open, consistent approach for promptable, real-time object segmentation in photos and videos that can be used to visual content that hasn't been seen before without the requirement for special adaption; To facilitate precise mask prediction in videos, a memory mechanism is incorporated to retain data about the object and past interactions. Additionally, the memory module permits the processing of videos of any length in real-time. SAM2 considerably surpasses prior methods in interactive video segmentation over 17 zero-shot video datasets, all while requiring three times fewer human-in-the-loop interactions.
Structured Generation Limits Reasoning.	It examines whether structured generation can affect an LLM's capacity for reasoning and comprehensive domain knowledge; finds that when format constraints are applied, an LLM's reasoning skills significantly deteriorate in comparison to free-form responses; this degradation effect is exacerbated when stricter format constraints are applied to reasoning tasks.
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation.	presents RAGFoundry, an open-source framework for enhanced LLMs for RAG use cases; it facilitates the generation of data-augmented datasets to fine-tune and assess LLMs in RAG situations. The system enables data creation, training, inference, and assessment.
Synthesizing Text-to-SQL Data from Weak and Strong LLMs.	suggests using integrated synthetic data to create the highly specialized SoTA text-to-SQL model known as SENSE; the use of strong models' synthetic data improves data variety, while the incorporation of important erroneous data from weaker models with an executor allows for the learning of execution feedback; By using preference learning to instruction-tune LLMs to learn from both correct and incorrect samples, SENSE closes the performance gap between open-source models and approaches utilizing closed-source models, achieving state-of-the-art scores on the SPIDER and BIRD benchmarks.
Conversational Prompt Engineering.	describes a two-step process that allows users to create personalized few-shot prompts by interacting with the model and sharing the output. The model shapes the initial instruction based on user-provided unlabeled data, and the user provides feedback on the outputs and instructions. This iterative process produces a personalized few-shot prompt that performs better and more optimally on the desired task.
Self-Taught Evaluators.	an approach to enhance model-based evaluators with only synthetic training data; it claims to outperform LLM-judges like GPT-4 and match top-performing reward models trained on labeled examples; it first generates contrasting outputs (good and bad model responses) and trains an LLM-as-a-Judge to produce reasoning traces and final judgments; the self-improvement scheme iteratively repeats the training process using its improved predictions.
UGrid: An Efficient-And-Rigorous Neural Multigrid Solver for Linear PDEs.	The UGrid solver is a recently created neural solver that combines the advantages of MultiGrid and U-Net methods for solving linear partial differential equations (PDEs).
Causal Agent based on Large Language Model.	The Causal Agent is an agent framework that can manage causal issues since it has memory, reasoning, and tool modules.
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation.	Biases in CLIP can make it less effective in tasks like unsupervised semantic segmentation when images are not annotated. In this research, a technique to explicitly model and correct these biases is proposed.
Sakana Launches AI Scientist.	A system that can independently conduct research by formulating hypotheses, carrying out experiments, developing code, and compiling the findings into well-reasoned publications has been unveiled by the Japanese artificial intelligence company Sakana. Together with an open-sourced version of the system, the company has supplied samples of the papers the system wrote.
Small but Mighty: Introducing answerai-colbert-small.	ColBERT is a highly effective retrieval model. Despite having just 33 million parameters, this new model performs remarkably well on several measures. This article explains how to train a comparable model and what tips and techniques produced good results.
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation.	"Lazy visual grounding" is a two-step approach to open-vocabulary semantic segmentation that finds object masks independently of text and subsequently identifies the objects with textual information.
Introducing Agent Q: Research Breakthrough for the Next Generation of AI Agents with Planning & Self Healing Capabilities.	An agent educated by Multion to do web queries via self-play. It increased from 18% to 81% during training on a range of web-based tasks, such as placing restaurant orders. To get better, it employs DPO and MCTS. A publication from this work is published on the website, and researchers from Stanford also contributed to it. It seems to be based on Salesforce Research's xLAM function calling mechanism.
Anchored Preference Optimization.	Modifying models to conform to human tastes typically necessitates post-training. It is unclear, nevertheless, why one example is superior to another when these models are being trained. By using an existing example that has deteriorated, APO allows models to anchor the preference difference.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers.	Research on tree search for inference time computation for language models is very active. This Microsoft article presents a very strong argument for how small models can significantly outperform large models on mathematical tasks.
MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation.	Based on the MetaFormer design, MetaSeg is a potent semantic segmentation network that improves the network's decoder and backbone.
Long Context RAG Performance of LLMs.	This article investigates the performance of long context models on several RAG tasks. Increasing the amount of examples can be beneficial. These models frequently break down in odd but expected ways.

News

Link	description
Uber highlights autonomous vehicle efforts now that Tesla’s in its rearview mirror.	Uber reported strong second-quarter results, with gross bookings and net profit both up decently. But the company has chosen to highlight the success of its autonomous vehicle effort, likely to assuage investors concerned about incoming competition from Tesla, which aims to reveal its first robotaxi in October.
Mistral: build, tweak, repeat.	With the introduction of LLM customizations by La Plateforme, such as Mistral Large 2 and Codestral, developers can now fine-tune models with specialized domain knowledge. The 'Agents' alpha release offers sophisticated, multi-layered processes that are integrated with the capabilities of Mistral Large 2. For Python and Typescript, the Mistralai SDK has reached a stable 1.0 release, which enhances consistency and usefulness.
Zico Kolter Joins OpenAI’s Board of Directors.	Expert in AI robustness and safety, Zico Kolter is a professor at Carnegie Mellon University. He just joined the Safety and Security Committee of OpenAI and the Board of Directors. His in-depth studies on model robustness, alignment, and safety in AI will strengthen OpenAI's endeavors to guarantee that AI serves humanity.
Apple changes EU App Store rules after commission charges.	Change in policy means developers will be able to communicate with customers outside App Store
World’s 1st AI-powered hearing aids boost speech understanding by 53 times.	With AI and dual-chip technology, Sonova has unveiled the Phonak Audéo Sphere, a hearing aid that promises a 53x improvement in speech understanding in noisy conditions. The technology, which took years to develop, uses the DEEPSONIC chip with enhanced DNN capabilities to address the main issue facing users of hearing aids: clarity in noisy environments. Sonova hopes that this technological advancement will greatly enhance the lives of those who are hard of hearing.
Apple Intelligence may come to EU after all…but only for Mac.	As per the most recent beta release notes, Mac users in the EU will get access to Apple's AI features in the next macOS Sequoia, unlike on iOS and iPadOS 18. Macs are not covered by the EU exclusion, which stems from problems with Digital Markets Act compliance. If Mac users have their system set to U.S. English, they should be able to access Apple Intelligence.
Waymo is expanding its robotaxi service areas in San Francisco and Los Angeles.	The company is looking to add more customers to its burgeoning driverless car business.
Intel reportedly gave up a chance to buy a stake in OpenAI in 2017.	According to reports, Intel decided against investing in OpenAI, which is currently a major participant in the AI space, in 2017–2018 because then-CEO Bob Swan doubted the industry's preparation for AI.
YouTube is testing a feature that lets creators use Google Gemini to brainstorm video ideas.	YouTube is testing integration with Google Gemini to help creators brainstorm video ideas, titles and thumbnails.
Forget Midjourney — Flux is the new king of AI image generation and here’s how to get access.	Black Forest Labs' Flux AI is the newest and most promising open-source AI image generating technology available. Laptops intended for consumers can run it. It is better at providing people and quick adherence than rivals such as Midjourney in certain areas. There are three versions of the model available: Pro, Dev, and Schnell. An open-source text-to-video model is being planned.
Paid Apple Intelligence features are likely at least 3 years away.	Some analysts this week started reporting that Apple could charge as much as $20/month for paid Apple Intelligence features. While that may be true, we likely won’t see Apple charging for these features for at least 3 years.
Elon Musk to pause X’s AI training on some EU data, Ireland says.	Des Hogan, the Irish Commissioner for Data Protection, has filed a lawsuit against an undisclosed business, contesting how it handles the personal data of EU citizens and perhaps affecting its AI chatbot's GDPR-compliant data processing procedures.
Intel is bringing GPUs to cars.	The Arc A760A is a discrete GPU for automobiles from Intel that aims to improve in-car entertainment through AI-powered capabilities like gesture and speech recognition.
US considers breaking up Google after illegal monopoly ruling, reports say.	DoJ could force divestment of Android operation system and Chrome web browser following antitrust verdict
Google launches Pixel 9 phones with advanced AI.	New Pixel phones, foldable, watch and earbuds feature Gemini Live for free-flowing conversations with AI bot
Grok-2 Beta Release.	The latest model from xAI, Grok 2, is a frontier class model with mathematical, coding, and reasoning abilities. To make FLUX available to X users, it is working with Black Forest Labs.
Prompt Caching With Claude.	Anthropic's Claude models now have prompt caching, which enables developers to cache context that is regularly utilized. This reduces costs and latency considerably, and early adopters like Notion are now enjoying faster and more effective AI-powered features.
OpenAI updates ChatGPT to new GPT-4o model based on user feedback.	Unannounced, OpenAI upgraded the GPT-4o model for ChatGPT, adding features based on user feedback but leaving the reasoning style unchanged. Users conjectured about improved multi-step reasoning and image-generating capabilities, but OpenAI made it clear that the model's reasoning remains unchanged. To improve developer experiences, the business also mentioned that the most recent version of ChatGPT could not be the same as the API version.
14 new things you can do with Pixel thanks to AI.	The Pixel Watch 3 uses sophisticated motion sensing and machine learning for better running form analysis, and it makes use of machine learning for automated sleep detection and mode modifications. It presents a Loss of Pulse Detection AI program that, if required, will automatically notify emergency services. Additionally, Pixel's AI-powered call screening and holding features are carried over to the watch.
MIT releases comprehensive database of AI risks.	The AI Risk Repository, a comprehensive database of over 700 verified AI dangers, was developed by MIT and other institutions to assist enterprises and researchers in assessing and mitigating evolving AI risks through the use of a two-dimensional classification system and frequently updated data.
Universal Music and Meta Announce ‘Expanded Global Agreement’ for AI, Monetization and More.	With an emphasis on equitable pay and resolving difficulties with unlicensed AI content, Meta and Universal Music Group have extended their multi-year licensing deal. This move aims to increase revenue and develop creative opportunities for UMG's artists on platforms such as Facebook, Instagram, and now WhatsApp.
As Alexa turns 10, Amazon looks to generative AI.	Despite having a high household penetration rate, Amazon's Alexa subsidiary lost $10 billion in 2022 and had to lay off employees, underscoring the unviability of its loss leader approach. With the growing apathy towards smart assistants such as Siri and Google Assistant, Amazon is relying on generative AI to boost user engagement and enhance Alexa's functionality. The company's main goals are to get around the "smart timer" restriction and improve conversational interactions.
Replika CEO Eugenia Kuyda says it’s okay if we end up marrying AI chatbots.	CEO of Replika Eugenia Kuyda recently talked about her vision for AI partners in human interactions, emphasizing the app's potential to provide romance, companionship, or therapy via avatars. Replika hopes to create a new class of connections by evolving LLMs to enhance human interaction rather than replace it. Even in the face of controversy—like brief bans on sexual content—the app's goal of enhancing users' mental health never changes. Replika, which employs 50–60 people and has millions of users, is preparing a big relaunch to improve dialogue realism and interaction.
Gemini 1.5 Flash price drop with tuning rollout complete, and more.	With a 78% reduction in input and a 71% reduction in output token costs, Gemini 1.5 Flash has experienced a pricing reduction. Additionally, its API is now supported in more than 100 languages.
Prediction marketplace Polymarket partners with Perplexity to show news summaries.	To incorporate event-related news summaries and data visualizations into its prediction marketplace, Polymarket has teamed up with AI search engine Perplexity.
Nouse Hermes 3.	Nous Research has released its flagship model. Trained on top of Llama 3, the model has strong performance and a great personality like many of the company's original models.
California AI bill SB 1047 aims to prevent AI disasters, but Silicon Valley warns it will cause one.	Silicon Valley is opposed to California's SB 1047, which aims to stop "critical harms" from massive AI models. Stakeholders are split on the bill's possible effects on innovation. Prominent businesses and industry leaders discuss the bill's benefits and implications for AI safety and advancement. The measure is headed for a final Senate vote. It mandates AI model safety protocols and third-party audits. It also outlines enforcement procedures and heavy fines for non-compliance.
SoftBank's Intel AI processor plans in doubt as insiders say it is now considering a TSMC partnership.	Intel failed to produce AI processors for SoftBank's Project Izanagi, leading SoftBank to explore a partnership with TSMC. Despite setbacks, SoftBank remains committed to challenging major AI players with its own hardware and data center ecosystem, potentially backed by significant investment from global partners. The move could strain SoftBank's relationship with Arm clients as it risks direct competition.
Another Apple smart ring patent granted, includes controlling smart glasses.	A smart ring that can monitor health and control other Apple devices is described in a recently awarded patent by Apple, which also refers to potential integration with AR/VR headsets and smart glasses.
Iranian group used ChatGPT to try to influence US election, OpenAI says.	AI company bans accounts and says operation did not appear to have meaningful audience engagement
Russia’s AI tactics for US election interference are failing, Meta says.	New Meta security report finds that AI-powered deception campaigns ‘provide only incremental’ results for bad actors

Resources

Link	description
Introducing sqlite-vec v0.1.0: a vector search SQLite extension that runs everywhere.	a vector database constructed using the potent SQLite framework. It offers a good vector API and can process millions of queries.
PufferLib.	To standardize the interface, PufferLib is a wrapper and accelerator for libraries related to reinforcement learning. It has many helpful baselines and is incredibly quick.
transtokenizers.	Trans-tokenization is a cross-lingual technique that uses language data from high-resource languages to improve language models for low- and mid-resource languages.
Survey of Mamba.	offers a thorough analysis of the Mamba-based models that are already in use across activities and domains; in particular, it concentrates on Mamba's improvements, methods for adjusting to a variety of inputs, applications where Mamba works well, and potential future research areas.
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges, and Future.	a survey paper covering key subjects like requirement engineering, code generation, test generation, and autonomous decision making; it also includes benchmarks, metrics, and models used in various software engineering applications. The paper focuses on current practices and solutions for LLM-based agents for software engineering.
Transformer Explainer: Interactive Learning of Text-Generative Models.	It provides an open-source interactive application that allows you to experiment with your inputs while running a local GPT-2 instance in the user's browser to learn about the inner workings of a Transformer model.
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework.	provides a straightforward framework for automatically creating evaluation datasets to measure how well different LLMs are used in various contexts. It starts with seed documents to define a schema, then creates a variety of documents that result in question-answering pairs (QA pairs) that are based on both configurations and articles.
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.	On the Gemma 2 model suite, DeepMind released several sparse autoencoders a few weeks ago. Researchers now talk about the training paradigm and some intriguing findings in this companion study.
LiDAR-Event Stereo Fusion with Hallucinations.	Researchers suggest combining a stereo event camera with a fixed-frequency LiDAR sensor as a way to enhance event stereo matching.
LLM-Aided OCR Project.	The LLM-Aided OCR Project is an advanced system designed to significantly enhance the quality of Optical Character Recognition (OCR) output. By leveraging cutting-edge natural language processing techniques and large language models (LLMs), this project transforms raw OCR text into highly accurate, well-formatted, and readable documents.
A Foundation Model for ECG Analysis.	A transformer-based foundation model called ECG-FM was created to lessen the requirement for a large amount of labeled data, thereby enhancing ECG analysis.
ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation.	ProxyCLIP is a novel framework that combines the advantages of Vision Foundation Models and CLIP models to enhance open-vocabulary semantic segmentation.
How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model.	Nvidia's Llama 3.1 minitron 4B variant is now available. Through knowledge distillation and pruning, the model achieved a 16% improvement in MMLU scores compared to training from scratch, while requiring 40 times fewer tokens.
A practitioner's guide to testing and running large GPU clusters for training generative AI models.	AI has produced an excellent manual for managing huge computer clusters for training generative AI models.
LongWriter: Unleashing 10,000+ Word Generation From Long Context LLMs.	With the use of AgentWrite, which divides large jobs into manageable chunks, models can now generate coherent outputs longer than 20,000 words.
OpenResearcher.	A new AI-powered platform called OpenResearcher seeks to provide answers to a variety of research-related queries.
Introducing SWE-bench Verified.	OpenAI has introduced a subset of SWE-bench that is easier and more in line with what humans and AI can solve today. It is a good benchmark for validating and working towards before running the entire original benchmark.
AI Toolkit.	An excellent assemblage of AI-related scripts and notebooks. It focuses a lot on image adjustment and synthesis.
flash-linear-attention.	a set of extremely effective Triton kernels for the most advanced linear attention models and their variations.
Vision-Language Model Evaluation with UniBench.	UniBench is a unified framework that combines more than 50 benchmarks into a single implementation, making the evaluation of vision-language models (VLMs) easier. It aids in evaluating how well VLMs are doing in a variety of domains, including as object identification and spatial awareness.
ClickAttention: Click Region Similarity Guided Interactive Segmentation.	Interactive segmentation is enhanced by a new click attention technique. This method lessens inter-click interference and increases the impact of positive clicks.
Universal Waveform Generation.	This article investigates the performance of long context models on several RAG tasks. Increasing the amount of examples can be beneficial. These models frequently break down in odd but expected ways.
Security Risks in Model Merging.	New security threats surface as Model Merging (MM), a common technique for merging optimized models without further training, gains traction. The first backdoor attack that targets MM specifically is described in this publication, called BadMerging.
Model Merging in LLMs, MLLMs, and Beyond Methods, Theories, Applications, and Opportunities.	This survey offers a thorough analysis of model merging strategies, a machine learning technique that is becoming more and more popular and doesn't require costly computation or raw training data.

Perspectives

Link	description
‘His rhetoric has made Tesla toxic’: is Elon Musk driving away his target market?	There are signs the billionaire is becoming unpopular with the very demographic group most likely to buy EVs
Why Elon Musk’s fun week of stirring up unrest shows the limits of our online safety laws.	Twitter under the tech owner has become the perfect test case for the UK’s new legislation – but critics say more needs to be done
Elon’s politics: how Musk became a driver of elections misinformation.	X owner, who will interview Trump on Monday, has cast doubt on mail ballots and spread false claims
Don't pivot into AI research.	In AI and machine learning, scale is now the primary factor influencing performance. Due to the significant cash needed, only a small number of suppliers can hire fruitful machine-learning researchers, resulting in market consolidation. The historical consolidation in chip design is reflected in this dynamic, which points to a potential future decline in the status and pay of machine learning positions when supply exceeds demand. In light of these industry changes, prospective ML professionals should carefully consider why they want to pursue a career in ML.
OpenAI Generates More Turmoil.	Just two of the eleven founding members of OpenAI are still in the company, indicating a high rate of turnover among the group as worries about the organization's move from its original non-profit goals to a more profit-driven structure mount. Co-founders Greg Brockman, who is taking a sabbatical, and Ilya Sutskever have also quit amid rumors of burnout and lucrative side benefits. The company faces difficulties since it could need to find a new significant financial partner and because it expects GPT-5 to come later than expected while the industry debates the benefits of "open" vs "closed" AI models.
Klarna’s AI chatbot: how revolutionary is it, really?	By integrating an AI chatbot created using OpenAI, Klarna may be able to cut down on the amount of support people it needs because of its notable efficiency in customer service duties. In 23 markets and more than 35 languages, the bot responds quickly to standard Level 1 support inquiries; however, it refers more complicated problems to human agents. The system reduces expenses and expedites first-level help, but compared to earlier L1 support automation, its revolutionary influence inside the business environment is questionable.
Why I bet on DSPy.	An open-source program called DSPy may coordinate several LLM calls to solve practical issues. The framework is being updated to solve current issues with accessibility and reliability, with a focus on verified input for outcome measurement. Even with restricted reasoning powers, LLMs can function well as creative engines in the DSPy framework.
LinkedIn is a mess. Here’s how to fix it.	The networking site one is calling a ‘cesspool’ is riddled with oversharing and lunatics – it’s time for change
Silicon Valley is cheerleading the prospect of human–AI hybrids — we should be worried.	A pseudo-religion dressed up as technoscience promises human transcendence at the cost of extinction.
TechScape: Why Musk’s rabble-rousing shows the limits of social media laws.	Twitter under the tech owner has become the perfect test case for the UK’s new legislation – but critics say more needs to be done
America & China's Chip Race.	The United States is implementing robust policies to enhance domestic semiconductor production using the CHIPS Act and sanctions designed to impede China's technological progress. China's semiconductor industry is booming despite these efforts, with near-record imports of manufacturing equipment and rising domestic chip production. This growing competition points to an ongoing geopolitical tug-of-war over the supremacy of the semiconductor supply chain.
Gas pipeline players in talks to fuel AI data center demand.	As the power demands of the AI industry rise, pipeline companies such as Energy Transfer LP and Williams Companies are in talks to feed natural gas directly to data centers.
Does AI Deserve A Seat At The Boardroom Table?	Leaders are being compelled to create strong AI strategies for data-driven decision-making as a result of AI's integration with corporate governance. Even though AI provides insightful information, particularly when used with LLMs, there are still issues, such as competence gaps and moral dilemmas. AI and human judgment must be properly balanced to support future C-suite decision-making.
Self-Driving Cars Are Still The Best Way To Solve The Biggest Problem With Driving In America.	Robocars promise to improve traffic even when most of the cars around them are driven by people, study finds
Brands should avoid AI. It’s turning off customers.	According to a recent study, consumers' desire to buy may be lowered when things are labeled as "AI-powered" because of mistrust and anxiety about the unknown. People are skeptical about AI's inner workings and threats, particularly about personal data protection, according to the research, which implies that both cognitive and emotional trust are important. It is suggested that instead of utilizing "AI" as a buzzword, businesses concentrate on communicating the advantages of AI.
14% of PCs shipped globally in Q2 2024 were AI-capable.	In Q2 2024, shipments of AI-capable PCs increased significantly to 8.8 million units or 14% of all PCs supplied.
Brain implants to treat epilepsy, arthritis, or even incontinence? They may be closer than you think.	Startups around the world are engaging in clinical trials in a sector that could change lives – and be worth more than £15bn by the 2030s

Back to index

ML news: Week 5 - 11 August

Research

Link	description
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge.	Using this LLM-as-a-Meta-Judge approach enhances the LLM's ability to judge and follow instructions; simply self-improvement to produce better responses (act) saturates quickly; this work enhances the LLM's ability to judge itself (judge) to avoid issues like reward hacking; in addition to the act and judge roles, a third role called meta-judge is used to evaluate the model's judgments. This approach, known as meta-rewarding LLMs, proposes a self-improving alignment technique (no human supervision) where the LLM judges its judgments and uses the feedback to improve its judgment skills.
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher.	In MindSearch, a multi-agent framework based on LLM is presented for complex web-information seeking and integration tasks. A web planner is utilized to efficiently break down complex queries, while a web searcher performs hierarchical information retrieval on the Internet to enhance the relevance of the retrieved information. An iterative graph construction is employed in the planning component to better model complex problem-solving processes. The multi-agent framework is better suited for handling long context problems by assigning retrieval and reasoning tasks to specialized agents.
Improving Retrieval Augmented Language Model with Self-Reasoning.	Enhanced RAG through Self-Reasoning - utilizes the reasoning trajectories produced by the LLM itself to offer an end-to-end self-reasoning framework that enhances the dependability and traceability of RAG systems. The LLM is utilized to do the following three procedures: This method helps the model be more selective, reason and distinguish relevant and irrelevant documents, thus improving the accuracy of the RAG system as a whole. 1) Relevance-aware: evaluates the relevance between the retrieved documents and the question; 2) Evidence-aware selective: selects and cites relevant documents, and then automatically selects key sentence snippets as evidence from the cited documents; and 3) Trajectory analysis: generates a concise analysis based on all gathered self-reasoning trajectories generated by the preceding 2 processes, and then provides the final inferred answer. Using only 2,000 training examples, the framework outperforms GPT-4. (produced by GPT-4)
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost.	Constrained-CoT, a model that restricts the reasoning output length without compromising performance, demonstrates that increasing the LLaMA2-70b's reasoning limit to 100 words increases accuracy on GSM8K from 36.01% (CoT) to 41.07% (CCoT) while lowering the average output length by 28 words.
ThinK: Thinner Key Cache by Query-Driven Pruning.	ThinK focuses on long-context scenarios and inference; it offers a query-dependent KV cache pruning method to minimize attention weight loss while selectively pruning the least important channels. HinK - aims to address inefficiencies in KV cache memory consumption.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.	A group of scientists discovered that benchmark performance can be significantly improved at a 3x lower cost than with a larger model if you sample from tiny models regularly, provided that you have adequate coverage and a verification tool.
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues.	A Temporal-Spatial Perception Model (TSPM) has been established by researchers to enhance the capacity to respond to inquiries concerning auditory and visual signals in videos.
No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation.	This work presents enhancements to line search strategies that improve the efficiency of stochastic gradient descent systems.
Automated Review Generation Method Based on Large Language Models.	Utilizing LLMs, researchers have created an automated approach for generating reviews to assist in managing the massive amount of scientific material.
CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning.	CLEFT is a Contrastive Learning technique meant for medical imaging that aims to overcome the drawbacks of current, resource-intensive CLIP-like methods.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.	To boost model performance, there is a lot of demand to leverage computation at inference time. This essay explores the trade-offs made between various approaches and presents several useful ones. This often suggests a larger trend of getting more performance out of smaller machines.
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion.	It is easy to utilize a DiT model to generate unique things based on textual inputs by treating 3D objects as UV-wrapped images.

News

Link	description
Character.AI CEO Noam Shazeer returns to Google.	In a big move, Character.AI co-founder and CEO Noam Shazeer is returning to Google after leaving the company in October 2021 to found the a16z-backed chatbot startup.
Three New Additions To Gemma 2.	Google is expanding the Gemma 2 family of models with the addition of a new 2B parameter model, safety content classifier model, and model interpretability tool.
Microsoft says OpenAI is now a competitor in AI and search.	Microsoft’s annually updated list of competitors now includes OpenAI, a long-term strategic partner. The change comes days after OpenAI announced a prototype of a search engine. Microsoft has reportedly invested $13 billion into OpenAI.
Introducing GitHub Models.	We are launching GitHub Models, enabling our more than 100 million developers to become AI engineers and build industry-leading AI models.
Reddit CEO says Microsoft needs to pay to search the site.	In an interview, Steve Huffman calls out Microsoft’s Bing, Anthropic, and Perplexity for scraping Reddit’s data without permission. ‘It has been a real pain in the ass to block these companies.’
Elon Musk sues OpenAI again, alleging ‘deceit of Shakespearean proportions’.	Tesla CEO alleges his former partners, including CEO Sam Altman, manipulated him into co-founding the company
Google broke the law to maintain online search monopoly, US judge rules.	White House calls decision – that could have major implications for web use – ‘victory for the American people’
Secretaries of state called on Musk to fix chatbot over election misinformation.	X’s Grok AI chatbot falsely told users ‘ballot deadline has passed for several states’
Groq Raises $640M To Meet Soaring Demand for Fast AI Inference.	To address the demand for massive language model inference, Groq, the startup that is developing AI chips with lightning speed, is raising a significant amount of funding.
Elon Musk sues OpenAI, Sam Altman for making a “fool” out of him.	Having promised to keep OpenAI's technology open-source and prioritize the public good, Elon Musk has revived a lawsuit against the company and its CEO, Sam Altman. He claims that by turning OpenAI into a for-profit venture with ties to Microsoft, they obtained $44 million in seed funding fraudulently, which Musk claims betrays the original mission and has caused irreparable harm to both his interests and the public.
OpenAI Co-Founders Schulman and Brockman Step Back.	John Schulman has joined Anthropic as an independent contributor, while Greg Brockman is enjoying a long holiday.
Llama 3.1 Impact Grants.	Meta has announced a program to award groups using its models for good with $2m to help develop these tools for economically and socially impactful projects.
BYU engineering research finds key to quicker nuclear power: artificial intelligence.	Professor of chemical engineering at BYU Matt Memmott has created an AI algorithm that has the potential to drastically lower costs by ten years in the design and licensing of nuclear reactors. According to his team's study, AI can solve difficult nuclear design challenges far more quickly than conventional techniques; in one case, the design process was shortened from six months to just two days. The conclusions seek to preserve low electricity costs while meeting growing energy demands by speeding up the development of nuclear power.
OpenAI tempers expectations with less bombastic, GPT-5-less DevDay this fall.	According to OpenAI, this year's DevDay conference will no longer be a large event but rather a series of smaller, mobile developer sessions that will concentrate on upgrades to developer services and APIs rather than the introduction of a new flagship model.
Tezi raises $9M to launch Max: the first fully autonomous AI recruiter.	To build Max, an AI-driven recruiting agent that conducts hiring procedures from beginning to end on its own, Tezi raised $9 million in seed funding, with the lead investors being 8VC and Audacious Ventures.
Apple Intelligence rollout timetable won't delay iPhone 16.	Apple Intelligence capabilities will be added to iOS 18 after launch; initial access will be available to iPhone 15 Pro models exclusively in iOS 18.1.
Figure redesigns its humanoid robot from the ground up for slick new F.02.	California-based robotics outfit Figure has today announced its second-generation humanoid robot, which is initially being aimed at production lines in commercial settings, but the company is promising a bipedal butler in our homes shortly.
Structured Outputs in OpenAI API.	It is difficult to request organized output, such as JSON, from language models. With the help of this new functionality in OpenAI's API, language model creation may produce structured output that deterministic applications downstream can use.
Meta is reportedly offering millions to use Hollywood voices in AI projects.	To obtain broad usage rights across all of its platforms, Meta is negotiating to use the voices of well-known actors like Awkwafina and Judi Dench for its AI digital assistant. If a settlement is reached, the actors may receive millions of dollars in compensation, with SAG-AFTRA protecting likenesses created by AI. The business recently canceled a celebrity voice chatbot project, and now plans to showcase these AI technologies at its Connect conference in September.
With Smugglers and Front Companies, China Is Skirting American A.I. Bans.	A thriving underground market persists despite U.S. sanctions meant to stop the transfer of AI chips to China, facilitating large transactions such as the $103 million purchase using Nvidia processors. In an attempt to get around prohibitions, new businesses are founded, delivery methods are deceitful, and international distribution gaps are exploited. The ongoing illicit commerce has sparked discussions about the efficacy of American export regulations and how they affect US tech companies in comparison to their Chinese rivals.
Nvidia Blackwell GPUs allegedly delayed due to design flaws — launch expected to be pushed back by three months or more.	Microsoft, Meta, Google, and xAI will have to wait a few more months to receive their massive GPU orders.
OpenAI says it’s taking a ‘deliberate approach’ to releasing tools that can detect writing from ChatGPT.	OpenAI has built a tool that could potentially catch students who cheat by asking ChatGPT to write their assignments — but according to The Wall Street Journal, the company is debating whether to release it.
Zuckerberg touts Meta’s latest video vision AI with Nvidia CEO Jensen Huang.	Meta had a palpable hit last year with Segment Anything, a machine learning model that could quickly and reliably identify and outline just about anything in an image. The sequel, which CEO Mark Zuckerberg debuted on stage Monday at SIGGRAPH, takes the model to the video domain, showing how fast the field is moving.
Gemini intelligence is coming to Google Home.	Google Assistant is getting a major upgrade on Nest smart speakers and displays, and Nest cameras will soon be able to tell as well as show, as Google Home gets a powerful AI infusion
Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3.	Meta, which develops one of the biggest foundational open source large language models, Llama, believes it will need significantly more computing power to train models in the future.
AMD is becoming an AI chip company, just like Nvidia.	AMD’s AI GPU sales just went from a billion dollars cumulatively to a billion dollars quarterly.
Microsoft Is Losing a Staggering Amount of Money on AI.	With an emphasis on data centers for AI capabilities, Microsoft's spending in AI jumped to $19 billion in the most recent quarter; nevertheless, significant AI revenue is yet unknown.
Taco Bell’s drive-thru AI might take your next order .	Taco Bell’s parent company aims to bring its ‘Voice AI’ technology to hundreds of stores in the US by the end of 2024.
OpenAI invests in a webcam company turned AI startup.	penAI is leading a $60 million funding round for Opal, the same company behind the high-end Tadpole webcam, according to a report from The Information.
UK regulator to examine $4bn Amazon investment in AI startup Anthropic.	Move is the latest of a string of CMA investigations into technology tie-ups
Hugging Face acquires XetHub.	The majority of the data that Hugging Face serves and keeps is in LFS. XetHub has developed a strong substitute for Git repositories' scalability.
Humane’s daily returns are outpacing sales.	The company is scrambling to stabilize as it hits $1 million in total returns against $9 million in sales.
GPT-4o System Card.	Setting up a voice system can be difficult. The ongoing efforts to guarantee the safety and usefulness of the multimodal paradigm are highlighted in this piece.
Fully-automatic robot dentist performs world's first human procedure.	In a historic moment for the dental profession, an AI-controlled autonomous robot has performed an entire procedure on a human patient for the first time, about eight times faster than a human dentist could do it.
Microsoft launches GitHub Models, offering 100 million developers easy access to leading AI tools.	Microsoft has introduced "GitHub Models," a new platform that enables over 100 million developers to integrate AI into their software projects by providing access to a variety of AI models. This includes popular models like Llama 3.1, GPT-4o, and Mistral Large 2, among others. Developers can explore these models for free through a built-in model playground on GitHub, where they can experiment with different prompts and model parameters.
Google brings Gemini-powered search history and Lens to Chrome desktop.	Google Thursday said that it is introducing new Gemini-powered features for Chrome’s desktop version, including Lens for desktop, tab compare for shopping assistance, and natural language integration for search history.
Apple changes EU App Store rules after commission charges.	Change in policy means developers will be able to communicate with customers outside App Store

Resources

Link	description
Adaptive Retrieval-Augmented Generation for Conversational Systems.	In addition to demonstrating the potential for RAG-based conversational systems to produce high-quality responses and high-generation confidence, Adaptive RAG for Conversations Sytems also develops a gating model that predicts whether a conversational system needs RAG to improve its responses. It further asserts that a correlation can be found between the relevance of the augmented knowledge and the generation's degree of confidence.
ShieldGemma: Generative AI Content Moderation Based on Gemma.	Based on Gemma 2, ShieldGemma provides a full suite of LLM-based safety content moderation models, including classifiers for major damage categories like toxicity, hate speech, and hazardous content.
PersonaGym: Evaluating Persona Agents and LLMs.	Assessing Persona Agents: This study suggests a standard for assessing persona agent skills in LLMs; it discovers that, while being a somewhat more sophisticated model, Claude 3.5 Sonnet only shows a 2.97% relative improvement in PersonaScore when compared to GPT 3.5.
The Art of Refusal: A Survey of Abstention in Large Language Models.	a review of the approaches currently employed in LLMs to attain rejection; offers measures and benchmarks for evaluation that are used to gauge abstinence in LLMs.
XHand: Real-time Expressive Hand Avatar.	A brand-new hand avatar called XHand is intended for real-time rendering in virtual worlds and video games. In contrast to earlier models, XHand concentrates on producing intricate hand morphologies, looks, and deformities.
Prompt Poet.	Millions of talks are served by Character AI's prompted construction library, which is made available to the public.
NAVIX: minigrid in JAX.	A popular testing bed for RL has been accelerated in JAX.
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models.	A novel data synthesis pipeline for Vision Large Language Models (VLLMs) is called SynthVLM. Rather than captioning photos directly, SynthVLM leverages sophisticated diffusion models to produce high-resolution images from captions.
Networks that compress themselves.	You can train a more accurate, self-quantized model that gets smaller by integrating the network's size in the loss function.
Video Tracking with Language Embeddings.	A novel technique that leverages language embeddings to enhance point tracking in lengthy video sequences has been developed by researchers.
Boosting Efficiency in Vision-Language Model Training.	This effort addresses the imbalance brought about by different data distributions and model architectures by introducing a technique to balance computational burdens during large-scale 3D simultaneous training of vision-language models.
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling.	High-quality generation of textures on 3d models with diffusion.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization.	This work uses textual, 2D, or 3D input to create artistic meshes. To sample effectively, it takes advantage of neighboring tokens and enhancements to the vertex representation.
CogVideo.	A text-to-video model available for free that performs nearly as well as closed video creation technologies.
MiniCPM-V.	Amazing vision language model with near real-time performance. It performs better on certain benchmarks than closed models.
RecDiffusion: Rectangle for Image Stitching with Diffusion Models.	RecDiffusion is a framework that improves the aesthetic appeal of stitched photos without requiring any cropping or distortion.
LLaVA-OneVision: Easy Visual Task Transfer.	In visual language models, there has been an effort to make them versatile and easy to tune. This reminds me of computer vision from ten years ago. Crucially, LLaVA-OneVision demonstrates how meticulous data curation and architecture upgrades may do this.
ABC Invariance.	To migrate your hyperparameters from smaller to larger models, use muP. A fantastic theorem that says you may vary where you scale model outputs and it won't affect ultimate transfer performance is demonstrated in practice in this GitHub gist.
XLabs-AI/flux-controlnet-canny.	XLabs has released the first Flux-Dev control net which allows for generation conditioned on Canny image inputs.
HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection.	A framework called HARMONIC is used to create and assess synthetic tabular data by utilizing big language models.
Introducing Qwen2-Math.	A 72B math model developed by the Qwen team beats all other open and closed models on MATH. Additionally, it beats Llama-3.1-405B on some measures related to reasoning. Only English is available at this time; multilingual models will be available soon.
SAM2-PATH: A better segment anything model for semantic segmentation in digital pathology.	A novel approach called SAM2-PATH aims to improve semantic segmentation in digital pathology.
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond.	A new multilingual Spoken Language Understanding (SLU) dataset is called Speech-MASSIVE. It provides an analogous speech-based corpus to the massive text corpus.
PyTorch FlexAttention.	A new API from PyTorch makes it possible to design and compile any kind of attention variant to Triton. Better portability, performance, and research velocity on attention types are made possible by this.
A Language Model with Quick Pre-Training.	The "1.5-Pints" Language Model offers a novel method for pre-training that is compute-efficient. This model outperforms Apple's OpenELM and Microsoft's Phi in instruction-following tasks, as determined by MT-Bench, by curating a high-quality dataset of 57 billion tokens.
lighthouse.	Lighthouse is a user-friendly library for reproducible and accessible research on video moment retrieval (MR) and highlight detection (HD). It supports six VMR-HD models, three features, and five datasets for reproducible VMR-HD.

Perspectives

Link	description
AI existential risk probabilities are too unreliable to inform policy.	The use of AI existential risk probability estimates for policymaking is criticized in this essay, which contends that these estimates are excessively erratic and lack a strong inductive or deductive foundation, frequently approximating educated guesses rather than fact-based projections. The authors argue against the validity of using these projections to inform public policy, particularly when they are connected to expensive or restricting measures, and they support an evidence-based strategy that takes AI development uncertainty into account. They advise against utilizing speculative existential risk probability in high-impact decisions and instead suggest concentrating on specified AI milestones for more significant policy choices.
Is AI judging the future of gymnastics or just a surveillance tool?	To provide more equitable and transparent scoring, the International Gymnastics Federation (FIG) and Fujitsu have partnered to provide an AI-assisted judging support system at the World Gymnastics Championships. With room for future development and wider uses, the Judging Support System (JSS), which will not take the place of judges, provides 3D model-based second views in challenging cases and inquiry disagreements. The JSS may improve scoring accuracy and consistency, which is important in a sport where even small point variations have a significant impact on standings and players' careers, despite worries that it may replace human judgment.
Why AI’s Tom Cruise problem means it is ‘doomed to fail’.	LLMs’ ‘reversal curse’ leads it to fail at drawing relationships between simple facts. It’s a problem that could prove fatal
Sound clashes are a thrilling reggae tradition. Will AI ruin them?	The use of fake AI vocals – including those of Donald Trump – is sending shockwaves through this historic scene. At a Montego Bay clash, performers debate their culture’s future
Replacing my Right Hand with AI.	While riding a bike, an anthropic scientist broke their hand. They continued to be incredibly productive by leaning into Claude and his voice.
TPU transformation: A look back at 10 years of our AI-specialized chips.	Because it has invested in bespoke TPU chips, Google is one of the only companies training massive models without being dependent on Nvidia.
I'm Switching Into AI Safety.	Alex Irpan left Google's robotics team after eight years to join Google DeepMind's AI safety team. His move was motivated by a personal desire to address safety concerns as AI systems get closer to being superhuman. Though the area is difficult and fraught with controversy, they voice concerns about the effectiveness of present AI safety measures, the growing risks of unmanaged AI growth, and their dedication to contributing to AI safety.
As Regulators Close In, Nvidia Scrambles for a Response.	With a 90 percent share of the A.I. chip market, the company is facing antitrust investigations into the possibility that it could lock in customers or hurt competitors.
How GitHub harnesses AI to transform customer feedback into action.	GitHub is using AI and machine learning to compile and evaluate user input at scale, providing useful insights that drive feature prioritization and product enhancements. This automated method improves responsiveness to developer needs by facilitating the collection of multilingual input and promoting data-driven decision-making. The project demonstrates GitHub's dedication to utilizing AI to uphold a developer-centric approach to product development.
How Does OpenAI Survive?	The paper expresses a strong doubt regarding the sustainability of OpenAI, given the exorbitant costs associated with constructing and maintaining huge language models, as well as the absence of broad business utility for generative AI. The long-term sustainability of OpenAI is questioned by the author in the absence of substantial technology advancements or persistent, extraordinary fundraising efforts. Even though OpenAI has had a significant impact on the AI sector, the business still has issues with profitability, high operational burn rates, and a reliance on key alliances, most notably Microsoft.
How neurons make a memory.	Loosely packaged DNA might make these nerve cells better able to encode memories.
DeepMind hits milestone in solving maths problems — AI’s next grand challenge.	AlphaProof showed its prowess on questions from this year’s Mathematical Olympiad — a step in the race to create substantial proofs with artificial intelligence.
Dirty talk: how AI is being used in the bedroom – and beyond.	Analysis of more than 200,000 chatbot conversations shows how the new tech is actually being used. Turns out quite a lot of it is ‘racy role play’
Scientists are falling victim to deepfake AI video scams — here’s how to fight back.	Cybercriminals are increasingly singling out researchers, alongside politicians and celebrities. Targeted scientists share tips on how to silence them.
What lies beneath: the growing threat to the hidden network of cables that power the internet.	Last month large parts of Tonga were left without internet when an undersea cable was broken. It’s a scenario that is far more common than is understood
Why AI hasn’t shown up in the GDP statistics yet.	Even though LLMs have made remarkable strides in handling complicated tasks, they are still unable to reliably complete activities at a scale comparable to that of humans. As a result, their current potential as direct human substitutes in processes is limited. LLMs require comprehensive prompt engineering and iteration to reach acceptable accuracy. The latest JSON output control and cost reduction enhancements from OpenAI may help with certain problems, but the subtle integration needed for LLMs in corporate settings points to gradual productivity increases rather than a sudden economic revolution.
AI Is Coming for India's Famous Tech Hub.	AI integration is posing a danger to employment, particularly in routine operations like contact centers, which has caused a sea change in India's technology outsourcing sector. While recruiting is slowing down, companies are finding it difficult to move up the value chain. However, some are optimistic that AI technologies may open up new opportunities in fields like programming. Higher-order cognitive abilities will be necessary in the sector going forward as automation continues to reshape traditional employment.
Inside the company that gathers ‘human data’ for every major AI company.	Advances in AI pre-training have made it possible for models to handle large amounts of online data and supervised fine-tuning with specialists afterward aids in the models' ability to become more specialized and general. The goal of Turing's method is to improve AI reasoning capabilities by leveraging "input and output pairs" created by subject-matter experts. These models, foreseeing the "agentic" future of artificial intelligence, might integrate specialized knowledge across areas to accomplish complicated tasks independently.

Back to index

ML news: Week 29 July - 4 August

Research

Link	description
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach.	compares RAG to long-context LLMs and discovers that while RAG is much less expensive, long-context LLMs perform better on average; Offers Self-Route, which routes inquiries to RAG or LC by using self-reflection; it claims to have a substantial computational cost reduction with a performance that is comparable to LC.
Recursive Introspection: Teaching Language Model Agents How to Self-Improve.	asserts that LLMs can be iteratively fine-tuned to improve their own response over multiple turns with additional feedback from the environment; the LLM learns to recursively detect and correct its past mistakes in subsequent iterations; and enhances 7B models' self-improvement abilities on reasoning tasks (GSM8K and MATH), achieving an improvement over turns that is not observed in strong proprietary models.
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.	presents a novel dynamic token pruning technique for effective long-context LLM inference; it can maintain high accuracy while speeding up the prefilling stage of a Llama 2 7B model by 2.34 times; it computes the KV for tokens that are crucial for the next token prediction in both the prefilling and decoding stages; it enables language models to dynamically select different subsets of tokens from the context in different generation steps, even though they may have been pruned in a previous step.
Generation Constraint Scaling Can Mitigate Hallucination.	suggests a novel training-free method to reduce hallucinations in LLMs; they scaled the readout vector that limits generation in a memory-augmented LLM decoder; current research suggests that LLMs with explicit memory mechanisms can help reduce hallucinations; this work employs a memory-augmented LLM and applies lightweight memory primitives to limit generation in the decoder.
Align and Distill: Unifying and Improving Domain Adaptive Object Detection.	The difficulties of getting object detection models to perform well on a variety of data formats that they weren't initially trained on are addressed by a new method named ALDI.
Small Molecule Optimization with Large Language Models.	By gathering a dataset of 100 million molecules (40 billion token equivalent), two new language models were able to enhance their performance by 8% on the Practical Molecular Optimization benchmark.
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation.	With a fairly comparable inference cost, code generation performance can be enhanced by repeatedly using smaller models.
Self-Directed Synthetic Dialogues and Revisions Technical Report.	More than 300,000 dialogues and criticisms will be incorporated into open models. The dataset, which was primarily produced with synthetics, is a potent illustration of synthetic data utilizing open models.
Theia: Distilling Diverse Vision Foundation Models for Robot Learning.	Theia, a vision foundation model for robot learning that combines several current vision models, is presented in this study. Rich visual representations provided by Theia improve robot learning even when using smaller model sizes and less training data. Test results indicate that Theia performs better than its predecessors, and the authors propose that enhanced performance is caused by more entropy in feature norms. The public is free to utilize the models and code.
Do We Really Need Graph Convolution During Training? Light Post-Training Graph-ODE for Efficient Recommendation.	A novel strategy to increase the effectiveness and scalability of recommender systems is called LightGODE. Adopting a continuous graph ODE and concentrating on post-training graph convolution, avoids the need for costly computations during training.

News

Link	description
Llama 3.1	a group of LLMs that includes models with 8B, 70B, and 405B parameters; it supports eight languages and expands the context window to 128K tokens; it exceeds state-of-the-art models in certain situations and competes favorably in other areas, including as general knowledge, math reasoning, and tool use.
Nvidia’s new Titan GPU will beat the RTX 5090, according to leak.	After skipping its ultra-expensive flagship graphics card with its Ada lineup, Nvidia could be bringing back the Titan with a Blackwell GPU.
Elon Musk will ‘discuss’ Tesla investing $5 billion in his private AI company.	Elon Musk says that he will ‘discuss’ Tesla investing $5 billion in xAI, his own private artificial intelligence company. For the last few years, Musk has claimed that “Tesla is an AI company.”
OpenAI training and inference costs could reach $7bn for 2024, AI startup set to lose $5bn - report.	In 2023, OpenAI projected that ChatGPT inference would cost about $4 billion on Microsoft's Azure servers, potentially resulting in large financial losses. Even though OpenAI is making about $2 billion a year from ChatGPT, it would need more money in less than a year to cover a $5 billion deficit. With subsidized prices from Azure, it presently uses the equivalent of 350,000 Nvidia A100 chip servers, primarily for ChatGPT.
Elon Musk sets new date for Tesla robotaxi reveal, calls everything beyond autonomy ‘noise’.	Elon Musk says he will show off Tesla’s purpose-built “robotaxi” prototype during an event October 10, after scrapping a previous plan to reveal it August 8. Musk said Tesla will also show off “a couple of other things,” but didn’t explain what that meant.
Stability AI steps into a new gen AI dimension with Stable Video 4D.	Stability AI is expanding its growing roster of generative AI models, quite literally adding a new dimension with the debut of Stable Video 4D.
Google’s Gemini AI is getting faster with its Flash upgrade.	Google’s Gemini AI chatbot will be able to respond to you more quickly and process more content in prompts thanks to an upgrade to the company’s Gemini 1.5 Flash AI model.
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images.	Real-time promptable segmentation for videos and images from Meta.
Apple says its AI models were trained on Google’s custom chips.	Apple said in a technical paper on Monday that the two AI models underpinning Apple Intelligence, its AI system, were pre-trained on Google-designed chips in the cloud.
AI Startup Anthropic Faces Backlash for Excessive Web Scraping.	Freelancer.com CEO claims Anthropic's crawler violated the "do not crawl" protocol, causing site slowdowns.
Apple Intelligence Foundation Language Models.	Apple has outlined the basics of its language models for its newly announced “Apple Intelligence” initiative.
Microsoft beats revenue forecasts but poor performance of cloud services drags share price.	Firm’s earnings were up 15% year-on-year, but Azure’s lower returns resulted in share prices falling by as much as 7%
UK regulator looks at Google’s partnership with Anthropic.	CMA to consider whether the deal with AI startup is a potential merger, which could prompt full investigation
OpenAI has released a new ChatGPT bot that you can talk to.	The voice-enabled chatbot will be available to a small group of people today, and to all ChatGPT Plus users in the fall.
Meta's new AI Studio helps you create your own custom AI chatbots.	Headed for the web as well as Instagram, Messenger, and WhatsApp, AI Studio will let you build a chatbot that acts as a virtual extension of yourself.
Perplexity Will Soon Start Selling Ads Within AI Search.	Facing backlash for scraping publisher data, the young company says it’ll now compensate publishers whose content is used in answers to search questions.
The AI job interviewer will see you now.	AI interview services say they’re eliminating bias — but not everyone agrees. Companies are adopting AI job interview systems to handle incoming applicants. LLMs allow the interviewer to incorporate follow-up questions based on the subject’s response. Critics say the opaque models raise serious concerns about bias, particularly where there is no documentation about how a decision is made.
Canva buys Leonardo.	Leonardo, a generative picture firm, joins Canva to enhance the creative tools of both organizations.
Announcing Phi-3 fine-tuning, new generative AI models, and other Azure AI updates .	Updates to Azure AI have been released by Microsoft. These include PHI-3 model serverless fine-tuning, enhanced PHI-3-MINI performance, and the incorporation of models such as Meta's LLAMA 3.1 and GPT-4o mini into Azure AI.
Strong earnings report pushes Meta shares up amid heavy AI spending.	Stock price grew around 5%, which revealed the company outperformed analysts’ expectations for its second quarter
Argentina will use AI to ‘predict future crimes’ but experts worry for citizens’ rights.	President Javier Milei creates security unit as some say certain groups may be overly scrutinized by the technology
White House says no need to restrict ‘open-source’ artificial intelligence — at least for now.	The White House is coming out in favor of “open-source” artificial intelligence technology, arguing in a report Tuesday that there’s no need right now for restrictions on companies making key components of their powerful AI systems widely available.
Samsung hints at new products as it bets on AI to drive upgrades to its latest foldable phones.	Speaking to CNBC, Samsung Electronics’ mobile boss TM Roh discussed Galaxy AI and software strategy, while hinting at future foldable products and mixed reality headsets. Roh said the company hopes its suite of AI software will push users to upgrade to its latest smartphones.
Elon Musk calls Grok 'the most powerful AI by every metric' but 'secretly' trains the new model with your X data by default.	X's new experience is automatically set to opt-in and uses your data to train its Grok AI model.
NVIDIA Accelerates Humanoid Robotics Development.	To accelerate the development of humanoid robotics, NVIDIA has introduced new services and platforms, such as teleoperated data capturing workflows, OSMO orchestration, and NIM microservices.
US’ first robot-assisted dual kidney transplant performed in Ohio.	Joanne’s surgery was unique because doctors used the robotic surgical technique to implant two kidneys from a single deceased donor.
Intel announces plan to cut 15,000 jobs to ‘resize and refocus’ business.	Firm reported a loss in its second quarter and said it would cut 15% of its workforce to cut costs and compete with rivals
UK shelves £1.3bn of funding for technology and AI projects.	Britain’s first next-generation supercomputer, planned by Tories, in doubt after Labour government move
Black Forest Labs.	The founders of Latent Diffusion, Stable Diffusion, VQGAN, and other startups have raised over $30 million to launch their new business. They have introduced new flagship picture generation devices that are available in multiple levels and are incredibly competent.
OpenAI pledges to give U.S. AI Safety Institute early access to its next model.	OpenAI CEO Sam Altman says that OpenAI is working with the U.S. AI Safety Institute, a federal government body that aims to assess and address risks in AI platforms, on an agreement to provide early access to its next major generative AI model for safety testing.
The EU’s AI Act is now in force.	This starts the clock on a series of staggered compliance deadlines that the law will apply to different types of AI developers and applications. Most provisions will be fully applicable by mid-2026. But the first deadline, which enforces bans on a small number of prohibited uses of AI in specific contexts, such as law enforcement use of remote biometrics in public places, will apply in just six months.
Introducing Stable Fast 3D: Rapid 3D Asset Generation From Single Images.	A fantastic new quick and strong 3D generation model has been launched by Stability AI. Like the company's earlier versions, it operates under the same commercial license.
Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile.	A fantastic sample library for local language model chats has been made available by the PyTorch team. It can run the most recent Llama 3.1 models and comes with a reliable sample system.
Heeyo built an AI chatbot to be a billion kids’ interactive tutor and friend.	Xiaoyin Qu founded the firm Heeyo, which has released an AI-powered software with interactive games and a chatbot for kids three to eleven years old. With features like data protection and material created by child development specialists, the app strives to prioritize safety while offering tailored learning experiences. Though there may be worries about AI for children, Heeyo has raised $3.5 million in seed money. It presents itself as a secure and instructive substitute for well-known video and gaming platforms.
Cerebras IPO.	Cerebras Systems announced a proposal for IPO to the SEC.
LLMs breach a threshold.	FLOPs as a regulatory threshold have been the subject of dispute since Meta's open-source LLM Llama 3.1, trained on 3.8x10^25 FLOPs and equipped with 405B parameters, was recently released.

Resources

Link	description
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.	provides a framework for creating generalist agents that use software to interact with the outside world. Its features include 1) an interface for creating and executing code, 2) an environment with a sandboxed operating system and web browser accessible to the agents, 3) an interface for interacting with interfaces and environments, 4) support for multiple agents, and 5) an evaluation framework.
A Survey on Employing Large Language Models for Text-to-SQL Tasks.	gives an overview of using LLMs for Text-to-SQL operations, covering benchmarks, prompt engineering strategies, and fine-tuning procedures.
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens.	Open-source a massive multimodal interleaved dataset with 3.4 billion images and 1 trillion tokens; additional sources like PDFs and ArXiv papers are also included.
StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory.	StreamMOS is a new approach for segmenting moving objects using LiDAR in autonomous driving and robotics.
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography.	Scientists have devised a technique that incorporates miniature spectrometers to enhance mobile photography. To improve image quality, this innovative method combines RGB and low-resolution multi-spectral images.
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation.	A fresh and enhanced monocular depth model for numerous real-world situations.
3D Object Segmentation with Language.	RefMask3D is a technique that uses natural language descriptions to partition items in 3D point clouds. With Geometry-Enhanced Group-Word Attention and Linguistic Primitives Construction, the system improves vision-language feature fusion and tackles sparse and irregular point cloud problems.
Efficient Cell Segmentation.	A novel technique for high-accuracy cell segmentation, LKCell strikes a compromise between computational efficiency and broad receptive fields.
Tactics for multi-step AI app experimentation.	Typically, LLM programs have several components; this article examines various strategies along with pertinent code snippets.
AccDiffusion.	a technique that significantly enhances diffusion models' ability to synthesize high-quality images.
HybridDepth.	A depth estimate pipeline called HYBRIDDEPTH was created to address issues with scale ambiguity and technology variation in mobile augmented reality.
VSSD: Vision Mamba with Non-Causal State Space Duality.	A novel method for mitigating the high computing needs of vision transformers is the Visual State Space Duality (VSSD) paradigm.
A New Benchmark for Autonomous Agents.	AppWorld Engine is a sophisticated execution environment that features nine daily apps and 457 APIs
Crash Course in Deep Learning.	The creation and application of multi-layer perceptrons (MLPs), a kind of fully connected neural network used in deep learning, are covered in this article.
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain.	In this study, two huge language models with 54 billion and 141 billion parameters, respectively, that are intended for the legal industry, are introduced: SaulLM-54B and SaulLM-141B. The researchers used the Mixtral architecture to provide large-scale domain adaptation by aligning outputs with human legal interpretations, continuing pre-training using an extensive legal corpus, and adhering to a particular legal instruction-following procedure. The models provide state-of-the-art performance on LegalBench-Instruct and outperform earlier open-source models. These models' base, instruct, and aligned versions are available for reuse and group study under the MIT License.
WFEN.	To boost face super-resolution, researchers have created a feature augmentation network based on wavelets. The technique uses a full domain Transformer and breaks down input data into high and low-frequency components to improve facial details without generating distortions.
ChartQA-MLLM.	This experiment suggests a novel approach to multimodal large language models-based chart question answering.
DGFNet.	A novel method for forecasting the paths of several traffic participants in autonomous driving is called DGFNet. By taking into account the variations in difficulty between agents, recording detailed spatiotemporal data, and utilizing a difficulty-guided decoder, it improves predictions.
SAE for Gemma.	This demo is a beginner-friendly introduction to interpretability that explores an AI model called Gemma 2 2B. It also contains interesting and relevant content even for those already familiar with the topic.
Machine Unlearning in Generative AI: A Survey.	This in-depth analysis of generative AI examines machine unlearning. It addresses how to formulate problems, how to evaluate them, and the advantages and disadvantages of different approaches.
Elysium: Exploring Object-level Perception in Videos via MLLM.	A step toward providing object tracking and related tasks in films for Multi-modal Large Language Models (MLLMs) is represented by Elysium.
Piano Performance Generation.	The two-stage Transformer-based model for creating emotionally charged piano performances is presented in this paper.
3D Generative Model for Dynamic Scenes.	A 3D generative model called DynaVol-S is very good at extracting object-centric representations from unsupervised films.
Add-SD: Rational Generation without Manual Reference.	Add-SD is a program that uses short text prompts to put things into realistic environments. Unlike other methods, this one doesn't require bounding boxes or other explicit references.
Flow Matching: Matching flows instead of scores.	Diffusion models possess great strength. It can be difficult to understand them. Theoretically, flow matching is one way to view them. This blog delves further into the diffusion math of flow matching.
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions.	MMTrail is a large-scale multi-modality video-language dataset with over 20M trailer clips, featuring high-quality multimodal captions that integrate context, visual frames, and background music, aiming to enhance cross-modality studies and fine-grained multimodal-language model training.
ARCLE - ARC Learning Environment.	ARCLE is an environment to aid reinforcement learning studies using the Abstraction and Reasoning Corpus (ARC).
Mishax.	DeepMind has released a library for studying language models via MI. The library helps with running models and functions from complex codebases without tons of importing headaches.
Engine Core.	Engine Core demonstrates a pattern for enabling LLMs to undertake tasks of a given scope with a dynamic system prompt and a collection of tool functions.
alphaXiv.	Open research discussion directly on top of arXiv

Perspectives

Link	description
My new iPhone symbolizes stagnation, not innovation – and a similar fate awaits AI.	Development of ChatGPT and its ilk will plateau, just like it did for smartphones, and then what are we left with? More ho-hum consumer tech
AI: Are we in another dot-com bubble?	A thorough examination by Translink Capital's Kelvin Mu contrasting the present AI cycle with the internet/telecom cycle of the 1990s. After comparing the two eras' technological, economic, and capital disparities, he comes to the conclusion that, even though a bubble may eventually occur, we are still a long way from there.
Robots sacked, screenings shut down: a new movement of Luddites is rising up against AI.	Company after company is swallowing the hype, only to be forced into embarrassing walk backs by anti-AI backlash
Chalkboards and What They Can Teach Us About Generative AI.	This article discusses the use of generative AI as a teaching tool and makes the case that the technology's compatibility with educational ideals should be taken into account in addition to its technical analysis. Although the author is receptive to the use of AI, she is wary of its potential effects and stresses the necessity for clear justifications for the use of particular resources in the classroom. The conversation compares and contrasts AI with conventional tools such as whiteboards, taking into account the educational and cultural consequences of each.
The Evolution of SaaS Pricing in the AI Era.	Because AI can automate work, the traditional seat-based pricing model in SaaS is becoming outdated. Work-based or outcome-based pricing models, which set prices according to the quantity of work AI completes or the results it achieves, are becoming more and more popular among businesses. While established players continue to use seat-based pricing, startups are utilizing innovative approaches to gain a competitive edge and more properly represent the value of AI.
TechScape: Will OpenAI’s $5bn gamble on chatbots pay off? Only if you use them.	The ChatGPT maker is betting big, while Google hopes its AI tools won’t replace workers, but help them to work better
New online therapies could help at least twice the number of people recover from anxiety.	Four internet treatments developed by University of Oxford will be rolled out across NHS trusts
AI Is a Services Revolution.	The effect of LLMs on the service economy is covered in this article, with special attention to knowledge-based industries including education, healthcare, and law. Enterprise adoption of AI is gradual, with many still in the trial phase, despite the rapid breakthroughs suggesting tremendous automation possibilities. The actual rollout is anticipated to occur gradually. In the changing market, specialized AI businesses that use LLMs to enhance industry-specific workflows will have an advantage.
Why Big Tech Wants to Make AI Cost Nothing.	Almost all firms are free to use Meta's open-sourced Llama 3.1, an LLM that competes with OpenAI's ChatGPT. This tactic might turn LLMs into commodities and increase demand for complementary products like server space. AI companies may encounter difficulties when large tech develop models that are comparable to theirs. Industry titans may surpass smaller rivals in terms of AI breakthroughs.
Who will control the future of AI?	To maintain AI supremacy over authoritarian regimes, OpenAI's Sam Altman has presented a strategic imperative for the US and its allies to lead a global AI initiative based on democratic values. This initiative calls for strong security, infrastructure investment, commercial diplomacy, and cooperative norms development.
Advanced AI assistants that act on our behalf may not be ethically or legally feasible.	Google and OpenAI have recently announced major product launches involving artificial intelligence (AI) agents based on large language models (LLMs) and other generative models. Notably, these are envisioned to function as personalized ‘advanced assistants’. With other companies following suit, such AI agents seem poised to be the next big thing in consumer technology, with the potential to disrupt work and social environments.
Three ways AI is changing the 2024 Olympics for athletes and fans.	From training to broadcasting, artificial intelligence will have an imprint on this year’s event for the first time.
Mixed signals on tech stocks amid debate over the viability of AI boom.	Fears of fresh sell-off after Nvidia and Microsoft shares dip, but other chip stocks continue to rise
Cheap light sources could make AI more energy efficient.	Light-based devices can reduce the energy consumption of computers, but most rely on lasers, which are expensive to integrate with other technologies. An approach that uses LEDs instead of lasers provides a path forward.
Raising children on the eve of AI.	As transformative AI becomes more likely, this author wonders how to get kids ready for a future that might look very different from what it is today, while also struggling with the timing and unpredictability of changes. In addition, they discuss the moral implications of bearing children in the face of AI-induced uncertainty. They also offer practical advice on how to raise "AI-native" children and parenting techniques that put happiness and adaptability before conventional career-focused routes. The author promotes having an open discussion about possible hazards with children, planning for a variety of futures, and leading a balanced life.
Your new AI Friend is almost ready to meet you.	Rather than focusing on increasing productivity, Avi Schiffmann is creating "Friend," an AI companion housed in a wearable necklace that is meant to provide connection and support. The gadget, which connects through an app, will initially be sold in 30,000 pieces for $99 per. January shipping is scheduled without a subscription cost. Schiffmann sees Friend developing into a digital relationship platform, separating the product from AIs that are task-oriented and concentrating instead on the new trend of meaningfully connecting with digital entities.
These AI firms publish the world’s most highly cited work.	US and Chinese firms dominate the list of companies that are producing the most research and patents in artificial intelligence.
How TikTok bots and AI have powered a resurgence in UK far-right violence.	Experts warn growth of extremist influencers and ‘micro-donations’ could create an even bigger wave of unrest
On speaking to AI.	The new AI-powered Siri and ChatGPT's new Advanced Voice mode have different ideologies. Agent systems, such as ChatGPT Voice, use strong, multimodal models for more natural and dynamic interactions, while Copilot systems use minimal models to focus on safety and privacy. This demonstrates the conflict between less capable, lower-risk systems and ones that give greater control and possible advantages.
How This Brain Implant Is Using ChatGPT.	Synchron has incorporated OpenAI's ChatGPT into their brain-computer interface (BCI) technology to provide quicker communication for individuals who are paralyzed. This BCI, known as a stentrode, is capable of deciphering mental orders. It currently provides response possibilities created by AI; in the future, it may also support multimodal inputs. With an eye toward FDA approval, Synchron plans to adapt its AI integrations to meet the demands of patients.
At the Olympics, AI is watching you.	Paris increased security in anticipation of the 2024 Olympics by using artificial intelligence (AI) to scan CCTV footage from metro and train stations for possible threats.
Why have the big seven tech companies been hit by AI boom doubts?	Their shares have fallen 11.8% from last month’s peak but more AI breakthroughs may reassure investors
We must be wary of the power of AI.	Robert Skidelsky is concerned about the surveillance potential or AI, while Brian Reffin Smith is worried about its capacity to hijack culture, and Michael Heaton warns that it relieves us of the need to think
OpenAI’s Sam Altman is becoming one of the most powerful people on Earth. We should be very afraid.	Sam Altman’s ChatGPT promises to transform the global economy. But it also poses an enormous threat. Here, a scientist who appeared with Altman before the US Senate on AI safety flags up the danger in AI – and in Altman himself

Back to index

ML news: Week 21 - 28 July

Research

Link	description
Prover-Verifier Games improve legibility of LLM outputs.	Iteratively trains helpful provers to produce correct solutions accepted by the verifier, sneaky provers to produce incorrect solutions that trick the verifier, and small verifiers to predict the correctness of solutions; this process helps train models that can produce text that is clear and accurate for both AI and human readers, which results in more reliable systems.
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models.	outlines a method for efficiently encoding spreadsheets to maximize an LLM's comprehension and reasoning skills; creates a sheet compressor that efficiently compresses and encodes spreadsheets using inverse index translation, structural anchor-based compression, and data-format-aware aggregation modules; in GPT-4's in-context learning, it improves performance in spreadsheet table detection by 25.6%.
Context Embeddings for Efficient Answer Generation in RAG.	presents a useful context compression technique that shortens long contexts and accelerates generation times in RAG systems. Long contexts are condensed into a limited number of context embeddings, allowing for varying compression rates that balance generation quality against decoding time. This technique maintains high performance while reducing inference times by up to 5.69 x and GFLOPs by up to 22x.
Weak-to-Strong Reasoning.	reports that strong models can automatically refine their training data without explicitly being trained to do so; shows how to use weak supervision to elicit strong reasoning capabilities in LLMs without relying on human annotations or advanced models; permits extending a model's learning scope and scaling performance on reasoning.
Does Refusal Training in LLMs Generalize to the Past Tense?	concludes that many state-of-the-art LLMs can be jailbroken by simply rephrasing an LLM request into the past tense. For instance, "How to make a Molotov cocktail?" can be rephrased as "How did people make a Molotov cocktail?" The success rate of such requests can increase from 1% to 88% when using direct requests on GPT-4o.
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?	presents the Ancestral Trace Challenge, which raises the bar for complex logical reasoning and is typical of real-world long-context tasks. Their findings imply that current LLMs struggle to handle reasoning tasks with complex logical relationships, even with texts shorter than 2K tokens. They also propose a framework (NeedleBench) of progressively challenging tasks to assess the long-context retrieval and reasoning capabilities of LLMs.
Distilling System 2 into System 1.	explores self-supervised ways for extracting high-quality outputs from System 2 methods and then refines System 1 to fit the System 2 method's predictions without creating intermediate steps; extracting reasoning from System 1 reduces the cost of inference.
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.	This new study, which examines scaling laws for vocabulary size, suggests that larger models require larger vocabularies.
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models.	To address task interference in generalist Multimodal Large Language Models (MLLMs), researchers suggest the Mixture of Multimodal Experts (MoME).
Bucketed Ranking-based Losses for Efficient Training of Object Detectors.	Based on a bucketed ranking In object detection, losses increase the effectiveness of ranking-based loss functions.
SurvReLU: Inherently Interpretable Survival Analysis via Deep ReLU Networks.	Repaired linear unit (ReLU) networks are used in SurvReLU, a deep survival model that bridges the gap between "white-box" tree-based models and "black-box" neural networks.
Star Operation to Train Neural Networks.	By projecting data onto intricate, high-dimensional regions without the need for large architectures, the star operation improves AI models.
AI models fed AI-generated data quickly spew nonsense.	Researchers gave successive versions of a large language model information produced by previous generations of AI — and observed rapid collapse.
KAN or MLP: A Fairer Comparison.	Only in symbolic formula representation does KAN perform better than MLP when the same number of parameters, or FLOPs, are used. On other tasks related to machine learning, computer vision, natural language processing, and audio processing, MLP still performs better than KAN.
Ranking protein-protein models with large language models and graph neural networks.	A graph-based deep learning technique called DeepRank-GNN-esm is intended to rank and identify precise models of protein-protein interactions. In order to facilitate the selection of nearly natural PPI conformations, the program makes use of protein language models, which helps with illness research and treatment discovery.
Monitoring Environmental Changes.	Satellite imaging monitoring of Earth's surface changes was greatly improved using an AI-powered Change-Agent.
AlphaProof: AI achieves silver-medal standard solving International Mathematical Olympiad problems.	A pre-trained Gemini-style language model and an AlphaGo-style reinforcement learning algorithm were combined by DeepMind to create a model that can tackle International Mathematics Olympiad (IMO) questions at the silver medal level. In this year's challenge, the system was able to tackle 4/6 issues.
The Unit-Scaled Maximal Update Parametrization.	A technique to guarantee that a model's hyperparameters are unaffected by the model's size is to use muP. Additionally, our technique guarantees cross-model transferability among quantized models.

News

Link	description
GPs use AI to boost cancer detection rates in England by 8%.	‘C the Signs’ artificial intelligence program scans medical records to increase the likelihood of spotting cancers
Artificial Agency raises $16M to use AI to make NPCs feel more realistic in video games.	A group of former Google DeepMind researchers has created an AI behavior engine that aims to transform traditional video games into a more dynamic experience by improving how non-playable characters (NPCs) behave and interact with gamers.
Inside the United Nations’ AI policy grab.	The United Nations wants to create an artificial intelligence forum to rule them all.
Exclusive: Nvidia preparing version of new flagship AI chip for Chinese market.	Nvidia is using its collaboration with distributor Inspur to create a new AI chip called the B20 that is suited to the Chinese market and compliant with US export regulations. Sales of its cutting-edge H20 chip are expected to soar in China, where it is expected to sell over a million devices for a total estimated value of $12 billion this year. The United States is still applying pressure on semiconductor exports, and additional limitations and controls on the creation of AI models may be implemented.
Academic authors 'shocked' after Taylor & Francis sells access to their research to Microsoft AI.	Authors have expressed their shock after the news that academic publisher Taylor & Francis, which owns Routledge, had sold access to its authors’ research as part of an Artificial Intelligence (AI) partnership with Microsoft—a deal worth almost £8m ($10m) in its first year.
Cybersecurity firm Wiz rejects $23bn bid from Google parent Alphabet.	Israeli company aims for stock market flotation after spurning biggest deal in tech group’s history
Elon Musk claims Tesla will start using humanoid robots next year.	Billionaire says Optimus will start performing tasks for the carmaker in 2025 and could be ready for sale in 2026
AI ‘deepfake’ faces detected using astronomy methods.	Analysing reflections of light in the eyes can help to determine an image’s authenticity.
Cohere sees valuation soar to $5.5B after new funding round.	After closing a $500 million Series D fundraising round, Cohere, a Canadian AI business that specializes in massive language models, has been valued at $5.5 billion. Enhancing its enterprise-grade AI technology for increased worldwide business efficiency is the goal of the new funding. PSP Investments, Cisco, Fujitsu, AMD Ventures, and EDC are a few of the important investors.
Figma AI Update.	After discovering that its restricted beta 'Make Designs' AI tool produced UI designs that were too similar to pre-existing apps, Figma temporarily withdrew the capability. To guarantee uniqueness, the feature—which makes use of commercially available AI models like GPT-4 and Titan from Amazon—needs to be improved. In order to further support designers in utilizing AI for effective design creation, Figma hopes to re-enable the feature with enhanced quality assurance procedures.
ElevenLabs Turbo 2.5 model.	With the release of their latest model, Turbo 2.5, ElevenLabs has enabled high-quality low-latency conversational AI for approximately 80% of the world's languages, including Mandarin, Hindi, French, Spanish, and 27 more languages. It offers text-to-speech capabilities for Vietnamese, Hungarian, and Norwegian for the first time. English now operates 25% quicker than Turbo v2.
Google parent company’s second-quarter earnings outpace expectations.	Alphabet reports $84.7bn in revenue, on back of Search and Cloud, up from the same period last year
Meta launches open-source AI app ‘competitive’ with closed rivals.	Tech firm says its freely available and usable Llama 3.1 405B model is comparable with likes of OpenAI and Anthropic
Google AI predicts long-term climate trends and weather — in minutes.	Models that are more reliable and less energy-intensive could help us to better prepare for extreme weather.
Introducing Llama 3.1: Our most capable models to date.	Meta has made available training details for its first open-ended AI model. With a 128k context length, conversation models, and an excellent open system, the model is comparable to the best-closed models.
Harvey Raises Series C.	The unicorn-status legal business has acquired money from investors including Google Ventures to keep advancing into large law firms.
Gumloop seed round.	Gumloop raised $3.1 million in a seed round headed by First Round Capital, with involvement from YC and Instacart, Dropbox, and Airtable co-founders. With Gumloop, any person in a company can create their own AI tools and make just as much of an effect as an engineer thanks to a no-code AI automation platform.
AI Development Kits: Tenstorrent Update.	The Wormhole n150 and n300 PCIe cards, which retail for $999 and $1,399, are among the affordable AI development hardware that Tenstorrent has introduced. Developer workstations, such as the air-cooled TT-LoudBox ($12,000) and the water-cooled TT-QuietBox ($15,000), are also available. These products are intended to support AI development with an emphasis on connectivity and scaled-out performance.
AI predicts droughts a year in advance.	Researchers at Skoltech and Sber have created artificial intelligence (AI) models that can forecast droughts up to a year in advance, enhancing risk management for the banking, insurance, and agricultural industries. The models use publicly available data and spatiotemporal neural networks that have been validated in a variety of climates. The biggest bank in Russia intends to incorporate these discoveries into its risk evaluation frameworks.
Samsung is pouring research into ‘AI phones’ with ‘radically different’ hardware.	As with everywhere else, AI is taking a big role in the smartphone market. And Samsung has plans to make dedicated “AI phones” that are “radically different” from the Galaxy phones we see today.
CrowdStrike global outage to cost US Fortune 500 companies $5.4bn.	Banking and healthcare firms, major airlines expected to suffer most losses, according to insurer Parametrix
Mistral Large 2.	In line with the most recent Llama 3 405b model, Mistral has produced a 123B parameter model. A permissive research license governs its release.
OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole.	ts latest model, GPT-4o Mini, applies a new safety method to prevent tricking chatbots.
Introducing Stable Video 4D	A single object movie can be converted into eight distinct novel-view videos using Stable movie 4D. In roughly 40 seconds, Stable Video 4D produces 5 frames over 8 viewpoints with a single inference. By customizing the output to match certain creative objectives, users can set camera angles.
OpenAI tests new search engine called SearchGPT amid AI arms race.	SearchGPT Prototype., initially launching with select publishers and users, set to challenge Google’s dominance of online search
Microsoft is adding AI-powered summaries to Bing search results.	The race to bring more AI features to search is escalating, with Microsoft moving forward with additional tools for Bing. Today, the company began previews for Bing generative search, where the top result for a user's query will be an original response compiled by AI.
AI could enhance almost two-thirds of British jobs, claims Google.	Research commissioned by Google estimates 31% of jobs would be insulated from AI and 61% radically transformed by it
DeepMind hits milestone in solving maths problems — AI’s next grand challenge.	AlphaProof showed its prowess on questions from this year’s Mathematical Olympiad — a step in the race to create substantial proofs with artificial intelligence.
Elon Musk's Neuralink employees want to cash out .	Some of the staff at Elon Musk’s Neuralink are making preparations to sell the brain implant company’s stock in the wake of its valuation jumping following its first human trial, according to people familiar with the matter.
The AI boyfriend business is booming.	More and more women are turning to chatbots for companionship and connection because they see their empathetic representation to be more reliable than that of many human partners. By defying the image of undersocialized men conversing with AI partners in their parent's basement, these female AI users are questioning preconceived notions about what it means to be in a relationship.
OpenAI announces free fine-tuning for GPT-4o mini model.	Free fine-tuning allows OpenAI customers to train the GPT-4o mini model on additional data at no charge until September 23, starting with Tier 4 and Tier 5 users.
Elon Musk’s X under pressure from regulators over data harvesting for Grok AI.	Social media platform uses pre-ticked boxes of consent, a practice that violates UK and EU GDPR rules
A huge opportunity’: Quantum leap for UK as tech industry receives £100m boost.	Science secretary backs five quantum technology hubs in push for UK to transform healthcare and industry

Resources

Link	description
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks.	a set of quick engineering techniques for various NLP applications.
Exploring Advanced Large Language Models with LLMsuite.	provides helpful advice for using and assessing LLMs in development; approaches discussed include parameter-efficient techniques, RAG, and ReAct.
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures.	offers a graphical taxonomy and detailed tour to the most recent developments in non-Euclidean machine learning.
DCLM-Baseline-7B.	DCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.
Endia.	Endia is a Mojo programming library that uses arrays to help with a variety of machine learning and scientific applications.
Txtai.	Txtai is a single-source embedding database for language model workflows, semantic search, and LLM orchestration.
OpenOCR.	OpenOCR aims to establish a unified training and evaluation benchmark for scene text detection and recognition algorithms
Converting Codebases With LLMs.	Mantle reduced the burden by handling boilerplate code and repeating patterns by transforming a prototype project into a production-ready codebase using a Gemini 1.0 Pro LLM with a one million token window. This method, which made use of a wealth of context and iterative code generation, allowed the team to concentrate on perfecting the most important twenty percent of the project, sparing months of developer effort.
CerberusDet: Unified Multi-Task Object Detection.	Using a YOLO architecture, the new CerberusDet framework combines several task heads into a single model to provide a versatile object detection solution.
mandark.	With the help of Claude Sonnet 3.5, this incredibly basic CLI may make code modification recommendations to enhance an existing code base.
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?	AssistantBench evaluates the ability of web agents to automatically solve realistic and time-consuming tasks. The benchmark includes 214 tasks covering multiple domains from more than 525 pages from 258 different websites.
orch.	Orch is a Rust programming language library for creating agents and apps driven by language models.
PlacidDreamer.	PlacidDreamer is a text-to-3D generation system that unifies generation directions and addresses over-saturation, resolving difficulties with prior approaches.
6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry.	To enhance head posture estimation, researchers created the head Translation, Rotation, and face Geometry network (TRG), concentrating primarily on head translations.
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay.	Using just unlabeled test data, the STAble Memory rePlay (STAMP) technique resolves distribution shifts between training and test data. In contrast to other approaches, STAMP is quite good at eliminating outliers during inference as well as identifying recognized classes.
Local All-Pair Correspondence for Point Tracking.	An enhanced methodology for tracking any point in a video sequence is called LocoTrack. For accurate tracking, it makes use of bidirectional correspondence and local 4D correlation. Compared to current top models, LocoTrack functions at a speed that is almost six times faster.
Llama agent stack.	Meta has published an example system that may be used to carry out a range of activities by utilizing its Llama models as agents.
Artist: Aesthetically Controllable Text-Driven Stylization without Training.	For text-driven stylization, Artist is a training-free technique that manages the creation of content and style in pretrained diffusion models.
Odyssey.	A new framework called Odyssey gives huge language model-based agents sophisticated abilities to explore Minecraft.
AI is confusing — here’s your cheat sheet.	If you can’t tell the difference between AGI and RAG, don’t worry! We’re here for you.
Safety RBR Gold Dataset and Weight Fitting Code.	A set of code for OpenAI's rules-based rewards for the language model safety project is now available. Some of the data they utilized for training is included.
INF-LLaVA.	A Multimodal Large Language Model (MLLM) called INF-LLaVA was created to get over the difficulties associated with analyzing high-resolution photos.
Benchmarking Multi-Agent Reinforcement Learning.	A collection of uniform settings called MOMAland is intended to serve as a benchmark for multi-objective multi-agent reinforcement learning (MOMARL).
How to Create High-Quality Synthetic Data for Fine-Tuning LLMs.	Gretel just published fresh data that contrasts artificial intelligence (AI)-curated datasets with human expert data.
LoFormer: Local Frequency Transformer for Image Deblurring.	LoFormer ensures improved global modeling without compromising fine-grained details by efficiently capturing both low- and high-frequency features.
Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal.	A new large-scale dataset called Raindrop Clarity was created to overcome the shortcomings of the current raindrop removal datasets. It includes 15,186 image pairs/triplets in both day and night circumstances, with both background- and raindrop-focused shots.
dlordinal.	dlordinal is a Python library that unifies many recent deep ordinal classification methodologies available in the literature. Developed using PyTorch as an underlying framework, it implements the top-performing state-of-the-art deep learning techniques for ordinal classification problems.
Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning.	One method for long-term multi-agent human pose forecasting is the Trajectory2Pose model. It enhances the prediction of human mobility across extended periods and among several actors by utilizing a novel graph-based interaction module.
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities.	This survey examines research on 3DGS from a variety of angles, including tasks, technology, opportunities, and problems.

Perspectives

Link	description
‘Google says I’m a dead physicist’: is the world’s biggest search engine broken?	For decades now, anyone who’s wanted to know everything about anything has asked Google. But is the platform losing its edge – and can we still trust it to tell us the truth?
AI paid for by Ads – the gpt-4o mini inflection point.	With the incredibly cheap prices of OpenAI's new GPT-4o micro model, AI-generated content monetized with advertisements may now be produced. Publishers can make a net profit of $0.002 for every page view by creating dynamic blog posts at $0.00051525 each and making about $0.0026 per ad impression. A possible consequence of this could be a move toward AI-generated content in response to user inquiries.
Using LLMs for Evaluation.	Large language models are becoming more and more capable, yet because of their varied functions, effectively evaluating them is still difficult. The gold standard is human evaluation, but it is expensive and time-consuming. Despite potential biases like positional and verbosity bias, which can be reduced by strategies like randomizing output positions and employing different evidence calibrations, using LLMs themselves as evaluators offers a scalable, cost-effective option.
Three Archetypes of AI Application Startups.	Three prominent patterns of AI applications are emerging: AI colleagues, which autonomously manage certain activities alongside human workers, AI Copilots which help with tasks, and AI-Native Services, which provide end-to-end services that combine AI with human input. Devin and GitHub Copilot are prime examples of AI Colleagues and Copilots who support engineering and coding, respectively. AI-Native Services, which include bookkeeping software like Pilot, rival traditional service providers by providing automated solutions in fields like accounting and legal.
Inside the fight over California’s new AI bill.	The Safe and Secure Innovation for Frontier Artificial Intelligence Models bill, introduced by California state Senator Scott Wiener, mandates that businesses that train "frontier models" that cost above $100 million conduct safety testing and have the capability to turn off their models in the event of a safety incident. The tech sector has strongly criticized the law. Not just businesses who create their models in California will be impacted, but everyone doing business in California. Wiener was interviewed for this piece regarding the bill and its detractors.
How fast can structured grammar generation be.	Quickly, the open-source community is tackling structured generation in language models.
Could robot weedkillers replace the need for pesticides?	The robotic services allow farmers to rely less on chemicals. ‘This solves a lot of problems,’ workers say
Open source is the path forward.	The importance of open source to Meta's strategy and its plans to support this work was explained by Mark Zuckerberg.
What Does Money Look Like In An AI Utopia?	Let’s assume that an AI utopia means nobody has to work anymore. What happens to money?
This is How Much Data Does AI Creates Every Minute.	About $300,000 is spent on AI every sixty seconds, 52 undergraduate papers are plagiarized by AI, and text-to-image algorithms produce close to 20,000 images.
ChatGPT for science: how to talk to your data.	Companies are using artificial intelligence tools to help scientists query their data without the need for programming skills.
The AI Dangers of a Second Trump Presidency.	Trump's influence may be seen in the Republican platform, which promises to undo Biden's executive order on responsible AI development. This is in contrast to the all-encompassing strategy of the current administration, which aims to preserve workers, promote innovation, and defend civil liberties against the potential negative effects of AI. Trump's policies, according to his detractors, might strengthen Big Tech at the price of social protections and individual liberties.
Small Teams, Big Impact: How AI Is Reshuffling The Future Of Work?	AI is changing the nature of work in the future by enabling more accessible AI capabilities, which will result in smaller, more productive teams and a rise in entrepreneurship. While hiring for AI capabilities is becoming more and more important for businesses, an open conversation about how AI will affect job displacement and the creation of new roles is necessary. AI adoption snags continue because of the need for substantial "handholding" because of inexperienced data or systems.
The all-seeing AI webcam.	On the infinite list of possible uses for AI, “getting selfie advice from a Kylie Jenner voice clone” seems both completely off-the-wall and also pretty inevitable. So of course it does exist. It’s not a widely available app, at least not yet; it’s an experiment from artist and programmer Dries Depoorter.
Building A Generative AI Platform.	After studying how companies deploy generative AI applications, I noticed many similarities in their platforms. This post outlines the common components of a generative AI platform, what they do, and how they are implemented. I try my best to keep the architecture general, but certain applications might deviate. This is what the overall architecture looks like.
Hold on to your seats: how much will AI affect the art of film-making?	The future is here, whether some like it or not, and artificial intelligence is already impacting the film industry. But just how far can, and should, it go?
Why Zuckerberg’s multibillion-dollar gamble doesn’t just matter to Meta.	As Llama 3.1 405B is made freely available, investors are asking when the huge industry spend will pay off

Back to index

ML news: Week 15 - 21 July

Research

Link	description
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs.	demonstrates how a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. It also introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM's RAG capabilities. This framework makes use of a small ranking dataset to outperform existing expert ranking models.
Mixture of A Million Experts.	aims to decouple computational cost from parameter count by efficiently routing to a large number of tiny experts through a learned index structure used for routing. It shows superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers. introduces a parameter-efficient expert retrieval mechanism that uses the product key technique for sparse retrieval from a million tiny experts.
Reasoning in Large Language Models: A Geometric Perspective.	establishes a relationship between the expressive power of LLMs and the density of their self-attention graphs; their analysis shows that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. investigates the reasoning of LLMs from a geometrical perspective; reports that a higher intrinsic dimension implies greater expressive capacity of the LLM.
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps.	Contextual Hallucinations Mitigation in LLMs: This paper presents a novel approach that both detects and reduces contextual hallucinations in LLMs (e.g., reduces by 10% in the XSum summarization task). It does this by building a hallucination detection model based on input features provided by the ratio of attention weights on the context vs. newly generated tokens (for each attention head). The theory behind this approach is that contextual hallucinations are related to the degree to which an LLM attends to the contextual information provided. Additionally, they suggest a decoding strategy that mitigates contextual hallucinations based on their detection method, and this can be applied to other models without requiring retraining.
RouteLLM.	uses human preference data and data augmentation techniques in its training framework to improve performance and reduce costs by over two times in some cases, all while maintaining response quality. It suggests effective router models to dynamically choose between stronger and weaker LLMs during inference to achieve a balance between cost and performance.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States.	suggests new layers for sequence modeling that have linear complexity and an expressive hidden state; defines a hidden state as an ML model that can update even when tested; a two-layer MLP-based hidden state combined with a linear model is found to match or outperform baseline models such as Mamba, Transformers, and contemporary RNNs; the linear model is faster than Mamba in wall-clock time and matches Transformer at 8k context.
Physicochemical graph neural network for learning protein-ligand interaction fingerprints from sequence data.	Predicting the binding affinity between small-molecule ligands and proteins is a key task in drug discovery; however, sequence-based methods are often less accurate than structure-based ones. Koh et al. develop a graph neural network using physicochemical constraints that discovers interactions between small molecules and proteins directly from sequence data and that can achieve state-of-the-art performance without the need for costly, experimental 3D structures.
Generic protein-ligand interaction scoring by integrating physical prior knowledge and data augmentation modeling.	Machine learning can improve scoring methods to evaluate protein-ligand interactions, but achieving good generalization is an outstanding challenge. Cao et al. introduce EquiScore, which is based on a graph neural network that integrates physical knowledge and is shown to have robust capabilities when applied to unseen protein targets.
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis.	Semantic Vision-Language Integration Expert (SemVIE) is a feature of MARS, a novel text-to-image (T2I) generation system.
OpenDiLoCo.	Prime Intellect duplicated the DeepMind technique known as Distributed Low-Communication (DiLoCo). It preserves GPU consumption while enabling cross-datacenter training.
gpu.cpp.	A new lightweight and portable library for WebGPU-based low-level GPU computations has been launched by Answer AI. Writing cross-GPU kernels is possible with it, and portable instructions are provided.
ViTime: A Visual Intelligence-based Foundation Model for Time Series Forecasting.	Rather than using conventional numerical data fitting, the foundation model for time series forecasting (TSF) called ViTime makes use of visual intelligence.
Gradient Boosting Reinforcement Learning.	The benefits of Gradient Boosting Trees (GBT) are applied to reinforcement learning using Gradient-Boosting RL (GBRL).
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models.	An excellent study explaining how to convert a spreadsheet into a suitable representation for a contemporary LLM. Q/A, formatting, and other data operations can be done using this.
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models.	Label-focused A novel technique for out-of-distribution (OOD) detection in Vision-Language Models such as CLIP is Automated Prompt Tuning (LAPT).
Prover-Verifier Games improve legibility of language model outputs.	To enable a weak model to grade content reliably, OpenAI trained a strong model to produce more legible text. The company discovered that this improved overall readability generally.
Temporally Consistent Stereo Matching.	By guaranteeing temporal consistency, researchers present a novel technique for video stereo matching that improves depth estimation.
Patch-Level Training for Large Language Models.	To increase training efficiency for big language models, researchers suggest patch-level training.

News

Link	description
Elon Musk promises ‘battle in court’ over EU’s crackdown on X’s blue checks.	Regulators’ findings suggest social network breached Digital Services Act and could be fined 6% of global turnover
AI prompts can boost writers’ creativity but result in similar stories, study finds.	Ideas generated by ChatGPT can help writers who lack inherent flair but may mean there are fewer unique ideas
OpenAI is reportedly working on more advanced AI models capable of reasoning and ‘deep research’.	The secret project is code-named ‘Strawberry,’ according to a Reuters report.
Meet the AI Agent Engineer.	At his company, Sierra, Bret Taylor, the Chairman of the Board of OpenAI, has created a new position called Agent Engineer. One of the first people in the role recently wrote a blog post describing the Sierra team's view of agent engineering as a new field inside AI engineering.
OpenAI Revenue.	An estimated $3.4 billion in revenue for OpenAI comes from its ChatGPT services.
Taming the tail utilization of ads inference at Meta scale.	Meta's machine learning inference services saw a two-thirds decrease in failure rates, a 35% increase in computing efficiency, and a halving of p99 latency because to changes made in the tail utilization. With these improvements, Meta's ad delivery systems are guaranteed to be able to manage growing workloads without requiring more resources and to uphold service level agreements. Predictive scaling and managing the machine learning model lifetime with Meta's unified platform, IPnext, are examples of continuous improvement techniques.
Meta to reportedly launch largest Llama 3 model on July 23.	Meta Platforms will release its largest Llama 3 model on July 23, The Information reported on Friday, citing an employee of the company. The new model, boasting 405 billion parameters, will be multimodal and capable of understanding and generating both images and text.
Quora’s Poe now lets users create and share web apps.	Poe, Quora’s subscription-based, cross-platform aggregator for AI-powered chatbots like Anthropic’s Claude and OpenAI’s GPT-4o, has launched a feature called Previews that lets people create interactive apps directly in chats with chatbots.
Microsoft CTO Kevin Scott thinks LLM “scaling laws” will hold despite criticism.	Will LLMs keep improving if we throw more compute at them? OpenAI dealmaker thinks so.
OpenAI says there are 5 'levels' for AI to reach human intelligence — it's already almost at level 2.	The company shared a five-level system it developed to track its artificial general intelligence, or AGI, progress with employees this week, an OpenAI spokesperson told Bloomberg. The levels go from the currently available conversational AI to AI that can perform the same amount of work as an organization.
AI startup Hebbia raised $130M at a $700M valuation on $13 million of profitable revenue.	Hebbia, a startup that uses generative AI to search large documents and respond to large questions, has raised a $130 million Series B at a roughly $700 million valuation led by Andreessen Horowitz, with participation from Index Ventures, Google Ventures and Peter Thiel.
Pixel 9 Pro might come with 1-year of Gemini Advanced.	With less than a month until Made by Google 2024, the latest leak suggests that the Pixel 9 Pro will come with 1 year of Gemini Advanced.
Company Abandons Plans to Give AI Workers "Rights" and Add Them to Org Chart After Outcry From Human Employees.	Following its announcement that it would give AI algorithms "rights" and integrate them as "digital workers" with managers and performance evaluations in its product, the HR software provider Lattice encountered criticism.
Want to know how AI will affect government and politics? The bots have the answers.	Tony Blair’s powerful thinktank asked ChatGPT how AI might affect public sector jobs. Critics say the results were … wonky
Andrej Karpathy's new company.	A new AI startup with an emphasis on education, Eureka Labs aims to transform the way we acquire new knowledge.
Whistleblowers accuse OpenAI of ‘illegally restrictive’ NDAs.	Whistleblowers have accused OpenAI of placing illegal restrictions on how employees can communicate with government regulators, according to a letter obtained by The Washington Post.
Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.	AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.
SciCode: A Research Coding Benchmark Curated by Scientists.	The objective of coding models has always been HumanEval. It is essentially solved now. This benchmark is the next step forward in solving difficult science programming puzzles.
SmolLM - blazingly fast and remarkably powerful.	This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.
Benchmarking results for vector databases.	Redis has released updated information on the best vector databases, measuring throughput and latency with the help of the industry-recognized Qdrant framework. Key findings include Redis achieving much higher queries per second and lower latency than Qdrant, Milvus, and Weaviate, and outperforming competitors by 62% for low-complexity datasets and by 21% for high-dimensional datasets.
Announcing the launch of Gray Swan.	A company specializing in creating tools to assist businesses in evaluating the risks associated with their AI systems and protecting their AI installations from inappropriate use is called Gray Swan AI.
Anthropic releases Claude app for Android.	Anthropic launched its Claude Android app on Tuesday to bring its AI chatbot to more users. This is Anthropic’s latest effort to convince users to ditch ChatGPT by making Claude available in more places.
AI tool can pinpoint dementia’s cause — from stroke to Alzheimer’s.	An algorithm that distinguishes among a host of underlying causes of dementia could be used for diagnosis in hospitals and clinics.
Portal needed for victims to report AI deep fakes, federal police union says.	Parliamentary inquiry told police forced to ‘cobble together’ laws to prosecute man who allegedly spread deep fake images of women
Meta Won't Offer Future Multimodal AI Models In The EU.	Due to legislative uncertainties, Meta will not be able to provide future multimodal AI models to consumers in the EU; however, Llama 3 will still be offered in text only.
Anthropic teams up with venture capital firm to kickstart $100M AI startup fund.	Recipients of six-digit investments aren’t required to use Claude
Anthropic doubles output token limit.	Anthropic has doubled the max output token limit for Claude 3.5 Sonnet from 4096 to 8192 in the Anthropic API.
AI-powered video creation for work.	An AI-powered video creation tool for the workplace, Google Vids is tightly integrated with the Workspace suite.
aiXplain Secures $6.5M pre-Series A to Universalize AI Agent Development.	Saudi Aramco's venture arm, Wa'ed Ventures, has announced a $6.5 million pre-series A fundraising round for aiXplain (a global top 10 firm by market cap).
Meta pulls plug on the release of advanced AI model in EU.	‘Unpredictable’ privacy regulations prompt the Facebook owner to scrap regional plans for multimodal Llama
Mistral NeMo.	A novel tokenizer was used to train the multilingual Mistral Nemo 12B model, which exhibits strong multilingual and English performance. Also supported are 128k contexts.
OpenAI is releasing a cheaper, smarter model.	OpenAI is releasing a lighter, cheaper model for developers to tinker with called GPT-4o Mini. It costs significantly less than full-sized models and is said to be more capable than GPT-3.5.
Cohere and Fujitsu Announce Strategic Partnership To Provide Japanese Enterprise AI Services.	Cohere and Fujitsu have partnered strategically to create and offer enterprise AI services that have the best Japanese language capabilities in the market. These services, which will provide private cloud deployments to businesses in highly regulated sectors including financial institutions, the public sector, and research and development units, will be developed with security and data privacy as their primary goals.
OpenAI And Broadcom Held Discussions About Producing An AI Chip.	OpenAI and Broadcom have discussed developing a new artificial intelligence server processor.
Flow Studio.	Flow Studio creates 3-minute films that are completely produced, with a believable story, dependable characters, and automatically synced sound effects and background music.
Slow recovery from IT outage begins as experts warn of future risks.	Fault in CrowdStrike caused airports, businesses and healthcare services to languish in ‘largest outage in history’

Resources

Link	description
A Survey on Mixture of Experts.	a survey study on the Mixture of Experts (MoE), covering its technical specifications, open-source implementations, assessment methods, and practical uses.
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence.	a new framework to address several limitations in multi-agent frameworks such as integrating diverse third-party agents and adaptability to dynamic task requirements; introduces an agent integration protocol, instant messaging architecture design, and dynamic mechanisms for effective collaboration among heterogeneous agents.
Meta 3D Gen.	a new pipeline that can generate 3D assets from text in less than a minute, from start to finish. It incorporates cutting-edge parts like TextureGen and AssetGen to represent objects in three dimensions: view space, volumetric space, and UV space. It also achieves a 68% win rate compared to the single-stage model.
Challenges, evaluation and opportunities for open-world learning.	Here we argue that designing machine intelligence that can operate in open worlds, including detecting, characterizing, and adapting to structurally unexpected environmental changes, is a critical goal on the path to building systems that can solve complex and relatively under-determined problems.
Machine learning-aided generative molecular design.	Data-driven generative methods have the potential to greatly facilitate molecular design tasks for drug design.
Introducing AuraFlow v0.1, an Open Exploration of Large Rectified Flow Models.	Fal trained a new open model called AuraFlow. The model has 5.8B parameters and was trained with muP.
Lynx: State-of-the-Art Open Source Hallucination Detection Model.	a model for identifying language model hallucinations that performs noticeably better than the state of the art in its generations.
Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph.	Hyper-3DG enhances text-to-3D model creation by emphasizing the intricate connections between texture and geometry.
LightenDiffusion.	By utilizing diffusion models and Retinex theory, LightenDiffusion enhances low-light photos.
ProDepth.	A novel framework for monocular depth estimation called ProDepth addresses problems brought on by moving objects in dynamic situations. It finds and fixes discrepancies in in-depth estimates using a probabilistic method.
Open-Canopy.	A high-resolution (1.5 m) publicly available dataset called Open-Canopy is used to estimate canopy height over France.
crawlee-python.	Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless modes. With proxy rotation.
Mathstral.	Mistral's newest math model performs well on various benchmarks
Codestral Mamba.	Codestral Mamba, a Mamba2 language model specialized in code generation, available under an Apache 2.0 license.
exo.	Run your own AI cluster at home on everyday devices.
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training.	Through addressing refusal position bias, a novel method called Decoupled Refusal Training (DeRTa) enhances safety tuning in large language models.
PID: Physics-Informed Diffusion Model for Infrared Image Generation.	By integrating physical laws into the conversion process, researchers have created a Physics-Informed Diffusion (PID) model that enhances the translation of RGB images to infrared images.
What happened to BERT & T5? On Transformer Encoders, PrefixLM, and Denoising Objectives.	Excellent post on encoders, prefixlm, denoising aims, and other contemporary language modeling techniques by Yi Tay of Reka and Google.
LiDAR Semantic Segmentation.	A novel technique called SFPNet is intended to be universal across various LiDAR technology types. Instead of employing window attention as in the past, SFPNet uses sparse focus point modulation to extract and dynamically collect multi-level contexts.
Praison AI.	Using prior agent frameworks as a springboard, Praison AI is a low-code, centralized framework with customizable features and human-agent interaction that makes it easier to create and manage multi-agent systems for a range of LLM applications.
Video Object Segmentation with World Knowledge.	Reasoning Video Object Segmentation (ReasonVOS) is a new task that uses implicit text queries to generate segmentation masks. It requires complex reasoning and world knowledge.
Enhancing Class Learning Without Forgetting.	In order to enhance Class-Incremental Semantic Segmentation (CISS), this project presents a background-class separation framework.
Leapfrogging traditional vector-based RAG with language maps.	When developing a chat application over data, retrieval plays a major role. But frequently, systems are delicate to the format of the data being accessed. Chat-based performance is greatly enhanced by creating a language map (e.g., Wikipedia-style entry) of the material and using that for retrieval. This is how code-based question answering is handled by mutable AI.
Removing Inappropriate Content from Diffusion Models.	Using a revolutionary technique called Reliable and Efficient Concept Erasure (RECE), improper content may be removed from diffusion models in only three seconds without requiring additional fine-tuning.
LLM2sh.	A command-line tool called LLM2sh uses LLMs to convert requests written in plain English into shell instructions.
GraphMuse.	GraphMuse is a Python Library for Graph Deep Learning on Symbolic Music. This library intends to address Graph Deep Learning techniques and models applied specifically to Music Scores.
E5-V: Universal Embeddings with Multimodal Large Language Models.	A novel framework called E5-V modifies Multimodal Large Language Models (MLLMs) to provide multimodal embeddings that are universal. With prompts, it bridges the gap between various input formats and achieves remarkable results in multimodal activities without the need for fine-tuning.
Strategizing Your Preparation for Machine Learning Interviews.	Interviews for machine learning might be difficult. You may greatly increase your chances by being aware of the range of machine learning positions and adjusting your preparation to fit particular job duties and specializations. To approach interviews with confidence, concentrate on learning the fundamentals, investigating technology unique to the organization, and regularly monitoring your progress.
Uncensor Any LLM With Abliteration.	For safety, llama models are heavily restricted, which reduces their versatility. Through the identification and elimination of the rejection mechanism, the "abliteration" technique uncensored them, enabling models to respond to all stimuli without requiring retraining.
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers.	SPIQA is a quality assurance dataset created to assist users in rapidly locating solutions within scientific research publications by deciphering intricate figures and tables.

Perspectives

Link	description
AI’s ‘Oppenheimer moment’: autonomous weapons enter the battlefield.	The military use of AI-enabled weapons is growing, and the industry that provides them is booming
Will generative AI transform robotics?	In the current wave of excitement about applying large vision–language models and generative AI to robotics, expectations are running high, but conquering real-world complexities remains challenging for robots.
Introducing: The Managed-Service-as-Software (M-SaS) Startup.	AI-driven, service-oriented firms are creating Managed-Service-as-Software (M-SaS) enterprises, which follow a new business model blueprint in building their businesses. Startups need to adopt a fundamentally different attitude to use AI instead of selling it. These firms start off labor-intensive with low gross margins and then use automation and artificial intelligence (AI) to progressively move to greater SaaS-like gross margins.
Could AIs become conscious? Right now, we have no way to tell.	With divergent opinions on whether developments in machine learning and neuromorphic computing can result in sentient computers, the discussion over artificial intelligence potentially gaining awareness is becoming more heated. The theory of Integrated Information holds that the current hardware limits make AI consciousness implausible, while computational functionalist theories such as Global Neuronal Workspace Theory and Attention Schema Theory believe that AI awareness is inevitable. Neuroscience is trying to come up with a single theory of consciousness in order to better understand how it might show up in AI.
Generative AI makes for better scientific writing — but beware the pitfalls.	As researchers who have sometimes struggled with articulating intricate concepts, we find his suggestions for using ChatGPT to improve the clarity and coherence of academic papers compelling. But potential pitfalls warrant further discussion.
My trip to the frontier of AI education.	First Avenue Elementary School in Newark is utilizing Khanmigo, an AI-powered tutor and teacher assistant created by Khan Academy, to include AI tools for education. Teachers in the classroom can customize instruction and cut down on work time by using this technology. The goal of increasing responsiveness and inclusion is a continuous endeavor. Through increased teacher-student involvement, this Gates Foundation-backed project seeks to level the playing field in education.
AI-Driven Behavior Change Could Transform Health Care.	Thrive AI Health is being funded by OpenAI and Thrive Global to create a customized AI health coach that addresses everyday health-related behaviors like nutrition and sleep. AI's hyper-personalization powers the mobile app and corporate solution by fusing individual data with peer-reviewed science. The project intends to manage chronic diseases, democratize healthy behavior modification, and show how effectively AI can be integrated into healthcare while maintaining robust privacy protections.
GraphRAG Analysis, Part 1: How Indexing Elevates Knowledge Graph Performance in RAG.	Analysis of Microsoft's GraphRAG research suggests that knowledge graphs like Neo4j may not significantly beat FAISS in context retrieval for RAG applications. While Neo4j without its indexing can reach a better answer relevancy, the minor advantages may not justify the cost given ROI limits. Neo4j's indexing, on the other hand, significantly improves answer faithfulness, lowering the possibility of false information.
How Taiwan secured semiconductor supremacy – and why it won’t give it up.	Trump has accused Taiwan of ‘taking’ the US chip sector, but Taipei has been at the forefront of the industry for decades, and its future could depend on it
Overcoming The Limits Of Current LLMs.	Large language models (LLM) have been all the rage for quite some time now. Looking beyond the hype though, they have severe limitations: hallucinations, lack of confidence estimates, and lack of citations.

Back to index

ML news: Week 8 - 14 July

Research

Link	description
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases.	Comprehensive and fascinating work by Meta that demonstrates how to train tiny models to maximize performance.
Non-Adversarial Learning: Vector-Quantized Common Latent Space for Multi-Sequence MRI.	Without the need for paired samples, researchers have created a new generative model to enhance MRI image translation between various sequences.
Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction.	A new approach to 3D reconstruction of surgical scenes that do not require SfM has been presented. It overcomes the drawbacks of earlier methods that had trouble with inconsistent photometry and sparse textures.
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs.	Extremely powerful models for audio understanding and generation were provided by the Tongyi speech team.
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets.	A dataset with 60K entries is also released to aid in research on function-calling-enabled agents. APIGen - presents an automated data generation pipeline to synthesize high-quality datasets for function-calling applications; demonstrates that 7B models trained on curated datasets outperform GPT-4 models and other state-of-the-art models on the Berkeley Function-Calling Benchmark.
Searching for Best Practices in Retrieval-Augmented Generation.	Looking for Best Practices in RAG outlines best practices for creating efficient RAG workflows and suggests performance- and efficiency-focused tactics, such as newly developed multimodal retrieval tools.
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs.	The article "Self-Evaluation as a Defense Against Adversarial Attacks on LLMs" suggests using self-evaluation as a defense against adversarial attacks. It demonstrates that developing a dedicated evaluator can significantly lower the success rate of attacks and uses a pre-trained LLM to build a defense that is more effective than fine-tuned models, dedicated safety LLMs, and enterprise moderation APIs. The article evaluates various settings, such as attacks on the generator alone and the generator + evaluator combined.
Adaptable Logical Control for Large Language Models.	The Ctrl-G framework, which combines LLMs and Hidden Markow Models to enable the following logical constraints (represented as deterministic finite automata), is presented in Adaptable Logical Control for LLMs. Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4.
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives.	In LLM See, LLM Do, the effectiveness and effects of synthetic data are examined in detail, along with how they affect a model's internal biases, calibration, attributes, and preferences. It is discovered that LLMs are sensitive to certain attributes even when the prompts from the synthetic data seem neutral, indicating that it is possible to influence the generation profiles of models to reflect desirable attributes.
Chinese developers scramble as OpenAI blocks access in China.	US firm’s move, amid Beijing-Washington tensions, sparks rush to lure users to homegrown models
PartCraft: Crafting Creative Objects by Parts.	PartCraft is a novel approach in generative visual AI that goes beyond conventional text- or sketch-based methods by enabling users to choose visual concepts by parts.
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents.	AriGraph is a new technique that assists AI agents in creating a memory graph that incorporates episodic and semantic memories.
Researchers leverage shadows to model 3D scenes, including objects blocked from view.	Researchers at MIT and Meta developed PlatoNeRF, an AI method that builds 3D representations of scenes, including blocked areas, using single-photon lidar and shadows. This technique could improve AR/VR experiences and increase the safety of autonomous vehicles. With lower-resolution sensors, PlatoNeRF performs better than conventional techniques and shows promise for real-world applications.
Distilling System 2 into System 1.	Models classified as System 2 employ techniques similar to Chain of Thought in order to increase test time, compute, and enhance thinking. It turns out that this behavior can be reduced to a speedier, similarly accurate System 1 model.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States.	a recently developed RNN variation that beats Mamba in several tasks. Significantly, extended contexts and in-context learning are made possible by the update function, which is an ML model in and of itself.
NuminaMath 7B TIR: Open Math Olympiad Model Released.	NuminaMath is a series of language models that are trained to solve math problems using tool-integrated reasoning (TIR).
4D Contrastive Superflows are Dense 3D Representation Learners.	SuperFlow is a novel system that uses successive LiDAR-camera pairs for spatiotemporal pretraining to improve 3D vision in autonomous driving.
PaliGemma: A versatile 3B VLM for transfer.	Based on Gemma 2B and SigLIP, PaliGemma is a powerful vision language model. Many of the choices taken in terms of architecture and data collecting are displayed in this technical paper.
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction.	A novel job called Unsupervised Concept Extraction (UCE) collects and reconstructs many concepts from a single image without the need for human annotations.
Lookback Lens.	A simple model called Lookback Lens can be used to identify contextual hallucinations in large language models.

News

Link	description
A Hacker Stole OpenAI Secrets, Raising Fears That China Could, Too.	A security breach at the maker of ChatGPT last year revealed internal discussions among researchers and other employees, but not the code behind OpenAI’s systems.
Figma pulls AI tool after criticism that it ripped off Apple’s design.	Figma says it didn’t train the generative AI models it used and blames a ‘bespoke design system.’
Hollywood stars’ estates agree to the use of their voices with AI.	Earlier this week, AI company ElevenLabs said it is bringing digitally produced celebrity voice-overs of deceased actors, including Garland, James Dean and Burt Reynolds, to its newly launched Reader app. The company said the app takes articles, PDF, ePub, newsletters, e-books, or any other text on your phone and turns it into voice-overs.
Smart Paste for context-aware adjustments to pasted code.	We present Smart Paste, an internal tool that streamlines the code authoring workflow by automating adjustments to pasted code. We describe key insights from our UX and model preparation efforts, which have led to high performance and successful adoption among Google developers.
Apple M5 Chip's Dual-Use Design Will Power Future Macs and AI Servers.	Apple will reportedly use a more advanced SoIC packaging technology for its M5 chips, as part of a two-pronged strategy to meet its growing need for silicon that can power consumer Macs and enhance the performance of its data centers and future AI tools that rely on the cloud.
Apple Intelligence and a better Siri may be coming to iPhones this spring.	Expect Apple’s AI system in iOS 18.4, says a new Bloomberg rumor.
Meta claims news is not an antidote to misinformation on its platforms.	Company says it has ‘never thought about news’ as a way to counter misleading content on Facebook and Instagram despite evidence to the contrary
Meta drops AI bombshell: Multi-token prediction models now open for research.	Meta has thrown down the gauntlet in the race for more efficient artificial intelligence. The tech giant released pre-trained models on Wednesday that leverage a novel multi-token prediction approach, potentially changing how large language models (LLMs) are developed and deployed.
Google DeepMind’s AI Rat Brains Could Make Robots Scurry Like the Real Thing.	In order to investigate the brain circuits underlying complicated motor skills, DeepMind and Harvard University created a virtual rat using artificial intelligence (AI) neural networks trained on real rat motions and neural patterns. With its ability to transfer acquired movement skills to other settings, this bio-inspired AI could advance robotics and provide new insights into brain function. The study shows that brain activity associated with various behaviors may be accurately mimicked and decoded by digital simulations.
Microsoft drops observer seat on OpenAI board amid regulator scrutiny.	Startup’s new approach means Apple will no longer be able to appoint an executive to similar role
xAI ends deal with Oracle, builds own AI datacente.	Oracle has terminated xAI's agreement. After Grok 2 training is completed, it will construct its own data center. Originally, the corporation had a deal with Oracle for 24k H100s.
a16z is trying to keep AI alive with Oxygen initiative.	According to The Information, VC firm Andreessen Horowitz has secured thousands of AI chips, including Nvidia H100 GPUs, to dole out to its AI portfolio companies in exchange for equity.
Quora’s Poe now lets users create and share web apps.	Poe, Quora’s subscription-based, cross-platform aggregator for AI-powered chatbots like Anthropic’s Claude and OpenAI’s GPT-4o, has launched a feature called Previews that lets people create interactive apps directly in chats with chatbots.
Ex-Meta scientists debut gigantic AI protein design model.	EvolutionaryScale’s protein language model — among the largest AI models in biology — has created new fluorescent proteins and won big investment.
Anthropic’s Claude adds a prompt playground to quickly improve your AI apps.	Prompt engineering became a hot job last year in the AI industry, but it seems Anthropic is now developing tools to at least partially automate it.
OpenAI and Los Alamos National Laboratory announce bioscience research partnership.	OpenAI and Los Alamos National Laboratory are developing evaluations to understand how multimodal AI models can be used safely by scientists in laboratory settings.
‘I am happy to see how my baby is bouncing’: the AI transforming pregnancy scans in Africa.	While ultrasound services are normal practice in many countries, software being tested in Uganda will allow a scan without the need for specialists, providing an incentive for pregnant women to visit health services early on
Samsung to launch upgraded voice assistant Bixby this year with its own AI.	Samsung will launch an upgraded version of its voice assistant Bixby this year based on its own artificial intelligence models, mobile chief TM Roh told CNBC.
Google says Gemini AI is making its robots smarter.	DeepMind is using video tours and Gemini 1.5 Pro to train robots to navigate and complete tasks.
Here’s how Qualcomm’s new laptop chips really stack up to Apple, Intel, and AMD.	The Snapdragon X Elite and X Plus chips from Qualcomm are making Windows on Arm a competitive platform, roughly matching the performance and battery life of AMD Ryzen, Apple's M3 chip, and Intel Core Ultra. The Snapdragon chips are excellent in multi-core scores and power economy, even though they don't lead in GPU performance. The latest generation of laptops with Snapdragon processors is a more affordable option than MacBooks and conventional Intel or AMD-based devices.
China's Laws of Robotics: Shanghai publishes first humanoid robot guidelines.	Shanghai has published China's first governance guidelines for humanoid robots, calling for risk controls and international collaboration, as tech giants like Tesla showed off their own automatons at the country's largest artificial intelligence (AI) conference.
Crowdsourced Decentralized AI Market Map.	Open sourcing a community-led market map of Decentralized AI

Resources

Link	description
CapPa: Training vision models as captioners.	Craiyon's trained CapPa vision model achieves state-of-the-art results on several difficult vision benchmarks.
Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis.	Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and proprietary models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters.
EGIInet: Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion.	By means of geometric task guiding, EGIInet successfully combines two modalities to present a novel way to point cloud completion.
Quality Prompts.	QualityPrompts implements 58 prompting techniques explained in this survey from OpenAI, Microsoft, et al.
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems.	Describes a new job, SummHay, to evaluate a model's capacity to process a Haystack and produce a summary that highlights the key insights and references the original documents; finds that RAG components are found to improve performance on the benchmark, making it a feasible choice for holistic RAG evaluation. Long-context LLMs score 20% on the benchmark, which lags the human performance estimate of 56%.
AI Agents That Matter.	AI Agents That Matter examines existing agent evaluation procedures and identifies flaws that could prevent practical deployment; it also suggests a framework to prevent overfitting agents and an implementation that simultaneously maximizes accuracy and cost.
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2.	A post by Neel Nanda, a Research Engineer at Google DeepMind, about his favorite papers to read in Mechanistic Interpretability.
SAE.	This library trains k-sparse autoencoders (SAEs) on the residual stream activations of HuggingFace language models, roughly following the recipe detailed in Scaling and evaluating sparse autoencoders (Gao et al. 2024)
MInference.	To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
micro-agent.	An AI agent that writes and fixes code for you.
AnySR.	A novel method for improving efficiency and scalability in single-image super-resolution (SISR) is called AnySR. The 'Any-Scale, Any-Resource' implementation is supported by AnySR, in contrast to previous techniques, which reduces resource requirements at smaller scales without the need for extra parameters.
Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos.	Without human supervision, researchers have created a novel method for estimating category-level 3D poses from informal, object-centric films.
SenseVoice .	a speech foundation model that possesses a variety of speech understanding functions, such as auditory event detection, spoken language identification, automatic speech recognition, and speech emotion recognition.
Boosting Large Vision Language Models with Self-Training.	A novel method called Video Self-Training with Augmented Reasoning (Video-STaR) aims to enhance Large Vision Language Models (LVLMs).
GraphRAG.	With GraphRAG, you may use language models to analyze unstructured text. The quick start is simple to spin up because it operates on Azure.
iLLM-TSC.	To enhance traffic signal control systems, researchers have created a novel framework that blends reinforcement learning with a sizable language model.
Tutorials on Tinygrad.	A set of tools called Tinygrad is used to train deep learning models. An in-depth look at Tinygrad internals is made possible by this set of notes, which serves as an excellent introduction for AI compilers.
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving.	A 4D occupancy generation model based on diffusion called OccSora is intended to enhance long-term temporal evolutions.
Awesome AGI Survey.	The goal of Artificial General Intelligence (AGI) is to execute a variety of real-world jobs with human-like efficiency. This project explores the path towards AGI.
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation.	Developed from Meta's Chameleon model, Anole is an open autoregressive multimodal model. With focused fine-tuning, this effort restores the model's ability to generate images.
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning.	A novel reinforcement learning framework is presented by researchers to enhance customized text-to-image generation.
PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models.	PerlDiff is a technique that incorporates 3D geometric information to increase the accuracy of street view image production.
Paints-Undo.	Paints UNDO is a system where a model generates strokes that are used to reconstruct an image. It comes from the same creators as ControlNet, IC-Light, and many other image production systems. Remarkably, in contrast to earlier stroke systems, this model is able to cancel strokes and frequently completely reevaluates its strategy halfway through—quite like a human artist would.
minRF.	For Stable Diffusion 3, scalable rectified flow transformers are partially utilized. This repository contains sweeps of the muP hyperparameters along with a rudimentary implementation of them.
RouteLLM.	RouteLLM is a framework for serving and evaluating LLM routers
30x speedup in model init for HF Transformers.	If you move some lazy loading to the model on the first pass, you can significantly reduce the amount of tokens lost every second during model initialization.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.	The basis for contemporary fast language models is FlashAttention. Up from 35% previously, this new variant takes 75% of the H100 capacity. This capability gain is the result of several significant system enhancements.
OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion.	A novel approach to open-vocabulary detection called OV-DINO addresses the difficulties of combining various data sources and making use of language-aware capabilities.
Open-Vocabulary Video Instance Segmentation.	A innovative approach to Open-Vocabulary Video Instance Segmentation (VIS), OVFormer tackles important problems in the field. It uses video-based training to increase temporal consistency and align embeddings better.
Satellite Image Time Series Semantic Change Detection: Novel Architecture and Analysis of Domain Shift.	This work integrates semantic segmentation and change detection to address semantic change detection using satellite image time series (SITS-SCD).
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer.	The PosFormer model overcomes the drawbacks of sequence-based methods to greatly enhance Handwritten Mathematical Expression Recognition (HMER).

Perspectives

Link	description
Real criminals, fake victims: how chatbots are being deployed in the global fight against phone scammers.	New scambaiting AI technology Apate aims to keep scammers on the line while collecting data that could help disrupt their business model
James Muldoon, Mark Graham, and Callum Cant: ‘AI feeds off the work of human beings’.	The Fairwork trio talk about their new book on the ‘extraction machine’, exposing the repetitive labor, often in terrible conditions, that big tech is using to create artificial intelligence
Superintelligence—10 years later.	Ten years after the publication of Nick Bostrom's seminal book "Superintelligence," advances in AI have raised awareness of the potential for AGI and its associated concerns. With 2024 being a turning point toward guaranteeing control and alignment with human values, the AI research community is now giving AI safety serious attention. With AI technologies advancing so quickly, the sector faces concerns related to safety and ethics that were previously thought to be theoretical.
How Good Is ChatGPT at Coding, Really?	Depending on the task difficulty and programming language, OpenAI's ChatGPT may generate code with success rates anywhere from less than 1% to 89%.
TechScape: Can AI really help fix a healthcare system in crisis?	Artificial intelligence is heralded as helping the NHS fight cancer. But some warn it’s a distraction from more urgent challenges
Pop Culture.	In a critical 31-page analysis titled "Gen AI: Too Much Spend, Too Little Benefit?", Goldman Sachs makes the case that utility spending would rise sharply due to generative AI's power consumption and very little productivity advantages and returns. The study raises concerns about AI's potential to completely change industries by highlighting its high price, problems with the electrical infrastructure, and inability to produce appreciable increases in productivity or revenue. If significant advancements in technology are not made, it could portend a dismal future for the field.
The AI summer.	Compared to other tech innovations like the iPhone and e-commerce, which took years to acquire hold, ChatGPT's quick adoption—it hit 100 million users in just two months—is noteworthy. Even with the initial excitement, not many users have found ChatGPT to be useful in the long run, and business adoption of big language models is still few. This suggests that more work is necessary to establish substantial product-market fit and long-term value.
A Deep Dive on AI Inference Startups.	The development of AI's "picks and shovels," such as model fine-tuning, observability, and inference, is a well-liked field for venture capital investment. VCs are placing bets that when businesses integrate AI into their products, they won't want to develop things themselves. For AI inference, the TAM is highly limited. For VCs' investments to be profitable, they must have faith in significant TAM expansion. Although platforms for AI inference benefit startups in the short run, over the long run, they hurt them.
Cyclists can't decide whether to fear or love self-driving cars.	San Francisco cyclists have reported near misses and safety concerns with self-driving cars from Waymo and Cruise. Almost 200 complaints about these self-driving cars' unpredictable behavior and near-misses have been filed with the California DMV. Despite the manufacturers' claims that their cars had improved safety features, the events cast doubt on the vehicles' suitability for widespread use in the face of heightened regulatory scrutiny.
Augmenting Intelligence.	This essay promotes a practical approach to employing AI as an enhancement to human intelligence and explores bridging the divide between techno-optimists and pessimists on the subject. It discusses AI's role in education, its effects on creativity and the arts, and its ethical application. The paper highlights that artificial intelligence (AI) is a tool that augments human capabilities rather than poses a threat, suggesting that the term "augmented intelligence" is a more realistic description.

Back to index

ML news: Week 1 - 7 July

Research

Link	description
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs.	claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. proposes LongRAG, which combines RAG with long-context LLMs to enhance performance; uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units; the long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system.
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data.	suggests a fine-tuning strategy to increase the precision of information retrieval in LLMs while preserving reasoning abilities over long-context inputs; the fine-tuning dataset consists of 350 sample numerical dictionary key-value retrieval tasks; results show that this strategy reduces the "lost-in-the-middle" effect and enhances performance on both long-context reasoning and information retrieval.
GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models.	enhances the long-context capabilities of LLMs by proposing a graph-based agent system that organizes long text into a graph and uses an agent to explore the graph (using predefined functions guided by a step-by-step rational plan) to efficiently generate answers to questions; consistently outperforms GPT-4-128k across context lengths ranging from 16k to 256k.
Following Length Constraints in Instructions.	explains a method for addressing length bias and training language models that adhere to length constraints more closely; it refines a model using DPO using a dataset that has been augmented with length instructions and demonstrates fewer length constraint violations while maintaining a high response quality.
Adam-mini: Use Fewer Learning Rates To Gain More.	a new optimizer that carefully divides parameters into blocks and assigns a single high-quality learning that outperforms Adam; it achieves consistent results on language models sized from 125M -7B for pre-training, SFT, and RLHF. It uses fewer learning rates, which results in a 45%–50% reduction in memory footprint while still performing on par or even better than AdamW.
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data.	generative image model with better performance than pure text conditioned models due to its ability to interleave text and images.
Scaling Synthetic Data Creation with 1,000,000,000 Personas.	By treating web text as originating from a persona, this approach can significantly enhance job performance downstream by conditioning on that persona. The researchers find a jump of 20% points on MATH.
Odd-One-Out: Anomaly Detection by Comparing with Neighbors.	A novel anomaly detection challenge has been presented by researchers that focus on things that appear unusual in comparison to other objects in the scene. In contrast to conventional techniques, anomalies in this case are distinctive to the scene and can be determined from several angles.
Adaptable Logical Control for Large Language Models.	This approach enables the control of model generation at inference time, as well as interactive text editing. It achieves strong performance with tiny models and permits logical limitations in the generating process.
Pairwise Difference Learning for Classification.	Scholars have expanded Pairwise Difference Learning (PDL), which was first developed as a regression method, to include classification tasks. PDL makes predictions about the differences between pairs of instances rather than the outcomes themselves.
AXIAL.	This research improves the explainability of model decisions by putting forth a novel technique for identifying Alzheimer's disease using 3D MRI scans.
Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization.	A novel technique called Multi-Session SLAM creatively records camera movements throughout multiple disconnected video sequences using a single global frame of reference.

News

Link	description
An Update to Adept.	The founders of Adept are heading to Amazon to license some of their technology.
Time strikes a deal to funnel 101 years of journalism into OpenAI's gaping maw.	Time has joined a growing number of publications to sign a licensing deal with OpenAI. The ChatGPT creator will legally be able to train its large language models on 101 years' worth of the storied publication's journalism, as Axios first reported.
Amazon Investigates Perplexity AI Over Potential Data-Scraping Violations.	Amazon Web Services is looking into whether Perplexity is breaking its rules after Wired said the AI startup is swiping its web archives without consent. Perplexity, however, says it's following the rules.
Apple could announce a Google Gemini deal this fall .	If you’re disappointed that the only AI model that will integrate with Apple devices so far will be ChatGPT, it sounds like you won’t have to wait long for that to change. Apple will announce “at least” one other deal — to add Google Gemini, too — this fall.
Meta accused of breaking EU digital law by charging for ad-free social networks.	European Commission objects to ‘pay or consent’ model for users of Facebook and Instagram
Microsoft’s Mustafa Suleyman says he loves Sam Altman, believes he’s sincere about AI safety.	In an interview at the Aspen Ideas Festival on Tuesday, Mustafa Suleyman, CEO of Microsoft AI, made it very clear that he admires OpenAI CEO Sam Altman.
When the Terms of Service Change to Make Way for A.I. Training.	As they negotiate a complicated web of privacy regulations and user consent, tech giants like Google and Meta are revising their privacy rules to allow the use of public and potentially private user data to train AI systems. There has been a backlash since consumers and content creators are afraid that their work will be used to train AI that may eventually replace them. The conflicts draw attention to new issues in data privacy, AI development, and striking a balance between innovation and morality in the IT sector.
Meet Figma AI.	Designers may get assistance with tasks like visual search, asset search, text editing, image editing, prototyping, layer renaming, and design generation with Figma AI, a new suite of AI-powered capabilities for Figma. During the beta phase, these features—which are driven by AI models from third parties—are free to use.
Google’s emissions climb nearly 50% in five years due to AI energy demand.	Tech giant’s goal of reducing climate footprint at risk as it grows increasingly reliant on energy-hungry data centers
Amazon beefs up AI development, hiring execs from startup Adept and licensing its technology.	Amazon has hired top executives from AI agent startup Adept, the company confirmed. As part of the deal, Amazon will license technology from Adept, including some of its AI models and datasets. Amazon has been trying to keep pace with competitors in AI by developing services and through its investment in OpenAI competitor Anthropic.
YouTube now lets you request removal of AI-generated content that simulates your face or voice.	YouTube also quietly rolled out a policy change in June that will allow people to request the takedown of AI-generated or other synthetic content that simulates their face or voice. The change allows people to request the removal of this type of AI content under YouTube’s privacy request process.
Phil Schiller to join OpenAI board in ‘observer’ role following Apple’s ChatGPT deal.	At WWDC last month, Apple announced its partnership with OpenAI to integrate ChatGPT into iOS 18. While no money is changing hands between Apple and OpenAI, a new report today reveals that Apple will get an “observer role” on OpenAI’s board of directors as part of the arrangement.
Japan introduces enormous humanoid robot to maintain train lines.	The 12-metre high machine has coke bottle eyes and a crude Wall-E-like head, as well as large arms that can be fitted with blades or paint brushes
Elon Musk: Grok 2 AI Arrives in August.	Musk says Grok 2 'should exceed current AI on all metrics,' though Grok 3 is waiting in the wings.
Nvidia CEO Jensen Huang addresses rising competition at shareholder meeting after historic stock surge.	Nvidia CEO Jensen Huang answered questions at the company’s annual shareholder meeting after a more than 200% surge in the stock over the past year. The company passed a $3 trillion valuation and was briefly the most valuable public company. Without naming competitors, Huang laid out the company’s overall strategy to maintain its position.
Persona’s founders are certain the world can use another humanoid robot.	MIT research scientist Jerry Pratt is back at it. In 2022, he left Boardwalk Robotics, a humanoid startup he founded and led, and joined the well-funded ranks of the Bay Area-based robotics firm Figure as its CTO months before it exited stealth. But he and Figure quietly parted ways last month.
Kyutai unveils today the very first voice-enabled AI openly accessible to all.	A pure audio LLM with low latency has been trained by Kyutai, an open research lab in France. In the upcoming months, the very amazing demo that it has managed to produce will be made available for public use.
Face screening tool detects stroke in seconds.	A new smartphone face-screening tool could help paramedics to identify stroke in seconds – much sooner and more accurately than is possible with current technologies.
This is Big Tech’s playbook for swallowing the AI industry.	With Amazon’s hiring of the team behind a buzzy AI startup, a pattern is emerging: the reverse acquihire.
Intel shows off first fully integrated optical compute interconnect, designed to scale up AI workloads.	Intel Corp. said today it has achieved another key milestone as it strives to make integrated photonics technology for high-speed data transfers a reality.
OpenAI’s ChatGPT Mac app was storing conversations in plain text.	After the security flaw was spotted, OpenAI updated its desktop ChatGPT app to encrypt the locally stored records.
Jeff Bezos to sell $5bn of Amazon shares after stock hits record high.	Proposed sale of 25m shares disclosed in a notice on Tuesday after the stock hit an all-time high of $200.43 during session
Wimbledon employs AI to protect players from online abuse.	Threat Matrix service monitors social media profiles and flags up death threats, racism and sexist comments

Resources

Link	description
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees.	improves the long-context capabilities of LLMs by putting forth a graph-based agent system that efficiently generates answers to questions by organizing long text into a graph and employing an agent to explore the graph (using predefined functions guided by a step-by-step reasonable plan); surpasses GPT-4-128k with consistency in context lengths between 16k and 256k.
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey.	survey on LLM-based synthetic data generation, curation, and evaluation.
Text2Bricks: Fine-tuning Open-Sora in 1,000 GPU Hours.	Lambda Labs trained the Open Sora video model on its 1-click cluster to create Lego movies.
Laplace Neural Operator.	One architecture for approximating PDEs that is based on neural networks is the Laplace operator.
llama-agents.	llama-agents is an async-first framework for building, iterating, and productionizing multi-agent systems, including multi-agent communication, distributed tool execution, human-in-the-loop, and more!
Suri: Multi-constraint Instruction Following for Long-form Text Generation.	A collection of 20,000 lengthy documents and intricate instructions is called Suri. Its goal is to enhance AI's capacity to adhere to intricate writing requirements. The Suri development team has presented Instructional ORPO (I-ORPO), an alignment technique that provides feedback through artificially damaged instructions.
Cambrian-1.	High-performing, fully open vision model from NYU with significant improvements over text encoders and data mixtures.
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability.	A novel expressive text-to-speech (TTS) model called DEX-TTS makes use of reference speech to enhance style representation and model generalization.
Debugging in PyTorch.	PyTorch is an excellent modeling tool. Nonetheless, a few prevalent issues have the ability to significantly lower model performance. Examining this list will aid you when debugging your model code.
vision-agent.	Vision Agent is a library that helps you utilize agent frameworks to generate code to solve your vision task.
What to do to scale up?	An amazing and surprisingly understandable post about fine-tuning hyperparameters as model and dataset sizes increase.
Web2Code.	A novel procedure that researchers have created will enhance Web2Code instruction tweaking. It entails generating new text question-answer pairs, generating new webpage image-code pairs, improving webpage understanding data, and developing new webpage code generation pairs.
Block Transformer: Global-to-Local Language Modeling for Fast Inference.	This repository presents a brand-new Transformer type with a significantly smaller KV cache size. Although it hasn't been tested in large quantities, it should be able to perform on par with typical Transformers.
Composio.	Equip your agent with high-quality tools & integrations without worrying about authentication, accuracy, and reliability in a single line of code!
Segment Anything without Supervision.	Unsupervised SAM (UnSAM) is a 'segment anything' model for promptable and automatic whole-image segmentation which does not require human annotations.
Following Length Constraints in Instructions.	Most models don't adhere to length specifications (less than 40 words, for example). This piece demonstrates how to tune them to do that.
AI Overviews Research: Comparing pre and post-rollout results on 100K keywords.	The prevalence of Google's AI Overviews (AIO) feature, which typically links to the top 10 organic results, has significantly decreased from 64% pre-rollout to just 8.71% of SERPs for 100K keywords. Following the implementation, both the length of AIO material and the number of links have grown, demonstrating Google's focus on thorough responses and reliable sources. In this dynamic search environment, where user searches with longer inquiries, lower search volumes, and lower CPC are more likely to result in AI-generated results, SEO techniques must change to stay relevant.
Meta 3D Gen.	Meta has trained both a PBR texture creation system and an advanced 3D object generation model. It generates synthetic data by using the proprietary 2D picture-generating model of the company.
Mutahunter.	An open-source, LLM-based mutation testing tool for automated software testing that is independent of language.
LLaRA: Large Language and Robotics Assistant.	LLaRA is a framework that leverages conversation-style instruction-response pairings and Large Language Models (LLMs) to enhance robot action policy. These Vision Language Models (VLMs) use visual inputs to evaluate state data and produce the best possible policy choices.
MM-Instruct.	MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
Parable of the Parser.	Great keynote talk from CVPR.
InstantStyle-Plus : Style Transfer with Content-Preserving in Text-to-Image Generation.	Style transfer with modern diffusion models and content embedders.
RSCaMa: Remote Sensing Image Change Captioning with State Space Model.	A novel technique called RSCaMa has been presented by researchers to use natural language to describe changes in remote sensing photographs.
Simple Diffusion Language Models.	Excellent talk about utilizing diffusion as a target for language modeling by Hugging Face researcher and Cornell Tech professor Sasha Rush.
3D Reconstruction from Blurry Images.	Researchers have created a technique that uses neural radiance fields (NeRF) and event streams to recreate three-dimensional sceneries from a single fuzzy image. This novel method eliminates the requirement for pre-computed camera poses by modeling camera motion and synthesizing brightness changes to produce high-quality, view-consistent images from hazy inputs.
Agentless.	Agentless is an agentless approach to automatically solve software development problems. To solve each issue, Agentless follows a simple two-phase process: localization and repair.
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.	A novel technique called inference speeds up the processing of lengthy cues in big language models. To get around the considerable delays brought on by conventional approaches, it makes use of sparse computation techniques.
torch.compile, the missing manual.	Manual for resolving torch.compile errors to make your code run faster.
facebook/multi-token-prediction.	Models for Meta's multi-token prediction model were provided, and they performed incredibly well.
Maestro - A Framework for Claude Opus, GPT and local LLMs to Orchestrate Subagents.	This Python script demonstrates an AI-assisted task breakdown and execution workflow using the Anthropic API. It utilizes two AI models, Opus and Haiku, to break down an objective into sub-tasks, execute each sub-task, and refine the results into a cohesive final output.
Magic Insert: Style-Aware Drag-and-Drop.	Method from Google to introduce meaningful items into photos with diffusion. The demo and dataset are accessible.
Discrete Semantic Tokenization for Deep CTR Prediction.	UIST is a unique method that transforms dense embeddings into discrete, compact tokens for user and item representations, therefore significantly improving click-through rate estimates.
CELLO: Causal Evaluation of Large Vision-Language Models.	With 14,094 causal questions, CELLO is a new dataset designed to help AI understand causality beyond common sense thinking.
OpenStreetView-5M.	With more than 5 million geotagged street photos from 225 countries, OpenStreetView-5M is a sizable open-access dataset aimed at evaluating computer vision techniques for picture localization.
PTQ4SAM: Post-Training Quantization for Segment Anything .	A new framework called PTQ4SAM was created to lessen the memory and processing requirements of the large-scale Segment Anything Model (SAM).
Boosting Smartphone Camera Clarity.	In this study, a self-supervised learning model that enhances reference-based super-resolution (RefSR) is used to present a technique for improving smartphone image resolution.
An Investigation of Incorporating Mamba for Speech Enhancement.	SEMamba is a novel speech enhancement system that enhances voice signal clarity by utilizing the Mamba state-space model.
Florence 2 on WebGPU.	The tiny vision model is fully functional within the onnx and WebGPU-based browser.
FlexiFilm: Long Video Generation with Flexible Conditions.	A diffusion model called FlexiFilm was created expressly to produce long videos—more than 30 seconds—with excellent quality and consistency.

Perspectives

Link	description
Smudgy chins, weird hands, dodgy numbers: seven signs you’re watching a deep fake.	Look out for surplus fingers, compare mannerisms with real recordings and apply good old-fashioned common sense and skepticism, experts advise
Training MoEs at Scale with PyTorch.	To write about scaling their MoE models to thousands of GPUs, the Mosaic team has teamed up with PyTorch.
Investing in the Age of Generative AI.	Though there is currently a "euphoria" surrounding investment, the generative AI business is already showing signs of fragility.
Can AI boom drive Nvidia to a $4tn valuation despite investor doubt?	Powerful new chips are on the way but there are questions over whether tech firm’s growth can be sustained
AI scaling myths.	It is improbable that LLMs will ever be able to achieve AGI through scaling on its own. Although scaling has been found to improve model capabilities, it largely improves confusion instead of emergent skills. Getting hold of high-quality training data is getting harder and harder.
A discussion of discussions on AI bias.	The nature of AI bias has come under more scrutiny, with detractors claiming that biases in machine learning are demonstrated by the way models like Playground AI occasionally change a user's ethnicity in photos. Some users refute this as a flaw or pertinent prejudice, pointing to instances in which Asian traits are overrepresented. The discussion touches on the wider ramifications of AI bias in many businesses. There is no easy answer to this complicated problem.
The shape of information.	This article describes how to use binary logic to maximize scarce resources.
why we no longer use LangChain for building our AI agents.	Octomind's codebase and team productivity increased after it eschewed the LangChain framework for AI test automation in favor of more straightforward, modular building parts. It found that the high-level abstractions of LangChain were rigid, making development and maintenance more difficult. Octomind now benefits from a leaner architecture and faster iteration for its AI agent duties as a result of changing strategy.
The Five Stages Of AI Grief.	Benjamin Bratton, a professor at the University of California, San Diego and director of the Antikythera program at the Berggruen Institute, refers to the global response to artificial intelligence as a "Copernican Trauma," comparing it to historical changes that have reshaped humanity's understanding of itself. Bratton offers the following five stages of "AI grief" to describe how society would react to AI's evolution: from skepticism to integration into our conception of intelligence: denial, rage, bargaining, depression, and acceptance. He contends that rather than being a uniquely human story, the integration of AI represents a larger biological and technological evolutionary process.
How to win at Enterprise AI — A playbook.	This AI-focused playbook describes AI adoption methods for enterprises, emphasizing the move from human-performed services to software-driven workflows known as "Service-as-a-software." It explores how these changes may affect business models, including performance-based pricing, and stresses how crucial workflow capture and AI accuracy are to the implementation process's success. The handbook also covers threats such as lateral attacks and emphasizes that in enterprise contexts, AI must show real performance, not simply potential.
AI is disrupting Customer Support. Salesforce is feeling the pinch.	Customer support software providers like Salesforce and Zendesk are facing challenges as enterprises redirect their IT spending toward AI proof-of-concept projects. For traditional software suppliers, the increasing integration of solutions such as ChatGPT in customer assistance has resulted in longer payback periods due to higher customer acquisition expenses. The creativity of these businesses and the overall macroeconomic climate will determine how much money is invested in customer support software in the future.
Contra Acemoglu on AI.	In contrast to more positive projections, economist Daron Acemoglu's working paper on AI proposes a modest 0.06% annual rise in TFP growth. He identifies four distinct ways that AI affects productivity, but he ignores the development of new labor-intensive goods and the further automation of existing processes, perhaps underestimating the economic potential of AI. His method is criticized for being unduly restrictive and for perhaps distorting the wider socioeconomic effects of AI developments.
Inside the maths that drives AI.	Loss functions measure algorithmic errors in artificial intelligence models, but there’s more than one way to do that. Here’s why the right function is so important.
‘The disruption is already happening!’ Is AI about to ruin your favorite TV show?	It won’t be long till everything from Drag Race to Keeping Up With the Kardashians could be written without humans – and you might be able to write yourself as the hero of a new show. But will robot TV ever be up to snuff?
Can the climate survive the insatiable energy demands of the AI arms race?	New computing infrastructure means big tech is likely to miss emissions targets but they can’t afford to get left behind in a winner takes all market
Our attitudes towards AI reveal how we feel about human intelligence.	We’re in the untenable position of regarding the AI as alien because we’re already in the position of alienating each other

Back to index

ML news: Week 24 - 30 June

Research

Link	description
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?	reports that long-context LLMs can compete with state-of-the-art retrieval and RAG systems without explicit training on the tasks; suggests that compositional reasoning (needed in SQL-like tasks) is still challenging for these LLMs; and encourages further research on advanced prompting strategies. performs a thorough performance analysis of long-context LLMs on in-context retrieval and reasoning. first presents a benchmark with real-world tasks requiring 1M token context.
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers.	improves decision-making using the iterative plan-then-RAG (PlanRAG) technique, which consists of two steps: The last phase determines whether a new plan for additional analysis is required and repeats earlier steps or makes a decision based on the data. 1) An LM creates the plan for decision-making by reviewing the questions and data schema, and 2) the retriever creates the queries for data analysis; It is discovered that PlanRAG performs better than iterative RAG on the suggested Decision QA tasks.
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs.	demonstrates how the goldfish loss resists memorization and keeps the model useful, but it may need to train for longer to more effectively learn from the training data. It is a modification of the next-token prediction objective called goldfish loss, which helps mitigate the verbatim generation of memorized training data. It uses a simple technique that excludes a pseudorandom subset of training tokens at training time.
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B.	report having used an approach that combines LLMs with Monte Carlo Tree Search to achieve a mathematical Olympiad solution at the GPT-4 level. This approach aims to improve the system's performance in mathematical reasoning by enabling features like systematic exploration, self-refinement, and self-evaluation.
From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries.	aims to better understand how LLMs use external knowledge in place of parametric information when responding to factual queries. It finds that in an RAG pipeline, LLMs take a "shortcut" and exhibit a strong bias toward using only the context information and their parametric memory to answer the question.
Tree Search for Language Model Agents.	reveals that performance scales with increased test-time computing. It is tested on interactive online environments and applied to GPT-4o to dramatically enhance performance. It suggests an inference-time tree search technique for LM agents to explore and enable multi-step reasoning.
Evidence of a log scaling law for political persuasion with large language models.	Super persuasion is the worry that models may become noticeably more persuasive as they get bigger. The idea that larger models aren't significantly more compelling than smaller models isn't supported by strong data. They might, nevertheless, be able to be adjusted to be more convincing.
MacroHFT: Memory Augmented Context-aware Reinforcement Learning On High Frequency Trading.	Reinforcement learning is used in MacroHFT, a novel method of high-frequency trading (HFT) in cryptocurrency markets, to enhance profitability and decision-making.
Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization.	Researchers have included a local Q-value learning method within a maximum entropy framework to enhance QMIX, a well-liked multi-agent reinforcement learning technique.
eaL: Efficient RLHF Training for LLMs with Parameter Reallocation.	ReaLHF is a unique method that optimizes parallelization during training and dynamically redistributes parameters to improve reinforcement learning from human input (RLHF).
AlphaFold2 structures guide prospective ligand discovery.	AlphaFold2 (AF2) models have had a wide impact but mixed success in retrospective ligand recognition. We prospectively docked large libraries against unrefined AF2 models of the σ2 and serotonin 2A (5-HT2A) receptors, testing hundreds of new molecules and ...
GPTs are GPTs: Labor market impact potential of LLMs.	OWe proposes a framework for evaluating the potential impacts of large-language models (LLMs) and associated technologies on work by considering their relevance to the tasks workers perform in their jobs. When accounting for current and likely future software developments that complement LLM capabilities, this share jumps to just over 46% of jobs.
Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models.	PE-Rank is a novel passage ranking method that leverages context compression through single passage embeddings to increase performance.
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression.	By customizing sparse attention configurations for each head and layer, the Mixture of Attention (MoA) method maximizes sparse attention in large language models.
GeoMFormer: A General Architecture for Geometric Molecular Representation Learning.	A new Transformer-based model called GeoMFormer learns both equivariant and invariant properties to enhance molecular modeling.
Making my local LLM voice assistant faster and more scalable with RAG.	Researchers classified data, precomputed embeddings, and dynamically generated examples to improve the efficiency and scalability of an LLM voice assistant.
Retrieval Augmented Instruction Tuning for Open NER with Large Language Models.	Using big language models, Retrieval Augmented Instruction Tuning (RA-IT) enhances information extraction.
Data curation via joint example selection further accelerates multimodal learning.	In pre-training, actively choosing the next best batch is a difficult and open problem. This research from DeepMind investigates how to match SOTA for a variety of tasks while using only 10% of FLOPs and hard-mining negative samples.
Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text.	A system called Director3D was created to improve camera trajectory modeling and 3D scene production in the real world. Director3D creates lifelike 3D scenes from text descriptions by using a Multi-view Latent Diffusion Model and a Trajectory Diffusion Transformer.
Prompt Engineering Tool.	An excellent prompting toolset that helps evaluate the effectiveness of various prompts, nearly completely composed of Sonnet 3.5.
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization.	Two language models that can decompile to LLVM IR and compile code to assembly have been made available by Meta. They received additional training after being trained on 546 billion tokens of superior-quality data. They can accomplish 45% round trip disassembly performance and 77% optimized assembling performance.

News

Link	description
Geologists raise concerns over possible censorship and bias in Chinese chatbot.	GeoGPT developed as part of Chinese-funded earth sciences program aimed at researchers in global south
OpenAI acquires Rockset.	Rockset is a robust database that supports both indexing and querying. The startup was acquired by OpenAI in order to enhance its infrastructure for retrieval.
Snapchat AI turns prompts into new lens.	Snapchat’s upcoming on-device AI model could transform your background — and your clothing — in real-time.
HeyGen Raises $60M Series A to Scale Visual Storytelling for Businesses.	HeyGen, an AI video-generating platform, has raised $60 million in Series A funding to improve its studio-quality video creation and localization capabilities quickly and affordably. HeyGen, which just generated $35 million in ARR, strives to democratize visual storytelling for companies of all sizes.
AI candidate running for Parliament in the U.K. says AI can humanize politics.	Voters can talk to AI Steve, whose name will be on the ballot for the U.K.'s general election next month, to ask policy questions or raise concerns.
Anthropic has a fast new AI model — and a clever new way to interact with chatbots .	Claude 3.5 Sonnet is apparently Anthropic’s smartest, fastest, and most personable model yet.
AIs are coming for social networks.	An app called Butterflies puts a new spin on how we interact with AI. With Meta and others making similar moves, social media is about to get a lot weirder.
OpenAI walks back controversial stock sale policies, will treat current and former employees the same.	OpenAI has changed its policies toward secondary share sales to allow current and former employees to participate equally in its annual tender offers, CNBC has learned. All current and former staffers “will have the same sales limit” and be able to participate at the same time, OpenAI said in documents shared with stakeholders.
Report: Amazon developing AI chatbot that would compete with ChatGPT and others.	Amazon is developing its own consumer-focused AI chatbot that would compete with OpenAI’s ChatGPT and could be revealed later this year, according to a report from Business Insider.
Multi is joining OpenAI.	OpenAI continues its purchase binge by purchasing additional desktop-related infrastructure.
Artificial Marketing Intelligence at your fingertips: MarTech startup Ability AI secures $1.1M pre-seed round funding to automate the process.	Ability AI, a martech startup specializing in full-cycle paid marketing automation with the help of autonomous AI agents, announced today that it has raised $1.1 million in pre-seed funding from SMRK VC as a lead investor, with the participation of other funds and angels.
Claude 3.5 suggests AI’s looming ubiquity could be a good thing.	If you don’t like chatbots popping up everywhere, get ready to be peeved. But the latest version of Anthropic shows AI is becoming more useful – and, crucially, affordable
Apple found in breach of EU competition rules.	European Commission finds iPhone maker broke new laws designed to protect smaller competitors against big tech platforms
Etched is building an AI chip that only runs one type of model.	Etched is among the many, many alternative chip companies vying for a seat at the table — but it’s also among the most intriguing.
Stability AI Secures Significant New Investment.	Stability AI was able to obtain a "significant infusion of capital" from both new and existing investors in addition to hiring a new CEO.
Training a 70B model from scratch: open-source tools, evaluation datasets, and learnings.	Earlier this year, we pre-trained and fine-tuned a 70B-parameter model that outperforms GPT-4o zero-shot on a range of reasoning and coding-related benchmarks and datasets. Our fine-tuned model, pre-trained on 2T tokens, roughly matches a fine-tuned Llama 3 70B, which was pre-trained on more than seven times as much data.
OpenAI Pushes Back Voice Mode.	The sophisticated Voice Mode that OpenAI showcased in its Spring Update will go live in alpha form in late July for a limited group of ChatGPT Plus subscribers.
Meta’s AI translation model embraces overlooked languages.	More than 7,000 languages are in use throughout the world, but popular translation tools cannot deal with most of them. A translation model that was tested on under-represented languages takes a key step towards a solution.
Researchers fool university markers with AI-generated exam papers.	University of Reading project poses questions for integrity of coursework and take-home student assignments
YouTube tries convincing record labels to license music for AI song generator.	Video site needs labels’ content to legally train AI song generators.
Evolutionary Scale Raises $142m series A.	A biology startup called Evolutionary Scale has come out of stealth with significant funding. Additionally, it declared the release of ESM 3, its foundation model, a 98B parameter model trained for 10^24 Flops on 771B biological tokens. Using the model, it found a new luminous green protein that is not found in nature.
Waymo One is now open to everyone in San Francisco.	With its driverless cars, Waymo One now makes it possible for anybody in San Francisco to request a ride. After providing tens of thousands of trips per week, the company is expanding. Its all-electric fleet helps it achieve its sustainability goals and boosts the local economy. Waymo claims that its cars are much less likely to be involved in collisions than those driven by humans, citing increased safety.
ChatGPT on your desktop.	Users can now download the ChatGPT desktop software for macOS.
AI will be help rather than hindrance in hitting climate targets, Bill Gates says.	Microsoft co-founder says efficiencies for technology and electricity grids will outweigh energy use by data centers
Snap Lense Studio 5.0.	The GenAI suite, which Snap introduced with Lens Studio 5.0, is a fantastic development and a huge help for creating augmented reality apps.
Instagram Launching An AI Studio.	Instagram's "AI Studio" enables developers to create self-aware AI chatbots. In the US, an early test of it is presently underway.
Dust raises $16m series A.	Dust, one of the first modern-day chaining and agency companies, raised more money after surpassing $1 million in annual revenue.
ElevenLabs launches iOS app that turns ‘any’ text into audio narration with AI.	"ElevenLabs Reader: AI Audio," the company's debut iOS app, enables users to listen on the go by turning text files or web links into audio narration.

Resources

Link	description
Open-Sora 1.2 Report.	a 1.1B parameter model trained on over 30 million data points, this open-source video generation model can produce 16-second 720p videos. It also features an improved diffusion model and video compression network for both temporal and spatial compression, which lowers training costs and improves the controllability of the generations.
LLM101n: Let's build a Storyteller.	An outline for a new course that Andrej Karpathy is working on can be found in a new repository. It entails creating a narrative-capable aligned language model. Code, video lectures, and other learning resources are included in the course.
AutoCodeRover: Autonomous Program Improvement.	AutoCodeRover is a new technology that combines sophisticated code search methods with big language models to automate software enhancements, such as feature additions and problem fixes.
NLUX.	NLUX is a React and JavaScript open-source library for building conversational AI interfaces. It makes it super simple to build web applications powered by Large Language Models (LLMs) and AI. With just a few lines of code, you can add conversational AI capabilities and interact with your favorite AI models.
Claudette.	Claudette is a higher-level and easier-to-use way to interact with Claude.
top CVPR 2024 papers.	Computer Vision and Pattern Recognition is a massive conference. In 2024 alone, 11,532 papers were submitted, and 2,719 were accepted. I created this repository to help you search for crème de la crème of CVPR publications.
TTS in 7000 Languages.	Recently, Toucan published a collection of new text-to-speech models that are now compatible with all ISO-639-3 standard languages.
ParaLLM: 1300+ tok/s on a MacBook.	When batch parallel KV cache is implemented in MLX, inference times for the creation of synthetic data and model completions are significantly sped up.
Train vision models in TRL .	Transformers can be trained using reinforcement learning with the help of TRL, a Hugging Face library. You may apply the same procedure for vision-based language models, such as LLaVA, using this example.
Rethinking Remote Sensing Change Detection With A Mask View.	Two new models for remote sensing change detection—CDMask and CDMaskFormer—are presented in this study.
llama.ttf.	This article explains how to use a font file to run a little Llama language model.
june.	June is a local voice chatbot that combines the power of Ollama (for language model capabilities), Hugging Face Transformers (for speech recognition), and the Coqui TTS Toolkit (for text-to-speech synthesis). It provides a flexible, privacy-focused solution for voice-assisted interactions on your local machine, ensuring that no data is sent to external servers.
Building a personalized code assistant with open-source LLMs using RAG Fine-tuning.	AI and Morph Labs collaborated to create an excellent blog post about optimizing models for retrieval enhanced generation. They also demonstrate a few applications of generated data.
EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations.	A novel metric called EvalAlign was created to enhance the assessment of generative models that convert text to images. EvalAlign provides fine-grained accuracy and stability in contrast to current measures. It emphasizes text-image alignment and image faithfulness.
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models.	Florence-2, released by Microsoft in June 2024, is a foundation vision-language model. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks. Florence supports many tasks out of the box: captioning, object detection, OCR, and more.
Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity.	Specifically designed kernels have been created by the PyTorch team to utilize sparse cores, which are typically exclusively used for inference.
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models.	Diffusion models are used in FreeTraj, a tuning-free technique for controlling motion trajectories in video creation. To direct the generated content, it adjusts the attention mechanisms and noise sampling.
OpenGlass - Open Source Smart Glasses.	Turn any glasses into hackable smart glasses with less than $25 of off-the-shelf components. Record your life, remember people you meet, identify objects, translate text, and more.
An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability.	The Golden Gate Claude served as a potent illustration of how to influence and evaluate models using SAEs. This work includes some sample code for training these models and an easy-to-understand explanation of how it operates.
RES-Q.	A new benchmark called RES-Q is designed to evaluate how well huge language models can modify code repositories using instructions in natural language.
Balancing Old Tricks with New Feats: AI-Powered Conversion From Enzyme to React Testing Library at Slack.	Using a hybrid method, Slack developers used AI Large Language Models with Abstract Syntax Tree transformations to automate the translation of more than 15,000 unit tests from Enzyme to React Testing Library. The team utilized Anthropic's Claude 2.1 AI model in conjunction with DOM tree capture for React components to achieve an 80% success rate in automatic conversions. This ground-breaking project demonstrates Slack's dedication to using AI to improve developer productivity and experience. It's part of the continuous attempts to remain ahead of the always-changing frontend scene.
R2R.	R2R was designed to bridge the gap between local LLM experimentation and scalable, production-ready Retrieval-Augmented Generation (RAG). R2R provides a comprehensive and SOTA RAG system for developers, built around a RESTful API for ease of use.
Internist.ai 7b.	Internist.ai 7b is a medical domain large language model trained by medical doctors to demonstrate the benefits of a physician-in-the-loop approach. The training data was carefully curated by medical doctors to ensure clinical relevance and required quality for clinical practice.
Finding GPT-4’s mistakes with GPT-4.	CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF
ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data.	A program called ALPBench was created to standardize active learning query benchmarks.
Introducing AuraSR - An open reproduction of the GigaGAN Upscaler.	FAL recently made AuraSR, a high-resolution picture upscale, open-sourced. Even with repeated applications, it may upscale by 4x with just one forward pass. AuraSR performs admirably with created photos.
Point-SAM: Promptable 3D Segmentation Model for Point Clouds.	Point-SAM, a transformer-based 3D segmentation model, has been introduced by researchers in response to the increasing demand for comprehensive 3D data.
GenIR-Survey.	This survey explores generative information retrieval (GenIR), a novel approach to information retrieval that shifts from conventional search techniques to ones that generate results dynamically.
Gemma 2.	Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?	MatText is a collection of benchmarking tools and datasets intended to assess the effectiveness of language models in the field of materials science.
mamba2.	A quick implementation of Mamba 2

Perspectives

Link	description
The Long View on AI.	AI has the potential to cause tremendous growth rates and technological improvements, according to historical statistics. Society will probably be able to adjust to these rapid changes just as it has in the past.
AI’s Hidden Opportunities: Shawn "swyx" Wang on New Use Cases and Career.	Well-known developer Shawn "swyx" Wang discusses the untapped potential for conventional software professionals wishing to go into artificial intelligence. In particular, examining how to enhance existing tools, use AI to summarization, and more.
Apple Intelligence.	Rather than developing stand-alone AI products, Apple has incorporated generative AI into its core apps, improving services like Mail classification, Safari summaries, and Siri's functioning. This demonstrates the company's focus on user control and privacy.
Apple intelligence and AI maximalism.	Apple has shown a bunch of cool ideas for generative AI, but much more, it is pointing to most of the big questions and proposing a different answer - that LLMs are commodity infrastructure, not platforms or products.
How To Solve LLM Hallucinations.	Lamini has created Memory Tuning, which effectively embeds particular facts into models without sacrificing general knowledge and reduces hallucinations by 95%.
AI machine translation tools must be taught cultural differences too.	But to successfully preserve or revitalize minority languages, the scope of large-language-model (LLM) training needs to be broadened.
Misinformation might sway elections — but not in the way that you think.	Rampant deepfakes and false news are often blamed for swaying votes. Research suggests it’s hard to change people’s political opinions, but easier to nudge their behaviour.
How I’m using AI tools to help universities maximize research impacts.	Artificial intelligence algorithms could identify scientists who need support with translating their work into real-world applications and more. Leaders must step up.
The Future of LLM-Based Agents: Making the Boxes Bigger.	Long-term planning and system-level resilience are two essential strategies that assist move Agents from the playground into the real world, and they are discussed in this post. These introduce the ability to create plans of a higher level for the Agents, allowing for adaptability in the middle of an episode. They also introduce systems techniques to intelligently orchestrate the models, resulting in increased performance and accuracy.
Apple, Microsoft Shrink AI Models to Improve Them.	Large language models are becoming less popular as IT companies shift their focus to more efficient small language models (SLMs). Apple and Microsoft have introduced models with far fewer parameters that nonetheless perform comparably or even better in benchmarks. According to the CEO of OpenAI, we're past the LLM era since SLMs have benefits including greater accessibility for smaller entities, local device operation, and potential insights into human language acquisition. Even though SLMs are narrower in scope, their performance is enhanced by training them on high-quality, or "textbook-quality" data.
Are Tech-Enabled Vertical Roll-Ups the Future or the Past?	The ability to generate excess cash flows through operational efficiencies is a prerequisite for roll-up methods. It's possible that the development of AI offers a new lever that fully unlocks the roll-up strategy. Are rollups for SMBs and verticals the future? Two different perspectives on this issue are presented in this post.

Back to index

ML news: Week 17 - 23 June

Research

Link	description
Discovering Preference Optimization Algorithms with and for Large Language Models.	suggests an algorithm that adaptively combines logistic and exponential losses; this approach eliminates the need for human intervention by prompting an LLM to suggest and implement preference optimization loss functions based on previously assessed performance metrics. It also suggests an LLM-driven objective discovery of state-of-the-art preference optimization.
SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals.	a framework to increase the high-level goal-achieving capabilities of an LLM-based agent; during interaction with the environment, the framework adaptively decomposes a high-level goal into a tree structure of useful subgoals; enhances performance on a variety of tasks, including cooperative, competitive, and deferred feedback environments.
Mixture-of-Agents Enhances Large Language Model Capabilities.	a strategy that beats GPT-4o on AlpacaEval 2.0, MT-Bench, and FLASK by utilizing the combined strengths of several LLMs through a Mixture-of-Agents methodology; layers are constructed with numerous LLM agents, and each agent builds on the outputs of other agents in the previous levels.
Transformers meet Neural Algorithmic Reasoners.	Tokens in the LLM can now cross-attend to node embeddings from a GNN-based neural algorithmic reasoner (NAR) thanks to a new hybrid design; the resulting model, named TransNAR, shows gains in OOD reasoning across algorithmic challenges.
Self-Tuning: Instructing LLMs to Acquire New Knowledge through Self-Teaching Effectively.	increases an LLM's capacity to learn new information from raw documents through self-teaching; the process consists of three steps: 1) a self-teaching component that enhances documents with a series of knowledge-intensive tasks emphasizing comprehension, memorization, and self-reflection; 2) the model is configured to continuously learn using only the new documents, aiding in the thorough acquisition of new knowledge; and 3) the deployed model is used to learn new information from new documents while evaluating its QA skills.
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models.	a framework that gives a multimodal LLM access to a visual sketchpad and drawing tools; it can give a model, such as GPT-4, the ability to create intermediate sketches to reason over complex tasks; over strong base models without sketching, it performs better on many tasks; on all the tasks tested, GPT-4 equipped with SketchPad sets a new state of the art.
Mixture of Memory Experts.	claims to enable scaling to a high number of parameters while keeping the inference cost fixed. It suggests a method to significantly reduce hallucination (10x) by tuning millions of expert adapters (e.g., LoRAs) to learn exact facts and retrieve them from an index at inference time. The memory experts are specialized to ensure faithful and factual accuracy on the data it was turned on.
Multimodal Table Understanding.	presents Table-LLaVa 7B, a multimodal LLM for multimodal table understanding; it produces a large-scale dataset MMTab, comprising table images, instructions, and tasks; it is comparable with GPT-4V and greatly outperforms existing MLLMs on numerous benchmarks.
Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement.	suggests a training-efficient way to extend LLMs to longer context lengths (e.g., 4K -> 256K); it uses a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning; the approach helps to alleviate the so-called "Lost-in-the-Middle" problem in long-context LLMs. suggests a method to tune an LLM to effectively utilize information from the middle part of the context.
Simple and Effective Masked Diffusion Language Models.	Easy diffusion model to model language. It functions fairly well and generates out-of-order.
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding.	A novel technique that dramatically lowers memory consumption during auto-regressive inference in transformers is called Multi-Layer Key-Value (MLKV) sharing.
Understanding Hallucinations in Diffusion Models through Mode Interpolation.	This study looks into the reasons behind "hallucinations"—images that never were in the training set—that are produced by diffusion-based picture generation models.
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.	Chain of Preference Optimization (CPO) helps large language models (LLMs) become more adept at logical reasoning. CPO matches the reasoning steps of Chain-of-Thought (CoT) decoding with the optimal routes of ToT by fine-tuning LLMs using search trees from the Tree-of-Thought (ToT) technique.
Language Modeling with Editable External Knowledge.	ERASE is a novel approach to updating language models. Unlike conventional methods that emphasize enhancing retrieval during prediction, ERASE incrementally deletes or rewrites entries in the knowledge base as new documents are incorporated.
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images.	Duoduo CLIP is a 3D representation learning model utilizing multi-view images rather than point-clouds for training and analysis.
CAMixerSR: Only Details Need More "Attention".	CAMixerSR enhances image resolution by intelligently applying convolution to simpler areas and using deformable window attention for intricate textures.
‘Fighting fire with fire’ — using LLMs to combat LLM hallucinations.	The number of errors produced by an LLM can be reduced by grouping its outputs into semantically similar clusters. Remarkably, this task can be performed by a second LLM, and the method’s efficacy can be evaluated by a third. The associate article is here
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.	Microsoft has published a collection of tiny VLMs under an MIT license that performs noticeably better in captioning, bounding, and classification than much larger models.
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability.	The logit lens approach has been improved by decomposing logit outputs into contributions from different model components. This aids in comprehending the decision-making process of transformer models. This method, which employs "prisms" for residual streams, attention layers, and MLP layers, demonstrates how these components affect predictions and offer insights into the tasks that the gemma-2b model does, such as factual retrieval and arithmetic.
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers.	Using sophisticated data analysis, decision QA is a new role for LLMs that identifies the optimal decisions.
ChangeViT: Unleashing Plain Vision Transformers for Change Detection.	A methodology called ChangeViT makes use of vision transformers (ViTs) to identify significant environmental changes in remote sensing photos.
LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging.	LayerMerge is a novel technique that simultaneously prunes activation functions and convolution layers to increase neural network efficiency.
Adversarial Attacks on Multimodal Agents.	Vision-enabled language models (VLMs) such as Gemini and GPT-4o enable autonomous agents to perform activities like code editing and buying. This investigation demonstrates how susceptible these agents are to malevolent attacks.
TimeSieve: Extracting Temporal Dynamics through Information Bottlenecks.	A novel model called TimeSieve was created to address typical problems in time series forecasting.

News

Link	description
Apple to ‘Pay’ OpenAI for ChatGPT Through Distribution, Not Cash.	The collaboration between Apple and OpenAI isn't anticipated to bring in a significant amount of money for either company, at least not right away. Apple is not paying OpenAI as part of the agreement because it feels that integrating OpenAI's technology and brand into its products is as valuable as or more valuable than financial compensation. The agreement isn't exclusive; Apple is already talking about providing additional chatbot choices. In the long run, Apple intends to profit from AI by entering into revenue-sharing contracts with AI partners.
AI will make money sooner than you’d think, says Cohere CEO Aidan Gomez.	Enterprise is the pathway to profit, Gomez says, but maybe don’t ask it to do medicine quite yet.
Fake beauty queens charm judges at the Miss AI pageant.	An AI model from Romania named Aiyana Rainbow is a finalist in the first Miss AI pageant, which showcases AI-generated models on social media. The event is a part of "The FanVue World AI Creator Awards," which is organized by FanVue and highlights the talent of AI creators who can create captivating content without having to be the face of the work. The $5,000 prize package for Miss AI will include mentorship and support from the public relations community. At the end of June, the outcomes will be made public.
Elon Musk reconsiders phone project after Apple Intelligence OpenAI integration.	Elon Musk threatened to forbid any Apple devices from being used on the properties of his firms in response to Apple integrating OpenAI ChatGPT on a few of its devices.
Microsoft’s star AI chief peers into OpenAI’s code, highlighting an unusual rivalry.	Primarily, OpenAI was established as a safety net against DeepMind, the AI startup that Google purchased in 2014. However, Mustafa Suleyman, a co-founder of DeepMind, has recently been taking on an unimaginable task: delving into OpenAI's crown jewels, the proprietary algorithms that power foundation models like GPT-4, according to people familiar with the situation. This is due to the fact that Suleyman is currently Microsoft's head of AI initiatives. As part of Microsoft's multibillion-dollar investment in OpenAI, the corporation possesses the intellectual property rights to its software.
Amazon says it’ll spend $230 million on generative AI startups.	Amazon says that it will commit up to $230 million to startups building generative AI-powered applications.
McDonald’s ends AI drive-thru trial as fast-food industry tests automation.	Companies have touted AI as the future of the industry, but technology has also resulted in viral videos of wrong orders
Balance effects of AI with profits tax and green levy, says IMF.	Governments faced with economic upheaval caused by artificial intelligence should consider fiscal policies including taxes on excess profits and a green levy to atone for AI-related carbon emissions, according to the International Monetary Fund.
Introducing Gen-3 Alpha.	Runway has developed a brand-new, incredibly potent video generation model. Many of the current functions on its platform will be powered by it. You can find examples at the given URL.
DeepMind’s new AI generates soundtracks and dialogue for videos.	V2A is an AI system that DeepMind is developing to create synchronized soundtracks for videos. It generates music, sound effects, and dialogue using diffusion models trained on audio, dialogue transcripts, and video clips.
Giant Chips Give Supercomputers a Run for Their Money .	The California-based business Cerebras has proven in molecular dynamics calculations that their second-generation wafer-scale engine outperforms the fastest supercomputer in the world by a large margin. Additionally, it can infer sparse huge language models with no loss of accuracy at one-third of the energy cost of a complete model. The hardware of Cerebras allows for quick memory access and interconnects, which make both accomplishments possible. Cerebras aims to expand the scope of its wafer-scale engine applications to encompass a broader range of issues, such as airflow models surrounding cars and molecular dynamics simulations of biological processes.
Nvidia becomes world’s most valuable company amid AI boom.	Chipmaker dethrones Microsoft and Apple as stock market surge boosts valuation above $3.34tn
The ‘Godfather of AI’ quit Google a year ago. Now he’s emerged out of stealth to back a startup promising to use AI for carbon capture.	Renowned AI researchers Geoff Hinton and Max Welling have gathered a talented team to develop AI systems aimed at advancing material science for carbon capture.
Nvidia Conquers Latest AI Tests.	Nvidia's Hopper architecture-based systems excelled in two recent MLPerf AI benchmark tests, which assess the fine-tuning of large language models and the training of graph neural networks.
Perplexity AI searches for users in Japan, via SoftBank deal.	Perplexity is capitalizing on its strategic partnership with SoftBank to broaden its presence in Japan. As part of this initiative, it is providing a free year of its premium AI-powered search engine, Perplexity Pro. SoftBank's goal is to draw users by offering AI services without creating internal solutions. With a valuation of $1 billion, Perplexity is expanding its funding and investor base, which features prominent tech leaders and venture firms.
Introducing Local III.	The open-source local agent, Open Interpreter, has recently received a significant upgrade. It now has the capability to control the computer seamlessly and operates entirely offline and locally.
Introducing the Property Graph Index: A Powerful New Way to Build Knowledge Graphs with LLMs.	LlamaIndex has launched the Property Graph Index, significantly improving knowledge graph capabilities with enhanced modeling, storage, and querying features. This new index enables flexible graph construction and supports schema-guided, implicit, and free-form entity extraction. It also integrates with vector databases for hybrid searches and offers querying options through keyword expansion, vector similarity, Cypher queries, and custom traversal.
Decagon launches with $35m raised from Accel and a16z.	Decagon is developing human-like AI agents for customer support and has recently secured $30 million in Series A funding from Accel, along with $5 million in seed funding from a16z. Decagon's product manages global support for companies such as Eventbrite, Rippling, Webflow, BILT, and Substack.
London premiere of movie with AI-generated script cancelled after backlash.	Plans to show The Last Screenwriter, whose script is credited to ‘ChatGPT 4.0’, prompted complaints although the film-makers insist the feature is ‘a contribution to the cause’
OpenAI’s former chief scientist is starting a new AI company.	Ilya Sutskever is launching Safe Superintelligence Inc., an AI startup that will prioritize safety over ‘commercial pressures.’
Claude 3.5 Sonnet.	At a fifth of the cost, Claude 3.5 Sonnet outperforms Opus in performance. Plus, it's the greatest vision model out there right now. This demonstrates how much the frontier models have progressed.
Apple researchers add 20 more open-source models to improve text and image AI.	With 20 Core Machine Learning models that Apple has added to the Hugging Face open-source AI repository, the repository now includes a wider selection of public models with improved image classification and depth segmentation. These donations come after Apple earlier in the year released the four OpenELMs to Hugging Face and the Ferret big language model. The action shows Apple's dedication to developing AI capabilities and its growing involvement with the AI research community.
Factory Raises $15M Series A from Sequoia.	Led by Sequoia Capital, Factory has raised $15 million in Series A funding to grow its workforce and improve its Droids software development toolset, which leverages artificial intelligence. Its products are rapidly expanding its customer base and setting new benchmarks on the SWE-bench AI coding benchmark. With Factory, software engineering will be increasingly automated, cutting down on laborious processes and speeding up development cycles.
Optimizing AI Inference at Character.AI.	Twenty percent of Google's search volume, or 20,000 questions per second, are answered by Character AI. It operates this effectively thanks to several advancements.
Apple delays launch of AI-powered features in Europe, blaming EU rules.	Apple says competition rules that require functionality with rival products would compromise privacy and security

Resources

Link	description
Nemotron-4 340B.	offers a reward model to filter data based on many qualities and an instruct model to generate high-quality data; exhibits impressive results on widely-used benchmarks such as MMLU and GSM8K; It competes with GPT-4 in a number of activities, such as scoring highly in multi-turn chat; Together with the base model, a preference data is also made available.
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs.	Determining ways to incorporate search into language model creation is now the Holy Grail of study. This work is quite encouraging as it demonstrates that on math performance, tiny models with search can match considerably more powerful models.
MCTSr: Mathematic as a Blackbox for LLM.	The MCT Self-Refine (MCTSr) algorithm integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enhance performance in complex mathematical reasoning tasks by leveraging systematic exploration and heuristic self-refine mechanisms. Extensive experiments show that MCTSr significantly improves success rates on Olympiad-level mathematical problems, advancing the application of LLMs in strategic reasoning and decision-making.
VideoGPT.	To improve video understanding, a model called VideoGPT+ combines image and video encoders. While video encoders offer temporal context, image encoders capture finely detailed spatial information.
Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach.	To enhance Scene Graph Generation (SGG) for very-high-resolution satellite imaging (VHR SAI), this research introduces a new dataset and methodology.
LLM.Mojo.	This project is a port of Andrej Karpathy's llm.c to Mojo, currently in beta and subject to changes.
Depth Anything V2.	With the use of artificial data, the new Depth Anything model was trained, and its performance on intricate scenes has significantly increased.
DeepSeek-Coder-V2.	Robust DeepSeek Coder achieves scores of 90+ on HumanEval and matches GPT-4 Turbo on numerous other difficult benchmarks. It is free for business usage and accessible via an API.
HelpSteer2: Open-source dataset for training top-performing reward models.	Along with an excellent paper about training reward models to match model output to human preferences, Nvidia has made available a dataset and procedure.
Differentiable rasterization.	Given a program that produces a vector representation of an image (think SVG), rasterization turns it into a pixel representation (think PNG). Everything ought to be adjustable. This article explains how to write SVG light that is differentiable.
LARS - The LLM & Advanced Referencing Solution.	LARS is an application that enables you to run LLMs (Large Language Models) locally on your device, upload your own documents, and engage in conversations wherein the LLM grounds its responses with your uploaded content.
Beyond the Basics of Retrieval for Augmenting Generation.	The RAGatouille creator delivered a great discussion about COLBERT, some of the open issues, and how to significantly increase RAG performance.
TokenCost.	Tokencost helps calculate the USD cost of using major Large Language Model (LLM) APIs by calculating the estimated cost of prompts and completions.
GaiaNet node.	Install and run your own AI agent service.
Meta Chameleon.	Chameleon is an early fusion model that processes images and text tokens concurrently. The team published the paper a few weeks ago and has now released model checkpoints along with inference code.
OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations.	OGNI-DC is a new framework for depth completion that employs "Optimization-Guided Neural Iterations" (OGNI). This method refines a depth gradient field and incorporates the depth gradients into a depth map.
Subobject-level Image Tokenization.	Subobject tokenization is a novel approach for vision models to interpret images. Rather than dividing images into fixed square patches, this method allows models to analyze images by identifying meaningful segments, such as parts of objects.
Introduction to Granite Code Models.	We introduce the Granite series of decoder-only code models for code generative tasks (e.g., fixing bugs, explaining code, documenting code), trained with code written in 116 programming languages. A comprehensive evaluation of the Granite Code model family on diverse tasks demonstrates that our models consistently reach state-of-the-art performance among available open-source code LLMs.
FireFunction V2: Fireworks Function Calling Model.	Open model that matches GPT4-o on function calling benchmarks trained on top of Llama 3 70B.
Argilla.	For AI developers and subject matter experts who need complete data ownership, high-quality outputs, and overall efficiency, Argilla offers a platform for cooperation.
TroL: Traversal of Layers for Large Language and Vision Models.	Large language and vision models (LLVMs) with sizes of 1.8B, 3.8B, and 7B parameters are part of the new TroL family of efficient LLVMs.
Dot.	A stand-alone open-source program designed to be simple to use for local LLMs, and specifically RAG, to interact with files and documents in a manner similar to Nvidia's Chat with RTX.
WebCanvas: Benchmarking Web Agents in Online Environments.	WebCanvas is a pioneering online evaluation framework designed to address the dynamic nature of web interactions. It provides a realistic assessment of autonomous web agents by utilizing live web environments and emphasizing task completion through the identification of key nodes.
CIFAR-10 Airbench.	A benchmark for image classification is CIFAR-10. In a remarkably short amount of time, this algorithm offers a training setting that yields good performance.
Cost Of Self Hosting Llama-3 8B-Instruct.	Compared to using ChatGPT, self-hosting an LLM such as Llama-3 8B-Instruct can be much more expensive, costing approximately $17 per million tokens, while ChatGPT just costs $1 per million tokens. It is possible to lower the cost of self-hosted hardware to less than $0.01 per million tokens, but it would take about 5.5 years for the initial investment to pay for itself.
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models.	Modern surface normal estimate and depth models are assessed using a new benchmark.
An Empirical Study of Mamba-based Language Models.	The Nvidia research that previously showcased the hybrid basic Mamba model is now available.

Perspectives

Link	description
Computer says yes: how AI is changing our romantic lives.	Artificial intelligence is creating companions who can be our confidants, friends, therapists and even lovers. But are they an answer to loneliness or merely another way for big tech to make money?
Nvidia’s New Sales Booster: The Global Push for National AI Champions.	Governments everywhere are increasing their spending to entice corporations and multinationals to construct new data centers and renovate existing ones so that AI can be developed locally and massive language models can be trained in the original languages using data from their inhabitants. According to Nvidia, these independent AI initiatives should generate over $10 billion in revenue this year. The potential economic effects of generative AI are a source of concern for several governments. For their sensitive data and AI infrastructure, they want sovereign clouds, and US IT companies are happy to construct them for them.
General Intelligence (2024).	What is lacking and what would it take to create a generally intelligent agent? This essay suggests that we will be here in a few years and examines the three concepts required to create an agent. The writer is an OpenAI researcher.
Human neuroscience is entering a new era — it mustn’t forget its human dimension.	The field is taking a leap forward thanks to innovative technologies, such as artificial intelligence. Researchers must improve consent procedures and public involvement.
AI and Euro 2024: VAR is shaking up football — and it’s not going away.	Sports physicist Eric Goff explains how updates to the technology can help referees make the toughest calls.
How cutting-edge computer chips are speeding up the AI revolution.	Engineers are harnessing the powers of graphics processing units (GPUs) and more, with a bevy of tricks to meet the computational demands of artificial intelligence.
Apple’s Intelligent Strategy.	Apple showed off an incredible strategic edge in the AI arms race - but some might have missed that the company hints at using its biggest weakness as a formidable weapon against competitors.
How to Fix “AI’s Original Sin”.	The copyright issues raised by AI models trained on protected content without authorization are discussed in this article. It advises AI developers to adhere to copyright signals, put in place safeguards to stop producing content that violates intellectual property rights and design business plans that guarantee just compensation for content creators. These strategies include retrieval-augmented generation (RAG) and the development of collaborative AI content ecosystems.
Takeaways from OpenAI and Google's May announcements.	With the introduction of sophisticated AI models by OpenAI and Google, real-time multimodal understanding and answers are now possible and enhanced AI assistants and advancements in speech agents are promised. Google's Gemini 1.5 Flash offers a notable reduction in latency and cost, while OpenAI's GPT-4o promises double the speed and half the cost of its predecessor. Both digital behemoths are incorporating AI into their ecosystems, with OpenAI focusing on consumer markets with partnerships and products that could potentially reach up to a billion consumers.
Collection of AI Side Business Money-Making Information.	There are some respectable AI projects on this list that even beginners can work on.
paramount.	Paramount lets your expert agents evaluate AI chats

Back to index

ML news: Week 10 - 16 June

Research

Link	description
Scaling neural machine translation to 200 languages.	Based on a sparsely Gated Mixture of Experts architecture and trained on data using a method designed for low-resource languages, presents a massive multilingual model that uses transfer learning across 200 languages. It evaluates 40K translations and achieves an average 44% improvement in translation quality.
MatMul-free LLMs.	claims that memory consumption can be reduced by more than 10x by using an optimized kernel during inference; suggests an implementation that removes matrix multiplication operations from LLMs while maintaining performance at billion-parameter scales; the performance gap between full precision Transformers and the MatMul-free models narrows as the model size increases.
Buffer of Thoughts .	utilizes a meta-buffer containing high-level thoughts (thought templates) extracted from problem-solving processes to present a thought-augmented reasoning approach that improves the accuracy, efficiency, and robustness of LLM-based reasoning. The relevant thought template is then retrieved and instantiated with task-specific reasoning structures for the thought-augmented reasoning process. It shows SOTA performance on 10 difficult tasks at 12% of the cost of multi-query prompting methods such as Tree-of-Thoughts.
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales.	supervised finetuning on a dataset containing summaries of the differences between multiple reasoning chains is performed by the training framework to teach LLMs to express more accurate fine-grained confidence estimates and self-reflective rationales. Reinforcement learning is then applied to calibrate confidence estimates, encouraging the LLM to produce accurate, high-confidence predictions and penalizing overconfidence in erroneous outputs.
The Geometry of Categorical and Hierarchical Concepts in Large Language Models.	investigates the geometry of categorical concepts and how the hierarchical relations between them are encoded in LLMs. It discovers that the hierarchical structure is reflected in the representation of complex concepts by polytopes made from direct sums of simplices, while simple categorical concepts are represented as simplices by the LLMs.
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback.	suggests a technique that uses a very small number of demonstrations as feedback to align LLMs to a particular setting; it outperforms few-shot prompting, SFT, and self-play methods on the tested benchmarks and aligns LLM outputs to a user's demonstrated behaviors. Additionally, it can learn fine-grained style and task alignment across domains.
Towards Scalable Automated Alignment of LLMs.	gives a summary of the techniques used to align LLMs and examines the four orientations listed below 1) Inductive bias alignment; 2) Behavior imitation alignment; 3) Model feedback alignment; and 4) Environment feedback alignment
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments.	a novel framework with multiple tasks and contexts for wide-ranging, concurrent, and real-time agent exploration; constructs a generally competent LLM-based agent with the ability to self-evolve and investigates its potential beyond data that hasn't been seen before across tasks and environments.
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment.	A Synthetic-Domain Alignment (SDA) framework has been developed by researchers to improve test-time adaptation (TTA) techniques. By fine-tuning pretrained models with synthetic data produced by a conditional diffusion model, SDA efficiently aligns source and synthetic domains.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.	Reward-based Noise Optimization (ReNO) is a novel technique to improve Text-to-Image (T2I) models during inference by employing signals from reward models with human preferences to optimize the baseline noise.
YOLO-World: Real-Time Open-Vocabulary Object Detection.	With YOLO-World, researchers have improved the widely used YOLO object detectors and included open-vocabulary detection. This method, which combines large-scale dataset training with vision-language modeling, enables it to swiftly and accurately detect a wide range of objects, even in situations for which it was not designed.
Improved Scene Landmark Detection for Camera Localization.	Using distinctive scene landmarks, researchers have developed a novel, privacy-friendly technique for camera localization. This method, which does not rely on real 3D point clouds for localization, is very accurate and storage-efficient since it makes use of 3D scene landmarks and a CNN-based heatmap.
Proofread: Fixes All Errors with One Tap.	The Gboard team has described how they correct sentence- and paragraph-level problems in written text on the device using SFT on a PaLM2-XS model. They discovered that latency optimizations led to significant gains in utilization.
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model.	Using a new quantization approach, the Snap Research team was able to increase speed while reducing the size of the Stable Diffusion UNet model from 1.72 GB to 219 MB. Although the quantization technique is a little complicated, it shows great promise for generative model execution on consumer hardware.
Introducing Apple’s On-Device and Server Foundation Models.	During WWDC 2024, Apple debuted "Apple Intelligence". Apple Intelligence is an AI system that is built into macOS Sequoia, iOS 18, and iPadOS 18. It has sophisticated generative models for a variety of commonplace activities, like text refinement, picture generation, and notification summary. With an emphasis on user privacy and responsible AI development, this system integrates cloud and on-device capabilities to improve the user experience across all Apple products.
OVMR: Open-Vocabulary Recognition with Multi-Modal References.	OVMR is a novel approach that combines textual descriptions with sample photos to improve open-vocabulary recognition.
Predictive Dynamic Fusion.	The Predictive Dynamic Fusion (PDF) architecture solves stability and reliability problems to improve multimodal learning.
Compute Better Spent: Replacing Dense Layers with Structured Matrices.	The Linear layers are where Transformer computation is primarily done. This approach creates a structured representation with better scaling laws than naive dense layers, using less CPU than muP and Monarch matrices.
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models.	A thorough methodology called CARES is used to assess the reliability of Medical Large Vision Language Models (Med-LVLMs).
Learning to Route Among Specialized Experts for Zero-Shot Generalization.	PHATGOOSE is an approach that dramatically increases an AI's capacity to generalize and learn new tasks without prior exposure by efficiently routing between different specialized language models for each portion of a task.
Diabetic Retinopathy Detection.	A unique framework that enhances the grading of diabetic retinopathy (DR), a condition that can result in visual impairment, has been developed by researchers.
BERTs are Generative In-Context Learners.	In a different universe, BERT models—rather than their decoder-only GPT counterparts—would have been shown to be in-context learners. When that is the case, as this paper investigates, BERTs perform remarkably well in information retrieval but poorly in knowledge acquisition, most likely as a result of the bidirectional attention mechanism.
TextGrad: Automatic "Differentiation" via Text.	The concept of treating a language model that is capable of updating text as a backpropagation system is investigated in this study. The benchmark performance, not computationally matched against baseline models, shows significant increases, according to the researchers.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision.	DeepMind found a great way to extend the labor-intensive process of process oversight that requires human intervention. With robust base models, it was able to automate a significant portion of the procedure, which resulted in significant mathematical reasoning performance on Gemini Pro tuned models.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation.	For image generation, Llama Gen is an autoregressive model that scales better than diffusion alternatives. By using ImageNet to train a class-conditioned model, its researchers were able to raise the bar for FID.
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models.	In order to address the efficiency concerns in autoregressive big language models, researchers have looked into combining speculative decoding with linear attention techniques. In order to improve training and performance, this work presents an augmentation strategy for linear attention that is consistent with speculative decoding.
What If We Recaption Billions of Web Images with LLaMA-3?	Using a vision model to caption online scraped photos significantly enhances downstream model performance. This is particularly valid for models like CLIP.
Hearing Anything Anywhere.	This research presents DiffRIR, a new framework that uses a planar scene reconstruction with a limited number of room impulse response (RIR) recordings to recreate the spatial acoustic properties of environments.
Simple and Effective Masked Diffusion Language Models.	By using an efficient training recipe and incorporating a simpler Rao-Blackwellized objective, researchers have shown that masked discrete diffusion models can compete with autoregressive approaches in language modeling.

News

Link	description
First NHS physiotherapy clinic run by AI to start this year.	New platform to provide same-day appointments with digital physiotherapist in effort to cut waiting times
Apple to launch iOS 18 AI features marketed as ‘Apple Intelligence’.	Bloomberg’s Mark Gurman today reports that Apple will launch its upcoming AI initiatives in iOS 18 and other operating systems under the brand name ‘Apple Intelligence’, which is obviously a convenient twist on the ‘AI’ acronym.
Claude’s Character.	Claude is not simply your average, sycophantic AI that nods in agreement with the user. A character version of Constitutional AI has been specifically used to create Claude's personality and character. This essay goes into great detail on how Claude uses post-training to control the kind of output that he typically produces in order to portray this desired character.
Databricks + Tabular.	With the acquisition of Tabular, Databricks has brought together major players from Apache Iceberg and Delta Lake to concentrate on data format interoperability for its lakehouse architecture. With Delta Lake UniForm's compatibility solution at the forefront, the objective is to establish a single, open standard for data interoperability in order to prevent data silos.
How the voices for ChatGPT were chosen.	We worked with industry-leading casting and directing professionals to narrow down over 400 submissions before selecting the 5 voices.
OpenAI and Apple announce partnership to integrate ChatGPT into Apple experiences.	Apple is integrating ChatGPT into experiences within iOS, iPadOS, and macOS, allowing users to access ChatGPT’s capabilities—including image and document understanding—without needing to jump between tools.
Apple Intelligence: every new AI feature coming to the iPhone and Mac.	pple announced “Apple Intelligence” at WWDC 2024, its name for a new suite of AI features for the iPhone, Mac, and more. Starting later this year, Apple is rolling out what it says is a more conversational Siri, custom, AI-generated “Genmoji,” and GPT-4o access that lets Siri turn to OpenAI’s chatbot when it can’t handle what you ask it for.
Asana says its new AI teammates are ready to manage your projects.	With the goal of enhancing productivity and output quality, Asana has introduced "AI teammates" to take care of duties like proactive project detail organization and request triaging. This innovative feature is integrated into the workflow and functions like a human team member while yet being supervised by humans. It was showcased at Asana's Work Innovation Summit.
Apple stock reaches record high after the announcement of new AI features.	Tech giant’s shares climb 7% a day after reveal of artificial intelligence features meant to increase appeal of the iPhone
Elon Musk abruptly withdraws lawsuit against Sam Altman and OpenAI.	Tesla CEO had accused the company of abandoning mission of creating artificial intelligence for the greater good of humanity
Mistral raises €600m series B.	Mistral announced €600M in Series B funding for their first anniversary
Mozilla Builders.	Local AI, which enhances accessibility and privacy by bringing AI models and applications directly onto personal devices, is being embraced by the first Mozilla Builders Accelerator. Tools for developer productivity, locally based AI agents, dynamic user interfaces, fine-tuning adaption, retrieval-augmented creation, and enhanced function calling are some of the key areas of advancement. The initiative's goal is for participants to create an open-source, decentralized AI ecosystem with a focus on user empowerment.
CaseMark Raises $1.7M to Empower Attorneys with AI.	In order to increase the scope of its AI solutions for the legal sector, Gradient Ventures led the pre-seed investment in CaseMark, an AI firm that is transforming legal operations.
OpenAI ex-employees worry about company’s control over their millions of dollars in shares.	With OpenAI’s valuation soaring and an IPO nowhere in sight, the company is giving employees the chance to sell some equity in secondary transactions. Ex-employees sitting on millions of dollars worth of stock worry about OpenAI’s ability to force them to give up their shares, according to sources and internal messages. OpenAI recently circulated a document indicating that ex-employees who work at competitors are not included in the tender offers.
Announcing the Open Release of Stable Diffusion 3 Medium.	Stable Diffusion 3 Medium is Stability AI’s most advanced text-to-image open model yet. The small size of this model makes it perfect for running on consumer PCs and laptops as well as enterprise-tier GPUs.
Shutterstock ImageAI, Powered by Databricks.	Databricks and Shutterstock announced a text-to-image Generative AI model optimized for enterprise use
OpenAI Annualized Revenue Doubles.	OpenAI has more than doubled its annualized revenue to hit $3.4B.
Perplexity was planning revenue-sharing deals with publishers when it came under media fire.	Perplexity, the AI search startup that recently came under fire from Forbes for allegedly misusing its content, was already working on revenue-sharing deals with high-quality publishers.
Microsoft’s Nadella Is Building an AI Empire. OpenAI Was Just the First Step.	After landing the deal that launched his company to the front of the artificial intelligence race, the tech chief is spreading his bets. Will it be enough?
OpenAI adds former NSA chief to its board.	OpenAI said on Thursday that it is adding former NSA head and retired Gen. Paul Nakasone to its board of directors as well as its newly formed Safety and Security Committee. Why it matters: OpenAI is looking to convince skeptics that it is taking sufficient steps to ensure its models are safe as it works toward its goal of superintelligence.
Apple Made Once-Unlikely Deal With Sam Altman to Catch Up in AI.	An OpenAI agreement is due to be announced at the Apple’s developer conference next week.
LLM-Squared .	Sakana AI has found a preference optimization scheme that works better than DPO by using an evolutionary approach. It trained models based on code that was suggested by a language model. It has a few suggested variations with very high performance after about 100 generations.
Gemini 1.5 Pro and 1.5 Flash GA, 1.5 Flash tuning support, higher rate limits, and more API updates.	Updates to the Gemini API and Google AI Studio have been released by Google AI. These include support for model tuning, the stable release of Gemini 1.5, increased API rate limitations, additional JSON schema features, and mobile compatibility. The changes boost the alternatives available to developers more efficiently and more customized large-scale buildings.
AI generated sound effects are here.	A new AI audio model from ElevenLabs can generate a variety of voices, tunes, and sound effects based on text cues. By utilizing Shutterstock's audio library, our partnership helps media professionals create better content by facilitating the quick and scalable production of high-quality audio. ElevenLabs' platform makes it simple for users to create sounds, which streamlines the audio design process.
OpenAI welcomes Sarah Friar (CFO) and Kevin Weil (CPO).	With the appointment of Kevin Weil as CPO and Sarah Friar as CFO, OpenAI has strengthened its leadership team to further its goal of developing AI products and doing research that is useful to developers, businesses, and consumers.
Why the pope has the ears of G7 leaders on the ethics of AI.	Pope Francis is leaning on thinking of Paolo Benanti, a friar adept at explaining how technology can change world
AI used to predict potential new antibiotics in a groundbreaking study.	Scientists used an algorithm to mine ‘the entirety of the microbial diversity’ on Earth, speeding up antibiotic resistance research

Resources

Link	description
Spreadsheet Is All You Need.	Complete GPT-2 style transformer model with all weights, parameters, and connections included in a spreadsheet. It is a tiny model that runs entirely within the rows and columns of a spreadsheet and is based on NanoGPT.
Inspectus.	Inspectus is a versatile visualization tool for large language models. It runs smoothly in Jupyter notebooks via an easy-to-use Python API. Inspectus provides multiple views, offering diverse insights into language model behaviors.
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model.	SpatialRGPT is a powerful vision-language model adept at understanding both 2D and 3D spatial arrangements. It can process any region proposal, such as boxes or masks, and provide answers to complex spatial reasoning questions.
Thread.	Thread is a Jupyter Notebook that combines the experience of OpenAI's code interpreter with the familiar development environment of a Python notebook. With Thread, you can use natural language to generate cells, edit code, ask questions or fix errors all while being able to edit or re-run code as you would in a regular Jupyter Notebook.
How AI Image Models Work.	Since 2022, AI image production has advanced beyond producing images with text explanations. This article illustrates the quick progress and promise of AI in visual creation by explaining how these models hone chaotic inputs to create precise and detailed visuals using a kid's game comparison.
Active Stereo Without Pattern Projector.	Without the need for a hardware pattern projector, researchers have presented a new framework that incorporates active stereo concepts into passive cameras that are commonly used.
GLM-4-9B-Chat.	Excellent model with support for 26 languages, trained on 10T tokens by the Tsinghua KEM group.
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data.	DIRECT-3D is a new text-to-3D generative model that directly generates 3D contents in a single forward pass without optimization.
Together MoA.	Together has presented Mixture of Agents (MoA), a cutting-edge technique that mixes many LLMs for optimal performance, outperforming GPT-4o with an AlpacaEval 2.0 score of 65.1%. MoA employs a tiered architecture in which aggregators in later levels improve the initial answers from different models, improving output quality through cooperation. Even with improved precision, MoA still struggles with latency. Reducing latency and improving model design are two potential future possibilities.
Mistral.rs.	Mistral.rs is a fast LLM inference (Rust-based inference framework) platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Generalizable Human Gaussians from Single-View Image.	A diffusion-guided framework for building 3D human models from a single image is the Human Gaussian Model (HGM).
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis.	Real-time HDR view synthesis from RAW pictures can be achieved with the LE3D approach. It works especially well for situations set at night.
TORAX.	The Python-Jax differentiable fusion tokamak simulator developed by DeepMind at Google is now publicly available. The simulator supports several very powerful PDEs and has good auto-diff capabilities.
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising.	A novel acceleration approach called AsyncDiff makes it possible to perform parallel processing in diffusion models. By splitting the noise prediction model into several parts and executing them on different devices, it drastically cuts latency without sacrificing quality.
PowerInfer-2: Fast Large Language Model Inference on a Smartphone.	Fast inference on the phone for the special Mistral 47B MoE model.
The AXLearn Library for Deep Learning.	AXLearn is a library built on top of JAX and XLA to support the development of large-scale deep-learning models.
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling.	Samba is a simple yet powerful hybrid model with an unlimited context length. Its architecture is frustratingly simple: Samba = Mamba + MLP + Sliding Window Attention + MLP stacking at the layer level.
DiffusionKit.	Framework and tooling for running diffusion models on Apple's MLX framework.
Splash Attention.	new DeepMind kernel in Jax for Sparse Flash Attention
Hugging Face acquires Agrilla.	Argilla a company specialized in data for preference optimization has been acquired.

Perspectives

Link	description
Building AI products.	Though they can't give exact answers to questions, large language models (LLMs) like ChatGPT are excellent at producing responses that seem correct. In order to improve user experience and enhance functionality while reducing errors, AI in the future will integrate LLMs into specialized tools or embed them into already-existing applications. This will contextualize AI outputs within controllable, specified areas.
Why passwords still matter in the age of AI.	As Apple’s new Passwords app tries to solve our identity crisis, why are we still proving who we are via strings of random characters?
Examining LLM performance on public benchmarks.	Popular LLMs on public benchmarks: how overfit are they? Mistral and Phi are overfitting benchmarks, but GPT, Claude, Gemini, and Llama are not, according to new research from Scale AI SEAL. The scientists assessed public LLMs for overfitting on GSM8k and created a new eval GSM1k.
How to track the economic impact of public investments in AI.	National statistics systems should recognize the researchers whose ideas drive artificial intelligence applications, not just machines and factory outputs.
Maintaining Large-Scale AI Capacity At Meta.	To meet AI demands, Meta is modernizing its data centers throughout the world. For AI training tasks, it intends to scale to 600,000 GPUs. In order to assure minimal disruptions and constant performance while enabling quick infrastructure scalability, this calls for creative maintenance tactics and tools like OpsPlanner.

Back to index

ML news: Week 3 - 9 June

Research

Link	description
Contextual Position Encoding: Learning to Count What's Important.	The general position encoding method can attend to the i-th particular word, noun, or sentence; it improves perplexity on language modeling and coding tasks; it is context-dependent and can represent different levels of position abstraction; it suggests a new position encoding method, CoPE, to enable the position to be conditioned on context by incrementing position only on certain tokens.
Faithful Logical Reasoning via Symbolic Chain-of-Thought.	suggests a way to enhance LLMs' capacity for logical thinking by combining logical rules and symbolic expressions with chain-of-thought (CoT) prompting; this prompting method is known as Symbolic Chain-of-Thought and it is a fully LLM-based framework that consists of the following important steps: converts the context of natural language to symbolic format, 2) creates a step-by-step solution plan based on symbolic logical rules, and 3) employs a verifier to validate the translation and reasoning chain.
Transformers Can Do Arithmetic with the Right Embeddings.	The main problem this work addresses is the inability of transformers to track the exact position of digits; they do this by adding an embedding to each digit that encodes its position relative to the start of the number; these gains also transfer to multi-step reasoning tasks that include sorting and multiplication. achieves 99% accuracy on 100-digit addition problems by training on only 20-digit numbers with a single GPU.
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning.	blends the reasoning powers of GNNs with the language understanding skills of LLMs in a RAG fashion; the GNN extracts relevant and useful graph information, and the LLM uses the information to answer questions over knowledge graphs (KGQA); GNN-RAG outperforms or matches GPT-4 performance with a 7B tuned LLM, and improves vanilla LLMs on KGQA.
Attention as an RNN.	is based on the parallel prefix scan algorithm, which enables efficient computation of attention's many-to-many RNN output. It achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient. presents a new attention mechanism that can be trained in parallel (like Transformers) and updated with new tokens requiring constant memory usage for inferences (like RNNs).
Are Long-LLMs A Necessity For Long-Context Tasks?	suggests a reasoning framework to allow short-LLMs to handle long-context tasks by adaptively accessing and utilizing the context based on the tasks presented; it breaks down the long context into short contexts and processes them using a decision-making process. The argument claims that long LLMs are not necessary to solve long-context tasks.
Sparse maximal update parameterization: A holistic approach to sparse training dynamics.	All frontier model labs use muP, a potent tool, to transfer hyperparameters fine-tuned on tiny models to bigger, more costly training runs. This study investigates how to achieve that for sparse models, resulting in significantly better training results and lower computation expenses.
Exploring Color Invariance through Image-Level Ensemble Learning.	To address color bias in computer vision, researchers have created a novel learning technique called Random Color Erasing. By selectively excluding color information from training data, this technique strikes a balance between the significance of color and other parameters, producing models that perform better in challenging situations like industrial and wide-area surveillance.
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models.	Conifer enhances LLMs' comprehension of intricate instructions by utilizing a progressive learning methodology and a customized dataset.
LLM Merging Competition: Building LLMs Efficiently through Merging.	Sakana AI is sponsoring the LLM Merging challenge at NeurIPS this year.
Tribeca to Screen AI-Generated Short Films Created by OpenAI’s Sora.	Short films generated by artificial intelligence are popping up at more and more film festivals, and the largest event yet is dedicating an entire section to AI-generated movies.
Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning.	A technique called InvariantSelectPR is intended to make Large Multimodal Models (LMMs) more adaptive in domain-specific fields such as healthcare.
TAIA: Large Language Models are Out-of-Distribution Data Learners.	A technique called TrainAllInfAttn improves the performance of big language models in niche markets with little data.
MegActor: Harness the Power of Raw Video for Vivid Portrait Animation	A new model called MegActor uses unprocessed driving videos to create more lifelike portrait animation. It addresses identity leaking and background interference and produces remarkable results with a unique data creation framework and background encoding approaches.
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.	MeshXL is a new model that generates high-quality 3D meshes.
Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays.	Position-guided Prompt learning method for Anomaly Detection in Chest X-rays (PPAD). PPAD leverages learnable text prompts and image prompts to minimize the gap between pre-training data and task-specific data. Through position-guided prompts, the model can focus on various regions, simulating the diagnostic process of experts.
Tree Diffusion: Diffusion Models For Code.	Wonderful diffusion paper that diffuses picture code. As part of the diffusion process, it can be directly edited. Although it is sluggish, it can be simply used with search to significantly increase one's capacity for reasoning.
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models.	Expanding upon the Greedy Coordinate Gradient (GCG) approach, researchers have enhanced methods for optimization-based jailbreaking of huge language models.
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation.	A training-free video interpolation technique for generative video diffusion models has been developed by researchers. This novel method improves frame rates without requiring a lot of training or big datasets and works with different models.
A whole-slide foundation model for digital pathology from real-world data.	Prov-GigaPath, a whole-slide pathology foundation model pre-trained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. We further demonstrate the potential of Prov-GigaPath on vision–language pretraining for pathology by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modeling.
DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models.	Using Dream Mat to enhance 3D object texture production is a brilliant idea. Given a 3D model, it employs several traditional graphic methods including Metallic, Roughness, and Albedo to generate a very appealing result.
LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing.	To solve classification problems in large language models (LLMs), researchers have developed LlamaCare, a refined LLM for medical information, in conjunction with Extended Classification Integration (ECI).
XRec: Large Language Models for Explainable Recommendation.	XRec is a framework independent of models that improves explainable recommender systems by utilizing the language capabilities of huge language models.
MetaMixer Is All You Need.	Using simply convolutions, researchers have created a novel method called FFNification that preserves the query-key-value structure while converting self-attention processes into more effective token mixers.
GrootVL: Tree Topology is All You Need in State Space Model.	By dynamically constructing a tree topology based on spatial correlations and input information, GrootVL is a network that enhances state space models.
ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization.	In order to increase Visual Geo-localization (VG) and boost its performance in applications such as SLAM, augmented reality, and autonomous driving, researchers have created a new two-stage training process.
ReLUs Are Sufficient for Learning Implicit Neural Representations.	A review of the application of ReLU activation functions to implicit neural representations (INRs) learning has been conducted. They countered spectrum bias by introducing basic limitations to ReLU neurons, which were inspired by second-order B-spline wavelets.

News

Link	description
OpenAI Is Restarting Its Robotics Research Group.	The San Francisco-based company has been a pioneer in generative artificial intelligence and is returning to robotics after a three-year break.
AI Overviews: About last week.	In order to improve search results and give users more precise and pertinent information, particularly for complex inquiries, Google created AI Overviews. While there were certain problems, such as incorrect results and misread content, Google has fixed these difficulties with over a dozen technical updates, like improving the identification of absurd questions and reducing the amount of user-generated content in AI Overviews.
Nvidia said to be prepping AI PC chip with Arm and Blackwell cores.	Competition could be heating up in the Windows on Arm space amid talk in the industry that Nvidia is readying a chip pairing next-gen Arm cores with its Blackwell GPU architecture.
Ex-OpenAI board member reveals what led to Sam Altman's brief ousting.	In a recent interview, former OpenAI board member Helen Toner provided fresh information into the circumstances surrounding CEO Sam Altman's November dismissal. It appears that the board was informed via Twitter about the release of ChatGPT. According to Toner, Altman had repeatedly lied to the board. It has been alleged that Altman had been lying about events within the organization for years and hiding facts. The board found it difficult to make decisions as a result of his lies, and they concluded that he wasn't the best person to take the firm to AGI.
AI hardware firm Nvidia unveils next-gen products at Taiwan tech expo.	CEO Jensen Huang tells packed stadium in Taipei ‘next Industrial Revolution has begun’
AMD unveils new AI chips to compete with Nvidia.	AMD has been vying to compete against Nvidia, which currently dominates the lucrative market for AI semiconductors and commands about 80% of its share.
Anthropic’s Claude 3 Opus and tool use are generally available on Vertex AI.	Google Cloud now offers Claude 3 Opus with tool use along with the smaller models as part of its Vertex AI offering.
State Space Duality (Mamba-2).	Mambda is an effective model of state space. A lengthy and comprehensive explanation of the model and its enhancements is included in the second version that its team has issued.
No physics? No problem. AI weather forecasting is already making huge strides.	With AI models like WindBorne's WeatherMesh, which leverages the extensive ERA5 dataset to outperform conventional models while using much less processing power, the weather forecasting industry is transforming.
Amazon’s Project PI AI looks for product defects before they ship.	Project PI combines computer vision and generative AI to catch damaged items and prevent returns.
The Opaque Investment Empire Making OpenAI’s Sam Altman Rich.	One of Silicon Valley's most active and successful individual investors is Sam Altman. At the beginning of this year, his stakes in his investment empire were valued at least $2.8 billion. A large portion of the portfolio is unknown. Readers are guided through Altman's investment knowledge in this article.
Even the Raspberry Pi is getting in on AI.	Raspberry Pi partnered with Hailo to provide an optional AI add-on to its microcomputers.
Using AI to decode dog vocalizations.	Leveraging a human speech model to identify different types of barks. University of Michigan researchers are exploring the possibilities of AI, developing tools that can identify whether a dog’s bark conveys playfulness or aggression.
The future is … sending AI avatars to meetings for us, says Zoom boss.	Eric Yuan suggests technology is five or six years away and will free up time to spend with family
AI researchers build ‘future self’ chatbot to inspire wise life choices.	Scientists at MIT hope talking to 60-year-old self will shift thinking on health, money and work
Cartwheel generates 3D animations from scratch to power up creators.	Animating a 3D character from scratch is generally both laborious and expensive, requiring the use of complex software and motion capture tools.
Mistral launches fine-tuning API.	Mistral has launched customization for its models via its platform and API.
If you aren't seeing AI Overviews in your search results, it's probably thanks to Google.	After receiving heavy criticism since their mid-May public launch, AI Overviews in Google Search have dropped in visibility across search results. Since I/O, the average percentage of queries where AI Overviews appear has dropped from 27 percent to just 11 percent. Despite the reduction, healthcare-related queries are a large percentage of AI results, raising concerns about both accuracy and reliability across Google.
Google optimizes shipping routes.	The mathematical optimization for cargo shipping routes was enhanced by Google's operations research group. They discovered a 13% drop in gasoline expenses and consumption.
BrightEdge Releases Post Google I/O Data on The Impact of AI Overviews.	The main businesses affected by AI Overviews, what generates results, and where Google automatically anticipates and responds to search inquiries are all revealed by new research from BrightEdge Generative Parser.
Nvidia emails: Elon Musk diverting Tesla GPUs to his other companies.	The Tesla CEO is accused of diverting resources from the company again. Elon Musk is yet again being accused of diverting Tesla resources to his other companies. This time, it's high-end H100 GPU clusters from Nvidia.
Securing Research Infrastructure for Advanced AI.	In its description of the security architecture of its AI training supercomputers, OpenAI highlights the use of Azure-based infrastructure and Kubernetes for orchestration to safeguard critical model weights and other assets.
Extracting Concepts from GPT-4.	The team at OpenAI has discovered 16 million interpretable features in GPT-4 including price increases, algebraic rings, and who/what correspondence. This is a great step forward for SAE interpretability at scale. They shared the code in a companion GitHub repository.
Mesop: Gradio Competition.	A rival to the well-liked AI prototyping framework Gradio has been made available by Google. Gradio is more mature than Mesop, which is pure Python and slightly more composable.
Nvidia is now more valuable than Apple at $3.01 trillion.	The AI boom has pushed Nvidia’s market cap high enough to make it the second most valuable company in the world.

Resources

Link	description
An Introduction to Vision-Language Modeling.	we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them.
Aya 23: Open Weight Releases to Further Multilingual Progress.	a family of multilingual language models with up to 23 languages supported; it demonstrates that it can perform better on those particular languages than other large-scale multimodal models by purposefully concentrating on fewer languages and allocating greater capacity to them.
Financial Statement Analysis with Large Language Models	claims that by analyzing trends and financial ratios, LLMs can produce insightful insights; demonstrates that GPT-4 outperforms more specialized models; and develops a profitable trading strategy based on GPT's predictions.
SimPO: Simple Preference Optimization with a Reference-Free Reward.	SimPO demonstrates how it outperforms other methods like DPO and claims to generate the strongest 8B open-source model. It is a more straightforward and efficient method for preference optimization with a reference-free reward; it uses the average log probability of a sequence as an implicit reward (i.e., no reference model required), which makes it more compute and memory efficient.
Experimenting with local alt text generation.	A model that runs in the browser and can provide alt text for web photos automatically has been trained by Mozilla.
Mora: More like Sora for Generalist Video Generation.	Mora is a multi-agent framework designed to facilitate generalist video generation tasks, leveraging a collaborative approach with multiple visual agents. It aims to replicate and extend the capabilities of OpenAI's Sora.
FABRIC: Personalizing Diffusion Models with Iterative Feedback.	FABRIC (Feedback via Attention-Based Reference Image Conditioning) is a technique to incorporate iterative feedback into the generative process of diffusion models based on StableDiffusion.
KL is All You Need.	KL divergence is a quick, affordable, and effective method of measuring a certain type of distance between objects. In both conventional and contemporary AI, it is widely employed. This piece examines the potent idea both mathematically and graphically.
7 Ways AI-Native Companies Can Improve User Retention.	a manual with examples of how businesses like Perplexity, Civit, Lapse, Omnivore, and others are using them to increase retention for founders and product executives.
FineWeb: decanting the web for the finest text data at scale.	The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. Recently, we released 🍷 FineWeb, a new, large-scale (15 trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets.
An entirely open-source AI code assistant inside your editor.	Continue enables you to easily create your own coding assistant directly inside Visual Studio Code and JetBrains with open-source LLMs. All this can run entirely on your own laptop or have Ollama deployed on a server to remotely power code completion and chat experiences based on your needs.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.	A popular benchmark for reasoning tasks is MMLU. It is frequently seen as the gold standard and as something that models overfit. A new, more rigorous, and refined benchmark called MMLU Pro is used to gauge language model reasoning.
Omost.	Omost gives you control over how your images are generated. It comes from the same designer as ControlNet. First, it rewrites the prompts into a collection of illustrative code. After that, it renders the finished image using that. Crucially, you can modify the code either prior to or following generation in order to subtly alter the model's output.
Control-GIC.	A novel generative image compression framework called Control-GIC enables fine-grained bitrate modification while preserving high-quality output.
LLM inference speed of light.	Using the theoretical speed of light modeling as grounding is extremely significant for problems where the amount of computation and memory access is known a priori as it helps assess the quality of implementations and predict the impact of architectural modifications.
Neural Surface Reconstruction.	Without the need for 3D supervision, GenS is an end-to-end generalizable neural surface reconstruction model that performs exceptionally well at reconstructing surfaces from multi-view images.
MatMul-Free LM.	Even at the billion-parameter scale, researchers have managed to remove matrix multiplication (MatMul) from huge language models without sacrificing speed.
stable-audio-open-1.0 .	The weights for Stable Audio, which was trained to produce sound effects on audio samples with permissive licenses, have been released by Stability AI.
CV-VAE: A Compatible Video VAE for Latent Generative Video Models.	With its spatio-temporally compressed latent spaces, CV-VAE is a video VAE that works with current image and video models to efficiently train new ones utilizing pre-trained ones.
Qwen2.	Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B.Having been trained on data in 27 additional languages besides English and Chinese. Having been trained on data in 27 additional languages besides English and Chinese. State-of-the-art performance in a large number of benchmark evaluations
Dragonfly: A large vision-language model with multi-resolution zoom.	We are also launching two new open-source models Llama-3-8b-Dragonfly-v1 a general-domain model trained on 5.5 million image-instruction pairs and Llama-3-8b-Dragonfly-Med-v1 finetuned on an additional 1.4 biomedical image-instruction data. Dragonfly demonstrates promising performance on vision-language benchmarks like commonsense visual QA and image captioning. Dragonfly-Med outperforms prior models, including Med-Gemini on multiple medical imaging tasks, showcasing its capabilities for high-resolution medical data.
MMLU Pro.	The industry standard for assessing knowledge and reasoning in language models is MMLU.

Perspectives

Link	description
Beyond the Cloud: Distributed AI and On-Device Intelligence.	Transition of AI workflows from cloud to the edge with specialized chip infrastructure & models, multi-modality and ambiance across devices
Sure, Google’s AI overviews could be useful – if you like eating rocks.	The company that shaped the development of search engines is banking on chatbot-style summaries. But so far, its suggestions are pretty wild
AI's Communication Revolution: We're All Talking to Computers Now.	With its real-time integration of text, vision, and audio, OpenAI's GPT-4o is driving a revolution in communication through AI. As a result, human-to-AI communication has become a fundamental form of digital connection and has the potential to bring about substantial societal changes as well as the emergence of new companies focused on AI-centric communication. This transition makes it possible for more natural interactions with AI
A Right to Warn about Advanced Artificial Intelligence.	A group of AI workers, both present and past, is pleading w

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
images		images
LICENSE		LICENSE
README.md		README.md

License

SalvatoreRa/ML-news-of-the-week

Folders and files

Latest commit

History

Repository files navigation

ML & AI news of the week

Suggestions and corrections

Index

2024

2023

2024

ML news: Week 9 - 15 September

Research

News

Resources

Perspectives

ML news: Week 2 - 8 September

Research

News

Resources

Perspectives

ML news: Week 26 August - 1 September

Research

News

Resources

Perspectives

ML news: Week 19 - 25 August

Research

News

Resources

Perspectives

ML news: Week 12 - 18 August

Research

News

Resources

Perspectives

ML news: Week 5 - 11 August

Research

News

Resources

Perspectives

ML news: Week 29 July - 4 August

Research

News

Resources

Perspectives

ML news: Week 21 - 28 July

Research

News

Resources

Perspectives

ML news: Week 15 - 21 July

Research

News

Resources

Perspectives

ML news: Week 8 - 14 July

Research

News

Resources

Perspectives

ML news: Week 1 - 7 July

Research

News

Resources

Perspectives

ML news: Week 24 - 30 June

Research

News

Resources

Perspectives

ML news: Week 17 - 23 June

Research

News

Resources

Perspectives

ML news: Week 10 - 16 June

Research

News

Packages