new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 2

GeoLLM: Extracting Geospatial Knowledge from Large Language Models

The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power. Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks. We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density. We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson's r^2) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature. With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe. Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well. Code is available on the project website: https://rohinmanvi.github.io/GeoLLM

  • 6 authors
·
Oct 9, 2023

GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems

Traditional POI recommendation systems often lack transparency, interpretability, and scrutability due to their reliance on dense vector-based user embeddings. Furthermore, the cold-start problem -- where systems have insufficient data for new users -- limits their ability to generate accurate recommendations. Existing methods often address this by leveraging similar trajectories from other users, but this approach can be computationally expensive and increases the context length for LLM-based methods, making them difficult to scale. To address these limitations, we propose a method that generates natural language (NL) user profiles from large-scale, location-based social network (LBSN) check-ins, utilizing robust personality assessments and behavioral theories. These NL profiles capture user preferences, routines, and behaviors, improving POI prediction accuracy while offering enhanced transparency. By incorporating NL profiles as system prompts to LLMs, our approach reduces reliance on extensive historical data, while remaining flexible, easily updated, and computationally efficient. Our method is not only competitive with other LLM-based and complex agentic frameworks but is also more scalable for real-world scenarios and on-device POI recommendations. Results demonstrate that our approach consistently outperforms baseline methods, offering a more interpretable and resource-efficient solution for POI recommendation systems. Our source code is available at: https://github.com/w11wo/GenUP.

  • 3 authors
·
Oct 27, 2024

LaDe: The First Comprehensive Last-mile Delivery Dataset from Industry

Real-world last-mile delivery datasets are crucial for research in logistics, supply chain management, and spatio-temporal data mining. Despite a plethora of algorithms developed to date, no widely accepted, publicly available last-mile delivery dataset exists to support research in this field. In this paper, we introduce LaDe, the first publicly available last-mile delivery dataset with millions of packages from the industry. LaDe has three unique characteristics: (1) Large-scale. It involves 10,677k packages of 21k couriers over 6 months of real-world operation. (2) Comprehensive information. It offers original package information, such as its location and time requirements, as well as task-event information, which records when and where the courier is while events such as task-accept and task-finish events happen. (3) Diversity. The dataset includes data from various scenarios, including package pick-up and delivery, and from multiple cities, each with its unique spatio-temporal patterns due to their distinct characteristics such as populations. We verify LaDe on three tasks by running several classical baseline models per task. We believe that the large-scale, comprehensive, diverse feature of LaDe can offer unparalleled opportunities to researchers in the supply chain community, data mining community, and beyond. The dataset homepage is publicly available at https://huggingface.co/datasets/Cainiao-AI/LaDe.

  • 13 authors
·
Jun 18, 2023

Unboxing Occupational Bias: Grounded Debiasing LLMs with U.S. Labor Data

Large Language Models (LLMs) are prone to inheriting and amplifying societal biases embedded within their training data, potentially reinforcing harmful stereotypes related to gender, occupation, and other sensitive categories. This issue becomes particularly problematic as biased LLMs can have far-reaching consequences, leading to unfair practices and exacerbating social inequalities across various domains, such as recruitment, online content moderation, or even the criminal justice system. Although prior research has focused on detecting bias in LLMs using specialized datasets designed to highlight intrinsic biases, there has been a notable lack of investigation into how these findings correlate with authoritative datasets, such as those from the U.S. National Bureau of Labor Statistics (NBLS). To address this gap, we conduct empirical research that evaluates LLMs in a ``bias-out-of-the-box" setting, analyzing how the generated outputs compare with the distributions found in NBLS data. Furthermore, we propose a straightforward yet effective debiasing mechanism that directly incorporates NBLS instances to mitigate bias within LLMs. Our study spans seven different LLMs, including instructable, base, and mixture-of-expert models, and reveals significant levels of bias that are often overlooked by existing bias detection techniques. Importantly, our debiasing method, which does not rely on external datasets, demonstrates a substantial reduction in bias scores, highlighting the efficacy of our approach in creating fairer and more reliable LLMs.

  • 3 authors
·
Aug 20, 2024 4

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.

  • 7 authors
·
Jul 8, 2021

Datasets for Large Language Models: A Comprehensive Survey

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.

  • 5 authors
·
Feb 27, 2024 1

Urban Mobility Assessment Using LLMs

Understanding urban mobility patterns and analyzing how people move around cities helps improve the overall quality of life and supports the development of more livable, efficient, and sustainable urban areas. A challenging aspect of this work is the collection of mobility data by means of user tracking or travel surveys, given the associated privacy concerns, noncompliance, and high cost. This work proposes an innovative AI-based approach for synthesizing travel surveys by prompting large language models (LLMs), aiming to leverage their vast amount of relevant background knowledge and text generation capabilities. Our study evaluates the effectiveness of this approach across various U.S. metropolitan areas by comparing the results against existing survey data at different granularity levels. These levels include (i) pattern level, which compares aggregated metrics like the average number of locations traveled and travel time, (ii) trip level, which focuses on comparing trips as whole units using transition probabilities, and (iii) activity chain level, which examines the sequence of locations visited by individuals. Our work covers several proprietary and open-source LLMs, revealing that open-source base models like Llama-2, when fine-tuned on even a limited amount of actual data, can generate synthetic data that closely mimics the actual travel survey data, and as such provides an argument for using such data in mobility studies.

  • 3 authors
·
Aug 22, 2024

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

This paper revisits the implementation of Load-balancing Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E sum_{i=1}^{N_E} f_i p_i, where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a micro-batch and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to 42.8B total parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.

  • 10 authors
·
Jan 20, 2025 2

PowerNorm: Rethinking Batch Normalization in Transformers

The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103. We make our code publicly available at https://github.com/sIncerass/powernorm.

  • 5 authors
·
Mar 17, 2020

VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.

  • 8 authors
·
May 20, 2023

Enhancing Price Prediction in Cryptocurrency Using Transformer Neural Network and Technical Indicators

This study presents an innovative approach for predicting cryptocurrency time series, specifically focusing on Bitcoin, Ethereum, and Litecoin. The methodology integrates the use of technical indicators, a Performer neural network, and BiLSTM (Bidirectional Long Short-Term Memory) to capture temporal dynamics and extract significant features from raw cryptocurrency data. The application of technical indicators, such facilitates the extraction of intricate patterns, momentum, volatility, and trends. The Performer neural network, employing Fast Attention Via positive Orthogonal Random features (FAVOR+), has demonstrated superior computational efficiency and scalability compared to the traditional Multi-head attention mechanism in Transformer models. Additionally, the integration of BiLSTM in the feedforward network enhances the model's capacity to capture temporal dynamics in the data, processing it in both forward and backward directions. This is particularly advantageous for time series data where past and future data points can influence the current state. The proposed method has been applied to the hourly and daily timeframes of the major cryptocurrencies and its performance has been benchmarked against other methods documented in the literature. The results underscore the potential of the proposed method to outperform existing models, marking a significant progression in the field of cryptocurrency price prediction.

  • 2 authors
·
Mar 6, 2024

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.

  • 25 authors
·
Jul 12, 2025 2

Large Language Models on Graphs: A Comprehensive Survey

Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.

  • 6 authors
·
Dec 5, 2023

ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks

Large Language Models (LLMs) can revolutionize how we deploy and operate Open Radio Access Networks (O-RAN) by enhancing network analytics, anomaly detection, and code generation and significantly increasing the efficiency and reliability of a plethora of O-RAN tasks. In this paper, we present ORAN-Bench-13K, the first comprehensive benchmark designed to evaluate the performance of Large Language Models (LLMs) within the context of O-RAN. Our benchmark consists of 13,952 meticulously curated multiple-choice questions generated from 116 O-RAN specification documents. We leverage a novel three-stage LLM framework, and the questions are categorized into three distinct difficulties to cover a wide spectrum of ORAN-related knowledge. We thoroughly evaluate the performance of several state-of-the-art LLMs, including Gemini, Chat-GPT, and Mistral. Additionally, we propose ORANSight, a Retrieval-Augmented Generation (RAG)-based pipeline that demonstrates superior performance on ORAN-Bench-13K compared to other tested closed-source models. Our findings indicate that current popular LLM models are not proficient in O-RAN, highlighting the need for specialized models. We observed a noticeable performance improvement when incorporating the RAG-based ORANSight pipeline, with a Macro Accuracy of 0.784 and a Weighted Accuracy of 0.776, which was on average 21.55% and 22.59% better than the other tested LLMs.

  • 2 authors
·
Jul 8, 2024

Satellite Connectivity Prediction for Fast-Moving Platforms

Satellite connectivity is gaining increased attention as the demand for seamless internet access, especially in transportation and remote areas, continues to grow. For fast-moving objects such as aircraft, vehicles, or trains, satellite connectivity is critical due to their mobility and frequent presence in areas without terrestrial coverage. Maintaining reliable connectivity in these cases requires frequent switching between satellite beams, constellations, or orbits. To enhance user experience and address challenges like long switching times, Machine Learning (ML) algorithms can analyze historical connectivity data and predict network quality at specific locations. This allows for proactive measures, such as network switching before connectivity issues arise. In this paper, we analyze a real dataset of communication between a Geostationary Orbit (GEO) satellite and aircraft over multiple flights, using ML to predict signal quality. Our prediction model achieved an F1 score of 0.97 on the test data, demonstrating the accuracy of machine learning in predicting signal quality during flight. By enabling seamless broadband service, including roaming between different satellite constellations and providers, our model addresses the need for real-time predictions of signal quality. This approach can further be adapted to automate satellite and beam-switching mechanisms to improve overall communication efficiency. The model can also be retrained and applied to any moving object with satellite connectivity, using customized datasets, including connected vehicles and trains.

  • 2 authors
·
Jul 22, 2025

Leveraging Large Language Models for Effective Label-free Node Classification in Text-Attributed Graphs

Graph neural networks (GNNs) have become the preferred models for node classification in graph data due to their robust capabilities in integrating graph structures and attributes. However, these models heavily depend on a substantial amount of high-quality labeled data for training, which is often costly to obtain. With the rise of large language models (LLMs), a promising approach is to utilize their exceptional zero-shot capabilities and extensive knowledge for node labeling. Despite encouraging results, this approach either requires numerous queries to LLMs or suffers from reduced performance due to noisy labels generated by LLMs. To address these challenges, we introduce Locle, an active self-training framework that does Label-free node Classification with LLMs cost-Effectively. Locle iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs, serving as additional supervision signals to enhance model training. Specifically, Locle comprises three key components: (i) an effective active node selection strategy for initial annotations; (ii) a careful sample selection scheme to identify "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module that combines LLMs and GNNs with a rewired topology. Extensive experiments on five benchmark text-attributed graph datasets demonstrate that Locle significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification. Notably, on the DBLP dataset with 14.3k nodes, Locle achieves an 8.08% improvement in accuracy over the state-of-the-art at a cost of less than one cent. Our code is available at https://github.com/HKBU-LAGAS/Locle.

  • 6 authors
·
Dec 16, 2024

A Survey of LLM times DATA

The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.

  • 17 authors
·
May 23, 2025

Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

As current training data for Large Language Models (LLMs) are dominated by English corpus, they are English-centric and they present impressive performance on English reasoning tasks.This paper primarily studies English-centric models, but our method could be universal by using the centric language in the dictionary for non-English-centric LLMs. Yet, they usually suffer from lower performance in other languages. There are about 7,000 languages over the world, and many are low-resourced on English-centric LLMs. For the sake of people who primarily speak these languages, it is especially urgent to enable our LLMs in those languages. Model training is usually effective, but computationally expensive and requires experienced NLP practitioners. This paper presents a novel and simple yet effective method called Dictionary Insertion Prompting (DIP). When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with about 200 languages from FLORES-200. Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. Despite the simplicity and computationally lightweight, we surprisingly found the effectiveness of DIP on math and commonsense reasoning tasks on multiple open-source and close-source LLMs.Our dictionaries, code, and synthetic benchmarks will be open-sourced to facilitate future research.

  • 3 authors
·
Nov 2, 2024

Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data

Large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, there is a paucity of studies that attempt to evaluate their performances on social media-based health-related natural language processing tasks, which have traditionally been difficult to achieve high scores in. We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks. We developed three approaches for leveraging LLMs for text classification: employing LLMs as zero-shot classifiers, us-ing LLMs as annotators to annotate training data for supervised classifiers, and utilizing LLMs with few-shot examples for augmentation of manually annotated data. Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data alone. Supervised learners also outperform GPT-4 and GPT-3.5 in zero-shot settings. By leveraging this data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. LLM-annotated data without human guidance for training light-weight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation. Future investigations are imperative to explore optimal training data sizes and the optimal amounts of augmented data.

  • 4 authors
·
Mar 27, 2024

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert attendance), e-commerce (click-throughs -> purchases), and LinkedIn profile (views -> messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.

  • 12 authors
·
Nov 5, 2025

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

This paper presents a comprehensive and practical guide for practitioners and end-users working with Large Language Models (LLMs) in their downstream natural language processing (NLP) tasks. We provide discussions and insights into the usage of LLMs from the perspectives of models, data, and downstream tasks. Firstly, we offer an introduction and brief summary of current GPT- and BERT-style LLMs. Then, we discuss the influence of pre-training data, training data, and test data. Most importantly, we provide a detailed discussion about the use and non-use cases of large language models for various natural language processing tasks, such as knowledge-intensive tasks, traditional natural language understanding tasks, natural language generation tasks, emergent abilities, and considerations for specific tasks.We present various use cases and non-use cases to illustrate the practical applications and limitations of LLMs in real-world scenarios. We also try to understand the importance of data and the specific challenges associated with each NLP task. Furthermore, we explore the impact of spurious biases on LLMs and delve into other essential considerations, such as efficiency, cost, and latency, to ensure a comprehensive understanding of deploying LLMs in practice. This comprehensive guide aims to provide researchers and practitioners with valuable insights and best practices for working with LLMs, thereby enabling the successful implementation of these models in a wide range of NLP tasks. A curated list of practical guide resources of LLMs, regularly updated, can be found at https://github.com/Mooler0410/LLMsPracticalGuide.

  • 8 authors
·
Apr 26, 2023

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

  • 45 authors
·
May 29, 2024 3

Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning

Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of explanations as features: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an LLM-to-LM interpreter to translate these explanations into informative features for downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including Cora, PubMed, ogbn-arxiv, as well as our newly introduced dataset, tape-arxiv23. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data. Our codes and datasets are available at: https://github.com/XiaoxinHe/TAPE.

  • 6 authors
·
May 30, 2023

BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce

This work presents the BanglishRev Dataset, the largest e-commerce product review dataset to date for reviews written in Bengali, English, a mixture of both and Banglish, Bengali words written with English alphabets. The dataset comprises of 1.74 million written reviews from 3.2 million ratings information collected from a total of 128k products being sold in online e-commerce platforms targeting the Bengali population. It includes an extensive array of related metadata for each of the reviews including the rating given by the reviewer, date the review was posted and date of purchase, number of likes, dislikes, response from the seller, images associated with the review etc. With sentiment analysis being the most prominent usage of review datasets, experimentation with a binary sentiment analysis model with the review rating serving as an indicator of positive or negative sentiment was conducted to evaluate the effectiveness of the large amount of data presented in BanglishRev for sentiment analysis tasks. A BanglishBERT model is trained on the data from BanglishRev with reviews being considered labeled positive if the rating is greater than 3 and negative if the rating is less than or equal to 3. The model is evaluated by being testing against a previously published manually annotated dataset for e-commerce reviews written in a mixture of Bangla, English and Banglish. The experimental model achieved an exceptional accuracy of 94\% and F1 score of 0.94, demonstrating the dataset's efficacy for sentiment analysis. Some of the intriguing patterns and observations seen within the dataset and future research directions where the dataset can be utilized is also discussed and explored. The dataset can be accessed through https://huggingface.co/datasets/BanglishRev/bangla-english-and-code-mixed-ecommerce-review-dataset.

  • 4 authors
·
Dec 17, 2024

Label-free Node Classification on Graphs with Large Language Models (LLMS)

In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.

  • 8 authors
·
Oct 6, 2023

Financial Risk Assessment via Long-term Payment Behavior Sequence Folding

Online inclusive financial services encounter significant financial risks due to their expansive user base and low default costs. By real-world practice, we reveal that utilizing longer-term user payment behaviors can enhance models' ability to forecast financial risks. However, learning long behavior sequences is non-trivial for deep sequential models. Additionally, the diverse fields of payment behaviors carry rich information, requiring thorough exploitation. These factors collectively complicate the task of long-term user behavior modeling. To tackle these challenges, we propose a Long-term Payment Behavior Sequence Folding method, referred to as LBSF. In LBSF, payment behavior sequences are folded based on merchants, using the merchant field as an intrinsic grouping criterion, which enables informative parallelism without reliance on external knowledge. Meanwhile, we maximize the utility of payment details through a multi-field behavior encoding mechanism. Subsequently, behavior aggregation at the merchant level followed by relational learning across merchants facilitates comprehensive user financial representation. We evaluate LBSF on the financial risk assessment task using a large-scale real-world dataset. The results demonstrate that folding long behavior sequences based on internal behavioral cues effectively models long-term patterns and changes, thereby generating more accurate user financial profiles for practical applications.

  • 7 authors
·
Nov 22, 2024

Self-Refined Generative Foundation Models for Wireless Traffic Prediction

With a broad range of emerging applications in 6G networks, wireless traffic prediction has become a critical component of network management. However, the dynamically shifting distribution of wireless traffic in non-stationary 6G networks presents significant challenges to achieving accurate and stable predictions. Motivated by recent advancements in Generative AI (GAI)-enabled 6G networks, this paper proposes a novel self-refined Large Language Model (LLM) for wireless traffic prediction, namely TrafficLLM, through in-context learning without parameter fine-tuning or model training. The proposed TrafficLLM harnesses the powerful few-shot learning abilities of LLMs to enhance the scalability of traffic prediction in dynamically changing wireless environments. Specifically, our proposed TrafficLLM embraces an LLM to iteratively refine its predictions through a three-step process: traffic prediction, feedback generation, and prediction refinement. Initially, the proposed TrafficLLM conducts traffic predictions using task-specific demonstration prompts. Recognizing that LLMs may generate incorrect predictions on the first attempt, we subsequently incorporate feedback demonstration prompts designed to provide multifaceted and valuable feedback related to these initial predictions. Following this comprehensive feedback, our proposed TrafficLLM introduces refinement demonstration prompts, enabling the same LLM to further refine its predictions and thereby enhance prediction performance. The evaluations on two realistic datasets demonstrate that the proposed TrafficLLM outperforms state-of-the-art methods with performance improvements of 23.17% and 17.09%, respectively.

  • 6 authors
·
Aug 19, 2024

A Global Context Mechanism for Sequence Labeling

Global sentence information is crucial for sequence labeling tasks, where each word in a sentence must be assigned a label. While BiLSTM models are widely used, they often fail to capture sufficient global context for inner words. Previous work has proposed various RNN variants to integrate global sentence information into word representations. However, these approaches suffer from three key limitations: (1) they are slower in both inference and training compared to the original BiLSTM, (2) they cannot effectively supplement global information for transformer-based models, and (3) the high time cost associated with reimplementing and integrating these customized RNNs into existing architectures. In this study, we introduce a simple yet effective mechanism that addresses these limitations. Our approach efficiently supplements global sentence information for both BiLSTM and transformer-based models, with minimal degradation in inference and training speed, and is easily pluggable into current architectures. We demonstrate significant improvements in F1 scores across seven popular benchmarks, including Named Entity Recognition (NER) tasks such as Conll2003, Wnut2017 , and the Chinese named-entity recognition task Weibo, as well as End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) benchmarks such as Laptop14, Restaurant14, Restaurant15, and Restaurant16. With out any extra strategy, we achieve third highest score on weibo NER benchmark. Compared to CRF, one of the most popular frameworks for sequence labeling, our mechanism achieves competitive F1 scores while offering superior inference and training speed. Code is available at: https://github.com/conglei2XU/Global-Context-Mechanism

  • 4 authors
·
May 31, 2023

BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering

Knowledge Graphs (KGs) have proven essential in information processing and reasoning applications because they link related entities and give context-rich information, supporting efficient information retrieval and knowledge discovery; presenting information flow in a very effective manner. Despite being widely used globally, Bangla is relatively underrepresented in KGs due to a lack of comprehensive datasets, encoders, NER (named entity recognition) models, POS (part-of-speech) taggers, and lemmatizers, hindering efficient information processing and reasoning applications in the language. Addressing the KG scarcity in Bengali, we propose BanglaAutoKG, a pioneering framework that is able to automatically construct Bengali KGs from any Bangla text. We utilize multilingual LLMs to understand various languages and correlate entities and relations universally. By employing a translation dictionary to identify English equivalents and extracting word features from pre-trained BERT models, we construct the foundational KG. To reduce noise and align word embeddings with our goal, we employ graph-based polynomial filters. Lastly, we implement a GNN-based semantic filter, which elevates contextual understanding and trims unnecessary edges, culminating in the formation of the definitive KG. Empirical findings and case studies demonstrate the universal effectiveness of our model, capable of autonomously constructing semantically enriched KGs from any text.

  • 4 authors
·
Apr 4, 2024

Understanding Graph Databases: A Comprehensive Tutorial and Survey

This tutorial serves as a comprehensive guide for understanding graph databases, focusing on the fundamentals of graph theory while showcasing practical applications across various fields. It starts by introducing foundational concepts and delves into the structure of graphs through nodes and edges, covering different types such as undirected, directed, weighted, and unweighted graphs. Key graph properties, terminologies, and essential algorithms for network analysis are outlined, including Dijkstras shortest path algorithm and methods for calculating node centrality and graph connectivity. The tutorial highlights the advantages of graph databases over traditional relational databases, particularly in efficiently managing complex, interconnected data. It examines leading graph database systems such as Neo4j, Amazon Neptune, and ArangoDB, emphasizing their unique features for handling large datasets. Practical instructions on graph operations using NetworkX and Neo4j are provided, covering node and edge creation, attribute assignment, and advanced queries with Cypher. Additionally, the tutorial explores common graph visualization techniques using tools like Plotly and Neo4j Bloom, which enhance the interpretation and usability of graph data. It also delves into community detection algorithms, including the Louvain method, which facilitates clustering in large networks. Finally, the paper concludes with recommendations for researchers interested in exploring the vast potential of graph technologies.

  • 3 authors
·
Nov 15, 2024

G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Text-based delivery addresses, as the data foundation for logistics systems, contain abundant and crucial location information. How to effectively encode the delivery address is a core task to boost the performance of downstream tasks in the logistics system. Pre-trained Models (PTMs) designed for Natural Language Process (NLP) have emerged as the dominant tools for encoding semantic information in text. Though promising, those NLP-based PTMs fall short of encoding geographic knowledge in the delivery address, which considerably trims down the performance of delivery-related tasks in logistic systems such as Cainiao. To tackle the above problem, we propose a domain-specific pre-trained model, named G2PTL, a Geography-Graph Pre-trained model for delivery address in Logistics field. G2PTL combines the semantic learning capabilities of text pre-training with the geographical-relationship encoding abilities of graph modeling. Specifically, we first utilize real-world logistics delivery data to construct a large-scale heterogeneous graph of delivery addresses, which contains abundant geographic knowledge and delivery information. Then, G2PTL is pre-trained with subgraphs sampled from the heterogeneous graph. Comprehensive experiments are conducted to demonstrate the effectiveness of G2PTL through four downstream tasks in logistics systems on real-world datasets. G2PTL has been deployed in production in Cainiao's logistics system, which significantly improves the performance of delivery-related tasks.

  • 8 authors
·
Apr 4, 2023

So2Sat LCZ42: A Benchmark Dataset for Global Local Climate Zones Classification

Access to labeled reference data is one of the grand challenges in supervised machine learning endeavors. This is especially true for an automated analysis of remote sensing images on a global scale, which enables us to address global challenges such as urbanization and climate change using state-of-the-art machine learning techniques. To meet these pressing needs, especially in urban research, we provide open access to a valuable benchmark dataset named "So2Sat LCZ42," which consists of local climate zone (LCZ) labels of about half a million Sentinel-1 and Sentinel-2 image patches in 42 urban agglomerations (plus 10 additional smaller areas) across the globe. This dataset was labeled by 15 domain experts following a carefully designed labeling work flow and evaluation process over a period of six months. As rarely done in other labeled remote sensing dataset, we conducted rigorous quality assessment by domain experts. The dataset achieved an overall confidence of 85%. We believe this LCZ dataset is a first step towards an unbiased globallydistributed dataset for urban growth monitoring using machine learning methods, because LCZ provide a rather objective measure other than many other semantic land use and land cover classifications. It provides measures of the morphology, compactness, and height of urban areas, which are less dependent on human and culture. This dataset can be accessed from http://doi.org/10.14459/2018mp1483140.

  • 17 authors
·
Dec 19, 2019

Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement

The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of LLM in content generation, and LLM Influence Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors' performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and two multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors' generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned PLM-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies.

  • 5 authors
·
Oct 18, 2024

Semantic Topic Analysis of Traffic Camera Images

Traffic cameras are commonly deployed monitoring components in road infrastructure networks, providing operators visual information about conditions at critical points in the network. However, human observers are often limited in their ability to process simultaneous information sources. Recent advancements in computer vision, driven by deep learning methods, have enabled general object recognition, unlocking opportunities for camera-based sensing beyond the existing human observer paradigm. In this paper, we present a Natural Language Processing (NLP)-inspired approach, entitled Bag-of-Label-Words (BoLW), for analyzing image data sets using exclusively textual labels. The BoLW model represents the data in a conventional matrix form, enabling data compression and decomposition techniques, while preserving semantic interpretability. We apply the Latent Dirichlet Allocation (LDA) topic model to decompose the label data into a small number of semantic topics. To illustrate our approach, we use freeway camera images collected from the Boston area between December 2017-January 2018. We analyze the cameras' sensitivity to weather events; identify temporal traffic patterns; and analyze the impact of infrequent events, such as the winter holidays and the "bomb cyclone" winter storm. This study demonstrates the flexibility of our approach, which allows us to analyze weather events and freeway traffic using only traffic camera image labels.

  • 3 authors
·
Sep 27, 2018

ResearchTown: Simulator of Human Research Community

Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multi-agent framework for research community simulation. Within this framework, the human research community is simplified and modeled as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph. To evaluate the quality of the research simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire novel research directions.

  • 8 authors
·
Dec 23, 2024 2

BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction

Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.4 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 80 chemical systems, 12 operating temperatures, and 646 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in a series of neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at https://github.com/Ruifeng-Tan/BatteryLife.

  • 9 authors
·
Feb 25, 2025

An Empirical Study of NetOps Capability of Pre-Trained Large Language Models

Large language models (LLMs) can respond to human language queries and have shown powerful potential applications in network operations (NetOps). Thanks to the large amount of commonsense knowledge inherent, LLMs achieve much better inference accuracy than traditional models and emerge with strong abilities in generalization, reasoning, and code generation. These abilities may have a crucial boost to automated and intelligent NetOps. However, it remains under-explored how well LLMs perform in various NetOps tasks. In this work, we make a systematic assessment of the capabilities, strengths, and limitations of selected LLMs in the field of NetOps. The evaluation is conducted on a collection of 5,732 questions about NetOps, encompassing 26 publicly available general-domain LLMs, including ChatGPT, LLaMA, Falcon, etc. We also finetune some of these LLMs with our collected NetOps corpus and evaluate the resulting models. The evaluation method follows the widely adopted benchmarks for general-domain LLMs, combined with Chain-of-Thought Prompts and Retrieval-Augmented Generation. The results show that only GPT-4 achieves high accuracy equivalent to passing the NetOps certification exam for humans, while all the other LLMs have much lower accuracy. However, some open models like LLaMA 2 still demonstrate significant potential. Furthermore, we evaluate the impact of factors such as model parameters, prompt engineering, instruction fine-tuning etc. This work shall be treated as the initial effort to systematic evaluation of LLMs in NetOps, and a more rigorous study is required for production use. The evaluation code and dataset will be released to benefit future research.

  • 9 authors
·
Sep 11, 2023 1

VecCity: A Taxonomy-guided Library for Map Entity Representation Learning

Electronic maps consist of diverse entities, such as points of interest (POIs), road networks, and land parcels, playing a vital role in applications like ITS and LBS. Map entity representation learning (MapRL) generates versatile and reusable data representations, providing essential tools for efficiently managing and utilizing map entity data. Despite the progress in MapRL, two key challenges constrain further development. First, existing research is fragmented, with models classified by the type of map entity, limiting the reusability of techniques across different tasks. Second, the lack of unified benchmarks makes systematic evaluation and comparison of models difficult. To address these challenges, we propose a novel taxonomy for MapRL that organizes models based on functional module-such as encoders, pre-training tasks, and downstream tasks-rather than by entity type. Building on this taxonomy, we present a taxonomy-driven library, VecCity, which offers easy-to-use interfaces for encoding, pre-training, fine-tuning, and evaluation. The library integrates datasets from nine cities and reproduces 21 mainstream MapRL models, establishing the first standardized benchmarks for the field. VecCity also allows users to modify and extend models through modular components, facilitating seamless experimentation. Our comprehensive experiments cover multiple types of map entities and evaluate 21 VecCity pre-built models across various downstream tasks. Experimental results demonstrate the effectiveness of VecCity in streamlining model development and provide insights into the impact of various components on performance. By promoting modular design and reusability, VecCity offers a unified framework to advance research and innovation in MapRL. The code is available at https://github.com/Bigscity-VecCity/VecCity.

  • 4 authors
·
Oct 31, 2024

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.

nvidia NVIDIA
·
Nov 10, 2025 2

When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

  • 9 authors
·
May 4, 2023

Local Augmentation for Graph Neural Networks

Graph Neural Networks (GNNs) have achieved remarkable performance on graph-based tasks. The key idea for GNNs is to obtain informative representation through aggregating information from local neighborhoods. However, it remains an open question whether the neighborhood information is adequately aggregated for learning representations of nodes with few neighbors. To address this, we propose a simple and efficient data augmentation strategy, local augmentation, to learn the distribution of the node features of the neighbors conditioned on the central node's feature and enhance GNN's expressive power with generated features. Local augmentation is a general framework that can be applied to any GNN model in a plug-and-play manner. It samples feature vectors associated with each node from the learned conditional distribution as additional input for the backbone model at each training iteration. Extensive experiments and analyses show that local augmentation consistently yields performance improvement when applied to various GNN architectures across a diverse set of benchmarks. For example, experiments show that plugging in local augmentation to GCN and GAT improves by an average of 3.4\% and 1.6\% in terms of test accuracy on Cora, Citeseer, and Pubmed. Besides, our experimental results on large graphs (OGB) show that our model consistently improves performance over backbones. Code is available at https://github.com/SongtaoLiu0823/LAGNN.

  • 9 authors
·
Sep 8, 2021

Neural Locality Sensitive Hashing for Entity Blocking

Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions primarily rely on generic similarity metrics such as Jaccard similarity, whereas practical use cases often demand complex and customized similarity rules surpassing the capabilities of generic similarity metrics. Consequently, designing LSH functions for these customized similarity rules presents considerable challenges. In this research, we propose a neuralization approach to enhance locality-sensitive hashing by training deep neural networks to serve as hashing functions for complex metrics. We assess the effectiveness of this approach within the context of the entity resolution problem, which frequently involves the use of task-specific metrics in real-world applications. Specifically, we introduce NLSHBlock (Neural-LSH Block), a novel blocking methodology that leverages pre-trained language models, fine-tuned with a novel LSH-based loss function. Through extensive evaluations conducted on a diverse range of real-world datasets, we demonstrate the superiority of NLSHBlock over existing methods, exhibiting significant performance improvements. Furthermore, we showcase the efficacy of NLSHBlock in enhancing the performance of the entity matching phase, particularly within the semi-supervised setting.

  • 9 authors
·
Jan 31, 2024

NAICS-Aware Graph Neural Networks for Large-Scale POI Co-visitation Prediction: A Multi-Modal Dataset and Methodology

Understanding where people go after visiting one business is crucial for urban planning, retail analytics, and location-based services. However, predicting these co-visitation patterns across millions of venues remains challenging due to extreme data sparsity and the complex interplay between spatial proximity and business relationships. Traditional approaches using only geographic distance fail to capture why coffee shops attract different customer flows than fine dining restaurants, even when co-located. We introduce NAICS-aware GraphSAGE, a novel graph neural network that integrates business taxonomy knowledge through learnable embeddings to predict population-scale co-visitation patterns. Our key insight is that business semantics, captured through detailed industry codes, provide crucial signals that pure spatial models cannot explain. The approach scales to massive datasets (4.2 billion potential venue pairs) through efficient state-wise decomposition while combining spatial, temporal, and socioeconomic features in an end-to-end framework. Evaluated on our POI-Graph dataset comprising 94.9 million co-visitation records across 92,486 brands and 48 US states, our method achieves significant improvements over state-of-the-art baselines: the R-squared value increases from 0.243 to 0.625 (a 157 percent improvement), with strong gains in ranking quality (32 percent improvement in NDCG at 10).

  • 6 authors
·
Jul 25, 2025

Can LLMs Convert Graphs to Text-Attributed Graphs?

Graphs are ubiquitous structures found in numerous real-world applications, such as drug discovery, recommender systems, and social network analysis. To model graph-structured data, graph neural networks (GNNs) have become a popular tool. However, existing GNN architectures encounter challenges in cross-graph learning where multiple graphs have different feature spaces. To address this, recent approaches introduce text-attributed graphs (TAGs), where each node is associated with a textual description, which can be projected into a unified feature space using textual encoders. While promising, this method relies heavily on the availability of text-attributed graph data, which is difficult to obtain in practice. To bridge this gap, we propose a novel method named Topology-Aware Node description Synthesis (TANS), leveraging large language models (LLMs) to convert existing graphs into text-attributed graphs. The key idea is to integrate topological information into LLMs to explain how graph topology influences node semantics. We evaluate our TANS on text-rich, text-limited, and text-free graphs, demonstrating its applicability. Notably, on text-free graphs, our method significantly outperforms existing approaches that manually design node features, showcasing the potential of LLMs for preprocessing graph-structured data in the absence of textual information. The code and data are available at https://github.com/Zehong-Wang/TANS.

  • 6 authors
·
Dec 13, 2024

NT-LLM: A Novel Node Tokenizer for Integrating Graph Structure into Large Language Models

Graphs are a fundamental data structure for representing relationships in real-world scenarios. With the success of Large Language Models (LLMs) across various natural language processing (NLP) tasks, there has been growing interest in integrating LLMs for graph learning. However, applying LLMs to graph-related tasks poses significant challenges, as these models are not inherently designed to capture the complex structural information present in graphs. Existing approaches address this challenge through two strategies: the chain of tasks approach, which uses Graph Neural Networks (GNNs) to encode the graph structure so that LLMs are relieved from understanding spatial positions; and Graph-to-Text Conversion, which translates graph structures into semantic text representations that LLMs can process. Despite their progress, these methods often struggle to fully preserve the topological information of graphs or require extensive computational resources, limiting their practical applicability. In this work, we introduce Node Tokenizer for Large Language Models (NT-LLM), a novel framework that efficiently encodes graph structures by selecting key nodes as anchors and representing each node based on its relative distance to these anchors. This position-anchored encoding effectively captures the graph topology, enabling enhanced reasoning capabilities in LLMs over graph data. Additionally, we implement a task-specific tuning procedure to further improve structural understanding within LLMs. Through extensive empirical evaluations, NT-LLM demonstrates significant performance improvements across a variety of graph-related tasks.

  • 8 authors
·
Oct 14, 2024

Efficient Telecom Specific LLM: TSLAM-Mini with QLoRA and Digital Twin Data

General-purpose large language models (LLMs), despite their broad capabilities accrued from open-world data, frequently exhibit suboptimal performance when confronted with the nuanced and specialized demands inherent in real-time telecommunications applications. This investigation addresses this critical limitation through the meticulous fine-tuning of TSLAM-Mini developed by NetoAI, a compact (3.8-billion parameter) causal language model architecturally derived from Phi-4 Mini Instruct 4B. The fine-tuning regimen leverages a bespoke dataset comprising 100,000 samples, strategically engineered to address 20 pivotal telecommunications use-cases, encompassing domains such as Network Fundamentals, IP Routing, MPLS, Network Security, Automation, OSS/BSS, RAN, Mobile Core, Satellite Communications, and Ethical AI. This dataset was curated utilizing NetoAI's DigiTwin platform, enriched with granular insights from venerated network Subject Matter Experts (SMEs) and authoritative RFC documents, thereby capturing high-fidelity representations of real-world network dynamics through simulations inspired by digital twin paradigms. Employing Quantized Low-Rank Adaptation (QLoRA), a state-of-the-art Parameter Efficient Fine-Tuning (PEFT) technique, we achieved substantial training efficiency and enabled prospective deployment on resource-constrained hardware. A novel evaluation framework, predicated on a high-capacity LLM (Qwen3-235B-A22B) functioning as an automated adjudicator, was instituted to rigorously assess instruction-following fidelity and response quality across the specified telecom use-cases. Empirical results unequivocally demonstrate TSLAM-Mini's superior aptitude in telecom-centric applications, underscoring the profound efficacy of domain-specific datasets and PEFT methodologies for advancing intelligent network management.

  • 4 authors
·
May 10, 2025