Summary: This paper proposes a new neural network architecture called the Transformer that is based solely on attention mechanisms, without using sequence aligned RNNs or convolutions. The Transformer achieves state-of-the-art results in machine translation while being more parallelizable and requiring significantly less time to train. Key contributions:

Proposes multi-head self-attention as a replacement for recurrence and convolutions in encoder-decoder architectures. Self-attention connects all positions with a constant number of sequentially executed operations, whereas recurrent layers require O(n) sequential operations.

Introduces scaled dot-product attention, which performs better than additive attention for large values of attention dimension. Applies attention scaling to improve training.

Employs positional encodings instead of recurrence to enable the model to make use of sequence order. Shows that learned positional embeddings can replace sinusoids with negligible loss in quality.

Achieves state-of-the-art BLEU scores on WMT 2014 English-to-German and English-to-French translation at a fraction of the training cost of previous models. Outperforms all previously published models on English constituency parsing with limited training data.

The Transformer's reliance on attention and positional encodings rather than recurrence make it very promising for parallelization and scaling to longer sequences. The results demonstrate the potential of attention-based models to supplant RNNs and CNNs in sequence transduction tasks.

Evaluation: The Transformer architecture presents several advantages for using large language models and generative adversarial networks:

The Transformer is highly parallelizable since it does away with sequence-aligned RNNs. This makes it very suitable for scaling up with more parameters and data.

The multi-head self-attention provides a way to jointly attend to information from different representation subspaces at different positions, allowing modeling of dependencies regardless of distance. This is useful for long-range dependencies in large contexts.

Positional encodings allow the model to make use of sequence order without recurrence. This can enable generating coherent, ordered outputs in GANs and large LMs.

The Transformer achieves excellent results with limited training data, suggesting its representations transfer well. This is promising for few-shot learning and fine-tuning large LMs.

The paper provides useful analysis into the roles different attention heads learn, which can inform work on interpretable attention-based representations.

Overall, the Transformer architecture seems very promising as a foundation for large scale language modeling and GAN training. The representations it learns appear powerful yet transparent. The results on parsing suggest it can capture linguistic phenomena well. The parallelizability enables scaling. Much follow-on work has already adapted and refined the Transformer, making it very relevant today.

4

6

Attention Is Off By One (www.evanmiller.org)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

5

1

ChatGPT an ENFJ, Bard an ISTJ: Empirical Study on Personalities of Large Language Models (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

1 comments fedilink

Title: ChatGPT an ENFJ, Bard an ISTJ: Empirical Study on Personalities of Large Language Models

https://arxiv.org/pdf/2305.19926.pdf

Authors: Jen-tse Huang, Wenxuan Wang, Man Ho Lam, Eric John Li, Wenxiang Jiao, Michael R. Lyu

Word count: 4249 words

Estimated read time: 9 minutes

Source code repos: https://github.com/CUHK-ARISE/LLMPersonality ↗ Supporting links: https://chat.openai.com/, ↗ https://www.16personalities.com/ ↗

Summary: This paper presents an empirical study evaluating the personality traits of large language models (LLMs) like ChatGPT using the Myers-Briggs Type Indicator (MBTI) framework. The key findings are:

ChatGPT consistently exhibits an ENFJ (Extraverted, Intuitive, Feeling, Judging) personality type across different prompts, question orders, rephrases, and languages.

Other LLMs show distinct personality types - text-davinci-003 and GPT-4 are also ENFJ, while Bard is ISTJ, Spark is ISFP, ERNIE Bot is ISTJ, and ChatGLM is ESFJ.

Attempts to modify ChatGPT's inherent ENFJ personality by assigning a persona or inducing moods were unsuccessful, indicating its personality is quite fixed.

The study analyzes ChatGPT's personality robustness, cross-lingual consistency, model differences, and controllability. It provides novel insights into personalization of LLMs.

The findings suggest LLMs exhibit distinct personality traits that persist irrespective of modifications. This has implications for human-AI interaction design and prompts further research into steering LLM behavior. The study methodology and analysis of multiple LLMs make valuable contributions to the field.

6

1

Personality Traits in Large Language Models (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Personality Traits in Large Language Models

https://arxiv.org/pdf/2307.00184.pdf

Authors: Mustafa Safdari, Gregory Serapio-García, Clément Crepy, Stephen Fitz, Peter Romero, Luning Sun, Marwa Abdulhai, Aleksandra Faust, Maja Matarić

Word count: Approximately 4900 words

Estimated read time: 25-30 minutes

Source code repo: Not provided

Supporting links: References section contains 113 links to related work and methodological sources

Summary This paper presents a comprehensive methodology for characterizing, measuring, and shaping personality traits synthesized in the language generated by large language models (LLMs). The authors administer validated personality inventories from psychology to probe the Big Five traits (extraversion, agreeableness, conscientiousness, neuroticism, openness) in LLMs like PaLM. They establish the construct validity of these LLM-simulated personality scores using best practices from psychometrics. The authors find that larger, instruction fine-tuned LLMs like Flan-PaLM exhibit more human-like patterns of personality based on rigorous statistical assessments. They also demonstrate methods to precisely shape levels of LLM-simulated personality traits using lexical cues. The shaped personality traits persist in downstream LLM behaviors like generating text. The authors discuss implications for responsible AI, human alignment, transparency, and application development.

Evaluation This paper provides a robust methodology grounded in psychometrics to characterize personality synthesized in LLM outputs. The statistical analyses quantify the reliability, dimensionality, and external validity of LLM-simulated personality, establishing firm techniques to measure these emergent phenomena. The lexical prompting method to shape personality traits along multiple levels also enables fine-grained control over LLM behavior.

These methods have clear applications for developing safer, aligned LLMs. The personality profiling and shaping techniques can steer LLM outputs away from toxic traits. They also increase transparency about how LLMs may be perceiving users based on synthesized personality. For conversational agents and interactive LLMs, these methods open pathways to customizing personality and increasing user engagement. The demonstrated personality shaping also has uses for domain-specific LLMs that benefit from particular traits. Overall, this work provides a rigorous basis for responsible application development using emergent aspects of LLMs like synthetic personality. The techniques generalize to other social constructs beyond personality as well.^___^

Releated: https://lemmy.intai.tech/post/125756

7

5

Large Language Models as General Pattern Machines (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

2 comments fedilink

Large Language Models as General Pattern Machines

Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, Andy Zeng

2982 words

Read time: 01:11:54

References and Support: https://general-pattern-machines.github.io ↗ https://arxiv.org/pdf/2307.04721.pdf

This paper investigates the capabilities of large language models (LLMs) as general pattern machines that can perform sequence transformations, completions, and improvements in a zero-shot manner when prompted with examples. The authors demonstrate that LLMs like GPT-3 can solve a subset of problems in the Abstract Reasoning Corpus, a benchmark for spatial reasoning, as well as complete patterns generated by context-free grammars. They also show LLMs can complete periodic functions like sinusoids, which enables completing periodic motions on a robot. By providing trajectories with increasing rewards, LLMs can generate improved trajectories and even learn stabilizing controllers for CartPole. Overall, the results suggest LLMs have inherent capabilities for general pattern manipulation that could be applied to robotics problems, despite not being explicitly trained for such tasks. The authors propose this as an alternative approach compared to task-specific finetuning, and suggest it provides insights into abilities that may transfer from pretraining on textual data. However, deploying LLMs for real robotics systems faces challenges like latency, context limitations, and compute costs.

This paper provides compelling evidence that large language models have built-in capabilities for pattern recognition and manipulation that can be exploited in a zero-shot manner, without task-specific fine-tuning. The experiments on solving spatial reasoning problems, completing periodic functions and robotic motions, and iteratively improving trajectories are practically relevant for potential robotics applications. However, as the authors note, there are still significant barriers to real-world deployment on physical systems due to factors like latency, context limitations, and compute costs. Nonetheless, this provides useful insights into the generalization abilities of large language models, and suggests promising directions for developing more general and adaptable agents by pretraining on diverse data. The proposed framework of prompting pattern transformations, completions, and improvements could be beneficial for sample-efficient learning in simulated environments. Overall the work is technically strong with rigorously designed experiments, and has high applicability for developing large language model based systems.

8

1

Large Language Models as Tool Makers (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Github
Paper

Large Language Models as Tool Makers Authors: Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

Word count: 4579 words

Estimated read time: 12 minutes

Source code: https://github.com/ctlllll/LLM-ToolMaker ↗

Summary:

This paper proposes a framework called LLMs As Tool Makers (LATM) that enables large language models (LLMs) to create and utilize their own tools for solving complex reasoning tasks. The key idea is to separate the process into two stages - tool making and tool using. In the tool making stage, a powerful yet expensive LLM acts as the "tool maker" to generate reusable Python functions for solving demonstrations of a task. In the tool using stage, a lightweight and cost-effective LLM acts as the "tool user" to call these tools to solve new instances of the task.

Experiments on tasks like logical deduction, tracking shuffled objects, Dyck language parsing, etc show that with tools made by GPT-4, GPT-3.5 Turbo as the tool user can match or exceed the performance of GPT-4 at lower cost. The authors also introduce a "dispatcher" LLM to handle streaming tasks by identifying when to reuse existing tools or request new ones.

Overall, this work demonstrates a promising approach to enabling LLMs to create their own tools, reducing reliance on human-crafted tools. The division of labor also allows using smaller models for most of the inferences, improving cost-efficiency. This technique could significantly expand the capabilities of LLMs in a scalable manner.

The proposed LATM framework demonstrates an interesting and promising approach to improving the reasoning and problem-solving capabilities of large language models in a cost-effective manner. Here are some thoughts on its applicability:

The ability for LLMs to create their own tools could be very useful for building practical applications. For any recurring task, the model could generate a reusable tool instead of solving from scratch each time. This could make applications more efficient and scalable.

The staged approach allows combining different sized models optimally - a powerful model makes tools, while lightweight models use the tools. This cost-effectiveness is attractive for real-world applications with budget constraints.

The tools being in Python allows them to integrate into application codebases easily. The dispatcher model also provides flexibility to handle new tasks.

The method's applicability does seem more geared towards logical reasoning, procedural and algorithmic tasks right now. Further research may be needed to extend it to other domains.

There are still open challenges around rigorously testing and validating the quality and safety of automatically generated tools. Methods to provide human oversight would be important.

Overall, the LATM paradigm does appear promising for augmenting LLMs and enabling them to participate more actively in their own learning and tooling. With further research to broaden its scope, it could become a general framework for efficiently enhancing LLM capabilities.

So in summary, LATM seems quite promising as a technique for unlocking more of the potential of LLMs for practical applications requiring complex reasoning in a scalable and cost-efficient manner. More research is still needed, but the principles demonstrated align well with enabling wider usage of LLMs and GANs in applications.

9

3

Language models can explain neurons in language models (openai.com)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

10

2

Curious Replay for Model-based Adaptation (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by taters@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Link: https://www.nature.com/articles/s41746-023-00873-0

Title: The Imperative for Regulatory Oversight of Large Language Models (or Generative AI) in Healthcare

Author(s): Bertalan Meskó & Eric J. Topol

Word count: 2,222

Estimated average read time: 10 minutes

Summary: This article emphasizes the need for regulatory oversight of large language models (LLMs) in healthcare. LLMs, such as GPT-4 and Bard, have the potential to revolutionize healthcare, but they also pose risks that must be addressed. The authors argue for differentiated regulation of LLMs in comparison to other AI-based medical technologies due to their unique characteristics and challenges.

The article discusses the scale, complexity, hardware requirements, broad applicability, real-time adaptation, societal impact, and data privacy concerns associated with LLMs. It highlights the need for a tailored regulatory approach that considers these factors. The authors also provide insights into the current regulatory landscape, particularly in the context of the United States' Food and Drug Administration (FDA), which has been adapting its framework to address AI and machine learning technologies in medical devices.

The authors propose practical recommendations for regulators, including the creation of a new regulatory category for LLMs, guidelines for deployment, consideration of future iterations with advanced capabilities, and focusing on regulating the companies developing LLMs rather than each individual model.

Evaluation for Applicability to Applications Development: This article provides valuable insights into the challenges and considerations regarding regulatory oversight of large language models in healthcare. While it specifically focuses on healthcare, the principles and recommendations discussed can be applicable to application development using large language models or generative AI systems in various domains.

Developers working on applications utilizing large language models should consider the potential risks and ethical concerns associated with these models. They should be aware of the need for regulatory compliance and the importance of transparency, fairness, data privacy, and accountability in their applications.

The proposed recommendations for regulators can also serve as a guide for developers, helping them shape their strategies for responsible and compliant development of applications using large language models. Understanding the regulatory landscape and actively addressing potential risks and challenges can lead to successful deployment and use of these models in different applications.

11

3

The imperative for regulatory oversight of large language models (or generative AI) in healthcare (www.nature.com)

submitted 1 year ago* (last edited 1 year ago) by taters@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Title: The Imperative for Regulatory Oversight of Large Language Models (or Generative AI) in Healthcare

Author(s): Bertalan Meskó & Eric J. Topol Word count: 2,222

Estimated average read time: 10 minutes

Summary: This article highlights the need for regulatory oversight of large language models (LLMs), such as GPT-4 and Bard, in healthcare settings. LLMs have the potential to transform healthcare by facilitating clinical documentation, summarizing research papers, and assisting with diagnoses and treatment plans. However, these models come with significant risks, including unreliable outputs, biased information, and privacy concerns.

The authors argue that LLMs should be regulated differently from other AI-based medical technologies due to their unique characteristics, including their scale, complexity, broad applicability, real-time adaptation, and potential societal impact. They emphasize the importance of addressing issues such as transparency, accountability, fairness, and data privacy in the regulatory framework.

The article also discusses the challenges of regulating LLMs, including the need for a new regulatory category, consideration of future iterations with advanced capabilities, and the integration of LLMs into already approved medical technologies.

The authors propose practical recommendations for regulators to bring this vision to reality, including creating a new regulatory category, providing guidance for deployment of LLMs, covering different types of interactions (text, sound, video), and focusing on companies developing LLMs rather than regulating each iteration individually.

Evaluation for Applicability to Applications Development: This article provides valuable insights into the regulatory challenges and considerations related to large language models in healthcare. While it primarily focuses on the medical field, the principles and recommendations discussed can be applicable to applications development using large language models or generative AI systems in various domains.

Developers working on applications that utilize large language models should be aware of the potential risks and ethical concerns associated with these models. They should also consider the need for regulatory compliance and the importance of transparency, fairness, data privacy, and accountability in their applications.

Additionally, developers may find the proposed practical recommendations for regulators helpful in shaping their own strategies for responsible and compliant development of applications using large language models. Understanding the regulatory landscape and being proactive in addressing potential risks and challenges can lead to the successful deployment and use of these models in various applications.

12

7

Microsoft Announces: LongNet - Scaling LLM Transformers to 1,000,000,000 Tokens & Context Length (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

2 comments fedilink

cross-posted from: https://lemmy.world/post/1115946

cross-posted from: https://lemmy.world/post/1115513

Microsoft Announces a New Breakthrough: LongNet: Scaling AI/LLM Transformers to 1,000,000,000 Tokens & Context Length

Official Microsoft Breakthroughs:

https://arxiv.org/pdf/2307.02486.pdf

https://github.com/microsoft/unilm/tree/master

See one of the first implementations of LongNet here:

https://github.com/kyegomez/LongNet

In the realm of large language models, scaling sequence length has emerged as a significant challenge. Current methods often grapple with computational complexity or model expressivity, limiting the maximum sequence length. This paper introduces LongNet, a Transformer variant designed to scale sequence length to over 1 billion tokens without compromising performance on shorter sequences. The key innovation is dilated attention, which exponentially expands the attentive field as the distance increases.

Features

LongNet offers several compelling advantages:

Linear Computation Complexity: It maintains a linear computational complexity and a logarithmic dependency between tokens.

Distributed Trainer: LongNet can serve as a distributed trainer for extremely long sequences.

Dilated Attention: This new feature is a drop-in replacement for standard attention and can be seamlessly integrated with existing Transformer-based optimization.

(+ many others that are hard to fit here - please read the full paper here for more insights)

Experimental results show that LongNet delivers strong performance on both long-sequence modeling and general language tasks. This work paves the way for modeling very long sequences, such as treating an entire corpus or even the whole Internet as a sequence.

If computation and inference hurdles are continually overcome the way they are now - we may be seeing near infinite context lengths sooner than many had initially thought. How exciting!

Arxiv Paper | The Abstract:

(take this graph with a grain of salt - this is not indicative of logarithmic scaling)

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LONGNET has significant advantages:

It has a linear computation complexity and a logarithm dependency between tokens.

It can be served as a distributed trainer for extremely long sequences.

Its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization.

Experiments results demonstrate that LONGNET yields strong performance on both long-sequence modeling and general language tasks.

Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence. Code is available at https://aka.ms/LongNet.

Click here to read the rest of the paper!

13

1

Large Language Models Enable Few-Shot Clustering (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Large Language Models Enable Few-Shot Clustering

paper

Authors: Vijay Viswanathan, Kiril Gashteovski, Carolin Lawrence, Tongshuang Wu, Graham Neubig

Word Count: 4893 words

Estimated Average Read Time: 9 - 10 minutes

This article discusses how large language models (LLMs) can be used to enable more query-efficient, few-shot semi-supervised text clustering. They explore three stages where LLMs can be incorporated into clustering:

Before clustering - by improving input features using keyphrases generated by an LLM

During clustering - by providing constraints to the clusterer using an LLM as a pairwise constraint pseudo-oracle

After clustering - using LLMs for post-correction of cluster assignments

They find that incorporating LLMs before and during clustering can provide significant improvements in cluster quality. Using an LLM to generate keyphrases for textual representation achieved state-of-the-art results on the datasets they tested.

While incorporating LLMs comes at additional cost, they find that an LLM can achieve comparable or better performance than a human oracle at a fraction of the labeling cost. This suggests that LLMs have the potential to enable more effective semi-supervised text clustering.

For applications development using LLMs or GANs, their work demonstrates how large language models can be used in a simple and targeted way to augment existing machine learning models and achieve performance gains. While their examples focused on clustering, a similar approach could potentially benefit other tasks where LLMs can provide "pseudo labels" or simulated demonstrations.

Overall, their results indicate that LLMs have great potential to enable more efficient use of human feedback for improved modeling performance.

14

1

Preference Ranking Optimization for Human Alignment (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Preference Ranking Optimization for Human Alignment with Large Language Models

paper

By Feifan Song et al.

Word Count: 5,841

Estimated Read Time: 26 - 43 minutes

Source Code

The paper proposes a method called Preference Ranking Optimization (PRO) for aligning large language models with human preferences. The key idea is that human alignment can be modeled as aligning the probability ranking of responses generated by the language model with the preference ranking of human annotators. PRO directly optimizes the language model to learn this ranking, avoiding the complexity of reinforcement learning approaches.

The authors evaluate PRO on a dataset of human feedback and find that it outperforms baselines in aligning with human preferences, achieving similar results to ChatGPT. They also analyze factors that influence the performance of PRO, such as the length and diversity of the preference rankings.

In summary, PRO presents an effective and efficient method for aligning language models to human values. With further research, similar techniques could improve the safety, trustworthiness and interpretability of large language models.

However, PRO relies on heuristics rather than formalizing human preferences as a well-defined objective. This may limit the generalizability of the approach. Additionally, the paper focuses on conversational data, so it remains to be seen how well PRO can align language models for other tasks.

15

1

Pushing the Limits of Machine Design Automated CPU Design with AI (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Title: Pushing the Limits of Machine Design: Automated CPU Design with AI

paper

Authors: Shuyao Cheng1,2 Pengwei Jin1,2 Qi Guo1 Zidong Du1,3 Rui Zhang1,3 Yunhao Tian1,2 Xing Hu1,2 Yongwei Zhao1,3 Yifan Hao1 Xiangtao Guan1,4 Husheng Han1,2 Zhengyue Zhao1,2 Ximing Liu1,2 Ling Li5 Xishan Zhang1,3 Yuejie Chu1 Weilong Mao1 Tianshi Chen3 & Yunji Chen1,2,∗

Word Count: 10,203
Estimated Read Time: 16-17 minutes

Summary:

This paper proposes an AI approach to automatically design a CPU, pushing the limits of machine design. The approach generates the circuit logic of the CPU design in the form of a Binary Speculation Diagram (BSD) by efficiently exploring a search space of 1010540, which is the largest of any machine designed object.

The approach addresses two key challenges:

Accuracy: The inferred circuit logic must ensure extremely high accuracy (>99.99999999999%). This is achieved using Monte Carlo-based expansion during BSD generation. Scalability: The circuit logic consists of millions of lines of formulas. This is addressed using Boolean distance-based node reduction to significantly reduce the number of BSD nodes. The researchers demonstrate the capability of the approach by automatically generating a RISC-V CPU design. The generated CPU runs Linux, performs comparably to an Intel 80486 CPU, and reduces the design cycle by about 1000x compared to human designed CPUs.

In summary, the research pushes the limits of machine design capabilities and represents one of the largest objects designed by AI to date. The approach also discovers aspects of the von Neumann architecture from scratch and has the potential to automatically generate CPU optimization.

Applicability to Large Language Models: This research demonstrates the capability of AI methods to design complex real-world systems from empirical data. The approach relies on techniques like Monte Carlo sampling, hierarchical graph generation, and node merging.

Large language models and GANs could potentially be applied to automate aspects of the approach, for example:

Generating an initial BSD structure from input-output examples using a language model Predicting optimal expansion orders using a language model Automating the Boolean distance calculation and node merging using a similarity model Overall, while the current approach is rule-based, incorporating large language models and GANs could help automate and improve various aspects, making the process more data-driven and efficient. This could potentially extend the approach to automating the design of even more complex systems.

16

1

Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine

paper

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Zhaopeng Tu

Word Count: 1504 words
Estimated Read Time: 8-9 minutes
Source Code: https://github.com/wxjiao/Is-ChatGPT-A-Good-Translator

Supporting Links:

ChatGPT: https://chat.openai.com
GPT-4: https://openai.com/research/gpt-4/

Summary:

This article evaluates the machine translation capabilities of ChatGPT compared to commercial translation products like Google Translate and DeepL Translate. The evaluation covers three aspects: translation prompts, multilingual translation, and translation robustness.

The results show that ChatGPT performs comparably with commercial systems on high-resource languages but lags behind on low-resource languages. However, a pivot translation strategy that uses English as an intermediate language significantly improves translation performance for distant languages. Furthermore, the newly launched GPT-4 engine boosts ChatGPT's translation abilities, making it competitive even for distant languages.

The article concludes that with GPT-4 as its foundation, ChatGPT has become a good translator. However, the evaluation has some limitations like the limited test data size and reproducibility issues due to ChatGPT's latency.

Applicability to Applications Development: This evaluation demonstrates the quality and limitations of ChatGPT's (and now GPT-4's) machine translation capabilities. While it shows promising results for high-resource languages, it still lags behind commercial products on domain-specific or noisy texts. However, with large language models like GPT-4 as the backend, chatbot and conversational AI applications that require translation abilities have the potential to improve significantly.

The insights from this study into translation prompts, multilingual translation strategies, and robustness can help guide the development and optimization of such large language model powered applications. But more rigorous and extensive evaluations are needed to fully understand the capabilities and remaining gaps.

17

1

SequenceMatch - Imitation Learning for Autoregressive Sequence Modelling with Backtracking (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

paper

Authors: Chris Cundy, Stefano Ermon

Word Count: 6700

Estimated Read Time: 21-25 minutes

Source Code: Not available (research paper)

Supporting Links: None provided

Summary: The paper proposes a method called SequenceMatch for training autoregressive sequence models like language models. The key ideas are:

Formulate sequence generation as an imitation learning problem to minimize divergences between the model and data distributions. This allows for penalties for out-of-distribution sequences.

Introduce a action to allow the model to correct erroneous generations, reducing compounding error.

Minimize alternative divergences like the χ2-divergence instead of maximum likelihood, which leads to improved generations.

Implement SequenceMatch without architectural changes by masking the action and recomputing logits efficiently.

They show empirically that SequenceMatch leads to better text generation compared to maximum likelihood training, as measured by MAUVE score, diversity, and fluency.

Evaluation: The SequenceMatch approach seems applicable to improving the quality of generations from large language models. The key ingredients - alternative loss functions, backtracking, and masking techniques - are general enough to apply to other autoregressive models like GANs. The main limitation is the increased training cost due to sampling during training. However, the authors discuss approaches to mitigate this. In summary, the techniques proposed in the paper are promising for developing better generative models for text and other sequential data.

18

1

The RefinedWeb Dataset for Falcon LLM - Outperforming Curated Corpora with Web Data, and Web Data Only (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

paper

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Authors: The Falcon LLM Team (see paper for full list of authors)

Word Count: Approximately 9,000 words

Estimated Read Time: Around 30-45 minutes

Source Code: The codebase used to develop the Falcon-LLM models and the RefinedWeb dataset is not publicly available.

Relevant Links:

• RefinedWeb dataset (600 billion token extract): https://huggingface.co/datasets/tiiuae/falcon-refinedweb • Falcon-LLM website: falconllm.tii.ae

Summary:

The paper proposes RefinedWeb, a large-scale dataset derived from CommonCrawl web data. Through extensive filtering and deduplication of the web data, the authors argue that RefinedWeb can be used to train language models that match or outperform models trained on curated corpora. They publicly release a 600 billion token extract of RefinedWeb, as well as language models trained on the dataset.

The authors find that properly filtered and deduplicated web data alone can lead to language models with powerful zero-shot capabilities, even outperforming publicly available models trained on curated data like The Pile. They achieve this by rigorously removing duplicate and low-quality web content from CommonCrawl.

The RefinedWeb dataset and the Falcon-LLM models trained on it could be useful resources for developing large language models or GAN-based systems. The dataset size of 5 trillion tokens would allow for scaling these systems to a large size. However, the lack of publicly available code makes reproducing and building upon the presented results more difficult.

19

1

On the Coverage of Cognitive mmWave Networks with Directional Sensing and Communication (lemmy.intai.tech)

submitted 1 year ago by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

On the Coverage of Cognitive mmWave Networks with Directional Sensing and Communication

Authors: Shuchi Tripathi, Abhishek K. Gupta, SaiDhiraj Amuru

Word Count: 5400

Average Reading Time: ~30 minutes

Highlights:

• The authors propose an analytical framework to evaluate the performance of a cognitive mmWave network consisting of a primary link and multiple secondary links using stochastic geometry.

• They consider directional channel sensing and communication in contrast to omnidirectional sensing, which allows secondary transmitters to transmit based on their orientation instead of being outside a certain distance. This provides better spatial reuse for secondary transmitters.

• They analyze the medium access probability, activity factor, and coverage probability of the primary and secondary links considering various parameters like directionality, threshold, density, etc.

• They show that directionality can improve the trade-off between the primary and secondary link performances by increasing both link coverages for appropriate threshold values.

• However, the effect of primary and secondary directionality depends on the location and orientation of the secondary links. While primary directionality does not always aid secondary coverage, secondary directionality always improves it.

• They propose an adaptive directional sensing where secondary links can choose higher or lower directionality based on their location to achieve similar coverage performances.

In summary, this work provides useful analytical insights into the performance of cognitive mmWave networks with directional sensing and communication. The proposed mathematical framework and results could potentially aid in the design and optimization of such networks.

Regarding applications of large language models, the analytical approach and results in this work could provide useful guidelines and inputs for developing agent-based simulations of cognitive mmWave networks. The simulations could leverage language models to emulate the behaviors of cognitive transmitters based on the derived insights to validate and extend the proposed framework.

20

1

Generate Anything Anywhere in Any Scene. (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

PACGen: Generate Anything Anywhere in Any Scene

Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee University of Wisconsin–Madison

Word Count: 2,683
Estimated Read Time: 9-12 minutes

Summary: The content proposes PACGen, a method that combines techniques from DreamBooth and GLIGEN models to enable both personalized and controllable text-to-image generation. PACGen addresses entanglement issues in existing personalized generative models through data augmentation during training. This allows PACGen to associate a personalized concept solely with its identity, not its location or size. During inference, a regionally-guided sampling technique ensures high quality generation while maintaining location control. Experimental results show that PACGen can generate personalized concepts with controllable location and size, achieving comparable or better fidelity than alternative baselines. The authors envision potential applications for PACGen in art, advertising and entertainment design.

Evaluation: PACGen demonstrates significant potential to enable impactful new creative applications by providing fine-grained control in personalized generative models. The ability to generate personalized concepts with controllable placement within desired contexts could enable improvements in AI creation tools for advertisement designers, artists and citizens. However, the potential misuse of this technology by bad actors generating manipulated content remains a concern. While PACGen represents an important advance, further research is needed to address potential issues and ensure the responsible development and use of such generative modeling techniques. Overall, PACGen is applicable for augmenting and improving large language model and generative adversarial network systems by enabling controllable personalized generative priors.

21

1

DreamDiffusion - Generating High-Quality Images from Brain EEG Signals (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

paper

Title: DreamDiffusion: Generating High-Quality Images from Brain EEG Signals Authors: Yunpeng Bai1 , Xintao Wang2 , Yanpei Cao2 , Yixiao Ge2 , Chun Yuan1,3 , Ying Shan2

Word Count: 3,697
Estimated Reading Time: ~15 minutes
Source Code: No source code provided

Summary: The paper proposes DreamDiffusion, a method to generate high-quality images based on EEG signals recorded from the human brain. This is achieved by:

Pre-training an EEG encoder using masked signal modeling on a large EEG dataset to learn robust EEG representations.

Fine-tuning a pre-trained Stable Diffusion text-to-image model using limited paired EEG-image data.

Using CLIP image embeddings to further optimize the EEG embeddings, aligning the EEG, text and image embeddings for improved image generation.

The results show that DreamDiffusion can generate realistic images from EEG signals alone, demonstrating progress towards more portable and affordable "thought-to-image" systems.

Applicability: The proposed method demonstrates the promising potential of large language models and GANs for applications that generate images directly from brain activity. The key ingredients - masked modelling pre-training, fine-tuning on diffusion models, and alignment with multi-modal embeddings - are applicable techniques for developing other brain-computer interface systems. However, the current results still show limitations in capturing fine-grained semantic information from EEG data. Overall, the paper outlines a path forward for building more capable brain-to-image generation systems.

22

1

One-2-3-45 (lemmy.intai.tech)

submitted 1 year ago* (last edited 1 year ago) by manitcor@lemmy.intai.tech to c/mltheory@lemmy.intai.tech

0 comments fedilink

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Minghua Liu1∗ Chao Xu2∗ Haian Jin3,4∗ Linghao Chen1,4∗ Mukund Varma T5 Zexiang Xu6Hao Su1

Word count: 8458 words
Estimated read time : ~ 37 minutes
Source code: http://one-2-3-45.com

Summary: This paper presents a method to reconstruct 3D shapes from a single image in an end-to-end manner without time-consuming optimization. Their approach consists of three main parts:

Multi-view synthesis: They leverage a view-conditioned 2D diffusion model, Zero123, to generate multi-view images of the input object.

Pose estimation: They estimate the elevation angle of the input image to determine the camera poses of the multi-view images.

3D reconstruction: They employ a neural surface reconstruction method based on signed distance fields to reconstruct a 3D textured mesh from the multi-view images in a single feed-forward pass.

Their key contributions are:

Reconstruction in just 45 seconds without per-shape optimization Producing higher quality geometry due to the use of SDF representation Generating more 3D consistent results thanks to the multi-view synthesis module Achieving better adherence to the input image compared to existing methods They evaluate their approach on synthetic data and real images, demonstrating superior performance in terms of both mesh quality and runtime compared to existing zero-shot single image 3D reconstruction approaches.

Evaluation: This approach has strong potential for applications in 3D content creation and augmented/virtual reality. The key benefits are:

Fast inference time of 45 seconds, which is orders of magnitude faster than optimization-based approaches. This makes it suitable for production environments with low latency requirements.

Ability to reconstruct 3D shapes from a single image of any object, not restricted to specific object categories. This enables a wide range of applications.

Good adherence to the input image, producing realistic 3D shapes that match the given input. This is important for applications where fidelity to the input is critical.

The ability to extend to text-to-3D tasks by integrating with text-to-image diffusion models, providing an unrestricted input domain.

The main limitation is the dependence on the Zero123 diffusion model for multi-view synthesis, which occasionally produces inconsistent predictions that can impact reconstruction quality. However, the overall results demonstrate strong potential for real-world applications. With further improvements to the multi-view synthesis module and additional regularizations, this approach could enable a wide range of novel applications that require reconstructing realistic 3D shapes from a single image in near real-time.