Naver trained a ‘GPT-3-like’ Korean language model

0
15

Elevate your enterprise data technology and strategy at Transform 2021.


Naver, the Seongnam, South Korean-based company that operates the eponymous search engine Naver, this week announced that it trained one of the largest AI language models of its kind, called HyperCLOVA. Naver claims that the system learned 6,500 times more Korean data than OpenAI’s GPT-3 and contains 204 billion parameters, the parts of the machine learning model learned from historical training data. (GPT-3 has 175 billion parameters.)

For the better part of a year, OpenAI’s GPT-3 has remained among the largest AI language models ever created. Via an API, people have used it to automatically write emails and articles, summarize text, compose poetry and recipes, create website layouts, and generate code for deep learning in Python. But GPT-3 has key limitations, chief among them that it’s only available in English.

According to Naver, HyperCLOVA was trained on 560 billion tokens of Korean data — 97% of the Korean language — compared with the 499 billion tokens on which GPT-3 was trained. Tokens, a way of separating pieces of text into smaller units in natural language, can be either words, characters, or parts of words.

In a translated press release, Naver said that it’ll use HyperCLOVA to provide “differentiated” experiences across its services, including the Naver search engine’s autocorrect feature. “Naver plans to support HyperCLOVA [for] small- and medium-sized businesses, creators, and startups,” the company said. “Since AI can be operated with a few-shot learning method that provides simple explanations and examples, anyone who is not an AI expert can easily create AI services.”

Jack Clark, the policy director for OpenAI, called HyperCLOVA a “notable” achievement both because of the scale of the model and because it fits into the trend of generative model diffusion, or multiple actors are developing “GPT-3-style” models. In April, a research team at Chinese company Huawei quietly detailed  PanGu-Alpha (stylized PanGu-α), a a750-gigabyte model with up to 200 billion parameters that was trained on 1.1 terabytes of Chinese-language ebooks, encyclopedias, news, social media, and web pages.

“Generative models ultimately reflect and magnify the data they’re trained on — so different nations care a lot about how their own culture is represented in these models. Therefore, the Naver announcement is part of a general trend of different nations asserting their own AI capacity [and] capability via training frontier models like GPT-3,” Clark wrote in his weekly Import AI newsletter. “[We’ll] await more technical details to see if [it’s] truly comparable to GPT-3.”

Skepticism

Some experts believe that while HyperCLOVA, GPT-3, and PanGu-α and similarly large models are impressive with respect to their performance, they don’t move the ball forward on the research side of the equation. They’re prestige projects that demonstrate the scalability of existing techniques, rather, or that serve as a showcase for a company’s products.

Naver makes no claim that HyperCLOVA overcomes other blockers in natural language, like answering math problems correctly or responding to questions without paraphrasing training data. More problematically, there’s also the possibility that HyperCLOVA contains the types of bias and toxicity found to exist in models like GPT-3. Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who’s disadvantaged. In particular, the effects of AI and machine learning model training on the environment have been brought into relief.

The coauthors of the OpenAI and Stanford paper suggest ways to address the negative consequences of large language models, such as enacting laws that require companies to acknowledge when text is generated by AI — perhaps along the lines of California’s bot law. Other recommendations include:

  • Training a separate model that acts as a filter for content generated by a language model
  • Deploying a suite of bias tests to run models through before allowing people to use the model
  • Avoiding some specific use cases

The consequences of failing to take any of these steps could be catastrophic over the long term. In recent research, the Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism claims that GPT-3 could reliably generate “informational” and “influential” text that might radicalize people into violent far-right extremist ideologies and behaviors. And toxic language models deployed into production might struggle to understand aspects of minority languages and dialects. This could force people using the models to switch to “white-aligned English,” for example, to ensure that the models work better for them, which could discourage minority speakers from engaging with the models to begin with.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

LEAVE A REPLY

Please enter your comment!
Please enter your name here