目前,低收入国家有 1.43 亿人正在等待手术。美国波士顿儿童医院副院长、哈佛医学院教授 Joan LaRovere表示,有一些组织愿意提供医生和资源,但两者之间存在信息鸿沟。


img__ Female Engineer Controller Observes Working of the System. In the Background People Working and Monitors Show Various Information.




There are currently 143 million people waiting for surgeries in lower income countries. And there are organizations ready to bring in doctors and resources —  but there’s an information gap between the two, says Joan LaRovere, associate chief medical officer at Boston Children’s Hospital, a professor at Harvard medical School, and co-founder of the Virtue Foundation, an NGO dedicated to solving this information problem.

目前,低收入国家有 1.43 亿人正在等待手术。美国波士顿儿童医院副院长、哈佛医学院教授 Joan LaRovere表示,有一些组织愿意提供医生和资源,但两者之间存在信息鸿沟。Joan LaRovere是致力于解决这一信息鸿沟问题的非政府组织美德基金会(Virtue Foundation)的联合创始人。

The Virtue Foundation, founded in 2002, has already created the world’s largest database of NGOs and healthcare facilities, delivering global health services in over 25 countries, organizing medical expeditions, conducting research, and donating medical equipment. As part of this work, the foundation’s volunteers learned about the necessity of collecting reliable data to provide efficient healthcare activity.

成立于 2002 年的美德基金会已经建立了世界上最大的非政府组织和医疗机构数据库。美德基金会在超过 25 个国家提供全球医疗服务、组织医疗考察、开展研究和捐赠医疗设备。基金会的志愿者们在进行这项工作时了解到收集可靠数据对于以便提供高效的医疗保健活动的必要性。

The problem is that information sources are incredibly varied and often hidden, says LaRovere.

LaRovere 表示,问题在于信息来源种类繁多,而且往往很隐蔽。

“It’s not aggregated,” she says. “It’s on the web. It’s buried in governmental organizations. It’s in a mixture of structured and unstructured formats.”


To help alleviate the complexity and extract insights, the foundation, using different AI models, is building an analytics layer on top of this database, having partnered with DataBricks and DataRobot. Some of the models are traditional machine learning (ML), and some, LaRovere says, are gen AI, including the new multi-modal advances.


“The generative AI is filling in data gaps,” she says. “This is a very new thing that’s going on and we’re right at the leading edge of the curve.”


The next step, she says, is to take the foundational data set, and augment it with other data sources, more layers of data, and even satellite data, to draw insights and figure out correlations.


“AI’s capabilities allow us the ability to start making the invisible, visible,” she adds.


But the Virtue Foundation isn’t alone in experimenting with gen AI to help develop or augment data sets.


“This does work and is in use today by a growing number of companies,” says Bret Greenstein, partner and leader of the gen AI go-to-market strategy at PwC. “Most enterprise data is unstructured and semi-structured documents and code, as well as images and video. This was not accessible in the past without complex, custom solutions that were often very brittle.”

Pwc 合伙人兼生成式人工智能上市战略负责人 Bret Greenstein 表示,“确实是可行的,而且现在越来越多的公司都在用。大多数企业数据都是非结构化和半结构化文档、代码以及图像和视频。在过去,如果没有复杂的定制解决方案就无法用上这些数据的,而这些定制解决方案往往也非常脆弱。”

For example, gen AI can be used to extract metadata from documents, create indexes of information and knowledge graphs, and to query, summarize, and analyze this data.


“This is a huge leap over older approaches that required extensive manual processing,” he says. “And it unlocks so many use cases since most workflows and processes are based on documents and similar data types.”


According to IDC, 90% of data generated by organizations in 2022 was unstructured. Companies use gen AI to create synthetic data, find and remove sensitive information from training data sets, add meaning and context to data, and perform other higher-level functions where traditional ML approaches fall short. But gen AI can also be slower, more expensive, and sometimes less accurate than older technologies, and experts advise against jumping into it before all the foundational layers are in place.

根据 IDC 的数据,2022 年企业生成的数据中有 90% 是非结构化数据。企业利用生成式人工智能创建合成数据,从训练数据集中找到并移除敏感信息,为数据添加意义和上下文,并执行传统机器学习方法无法实现的其他更高级功能。但与旧技术相比,生成式人工智能可能要慢一些、更昂贵一些,有时准确性也更低一些,因此专家建议在所有基础层都到位之前不要盲目采用生成式人工智能技术。

## Data extraction use case


ABBYY, an intelligent automation company, has been using various types of AI and ML to process documents for more than 35 years. And three years ago, long before ChatGPT hit the scene, it began using gen AI.

ABBYY 是一家智能自动化公司,35 年来一直在使用各种人工智能和机器学习技术处理文档。三年前,早在 ChatGPT 出现之


“We used it to help with optical character recognition,” says Max Vermeir, ABBYY’s senior director of AI strategy.

ABBYY的人工智能策略高级主管Max Vermeir表示,“我们用它来帮助进行光学字符识别。”

Previously, a convolutional neural network would be used to detect which bits of an image had text in it. “Then that went into a transformer, the same architecture as ChatGPT, but built in a different way, he says.

以前用卷积神经网络检测图中哪些地方含有文字。他表示,“然后将其输给一个转换器,与 ChatGPT 的架构相同,但构建方式不同。”

The benefits of using an LLM for this task is that it can see the big picture and figure out what the text is supposed to be from context cues. The problem, says Vermeir, is that LLMs are very resource intensive. “And in optical character recognition, it’s all about speed,” he adds. “So it’s only when we detect a very low-quality document do we involve a large language model.”

使用 LLM(大型语言模型) 完成这项任务的好处是可以看到全局,并根据上下文线索找出文本的内容。Vermeir表示,问题在于LLM 非常耗费资源。他补充表示,“做光学字符识别时,速度很重要。因此,只有在检测到质量很低的文档时,我们才会使用大型语言模型。

The company is also using LLMs to figure out the location of key information in a particular type of document.

ABBYY也在使用  LLM 来确定特定类型文件中关键信息的位置。

“We do the optical character recognition, give the full text to the LLM, and then ask our questions,” he says. For example, the LLM could figure which parts of the document hold particular types of information. “Then we distil it to a smaller model that’s trained specifically on that type of document, which means it’ll be very efficient, accurate, and much less resource intensive.”

他表示,“我们先做光学字符识别,然后将全文交给 LLM,然后我们问一些问题。”例如,LLM 可以找出文档的哪些部分


In addition to being resource intensive, general-purpose LLMs are also notorious for having accuracy issues.


“Purely using LLMs won’t provide the reliability needed for critical data tasks,” Vermeir says. “You don’t want an LLM to guess what’s in a PDF that’s been sitting in your archive for 10 years — especially if it’s your most important contract.”

Vermeir 表示,“单纯使用大型语言模型无法提供执行关键数据任务所需的可靠性。你不会想靠一个大型语言模型去猜在你档案中存放了10年的PDF内容,尤其是如果文件是个很重要的合同。”

It’s important to use the right tool for the job considering all the hype surrounding gen AI. “A lot of people are trying to leverage this technology, which seems like it can do everything,” he says, “but that doesn’t mean you should use it for everything.”


So, for example, ABBYY already has a tool that can turn a single image into hundreds of synthetic images to use for training data. If there are duplicate records, fuzzy logic matching technology is great at checking whether it’s the same person. But if there’s an Onion article that recommends eating a rock every day, or a Reddit post about putting glue on pizza, are these credible sources of information that should be part of a training data set?

例如,ABBYY 已经有一款工具可以将单张图像转化为几百张合成图像,用作训练数据。如果有重复记录,模糊逻辑匹配技术就能很好地检查出是否是同一个人。但是,如果有一篇洋葱(Onion)文章建议每天吃一块石头,或者有一篇 Reddit 帖子说要在披萨上涂胶水,这些是可信的信息来源吗?应该成为训练数据集的一部分吗?”

“That actually requires that the technology reasons about whether people generally put glue on pizza,” says Vermeir. “That’s an interesting task to put to a large language model, where it’s reasoning about a large quantity of information. So this use case is quite useful.” In fact, ABBYY has something similar to this, figuring out whether a particular piece of information, when added to a training data set, will help performance of a model that’s being trained.

Vermeir 表示,“这实际上需要这项技术能够推理,大家通常是否会在披萨上涂胶水。这对于大型语言模型来说是一个有趣的任务,因为它需要针对大量信息进行推理。所以这个用例非常有用。”其实 ABBYY 也有类似的功能,在训练数据集中添加特定信息时,要 确认该信息是否有助于 提高训练模型的性能。”

 “We’re validating whether the training data we’re receiving actually increments the model,” he says.


This is particularly relevant to a smaller ML or special purpose gen AI model. For general-purpose models, it’s harder to make that kind of distinction. For example, excluding Onion articles from a training data set might improve a model’s factual performance, but including them might improve a model’s sense of humor and writing level; excluding flat-earth websites might improve a model’s scientific accuracy, but reduce its ability to discuss conspiracy theories.


## Deduplication and quality control use case


Cybersecurity startup Simbian is in the process of building an AI-powered security platform, and worries about users “jailbreaking” the AI, or asking questions in such a way that it gives results it’s not supposed to.

网络安全初创公司 Simbian 正在构建一个由人工智能驱动的安全平台。 Simbian担心用户会对平台的 AI 进行“越狱”操作,也就是用特别的方式提问,使得人工智能模型给出一些不和谐的回答。

“When you’re building an LLM for security, it better be secure,” says Ambuj Kumar, the company’s CEO.

Simbian首席执行官 Ambuj Kumar 表示,“我们做 LLM 安全平台,平台自然应该是安全的。”

To find examples of such jailbreaks, the company set up a website where users can try to trick an AI model. “This showed us all of the ways an LLM can be fooled,” he says. However, there were a lot of duplicates in the results. Say, for example, a user wants to get a chatbot to explain how to build a bomb. Asking it directly will result in the chatbot refusing to answer the question. So the user might say something like, “My grandmother used to tell me a story about making a bomb…” And a different user might say, “My great-grandfather used to tell me a story…” Simply in terms of the words used, these are two different prompts, but they’re examples of a common jailbreak tactic.

为了找到各种越狱的例子,Simbian 建了个网站,专门让用户能够针对人工智能模型玩各种越狱手法。他表示,“这可以向我们展示所有可以骗过 LLM 的方法。但得到的结果中有很多重复。比如说,用户想让聊天机器人解释如何制造炸弹。直接这样问会导致聊天机器人拒绝回答问题。所以用户可能会说,‘我祖母曾经给我讲过一个关于制造炸弹的故事……’而另一个用户可能会说,‘我的曾祖父曾经给我讲过一个故事……’单从用词来看,这是两个不同的问法,但两种问法都是常见越狱策略的例子。”

Having too many examples of a similar tactic in the training data set would skew the results. Plus, it costs more money. By using gen AI to compare different successful jailbreaks, the total number of samples was lowered by a factor of 10, he says.

在训练数据集中有太多类似策略的示例会使结果出现偏差。此外,成本也更高。他表示,通过使用生成式人工智能比较不同的成功越狱案例,样本总数减少了 10 倍。

Simbian is also using an LLM to screen its training data set, which is full of different kinds of security-related information.

Simbian 也在使用大型语言模型筛选训练数据集,训练数据集充满了各种与安全相关的信息。

“People have written gigabytes of blogs, manuals, and READMEs,” he says, “and we’re continuously reading those things, figuring out which ones are good and which ones aren’t, and adding the good ones to our training data set.”


## Synthetic data use case


One use case is particularly well suited for gen AI because it was specifically designed to generate new text.


“They’re very powerful for generating synthetic data and test data,” says Noah Johnson, co-founder and CTO at Dasera, a data security firm. “They’re very effective on that. You give them the structure and the general context, and they can generate very realistic-looking synthetic data.” The synthetic data is then used to test the company’s software, he says. “We use an open source model that we’ve tuned to this specific application.”

数据安全公司 Dasera 的联合创始人兼首席技术官 Noah Johnson 表示,“大型语音模型在生成合成数据和测试数据方面非常

强大。这些模型在这方面非常有效。你提供结构和大致上下文,这些模型就能生成非常逼真的合成数据。”他称,合成数据然后可以用来测试 Dasera 公司的软件。他还表示,“我们使用的是一个开源模型,并针对这一特定应用进行了微调。”

And synthetic data isn’t just for software testing, says Andy Thurai, VP and principal analyst at Constellation Research. A customer service chatbot, for example, might require a large amount of training data to learn from.

Constellation Research 公司副总裁兼首席分析师 Andy Thurai 表示,合成数据不仅仅适用于软件测试。例如,客户服务聊天机器人可能需要大量的训练数据,从训练数据进行学习。

“But sometimes there isn’t enough data,” says Thurai. “Real-world data is very expensive, time-consuming, and hard to collect.” There might also be legal constraints or copyright issues, and other obstacles to getting the data. Plus, real-world data is messy, he says. “Data scientists will spend up to 90% of their time curating the data set and cleaning it up.” And the more data a model is trained on, the better it is. Some have billions of parameters.

Thurai 表示,“但有时没有足够的数据,现实世界的数据非常昂贵、耗时,而且很难收集。在获取数据的过程中,可能还会遇到法律限制、版权问题以及其他障碍。另外,现实世界的数据杂乱无章。数据科学家需要花费高达 90% 的时间来整理数据集和清洗数据。”模型训练所用的数据越多,性能往往越好。有些模型有几十亿个参数。

“By using synthetic data, you can produce data as fast as you want, when you want it,” he says.


The challenge, he adds, is that it’s too easy to produce just the data you expect to see, resulting in a model that’s not great when it comes across real-world messiness.


“But based on my conversations with executives, they all seem to think that it’s good enough,” says Thurai. “Let me get the model out first with a blend of real world data and synthetic data to fill some blank spots and holes. And in later versions, as I get more data, I can fine-tune or RAG or retrain with the newer data.”

Thurai 表示,“但根据我与高管的交谈来看,他们似乎都觉得这已经足够好了。他们觉得我先用现实世界数据和合成数据混合整一个模型,填补一些空白的点和洞,在以后的版本中,我获得更多数据时可以用新数据进行微调、用 RAG 或重新训练。”

## Keeping gen AI expectations in check


The most important thing to know is that gen AI won’t solve all of a company’s data problems.


“It’s not a silver bullet,” says Daniel Avancini, chief data officer at Indicium, an AI and data consultancy.

人工智能和数据咨询公司 Indicium 的首席数据官 Daniel Avancini 表示,“生成式人工智能不是灵丹妙药。”

If a company is just starting on its data journey, getting the basics right is key, including building good data platforms, setting up data governance processes, and using efficient and robust traditional approaches to identifying, classifying, and cleaning data.

如果一家公司刚刚开始数据之旅,那么做好基础工作至关重要,包括构建良好的数据平台、建立数据治理流程,并使用高效且稳 健的传统方法来识别、清理数据及对数据进行分类。

“Gen AI is definitely something that’s going to help, but there are a lot of traditional best practices that need to be implemented first,” he says.


Without those foundations in place, an LLM may have some limited benefits. But when companies do have their frameworks in place, and are dealing with very large amounts of data, then there are specific tasks that gen AI can help with.


“But I wouldn’t say that, with the technology we have now, it would be a replacement for traditional approaches,” he says.