从Web2到Web3：我为何看好AI赛道

币圈资讯阅读：58 2024-04-22 12:53:51 评论：0

美化布局示例

欧易(OKX)最新版本

【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载全球官网大陆官网

币安(Binance)最新版本

币安交易所app【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载官网地址

火币HTX最新版本

火币老牌交易所【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载官网地址

作者：Zixi.eth，经纬中国投资人来源：X（原推特）@Zixi41620514

我近期开始重点花时间来看Web2/Web3 AI赛道，Focus on在全球的模型大赛道中的开源模型社区，数据大赛道，服务于大模型的各种中间件——例如为Foundation Model变成行业大模型的全流程服务，以及部分Application等。欢迎各类创业者同我们经纬沟通，我们认为AI会是一个很长期的赛道。

第一期将会分享一下，我们近期已经布局了的数据大赛道中的数据标注行业，也是我个人今年投的很满意的标的。

AI的开发可分为以数据采集、清洗、标注、增强等流程为主体的数据准备工作，以模型构建、训练、调优、部署为主体的算法开发工作。其中，由于新时代AI对数据具有多模态，高精度，强定制等的多样化需求，新时代AI数据对人力劳动的依赖程度也很较高，同时也需要进一步提升AI和人的流畅交互来增加效率。数据标注工作指对模型训练所需的数据样本中的特征要素进行标识与区分。由于目前AI的发展仍处于监督学习阶段，以深度学习为代表的AI算法模型在训练过程中对数据内涵信息及数据之间逻辑的学习及验证基于数据的特征标识实现，数据的标注工作具有必要性，是数据准备乃至人工智能项目开发的核心工作之一。与数据准备其余工作流程类似，数据的标注工作高度依赖人力劳动。冗长的工作周期及庞大的人力成本成为制约AI行业发展的主要因素之一。数据标注服务供给侧的痛点催生市场对自动化工具的需求，推动智能数据标注技术的发展与规模化应用。

图1：从数据采集到AI可用的数据集

在目前数据标注最大的应用下游智能驾驶领域，仍需要大量的人工来标注各种场景，如猫猫狗狗、电线杆、婴儿车等。举例来说，Scale AI是OpenAI的重要数据提供商，他们在全球的第三世界国家建立了自己的数据标注工作室，协助OpenAI进行文字/图片的数据标注。

然而，随着AI的进步，预标注在工作流程中的占比逐渐增大。在早期，数据标注主要通过手动方式完成，以构建和积累机器学习的数据集。尽管效率相对较低，成本较高，但只要标注到位，向机器提供的数据具有很大的优势。随着时间的推移，人工标注的重心逐渐从美国转移到委内瑞拉、菲律宾等第三世界国家，以降低成本。

随着模型的发展，自动化数据标注的准确性提高，可以使用模型来辅助人工标注，例如模型预处理数据然后发送给人工标注，或者由人工审核和校正自动化模型提供的标注结果。与纯人工标注相比，AI辅助标注加快了数据标注的速度。目前，全球最大的数据标注公司之一，如Scale AI等，都在努力减少数据标注过程中的人工参与比例。

尽管预标注在计算机视觉领域的数据上已经取得了不错的效果，但在语言和大模型的新时代，预标注仍然非常不成熟，无法完全替代人力。原因如下：1. 准确性低，特别是在处理复杂任务和边缘案例时。2. 样本偏差和模型幻觉问题。3. 某些垂直领域需要大量由行业专家标注的数据集。4. 预标注的可拓展性较差，尤其对于小语言或不常见场景，成本高且质量较差，仍需要特定的人工完成。

综上所述，短期内预标注不会完全取代人工标注，两者将共存。虽然人工标注的比例可能会下降，但标注流程中仍需要审核员进行数据标注的审核。

图：预标注下的数据标注流程

数据标注行业并非新兴行业，它伴随着智能驾驶的兴起于17/18年开始崭露头角。下图展示了国内预测的数据标注提供商市场规模，值得一提的是，美国的数据标注市场规模大致是中国的3-5倍。

数据标注行业是一个相对分散的市场，不像是一个技术壁垒极高的领域，而更像是技术、人力和组织管理壁垒各占三分之一的领域。该领域的核心竞争力主要体现在以下几个方面：1. 价格 2. 质量 3. 专业知识和知识覆盖范围（多样性？）4. 速度

价格是显而易见的，因为所有人都需要大量廉价的数据。在价格方面的压力驱使着一种地理套利的方式，即在发达的美国，完成一项数据标注可能需要支付1美元的工资，而在不太发达的中国，这仅需要0.5美元，在菲律宾可能只需要0.1美元。因此，市场上的解决方案之一是将订单交给第一世界国家，然后在第三世界国家招募人员，通过直营工作室解决问题。

数据质量也很容易理解，大模型和智能驾驶领域需要高质量的数据。如果输入模型的数据质量差，大模型的性能也将受到影响。解决数据质量问题的有效方案之一是通过模型的预标注产生原始数据，然后进行人工标注，然后不断进行强化学习和人工反馈，以完善数据标注质量。或者，团队需要对下游客户的数据标注流程非常清晰，能够制定标准操作程序（SOP），使数据标注员工可以根据SOP进行标注，从而提高质量。

然而，如何理解专业知识和知识覆盖范围呢？我们举三个例子：

1. 在通用大模型下，这是一个不小的挑战。给文本大模型标注可能相对容易，但你必须找到能够标注中文/英文/法文/德文/俄文/阿拉伯文等多语言的人员，而数据标注公司如何在全球范围内招募和管理这么多分布式的人员将是一个不小的难题。

2. 考虑一个语音机器人/数字人领域的人工智能应用初创公司。初创公司通常没有足够的时间、人力和资金来内部建立一个数据标注团队。他们需要找到一个外包团队来帮助标注四川口音、粤语口音、上海口音、东北口音等中文语系，同时还需要标注北美英语口音、英国英语口音、新加坡英语口音等英语语系。在市场上找到一个能够胜任这些任务的优秀数据标注工作室可能会非常困难。如果采用直营或分包的方式，从接单到招募可能需要一两个月的工作时间，这将严重影响供应效率。

3. 再考虑一个更为细分的领域，一个专注于法律大模型的初创公司需要大量的法律数据标注工作。法学领域仍然具有相当高的专业要求，初创公司需要找到符合以下条件的数据标注供应商：1. 至少有十几个了解法律的人员，可能还需要涵盖中国法系、香港法系、美国法系等；2. 必须能够理解中文和英文；3. 成本不能太高。如果找律师来进行标注，由于律师工资较高，他们可能不愿意从事这项工作。因此，目前这类细分领域的解决方案只能是内部招募学校实习生来从事数据标注工作。而对于直营和分包的管理模式，要完成此类细分领域的赛道还是相当困难的。

因此，市场上的主要参与者可以分为三类：1. 大公司内部自主完成（例如百度众包）；2. 采用直营/分包模式的初创公司（下面进行分析）；3. 中小型数据标注工作室。

图：中国AI市场的数据市场规模

在我们继续深入分析之前，让我们先了解一下当前该领域的龙头初创公司：

1. Scale AI：美国的Scale AI，主营业务涵盖四个方面：数据标注、管理评估（控制标注的数据质量，提升标注的效率）、自动化（辅助标注，提升效率）、数据合成（模型越来越丰富，真实数据不够用的情况下，需要自动合成数据投喂模型，我们后面会专门讲合成数据赛道）。Scale AI最初以自动驾驶标注为主，两年前公司80-90%的订单来自自动驾驶（2D、3D、激光雷达等），该比例近年有所下降。公司的订单来源因应供应商的行业趋势，近几年政府、电商、机器人、大模型等领域发展迅猛，再加上团队对行业趋势的敏锐捕捉能力，因此在每个细分领域都能保持很高的市场份额。此外，Scale AI还推出了自己的Model as a Service服务，例如帮助客户Finetune、托管以及部署模型等。

收费模式分为两种：

- Consumption-base：例如，Scale Image起价为每张图片2美分，每条标注6美分；Scale Video起价为每帧视频13美分，每条标注3美分；Scale Text起价为每项任务5美分，每条标注3美分；Scale Document AI起价为每项任务2美分，每条标注7美分。

- Project-base，即根据合同中的数据量等项目收费，实际上大部分收入为项目制收入，客单价从几十万美元至几千万美元不等。

2022年，Scale AI的预计收入为2.9亿美元，目前估值为70亿美元，是世界上最大的数据标注公司。该公司的投资人也非常豪华。

2. 海天瑞声：中国的海天瑞声在数据标注领域也扮演着重要的角色。该公司在数据标注、数据清洗、数据分析等方面有着丰富的经验。然而，关于其详细的业务模式、收费方式和融资情况等方面的信息目前尚不清晰。

3. Appen：澳大利亚的Appen是另一家全球领先的数据标注公司。与Scale AI类似，Appen提供数据标注、语音数据收集、翻译等服务。该公司在全球范围内设有众多的标注员，为客户提供高质量的数据标注服务。Appen的详细业务模式和融资情况也值得进一步深入了解。

这三家公司在全球数据标注领域占据重要地位，分别代表了美国、中国和澳大利亚在这一领域的领先地位。在我们深入探讨初创公司的业务模式和市场竞争之前，这些龙头公司的了解将有助于为整个行业的背景提供更全面的认识。

海天瑞声是A股上市公司，但不完全是个数据标注公司。相比于http://Scale.AI自己建团队直营做数据标注，海天本质上是技术服务商，把单子外包给各种工作室。海天瑞声在国内能做大核心靠的是：1.在语音标注上积累很深，能覆盖190多种语言（占70-80%收入） 2.规模效应 3.国际化能力不错。在国内数据标注行业很狂野也很早期，非常零散且无序，也缺乏行业标准和规范。

我们可以看看（Appen）和海天的商业模式对比，看看直营/外包的商业模式和毛利经历情况。
图：直营/外包商业模式…

铺垫了这么多，记性好的读者是不是想到我们的标题是如何用区块链重塑数据标注。全文还没讲到区块链呢，到底怎么重塑呢？

未来的AI应该是open和sovereign的，无论是数据，算力，还是模型，都应该在确保高质量和效率的基础上给社会提供universal and open access。所有帮助推进AI的参与者应该对自己的贡献和产出用有所属权以及合理的利益分配和奖励。

我们近期投资的公司Quest Labs的目标就是重新定义新时代AI和人的关系，通过AI和区块链的技术来颠覆和解决现有行业内的痛点。作为AI产业链上游必须的铲子，数据服务就是Quest第一个想要解决的问题。通过AI来促进数据生产效率，通过区块链来重新定义新时代公开数据集的经济模型和价值捕获，两者相辅相成来良性的持续产出High value data以及提升AI标注员的能力和认知。

1. AI and Human Collaborative intelligence:

An intelligent human-in-the-loop, AI-centered infra to enable and incentivize human teams to smoothly interact with co-pilot models，提供高精度数据，并迭代提高质量，以在lifecycle中生成高价值数据
由 Humans Ops Tool 提供支持的decentralized marketplace，可最大限度地提高去中心化劳动力管理的效率，并优化分布式团队全球网络中的协作和沟通

2. 数据公开化，隐私，和所属权

平台通过付费现金流和代币来深度激励用户流量及粘合度，同时不停刺激数据飞轮效应，捕捉供需两端行为和历史数据来互相持续学习。通过算法来推荐和制定数据需求框架以保证未来商用价值 (hard domain mining)，覆盖大量垂类细分场景。所有数据标参与者可以提前开始提供数据集来不停被调用商业化，获得现金流和代币奖励，最终成为一个新时代的有价值的开放AI数据网络。
数据加密和隐私保护：用ZK和FHE等方式来对用户数据做更好加密化的processing和storage。
通过区块链技术来追溯和验证参与者对数据的所属权，其中包括收集，标注等不同的产出以及其对应的价值。

3. 新的经济模型

通过全球自动匹配的AI数据服务平台 (ai数据服务的美团)，从中心化计划经济变成市场经济。
通过区块链技术保证声誉可信度+数字币优化结算体系，无限扩大供给端人流量做精准的匹配，让合适的人做对的事才能高效化和质量化。通过数据标注服务和贫困人口的重叠，解决就业+变相实现普惠金融。

4. token去奖励给到用户去激励持续学习和高质量服务及产出，同时激励用户提供优质和有效的反馈来优化平台模型去增加整个流水线的效率和产能 (Human and AI mutual continuous learning)。

通过token去根据POPW去进行合理的利益分配和价值捕获，更好降低CAC，然后增加retention

从web2的世界来看，这是一个数据标注的分发平台，有点像滴滴和美团外卖。但是从web3来看，这是一个有真实现金流的Axie Infinity+YGG。在2021年的牛市中，Axie和YGG的组合带了相当多的第三世界用户进入Web3，并且这类游戏公会在疫情中养活了非常多的第三世界家庭，尤其是菲律宾。市场也给了Axie和YGG非常好的回报，他们是很有意思的Alpha。我们作为一个bridging web2和web3的投资人，非常愿意支持利用区块链技术给真实商业添砖加瓦的项目和团队，我们很期待团队在之后的表现。这也是我们看到少有的web3技术能够给web2业务插上翅膀的方向。

Author Jingwei China investors from the original Twitter I recently began to focus on spending time to look at the track in the global model big track, the open source model community data big track serves all kinds of middleware of the big model, such as the whole process service and part of it to become a big industry model. We welcome all kinds of entrepreneurs to communicate with Jingwei. We think it will be a long-term track. The first phase will share the data we have laid out recently. The data labeling industry in the big track is also my personal investment this year. The development of satisfactory targets can be divided into data preparation with data collection, cleaning, labeling and enhancement as the main body, and algorithm development with model construction, training, optimization and deployment as the main body. Among them, because of the diversified demand for data in the new era, data in the new era is also highly dependent on human labor, and it also needs to be further improved and people's smooth interaction to increase efficiency. Data labeling refers to the characteristics of data samples needed for model training Because the current development is still in the stage of supervised learning, the algorithm model represented by deep learning learns and verifies the connotation information of data and the logic between data in the training process. It is necessary to realize data labeling based on data feature identification, which is one of the core tasks of data preparation and even artificial intelligence project development. Similar to the other workflows of data preparation, data labeling is highly dependent on human labor, lengthy work cycle and huge labor cost. It has become one of the main factors restricting the development of the industry. The pain point on the supply side of data labeling service has given birth to the demand for automation tools in the market, which has promoted the development and large-scale application of intelligent data labeling technology. From data collection to available data sets, in the downstream intelligent driving field, it still needs a lot of manpower to label various scenes, such as cats, dogs, telephone poles and strollers. For example, they are important data providers who have established their own numbers in third world countries around the world. According to the annotation studio, the data annotation of characters and pictures is assisted. However, with the progress, the proportion of pre-annotation in the workflow is gradually increasing. In the early days, data annotation was mainly done manually to build and accumulate the data set of machine learning. Although the efficiency is relatively low and the cost is high, the data provided to the machine has great advantages as long as the annotation is in place. With the passage of time, the focus of manual annotation has gradually shifted from the United States to third world countries such as Venezuela and the Philippines to reduce the cost. The development of automatic data labeling can improve the accuracy of data labeling. For example, the model can be used to assist manual labeling, such as preprocessing the data and then sending it to manual labeling, or the labeling results provided by the automatic model can be manually reviewed and corrected. Compared with pure manual labeling, auxiliary labeling accelerates the speed of data labeling. At present, one of the largest data labeling companies in the world is trying to reduce the proportion of manual participation in the data labeling process, although pre-labeling has been achieved in the field of computer vision. Good results, but in the new era of language and large model, pre-labeling is still very immature and cannot completely replace manpower for the following reasons: low accuracy, especially when dealing with complex tasks and marginal cases, sample deviation and model illusion. Some vertical fields need a large number of data sets labeled by industry experts, and the scalability of pre-labeling is poor, especially for small languages or uncommon scenes, which are costly and of poor quality, and still need specific manual completion. To sum up, pre-labeling will not completely replace manual labeling in the short term. Note that the two will coexist. Although the proportion of manual labeling may decrease, auditors are still needed to review the data labeling in the labeling process. The data labeling process under pre-labeling is not a new industry. With the rise of intelligent driving, the data labeling industry began to emerge in. The following figure shows the market size of domestic predicted data labeling providers. It is worth mentioning that the data labeling market in the United States is roughly twice as large as that in China. The data labeling industry is a relatively scattered market, unlike a technology. The field with extremely high technical barriers is more like a field where technical manpower and organizational management barriers each account for one third. The core competitiveness of this field is mainly reflected in the following aspects: price, quality, professional knowledge and knowledge coverage, diversity, speed and price. Because everyone needs a lot of cheap data, the pressure on price drives a geographical arbitrage way, that is, in the developed United States, it may cost US dollars to complete a data annotation, while in the less developed China, it is obvious. Only US dollars is needed. In the Philippines, it may only cost US dollars. Therefore, one of the solutions in the market is to hand over the order to the first world countries and then recruit people in the third world countries to solve the problem through direct studios. The data quality is also easy to understand. Large models and intelligent driving fields need high-quality data. If the data quality of the input model is poor, the performance of the model will also be affected. One of the effective solutions to solve the data quality problem is to generate the original data through the pre-labeling of the model and then carry it out. Work labeling and then continue to strengthen learning and manual feedback to improve the quality of data labeling, or the team needs to be very clear about the data labeling process of downstream customers, and can formulate standard operating procedures so that data labeling employees can label according to it, thus improving the quality. However, how to understand the professional knowledge and knowledge coverage? Let's give three examples, which is a big challenge. It may be relatively easy to label large text models, but you must find a way to label Chinese, English and French. German, Russian, Arabic and other multilingual personnel, and how data labeling companies recruit and manage so many distributed personnel on a global scale will be no small problem. Considering an artificial intelligence application startup in the field of voice robots and digital people, startups usually don't have enough time, manpower and funds to establish a data labeling team internally. They need to find an outsourcing team to help label Chinese languages such as Sichuan accent, Cantonese accent, Shanghai accent and Northeast accent. It may be very difficult to find an excellent data annotation studio capable of these tasks in the market. If it is directly operated or subcontracted, it may take one or two months from receiving orders to recruiting, which will seriously affect the supply efficiency. Consider a more subdivided field. A start-up company specializing in legal model needs a lot of legal data annotation work. The legal field still has quite high professional requirements. Start-ups need to find data annotation suppliers who meet the following conditions and have at least a dozen understanding methods. 比特币今日价格行情网_okx交易所app_永续合约_比特币怎么买卖交易_虚拟币交易所平台

文字格式和图片示例

注册有任何问题请添加微信：MVIP619 拉你进入群