owned this note
owned this note
Published
Linked with GitHub
## 3. Open source security challenges
Security is an important factor in determining whether an open-source product can be successfully commercialized. Business users usually need to conduct a comprehensive security assessment of the products they use to ensure that the overall business is secure and controllable, which includes cyber-attack security, data security, and commercial license controllability.
According to Synopsys, by the end of 2022, 84% of repository contain at least one known open-source vulnerability, 48% contain high-risk vulnerabilities, and 34% of respondents also said they had experienced "an attack launched using a known vulnerability in open-source software in the past 12 months. Open-source security is an issue that requires a great deal of attention, and it greatly affects customer trust in open-source software, as well as whether the large open-source ecosystem can be stabilized in the future. Only by ensuring security, open-source software can go farther on the road to commercialization. <br>
<div style="margin-left: auto; margin-right: auto; width: 55%">
| ![image029](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-1.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.1 Open Source Codebase Vulnerabilities (Data Source:Synopsys)</center>
<br>
### 3.1 Open source software cybersecurity
#### 3.1.1 Open source software security vulnerabilities can be exploited with serious consequences
Open-source software plays a key role in driving technological innovation and facilitating knowledge sharing, but they are also inherently at risk of security vulnerabilities. The root causes of these security vulnerabilities usually lie in open-source code management and maintenance issues, such as programming errors, lack of continuous security reviews, and lagging application of updates and patches. Particularly where programs are not active enough or lack effective regulation, these vulnerabilities may go unrecognized or unfixed for long periods of time. Historically, several serious security incidents have occurred due to security vulnerabilities in open-source software, resulting in sensitive data breaches and financial losses.
In April 2014, a major security vulnerability in the widely used open-source component OpenSSL, known as Heartbleed, emerged. This vulnerability has existed since the May 2012 release and allows an attacker to obtain data containing certificate private keys, usernames, passwords, email addresses, and other sensitive information. Because this vulnerability went undetected for nearly two years, its impact was extremely widespread and almost impossible to accurately measure. Again, in December 2021, another widely used open-source component, Apache Log4j2, was found to have a serious remote code execution vulnerability called Log4Shell. This vulnerability quickly spread globally due to the high performance and low exploitation barrier of Apache Log4j2, affecting a number of well-known companies and service platforms, including Steam, Twitter, Amazon, and others.
#### 3.1.2 The relative prevalence of open source software cybersecurity issues
**Open source software is inherently more vulnerable**
According to the results of "2022 [QiAnXin QAX](https://en.qianxin.com/) Open-source Project Inspection Program", the overall defect density of open-source software is 21.06/thousand lines, and the density of high-risk defects is 1.29/thousand lines. The number of defect densities and high-risk defect densities has been increasing for three consecutive years, with an accelerating trend. The overall detection rate of the ten categories of typical defects in open-source software was 72.3%, while this figure was only 56.3% two years ago. There is a rapid increase in the detection rate of open-source software, suggesting the security issue of the software itself is quite serious.
<br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image030](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-2.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.2 Three-Year Comparison of Average Defect Density of Open Source Software </center>
<center>(Source:2023 China Software Supply Chain Security Analysis Report)</center>
<br>
In terms of the absolute number of open-source software flaws and vulnerabilities, according to data from [QiAnXin (QAX)](https://en.qianxin.com/), by the end of 2022, 57,610 vulnerabilities related to open-source software will be included in the public vulnerability database, and 7,682 new vulnerabilities will be added in 2022, an incremental increase of about 15%, which is a worrisome situation.
:::info Expert Review
**Yu Jie**:The security of open-source software urgently needs to be given sufficient attention, and it is clear that the strength of individual communities alone is not enough to deal with it. How to build an effective systems and regimes to comprehensively protect the security of open-source software has become a major issue that cannot be avoided with its rapid development.
:::
**Open-source projects with too low or too high levels of activity are more likely to have security risks**
Open-source software that is too inactive and updated too infrequently will result in vulnerabilities not being fixed in a timely manner, thus increasing the risk exposure of the software; if it is too active and updated too quickly, it will also result in users not being able to update accordingly in a timely manner, which puts more pressure on security operations and maintenance.
According to the data of QAX, if the open-source projects that have not been updated for more than a year are regarded as inactive projects, the number of inactive open-source projects in the mainstream open-source software package system will be 3,967,204 in 2022, accounting for 72.1%, while this ratio was 69.9% and 61.6% in 2021 and 2020, respectively, which indicates that the overall motivation of the open-source authors to maintain the software has decreased, which is not favorable to the long-term development of the security of the open-source software ecosystem.
<br>
<div style="margin-left: auto; margin-right: auto; width: 90%">
| ![image031](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-3.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.3 Statistics of Inactive Open Source Projects </center>
<br>
Against the backdrop of generally low activity, there are also some open-source software that are overly active, again putting a lot of security O&M pressure on users. According to QAX, there will be 22,403 open-source projects with more than 100 versions in the mainstream open-source package ecosystem in 2022, compared to 19,265 and 13,411 in 2021 and 2020, respectively.
<br>
<div style="margin-left: auto; margin-right: auto; width: 90%">
| ![image032](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-4.png)|
| -------- |
</div>
<center> Figure 3.4 Extremely Active Program Statistics </center>
<br>
Too little or too much activity poses a high security risk to users of the open-source ecosystem, and a balance is urgently needed to ensure the healthy and sustainable development of open-source software. A more scientific version management and release mechanism is needed to ensure that updates respond to security and functionality needs in a timely manner without disturbing users too frequently. For projects with insufficient activity, their activity can be enhanced by increasing community participation and providing incentives. For projects with frequent updates, more attention should be paid to communicating with users, providing clear update logs and support guidelines to help users better understand and adapt to these changes.
At the same time, users should also be encouraged to actively participate in the feedback and contribution of the open-source project to form a positive interaction. Users' actual experience and feedback are important references for adjusting the update pace and optimizing software functions. By establishing a healthy user-developer interaction mechanism, we can effectively balance the activity and update frequency to ensure the safety and usability of the software.
**Some users are using software that is outdated or with version usage being disorganized**
According to QAX , many software projects use very outdated versions of open-source software, even versions released 30 years ago, with many vulnerabilities and very high risk exposure. One of the earliest software is IJG JPEG 6 released in 1995, which is still used by many projects. Older versions often come with older vulnerabilities, and there are still very old open-source vulnerabilities in some software projects. The oldest vulnerability is from 2002, 21 years ago, and is still used by 11 projects.
<br>
<div style="margin-left: auto; margin-right: auto; width: 90%">
| ![image033](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-5.png) |
|-------------------------------------------------------|
</div>
<center> Figure 3.5 Aged Open Source Vulnerabilities and Their Usage </center>
<br>
There is a lot of confusion over the use of versions of open-source software, not all of which are up-to-date. For example, there are 181 versions of Spring Framework in use. The use of earlier versions can lead to a large number of vulnerabilities that have been fixed in newer versions can still be exploited maliciously, thus posing a significant security risk.
#### 3.1.3 Strategies for dealing with vulnerability risks in open source software
**Regular security audits and code checks**
A clear audit process needs to be defined that includes a comprehensive review of the overall architecture, codebase, and dependencies of the software. These audits can be performed by assembling specialized security teams or utilizing third-party security services. These teams or service providers should have an in-depth understanding of open-source software.
Regular code review meetings are also held to encourage team members to review each other's code, which not only helps identify potential security issues, but also improves the team's programming skills and code quality. Audits and code review should be an continuous process, constantly monitoring and updating the code base in response to newly discovered vulnerabilities and security threats.
**Using the SCA (Software Component Analysis) tool**
Software Component Analysis (SCA) is a methodology for managing the security of open-source components, enabling development teams to quickly track and analyze the open-source components used in their projects. SCA tools identify all relevant components and supporting libraries, as well as direct and indirect dependencies between them. In addition, they can check software licenses, identify deprecated dependencies, and discover potential vulnerabilities and threats. A SCA scan produces a software bill of materials (SBOM) that contains a complete list of the project's software assets.
With the widespread use of open-source components in software development, SCA is emerging as a key component of application security, although the concept itself is not new. The number of SCA tools has grown with its importance. In modern software development practices, including DevSecOps, SCA not only needs to provide ease of use for developers, but also needs to guide and direct developers safely throughout the software development lifecycle (SDLC).
When using SCA for open-source security, the following points should be considered:
- Adopt developer-friendly SCA tools: Developers are often busy writing and optimizing code, and they need tools that promote efficient thinking and rapid iteration. Unfriendly SCA tools can slow down the development process. An easy-to-use SCA tool simplifies setup and operation. Such tools should integrate easily with existing development workflows and tools, and should be implemented early in the software development life cycle (SDLC). It is important that developers understand the importance of SCA and incorporate its security checking process into their daily work to minimize code rewrites due to security issues.
- Integrate SCA into the CI/CD process: Using SCA tools does not mean that they will interfere with the development, testing, and production processes. Instead, organizations should integrate SCA scanning into Continuous Integration/Continuous Deployment (CI/CD) processes so that vulnerabilities can be identified and remediated as a functional part of the software development and build process. This approach also helps developers make code security part of their daily workflow.
- Effective Use of Reports and Software Bills of Materials: Many organizations, including the U.S. Federal Government, require a software bill of materials (SBOM) when purchasing software. Providing a detailed SBOM means that organizations recognize the importance of keeping track of every component within an application. Clear security scanning and remediation reports are also critical, as they provide detailed information about an organization's security practices and the number of vulnerabilities remediated, demonstrating a commitment to and actual action on software security.
**Enhancing education and training**
Conduct regular security awareness training for developers to increase their knowledge of security threats and best security practices, including educating them on identifying common security vulnerabilities and attack tactics. Use hands-on simulation exercises and workshops to allow developers to learn how to handle security incidents in a secure environment. These exercises can include vulnerability mining, code remediation, and security testing.
Given the rapid changes in the security landscape, encourage developers to continuously learn and update their knowledge, including by participating in online courses, seminars and industry conferences. Create a platform, such as an internal forum or regular meetings, for developers to share their knowledge and experience in security to foster learning and collaboration among teams.
### 3.2 Controllable open source licences
#### 3.2.1 Open source licenses are a constraint on users of open source resources, with a wide range of categories
An open-source license is a binding for open-source resources (including, but not limited to, software, code, and web users). Based on the open-source license, the user gets the right to use, modify and share the open-source resources. If the software is not licensed, it means that the copyright is retained and the user can only view the source code and not use it. Therefore, an open-source license is essentially a legal permit that protects project contributors and users of open-source resources, ensures that contributors can open-source the resources they own in the way they want to, and also ensures that users can use the resources in a reasonable and legal way to avoid being caught in intellectual property disputes, which greatly contributes to the prosperity of the open-source community.
Open-source licenses are divided into three overall categories based on how restrictive the license is:Permissive, Weak Copyleft, Strong Copyleft <br>
<div style="margin-left: auto; margin-right: auto; width: 90%">
| ![image034](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-6.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.6 Open Source License Classification </center>
<br>
**The Permissive category** is the most flexible category of licenses, including BSD, MIT, Apache, ISC, etc., which provide extremely permissive licensing conditions that allow people to freely use, modify, copy, and distribute the software. They equally support the use of software for commercial or non-commercial purposes.The only requirement is that the appropriate license text and copyright information be included in each copy of the software.
**The Weak Copyleft category** is a more restrictive license than the Permissive category, including LGPL, MPL, etc., which requires that any changes made to the code be released under the same license. Also, the modified code must contain the license and copyright information of the original code. However, they do not mandate that the entire project be released under the same license.
**The Strong Copyleft category** is an even more restrictive type of license, including GPL, AGPL, CPL, etc. This type of license states that the entire project must be released under the same license, including those cases where only a portion of the software is used. In addition, these licenses require that all modified versions of the code be publicly released.
Under these broad categories, specific licenses and license families will have unique restrictions, permissions, and specific differences in additional parameters, and the overall logical relationship of licenses is organized as follows: <br>
<div style="margin-left: auto; margin-right: auto; width: 80%">
| ![image035](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-7.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.7 License Logic Relationships </center>
<br>
Kaiyuanshe provides an open-source license filter, which provides good help to understand the best license options faster and better, and is highly recommended for readers who need it:https://kaiyuanshe.cn/tool/license-filter
#### 3.2.2 Risk of infringement by using open source resources without complying with the license
**Open source license infringement**
"Open-source license infringement" is the use of open-source software without complying with the terms and conditions of the open-source license associated with the software, thereby violating the legal constraints imposed by the license. Such behavior can lead to a host of legal and ethical problems. While open-source software is freely available to the public for use and modification, such use and modification is still subject to certain limitations, which are specified by the corresponding open-source license.
Specific instances include, but are not limited to, the following:
Ignoring Copyright Notices and Attribution:Many open-source licenses require that original copyright notices and author attributions be retained when copying, distributing, or modifying software. Ignoring this requirement, such as removing the original author's copyright information or failing to properly attribute the work, is considered an infringement.
Non-availability of Source Code:Some licenses, such as the GPL (General Public License), require that the source code be made available along with the distribution of the software. If a piece of software based on such a license is distributed without the source code being made available at the same time, this also constitutes infringement.
Restrictive Use:Some licenses have restrictions on the scenarios in which the software can be used. For example, certain licenses may prohibit the use of the software in certain types of business activities. Violation of these restrictive covenants is also a tort.
Violating Conditions for Distribution and Re-licensing:Copyleft open-source licenses such as the GPL requires that any modifications and derivative works based on GPL-licensed software must also be released under the GPL license. Violations of this rule, such as privatizing GPL code or distributing derivative works under non-GPL licenses, constitute copyright infringement.
Violation of Specific Terms:In addition to the common scenarios described above, there are specific license terms that may be violated under certain circumstances. This depends on the specific requirements of the particular license.
**License Reciprocity Requirement Leads to Expanded Scope of Open Source Copyright Problems**
The so-called "reciprocity requirement" of an open-source license, i.e., whether a derivative work follows the license of the original work, refers to the fact that the terms and conditions of an open-source license tend to continue to apply during the process of open sourcing the software, which includes copying, modifying, manipulating, redistributing, and displaying. The permissions and limitations of such licenses can extend vertically to derivative works and modified versions based on the original software development, and even horizontally affect other parts of the software developed based on such open-source software.
Of the many open-source licenses, the GPL has the strongest reciprocity requirements and the most lawsuits associated with it. The main reason for this is:Any derivative software based on GPL code modifications needs to be open source. If a piece of software contains GPL code, even if it is only a portion, the software as a whole is usually required to be open-source (unless it meets the terms of a specific exception). Failure to open-source portions of proprietary software affected by the GPL may result in infringement by the user in violation of the obligations of the GPL license. Moreover, the GPL is extremely complex, containing 17 terms. It has more stringent requirements for users, and once these requirements are violated, the user's license agreement is terminated and continued use of GPL-licensed open-source software may constitute copyright infringement. <br>
<div style="margin-left: auto; margin-right: auto; width: 90%">
| ![image036](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-8.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.8 GPL License Related Litigation </center>
<br>
**Infringement of open source licenses may lead to serious consequences**
Once an open-source license is characterized as an infringement, the loss to the defendant company or individual is far more than just compensation payment, but also includes a series of issues such as reputation and partnership:
Lawsuits and Fines:In 2017, Versata Software sued Ameriprise Financial for violating Versata's patents. While this is not a pure case of open-source license infringement, it involves software licensing and copyright issues. The case eventually ended in a settlement, but the legal fees and time costs involved were prohibitive.
Enforcing Compliance with License Requirements:A famous case is the 2015 VMware vs. Hellwig case. Hellwig, a Linux kernel developer, accused VMware of using GPL-based Linux code in its ESXi products without following the open-source requirements of the GPL license. Although the court did not ultimately rule in Hellwig's favor, the case sparked a broader discussion about GPL license obligations and derivative works.
Reputational Damage:Red Hat filed a lawsuit against Speakeasy, Inc. in 2004 for allegedly failing to comply with the requirements of the GPL license. Despite the settlement of the case, Speakeasy's reputation has suffered, especially in the open-source community.
Business Impact:Cisco was sued by the Free Software Foundation (FSF) in 2008 for violating the GPL license for its Linksys products. Cisco ultimately agreed to comply with the GPL license and pay an undisclosed amount as a donation. The lawsuit led Cisco to reconsider its open-source strategy for its products.
Partnership Damage:a company is found to be in violation of an open-source license, its business partners may reevaluate their relationship with the company, especially if the collaborative project involves open-source software.
#### 3.2.3 Open source large model licenses are largely distinct from traditional licenses
As open-source LLMs are still evolving and iterating, two highly influential open-source LLMs of the year:Llama2 and Falcon, have both been questioned as to whether or not they are truly "open source" due to tweaks to the terms of their open-source licenses. Both do not use commercially available licenses, but rather their own "LLAMA 2 COMMUNITY LICENSE AGREEMENT" and "TII Falcon LLM License", respectively; and both impose additional restrictions on their commercial use. Both have additional restrictions on their commercial use.
**Difference in open source licenses for LLaMA2**
Much of the discussion of Llama2's violation of open-source guidelines comes from its more unique terms:
- The Llama2 open-source model may not be used in products or service platforms with monthly active MAUs greater than 700 million, unless approved and licensed by Meta;
- The Llama2 open-source model may not be used in any manner that violates applicable laws or regulations, including trade compliance laws. Also not applicable to use in languages other than English;
- Other LLMs (not including Llama2 or its derivatives)
The Open Source Initiative (OSI) has published ten definitions of open source, which are currently recognized internationally, and the Llama2 protocol conflicts with two of them
- Non-Discrimination Against Individuals or Groups:The Llama License prevents enterprise users with more than 700 million monthly users from obtaining licenses directly through this License.
- Non-Discrimination Against Fields:The license shall not restrict anyone from using the program in a particular field. The Llama License prohibits the use of Llama2 outputs to improve other AI LLMs, which would be a restriction on the domain of use. Llama2's language restrictions also lead to limitations in the use of Chinese language domains.
**Difference in open source licenses for Falcon**
The TII Falcon LLM License makes some key changes from the Apache License. The Apache License is a popular open-source license that is friendly to commercial use and allows users to distribute or sell their modified code as an open-source or commercial product after meeting certain conditions.
Falcon's license is similar to the Apache License in that it also provides broad permissions to use, modify, and distribute the licensed work, and requires that the license text be included in the distribution and properly attributed, in addition to a disclaimer of limitations of liability and warranties.
However, the TII Falcon LLM License introduces additional commercial use terms that require commercial applications to pay a 10% license fee on annual revenues in excess of $1 million. It also places additional restrictions on the manner in which the work may be published or distributed, such as emphasizing the need for attribution to "Falcon LLM technology from the Technology Innovation Institute."
**The purpose of open-source for LLMs of open-source is different from that of traditional open-source software**
In the case of Llama2, for example, the license is essentially a guiding framework for organizations that intend to develop and deploy AI systems while adhering to Meta's established specifications and standards. The purpose of this framework is to ensure that these organizations meet specific rules and standards set by Meta when developing and deploying AI technologies. Such an approach helps Meta manage the scope and manner in which its AI technology is applied, thereby safeguarding its business interests and brand image.
The Llama2 license may constitute a compliance requirement that must be adhered to for those who plan to conduct AI development on the Meta platform. This means that these organizations must follow Meta's specific specifications and requirements when using Meta-provided tools and resources to develop and deploy AI models. In doing so, these companies may need to apply to Meta for the appropriate licenses, of which the Llama2 license is a part.
#### 3.2.4 Means of securing controllable licenses
**Document the use of open source components**
When the enterprise or individual user's software reaches a certain size, the burden of managing the included open-source components becomes heavier, which leads to infringement problems due to the inability to manage them in a timely manner. According to Synopsys, 89% of the codebase contains open-source code that has been out of date for at least four years, and 88% of the codebase contains components that have been inactive for the past 2 years and contain components that are not the latest version. In many cases, developers may have completely forgotten which open-source components have been used and are unable to react in a timely manner when licenses for those open-source components are updated, leading to infringement issues. Therefore, it becomes very necessary to manage open-source components in a reasonable way.
Developers can manually or automatically maintain a detailed dependency list of all used open-source components and their version information in the project's documentation. For example, in many programming languages, dependencies can be tracked using files such as requirements.txt (Python), package.json (Node.js), and so on.
Create an internal document or knowledge base that records all relevant information about the open-source components used, including their origin, license information, and how they are used, and regularly check their licenses for updates. Track in detail in the documentation which open-source components are used, and add comments in the corresponding places in the code to indicate this. Add the corresponding license website to the document to check it regularly and find out the changes of the license terms in time. Also document in your programming how you have complied with valid license conditions.
For larger volumes of development work, manually recorded text may not be able to meet the project requirements, at this time you can use related tools, such as code component analysis (SCA) software. These tools automatically identify and document the open-source components used in a project. They are usually able to provide detailed reports that include component license information, versions, and possible security vulnerabilities.
**Cautious use of supplementary coding tools**
Intelligent programming assistants such as ChatGPT and GitHub Copilot provide programming advice and code snippets by analyzing a large number of codebases and documentation. While these tools are extremely valuable in improving programming efficiency, there are several key points to consider when using the code they generate to avoid potential open-source license infringement issues:
- License Issues with Source Code:Assistive programming software may generate suggestions based on code in its training datasets. These training datasets may contain code from different open-source projects that may have various license requirements. Usually supplementary programming results do not index the corresponding licenses, and copyright issues may be involved if the generated code snippets are too close to the original code and are copied directly by the user.
- Attribution of Responsibility:When using code generated by an intelligent programming assistant, it needs to be clear that the ultimate responsibility lies with the user. This means that the developer is responsible for the legality and suitability of the generated code. As a result, developers conduct regular code reviews, especially for sections generated using assisted programming, to ensure that they do not violate the terms of any open-source license.
**Adequate code audits during mergers and acquisitions**
An adequate code audit during the M&A process is essential, especially to avoid infringement issues involving open-source licenses. M&A activities usually involve a thorough evaluation of the target company's assets, of which technology assets, especially software assets, occupy an important place. The following issues need to be highlighted in M&A audits:
- Identifying Open-source Components:An important task of a code audit is to identify all open-source components used in the target company's products. This includes open-source libraries and frameworks that are used directly, as well as open-source software that is indirectly relied upon. Understanding these components and their versions is critical to assessing the associated license requirements.
- Reviewing License Compliance:After confirming an open-source component, its corresponding license needs to be reviewed. This includes determining the types, limitations and obligations of these licenses. In particular, note that some licenses may have specific restrictions on commercial use or require disclosure of modified source code.
- Assessing Risks and Responsibilities:During the audit, the legal and financial risks that may arise from non-compliance with open-source licenses should be assessed. This includes potential infringement lawsuits, fines, or the need to refactor parts of the product that rely on specific open-source components.
- Post-Integration Compliance Strategies:After an M&A is completed, there needs to be a clear plan for integrating the target company's codebase and ensuring continued compliance with all relevant open-source license requirements. This may involve implementing new code management and compliance monitoring processes throughout the organization.
- Professional Legal Advice:Because open-source licenses can be very complex, obtaining professional legal advice is critical. A professional attorney can help correctly interpret the terms of the license and provide advice on how to handle potential license conflicts.
### 3.3 Open Source AI Security
With the popularity of LLMs, in addition to the LLM license issues mentioned above, more AI safety and control issues have gradually entered people's view. Since the technology is relatively new and there is no clear definition and specification, this paragraph lists the topics of greater concern to the relevant practitioners at the moment based on desk research, in the hope of triggering readers' thinking, and welcomes discussion and feedback.
#### 3.3.1 Open Source AI Poses New Requirements for Data Security
Unlike traditional data security, since a large part of the output results of AI LLMs depends on the training dataset, issues such as the quality of the dataset and whether the dataset contains malicious data are particularly important for AI LLMs, especially open-source LLMs, because many of the datasets of the open-source LLMs provide data internally by the enterprise, and the cleansing, monitoring, and compliance can't be done as professionally as those of the professional closed-source LLM vendors.
**Improper handling of the training dataset triggers a range of biases**
Data bias occurs when certain elements in a data set are overemphasized or underrepresented. When training AI or machine learning models based on such biased data, it can lead to biased, unfair and inaccurate results.
- **Selective Bias**:Some facial recognition systems, trained primarily on white images, have relatively low accuracy in recognizing faces of different races;
- **Exclusionary Bias**:This bias usually occurs at the data preprocessing stage, and if the data is based on stereotypes or false assumptions, then the results will be biased regardless of which algorithm is used;
- **Observer Bias**:Researchers may consciously or unconsciously bring their personal views into a research project, which can influence the results;
- **Racial Bias**:Racial bias occurs when a dataset is biased toward a particular group;
- **Measurement Bias**:This bias occurs when the data used for training does not match the data in the real world, or when incorrect measurements distort the data.
These biases, when used maliciously, can lead to outputs that are significantly politically or racially biased, or data errors that can significantly affect the performance and credibility of the larger model.
**Training data sources should be taken into account when choosing a LLM of an open-source base**
Many of the LLM training data sources are obtained directly from the Internet via crawler tools, where discriminatory, hateful and offensive speech and information is prevalent. In practice, people read, comment, like and spread negative messages far more than positive ones. As a result, human-generated information sources have long been in a more chaotic and unhealthy state. LLMs in this environment may contribute to the spread of racial discrimination and disinformation by being influenced by such data.
Once the data source at the base of the LLM is contaminated, even if the enterprise itself is fine-tuned to use a perfect data source, it can lead to significant bias in the final output. Therefore, when choosing a LLM for the base, users should not only consider the performance of the LLM, but should also take the source of the training data into consideration. The focus should be on LLMs that select annotated datasets from multiple sources in a responsible manner, while considering bias minimization as a factor to focus on throughout the model building process and even after deployment.
#### 3.3.2 The extensive use of open-source AI LLMs raises ethical considerations for society
**The problem of LLM hallucinations can lead to serious consequences**
There is an unresolved problem with current LLMs - hallucinations. According to the Sail Lab at HIT (Harbin Institute of Technology), hallucination refers to "text generation tasks in which unfaithful or meaningless text is sometimes produced. "While hallucinatory texts are unfaithful and meaningless, they are often so readable due to the powerful context generation capabilities of the LLM that the reader is led to believe that they are based on the provided context, even though it is actually very difficult to find or verify that such a context actually exists. This phenomenon is similar to mental hallucinations that are difficult to distinguish from other "real" perceptions, and it is also difficult to capture hallucinatory texts at a glance.
There are many types of illusions and they are still emerging as the use of LLMs expands. The main types of common hallucinations are the following:
- **Logic Errors**:The LLM makes logical errors in its reasoning, which results in outputs that seem reasonable but don't stand up to scrutiny;
- **Fabricated Facts**:The database of the LLM itself does not support its answer to this question, but since the LLM cannot define its own boundaries, it will confidently assert facts that simply do not exist;
- **Data-Driven Bias**:As mentioned in the previous section, due to the prevalence of certain data, the output of the model may be biased in certain directions, leading to erroneous results.
False outputs due to LLM hallucinations may cause harm to some users who are convinced by them. On May 16, 2023, the World Health Organization issued a statement of caution on the use of AI LLM tools. They noted that while these tools facilitate access to health information and may enhance the efficiency of diagnosis, particularly in resource-poor areas, their use requires a rigorous assessment of potential risks. The World Health Organization further emphasized that rushing into the use of inadequately tested systems could lead to mistakes by healthcare professionals, harm to patients and reduced trust in AI technologies, which could undermine or delay the potential long-term benefits and applications of such technologies globally.
<br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image037](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_3/3-9.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 3.9 Classification of hallucinations by Harbin Institute of Technology </center>
<br>
Since there is not yet a clear accountability entity for LLMs, and even more so for open-source LLMs, in the event of serious consequences, it will be very difficult for users who have suffered losses to defend their rights and their losses to be mitigated. Currently there are 2 pressing issues to be addressed in this regard:
- How LLMs hallucinations can be better addressed - technical aspects
- How to define more clearly who is responsible for LLMs - legal aspects
**Outputs from LLMs may output content that violates ethical laws**
At present, some LLMs lack content filtering mechanisms, resulting in output content that violates domestic laws and regulations, public order and morals, mainly containing the following situations:
- Copyright Issues:LLMs may generate content that contains or resembles copyrighted material. For example, the model may create text that is similar to pre-existing literary works, song lyrics, movie scripts, and so on.Such a generation may violate the rights of the original author or copyright holder, leading to legal disputes;
- Territorial legislation:Different countries and regions have their own unique legal systems. For example, certain countries have stricter censorship of Internet content, such as explicit bans on politically sensitive content, religious messages or specific expressions on gender issues. When the LLM runs in these regions, the generated content must comply with local laws. For example, when someone asked an LLM "how to cook wild giant salamander", the model answered "braise it" and even provided detailed steps. Such answers may mislead the questioner. As a matter of fact, wild giant salamander are Class II protected animals and should not be captured, killed or eaten.
- Defamation and Misinformation:If model-generated content contains false accusations or defamatory statements about individuals or organizations, legal action may result. This places high demands on ensuring the accuracy and legitimacy of the content.
In order to ensure compliance with various legal requirements, organizations using LLMs may need to put in place regulatory mechanisms, such as auditing generated content to ensure that it does not violate any legal requirements. Especially for open-source models used by enterprises, they are relatively more leniently scrutinized for content output, and enterprises need to pay extra attention to related issues to prevent getting into legal disputes and incurring losses. Here again, it can be summarized in 2 questions:
- How to Enhance Information Filtering Mechanisms for LLMs - Technical Aspects
- How to define whether LLM output content is infringing and illegal - legal aspects
**LLMs may exacerbate social divide**
The Secretary General of the Digital Economy Committee of the Beijing Computer Society has said:The potential security issues of LLMs are of particular concern for those who lack critical thinking and analytical skills, and who are not well-informed about paid knowledge and healthcare services. With the dramatic increase in the number of Internet users and the widespread use of mobile devices, such as cell phones, low-education and low-income populations are increasingly relying on these avenues for medical, educational, and daily life advice. However, large-scale generative language models may exacerbate discriminatory portrayals and social biases against these marginalized groups, deepen social divisions, increase the harm of misleading, malicious information, and raise the risk of disclosure and misuse of individuals' real information.
The use of LLMs is like a double-edged sword; on the one hand, it can reintegrate network resources and improve the efficiency of information collection; on the other hand, it may exacerbate information barriers due to problems such as hallucinations and lead to the misinformation of many populations with scarce information sources. There are 2 issues that need to be addressed at this point:
- Enhancing public education that LLMs are not a panacea and need to be viewed with caution - Social communication aspect
- How to ensure the quality of LLM training datasets and reduce their bias - technical aspect
## 4 Capital market situation for open source projects
### 4.1 The status of global markets
#### 4.1.1 Global VC Investment Declines in 2023, but AIGC is in the Spotlight
Since 2023, volatility in global financial markets has increased due to growing interest rates, challenging economic conditions, geopolitical conflicts, and concerns about the stability of the international financial system, which has led to a bleak picture for the global VC capital markets. According to KPMG, global venture capital activity has declined for seven consecutive quarters through Q3 2023 (see Figure 4.1). <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image038](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-1.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.1 Global Venture Capital Activity (Source:KPMG)</center>
<br>
Against the backdrop of a declining equity market, fund managers have generally reduced their allocations to private equity assets to maintain portfolio proportions; at the same time, due to the high volatility of venture capital and the uncertainty of the future global economic situation, the scale of venture capital fundraising in 2023 will drop significantly compared with that of previous years. Compared to an average of more than $250 billion annually over the past five years (2018-2022), venture capital commitments as of 2023Q3 amounted to just $116 billion (according to KPMG). Overlaying the trend of seven consecutive quarters of declining venture capital activity, fundraising will shrink significantly in 2023Q4 and for the full year. <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image039](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-2.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.2 Global Venture Capital Fundraising Scale (Source:KPMG)</center>
<br>
At the valuation level, investor caution is also growing. Compared to 2021 and 2022, the proportion of premium financing has decreased by about 10%, and the proportion of par and discount financing has risen by about 5%, which creates an obstacle to the exit of early-stage capital. <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image040](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-3.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.3 Global VC Premium, Parity, and Decline Investment Ratios (Source:KPMG)</center>
<br>
However, against the backdrop of an overall bleak environment, AIGC-related financings have been in the global spotlight, with a significant increase in the size of related financings. In North America, the largest number of AI-related companies will be unicorns in 2023, including AI agent startup Imbue, AI + biotech company TrueBinding, generative AI company Runway, and natural language processing company Cohere; in Europe, despite the overall slowdown in funding, AI companies have been particularly strong, with a large number of startups receiving funding, such as French AI platform company Poolside; in Asia, investor interest in AI is also rising, but national regulators are also increasing the regulation of generative AI. In Europe, despite the overall funding slowdown, AI companies are doing particularly well, with a large number of startups receiving funding, such as French AI platform company Poolside; and while investor interest in AI in Asia continues to grow, so too does regulatory oversight of generative AI by national regulators.
It is expected that along with the rapid iteration of AI technology, the concepts of LLM and AI Agent continue to be hot, the investment and financing related to the AI field will be less affected by the contraction of the scale of global venture capital investment.
#### 4.1.2 Global Open Source Financing
The growth of commercial open-source companies has been remarkable in recent years, with the combined market capitalization of these companies growing rapidly from $10 billion to surpass the $500 billion mark. This significant growth not only demonstrates the huge potential of open-source technology in the commercial sector, but also reflects the high level of investor recognition and trust in the open-source model. According to OSS Capital, the market capitalization of commercial open-source companies is expected to reach a staggering $3 trillion in the future.
The open-source business sector has shown solid growth over the past four years. Over 400 startups raised approximately 700 rounds totaling $29 billion during this period.Specifically, annual financing increases from $270 million in 2020 to $12.5 billion in 2023, a compound annual growth rate of 255%.
Although the size of the financing showed a downward trend in 2022, this trend was mitigated in 2023. Beginning in February 2023, financing begins to pick up gradually. In the first 11 months of 2023, total funding has already surpassed the amount raised in all of 2022. However, volatility in the scale of financing increased throughout the year, influenced by geopolitical conflicts and the post-epidemic economic recovery. Financing peaked at around $2 billion or so in March, May and September, and was below average in June and August.
Even in the lowest funding month of 2023, $386 million in monthly funding exceeded the highest monthly funding in 2021 and even surpassed the total funding for all of 2020 ($272 million). This trend reflects the capital market's continued interest in and recognition of open-source business. This apparent trend of growth in funding shows the growing interest and confidence of the capital markets in open-source business. Investors value not only the innovative potential and technological advantages of open-source models, but also their sustainability in the marketplace and long-term growth potential.
<br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image041](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-4.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.4 Amount of Global VC Funds Invested in Commercialized Open-source Software Companies (Source:OSS Capital)</center>
<br>
Analyzing from the perspective of financing scale of each round, the capital prefers medium-term financing such as B, C, D, and so on. This reflects the characteristics of commercial open-source companies:In the early stage, the technical details are still unclear, and the business model is not clear; however, when they gradually cross the start-up stage, commercial open-source companies will explode with stronger growth momentum, attracting more capital; in the later stage when the business model is gradually matured and the open-source product becomes well-known and generates stable cash flow, the need for financing will be reduced. <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image042](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-5.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.5 Distribution of Financing Rounds for Commercialized Open Source Software Companies ($M) (Source:OSS Capital)</center>
<br>
A total of 328 commercial open-source companies have received more than $10 million in funding over the past four years. Of these, the main concentration was in the US$10-50 million range, with a total of 210 rounds, or 64% of all rounds, in the US$10-20 million and US$20-50 million ranges. There were 49 rounds of $50-100 million and 46 rounds of $100-200 million, accounting for 29% of all rounds. A total of 23 companies received more than $200 million in funding, with two of them even receiving more than $500 million in a single round. <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image043](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-6.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.6 Distribution of Financing Rounds for Commercialized Open-source Software Companies ($M) (Source:OSS Capital)</center>
<br>
### 4.2 The status of China market
#### 4.2.1 Overview of the development of China's equity capital market
**The number and size of newly established funds declined, but the overall trend is gradually improving**
In the first half of 2023, 3,930 new funds were launched in the (PE/VC) market, down 12% from 4,456 new funds launched in the same period last year. During this period, new fund launches totaled $364.2 billion, a decrease of 3% year-over-year. Despite the decline in size and volume compared to last year, the second quarter performed better than the first quarter, with an overall improving trend:Specifically, new fund launches in the first quarter amounted to $161.4 billion, a decline of nearly 20% year-on-year, while the second quarter recorded $202.8 billion, an increase of 16% year-on-year. <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image044](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-7.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.7 Domestic Private Equity Fund Contributions and Volume (Source:investment.com, KPMG)</center>
<br>
**Increase in the size of RMB funds and a significant decrease in the size of foreign currency funds**
In the first half of 2023, the number of new RMB funds launched was 3,840, a decrease of 13% compared to the same period last year. The total size of RMB funds reached US$339.5 billion, a 13% increase compared to the same period last year. The size of foreign currency funds was $24.7 billion, a significant decline of 67% from the previous year. Despite the increase in the number of foreign currency funds in 2023, their impact on the total size is small as most are small funds.
This trend indicates that the domestic equity investment market prefers the more conservative investment style of RMB funds:and requires a higher degree of stability in the portfolio companies. For open-source business startups in China, simply following the market buzz is no longer enough to attract investment. Technological strength and long-term growth potential become key factors in assessing whether to make further investments. <br>
<div style="margin-left: auto; margin-right: auto; width: 85%">
| ![image045](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-8.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.8 Size and number of domestic private equity RMB funds (Source:KPMG)</center>
<br>
<div style="margin-left: auto; margin-right: auto; width: 85%">
| ![image046](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-9.png) |
| -------------------------------------------------------- |
</div>
<center> Figure 4.9 Domestic Private Equity Foreign Currency Fund Size and Volume (Source:KPMG)</center>
<br>
**Economic recovery falls short of expectations and decline in overall investment volume and size**
Against the macro backdrop of unstable roots of economic recovery, slowdown in overall demand, and instability in external markets, the total number of investments in the H1 equity market in 2023 will be 3,750, a year-on-year decline of 31%; the total amount of investment supplied will be USD56.9 billion, a decline of 6% compared to the same period last year. Compared to the financing side where the size of newly established funds declined by 3%, a stronger contraction has been shown on the investment side, which further illustrates the cautious sentiment of investors, which is consistent with the trend shown by international markets. <br>
<div style="margin-left: auto; margin-right: auto; width: 85%">
| ![image047](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-10.png) |
| --------------------------------------------------------- |
</div>
<center> Figure 4.10 Amount and number of investments in the domestic equity market (Source:KPMG)</center>
<br>
#### 4.2.2 Steady development of domestic open source ecology
**Open-source industry is gradually improving in all aspects of the ecosystem and is steadily growing**
At present, the domestic open-source industry is experiencing the development pattern of both top-level design and industrial progress, talent reserve and technological innovation, making progress together in all aspects from laws and regulations, policy support, competition selection, and all links of the industry chain.
In terms of laws and regulations, Zhang Guofeng, deputy director of the Institute of Artificial Intelligence and Change Management at the University of International Business and Economics in Shanghai and secretary general of the Shanghai Open-source Information Technology Association on November 2, 2023, said at the media communication meeting of the 2023 Open-source Industry Ecological Conference that Shanghai's open-source industry planning and policies are in the process of being drafted and pushed forward, and that Shanghai must seize the historic opportunity to actively participate in digital governance and digital public goods international cooperation (news from The Paper); in terms of policy support, at the 2023 Global Open-source Technology Summit (GOTC), the Shanghai open-source industry service platform was officially announced to start:Shanghai Pudong Software Park signed a contract with the Linux Foundation Asia-Pacific to officially land the Linux Foundation Asia-Pacific Open-source Community Service Center, and signed a strategic cooperation agreement with OSChina to build the Shanghai open-source ecological (News from Wen Hui Bao). In terms of competition selection, China has already had a series of open-source competitions such as "China Software Open-source Innovation Competition" and "OpenHarmony Competition Training Camp", which have attracted students from Shanghai Jiaotong University, Fudan University and other domestic universities to participate in the competitions, and a large number of innovative highlights have emerged from the competitions, fully reflecting the momentum and great potential of the flourishing co-construction of the open-source ecosystem. A large number of innovative highlights emerged in the competition, fully reflecting the good momentum and great potential of the open-source ecological construction.
All segments of the open-source chain are thriving. In the field of artificial intelligence, numerous companies have open-sourced base LLMs, including Alibaba open-sourcing Tongyi Qianwen, High-Flyer Quant open-sourcing DeepSeek, and more. Startups in Baichuan Intelligence, Zhipu AI, Zero One Everything and so on have respectively released a variety of LLMs of their own training base, it is worth mentioning that these companies are favored by the capital market, respectively, in this year, one or more high-value financing. In the developer tools layer, a number of startups that are already deep in the game are joined by new players and there are already products that are trying to go global. In the foreseeable future, there are also opportunities for open-source AI applications to usher in more opportunities at the application layer.
In the area of underlying operating systems, large companies are promoting the localization of operating systems, including the Anolis OS open-source community developed by Alibaba and the openEuler community supported by the OpenAtom Open-Source Foundation. These large enterprises also have notable open-source project layouts in a number of key areas, including cloud native, big data, artificial intelligence, and front-end technologies. For example, ant-design, Ant Group's enterprise UI design tool, PaddlePaddle, Baidu's deep learning platform, and Apache Echarts, a data visualization charting library, all have a wide reach and large user base in the GitHub community.
In the big data and database industry, a number of startups are actively strategizing in response to the large and diverse data generated by domestic and international markets, as well as the growing demand for data processing. For example, PingCAP launched TiDB, a distributed relational database, and TiKV, a distributed key-value database; TDengine, a time-series database; and ShardingSphere, a distributed database middleware from SphereEx. With the development of AI technology, innovative products have emerged in the AI field, such as Zilliz's vector database developed for AI applications and Jina.ai's neural search engine, which enables searches across all types of content. <br>
<div style="margin-left: auto; margin-right: auto; width: 85%">
| ![image048](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-11.png) |
| --------------------------------------------------------- |
</div>
<center> Figure 4.11 Map of domestic AI-related tech companies' open source projects and open source companies (partial)</center>
<br>
**ModelScope has become the first portal for domestic open source LLMs, marking the gradual growth of China's open-source AI community construction**
ModelScope Community is an AI modeling community launched by Ali Dharma Institute in collaboration with the Open-source Development Committee of China Computer Federation (CCF), aiming to build a next-generation open-source model-as-a-service sharing platform, and strive to lower the threshold of AI applications. Since its launch, it has expanded rapidly:The community now has over 2,300 models, over 2.8 million developers, and over 100 million model downloads. Baichuan Intelligence, Wisdom Spectrum AI, Shanghai Artificial Intelligence Laboratory, IDEA Research Institute and other leading LLMing organizations use ModelScope as their open-source model debut platform.
The ModelScope community upholds the concept of "Model as a Service" and treats AI models as an important element of production, providing services around the model lifecycle, from model pre-training to secondary tuning and finally to model deployment. Compared to the foreign community Hugging Face, ModelScope pays more attention to domestic needs, provides a large number of Chinese models, and promotes the application of relevant AI scenes in China. <br>
<div style="margin-left: auto; margin-right: auto; width: 75%">
| ![image049](https://raw.githubusercontent.com//kaiyuanshe/2023-China-Open-Source-Report/main//public/image/commercialization/chapter_4/4-12.png) |
| --------------------------------------------------------- |
</div>
<center> Figure 4.12 So far, ModelScope community has 11 model classes including LLM, zero-sample learning, etc. </center>
<br>
The establishment and rapid development of the ModelScope community has set a benchmark for China's open-source community culture, which is conducive to further promoting the spread of open-source culture in China, attracting more creative, open-source spirit of technology creators, technology users to join, and promoting the further prosperity of China's open-source cause.
#### 4.2.3 Domestic Open Source Company Financing Remains Hot
The market heat maintained in 2023, with several large investments taking place and some startups raising multiple rounds of funding in a year, reflecting the high level of investor interest. Open Source China is an open-source community platform company, including nearly 100,000 world-renowned open-source projects, under the banner of open-source community Landscape and Japan's old open-source community OSDN, and also owns the code hosting platform Gitee, which is the leading code hosting service platform in China, and has obtained a 775 million yuan of strategic financing in the B+ round; SelectDB develops and promotes open-source real-time data warehouse Apache Doris, and provides technical support and commercial services for Apache Doris users, and has obtained a new round of several hundred million yuan of financing so far. Flywheel Technology, which develops and promotes the open-source real-time data warehouse Apache Doris and provides technical support and commercial services for Apache Doris users, has obtained a new round of financing of hundreds of millions of yuan, and the total financing scale has reached nearly 1 billion yuan up to now; Lanboat Technology, which provides a new generation of cognitive intelligence platform based on NLP technology, has completed the investment of the Pre-A+ round, and the total financing scale has reached hundreds of millions of yuan in less than a year.
At present, the development of China's open-source ecosystem is still at an early stage, and the financing events in 2023 will mainly focus on round B and before, involving artificial intelligence, open-source communities, data warehouses and LLMing platforms, and other fields, with vast market opportunities.
<center>
Table 4.1 Investment and Financing of Domestic Open Source Software Startups (slide to right to view full content)
</center>
<center>
(Github statistics as of December 7, 2023)
</center>
| **Company** | **Open source project** | **Corporate operations** | **Latest round of financing round** | \*\* Amount of latest round of financing\*\* | **Time of latest round of financing** | **GitHub Star** | **GitHub Fork** |
| --------------------------------------------- | ------------------------------------ | -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- | ------------------------------------- | --------------- | --------------- |
| **Tributary Technologies** | Apache APISIX | Microservices API Gateway | A + round | Millions of dollars. | June 2021 | 10.8k | 2k |
| **Moby Dick Open Source** | Apache DolphinScheduler | Cloud-Native DataOps Platform | Pre-A round | tens of millions of dollars | July 2022 | 9.4k | 3.5k |
| **Flywheel Technologies** | Apache Doris | Cloud Native Real-Time Warehouse | Pre-A round | several hundred million dollars | June 2023 | 6.5k | 1.9k |
| **Even Tech** | Apache HAWQ | Hadoop SQL Analysis Engine | B + round | Nearly $200 million | August 2021 | 672 | 324 |
| **Tianmou Technology** | Apache IoTDB | Time Series Database System | angel round (finance) | nearly a billion dollars | June 2022 | 2.8k | 750 |
| **Short step information technology** | Apache Kylin | Big Data online analytical processing engine | D round | $70 million. | April 2021 | 3.4k | 1.5k |
| **StreamNative** | Apache Pulsar | distributed message queue | A + round | - | 2023 | 12k | 3.2k |
| **SphereEx** | Apache ShardingSphere | Distributed Database Pluggable Ecology | Pre-A round | Nearly $10 million | January 2022 | 17.7k | 6.1k |
| **Antoine Mound (AutoMQ)** | automq-for-rocketmq automq-for-kafka | Streaming storage software and message queues | Angel rounds + | Tens of millions of RMB | November 2023 | 195 | 34 |
| **Smart Spectrum AI** | ChatGLM | Large Prophecy Model | B++++ | RMB 1.2 billion | September 2023 | 36.3k | 4.9k |
| **Luchen Technology** | Colossal-AI | High-Performance Enterprise AI Solutions | angel round (finance) | $6 million | September 2022 | 6.8k | 637 |
| **Chatopera** | cskefu | Multi-Channel Intelligent Customer Service System | angel round (finance) | millions of dollars | August 2018 | 2.2k | 742 |
| **Digital Change Technology** | Databend | cloud warehouse (computing) | angel round (finance) | Millions of dollars. | August 2021 | 4.8k | 500 |
| **Dify.AI** | Dify | LLMOps platform | fund | undisclosed | 44986 | 11.8k | 1596 |
| **Image Cloud Technology** | EMQX | MQTT Message Middleware | B round | 150 million | December 2020 | 10.8k | 1.9k |
| **TensorChord** | Envd | MLOps | seed round | Millions of dollars. | November 2022 | 1.3k | 102 |
| **Stoneware Technology** | FydeOS | Chromium-based operating systems | Pre-A round | tens of millions of dollars | February 2022 | 1.5k | 192 |
| **Generalized intelligence** | GAAS | Autonomous UAV flight program | * | undisclosed | October 2018 | 1.7k | 411 |
| **GeekCode** | Geekcode.cloud | cloud development environment | seed round | Millions of RMB | April 2022 | 42 | 2 |
| **Gitee** | git | Git Code Hosting | B + round | 775 million | July 2023 | - | * |
| **Polar Fox** | GitLab | DevOps Tooling Platform | A++ round | tens of millions of dollars | September 2022 | - | * |
| **White Sea Technology** | IDP | AI Data Development Platform | seed round | tens of millions of dollars | December 2021 | 17 | 3 |
| **Ella Yunko** | illa-builder | Low-code development platform | angel round (finance) | Millions of dollars. | September 2022 | 2.3k | 126 |
| **Gina Technology** | Jina | A multimodal neural network search framework | Series A | $30 million | November 2021 | 16.8k | 2k |
| **Juicedata** | JuiceFS | distributed file system (DFS) | angel round (finance) | millions of dollars | October 2018 | 7.1k | 605 |
| **Harmonic Cloud Technology** | Kingdling | Container Cloud Products and Solutions | B + round | over one hundred million dollars | January 2022 | 270 | 56 |
| **Fly to Cloud** | JumpServer | Cloud & DevOps | D + Wheel | 100 million | April 2022 | 19.5k | 4.8k |
| **Talent Cloud Technology** | Kubernetes | Container Cloud Platform | Mergers and Acquisitions - Bytes | undisclosed | July 2020 | 94.1k | 34.5k |
| **Zeto Technology** | Kunlun | distributed database | angel round (finance) | tens of millions of dollars | August 2021 | 112 | 15 |
| **Deepness Technology** | LinuxDeepin | Linux operating system | B round | tens of millions of dollars | April 2015 | 413 | 70 |
| **Matrix origin** | Matrixone | data intelligence | angel + round | Tens of millions of dollars | October 2021 | 1.3k | 212 |
| **Mission Technologies** | Mengzi | macrolanguage model | Pre-A+ round | several hundred million yuan (RMB) | March 2023 | 530 | 61 |
| **Zilliz** | milvus | vector search engine | B + round | $60 million. | August 2022 | 14.4k | 1.9k |
| **Euronet** | Nebula | distributed graph database | Pre-A + round | Nearly $10 million | November 2020 | 8.3k | 926 |
| **PLEASURE NUMBER TECHNOLOGY** | NebulaGraph | distributed graph database | Series A | Tens of millions of dollars | September 2022 | 9.7k | 1.1k |
| **First class technology** | oneflow | Deep Learning Framework | Mergers and Acquisitions - Meituan | - | 2023 | 4.1k | 478 |
| **Facial Intelligence** | OpenBMB | Large model applications | seed round | undisclosed | August 2021 | 359 | 49 |
| **EasyJet Travel Cloud** | OpenStack | IaaS | Round E | undisclosed | July 2021 | 4.6k | 1.6k |
| **Original Language Technology** | PrimiHub | privacy calculations | Angel rounds + | multi-million dollar | October 2022 | 263 | 60 |
| **Good Rain Technology** | Rainbond | Cloud Operating System for Enterprise Applications | Pre-A round | millions of dollars | August 2016 | 3.6k | 664 |
| **Quick use of cloud computing** | QuickTable | Code-free data modeling tools | * | undisclosed | August 2021 | 7 | 3 |
| **Rayside Technology** | RT-Thread | Internet of Things Operating System | - | undisclosed | January 2020 | 7.6k | 4.2k |
| **Giant Sequoia Database** | SequoiaDB | Distributed relational database | D round | several hundred million dollars | October 2020 | 305 | 115 |
| **Borderless Technology** | Shifu | IoT Software Development Framework | Series A | undisclosed | June 2022 | 205 | 21 |
| **Dingshi Vertical** | StarRocks | MPP Analytical Database | B round | undisclosed | January 2022 | 3.6k | 793 |
| **Stone Atomic Technology** | StoneDB | Real-time HTAP database | angel round (finance) | tens of millions of dollars | February 2022 | 639 | 100 |
| **TabbyML** | TabbyML | Open Source AI Programming Assistant | seed round | undisclosed | 45108 | 13.9k | 515 |
| **Taiji graphic** | Taichi | Digital content creation infrastructure | Series A | $50 million | February 2022 | 21.7k | 2.1k |
| **Titanium-platinum data** | Tapdata | Real-time data service platform | Pre-A + round | Tens of millions of dollars | July 2021 | 223 | 52 |
| **Throughout data** | TDengine | Time-Series Spatial Big Data Engine | B round | $47 million | May 2021 | 20.1k | 4.6k |
| **PingCAP** | TiDB | distributed database | Round E | undisclosed | July 2021 | 32.9k | 5.3k |
| **Digital Paradise** | uni-app | A Unified Front-End Framework with Vue Syntax | B + round | undisclosed | September 2018 | 37.4k | 3.4k |
| **LINGO TECHNOLOGY** | Vanus | Large Model Middleware | seed round | Millions of dollars. | 45108 | 2.2k | 110 |
| **Future speed** | Xorbits | Distributed Data Science Computing Framework | angel round (finance) | Millions of dollars. | 44958 | 933 | 58 |
| **Levi Software** | Zabbix | IT operations management | Series A | undisclosed | November 2022 | 2.6k | 766 |
| **KodeRover** | Zadig | Cloud Native Software Delivery Cloud | Pre-A round | tens of millions of dollars | August 2021 | 1.8k | 636 |
| **EasySoft Tianchuang** | zentaopms | Agile Project Management | Series A | tens of millions of dollars | October 2021 | 946 | 275 |
| **Cloud Axis Information** | ZStack | IaaS | * | undisclosed | March 2021 | 1.2k | 380 |
<br>
<center>
Table 4.2 Investment and Financing of Domestic Open-source LLMing Startups (slide to right to view full content)
</center>
<center>
(Hugging Face statistics as of December 7, 2023)
</center>
<table>
<tr>
<th> Company </th>
<th> Latest financing round </th>
<th> Date of last financing </th>
<th> Recent financing volume </th>
<th> Model Introduction </th>
<th> model name </th>
<th> likes </th>
<th> download </th>
</tr>
<tr>
<td rowspan="3"> 百川智能 </td>
<td rowspan="3">A 轮 </td>
<td rowspan="3">2023-10-17 00:00:00</td>
<td rowspan="3">3 亿美元 </td>
<td rowspan="3"> 在知识问答、文本创作领域表现突出 </td>
<td>Baichuan-7B</td>
<td>795</td>
<td>102k</td>
</tr>
<tr>
<td>Baichuan-13B-Chat</td>
<td>612</td>
<td>8.29k</td>
</tr>
<tr>
<td>Baichuan2-13B-Chat</td>
<td>321</td>
<td>133k</td>
</tr>
<tr>
<td rowspan="3"> 智谱 AI</td>
<td rowspan="3">B+++++ 轮 </td>
<td rowspan="3">2023-09-19 00:00:00</td>
<td rowspan="3">12 亿人民币 </td>
<td rowspan="3"> 多模态理解、工具调用、代码解释、逻辑推理 </td>
<td>ChatGLM-6B</td>
<td>2.67k</td>
<td>56.8k</td>
</tr>
<tr>
<td>ChatGLM2-6B</td>
<td>1.91k</td>
<td>97.7k</td>
</tr>
<tr>
<td>ChatGLM3-6B</td>
<td>501</td>
<td>104k</td>
</tr>
<tr>
<td rowspan="3"> 元语智能 </td>
<td rowspan="3"> 出资设立 </td>
<td rowspan="3">2022-11-24 00:00:00</td>
<td rowspan="3">—</td>
<td rowspan="3"> 功能型对话大模型 </td>
<td>ChatYuan-large-v2</td>
<td>171</td>
<td>669</td>
</tr>
<tr>
<td>ChatYuan-large-v1</td>
<td>108</td>
<td>120</td>
</tr>
<tr>
<td>ChatYuan-7B</td>
<td>9</td>
<td>3</td>
</tr>
<tr>
<td rowspan="3"> 面壁智能 </td>
<td rowspan="3"> 天使轮 </td>
<td rowspan="3">2023-04-14 00:00:00</td>
<td rowspan="3"> 数千万人民币 </td>
<td rowspan="3"> 大语言模型,包括包括文字填空、文本生成、问答 </td>
<td>cpm-bee-10b</td>
<td>158</td>
<td>19</td>
</tr>
<tr>
<td>cpm-ant-10b</td>
<td>22</td>
<td>12.6k</td>
</tr>
<tr>
<td>cpm-bee-1b</td>
<td>12</td>
<td>7</td>
</tr>
<tr>
<td rowspan="3"> 澜舟科技 </td>
<td rowspan="3">Pre-A + 轮 </td>
<td rowspan="3">2023-03-14 00:00:00</td>
<td rowspan="3"> 数亿人民币 </td>
<td rowspan="3"> 处理多语言、多模态数据,文本理解、文本生成 </td>
<td>mengzi-t5-base</td>
<td>41</td>
<td>1.42k</td>
</tr>
<tr>
<td>mengzi-bert-base</td>
<td>32</td>
<td>1.46k</td>
</tr>
<tr>
<td>mengzi-t5-base-mt</td>
<td>17</td>
<td>44</td>
</tr>
<tr>
<td rowspan="3"> 虎博科技 </td>
<td rowspan="3">A 轮 </td>
<td rowspan="3">2019-03-01 00:00:00</td>
<td rowspan="3">3300 万美元 </td>
<td rowspan="3"> 多语言任务大模型,覆盖生成、开放问答、编程、画图、翻译、头脑风暴等 15 大类能力 </td>
<td>tigerbot-70b-chat-v2</td>
<td>40</td>
<td>1.68k</td>
</tr>
<tr>
<td>tigerbot-180b-research</td>
<td>33</td>
<td>12</td>
</tr>
<tr>
<td>tigerbot-70b-base-v1</td>
<td>15</td>
<td>3.25k</td>
</tr>
<tr>
<td rowspan="2"> 深势科技 </td>
<td rowspan="2">C 轮 </td>
<td rowspan="2">2023-08-18 00:00:00</td>
<td rowspan="2"> 超 7 亿人民币 </td>
<td> 高精度蛋白质结构预测模型 </td>
<td>Uni-Fold-Data</td>
<td>—</td>
<td>6</td>
</tr>
<tr>
<td> 三维分子预训练模型 </td>
<td>Uni-Mol-Data</td>
<td>—</td>
<td>3</td>
</tr>
<tr>
<td rowspan="3"> 元象 XVERSE</td>
<td rowspan="3">A + 轮 </td>
<td rowspan="3">2022-03-11 00:00:00</td>
<td rowspan="3">1.2 亿美元 </td>
<td rowspan="3"> 大语言模型,具备认知、规划、推理和记忆能力 </td>
<td>XVERSE-13B</td>
<td>117</td>
<td>42</td>
</tr>
<tr>
<td>XVERSE-13B-Chat</td>
<td>42</td>
<td>412</td>
</tr>
<tr>
<td>XVERSE-65B</td>
<td>35</td>
<td>6.18k</td>
</tr>
<tr>
<td rowspan="3"> 零一万物 </td>
<td rowspan="3"> 天使轮 </td>
<td rowspan="3">2023-11-06 00:00:00</td>
<td rowspan="3">—</td>
<td rowspan="3"> 通用型 LLM,其次是图像、语音、视频等多模态能力。</td>
<td>Yi-34B</td>
<td>1.07k</td>
<td>109k</td>
</tr>
<tr>
<td>Yi-6B</td>
<td>303</td>
<td>26.7k</td>
</tr>
<tr>
<td>Yi-34B-200K</td>
<td>107</td>
<td>4.55k</td>
</tr>
</table>