Investment Memo - CoreWeave

Chan Ka Hei 08-Aug-2023 # Introduction CoreWeave addresses the demand for cloud-based accelerated computing, notably generative AI training, and inference, visual effect rendering, pixel streaming, high-fidelity physics simulation, etc. The company started in 2017 engaging in cryptocurrency mining. It pivoted in 2019 during the crypto bust and started providing on-demand visual effects rendering cloud services with its GPU hardware. Since then the company developed and operated a collection of data centers to provide cloud-accelerated computing services. Major use cases for CoreWeave include generative AI and visual effect rendering. Research predicts that the market for cloud service in generative AI will become 247 billion in 2032 with a compound annual growth rate of 42% in the coming decade[^bloombery]. For visual effect rendering the research predicts that the market will grow to 21 billion in 2029 with a compound annual growth rate of of 8.2%[^maximize]. The following sections will analyze the competitiveness of CoreWeave, and predict how much market share can CoreWeave acquire. Finally, a valuation and investment recommendation will be given. [^bloombery]: https://www.bloomberg.com/company/press/generative-ai-to-become-a-1-3-trillion-market-by-2032-research-finds/ [^maximize]: https://www.maximizemarketresearch.com/market-report/visual-effects-vfx-market-global-market/148265/ # Solution ## Cloud computing infrastructure CoreWeave focuses on NVIDIA GPUs and networking hardware, which offer the highest efficiency in terms of compute capacity (FLOPS to dollar ratio) in the market[^mlcommon]. The company provides the following cloud computing infrastructures, which can be consumed via the company's browser-based portal or command line interface. Customers can choose between on-demand instances or reserved instances based on their usage pattern and budget[^pricing]. - GPU compute - NVIDIA HGX - NVIDIA H100 - HVIDIA A100 - NVIDIA A40 - NVIDIA RTX 6000 - CPU compute - AMD Milan - AMD Rome - Intel Xenon Scalable - Managed Kubernetes - Virtual Server - Linux - Windows - Storage - HDD - SSD - Object Storage - Networking - NVIDIA Infiniband GPUDirect RDMA [^mlcommon]: https://mlcommons.org/en/training-normal-30/ [^pricing]: https://www.coreweave.com/gpu-cloud-pricing ## Major use cases CoreWeave's infrastructure as a service (IaaS) business gravitates toward compute-intensive use cases where high FLOPS to dollar ratio brings an immense competitive edge. ### Generative AI Training Widely accepted consensus in academia ([Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556), [Deep Learning Scaling is Predictable, Empirically](https://arxiv.org/abs/1712.00409)) agrees that with more data and larger model size (i.e. number of parameters), deep learning model will perform better. In particular, large language models (LLM) popularized by OpenAI's ChatGPT are optimally trained at 20 tokens/parameter. (1 token equals to roughly 3/4 word.) Whether an organization is training an AI model from scratch, or fine-tuning from a publicly available model, the amount of computing needed for an optimal training process is directly proportionally to the amount of data the organization has access. Following this narrative, as most organizations have a relatively stable volume of data available for AI training, it is in the organization's best interest to distribute the fixed amount of computing over a large number of GPUs to achieve the quickest time to market. However, orchestrating a large amount of compute infrastructure would incur additional overhead, reducing efficiency, and without specialized hardware like Infiniband GPUDirect RDMA the overhead cost could become prohibitively expensive since they tend to scale exponentially[^scaling]. CoreWeave's data center is designed from the ground up to cater to this application, providing highly efficient computing infrastructure even at large scale. [^scaling]: A typical HPC for training transformer model has a scaling factor of 0.87. e.g. 1000 GPUs will yield about $1000^{0.87}=407$x acceleration compared to 1 GPU. https://arxiv.org/pdf/2201.12423.pdf CoreWeave HGX systems have a scaling factor of 0.93 calculated from the MLCommon benchmark. That implies 1000 GPUs will yield 617x acceleration, which represents a 50% increment in scaling efficiency compared to a typical system. ### Generative AI Inference Generative AI models could take significant computing resources to host. For example, a popular choice for self-hosted LLM, llama-2 published by Meta, requires 140 GB of GPU memory to run and that translates to a minimum hardware requirement of 2x A100 80GB. Contrary to AI training, AI inference typically has a scaling factor close to 1 (perfect scaling). Like other on-demand services hosted on the internet, an adaptive system that provisions the right amount of resources to host an AI model at low latency is key to a satisfactory user experience while minimizing cost. CoreWeave operates multiple data centers across the US. Alongside managed Kubernetes service, the company is able to provide cloud computing resources matched to where and when the demand appears. ### VFX and Rendering Similar to generative AI inference, VFX and rendering workloads enjoy near-perfect scaling. NVIDIA GPUs offer hardware acceleration for this workload in the form of ray tracing cores, which allow rendering to be quickly and cheaply completed. Rendering workload typically place focuses on throughput and is not sensitive to latency. CoreWeave has one of the largest pools of GPU available among cloud providers, available at as low as 30% cost of that offered by major cloud companies, which makes it ideal for customers just looking for a large pool of GPU to rent for a short period of time. _Summary_ | Use Case | Major pain point | CoreWeave's Solution | | -------- | -------- | -------- | | AI Training | Sublinear scaling | Advanced GPU and networking hardware | | AI Inference| Varied demand, latency | Managed Kubernetes, data center at different geolocation | | VFX render | High volume of GPU needed only for a short period of time | Large pool of available GPU | # Competition The market of accelerated computing is filled with competition, both larger and smaller in scale compared to CoreWeave. Here lists a few: - Major Cloud Providers - Google Cloud Platform - Amazon Web Service - Microsoft Azure - Specialized Cloud Providers - Lambda Labs - Vultr - Regional Micro Cloud Providers - Taiwan Web Service ## Integration ### Major Cloud Providers While the availability of GPU resources on CoreWeave is similar to major cloud providers, it lacks the breadth of services offered by major cloud providers. Customers who currently host their applications on major cloud providers would need to invest heavily in solution architecture before they can effectively utilize CoreWeave. The major cloud provider also provides managed generative AI services that ease the deployment of custom generative AI training and deployment, which under suitable circumstances completely removes the need for renting a dedicated server for generative AI training/inference. These services also erase the need to manage the underlying servers. _Managed generative AI services offered by major cloud providers_ | Major Cloud Provider | Generative AI Model | Managed Training | Managed Inference | | -------------------- | ------------------- | ---------------- | ----------------- | | GCP | PaLM2 | TRUE | TRUE | | AWS | Stable-Diffusion | TRUE | TRUE | | AWS | Jurassic-2 | TRUE | TRUE | | AWS | Claude 2 | TRUE | TRUE | | AWS | Titan | TRUE | TRUE | | AWS | Command and Embed | TRUE | TRUE | | Azure | Llama-2 | TRUE | TRUE | | Azure | OpenAI GPT-3 | TRUE | TRUE | | Azure | OpenAI GPT-3.5 | FALSE | TRUE | | Azure | OpenAI GPT-4 | FALSE | TRUE | | CoreWeave | None | FALSE | FALSE | [^gcp_ref] [^aws_ref] [^azure_ref] [^gcp_ref]: Tune language foundation models https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models [^aws_ref]: AWS Bedrock https://aws.amazon.com/bedrock/ [^azure_ref]: Azure Machine Learning support Llama-2 fine tuning with Deepspeed and LoRA integration https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233 ### Specialized Cloud Providers Aside from major cloud providers, CoreWeave also faces competition from cloud providers of similar scale and niche. These cloud providers offer limited integration options and in comparison, CoreWeave holds a slight competitive edge. _Technology stack from similarly sized cloud providers_ | Cloud Providers | Training Orchestration | Inference Orchestration | Virtual Workstation | | -------- | -------- | -------- | -------- | | CoreWeave | Managed Kubernetes | Managed Kubernetes | Yes | | Lambda Labs | Self managed with Docker | Self managed with Docker | No | | Vultr | Managed Kubernetes | Managed Kubernetes | No | ### Regional Micro Cloud Providers Regional micro cloud providers like Taiwan Web Service (a subsidiary of Asus) provide custom generative AI trained with a non-English corpus which provides a unique competitive edge in non-English speaking economies. [^tws-ref] [^tws-ref]: TWS AI Foundry Service https://tws.twcc.ai/en/afs/ ## Privacy and security Customers who value security and privacy above anything will choose to build their own on-premise data centers, hence, wildly accepted ISO/IEC certification status on information security and privacy are listed below for comparison. _ISO/IEC 27001 and 27701 certification status for different cloud providers for AI-related services_ | Column 1 | ISO/IEC 27001 (security) | ISO/IEC 27701 (privacy) | | -------- | -------- | ------- | | CoreWeave | TRUE | FALSE | GCP | TRUE | TRUE | | AWS | TRUE | TRUE | | Azure | TRUE | TRUE | | Lambda Labs | FALSE | FALSE | | Vultr | Planed on 2024 | FALSE | While CoreWeave is behind major cloud vendors in terms of security and privacy, it leads ahead of its peers. This will become a point of differentiation when the customer is moderately concerned with data privacy and security. However, major privacy regulations like GDPR[^gdpr] dictate that services that collect privacy data of European citizens must store them in European jurisdiction or countries with similar scope of privacy protection. CoreWeave's ability to solicit businesses to international organizations is hence limited, having all data centers operated in the US. To mitigate the legal risk, potential customers would need to invest in hybrid cloud infrastructures to incorporate CoreWeave. [^gdpr]: GDPR art. 44 https://gdpr-info.eu/art-44-gdpr/ ## Unit economy ### Generative AI Training For AI training the unit economy depends heavily on the extent of distribution in accordance with the scaling factor discussed above. Llama-2 is a popular open-sourced language foundation model with well-documented training characteristics, trained by Meta in its in-house data center with 2048 A100 GPUs. It will be used as a unit to illustrate the typical time and monetary cost of training generative AI. Accumulated from sources[^nvidia-ref] the following assumption is made to extrapolate training characteristics. - A100 clusters have a scaling factor of 0.87 - H100 clusters have a scaling factor of 0.93 - For a single 8x GPU system, H100 is roughly 4.5x faster than A100 in llm training . ![](https://hackmd.io/_uploads/SyHI4IPn3.png) _Estimated time and monetary cost for Llama-2-70b from scratch with different cloud vendors_ | GPU-sku | GPU-count | GPU-hour | Training-hour | CoreWeave | GCP | AWS | Azure | Lambda Lab | Vultr | | ------- | --------- | -------- | ------------- | --------- | --- | --- | ----- | ---------- | ----- | A100 | 512 | 1436618 | 2805 | $ 3,174,925 | $ 7,281,734| $ 7,356,516 | $ 6,667,703 | $ 2,154,927 | $ 3,741,132 | A100 | 1024 | 1572083 | 1535 | $ 3,474,303 | $ 7,968,361| $ 8,050,194 | $ 7,296,430 | $ 2,358,124 | $ 4,093,900 | A100 Meta (original paper) | 2048 | 1720320 | 840 | $ 3,801,907 | $ 8,719,724 | $ 8,809,274 | $ 7,984,435 | $ 2,580,480 | $ 4,479,928 | H100 | 512 | 219569 | 429 | $ 1,045,148|| $ 2,698,503 || $ 568,683 | | H100 | 1024| 230486 | 225 | $ 1,097,113|| $ 2,832,672 || $ 596,958 | | H100 | 2048| 241945 | 118 | $ 1,151,658|| $ 2,973,504 || $ 626,637 | Generative AI training with specialized cloud vendor like CoreWeave can represent over 60% in cost saving compared to major cloud providers, which translate to over USD 1.5 million in cost saving using H100. Such a cost-saving could justify the overhead cost associated with hybrid cloud implementation for placing training workloads on CoreWeave. CoreWeave and Vultr are at a similar price point, but Lambda Labs are at a significantly lower price point than CoreWeave. Whether this will constitute a major competitive edge is however unclear. The absolute difference is around 0.5 million USD, but since Lambda Labs does not provide any orchestration service for distributed training, chances of errors and restarts are much higher, quickly diminishing the pricing advantage and creating a time disadvantage. [^nvidia-ref]: H100 vs A100 performance comparison https://blogs.nvidia.com/blog/2022/09/08/hopper-mlperf-inference/ ### Generative AI Inference For inference, whether it is justified to give up the integration advantage offered by major cloud providers depends on the volume and location of the inference requests. As CoreWeave's infrastructure is mainly deployed in the US their advantage lies in high-volume domestic use cases. | | Low Volume | High Volume | | -------- | -------- | -------- | | Inference in US | Not enough cost saving to give up integration advantage | CoreWeave wins | | Inference not in US | Not enough cost saving to give up integration advantage | High latency might be prohibitive in some application | ### VFX and Rendering VFX rendering is a highly parallelizable workload that requires little orchestration effort. In this space, CoreWeave has an advantage compared to major cloud providers as it provides GPU computing time at a vastly more affordable price point. Compared to other specialized cloud providers CoreWeave has a unique edge in that necessary software is managed for customers, such as a virtualised server, to complete their whole workflow on the cloud. ## Sales and distribution All three major cloud provider has a global presence in sales and distribution, which are vastly superior to CoreWeave. Compared with similarly sized peers, one would expect Lambda Labs to be in a leading position. The company had been successfully selling GPU hardware before venturing into GPU cloud services, which gave the company contacts with a variety of users who need to consume GPU computing. CoreWeave on the other hand had a network in the visual effect rendering market only, which in comparison is limited and might not necessarily translate into an extensive network among AI practitioners. Evidenced by web traffic data sourced on similarweb.com, it is reasonable to believe that CoreWeave lags behind similar-sized cloud providers in terms of reach to potential customers. ![](https://hackmd.io/_uploads/Hyx0WCZ23.png) Hence, to compensate for the lack of sale and distribution network CoreWeave might need to seek for strategic partnership oversea. For the visual effect rendering market, CoreWeave has a mature network in the industry in the US. # Key risks ## Defensibility As of today, CoreWeave's product is poorly defensed compared to major cloud providers. Managed generative AI services on major cloud providers, like PaLM2 on GCP, Llama-2 on Azure, and numerous foundation models on AWS bedrock, share vendor lock-in characteristics. Models trained on the platform cannot be shared, and is only deployable on that platform. In contrast, CoreWeave's service provides raw computing infrastructure. Customers own the model artifact and can freely transfer the model to other cloud vendors, reducing CoreWeave's revenue. ## Business model CoreWeave generates revenue by renting well-managed GPU computing infrastructure. Lacking a key differentiating factor to similar offerings by Lambda Labs and Vultr, CoreWeave risks being trapped in the race to the lowest price. Cloud providers usually enjoy a great discount for hardware because of the bulk purchase. Added to the fact that CoreWeave received investment from NVIDIA [^crunchbase], it is reasonable to believe that CoreWeave has the ability to keep GPU purchases at lower than peer cost. Hence, even if it comes to a race for the lowest price, CoreWeave can remain in a moderately defensible position. [^crunchbase]: CoreWeave raised 221M in series B https://www.crunchbase.com/funding_round/coreweave-series-b--cf5d3f7f ## Scalability Scaling factor is extensively discussed above, and it shows that a bigger cluster with a higher scaling factor equals a more competitive cluster for delivering fast results in AI training. CoreWeave recently announced a 2.3 billion debt financing collateralized by [^crunchbase-2] NVIDIA GPUs. If the raised capital is invested in building more and bigger data centers across multiple geolocations, CoreWeave will be in a good position to address its weakness in meeting privacy compliance requirements and enhancing its unit economy competitiveness. With more data centers in operation serve as collaterals CoreWeave can then raise more capital, forming a positive feedback loop. In order to take advantage of the positive feedback loop for rapid expansion CoreWeave will be dependent on a low-interest rate. If the interest rate rises in the future it will severely limits CoreWeave's ability to expand and improve its quality of service simply by building bigger and more data centers. The current market price for a single H100 is about USD 30000[^wcctech]. Making the following assumptions, the viability of debt financing for scaling up can be analyzed. - In a data center half of the hardware value is in the GPUs. i.e. each unit of GPU rented costs $ 60,000 in hardware. - Hardware in the data center has a 5 year straight-line amortization - Does not account for operation overhead _Debt financing for H100 GPUs on cloud_ | Interest Rate | Interest Expense in 5 years | Minimum utilization rate to break-even at $4.76/hour (CoreWeave on-demand price) | Minimum utilization rate to break-even at $2.59/hour (price matched with Lambda Lab) | | ------------- | -------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | | 4% | $ 12,000 | 34.6% | 63.5% | | 5.5% | $ 16,500 | 36.8% | 67.4% | | 7% | $ 21,000 | 38.9% | 71.4% | | 8.5% | $ 25,500 | 41.0% | 75.4% | If also accounts for operation costs, including but not limited to - electricity - property management - sales commission - customer support - software engineering - corporate management CoreWeave will need to leave ample room for a break-even utilization rate in order to make a profit and ensure quality of service. Note that to avoid the worst-case scenario of racing the price to the bottom, CoreWeave will need to further create value by intelligent investment in data center location and supporting software, much like its success in the visual effect rendering market. [^crunchbase-2]: CoreWeave raised 2.3B in debt financing https://www.crunchbase.com/funding_round/coreweave-debt-financing--1eac4917 [^wcctech]: https://wccftech.com/nvidia-h100-80-gb-pcie-accelerator-with-hopper-gpu-is-priced-over-30000-us-in-japan/ # Team CEO: Michael Intrator CFO: Evan Meagher CTO: Brian Venturo CSO: Brannin McBee The leadership team at CoreWeave has a rich history working in the finance, investment, and business industries. They held positions such as proprietary trader, vice president, and partners in their previous firms. Note that no member of the senior leadership team comes from a technical background. Since generative AI is generally considered a much more complex system than visual effect rendering, talents with deep technical knowledge can fill in the gap. # Recommendation I recommend incrementally investing in CoreWeave based on a set list of milestone achievements. The potential exponential growth fueled by debt financing is enticing. With careful risk management and product development decisions, CoreWeave could represent an attractive investment opportunity. ## Valuation ### Generative AI Based on the competition analysis, customers who would be drawn to CoreWeave share the following characteristics. - US-based - Does not operate in EU / Able to operate hybrid cloud - Not overly concerned about data security and privacy - Has the capability to run large-scale distributed training and inference on bare metal - Has a budget of about 1 million per model training If each constrains reduces the total addressable market by 50%, CoreWeave is predicted to have a maximum annual revenue of 247 * 0.5^5^ = 7.71 billion If CoreWeave realizes 33% (equal share of the addressable market with the remaining 2 competitors). Given the price-to-sales ratio from 4-6, in 10 years CoreWeave could grow into 10 to 15 billion market capitalization from the generative AI market. [^msft]: MSFT Q2 financial statement for net margin estimate https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/FinancialStatementFY23Q2.xlsx?version=dc48d17a-1912-afb7-ca8e-36ecf5030952 ### Visual Effect Rendering Based on the competition analysis CoreWeave holds an overwhelming advantage in cloud rendering service. Making an optimistic projection of CoreWeave becoming the market leader (50% of the total addressable market) in this sector, and an assumption that 5% of the VFX market value is spent on cloud service, will yield the following calculation. Revenue: 21 * 0.5 * 0.05 = 0.525 billion Market Capitalisation: = 2.1 - 3.15 billion ### Combined Hence in total, CoreWeave market capitalization can grow to 12.1 - 18.15 billion in the coming decade. The current valuation[^seriesb] of CoreWeave is about 2 billion. While a 6-9x return on investment over 10 years is not attractive, with some changes to the corporate structure and product offering, the addressable market can grow substantially. [^seriesb]: Series B investment disclosed on CrunchBase https://www.crunchbase.com/funding_round/coreweave-series-b--f576e161 ## Milestone 1 Global Expansion Privacy regulation around the world, alongside increasing geopolitical tension, motivates local storage and processing of consumer data. If CoreWeave builds service in Europe and Asia with the capital raised by debt financing, not only can the company reach the local market, but also increase the product's competitiveness with multinational corporate customers Addressable market upon completion of this milestone: 2x Return on investment over 10 years: 12x - 18x Recommended investment: 100 million ## Milestone 2 Product Development Managed generative AI services ease the friction of development and create a vendor lock-in effect. With this key feature, CoreWeave will be on par with major cloud providers and can reach numerous companies with less technical expertise. Addressable market upon completion of this milestone: 2x Return on investment over 10 years: 24 - 36x Recommended investment: 100 million ## Exit Strategy As competitiveness and efficiency for generative AI cloud services grow as the data centers scale up, it is expected that market consolidation will eventually take place. This dynamics predicts that the vendor with the biggest data center will become the market leader. CoreWeave shareholders will be in a relatively advantaged position if, when a merger or acquisition event occurs the combined company will be owning the largest scale of accelerated computing data center in the market. When CoreWeave has scaled to this point, a merger & acquisition deal should be solicited to maximize shareholder value. If by luck, CoreWeave experience exponential growth and quickly become the largest GPU cloud provider in the market, as a shareholder of the company an IPO should be solicited to maintain the leading position in the market with the raised capital.