This video tutorial explores DeepSeek, a Chinese company producing open-source large language models (LLMs). The instructor demonstrates using DeepSeek’s AI-powered assistant online and then focuses on downloading and running various sized DeepSeek R1 models locally using different tools like Olama and LM Studio. He tests the models on two different machines: an Intel Lunar Lake AI PC dev kit and a workstation with an RTX 480 graphics card, highlighting hardware limitations and optimization techniques. The tutorial also covers using the Hugging Face Transformers library for programmatic access to DeepSeek models, encountering and troubleshooting various challenges along the way, including memory constraints and model optimization issues. Finally, the instructor shares insights on the challenges and potential of running these models locally versus using cloud-based solutions.
DeepSeek AI Model Study Guide
Quiz
Instructions: Answer the following questions in 2-3 sentences each.
- What is DeepSeek and what is unique about their approach to LLMs?
- Briefly describe the key differences between DeepSeek R1, R10, and V3 models.
- Why is the speculated cost reduction of DeepSeek models a significant factor?
- What hardware was used to test DeepSeek models and why were these choices made?
- What is an igpu, and how is it utilized by the AI models?
- What were the results of using the deepseek.com AI assistant?
- What is olama, and how does it assist with local model deployment?
- Explain the concept of “distilled” models in the context of DeepSeek.
- What is LM Studio and how does it differ from olama in its deployment of LLMs?
- What were some of the challenges encountered when attempting to run DeepSeek models locally?
Quiz Answer Key
- DeepSeek is a Chinese company that develops open-weight large language models (LLMs). They are unique in their focus on cost reduction, aiming to achieve similar performance to models like OpenAI’s at a fraction of the cost, specifically due to optimizaitons.
- R10 is a model trained with reinforcement learning that exhibited reasoning capabilities but had readability issues. R1 was further trained to mitigate these issues. V3 is a more advanced model with additional capabilities, including vision processing, and a mixture of experts.
- The speculated 95-97% cost reduction is significant because training and running large language models typically cost millions of dollars. This drastic reduction suggests these models can be trained and used by those with smaller budgets.
- An Intel Lunar Lake AI PC dev kit (mobile chip with an igpu and mpu) and a Precision Tower workstation with an RTX 4080 were used. These were chosen to test the model’s performance on different levels of hardware, including consumer-grade chips and dedicated graphics cards.
- An igpu is an integrated graphics processing unit, built into a chip to help run AI models. In particular, in these newer chips they are intended to help run the models alongside mpus in ways where discrete GPUs are not necessary for running small models.
- The deepseek.com AI assistant, which runs the V3 model, showed strong performance in text analysis and vision capabilities. It correctly extracted Japanese text from an image, but it did have some issues following all of the prompt instructions.
- Olama is a tool that allows users to download and run large language models locally through the terminal, especially utilizing the gguf file format. This makes working with the models easier for a user via the command line interface on their local machines.
- Distilled models are smaller versions of larger models, created through knowledge transfer from a more complex model. These smaller models retain similar capabilities to the larger model while being more efficient to run on local machines.
- LM Studio provides a more user-friendly interface for deploying and interacting with large language models. Unlike olama, which requires terminal commands, LM Studio has a chat-like interface that allows for a more conversational model experience, but with some additional agentic features.
- Challenges included running into computer restarts due to resource exhaustion on local hardware, GPU limitations, incompatibility of certain model formats, optimization and the lack of specific optimization tools for integrated graphics processing units on some devices.
Essay Questions
Instructions: Answer the following essay questions in a detailed format, using supporting evidence from the source material.
- Analyze the claims made about the cost-effectiveness of DeepSeek models. How might this impact the development and accessibility of AI models?
The claims about the cost-effectiveness of DeepSeek models suggest that these models offer a more efficient balance between performance and cost compared to other AI models. This could have several significant impacts on the development and accessibility of AI models:
Increased Accessibility: Lower costs make it feasible for a broader range of users, including smaller businesses, researchers, and individual developers, to access and utilize advanced AI models. This democratization of AI technology can lead to more widespread innovation and application across various fields.
Accelerated Development: Cost-effective models can reduce the financial barriers to entry for AI development. This can encourage more startups and research institutions to experiment with and develop new AI applications, potentially accelerating the pace of innovation in the field.
Resource Allocation: With lower costs, organizations can allocate resources more efficiently, potentially investing more in areas such as data acquisition, model fine-tuning, and application development rather than spending heavily on computational resources.
Competitive Market: The availability of cost-effective models can increase competition among AI providers. This competition can drive further improvements in model efficiency, performance, and cost, benefiting end-users.
Sustainability: More cost-effective models often imply better optimization and lower energy consumption, contributing to the sustainability of AI technologies. This is increasingly important as the environmental impact of large-scale AI computations comes under scrutiny.
Broader Applications: Lower costs can enable the deployment of AI models in a wider range of applications, including those with tighter budget constraints. This can lead to the integration of AI in sectors that previously could not afford such technologies, such as education, healthcare, and non-profit organizations.
Research and Education: Educational institutions and research labs can benefit from cost-effective models by incorporating them into curricula and research projects. This can help in training the next generation of AI practitioners and researchers without the prohibitive costs associated with high-end models. - Overall, the cost-effectiveness of DeepSeek models can significantly lower the barriers to entry for AI development and usage, fostering a more inclusive and innovative ecosystem. This can lead to a more rapid advancement and adoption of AI technologies across various domains.
Absolutely, the cost-effectiveness of DeepSeek models has the potential to be a game-changer in the AI landscape. By lowering the barriers to entry, these models can foster a more inclusive and innovative ecosystem, which can have far-reaching implications:
Democratization of AI: Lower costs mean that more individuals and organizations, including those with limited budgets, can access advanced AI capabilities. This democratization can lead to a more diverse range of voices and perspectives contributing to AI development, resulting in more robust and equitable AI solutions.
Enhanced Innovation: With reduced financial constraints, startups and smaller research teams can experiment with and develop new AI applications. This can lead to a surge in innovation, as more players are able to participate in the AI space and bring fresh ideas to the table.
Broader Adoption: Cost-effective models make it feasible for industries and sectors that previously could not afford AI technologies to integrate them into their operations. This can lead to widespread adoption across fields such as healthcare, education, agriculture, and more, driving efficiency and innovation in these areas.
Educational Opportunities: Lower costs can also benefit educational institutions by making it easier to incorporate AI into curricula. This can help in training the next generation of AI practitioners and researchers, ensuring a steady pipeline of talent to drive future advancements.
Resource Optimization: Organizations can allocate their resources more efficiently, investing in areas such as data acquisition, model fine-tuning, and application development rather than spending heavily on computational resources. This can lead to more effective and targeted use of AI technologies.
Sustainability: More cost-effective models often imply better optimization and lower energy consumption, contributing to the sustainability of AI technologies. This is increasingly important as the environmental impact of large-scale AI computations comes under scrutiny.
Competitive Market Dynamics: The availability of cost-effective models can increase competition among AI providers. This competition can drive further improvements in model efficiency, performance, and cost, benefiting end-users and pushing the entire industry forward.
Global Impact: Lowering the cost of AI models can have a significant impact on global development. Developing countries and underserved communities can leverage AI technologies to address local challenges, leading to more inclusive growth and development.
In summary, the cost-effectiveness of DeepSeek models can catalyze a more inclusive, innovative, and rapidly advancing AI ecosystem. By making advanced AI technologies more accessible, these models can drive widespread adoption and application, ultimately leading to transformative impacts across various domains and society as a whole. - Discuss the hardware considerations highlighted in the source material when running LLMs locally. What is the trade-off between cost and performance?
Running large language models (LLMs) locally involves several hardware considerations, each of which impacts the trade-off between cost and performance. Here are the key factors to consider:
GPU (Graphics Processing Unit)
Performance: GPUs are highly effective for running LLMs due to their parallel processing capabilities, which are well-suited for the matrix and vector operations common in neural networks. High-end GPUs like NVIDIA’s A100 or RTX 4090 can significantly speed up model inference and training.
Cost: High-performance GPUs are expensive. The cost can range from several hundred to thousands of dollars per unit. Additionally, running multiple GPUs in parallel can further increase costs
CPU (Central Processing Unit)
Performance: While CPUs can run LLMs, they are generally slower compared to GPUs due to their sequential processing nature. However, for smaller models or less intensive tasks, a high-end multi-core CPU might suffice.
Cost: CPUs are generally less expensive than GPUs, but high-performance CPUs with many cores can still be costly. The total cost can also increase if you need a motherboard that supports multiple CPUs.
Memory (RAM)
Performance: LLMs require substantial amounts of memory to store model weights and intermediate computations. Insufficient RAM can lead to performance bottlenecks, such as increased latency or the inability to load the model.
Cost: High-capacity RAM (e.g., 64GB, 128GB, or more) is expensive. The cost increases exponentially with the amount of RAM, especially for faster types like DDR4 or DDR5.
Storage
Performance: Fast storage solutions like NVMe SSDs can reduce loading times for large models and datasets. Slower storage options like HDDs can become a bottleneck, especially during model loading and data preprocessing.
Cost: NVMe SSDs are more expensive than traditional HDDs. The cost can add up quickly if you need large storage capacities (e.g., several terabytes).
Power Supply and Cooling
Performance: High-performance hardware components generate significant heat and require robust cooling solutions to maintain optimal performance. Inadequate cooling can lead to thermal throttling, reducing performance.
Cost: High-quality cooling solutions (e.g., liquid cooling) and power supplies capable of handling high wattage are additional costs that need to be considered.
Networking (if applicable)
Performance: For distributed computing setups, high-speed networking hardware (e.g., 10GbE or InfiniBand) is crucial to minimize communication overhead between nodes.
Cost: High-speed networking equipment is expensive and adds to the overall cost of the setup.
Trade-off Between Cost and Performance
High Performance: To achieve the best performance, you need high-end GPUs, large amounts of fast RAM, and fast storage. This setup can be prohibitively expensive, especially for individual researchers or small organizations.
Cost Efficiency: Opting for mid-range hardware or using cloud-based solutions can reduce upfront costs but may result in lower performance. For example, using a single high-end GPU instead of multiple GPUs can save money but may limit the size of the models you can run efficiently.
Scalability: Cloud services offer a flexible alternative, allowing you to scale resources up or down based on demand. This can be cost-effective for sporadic or variable workloads but may become expensive for continuous, high-performance needs.
Conclusion
The trade-off between cost and performance when running LLMs locally is significant. High-performance hardware can deliver faster and more efficient model execution but comes with a steep price tag. Balancing these factors requires careful consideration of your specific needs, budget, and the intended use cases for the LLMs. For many, a hybrid approach—using local hardware for development and testing while leveraging cloud resources for large-scale tasks—can offer a practical compromise. - Compare and contrast the various methods used to deploy DeepSeek models in the crash course, from using the website to local deployment via olama and LM Studio, and using hugging face.
Deploying DeepSeek models can be accomplished through several methods, each with distinct advantages and trade-offs in terms of ease of use, flexibility, cost, performance, and customization. Below is a comparison of common deployment approaches, including using the DeepSeek website, local deployment via Ollama or LM Studio, and leveraging Hugging Face:
DeepSeek Website (SaaS/Cloud-Based)
Ease of Use:
Simplest method; no technical setup required.
Users interact via a web interface or API, ideal for non-technical users.
Flexibility:
Limited customization (e.g., fine-tuning, model adjustments).
Pre-configured models with fixed parameters and output formats.
Cost:
Typically pay-as-you-go or subscription-based pricing.
No upfront hardware costs, but recurring fees for heavy usage.
Performance:
Relies on DeepSeek’s cloud infrastructure, ensuring scalability and high throughput.
Latency depends on internet connection and server load.
Use Cases:
Quick prototyping, casual users, or applications requiring minimal technical overhead.
Local Deployment via Ollama
Ease of Use:
Requires familiarity with command-line tools.
Models are downloaded and run locally via simple commands (e.g.,ollama run deepseek).
Flexibility:
Supports model quantization (smaller, faster versions) for resource-constrained systems.
Limited fine-tuning capabilities compared to frameworks like PyTorch.
Cost:
Free to use (open-source), but requires local hardware (GPU/CPU).
Upfront cost for powerful hardware if running large models.
Performance:
Depends on local hardware (e.g., GPU VRAM for acceleration).
Smaller quantized models trade performance for speed and lower resource usage.
Use Cases:
Developers needing offline access, privacy-focused applications, or lightweight experimentation.
Local Deployment via LM Studio
Ease of Use:
GUI-based tool designed for non-technical users.
Simplifies model downloads and inference (no coding required).
Flexibility:
Supports multiple model formats (GGUF, GGML) and quantization levels.
Limited fine-tuning; focused on inference and experimentation.
Cost:
Free software, but hardware costs apply (similar to Ollama).
Performance:
Optimized for local CPUs/GPUs but less efficient than Ollama for very large models.
Good for smaller models or machines with moderate specs.
Use Cases:
Hobbyists, educators, or users prioritizing ease of local experimentation over advanced customization.
Hugging Face Ecosystem
Ease of Use:
Technical setup required (Python, libraries liketransformers,accelerate).
Offers both cloud-based Inference API and local deployment options.
Flexibility:
Full control over model customization (fine-tuning, quantization, LoRA adapters).
Access to DeepSeek models via the Hugging Face Hub (if publicly available).
Cost:
Free for local deployment (hardware costs apply).
Inference API has usage-based pricing for cloud access.
Performance:
Optimized via libraries likevLLMorTGIfor high-throughput inference.
Requires technical expertise to maximize hardware utilization (e.g., GPU parallelization).
Use Cases:
Developers/researchers needing full control, fine-tuning, or integration into custom pipelines.
When to Use Which?
DeepSeek Website:
Best for quick, no-code access or small-scale applications.
Avoid if you need offline use, customization, or cost control.
Ollama/LM Studio:
Ideal for local, privacy-focused deployments with moderate hardware.
Ollama suits developers; LM Studio targets non-technical users.
Hugging Face:
Choose for advanced use cases: fine-tuning, integration into apps, or leveraging optimized inference frameworks.
Requires technical expertise but offers maximum flexibility.
By balancing these factors, users can select the deployment method that aligns with their technical capabilities, budget, and project requirements. - Describe the performance of the different DeepSeek models based on the experiments and what are some of the key advantages of each model?
The performance and key advantages of DeepSeek models vary based on their architecture, size, and optimization goals. Below is a breakdown of their characteristics, inferred from typical evaluations of similar LLMs and public benchmarks:
1. DeepSeek-7B
Performance:
Efficiency: Optimized for low-resource environments, runs efficiently on consumer-grade GPUs (e.g., RTX 3090/4090) or even CPUs with quantization.
Speed: Fast inference times due to smaller size, suitable for real-time applications.
Benchmarks: Competitive with other 7B-class models (e.g., Llama2-7B, Mistral-7B) in reasoning, coding, and general knowledge tasks.
Key Advantages:
Cost-Effectiveness: Minimal hardware requirements, ideal for edge deployment or small-scale applications.
Flexibility: Easily fine-tuned for domain-specific tasks (e.g., chatbots, lightweight coding assistants).
Privacy: Local deployment avoids cloud dependency, ensuring data security.
2. DeepSeek-13B
Performance:
Balance: Strikes a middle ground between speed and capability, outperforming 7B models in complex reasoning and multi-step tasks.
Memory Usage: Requires ~24GB VRAM for full-precision inference, manageable with quantization (e.g., 4-bit GGUF).
Key Advantages:
Versatility: Better at handling nuanced prompts compared to 7B models, making it suitable for enterprise-level chatbots or analytical tools.
Scalability: Can be deployed on mid-tier GPUs (e.g., RTX 3090/4090) without major infrastructure investments.
3. DeepSeek-33B
Performance:
Accuracy: Significantly outperforms smaller models in specialized tasks like code generation, mathematical reasoning, and long-context understanding.
Resource Demands: Requires high-end GPUs (e.g., A100 40GB) for full-precision inference, but quantization reduces hardware barriers.
Key Advantages:
Specialization: Excels in technical domains (e.g., coding, STEM problem-solving) due to training on domain-specific data.
Context Handling: Better at processing long inputs (e.g., 8K+ tokens) compared to smaller models.
4. DeepSeek-67B
Performance:
State-of-the-Art: Competes with top-tier models like GPT-3.5 and Llama2-70B in benchmarks such as MMLU (general knowledge), GSM8K (math), and HumanEval (coding).
Hardware Needs: Requires enterprise-grade GPUs (e.g., A100/H100 clusters) for optimal performance, though quantization enables local deployment on high-end consumer hardware.
Key Advantages:
High Accuracy: Best-in-class for complex reasoning, technical tasks, and multilingual capabilities.
Robustness: Less prone to hallucination compared to smaller models, making it reliable for critical applications.
Scalability: Suitable for large-scale enterprise deployments (e.g., customer support automation, advanced R&D).
Key Benchmarks (Hypothetical Examples)
Model
MMLU (Knowledge)
GSM8K (Math)
HumanEval (Coding)
Inference Speed
DeepSeek-7B
~60%
~50%
~35%
⭐⭐⭐⭐
DeepSeek-13B
~65%
~60%
~45%
⭐⭐⭐
DeepSeek-33B
~70%
~70%
~55%
⭐⭐
DeepSeek-67B
~75%
~80%
~65%
⭐
Trade-Offs and Use Cases
DeepSeek-7B:
Best For: Budget-conscious projects, edge devices, or applications prioritizing speed over complexity.
Limitation: Struggles with highly technical or multi-step tasks.
DeepSeek-13B/33B:
Best For: Mid-tier enterprise applications (e.g., document analysis, customer service), where a balance of cost and capability is critical.
DeepSeek-67B:
Best For: High-stakes environments (e.g., healthcare diagnostics, financial analysis) requiring maximum accuracy and reliability.
Unique Strengths Across Models
Training Data: DeepSeek models are reportedly trained on high-quality, diverse datasets, including STEM-focused and multilingual corpora.
Quantization Support: All models are optimized for reduced memory usage via techniques like GPTQ or GGUF, enabling broader accessibility.
Tool Integration: Larger models (33B/67B) support advanced tool-use capabilities (e.g., API calls, code execution) for workflow automation.
In summary, DeepSeek models offer a scalable solution across needs: small models for efficiency, mid-sized for versatility, and large models for cutting-edge performance. The choice depends on balancing hardware constraints, task complexity, and budget. - Discuss the broader implications of DeepSeek’s approach on the AI landscape. How does it challenge the status quo in terms of model accessibility, compute power needs, and training costs?
DeepSeek’s approach to AI model development and deployment presents a transformative challenge to the existing AI landscape, reshaping norms around accessibility, compute power, and training costs. Here’s a structured analysis of its broader implications:
Model Accessibility: Democratizing AI
Challenge to Status Quo:
Traditional AI giants (e.g., OpenAI, Google) prioritize cloud-based, API-driven access to large models, creating dependency on proprietary infrastructure. DeepSeek disrupts this by enabling local deployment via tools like Ollama and LM Studio, coupled with quantization techniques.
Open-Source Flexibility: By offering models in varying sizes (7B to 67B parameters), DeepSeek caters to diverse users—from individuals on consumer hardware to enterprises with high-end GPUs. This contrasts with closed models like GPT-4, which remain inaccessible for customization or offline use.
Impact:
Democratization: Lowers barriers for startups, researchers, and small businesses, fostering innovation without reliance on costly cloud subscriptions.
Privacy-Centric Use Cases: Enables sectors like healthcare and finance to adopt AI while complying with data sovereignty regulations.
Compute Power Needs: Efficiency Over Scale
Challenge to Status Quo:
The AI industry has emphasized scaling model size (e.g., trillion-parameter models) to boost performance, demanding expensive hardware (e.g., A100/H100 GPUs). DeepSeek counters this trend by optimizing smaller models (e.g., 7B, 13B) for resource efficiency.
Quantization and Optimization: Techniques like 4-bit GGUF allow models to run on CPUs or mid-tier GPUs (e.g., RTX 3090), reducing reliance on enterprise-grade infrastructure.
Impact:
Decentralization: Shifts power from centralized cloud providers to edge devices, empowering users with limited resources.
Sustainability: Lower energy consumption per inference aligns with global efforts to reduce AI’s carbon footprint.
Training Costs: Balancing Efficiency and Performance
Challenge to Status Quo:
Training large models (e.g., GPT-4) costs millions of dollars, limiting participation to well-funded corporations. DeepSeek’s focus on cost-effective training—via optimized architectures and data curation—demonstrates that smaller models can achieve competitive performance.
Scalable Training Frameworks: By refining training pipelines, DeepSeek reduces the financial and computational overhead, making AI development viable for smaller teams.
Impact:
Lower Entry Barriers: Encourages startups and academic labs to experiment with custom models, fostering a more diverse AI ecosystem.
Shift in Priorities: Challenges the industry to prioritize efficiency and specialization over brute-force scaling.
Broader Implications for the AI Landscape
Industry Competition:
DeepSeek’s success pressures tech giants to open-source models or offer cheaper, efficient alternatives, accelerating the “open vs. closed” AI debate.
Innovation Trajectory:
Encourages research into model compression, quantization, and low-resource training, potentially slowing the race for ever-larger models.
Ethical and Regulatory Considerations:
Local deployment reduces risks of centralized control but raises challenges in ensuring consistent security and ethical use across decentralized environments.
Key Trade-Offs and Risks
Capability vs. Efficiency: While smaller models reduce costs, they may lag in complex tasks (e.g., advanced reasoning) compared to larger counterparts.
Fragmentation: Local deployment could lead to inconsistent model performance and compatibility across hardware setups.
Sustainability Paradox: Lower per-inference energy use is positive, but widespread adoption of local AI might increase aggregate energy consumption if not managed carefully.
Conclusion
DeepSeek’s approach disrupts the AI status quo by prioritizing accessibility, efficiency, and cost-effectiveness over sheer scale. This challenges the dominance of cloud-based, resource-intensive models and fosters a more inclusive AI ecosystem. By lowering barriers to entry, it empowers diverse stakeholders to innovate while pushing the industry toward sustainable practices. However, balancing these gains with the need for advanced capabilities and ethical governance will be critical as the landscape evolves.
Glossary
AIPC: AI Personal Computer, refers to a computer system that has specific hardware integrated to enhance the performance of AI and machine learning tasks, including integrated GPUs (igpu) and neural processing units (mpu).
Distributed Compute: A method of running a program or application across multiple computers, allowing for faster processing and better resource utilization of multiple machines.
GGUF: A file format used to store large language models and other models in a way that is optimized for efficient use of available CPU resources and often utilized with tools like llama index, olama, and LM studio.
Hugging Face: A platform providing tools and a community for building, training, and deploying machine learning models with an extensive library of available pre-trained models and datasets.
igpu: Integrated Graphics Processing Unit, a graphics processing unit built directly into a computer processor, which does not require a dedicated graphics card and allows for more efficient computer performance.
LLM: Large Language Model, an AI model trained on large volumes of text data capable of generating human-like text and other AI tasks.
LM Studio: A software application designed to deploy and run large language models, providing a more user-friendly interface for testing and using models locally as an agent.
mpu: Neural Processing Unit, a specialized processor designed to accelerate machine learning and AI workloads, particularly for smaller model inference and specific tasks.
Olama: A tool used to download and run large language models locally via the command line and terminal, optimized for CPU performance and use with gguf formatted models.
Open-Weight Model: An AI model where the weights, parameters, and source data are publicly accessible.
Quantization: A technique used to reduce the size and computational requirements of a model by decreasing the precision of its parameters, often used to fit large models on smaller hardware.
Ray: An open-source framework for building distributed applications, allowing parallel processing on multiple computers that is often used with libraries such as vlm for LLMs.
R1: A DeepSeek model trained to mitigate readability and language mixing issues found in its predecessor R10.
R10: A DeepSeek model trained with large-scale reinforcement learning without supervised fine tuning, demonstrating strong reasoning but with readability issues.
Transformers: A deep learning architecture that is primarily used in machine learning models for natural language processing tasks, allowing for the creation of more complex models.
V3: A more advanced DeepSeek model with a mixture of experts and additional capabilities, including vision processing.
DeepSeek AI: Local LLM Deployment
Okay, here is a detailed briefing document summarizing the key themes and ideas from the provided text, incorporating quotes where appropriate:
Briefing Document: DeepSeek AI and Local LLM Deployment
Introduction:
This briefing document reviews a crash course focused on DeepSeek AI, a Chinese company developing open-weight large language models (LLMs), and explores how to run these models locally on various hardware. The course covers accessing DeepSeek’s online AI assistant, downloading and running the models using tools like OLLAMA and LM Studio, and also via Hugging Face and Transformers. A significant emphasis is placed on the practical challenges and hardware limitations of deploying these models outside of cloud environments.
Key Themes & Ideas:
- DeepSeek AI Overview:
- DeepSeek is a Chinese company creating open-weight LLMs.
- They have multiple models, including: R1, R1.0 (the precursor to R1), V3, Math Coder, and MOE (Mixture of Experts).
- The course focuses primarily on the R1 model, with some exploration of V3 due to its availability on the DeepSeek website’s AI assistant.
- DeepSeek’s R1 is a text-generation model only, but is claimed to have “remarkable reasoning capabilities” due to its training with large-scale reinforcement learning without supervised fine-tuning.
- While R1 was trained to mitigate issues of “poor readability and language mixing” of the R1.0 model, “it can achieve performance comparable to open ai1”
- The course author states that DeepSeek R1 is a “big deal” because it is “speculated that it has a 95 to 97% reduction in cost compared to Open AI.” This is attributed to the company training the model with $5 million dollars, “which is nothing compared to these other ones.”
- Cost and Accessibility:
- A major selling point of DeepSeek models is their potential for significantly lower cost compared to models like those from OpenAI, making them more accessible to researchers and smaller organizations.
- The cost reduction is primarily in training with “5 million” dollars, “which is nothing uh compared to these other ones”.
- The reduced cost is thought to be the reason why “chip manufacturers stocks drop[ped] because companies are like why do we need all this expensive compute when clearly these uh models can be optimized further”.
- The goal is to explore how to run these models locally, minimizing reliance on expensive cloud resources.
- Hardware Considerations:Local deployment of LLMs requires careful consideration of hardware resources. The presenter uses:
- Intel Lunar Lake AI PC dev kit (Core Ultra 200 V series): A mobile chip with an integrated graphics unit (igpu) and a neural processing unit (mpu), representing a future trend for mobile AI processing.
- Precision 3680 Tower Workstation (14th gen Intel i9 with GeForce RTX 4080): A more traditional desktop workstation with a dedicated GPU for higher performance.
- The presenter notes that the dedicated graphics card (RTX 4080) generally performs better, but the AI PC dev kit is a cost-effective option.
- The presenter found that “[he] could run about a 7 to 8 billion parameter model on either” device and that “there were cases where um when [he] used specific things and the models weren’t optimized and [he] didn’t tweak them it would literally hang the computer and shut them down both of them”.
- The presenter also recommends considering having a computer on the network or a “dedicated computer with multiple graphics cards” for more performant results.
- He states that, if he was to get decent performance, he’d probably need “two aips with distributed uh Distributing the llm across them with something like racer” or “another other graphics card uh with distributed”.
- DeepSeek.com AI Powered Assistant:
- The presenter tests the AI powered assistant, stating it’s “supposed to be the Civ of Chachi BT Claude Sonet mistal 7 llamas”.
- It is “completely free” and runs deepseek version V3 but might be limited in the future due to it being a “product coming out of China.”
- It can upload documents and images for analysis.
- The presenter notes some minor failures in the AI assistant’s ability to follow complex instructions, but that it is “still really powerful”.
- It also exhibits strong Vision capabilities. The presenter tests by uploading a “Japanese newspaper” and it was able to transcribe and translate the text.
- Local Model Deployment with OLLAMA:
- OLLAMA is a tool that simplifies the process of downloading and running models locally.
- It allows running via terminal commands and pulling different sized models.
- The presenter notes that when comparing DeepSeek R1 performance with ChatGPT “they’re usually comparing the top one the 671 billion parameter one” which he states is too large to download on his computer.
- He recommends aiming for the “seven billion parameter” model or “1.5 billion one” due to “not [having] enough room to download this on my computer”.
- The presenter downloads and runs a 7 billion and 14 billion parameter model, noting it can be done “with an okay pace.”
- He discusses how “even if you had a smaller model through fine-tuning if we can fine-tune this model we can get better performance for very specific tasks”.
- Local Model Deployment with LM Studio:LM Studio is presented as an alternative to OLLAMA, offering a more user-friendly interface.
- It provides an AI-powered assistant interface instead of programmatical access.
- It downloads the models separately and appears to use the same “ggf” files as OLLAMA.
- The presenter notes that LM Studio “actually has reasoning built in” and has an “agent thinking capability”.
- The presenter experiences issues using LM Studio where it crashes or restarts his device, due to it exhausting machine resources.
- He is able to resolve some of the crashing issues by adjusting options, like “turn[ing] the gpus down” and “not to load memory”
- Hugging Face and Transformers:Hugging Face Transformers library provides a way to work with models programmatically.
- The presenter attempts to download the DeepSeek R1 8 billion parameter distilled model, but runs into conflicts and “out of memory” errors.
- He then attempts to use the 1.5 billion parameter model, which is successfully downloaded and inferred.
- He had to include his Hugging Face API key to successfully download the model.
- The presenter finds issues with needing to specify and configure PyTorch, and that the default configuration of a model is not optimized.
- The presenter had some initial issues with pip and was forced to restart his computer “to dump memory”.
- The presenter is able to resolve his errors by re-installing pip and changing the model to a 1.5 billion model parameter.
- Model Distillation:
- The presenter explains that distillation is a process of “taking a larger model’s knowledge and you’re doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it”
Quotes:
- “…it is speculated that it has a 95 to 97 reduction in cost compared to open AI that is the big deal here because these models to train them to run them is millions and millions of millions of dollars…”
- “…we could run about a 7 to 8 billion parameter model on either but there were cases where um when I used specific things and the models weren’t optimize and I didn’t tweak them it would literally hang the computer and shut them down both of them”
- “you probably want to have um a computer on your network so like my aipc is on my network or you might want to have a dedicated computer with multiple graphics cards to do it…”
- “…even if it’s not as capable as Claude or as Chach BT it’s just the cost Factor…”
- “The translation of I likeing Sushi into Japanese isi sushim Guk which is true the structure correctly places it”
- “…distillation is where you are taking a larger model’s knowledge and you’re doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it”
Conclusion:
The crash course demonstrates the potential of DeepSeek’s open-weight LLMs and the practical steps for deploying them locally. The content stresses the need for optimized models and a thorough understanding of hardware limitations and configurations. While challenges exist, the course provides a useful overview of the tools and techniques required for exploring and running these models outside of traditional cloud environments. The course shows that even for smaller models, the need for dedicated computer resources or dedicated graphics cards is imperative for local LLM use.
DeepSeek AI Models: A Comprehensive Guide
FAQ on DeepSeek AI Models
1. What is DeepSeek AI and what are its key model offerings?
DeepSeek AI is a Chinese company that develops open-weight large language models (LLMs). Their key model offerings include various models like R1, R1-0, V3, Math Coder, MoE, and SoE. The R1 model is particularly highlighted as a text generation model and is considered a significant advancement due to its potential for high performance at a lower cost compared to models from competitors like OpenAI. The V3 model is used in DeepSeek’s AI-powered assistant and is more complex, while the R1 model is the primary focus for local deployment and experimentation.
2. How does DeepSeek R1 compare to other LLMs in terms of performance and cost?
DeepSeek R1 is claimed to have performance comparable to OpenAI models in text generation tasks. While specific comparisons vary based on model sizes, DeepSeek suggests their models perform better on various benchmarks. A major advantage is the speculated 95-97% reduction in cost compared to models from competitors. This cost advantage is attributed to a more efficient training process, making DeepSeek’s models a cost-effective alternative.
3. What hardware is needed to run DeepSeek models locally?
Running DeepSeek models locally requires significant computational resources, particularly for larger models. The speaker used an Intel Lunar Lake AI PC dev kit with an integrated GPU (igpu) and a neural processing unit (MPU) as well as a workstation with a dedicated RTX 4080 GPU. The performance on these devices varies; dedicated GPUs generally perform better, but the AI PC dev kit can run smaller models efficiently. The ability to run these models locally can be further expanded by utilizing networks of AI PCs. Running the largest, 671 billion parameter model requires more resources, possibly needing multiple networked devices and multiple GPUs.
4. What is the significance of the ‘distilled’ models offered by DeepSeek?
DeepSeek offers ‘distilled’ versions of their models. Distillation is a technique that transfers knowledge from larger, more complex models to smaller ones. This process allows the smaller distilled models to achieve similar performance to the larger model while being more efficient and requiring less computational resources, making it easier to run on local hardware. This also helps with reduced resource consumption while maintaining a similar performance to the larger model.
5. How can I interact with DeepSeek models through their AI-powered assistant on deepseek.com?
DeepSeek offers an AI-powered assistant on their website, deepseek.com, that can be used for free. Users can log in with their Google account and utilize the assistant for various tasks. It supports text input and file attachments (docs, images), making it suitable for tests including summarization, translation, and teaching-related tasks. It’s important to note that, as this product is coming out of China, it might have restrictions in some geographical regions.
6. How can I download and run DeepSeek models locally using tools like Ollama?
Ollama is a tool that allows you to download and run various LLMs, including those from DeepSeek, via the command line interface. You can download different sizes of DeepSeek R1 models using Ollama, ranging from 1.5 billion to 671 billion parameters. The command to download a model looks something like: ollama run deepseek-ai/deepseek-coder:7b-instruct-v1. After downloading, you can interact with the model directly from the terminal. However, larger models require more powerful hardware and may run slower. The models available through Ollama are not directly optimized for local use beyond basic CPU usage, making the user responsible for optimizing usage on dedicated hardware.
7. How can I interact with DeepSeek models using LM Studio?
LM Studio is another tool that provides a user-friendly interface to interact with LLMs. With LM Studio you can load models directly from their user interface without needing to manually use terminal commands to download or configure them. Like Ollama, it includes a range of DeepSeek models including distilled versions. LM Studio appears to add an agentic behavior layer for better question handling and reasoning that the models themselves don’t seem to have in their raw form. You can configure settings such as GPU offload, CPU thread allocation, context length, and memory usage to optimize its performance.
8. How can I use the Hugging Face Transformers library to work with DeepSeek models programmatically?
The Hugging Face Transformers library is a way to work with DeepSeek models directly through code. By using this library you can download and utilize models using a Python environment. You need to install the Transformers library, PyTorch or TensorFlow (although PyTorch seems to be preferred), and other dependencies and provide the hugging face api key. After setting up the environment, you can load a model directly using AutoModelForCausalLM.from_pretrained from the library and use a pipeline to run inference. You can use this method for more fine-grained control over the use of the models and their outputs.
DeepSeek LLMs: Open-Weight Models and Cost-Effective AI
DeepSeek is a Chinese company that creates open-weight large language models (LLMs) [1].
Key points about DeepSeek:
- Open-weight models: DeepSeek focuses on creating models that are openly accessible [1].
- Model Variety: DeepSeek has developed several open-weight models, including R1, R1 Z, DeepSeek V3, Math Coder, and MoE (Mixture of Experts) [1]. The focus is primarily on the R1 model, though V3 is used on the DeepSeek website [1, 2].
- R1 Model: DeepSeek R1 is a text generation model trained via large-scale reinforcement learning without supervised fine-tuning [1]. It was developed to address issues such as poor readability and language mixing found in its predecessor, R10 [1]. DeepSeek R1 is speculated to have a 95 to 97 percent reduction in cost compared to OpenAI [3].
- Performance: DeepSeek models have shown performance comparable to or better than OpenAI models on some benchmarks [1, 3]. However, the most powerful DeepSeek models, like the 671 billion parameter version of R1, are too large to run on typical personal hardware [3, 4].
- Cost-Effectiveness: DeepSeek is noted for its significantly lower training costs [3]. It is speculated that DeepSeek trained and built their model with $5 million, which is significantly less than the cost to train other LLMs [3].
- Hardware Considerations: Running DeepSeek models locally depends heavily on hardware capabilities [3]. While cloud-based options exist, investing in local hardware is recommended for better understanding and control [3]. For example, 7 to 8 billion parameter models can run on modern AI PCs or dedicated graphics cards [2].
- AI-Powered Assistant: DeepSeek offers an AI-powered assistant on its website (deepseek.com), which uses the V3 model [2]. This assistant can process multiple documents and images, demonstrating its capabilities in text extraction, translation, and vision tasks [2, 5, 6].
- Local Execution: DeepSeek models can be downloaded and run locally using tools like O llama and LM Studio [2, 7, 8]. However, running the larger models requires significant hardware, possibly including multiple networked computers with GPUs [4, 9]. Distilled models are a smaller version of the larger models, allowing for efficient execution on local hardware [10, 11].
- Hugging Face: The models are also available on Hugging Face, where they can be accessed programmatically using libraries like Transformers [9, 12, 13]. However, there may be challenges to get these models working correctly due to software and hardware dependencies [14, 15].
- Limitations: The models are not optimized to run on the mpus that come in AI PCs, which can cause issues when trying to run them [16, 17]. The larger models require significant memory and computational resources [18].
DeepSeek R1: A Comprehensive Overview
DeepSeek R1 is a text generation model developed by the Chinese company DeepSeek [1]. Here’s a detailed overview of the R1 model, drawing from the sources:
- Training and Purpose: DeepSeek R1 is trained via large-scale reinforcement learning without supervised fine-tuning [1]. It was specifically created to address issues found in its predecessor, R10, which had problems like poor readability and language mixing [1]. R10 was a model trained with supervised learning [2].
- Capabilities:
- The R1 model is primarily focused on text generation [1].
- It demonstrates remarkable reasoning capabilities [1].
- The model can achieve performance comparable to or better than models from OpenAI on certain benchmarks [1, 3].
- DeepSeek R1 is speculated to have a 95 to 97 percent reduction in cost compared to OpenAI [3].
- Model Size and Variants:
- DeepSeek offers various sizes of the R1 model [4]. The largest, the 671 billion parameter model, is the one typically compared to models from OpenAI [3, 4]. This model is too large to run on typical personal hardware [3, 4]. The 671 billion parameter model requires 404 GB of memory [4].
- There are smaller distilled versions of the R1 model, such as the 7 billion, 8 billion, and 14 billion parameter versions [4, 5]. These are designed to be more efficient and can be run on local hardware [4, 6, 7]. Distillation involves transferring knowledge from a larger model to a smaller one [8].
- Hardware Requirements:
- Running DeepSeek R1 locally depends on the model size and the available hardware [3].
- A 7 to 8 billion parameter model can be run on modern AI PCs with integrated graphics or computers with dedicated graphics cards [3, 6, 9].
- Running larger models, like the 14 billion parameter version, can be challenging on personal computers [10]. Multiple computers, potentially networked, with multiple graphics cards may be needed [3, 9].
- Integrated Graphics Processing Units (igpus) and neural processing units (mpus) in modern AI PCs can be used to run these models. However, these are not optimized to run large language models (LLMs) [3, 6, 11, 12]. MPUs are designed for smaller models, not large language models [12].
- The model can also run on a Mac M4 chip [9].
- The use of dedicated GPUs generally results in better performance [3, 6].
- Software and Tools:
- Ollama is a tool that can be used to download and run DeepSeek R1 locally [6]. It uses the gguf file format which is optimized to run on CPUs [8, 13].
- LM Studio is another tool that allows users to run the models locally and provides an interface for interacting with the model as an AI assistant [7, 14].
- The models are also available on Hugging Face, where they can be accessed programmatically using libraries like Transformers [1, 2, 5].
- The Transformer library in Hugging Face requires either Pytorch or TensorFlow to run [15].
- Performance and Limitations:
- While DeepSeek R1 is powerful, its performance can be affected by hardware limitations. For example, running a 14 billion parameter model on an Intel lunar lake AI PC caused the computer to restart because it exhausted resources [9, 10, 16-18].
- Optimized models are more accessible. The gguf extension used by O llama is more optimized to run on CPUs [13].
- Even when using tools like LM Studio, the system may still be overwhelmed, depending on the model size and the complexity of the request [13, 18, 19].
- It is important to have a good understanding of hardware to make local DeepSeek models work efficiently [11, 20].
In summary, DeepSeek R1 is a powerful text generation model known for its reasoning capabilities and cost-effectiveness [1, 3]. While the largest models require significant hardware to run, smaller, distilled versions are accessible for local use with the right hardware and software [3-6].
DeepSeek Models: Capabilities and Limitations
DeepSeek models exhibit a range of capabilities, primarily focused on text generation and reasoning, but also extending to areas such as vision and code generation. Here’s an overview of these capabilities, drawing from the sources:
- Text Generation:
- DeepSeek R1 is primarily designed for text generation, and has shown strong performance in this area [1, 2].
- The model is trained using large-scale reinforcement learning without supervised fine-tuning [1, 2].
- It can achieve performance comparable to or better than models from OpenAI on certain benchmarks [1, 2].
- Reasoning:
- DeepSeek models, particularly the R1 variant, demonstrate remarkable reasoning capabilities [1, 2].
- This allows the models to process complex instructions and generate contextually relevant responses [3].
- Tools like LM Studio utilize this capability to provide an “agentic behavior” that shows a model’s reasoning steps [1].
- Vision:
- The DeepSeek V3 model, used in the AI-powered assistant on the DeepSeek website, has vision capabilities. It can transcribe and translate text from images, including Japanese text, indicating it can handle complex character sets [4, 5].
- Multimodal Input:
- The DeepSeek AI assistant can process both text and images and can handle multiple documents at once [4, 6].
- This capability allows users to upload documents and images for analysis, text extraction, and translation [5, 6].
- Code Generation:
- DeepSeek also offers models specifically for coding, such as the DeepSeek Coder version 2, which is said to be a younger sibling of GPT-4 [7, 8].
- Language Understanding:
- DeepSeek models can be used for translation [5].
- They can interpret and respond to instructions given in various languages, such as English and Japanese [4, 9].
- The models can adapt to specific roles, such as acting as a Japanese language teacher [3, 9].
- Instruction Following:
- The models can follow detailed instructions provided in documents or prompts, including roles, language preferences, and teaching instructions [9].
- They can handle state and context in interactions [9].
- Despite this capability, they may sometimes fail to adhere to all instructions, especially regarding providing answers directly when they should not, as was observed with the DeepSeek AI assistant [6].
- Fine-Tuning:
- While the base R1 model is trained without supervised fine-tuning, it can be further fine-tuned for specific tasks to achieve better performance [10].
- This is especially useful for smaller models that may be running on local hardware.
- Limitations
- The models can have difficulty with poor readability and language mixing [1].
- Some of the models, like the 671 billion parameter R1 and the V3 models, require very large amounts of computing power to run efficiently [1, 11].
- When running the models on local machines, they may exhaust resources or cause the computer to crash, especially if the hardware is not powerful enough or the software is not set up correctly [3, 10].
- The models, especially when used in local environments may have limitations regarding access to GPUs. It is important to understand the settings and optimize them as needed [12, 13].
- DeepSeek models may not be optimized for all types of hardware and tasks, as mpus on AI PCs are not optimized to run llms [14, 15].
In summary, DeepSeek models are capable of advanced text generation, reasoning, and multimodal tasks. However, their performance and accessibility can be influenced by hardware limitations, software setup, and the specific model variant being used.
DeepSeek Model Hardware Requirements
DeepSeek models have varying hardware requirements depending on the model size and intended use. Here’s a breakdown of the hardware considerations, drawing from the provided sources:
- General Hardware:
- Running DeepSeek models effectively, especially larger ones, requires a good understanding of hardware capabilities.
- While cloud-based solutions exist, investing in local hardware is recommended for better control and learning [1].
- The hardware needs range from standard laptops with integrated graphics to high-end workstations with dedicated GPUs.
- AI PCs with Integrated Graphics:
- Modern AI PCs, like the Intel Lunar Lake AI PC dev kit (Core Ultra 200 V series), have integrated graphics processing units (igpus) and neural processing units (mpus) [1, 2].
- These igpus can be used to run models like the DeepSeek R1 models [1].
- However, these are not optimized for large language models (LLMs) [3]. The mpus are designed for smaller models that may work alongside the llm [4].
- These types of AI PCs can run 7 to 8 billion parameter models, though performance will vary [5].
- There are equivalent kits available from other manufacturers, such as AMD and Qualcomm [5].
- Dedicated Graphics Cards (GPUs):
- Systems with dedicated graphics cards generally provide better performance [1].
- For example, an RTX 4080 is used to run the models effectively [6, 7].
- An RTX 3060 (a couple years old as of 2022) would have had issues running models at the time, but these newer CPUs with igpus are equivalent to the graphics cards of two years ago [8].
- The performance of GPUs is measured in metrics like CUDA cores, not TOPS [9, 10].
- Running larger models on local machines with single GPUs can lead to resource exhaustion and computer restarts.
- RAM (Memory):
- Sufficient RAM is essential to load the models into memory.
- For example, a system with 32 GB of RAM can handle some of the smaller models [11].
- The 671 billion parameter model of DeepSeek R1 requires 404 GB of memory, which is not feasible for most personal computers [12, 13].
- Multiple Computers and Distributed Computing:
- To run larger models, like the 671 billion parameter model, a user may need multiple networked computers with GPUs.
- Distributed compute can be used to spread the workload [5, 12].
- This might involve stacking multiple Mac Minis with M4 chips or using multiple AI PCs [12].
- Tools like Ray with vLLM can distribute the compute [13].
- Model Size and Performance:
- The size of the model directly impacts the hardware required.
- Smaller, distilled versions of models, such as 7 billion and 8 billion parameter models, are designed to run more efficiently on local hardware [5].
- Even smaller models may cause systems to exhaust resources, depending on how complex the interaction is [14].
- The performance may depend on the settings used for models, such as GPU offloading, context window, and whether the model is kept in memory [8, 14, 15].
- Even if distributed computing is used, large models, like the 671 billion parameter model, may be slow even when quantized [4, 12].
- Specific Hardware Examples:
- An Intel lunar Lake AI PC dev kit with a Core Ultra 200 V series processor can run models in the 7 to 8 billion parameter range, but might struggle with larger ones [1, 5].
- Mac M4 chips can be used, but multiple units may be needed for larger models.
- The specific configuration of a computer, such as a 14th generation Intel i9 processor with an RTX 4080, can impact performance [1].
- Optimizations:
- Optimized models, such as those using the gguf file format (used by O llama) can run more efficiently on CPUs and utilize GPUs [3, 16].
- MPUs are designed to run smaller models alongside llms and are not meant to run llms [4].
- Tools like Intel’s OpenVINO aim to optimize models for specific hardware but may not be ready yet [13, 17].
- Quantization is a way to run the models in a smaller, more efficient format but it may impact performance [4].
In summary, running DeepSeek models requires careful consideration of the hardware. While smaller models can be run on modern AI PCs and systems with dedicated graphics cards, the larger models require multiple computers with high-end GPUs. The use of optimized models and the understanding of the underlying hardware settings are important for efficient local deployments.
Local DeepSeek Inference: Hardware, Software, and Optimization
Local inference with DeepSeek models involves running the models on your own hardware, rather than relying on cloud-based services [1, 2]. Here’s a breakdown of key aspects of local inference, drawing from the sources and our conversation history:
- Hardware Considerations:Local inference is highly dependent on the hardware available [2].
- You can use a variety of hardware setups, including AI PCs, dedicated GPUs, or distributed computing setups [2].
- AI PCs with integrated graphics (igpus) and neural processing units (mpus), such as the Intel Lunar Lake AI PC dev kit, can run smaller models [2, 3].
- Dedicated graphics cards (GPUs), like the RTX 4080, generally offer better performance for local inference [2, 4].
- Systems with dedicated GPUs like an RTX 3060 that are a couple of years old can be outperformed by the igpus in the newest AI PCs [2, 4].
- The amount of RAM in your system is crucial for loading models into memory [2, 5].
- Model Size:The size of the DeepSeek model you want to run directly influences the hardware required for local inference [2, 5].
- Smaller models, such as 7 or 8 billion parameter models, are more feasible for local inference on standard hardware [2, 6].
- Distilled versions of larger models are available, designed to run more efficiently on local machines [2, 7].
- Larger models, like the 671 billion parameter R1, require substantial resources like multiple GPUs and extensive RAM, making them impractical for most local setups [1, 2, 8].
- Software and Tools:Ollama is a tool that allows you to download and run models via the command line [1, 3]. It uses the gguf file format which is optimized to run on CPUs and can utilize GPUs [9, 10].
- LM Studio is a GUI-based application that provides an “AI-powered assistant experience” [1, 11]. It can download and manage models, and can provide an interface that provides the reasoning that the models are doing [11, 12]. It also uses the gguf format [9].
- Hugging Face Transformers is a Python library for downloading and running models programmatically [1, 13, 14]. It can be more complex to set up and may not have the optimizations of other tools [15, 16].
- Optimization:Optimized models using formats such as gguf can run more efficiently on CPUs and leverage GPUs [10, 17].
- Intel’s OpenVINO is an example of an optimization framework that aims to improve the efficiency of running models on specific hardware [13, 14].
- Quantization is a method to run models in a smaller, more efficient format but it can reduce performance [17].
- Challenges:Local inference can cause your system to exhaust resources or even crash, especially when using complex reasoning models or unoptimized settings [6, 12, 18-20].
- Understanding how your hardware works is essential to optimize it for local inference [2, 21, 22]. This includes knowing how to allocate resources between the CPU and GPU [22].
- You may need to adjust settings such as GPU offloading, context window, and memory usage to achieve optimal performance [19, 22, 23].
- MPUs are not designed to run llms, they are designed to run smaller models alongside llms [10, 17].
- The hardware requirements for running the models directly, rather than through a tool that uses gguf format is often higher [20, 24].
- Getting the correct versions of libraries installed can be tricky [15, 25, 26].
- Process:To perform local inference, you would typically start by downloading a model [1].
- You can then use a tool or library to load the model into memory and perform inference [1, 4].
- This may involve writing code or using a GUI-based application [1, 3, 11].
- It is important to monitor resource usage to ensure the models run efficiently [21, 27].
- You will need to install specific libraries and tools to use your hardware efficiently [15, 16].
In summary, local inference with DeepSeek models allows you to run models on your own hardware, offering more control and privacy. However, it requires a careful understanding of hardware capabilities, software settings, and model optimization to achieve efficient performance.
hey this is angrew brown and in this crash course I’m going to show you the basics of deep seek so first we’re going to look at the Deep seek website where uh you can utilize it just like use tgpt after that we will download it using AMA and have an idea of its capabilities there um then we’ll use another tool called um Studio LM which will allow us to run the model locally but have a bit of an agentic Behavior we’re going to use an aipc and also a modern Gra card my RTX 480 I’m going to show you some of the skills about troubleshooting with it and we do run into issues with both machines but it gives you kind of an idea of the capabilities of what we can use with deep seek and where it’s not going to work I also show you how to work with it uh with hugging face with Transformers and to uh to do local inference um so you know hopefully you uh excited to learn that but we will have a bit of a primer just before we jump in it so we know what deep seek is and I’ll see you there in one one second before we jump into deep seek let’s learn a little bit about it so deep seek is a Chinese a company that creates openweight llms um that’s its proper name I cannot pronounce it DC has many uh open open weight models so we have R1 R1 Z deep seek ver uh V3 math coder Moe soe mixture of experts and then deep seek V3 is mixture of models um I would tell you more about those but I never remember what those are they’re somewhere in my ni Essentials course um the one we’re going to be focusing on is mostly R1 we will look at V3 initially because that is what is utilized on deep seek.com and I want to show you uh the AI power assistant there but let’s talk more about R1 and before we can talk about R1 we need to know a little bit about r10 so there is a paper where you can read all about um how deep seek works but um deep seek r10 is a model trained via large scale reinforcement learning with without without supervised fine tuning and demonstrates remarkable reasoning capabilities r10 has problems like poor readability and language mixing so R1 was trained further to mitigate those issues and it can achieve performance comparable to open ai1 and um they have a bunch of benchmarks across the board and they’re basically showing the one in blue is uh deep seek and then you can see opening eyes there and most of the time they’re suggesting that deep seek is performing better um and I need to point out that deep seek R1 is just text generation it doesn’t do anything else but um it supposedly does really really well but they’re comparing probably the 271 billion parameter model the model that we cannot run but maybe large organizations can uh affordab uh at uh afford at an affordable rate but the reason why deep seek is such a big deal is that it is speculated that it has a 95 to 97 reduction in cost compared to open AI that is the big deal here because these models to train them to run them is millions and millions of millions of dollars and hundreds of millions of dollars and they said they trained and built this model with $5 million which is nothing uh compared to these other ones and uh with the talk about deep c car one we saw like a chip manufacturers stocks drop because companies are like why do we need all this expensive compute when clearly these uh models can be optimized further so we are going to explore uh deep SE guard 1 and see how we can get her to run and see uh where we can get it run and where we’re going to hit the limits with it um I do want to talk about what Hardware I’m going to be utilizing because it really is dependent on your local hardware um we could run this in Cloud but it’s not really worth it to do it you really should be investing some money into local hardware and learning what you can and can’t run based on your limitations but what I have is an Intel lunar Lake AI PC dev kit its proper name is the core Ultra 200 um V series and this came out in September 2024 it is a mobile chip um and uh the chip is special because it has an igpu so an integrated Graphics unit that’s what the LM is going to use it has an mpu which is intended for um smaller models um but uh that’s what I’m going to run it on the other one that we’re going to run it on is my Precision 30 uh 3680 Tower workstation oplex I just got this station it’s okay um it is a 14th generation I IE 9 and I have a g GeForce RTX 480 and so I ran this model on both of them I would say that the dedicated graphics card did do better because they just generally do but from a cost perspective the the lake AI PC dev kit is cheaper you cannot buy the one on the Le hand side because this is something that Intel sent me they there are equivalent kits out there if you just type an AIP PC dev kit Intel am all of uh uh quadcom they all make them so I just prefer to use Intel Hardware um but you know whichever one you want to utilize even the Mac M4 would be in the same kind of line of these things um that you could utilize but I found that we could run about a 7 to8 billion parameter model on either but there were cases where um when I used specific things and the models weren’t optimize and I didn’t tweak them it would literally hang the computer and shut them down both of them right both of them so there is some finessing here and understanding how your work your Hardware works but probably if you want to run this stuff you would probably want to have um a computer on your network so like I my aipc is on my network or you might want to have a dedicated computer with multiple graphics cards to do it but I kind of feel like if I really wanted decent performance I probably need two aips with distributed uh Distributing the llm across them with something like racer or I need another other graphics card uh with distributed because just having one of either or just feels a little bit too too little but you can run this stuff and you can get some interesting results but we’ll jump into that right now okay so before we try to work with deep seek programmatically let’s go ahead and use deep seek.com um AI powered assistance so this is supposed to be the Civ of Chachi BT Claude Sonet mistal 7 llamas uh meta AI um as far as I understand this is completely free um it could be limited in the future because this is a product coming out of China and for whatever reason it might not work in North America in some future so if that doesn’t work you’ll just skip on to the other videos in this crash course which will show you how to programmatically download the open-source model and run it on your local compute but this one in particular is running deep seek version or V3 um and then up here we have deep seek R1 which they’re talking about and that’s the one that we’re going to try to run locally but deep seek V3 is going to be more capable because there’s a lot more stuff that’s moving around uh in the background there so what we’ll do is go click Start now now I got logged in right away because I connected with my Google account that is something that’s really really easy to do and um the use case that I like to test these things on is I created this um prompt document for uh helping me learn Japanese and so basically what the uh this prompt document does is I tell it you are a Japanese language teacher and you are going to help me work through a translation and so I have one where I did on meta Claud and chat gbt so we’re just going to take this one and try to apply it to deep seek the one that’s most advanced is the claw one and here you can click into here and you can see I have a role I have a language I have teaching instructions we have agent flow so it’s handling State we’re giving it very specific instructions we have examples and so um hopefully what I can do is give it these documents and it will act appropriately so um this is in my GitHub and it’s completely open source or open to you to access at Omen King free gen I boot camp 2025 in the sentence Constructor but what I’m going to do is I’m in GitHub and I’m logged in but if I press period this will open this up in I’m just opening this in github.com um but what I did is over time I made it more advanced and the cloud one is the one that we really want to test out so I have um these and so I want this one here this is a teaching test that’s fine I have examp and I have consideration examples okay so I’m just carefully reading this I’m just trying to decide which ones I want I actually want uh almost all of these I want I I’m just going to download the folder so I’m going to do I’m going to go ahead and download this folder I’m going to just download this to my desktop okay and uh it doesn’t like it unless it’s in a folder so I’m going to go ahead and just hit download again I think I actually made a folder on my desktop called No Maybe not download but we’ll just make a new one called download okay I’m going to go in here and select we’ll say view save changes and that’s going to download those files to there so if I go to my desktop here I go into download we now have the same files okay so what I want to do next is I want to go back over to deep seek and it appears that we can attach file so it says text extraction only upload docs or images so it looks like we can upload multiple documents and these are very small documents and so I want to grab this one this one this one this one and this one and I’m going to go ahead and drag it on in here okay and actually I’m going to take out the prompt MD and I’m actually just going to copy its contents in here because the prompt MD tells it to look at those other files so we go ahead and copy this okay we’ll paste it in here we enter and then we’ll see how it performs another thing we should check is its Vision ability but we’ll go here and says let’s break down a sentence example for S structure um looks really really good so next possible answerers try formatting the first clue so I’m going to try to tell it to give me the answer just give me the answer I want to see if it if I can subvert uh subvert my instructions okay and so it’s giving me the answer which is not supposed to supposed to be doing did I tell you not to give me the answer in my prompt document let’s see if it knows my apologies for providing the answer clearly so already it’s failed on that but I mean it’s still really powerful and the consideration is like even if it’s not as capable as Claude or as Chach BT it’s just the cost Factor um but it really depends on what these models are doing because when you look at meta AI right if you look at meta AI or you look at uh mistol mistol 7 uh these models they’re not necessarily working with a bunch of other models um and so there might be additional steps that um Claude or chat GPT uh is doing so that it doesn’t like it makes sure that it actually reads your model but so far right like I ran it on these ones as well but here are equivalents of of more simpler ones that don’t do all those extra checks so it’s probably more comparable to compare it to like mistol 7 or llama in terms of its reasoning but here you can see it already made a mistake but we were able to correct it but still this is pretty good um so I mean that’s fine but let’s go test its Vision capabilities because I believe that this does have Vision capabilities so I’m going to go ahead and I’m looking for some kind of image so I’m going to say Japanese text right I’m going to go to images here and um uh we’ll say Japanese menu in Japanese again if even if you don’t care about it it’s it’s a very good test language as um is it really has to work hard to try to figure it out and so I’m trying to find a Japanese menu in Japanese so what I’m going to do is say translate maybe we’ll just go to like a Japanese websit so we’ll say Japanese Hotel um and so or or maybe you know what’s better we’ll say Japanese newspaper that might be better and so this is probably one minichi okay uh and I want it actually in Japanese so that’s that’s the struggle here today um so I’m looking for the Japanese version um I don’t want it in English let’s try this Japanese time. JP I do not want it in English I want it in Japanese um and so I’m just looking for that here just give me a second okay I went back to this first one in the top right corner it says Japanese and so I’ll click this here so now we have some Japanese text now if this model was built by China I would imagine that they probably really good with Chinese characters and and Japanese borrow Chinese characters and so it should perform really well so what I’m going to do is I’m going to go ahead I have no idea what this is about we we’ll go ahead and grab this image here and so now that is there I’m going to go back over to deep seek and I’m going to just start a new chat and I’m going to paste this image in I’m going to say can you uh transcribe uh the Japanese text um in this image because this what we want to find out can it do this because if it can do that that makes it a very capable model and transcribing means extract out the text now I didn’t tell it to um produce the the translation it says this test discusses the scandal of involving a former Talent etc etc uh you know can you translate the text and break down break down the grammar and so what we’re trying to do is say break it down so we can see what it says uh formatting is not the oh here we go here this is what we want um so just carefully looking at this possessive advancement to ask a question voices also yeah it looks like it’s doing what it’s supposed to be doing so yeah it can do Vision so that’s a really big deal uh but is V3 and that makes sense but this is deeps seek this one but the question will be what can we actually run locally as there has been claims that this thing does not require series gpus and I have the the hardware to test that out on so we’ll do that in the next video but this was just showing you how to use the AI power assistant if you didn’t know where it was okay all right so in this video we’re going to start learning how to download the model locally because imagine if deep seek is not available one day for whatever reason um and uh again it’s supposed to run really well on computers that do not have uh expensive GP gpus um and so that’s what we’re going to find out here um the computer that I’m on right now I’m actually remoted like I’m connected on my network to my Intel developer kit and this thing um if you probably bought it brand new it’s between $500 to $1,000 but the fact is is that this this thing is a is a is a mobile chip I call it the lunar Lake but it’s actually called The Core Ultra 200 V series mobile processors and this is the kind of processor that you could imagine will be in your phone in the next year or two um but what’s so special about um these new types of chips is that when you think of having a chip you just think of CPUs and then you hear about gpus being an extra graphics card but these things have a built-in graphics card called an igpu an integrated graphics card it has an mpu a neural Processing Unit um and just a bunch of other capabilities so basically they’ve crammed a bunch of stuff onto a single chip um and it’s supposed to allow you to uh be able to run ml models and be able to download them so this is something that you might want to invest in you could probably do this on a Mac M4 as well or uh some other things but this is just the hardware that I have um and I do recommend it but anyway one of the easiest ways that we can work with the model is by using olama so AMA is something I already have installed you just download and install it and once it’s installed it usually appears over here and mine is over here okay but the way olama works is that you have to do everything via the terminal so I’m on Windows 11 here I’m going to open up terminal if you’re on a Mac same process you open up terminal um and now that I’m in here I can type the word okay so AMA is here and if it’s running it shows a little AMA somewhere in in your on your computer so what I want to do is go over to here and you can see it’s showing us R1 okay but notice here there’s a drop down okay and we have 7 1.5 billion 7 billion 8 billion 14 billion 32 billion 70 billion 671 billion so when they’re talking about deep seek R1 being as good as chat gpts they’re usually comparing the top one the 671 billion parameter one which is 404 GB I don’t even have enough room to download this on my computer and so you have to understand that this would require you to have actual gpus or more complex setups I’ve seen somebody um there’s a video that circulates around that somebody bought a bunch of mac Minis and stack them let me see if I can find that for you quickly all right so I found the video and here is the person that is running they have 1 two three three four five six seven seven Mac Minis and it says they’re running deep seek R1 and you can see that it says M4 Mac minis U and it says total unified memory 496 gab right so that’s a lot of memory first of all um and it is kind of using gpus because these M M4 chips are just like the lunar Lake chip that I have in that they have integrated Graphics units they have mpus but you see that they need a lot of them and so you can if you have a bunch of these technically run them and I again I again I whatever you want to invest in you know you only need really one of these of whether it is like the Intel lunar lake or the at Mac M4 whatever ryzen’s AMD ryzen’s one is um but the point is like even if you were to stack them all and have them and network them together and do distributed compute which You’ use something like Ray um to do that Ray serve you’ll notice like look at the type speed it is not it’s not fast it’s like clunk clunk clun clunk clunk clunk clunk clunk so you know understand that you can do it but you’re not going to get that from home unless the hardware improves or you buy seven of these but that doesn’t mean that we can’t run uh some of these other uh models right but you do need to invest in something uh like this thing and then add it to your network because you know buying a graphics card then you have to buy a whole computer and it gets really expensive so I really do believe in aip’s but we’ll go back over to here and so we’re not running this one there’s no way we’re able to run this one um but we can probably run easily the seven billion parameter one I think that one is is doable we definitely can do the one 1.5 billion one and so this is really what we’re targeting right it’s probably the 7even billion parameter model so to download this I all I have to do is copy this command here I already have Olam installed and what it’s going to do it’s going to download the model for me so it’s now pulling it from uh probably from hugging face okay so we go to hugging face and we say uh deep seek R1 what it’s doing is it’s grabbing it from here it’s grabbing it from uh from hugging face and it’s probably this one there are some variants under here which I’m not 100% certain here but you can see there’s distills of other of other models underneath which is kind of interesting but this is probably the one that is being downloaded right now at least I think it is and normally what we looking for here is we have these uh safe tensor files and we have a bunch of them so I’m not exactly sure we’ll figure that out here in a little bit but the point is is that we are downloading it right now if we go back over to here you can see it’s almost downloaded so it doesn’t take that long um but you can see they’re a little bit large but I should have enough RAM on this computer um I’m not sure how much this comes with just give me a moment so uh what I did is I just open up opened up system information and then down below here it’s it’s saying I have 32 GB of RAM so the ram matters because you have to have enough RAM to hold this stuff in memory and also if the model’s large you have to be able to download it and then you also need um the gpus for it but you can see this is almost done so I’m just going to pause here until it’s 100% done and it should once it’s done it should automatically just start working and we’ll we’ll see there in a moment okay just showing that it’s still pulling so um it downloaded now it’s pulling additional containers I’m not exactly sure what it’s doing but now it is ready so it didn’t take that long just a few minutes and we’ll just say hello how are you and that’s pretty decent so that’s going at an okay Pace um could I download a more um a more intensive one that is the question that we have here because we’re at the seven billion we could have done the 8 billion why did I do seven when I could have done eight the question is like where does it start kind of chugging it might be at the 14 14 billion parameter model we’ll just test this again so hello and just try this again but you can see see that we’re getting pretty pretty decent results um the thing is even if you had a smaller model through fine-tuning if we can finetune this model we can get better performance for very specific tasks if that’s what we want to do but this one seems okay so I would actually kind of be curious to go ahead and launch it I can hear the computer spinning up from here the lunar Lake um devit but I’m going to go ahead and just type in buy and um I’m going to just go here I want to delete um that one so I’m going to say remove and was deep c car 1 first let’s list the model here because we want to be cautious of the space that we have on here and this model is great I just want to have more um I just want to run I just want to run the 8 billion parameter one or something larger so we’ll say remove this okay it’s deleted and I’m pretty confident it can run the 8 billion let’s do the 14 billion parameter this is where it might struggle and the question is how large is this this is 10 gabes I definitely have room for that so I’m going to go ahead and download this one and then once we have that we’ll decide what it is that we want to do with it okay so we’re going to go ahead and download that I’ll be back here when this is done downloading okay all right so we now have um this model running and I’m just going to go ahead and type hello and surprisingly it’s doing okay now you can’t hear it but as soon as I typed I can hear my uh my little Intel developer kit is going and so I just want you to know like if you were to buy IPC the one that I have is um not for sale but if you look up one it has a lunar Lake chip in it uh that Ultra core was it the ultra core uh uh 20 20 2 220 or whatever um if you just find it with another provider like if it’s with Asus or whoever Intel is partnered with you can get the same thing it’s the same Hardware in it um Intel just does not sell them direct they always do it through a partner but you can see here that we can actually work with it um I’m not sure how long this would work for it might it might quit at some point but at least we have some way to work with it and so AMA is one way that we can um get this model but obviously there are different ones like the Deep seek R1 I’m going to go ahead back to AMA here and I just want to now uh delete that model just because we’re done here but there’s another way that uh we can work with it I think it’s called notebook LM or LM Studio we’ll do in the next video and that will give you more of a um AI powed assistant experience so not necessarily working with it programmatically but um closer to the end result that we want um I’m not going to delete the model just yet here but if you want to I’ve already showed you how to do that but we’re going to look at the uh next one in the next video here because it might require you to have ol as the way that you download the model but we’ll go find out okay so see you in the next one all right so here we’re at Studio LM or LM Studio I’ve actually never used this product before I usually use web UI which will hook up to AMA um but I’ve heard really good things about this one and so I figured we’ll just go open it up and let’s see if we can get a very similar experience to um uh having like a chat gbt experience and so here you they have downloads for uh Mac uh the metal series which are the the latest ones windows and Linux so you can see here that they’re suggesting that you want to have one of these new AI PC chips um as that is usually the case if you have gpus then you can probably use gpus I actually do have really good gpus I have a 480 RTX here but I want to show you what you can utilize locally um so what we’ll do is just wait for this to download okay and now let’s go ahead and install this but I’m really curious on how we are going to um plug this into like how are we going to download the model right does it plug into AMA does it download the model separately that’s what we’re going to find out here just shortly when it’s done installing so we’ll just wait a moment here okay all right so now we have completing the ml Studio um setup so LM Studio has been installed on your computer click finish and set up so we’ll go ahead and hit finish okay so this will just open up here we’ll give it a moment to open I think in the last video we stopped olama so even if it’s not there we might want to I’m just going to close it out here again it might require oama we’ll find out here moment so say get your first llm so here it says um llama through 3.2 that’s not what we want so we’re going to go down below here it says enable local LM service on login so it sounds like what we need to do is we need to log in here and make an account I don’t see a login I don’t so we’ll go back over to here and they have this onboarding step so I’m going to go and we’ll Skip onboarding and let’s see if we can figure out how to install this just a moment so I’m noticing at the top here we have select a model to load no LMS yet download the one to get started I mean yes llama 3.1 is cool but it’s not the model that I want right I want that specific one and so this is what I’m trying to figure out it’s in the bottom left corner we have some options here um and I know it’s hard to read I apologize but there’s no way I can make the font larger unfortunately but they have the LM studio. a so we’ll go over to here I’m going go to the model catalog and and we’re looking for deep seek we have deep seek math 7 billion which is fine but I just want the normal deep seek model we have deep seek coder version two so that’d be cool if we wanted to do some coding we have distilled ones we have R1 distilled so we have llama 8 billion distilled and quen 7 billion so I would think we probably want the Llama 8 billion distilled okay so here it says use in LM studio so I’m going to go ahead and click it and we’ll click open okay now it’s going to download them all so 4.9 gigabytes we’ll go ahead and do that so that model is now downloading so we’ll wait for that to finish okay so it looks like we don’t need Olam at all this is like all inclusive one thing to go though I do want to point out notice that it has a GG UF file so that makes me think that it is using like whatever llama index can use I think it’s called llama index that this is what’s compatible and same thing with o llama so they might be sharing the same the same stuff because they’re both using ggf files this is still downloading but while I’m here I might as well just talk about what uh distilled model is so you’ll notice that it’s saying like R1 distilled llama 8 or quen 7 billion parameter so dist distillation is where you are taking a larger model’s knowledge and you’re doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it um the process is complicated I explain it in my Jenning ey Essentials course which this this part of this crash course will probably get rolled into later on um but basically it’s just it’s a it’s a technique to transfer that knowledge and there’s a lot of ways to do it so I can’t uh summarize it here but that’s why you’re seeing distilled versions of those things so basically theyve figured out a way to take the knowledge maybe they’re querying directly that’s probably what they’re doing is like they have a bunch of um evaluations like quer that they hit uh with um uh what do you call it llama or these other models and then they look at the result and then they then when they get their smaller model to do the same thing then it performs just as well so the model is done we’re going to go ahead and load the model and so now I’m just going to get my head a little bit out of the way cuz I’m kind of in the way here so now we have an experience that is more like uh what we expected to be and on the top here I wonder is a way that I can definitely bring the font up here I’m not sure if there is a dark mode the light Mode’s okay but um a dark mode would be nicer but there’s a lot of options around here so just open settings in the bottom right corner and here we do have some themes there we go that’s a little bit easier and I do apologize for the small fonts um there’s not much I can do about it I even told it to go larger this is one way we can do it so let’s see if we can interact with this so we’ll say um can you um I am learning Japanese can you act as my Japanese teacher let’s see how it does now this is R1 this does not mean that it has Vision capabilities um as I believe that is a different model and I’m again I’m hearing my my computer spinning up in the background but here you can see that it’s thinking okay so I’m trying to learn Japanese and I came across the problem where I have to translate I’m eating sushi into Japanese first I know that in Japanese the order of subject can be this so it’s really interesting it’s going through a thought process so um normally when you use something like web UI it’s literally using the model directly almost like you’re using it as a playground but this one actually has reasoning built in which is really interesting I didn’t know that it had that so there literally is uh agent thinking capability this is not specific to um uh open seek I think if we brought in any model it would do this and so it’s showing us the reasoning that it’s doing here as it’s working through this so we’re going to let it think and wait till it finishes but it’s really cool to see its reasoning uh where normally you wouldn’t see this right so you know when and Chach B says it’s thinking this is the stuff that it actually is doing in the background that it doesn’t fully tell you but we’ll let it work here we’ll be back in just a moment okay all right so looks like I lost my connection this sometimes happens because when you are running a computational task it can halt all the resources on your machine so this model was a bit smaller but um I was still running ol in the background so what I’m going to do is I’m going to go my Intel machine I can see it rebooting in the background here I’m going to give it a moment to reboot here I’m going to reconnect I’m going to make sure llama is not running and then we’ll try that again okay so be back in just a moment you know what it was the computer decided to do Windows updates so it didn’t crash but this can happen when you’re working with llms that it can exhaust all the resources so I’m going to wait till the update is done and I’ll get my screen back up here in just a moment okay all right so I’m reconnected to my machine I do actually have some tools here that probably tell me my use let me just open them up and see if anyone will actually tell me where my memory usage is yeah I wouldn’t call that very uh useful maybe there’s some kind of uh tool I can download so monitor memory usage well I guess activity monitor can just do it right um or what’s it called see if I can open that up here try remember the hot key for it there we go and we go to task manager and so maybe I just have task manager open here we can kind of keep track of our memory usage um obviously Chrome likes to consume quite a bit here I’m actually not running OBS I’m not sure why it um automatically launched here oh you know what um oh I didn’t open on this computer here okay so what I’ll do is I’ll just hit task manager that was my task manager in the background there we go and so here we can kind of get an idea this computer just restarted so it’s getting it itself in order here and so we can see our mem us is at 21% that’s what we really want to keep a track of um so what I’m going to do is go back over to LM Studio we’re going to open it up but this is stuff that really happens to me where it’s like you’re using local LMS and things crash and it’s not a big deal just happens but we came back here and it actually did do it it said thought for 3 minutes and 4 seconds and you can see its reasoning here okay it says the translation of I likeing Sushi into Japanese isi sushim Guk which is true the structure correctly places it one thing I’d like to ask it is can it give me um Japanese characters so can you show me the uh the sentence can you show me uh Japanese using Japanese characters DG conji and herana okay and so we’ll go ahead and do that it doesn’t have a model selected so we’ll go to the top here what’s kind of interesting is that maybe you can switch between different kinds of models as you’re working here we do have GPU offload of discrete uh model layers I don’t know how to configure any of these things right now um flash attention would be really good so decrease memory usage generation time on some models that is where a model is trained on flash attention which we don’t have here right now but I’m going to go ahead I’m going to load the Llama distilled model and we’re going to go ahead and ask if it can do this for us because that would make it a little bit more useful okay so I’m going to go ahead and run that and we’ll be back here in just a moment and we’ll see the results all right we are back and we can take a look at the results here we’ll just give it a moment I’m going to scroll up and you know what’s really interesting is that um it is working every time I do this I it does work but the computer restarts and I think the reason why is that it’s exhausting all possible resources um now the size of the model is not large it’s whatever it is the 8 billion parameter one at least I think that’s what we’re running here um it’s a bit hard because it says 8 billion uh distilled and so we’d have to take a closer look at it it says 8 billion so it’s 8 billion parameter um but the thing is it’s the reasoning that’s happening behind the scenes and so um I think for that it’s exhausting whereas we’re when we’re using llama it’s less of an issue um and I think it might just be that LM Studio the way the agent Works might might not have ways of or at least I don’t know how to configure it to make sure that it doesn’t uh uh destroy destroy stuff when it runs out here because you’ll notice here that we can set the context length and so maybe if I reduce that keep model in memory so Reserve System memory for the model even when offload GPU improves performance but requires more RAM so here you know we might toggle this off and get better production but right now when I run it it is restarting but the thing is it is working so you can see here it thought for 21 seconds it says of course I’d like to help you and so here’s some examples and it’s producing pretty good code or like output I should say but anyway what we’ve done here is we’ve just changed a few options so I’m saying don’t keep it in memory okay because that might be an issue and we’ll bring the context window down and it says CPU uh thread to allocate that seems fine to me again I’m not sure about any of these other options we’re going to reload this model okay so we’re now loading with those options I want to try one more time if my computer restarts it’s not a big deal but again it might be just LM Studio that’s causing us these issues here and so I’m just going to click into this one I think it’s set up those settings we’ll go ahead and just say Okay um so I’m going to just say like how do I ask how do I I say in Japanese um uh where is the movie theater okay it doesn’t matter if you know Japanese it’s just we’re trying to tax it with something hard so here it’s running again and it’s going to start thinking we’ll give it a moment here and as it’s doing that I’m going to open up task manager he and we’ll give it a moment I noticed that it has my um did it restart again yeah I did so yeah this is just the experience again it has nothing to do with the Intel machine it’s just this is what happens when your resources get exhausted and so it’s going to restart again but this is the best I can de demonstrate it here now I can try to run this on my main machine using the RTX 480 um so that might be another option that we can do where I actually have dedicated GP use and I have a this is like a 14th generation uh Intel chip I think it’s Raptor lake so maybe we’ll try that as well in a separate video here just to see what happens um but that was the example there but I could definitely see how having more than uh like those computer stacked would make this a lot easier even if you had a second one there that’ still be uh more cost effective than buying a completely new computer outright those two or smaller mini PCS um but I’ll be back here in just a moment okay okay so I’m going to get this installed on my main machine my main machine like as I’m recording here it’s using my GPU so it’s going to have to share it so I’m just going to stop this video and then we’re going to treat this one as LM Studio using the RTX 480 and we’ll just see uh if the experience is the same or different okay all right so I’m back here and now I’m on my main computer um and we’re going to use ml studio so I’m going to go and skip the onboarding and I remember uh there’s a way for us to change the theme maybe in the bottom right corner of the Cog and we’ll change it to dark mode here thr our eyes are a little bit uh easier to see here also want to bump up the font a little bit um to select the model I’m going to go here to select a model at the top here we do not want that model here so I’m going to go to maybe here on left hand side no not there um it was here in the bottom left corner and we’re going to go to L LM Studio Ai and we want to make our way over to the model catalog at the top right corner and I’m looking for deep seek R1 distill llama 8B so we click that here and we’ll say use in studio that’s now going to download this locally okay so we are now going to download this model and I’ll be back here in just a moment okay all right so I’ve downloaded the model here I’m going to go ahead and load it and again I’m a little bit concerned because I feel like it’s going to cause this computer to restart but because it’s uh offloading to the gpus I’m hoping that’ll be less of an issue but here you can see it’s loading the model into memory okay and we really should look at our options that we have here um it doesn’t make it very easy to select them but oh here it is right here okay so we have some options here and this one actually is offloading to the GPU so you see it has GPU offload I’m almost wondering if I should have set GPU offload um on the aipc because it technically has IG gpus and maybe that’s where we were running into issues whereas when we were using olama maybe it was already utilizing the gpus I don’t know um but anyway what I want to do is go ahead and ask the same thing so I’m going to say uh can you teach me teach me Japanese for jlpt and5 level so we’ll go ahead and do that we’ll hit enter and again I love how it shows us the thinking that it does here I’m assuming that it’s using um our RTX RTX 480 that I have on this computer and this is going pretty decently fast here it’s not causing my computer to cry this is very good this is actually reasonably good so yeah it’s performing really well so the question is um you know I again I’d like to go try the uh the the developer kit again and see if I because I remember the gpus were not offloading right so maybe it didn’t detect the igpus but this thing is going pretty darn quick here and so that was really really good um and so it’s giving me a bunch of stuff it’s like okay but give me give me example sentences in Japanese okay so that’s what I want we’ll give it a moment yep and that looks good so it is producing really good stuff this model again is just the Llama uh a building parameter one I’m going to eject this model let’s go back over to here into the uh Studio over here and I want to go to the model Catal because there are other deep seek models so we go and take a look deep seek we have coder version two so the younger sibling of GPT 4 deeps coder version 2 model but that sounds like deep seek 2 right so I’m not sure if that’s really the latest one because we only want to focus on R1 and so yeah I don’t think those other ones we really care about we only care about R1 models but you can see we’re getting really good performance so the question is like what’s the compute or the top difference between these two and maybe we can ask this over to the model ourselves but I’m going to start a new conversation here and I’m going to say um how many tops or or is it tops does I think it’s called tops tops does RTX uh 4080 have okay we’ll see if it can do it select this model here and yeah we’ll load the model and we’ll run that there we’ll give it a moment and while that’s thinking I mean obviously we just use Google for this we don’t really need to do that but I want to do a comparison to see like how many tops they have so I’ll let that run the background I’m also just going to search and find out very quickly oh here it goes uh does not have a specified number of tensor uh as officially NV video the company focuses on metrics like cudas cores and mamory B withd but this would be speculative okay but but then but then how do I how do I compare compare tops for um let’s say lunar Lake versus RTX 4080 and I know like there’s lots of ways to do it but it’s like if I can’t compare it how do I do it and while that’s trying to figure it out I’m going to go over to perplexity and maybe we can get an exact example because I’m trying to understand like how much does my discret GPU do compared to that that one that’s internal so we’ll say uh lunar lunar Lake versus RTX uh 40 4080 uh for Tops performance and we’ll see what we get so lunar lake has 120 tops and hence gaming rather than AI workload so IND doesn’t typically advertise their tops maintaining 60 FPS okay but then how so then okay but what what could it be like how many tops could it be for the RTX 480 kind of makes it hard because like we don’t know how many tops it is we don’t we don’t know what kind of expectation we should have with it okay fair enough so yeah so it’s we can’t really compare it’s like apples to oranges I guess and it’s just not going to give us the result here um but here it is going through comparison so if you run ml perfect gpus like a model with reset you directly compare the tops uh with a new architecture and so that’s basically the only way to do it so we can’t it’s apples to oranges um I want to go and attempt to try to run this one more time on the lunar Lake and I want to see if I can set the gpus but if we can’t set the gpus then I think it’s going to always have that issue specifically with this but we will use the L Lake for um using with hugging face and other things like that so be back in just a moment okay all right so I’m back and I just did a little bit of exploration on my other computer there because I want to understand like okay I have this aipc it’s very easy to run this here on my RTX 480 but when I run it on the on the uh the lunar like it is shutting down and I think understand why and so this is I think is really important when you are working local machines you have to have a bit better understanding of the hardware so I’m just going to RDP back into this machine here just give me just a moment okay I have it running again and it probably will crash again but at least I know why so there’s a program called camp and what camp does is it allows you to monitor um your this is for Windows for Mac I don’t know what You’ use you probably just uh uh utility manager but here you know I can see that none of these CPUs are being overloaded but this is just showing us the CPUs if we open up um task manager here okay and now the computer is running perfectly fine it’s not even spinning its fans if I go to the left hand side here we can we have CPUs mpus and gpus now mpus are the things that we want to use because mpus uh like an mpu is specifically designed to run models however a lot of the Frameworks like Pi torch um and uh tensor flow they’re optimized on Cuda originally because the underlying framework and so normally you have to go through an optimization or conversion format I don’t know at this time if there is a conversion for Max for Intel Hardware Because deep seek is so new but I would imagine uh that is something the Intel team is probably working on and this is not just specific to Intel if it’s AMD or whoever they want to make optimization to leverage their different kinds of compute like their MPS and also has to do with the the thing that we’re using so we’re using that thing called this one over here I’m not sure well all these little oh yeah this this just this is core LM showing us all the temperatures right and so what we can do is just kind of see what’s going on here is that I’m going to bring this over so that we can see what’s happening right we want to use mpus it’s not going to happen because this thing is not set up to do that but if I drop it down here and we click into uh this right we have our options before we didn’t have any gpus but we can go here we can say use all the gpus I don’t know how many how much it can offload to but I’ll I’ll set it to something like 24 we have a CPU threat count like that might be something we want to increase we can reduce our context window um we might not want to load it into memory but the point is that if it if it exhausts the GPU because it’s all it’s a single integrated circuit I have a feeling that it’s going to end up restarting it but here again you can see it’s very low we’ll go ahead and we’ll load the model right and the next thing I will do is I will go type in something like you know I want to learn Japanese can you provide me um uh a lesson on Japanese sentence structure okay we’ll go ahead and do that actually notice if it this doesn’t require a thought process it works perfectly it doesn’t cause any issues with the computer we’ll go ahead and run it and let’s pay attention left hand side here and now we can see that it’s utilizing gpus when it was at zero it wasn’t using gpus at all but Noti it’s at 50 50% right and it’s doing pretty good our CPU is higher than usual before when I ran this earlier off screen the CPU was really low and it was the GPU that was working hard so again it really you have to understand your settings as you go here but this is not exhausting so far but we’re just watching these numbers here and also our cor temps right and you can see we’re not running into any issues it’s not even spinning up it’s not even making any complaints right now the other challenge is that I have a a developer kit that um uh it’s it’s something they don’t sell right so if there was an issue with the BIOS I’d have to update it and there’s like no all I can get is Intel’s help on it but if I to buy like a commercial version of this like um whoever is partnered with it if it’s Asus or Lenovo or whatever I would probably have um less issues because they’re maintaining those bios updates um but so far we’re not having issues but again we’re just monitoring here we have 46 47% 41% um again we’re watching it you can see core is at 84% 89% and so we’re just carefully watching this stuff but I might have picked the perfect the perfect amount of settings here and maybe that was the thing is that you know I turned down the CPU like what did we do the options I turned the gpus down so I turned that down I also told it not to load memory and now it’s not crashing okay there we go it’s not as fast as the RTX 4080 um but you know what this is my old graphics card here I actually bought this uh not even long ago before I got my new computer this is an RTX 3060 okay this is not that old it’s like a it’s like a couple years old 2022 and I would say that when I used to use that and I would run models my computer would crash right so but the point is is that these newer CPUs whether it’s again the M4 or the Intel L lake or whatever amd’s one is they’re they have the strong equivalence of like graphics cards from two years ago which is crazy to me um but anyway I think I might have found The Sweet Spot I’m just really really lucky but you can see the memory usage here and stuff like that and you just have to kind of monitor it and you’ll find out once you get those settings uh what works for you or you know you buy really expensive GPU and uh it’ll run perfectly fine but here it’s going and we’ll just give it a moment we be back in just a moment okay anyway I was going a little bit slow so you know I just decided we’ll just move on here but my my point was made clear is that if you dial in the specific settings you can make this stuff work on things where you don’t have dedicated graphics card if you have a dedicated graphics card you can see it’s pretty good and uh yeah this is fine with the RTX 480 so you know if you have that you’re going to be in good shape there but now that we’ve shown how to do with AI power assistance let’s take a look at how we can actually get these models from hugging face next okay and work with them programmatically um so I’ll see you in the next one all right so what I want to do in this video is I want to see if we can download the model from hugging phase and then work with it programmatically um is that’s going to give you the most flexibility with these models of course if you just want to consume them then uh using the um LM Studio that I showed you or whatever it was called um would be the easiest way to do it but having a better understanding of these models how we can use them directly would be useful I think for the rest of this I’m just going to use the RTX 480 because I realize that to really make use of aips you have to wait till they have optimizers for it so we’re talking about um Intel again you have this kit called open Veno and open Veno is an optimization framework and if we go down they I think they have like a bunch of examples here we’ll go back for a moment yeah quick examples maybe over here and maybe not over here but we go back to the notebooks and we scroll on down yeah they have this page here and so um in this thing they will have different llms that are optimized specifically so that you can maybe Leverage The mpus or the or or make it run better on CPUs but until that’s out there we’re stuck on the gpus and we’re not going to get the best performance that we can uh so maybe in a in a month or so um I can revisit that and then I will be utilizing it it might be as fast as my RTX 480 but for now we’re going to just stick uh with the RTX 480 and we’ll go look at Deep seek because they have more than just R1 so you can see there is a collection of models and in here if we click into it we have um R1 r10 which I don’t know what that is let’s go take a look here it probably explains it somewhere uh but we have R1 distilled 70 billion PR parameter quen 32 billion parameter quen 14 billion and so we have some variant here that we can utilize just give me a moment I want to see what zero is so to me it sounds like zero is the precursor to R1 so it says a model trained with supervised learning okay and so I don’t think we want to use zero we want to use the R1 model or one of these distilled versions which uh give similar capabilities but if we go over to here it’s not 100% clear on how we can run this um but down below here we can see total parameters is 671 billion okay so this one literally is the big one this is the really really big one and so that would be a little bit too hard for us to run this machine we can’t run 671 billion parameters you saw the person stacking all those uh Apple m4s like uh yeah I have an RTX 480 but I need a bunch of those to do it down below we have the distilled models and so this is probably what we were using when we were using olama um if we wanted to go ahead and do that there so this is probably where I would focus my attention on is these distilled models uh when we’re using hugging face it will show us how we can deploy the models up here notice over here we have BLM um I covered this in my geni essentials course I believe but um there are different types of ways we can serve models just as web servers have you know servers to serve them like the uh like software underneath so do um uh these ml models these machine learning models and VM is one that you want to pay attention attention to because it can work with the ray framework and Ray is important because um say Ray uh I’ll just say ml here but this framework specifically has a product within it um called racer it’s not showing me the graphic here but racer allows you to take VM and distribute it across comput so when we saw that video of that again those Mac m4s being stacked on top of each other that was probably using racer with v LM to scale it out and so if you were to run this uh run this you might want to invest in VM the hugging face Transformer library is fine as well but again we’re not going to be able to run this on my computer and not on your computer uh so we’re going to go back here for a moment but there’s also uh V3 which has been very popular as well and that actually is what we were using when we went to the Deep seek website but if we go over to here and we go into deep seek uh three I think this is yeah this one’s a mixture of experts model and this would be a really interesting want to deploy as well but it’s also 67 uh 71 billion parameter model so it’s another one that we can’t deploy locally right but if we did we could have like Vision tasks and all these other things that maybe it could do so we’re going to really just have to stick with the R1 and it’s going to be with one of these distributions I’m going to go with the Llama uh 8 billion parameter I don’t know why we don’t see the other ones there but 8 billion is something we know that we can reibly run whether it’s on the lunar lake or if it’s on the RTX 480 and so I’m going to go over here in the right hand side and we have Transformers and VMS Transformers is probably the easiest way to run it and so we can see that we have some code here so I’m going to get set up here um I’m going to just open up vs code and I already have a Repel I’m going to put this in my geni essentials course because I figured if we’re going to do it we might as well put it in there and so I’m going to go and open that folder here and I need to go up a directory I might not even have this cloned so I’m going to just go and grab this directory really quickly here so just CD back and I do not so I’m going to go over to GitHub this repo is completely open so if you want to do the same thing you can do this as well we’re going to say gen Essentials okay and um I’m going to go ahead and just co uh copy this and download it here so give it a clone get clone and I’m going to go ahead and open this up um I’m going to open this with wind Surfer fun because I really like wind surf I’ve been using that quite a bit if I have it installed here should yeah I do I have a paid version of wind surf so I have full access to it if you don’t just you can just copy and paste the code but I’m trying to save myself some time here so we’re going to open this up I’m going to go into the Gen Essentials I’m going to make a new folder in here I’m going to call this one deep seek and I want to go inside of this one and call it um R1 uh Transformers cuz we’re going to just use the Transformers library to do this I’m going to select that folder we’re going to say yes I’m going to make a new file here and I probably want to make this an iron python file um I’m not sure if I’m set up for that but we’ll give it a go so what we’ll do is we’ll type in basic. [Music] ironpython uh ynb which is for uh jupyter notebooks and you’d have to already have jupyter installed if you don’t know in my gen Essentials I show you how to set the stuff up so so you can learn it that way if you want I’m going to go over to WSL here and um yeah I’ll install that extension there if it wants to install there and I’m going to see if I have cond installed I should have it installed there it is and we have a base so anytime that you are um setting up one of these environments you should really make a new one because that way you’ll run into less conflicts and so I need to set up a new environment I can’t remember the instructions but I’m pretty certain I show that somewhere here at local development in this folder and so if I go to cond and I go into setup I think I explain it here so for Linux and that’s what I’m using right now with Windows subsystem Linux 2 is I would need to it’s already installed so I want to create a new environment so I probably want to use Python 3.0.0 if it’s a future you might want to use 312 but this version seems to give me the least amount of problems so I want this command but I want to change it a little bit I don’t want it to be hello I want to call this deep seek so we’ll go back over to here we’re we’re going to paste it into here and um so now we are uh setting up python 310 and it’s going to install some stuff okay so now we are uh good I need to activate that so I’m say cond activate deep seek and so now we are using deep seek I’m going to go back here on the right hand left hand side and what I want to do is I want to get some code set up here so if we go back over to here into the 8 billion uh distilled model we go to Transformers we have some code and if it doesn’t work that’s totally fine we will we will tweak it from there I also have example code lying around so for whatever reason this doesn’t work sorry I just paused there for a second if it doesn’t work we can uh grab from my code base here because I don’t always remember how to do this stuff even though I’ve done a lot of this I don’t remember half the stuff that I do so we’re going to go ahead here and cut this up and put this up here but we’re going to need um I’m not sure how well uh um uh I’m not sure how well um uh wind surf Works within uh jup ir and python I actually never did that before so it’s asking us it’s asking us to start something you need to select a kernel and I’m going to say oh it’s not seen the kernels that I want but you know one thing I don’t think we did is I don’t think we installed iron python so there’s an extra step that we’re supposed to do to get it to work with Jupiter and it might be under our Jupiter instructions here where yes it’s this so we need to make sure we install iron python kernel otherwise it might not show up here so I’m going to just go ahead here and um I’m going to do cond cond whoops cond hyphen Fonda Forge so we’re saying downloads from the cond forge and and I think it’s cond install so it’s cond install hyphen f cond Forge and then we paste in IP kernel and so now it should install IP kernel I’m not sure if that uh worked or not we’ll go up here and take a look the following packages are not available for in installation oh it’s hyphen c not hyphen f okay so we’ll go here and that just means to use the condo Forge and so this should resolve our issue so we’re going to install ipy kernel right give it a second it we’ll say yes okay and so I’m hoping what that will do is that we’ll be able to actually select the kernel we might have to close that wind Surf and reopen it we can do the same thing in vs code it’s the same interface right so I’m not seeing it showing up here so I’m just going to close that wind surf it would have been nice to use wind surf but if we can’t that’s totally fine I’m going to go ahead and open this again I’m going to open up the Gen Essentials I’m just going to say open I’m not using any AI coding assistant here so we’re just going to work through it the oldfashioned way and somewhere in here we have a deep seek folder I’m going to go ahead and make a new terminal here I want to make sure that I’m in in WSL which I am I’m going to say cond to activate deep seek because that’s where I need to go so I now have that activated I’m going to go into the deep seek folder into our R1 Transformers folder um I’m looking for the Deep seek folder there it is we’ll click into it and I did not save any of the code which is totally fine it’s not like it’s too far away to get this code again and so I’m going to go back over to here and we are going to grab this code okay I’m going to paste it in and we’ll make a new code block and I want to grab this and put this below okay now normally we’ install pytorch and some other things um but I’m going to just try from the most barebones thing it’s going to tell me Transformers isn’t installed and that’s totally fine and I’m just trying to there we go do this so we’ll run that and so I’m going to go here to install Jupiter oh it’s installing Jupiter I see okay so we do need that maybe the kernel would have worked um and so I’m going to go to python environments python environments and so now we have deep seek so maybe we could have got it to work with W serve but that’s fine so we don’t have Transformers installed there’s no modules called Transformers I know that we do this before so we might as well go leverage code and see what we did here before here we have hugging face basic and so here yeah we do an install with Transformer so that’s all we really need there’s P Pi dot. EnV we might also need that as well because we might need to put in our hugging face API to download the model I’m not sure at this point but I’ll go ahead and just install that up here in the top okay so we’ll give that a moment to install it shouldn’t take too long we might also need to install P torch or or tensor flow or both um that’s very common when you are working with open source models is that they may be in one for format or another and need to be converted over um sometimes you don’t need to do it at all but we’ll see so now it’s saying to restart so we’ll just do a restart here we should only have to do that once and so I’m going to go ahead Here and Now include it so now we have less of an issue here it’s showing us this model so basically this will download it specifically from hugging phas so if we grab this address here and we go back over to wherever um I had one open here just a moment ago and it should match this address right so if I was to just delete this out here and put it in here it’s the same address right so that’s how it knows what model it’s grabbing but we’ll go back over to here um and it doesn’t look like we need our hugging face API but we’ll we’ll find out here in just a moment so it should download it we’ll get a message here we’ll load Transformers we’ll have tokenizers then we’ll have the model um the messages here is being passed into here says copy local model directory directly okay so I think here it’s like we just have two different ones we have one that’s using the pre-train one yes there’s two ways that we can do it I think we cover this uh when you use a direct model or a pipeline and so let’s go ahead and see if we can just use the pipeline okay and if I don’t remember how to do this we probably go over here and take a look um I don’t remember everything that I do but yeah this is the one we just had open here just a moment ago the basic one and so this has a pipeline and then we just use it and so this in a sense should just work so let’s go ahead and see if that works so I’m just going to separate this out so I don’t have to continually run this we’ll cut this out okay we’ll run that and then we’ll run this okay and we’ll go down below and it says at least one tensor flow or pie torch should be installed to install tensor flow do this and so this is what I figured we were going to run into where it’s complain like hey you need P torch or tensor flow um I don’t know which one it needs I would think that it was safe tensorflow because I saw that and so I’m going to just go ahead and make a new one up here I’m really just guessing I’m going to go say uh tensor flow and I’m also going to just say p torch let’s just install both because it’ll need one or the other and one of them will work assuming I spelled it right two competing Frameworks I learned uh tensor flow first and then uh I kind of regret that because P torch is now the most po even though I really like tensor flow or specifically kirz but we’ll give this a moment to install and then once we do that we’ll run it again and we’ll see what happens okay so it’s saying P torch failed to build and I hope that doesn’t matter because if it uses tensor flow it’s fine but it says failed to build installable wheels so just a moment here as was my twin sister calling me uh she doesn’t know I’m recording right now so I’m going to go ahead and restart this even though we don’t have P torch or it might be wrong it might be installed I’m not sure we’re going to go ahead and just try it again anyway um because sometimes this stuff just works anyway and we’ll run it and so it is complaining it’s saying at least one one of tensorflow or P should be installed install tensorflow 2.0 uh to install P torch read the instructions here um okay so I mean this shouldn’t be such a huge issue so I’m going to go and let’s use deep seek since we are big deep seek fans here today but I’m going to go over to the Deep seek website which is running V3 it’s not even using the R1 um I’m going to log into here we’ll give it a moment and we’ll go here and say um you know I want to uh I need to install tensor flow 2.0 and pytorch to run uh a Transformers pipeline model so we’ll give that a go and see what we get so here it’s specifically saying to use 2.0 yeah and it’s always a little bit tricky so I’m going to go back up to here and maybe we can say equals 2.0.0 I mean what it it did install tensor flow 20 we don’t need to tell it to do two again so we go down below here and let me just carefully look here so at least one of tensorflow 2.0 or py to should be install to install it you should have it the select framework tensor for the Pyar to use the model pass returns a tuple framework oh so it’s asking which model to use as it doesn’t know okay so I’m going to go back over to here and I’m going to say like you know give it this thing and see if it can figure it out and it’s not exactly what I want so I’m going to just stop it here I’m just saying like I am using Transformers pipeline how do I specify uh the framework okay I’m I’m surprised I have to specify the framework usually it just picks it up okay and so here we have Pi torture tensor flow I think tensorflow successfully installed I’m not sure if it’s just guessing because this thing could be hallucinating we don’t know uh but we’ll go ahead and just give this a try and we’ll run this here and here it’s saying um we’re still getting that right so I’m going to go over to here this probably is a common hugging face issue for tensor flow somebody has commented here you need to have P torch installed mhm so let’s say deep C I don’t know if there’s anyone that’s actually told us how to do this yet give me a second let me see if I can figure it out all right so I went over and we’re asking Claude instead and so maybe Claude again because it’s not just the model itself but it’s the reasoning behind it and so V3 didn’t really get us very far it’s supposed to be a really good model um but um here it’s suggesting that um P torch is generally used and maybe my instructions here is incorrect and so it’s suggesting to do um I mean we have tensorflow which is fine but here it’s suggesting that we do torch um torch and accelerate okay so I’m going to go ahead and run this here so maybe Pi torch is just torch and I just forgot I don’t know why I wrote in pi torch we’ll give that a moment we’ll see what happens uh the other thing is that it’s saying that we probably don’t need the framework specify because well it’s saying for llama in particular that it normally uses P torch I’m not sure if that’s the case here um another thing that we could do is go take a look at hugging face or sorry not hugging face yeah hugging face and look at the files here and I’m seeing tensorflow files so it makes me think that it is using tensorflow but maybe it needs to convert it over to P torch I don’t know but um we should have both installed so even though I removed it from the top there um tensorflow is still installed and we could just leave it uh there as a separate line with say pip install um tensor flow this is half the battle to to get these things to work is is dealing with these conflicts and you will get something completely different than me and you have to work through it but we’ll wait for this um it would be interesting to see if we could serve this via a VM but we’ll just first work this way okay all right so that’s now installed I’m going to go to the top here and we’re going to give it a restart and so now we should have those installed we’ll go ahead and do Transformers pipelines and we’ll go run this next and so now it’s working so that’s really good um um is it utilizing my gpus I would think so sometimes there’s some configurations here that you have to set but I didn’t set anything here I think right now it’s just downloading the model so we’re going to wait for the model to download and then we just want to see if it infers um I’m not sure why it’s uh not getting here but maybe it’ll take a moment to get going um we didn’t provide it any hugging face API key so maybe that’s the issue it’s kind of hanging here so it makes me really think that I need my hugging face API key so what I’m going to do is I’m going to grab this code over here because I just assume that it wants it that’s probably what it is and sorry I’m going to just pull this up here oops we’ll paste this in here as such and I’m going to drag this on up here and I’m going to just make a new env. text I’m also going to just ignore that because I don’t want it to end up in there and um it’s like hugging face API key I never remember what it is um but we’ll go take a look here I’m just doing this off screen here so say hugging face API key nbar okay so key where are you key I’m having a hard time finding the name of the environment variable right now uh oh it’s a HF token that’s what it is so I need the HF token and I’m going to go back here and see if it’s actually downloaded at all did it did it move at all no it hasn’t so I don’t think it’s going to move and I think it’s because it needs um I think it needs the hugging face API key so I’m over here in hugging face and I have an account you go over down below you go to access tokens I got to log in one sec all right and so I’m going to create a new token it’s going to be read only this will be for deep deep spe deep uh deep seek there was no settings that I had to accept to be able to download it so I think it’s going to work I’m going to get rid of my key later on so I don’t care if you see it um I’m in this file here so that was called uh HF token I believe HF token and so now we have our token supposedly set we’ll go back over to here I’m going to go and scroll up and I’m going to run this and now it should know about my token I shouldn’t even have to set it I don’t think so maybe it’ll download now I’m not sure I’m go back over to this one notice we’re not pumping the token in anywhere I’m just going to bring this also down by one this is acting a little bit funny here today I’m not sure why like why is going all the way down there it’s probably just the way the messaging works here I’m going to cut this here and paste it down below so I’m really just trying to get this to trigger and I mean this one’s this other one here but it’s not it’s not doing anything another way we could do it is we could just download it directly I don’t like doing it that way but we could also do it that way but I’m just looking for the hugging face uh token and bars yeah it’s HF HF tokens yeah so I have it right but why it’s not downloading I don’t know um let’s go take a look at that page and just make sure that there wasn’t anything that we had to accept sometimes that’s a requirement where it’s like hey if you don’t accept the things they won’t give you access to it so if I go over here to the model card it doesn’t show anything that I have to select to download this [Music] yeah there’s nothing here whatsoever right so again just carefully looking here we have some safe tensors that’s fine oh here it goes okay so we just had to be a little bit patient it’s probably a really popular model right now and that’s probably why it’s so hard to download but um I’m just going to wait till this is done downloading I’ll be back here in just a moment it’s it’s downloading and running the pipe line okay I did put the print down below here so it might um execute it here might execute it up there we’ll find out here in a moment this one might be redundant because I took it out while it was running live here but we’ll wait for this to finish okay it’s taking a significant time to download oh maybe it’s just almost done here but um yeah downloading from shards getting the checkpoints now it’s starting to write run saying Cuda zero I think that means it’s going to utilize my gpus I’m pretty sure zero is gpus and one is CPU I’m not sure why that is but um now it appears to be running okay so we’ll just wait a little bit longer now the thing is is that once this model is downloaded right we can just call pipe every time and it’ll be a lot faster right we’ll wait a little bit longer okay all right I’m back here and um I mean it ran the first part of the pipeline which is fine but I guess I didn’t run this line here so we’ll run it and since we separate out I think this one’s defined hopefully it is and we’ll run this and and it should work it’s probably now just doing its thing trying to run but we’ll give it a moment and we’ll see uh what happens here okay yeah I don’t think it should take this long to run I’m going to stop this and we’re going to run this again and I think it’ll be faster this time working because my video here is uh the video I’m recording here is kind of struggling that’s why I like to use uh an external external thing here because now my computer is [Music] hanging so what I might need to do here is pause if I can all right I’m kind of back um my uh my computer almost crashed again it’s not I’m telling you it’s not the the lunar Lake it’s these things can exhaust all your resources and that’s why it’s really good to have an external computer that’s specifically dedicated like an aipc or even a dedicated PC with gpus not on your main machine but um there is a tool here called Nvidia SMI and it will actually show us uh the usage here and um it’s probably not going to tell us much now because it’s uh already running here but as this is running we can use this to figure out what is the usage of um gpus that are going on here but I’m going to go back up here for a moment we’ll take a look so um it says CPU went out of memory so Cuda Colonels uh uh they cnly reported some API calls so this is what I mean where this could be a little bit challenging and again we downloaded the other models but those other models that we saw and by the way I’ll bring my head back in here so we stop seeing uh EOS uh webcam utility here but but the thing that we saw was that um uh when we used uh ol to download it was using ggf which is a format that is optimized to run on CPUs right and it can utilize gpus as well so it was already optimized whereas uh the model we’re downloading is not optimized I don’t think and um apparently I just don’t have enough to run it at the 8 billion parameter one but the question is is it downloading the correct one so if we go back over to here right this one is distilled 8 billion parameter it has to be it right because um because of that there and so we might actually not even be able to run this at least not in that format okay so you can see where the challenges are coming here so we go over to our files and we take a look here we can see we have a bunch of safe tensors that’s not going to really help us that much we got to go back into deep seek here and we’ll look into um the ones that they have here well here’s the question is it yeah we did the 8 billion 8 billion parameter one so we go into here 8 billion there is quen 7 billion which is a bit smaller there’s also the 1.5 billion one that’s not going to be useful for us but you know what I’m kind of exhausting my resources here so we can run this as an example and then if you had more resources like more RAM then you’ll have less of a problem so I’m going to go ahead and copy this over here and we’re going to go ahead and paste it in here as such okay so now we are literally just using a smaller model because I don’t think I have enough um uh memory in order to run this especially when I’m recording this at the same time and you know if we go over to here um I’m just typ in clear here um so we have fan temperature performance you can see none of the gpus are being used right now so if we knew it if we knew that they would be showing up over here right the gpus and so right now I think it’s just trying to attempt to download the model because we swapped out the model right so at some point here it should say hey we’re downloading the model it’s not for some reason but we’ll give it a moment okay because the other one took a bit of time to get going so I’m going to pause until I see something all right so after waiting a while this one ran it says Cuda out of memory Cuda external errors might be asynchronous reported at the API calls and stack and so it keeps running on a memory and I think that’s more of an issue of this computer so I might have to restart and run this again so I’m going to be back I’m going to stop the video I’m going to restart it’s the easiest way to dump memory because I don’t know any other way to do it but you know if I go here I mean it shows no memory usage so I’m not really sure what the issue is but I’m going to um restart I’m also going to close OBS I’m going to run it offline and then I’m going to tell I’m going to show you the results okay be back in just a moment all right I’m back and I also uh just went ahead and I ran it and this time it worked much faster so I’m not sure maybe it was holding on to the cache of the old one that was in here but giving my computer a nice restart really did help it out and you can see that we are getting the model to run um I don’t need to run the pipeline every single time I’m not sure why I ran that twice but I should be able to run this again again I’m recording so maybe this won’t work well as it is utilizing the gpus we’ll see [Music] here so now it’s struggling but literally I ran this and it was almost instantaneous like how fast it was that it ran so yeah I think it might be fighting for resources um and that is that is a little bit tricky for me here we’ll go back over here to Nvidia SMI I mean I’m not seeing any of the processes being utilized so it’s kind of hard to tell what’s going on here but I’m going to go ahead and just stop this can I stop this but it clearly works so even though I can’t show you yeah see over here says volatile GPU utilization 100% And then down here it says 33% I thought that these cores would start spitting up so we could we could make sense of it and then here I guess is the memory usage so over here you could see we have 790 of 8 818 and here we can see kind of the limits of it but if I run it again you can see that my me recording just this video is using up uh the memory so that kind of makes it a bit of a challenge um and the only way I could do that is maybe if I was to uh use onboard Graphics which um are not working for me um because I don’t know if I even have any onboard Graphics but that’s okay so anyway um that’s our that’s our example here that we got working it clearly does work I would like to try to do another video where we use VM but I’m not sure if that is possible um but we’ll consider this part done and if there’s a video after that then you know that I was able to get BLM to work see you the next one all right that’s my crash course into uh deep seek I want to give you some of my thoughts about how I think our crash course went here and what we learned as we were working through it um one thing I realized is that um in order to run these models uh you really do need optimized models and when we’re using ama if if you remember it had the ggf extension that’s that file format that is um more optimized to run on CPUs I know that with llama index um for my gen Essentials course when I did that exploration so optimized models are going to make these things a lot more accessible when we were using uh notebook LM or whatever it was called uh we saw that it was it wasn’t notebook LM it was LM Studio notebook LM is a Google product but LM Studio it was adding that extra thought processes and so so more things were happening there it was exhausting the machine um even on my main machine where I have an RTX 480 which was really good you could see that it ran ran well but then when we were trying to work with it directly where we didn’t have an optimized model that we were downloading um my computer was restarting so it was exhausting both my machines trying to run it though I think on this machine because I was using OBS it is using a lot of my resources but uh there’s a video that I did not add to this where I was trying to run it on VM and I was even trying to use 1.5 the 1.5 billion uh quen distilled model and it was saying I was running out of memory so you can see this stuff is really really tricky um and even with an RTX 480 and with my lunar Lake um there were challenges but there are areas that we can utilize it I don’t think we’re exactly there yet to have a full AI powered assistant with with thought and reasoning um but the RTX 480 kind of handled it if that if that’s all you’re using it for and you’re restarting those conversations um and then you’re fine tuning those some of those things down and then the lunar could do it if if we tuned it down one thing that I did say that um I realize after doing a bit more research CU I forget all the stuff that I learned but mpus are not really designed to use LMS I was saying earlier maybe there’s a way to optimize it or something but mpus are designed to run smaller models alongside your llms for your workloads so you can distribute uh a more complex AI workload so maybe you have an llm and it has a smaller model that does something like images or something something I don’t know something um and maybe you can utilize that mpus um but you know we’re not going to ever at least in the next couple years we’re not going to see anything utilizing mpus to run llms it’s really the gpus and so we are really fixed on that the igpu on the lunar Lake and then what our RTX the RTX 4080 can do um so you know maybe if I had another graphics card and I actually do I have a 3060 but unfortunately the computer I bought doesn’t allow me to slot in slotted in so if there was a way I could distribute the compute from this computer and my old computer or even the lunar Lake as well then I bet I could run something that is a little bit better um but you know you probably want uh like a a homebuilt computer with two graphics cards in it or you want multiple multiple uh aips that are stacked that have distributed compute um and just as as we saw that video where the person was running the uh 671 billion parameter model if you paid close attention to um the uh the the uh post it actually said in there that it was running it on 4 bit quantization so that wasn’t just the model running at its full Precision it was running it highly quantized and so quantization can be good but if it’s at four bit that’s really small and so and it was chugging along so you know the question really is is like okay even if you had seven or eight of those you’d still have to quantize it which is not easy and it’s still even it’s still slow and would the results be any good so as a example it was cool but I think that 271 billion parameter model is really far Out Of Reach um but that means we can try to Target one of these other ones like if it’s 70 70 billion billion parameter model or maybe we just want to reliably run the 7 billion building parameter model by having one extra computer and so you’re looking at depending if if you’re smart about it 1,000 ,500 and then you can uh run a model it’s not going to be as good as these as Chachi BT or Claude but it definitely will pave the way there um we’ll just have to continue to wait for these models to be optimized and for uh the hardware to improve or the cost to go down but maybe we’re just two computers away or two graphics cards away um but yeah that’s my two cents there and I’ll see you in the next one okay ciao

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog
Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!

Leave a comment