Building Your Own GPU-Powered Machine Learning Server Part 1. Building the Server

Prefix – Understanding the Graphics Processing Unit (GPU)

The Current Standard – Cloud Computing

Cloud Computing services are abundant these days.  You can quickly set up simple Linux or Windows servers for your development or testing environments quickly and easily.  Most of the services have a “free tier” of single or dual Vcore virtual machines that you can tunnel into and host your Jupyter Notebooks or Flask Apps with relative ease.  Some even come with packaged instances ready made for us Data Nerds.  Here is a list of links to some of the more popular services:

When you start getting into Neural Networks with a ton a of Matrix Operations and constant Gradient Loss Optimization over many Epochs, the processors simply can’t hold up, and this is where GPUs come in handy. To fill the need, most of the cloud computing providers have implemented GPU-Powered Cloud Services like AWS’s p2.xlarge instances that host your processing heavy ML experiments.

A brief CPU/GPU overview:

  • CPUs are great at handling a few tasks quickly.
    • Essentially, a CPU can do as many tasks simultaneously as the number of cores it has.
    • Double that if it has Hyper-Threading.
      • Example: An Intel Quad Core CPU with Hyper-Threading can do 8 tasks at once.
    • Modern CPUs also have built in GPU capabilities (Intel Integrated Graphics), memory controllers, PCI controllers, and generally handle the majority of low level hardware related communication and control on the chip.
    • They stream data from memory, work on it, and pass it back, but it is generally a handful of operations at one time quickly.
  • GPUs have a ton of cores.  They can do many simpler operations simultaneously, which is great for neural networks and calculating your gradients with a large feature size.
    • Consider that rendering 3D Graphics is also a series of many many simulatnaeous matrix operations.
    • They have incredibly fast memory.  The communication between the GPU memory and GPU processors is much faster than CPU to System Memory.
      • Consider the textures it loads and transforms from memory when it renders 3D Graphics.
  • CUDA (Compute Unified Device Architecture) is the parallel computing architecture from Nvidia that allows software to hook into these awesome GPU capabilities for things outside of gaming.
    • CUDA was traditionally used by graphically heavy softwares like Adobe and Da-Vinci Resolve to do color correction and add effects to images and video.
    • Nvidia’s recent stock boom is largely thanks to Machine Learning and Block Chain Mining, but their advances in the field are largely due to the gaming industry’s need for faster and better GPUs.
  • The accessibility of these awesome GPUs (and most fast computers) is largely thanks to the gaming industry.
    • An awesome gaming computer will be equally capable of running your web of Tensors as it will running Overwatch on Epic settings at 4k 120hz.

Why Ubuntu?  You need a linux distro to run nvidia-docker in order to build and launch GPU based docker images like Keras and Tensorflow-GPU.  At the time of this writing, OSx and Windows are NOT supported by nvidia-docker.

Consumer vs ‘Pro’ Graphics Cards

So now that we understand a bit about why GPUs are used in machine learning, lets explore our options.  I recently picked up an EVGA GTX 1080 for $550 at BestBuy. If you are reading this early in mid May 2018, this is hands down the best price you are going to find for this card anywhere.  “But wait,” you say.  “None of the cloud computing guys use GTX 1080s!  They use Tesla or Volta GPUs”.  You’re right, Amazon and Google don’t slap gaming level GPUs into their servers. They use Nvidia’s pro line.  But the numbers don’t lie.


╔═════════════════╦═══════════════╦══════════════════╦════════════╗
║    Benchmark    ║ Intel Core i7 ║ GeForce GTX 1080 ║ Tesla P100 ║
╠═════════════════╬═══════════════╬══════════════════╬════════════╣
║ MNIST + ConvNet ║ 0.3777 s      ║ 0.005 s          ║ 0.005 s    ║
║ DeepConvLSTM    ║ 1665.2 s      ║ 26.45 s          ║ 21.21 s    ║
╚═════════════════╩═══════════════╩══════════════════╩════════════╝
                                  ║       $550       ║   $5,000   ║ 
                                  ╚══════════════════╩════════════╝ Source - MEDIUM - link

Thats right.  That’s an extra 0.  The Nvidia Tesla P100 is less than 25% faster at training an LSTM than a stock Nvidia GTX 1080 but almost 1,000% the price.  Nvidia also offers a Ti  variant of the card for $800 which actually outperforms the P100 in most benchmarks except one.  Heat!

╔══════════════════╦════════════╗
║ GeForce GTX 1080 ║ Tesla P100 ║
╠══════════════════╬════════════╣
║ 77ºC             ║ 43ºC       ║
║ 118/180W         ║ 110/250W   ║
╚══════════════════╩════════════╝ Source - MEDIUM - link

Naturally, if you are running 1,000 of these in a server room, that 35 degree difference is no joke.  Also, big names don’t slap ‘gaming’ gear into their servers.

If you are reading this, chances are you are somewhat familiar with cloud computing for Data Science.  You might also be somewhat familiar with how much it costs to train a big model for a day or two.  If your employer is footing the bill, good on you, but if you are like me, an enthusiast and tinkerer, then you start to wonder how much it would cost and how much time it would take to build and understand the inner workings of these servers.

My Plan

I have an aging 2008 MacBook Pro with SSD and 8GB of RAM but I also have an amazing Hexa-Core audio workstation / gaming rig.

My Rig

  • Asus Rampage Extreme IV socket 2011 – x79 (a dated but still capable platform)
  • Intel 3930k overclocked to 4.5Ghz
  • Custom Water Loop (mostly XSPC)
  • 64GB of Corsair Vengeance 2133 DDR4 Memory
  • Hard-drives
    • Samsung 960 Evo NVMe – Windows 10
    • Samsung 860 Pro SSD – OSx Sierra 10.12.6 (Hackintosh)
    • Corsair SSD – Ubuntu 16.0.4
    • 2 x 1tb Seagate Barracuda Drives
  • 1500w Thermaltake PSU (totally overkill, but it was free)
  • EVGA Nvidia GTX 1080

The system cost about $2,500 spread over the course of 4 years with little upgrades here and there.  An AWS p2.xlarge (the cheapest GPU powered T2 instance as of the time of this writing) is $.90/h.  I had everything but the card and after a few days of struggling with the P2 (no python 3.6 on AMIs), I ran to BestBuy to get the 1080.  It was the last one in the SF area.  Would the AWS be cheaper?  Maybe initially, but those training sessions can quickly add up.  Besides, I wouldn’t get Overwatch on epic settings.  And there is something about building your own system that is ultimately incredibly satisfying.

I have built literally hundreds of systems for my business www.KoreTechs.com from ZFS servers to $10k color correction workstations and let me tell you, it isn’t as scary or complicated as it looks.  I’m not going to write a tutorial on how to build a PC because there are hundreds out there, but I will give you some pointers below.

My Recommendations:

Purchasing or building a computer can seem overwhelming.  Here are some tips:

  • Intel i7s are all great.
    • From Core i7 to the latest 8th generation “Coffee Lake”, the differences in actual performance were negligible from generation to generation.
    • If you are on a budget, you can get a used i7 gaming rig for as little as $300.
  • Most Intel Processors break down into 3 major categories:
    • Consumer – i3 and i5 – Budget
    • Workstation – i7, i7x and i9 – Mid to high end consumer
    • Server – Xeon – The skys’ the limit
  • Most Intel Chipsets break down into 3 major categories:
    • Consumer – Z77, Z170, Z270, Z290, H110… (Socket 115x) – Budget
    • Workstation – X79, X99, X299 – (Socket 2033 or 2066) – Mid to high end consumer
    • Server – C60x, C612… (Socket 2033 or 2066) – The skys’ the limit
  • When buying from an individual rather than store, keep in mind:
    • An i7 from 2010 is NOT the same as an i7 from 2018.
      • Seller’s will slap on the i7 hoping you won’t know any better.
      • Later models lack certain features like USB3, NVMe support and DDR4 memory.
      • They are older and have a shorter lifespan than a newer computer.
    • Make sure it works.
      • Stress test it before purchase if possible.  Prime95 or even just running a quick free geekbench will suffice.
    • Make sure all of the parts are as advertised.
      • Most model numbers should be visible simply by removing the side cover of the case.
  •  RAM Speed
    • Really doesn’t make a ton of difference
    • Match your RAM speed with the class of processor
    • Make sure to get the proper DDR type memory for your motherboard
  • Power Supplies
    • Don’t skimp on this
    • I prefer Seasonic
    • Most high-end Corsair are rebranded Seasonics

Step 1 – Picking Parts

CPU Sockets and the intel i-families demystified

  • Intel or AMD
    • Not to sound like a fanboy, but Intel is the way to go.  In all of the systems I have built and overclocked, no Intels have failed.
    • Great warranty and replacement program, although you probably won’t need it.
    • AMD can be great, and I’ve built a few HTPCs (Home Theater Personal Computers), but still.  Intel.
  • Intel Processor Type
    • Xeon – Server grade processor.
      • Generally not overclockable.
      • More expensive but offer motherboards with 10Gbit ethernet.
      • Offered in higher Core Counts.
      • Can have dual, quad or even 8 CPU monstrosities.
    • i3 – Entry level.
      • Low PCIe lane count means not very expandable.
      • Not recommended for our Data Science server.
    • i5 – Mid tier
      • These can be quite powerful little machines capable of running a mid tier data science server.
    • i7 – Intel’s top of the line until the recent release of i9s.
      • Expandable with high PCIe lane counts, lots of IO options and generally the go-to.
    • i9 – Intel’s newest tier.
      • Just a rebranding and added confusion.
      • Basically, today’s i7.
  • Intel Processor Labels
    • Intel’s processor model numbers are often followed by a letter.  Here’s what they mean:
      • No letter – None overclockable, standard, everyday processor.
      • K – Overclockable version of its none lettered counterpart.  Usually worth the $10-$15 extra.
      • – Stands for Extreme.  The highest end of their “gaming” or “workstation” models.
        • Usually have higher core count, cache and PCIe lane count.
        • Often come a year or 2 after initial release of the chipset.
      • Avoid the following:
        • M, T, U and Y.  Those are low powered consumption or mobile processors.
  • RAM Speed
    • Really doesn’t make a ton of difference
    • Match your RAM speed with the class of processor
    • Make sure to get the proper DDR type memory for your motherboard
  • Power Supplies
    • Don’t skimp on this
    • I prefer Seasonic
    • Most high-end Corsair are rebranded Seasonics

The chart below represents which processors will fit into which sockets (some will require a BIOS update).

SocketSocket NameYearCPU SupportIntel ChipsetsLineMax Price I'd Pay
LGA 1366Socket B2008-2011Core i7 - 7xx, 8xx, 9xxx58Prosumer$300
LGA 1156Socket H12009-2010Core i5 - 6xx, 7xx,
Core i7 - 8xx
H55, H57, P55, Q57ConsumerNA
LGA 1155Socket H22010-2012Sandy Bridge:
i3 - 21xx,
i5 - 23xx, 24xx, 25xx
i7 - 26xx, 27xx
Ivy Biridge:
i3 - 31xx,
i5 - 33xx, 34xx, 35xx
i7 - 36xx, 37xx
Sandy Bridge:
H61, B65, Q65, Q67, H67, P67 Z68
Ivy Bridge:
B75, Q75, Q77, H77, Z75, Z77
Consumer$400
LGA 2011Socket R12011-2013Sandy Bridge-E: i7 3xxx
Ivy Bridge-E: i7 4xxx
Xeon E5: 1xxx, 2xxx
Sandy/Ivy Bridge E:
X79
Xeon E5:
C602J, C602, C604, C606, C608
Workstation$600
LGA 1150Socket H32013-2015i3 - 41xx, 43xx
i5 - 44xx, 45xx, 46xx
i7 - 47xx
Haswell:
H81, B85, Q85, Q87, H87, Z87
Haswell Refresh:
H97, Z97
Consumer$400
LGA 2011-3Socket R32014-2016Haswell E:
i7 - 5xxx
Broadwell E:
i7 - 6xxx
Xeon E5 v3: 1xxx v3, 2xxx v3
Haswell/Broadwell E:
X99
Xeon E5 v3:
C612
Workstation$1000
LGA 1151 1xx, 2xxSocket H42015-2018Sky Lake:
i3 - 61xx, 63xx
i5 - 64xx, 65xx, 66xx
i7 - 67xx
Kaby Lake:
i3 - 71xx, 73xx
i5 - 74xx, 75xx, 76xx
i7 - 77xx
Sky Lake:
H110, B150, Q150, H170, Q170, Z170
Kaby Lake:
B250, Q250, H270, Q270, Z270
Consumer$600
LGA 1151 3xxSocket H42017 - TBDCoffe Lake:
i3 - 81xx, 83xx
i5 - 84xx, 85xx, 86xx
i7 - 87xx
Coffe Lake:
H310, B360, H370, Q370, Z370, Z390
Consumer$800
LGA 2066Socket R42017-TBDKaby/Skylake X:
i5 - 76xxX,
i7 - 77xxX, 78xxX,
i9 - 79xxX
Kaby/Skylake X:
X299
Workstation$2,000

 

Leave a Reply

Your email address will not be published. Required fields are marked *