{"id":24,"date":"2008-12-07T21:04:08","date_gmt":"2008-12-08T04:04:08","guid":{"rendered":"http:\/\/www.imaginarybillboards.com\/?p=24"},"modified":"2010-04-05T10:45:13","modified_gmt":"2010-04-05T17:45:13","slug":"perl-super-easy-parallelization-with-threadeach","status":"publish","type":"post","link":"http:\/\/www.imaginarybillboards.com\/?p=24","title":{"rendered":"Perl super-easy parallelization with threadeach"},"content":{"rendered":"<p>I&#8217;ve been thinking about a good way to make perl more parallelize-able.  The thing that keeps coming to my mind is that it should be so easy that you wouldn&#8217;t even think about it.   Lots of the time in sysadmin-land, you have to just do a ton of things completely identically to a bunch of things.  Examples from just the last week at work.<\/p>\n<p>For each thing in a list, connect to its database, get x data, do some analysis on that.<\/p>\n<p>For each server in a list, connect to it and so something.  Push a file, get a file, run a command, etc.<\/p>\n<p>For each ip\/port in a list, open a socket and listen for x time, then return the results.<\/p>\n<p>So, what to use?  I just like the name threadeach().  Normally, in perl, you do this<\/p>\n<pre>foreach  my $thing(@list_of_things){... do something }<\/pre>\n<p>It&#8217;d be nice if you knew this was easily done in parallel, to do it like this:<\/p>\n<pre>threadeach my $thing(@list_of_things){function to be performed on each thing}<\/pre>\n<p>Right now, you can *sort of* do the same thing with a little work.  I&#8217;ve got a threadeach module like this I&#8217;ve been using.<\/p>\n<h3>Threadeach<\/h3>\n<p>I whipped up a module for it with three functions in it.<\/p>\n<ul>\n<li>threadeach(\\&amp;subroutine,@array) #will parallelize, running \u00c2\u00a0(number of cpu cores) threads at a time<\/li>\n<li>threadall(\\&amp;subroutine,@array) # will parallelize all at once!!! \u00c2\u00a0Kind of crazy but fun actually<\/li>\n<li>threadsome(\\&amp;subroutine,&lt;num to run&gt;,@array); #will parallelize the number passed of threads at a time<\/li>\n<\/ul>\n<p>It&#8217;s also got another trick in that it waits for them all to be done and then returns the &#8220;return&#8221; values in order. \u00c2\u00a0A lot of the time, I do a foreach and print something, in this case I can just return what I&#8217;d have printed before and print it at the end. \u00c2\u00a0print threadeach(\\&amp;sub,@things);<\/p>\n<p>I&#8217;ve been using it in my <a href=\"http:\/\/www.imaginarybillboards.com\/?p=96\">check_network<\/a> script that looks at things in a given subnet and it works pretty well. \u00c2\u00a0I just had to change the line from foreach my $ip(0..255){&#8230;} to threadeach(\\&amp;&#8230;,0..255);sub {&#8230;} \u00c2\u00a0And instead of printing inside, I return the printed value (as stated above). \u00c2\u00a0It&#8217;s been working really well in this limited case, I have to try it more on other things, but don&#8217;t see why it wouldn&#8217;t work fine. \u00c2\u00a0But since this script does an nmap against the host, it uses a good bit of CPU &#8211; I tried using threadall() and it almost hung the machine &#8211; 255 nmap processes at once will do that.<\/p>\n<h3>Timing<\/h3>\n<p>Running original version: sudo time perl check_network.pl 192.168.200.0 -&gt; 524.71 real        76.91 user        26.39 sys<\/p>\n<p>Running threadeach version: sudo time perl threaded_check_network.pl 192.168.200.0 -&gt; 189.95 real        77.31 user        28.48 sys<\/p>\n<p>Roughly 1\/3 the time. \u00c2\u00a0Which makes sense, because no matter how long one of the machines takes to do, it&#8217;s added in to the rest for the original version, and can be &#8220;worked around&#8221; in the parallel version. \u00c2\u00a0For my example, for some reason the .107 box takes several minutes ( I skipped it in this test) to run &#8211; but even not counting that one, there are some that are almost instant(the down boxes) and some that take longer.<\/p>\n<h3>How it works<\/h3>\n<p>Not counting the deciding how many to run at a time (which depends on how it&#8217;s called) &#8211; that just tries to get the number of cpus on the machine, and if it can&#8217;t, returns an arbitrary number (currently 8), it&#8217;s fairly straightforward. \u00c2\u00a0Set up an empty hash to keep track of thread ID vs index, an index variable to keep track of where we are at with the list, and finally an empty array to store anything being returned.<\/p>\n<p>Main loop: \u00c2\u00a0As long \u00c2\u00a0as there are:<\/p>\n<ul>\n<li>Threads working<\/li>\n<li>Threads done and waiting to return<\/li>\n<li>or more things to do<\/li>\n<\/ul>\n<p>Do:<\/p>\n<ul>\n<li>Get the return values of any threads waiting to give them back. (puts the return value of the thread into the return array corresponding to the slot that it was passed originally)<\/li>\n<li>launches more threads, until there are either no more left or until it has reached the max number<\/li>\n<li>When launching those threads, it puts the value of the index (id corresponding to the slot of original array) into the value side of a hash where the key is the thread ID<\/li>\n<li>sleep for one second.<\/li>\n<\/ul>\n<p>And at the end, returns the @return array. \u00c2\u00a0The @return isn&#8217;t strictly necessary if it&#8217;s supposed to be a foreach replacement, but works really well for where it&#8217;s useful. \u00c2\u00a0The sleep(1); isn&#8217;t strictly necessary either, but 1- if you&#8217;re doing a bunch of threads, waiting a second at a time isn&#8217;t a huge deal, and 2- otherwise it pegs the CPU just doing a tight while loop checking on thread status.<\/p>\n<h3>In the future&#8230;<\/h3>\n<p>Figure out how to make it work as a drop-in replacement for foreach. Calling it as a function seems so hack-ish. \u00c2\u00a0A better way to decide how many threads to call (if Sys::CPU doesn&#8217;t work). \u00c2\u00a0Was thinking about the year &#8211; 2005, so the number would increase. \u00c2\u00a0Or could just require Sys::CPU&#8230; \u00c2\u00a0Could also make sure whatever is in the main block is thread-safe, but should probably just trust the user. \u00c2\u00a0In the meantime, I&#8217;m going to use it for a little bit and bang against it on a few systems before throwing it out into the cruel, cold world.<\/p>\n<p>It may also be cool if it can buffer I\/O so that for some things it really does act just like foreach, too.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been thinking about a good way to make perl more parallelize-able. The thing that keeps coming to my mind is that it should be so easy that you wouldn&#8217;t even think about it. Lots of the time in sysadmin-land, you have to just do a ton of things completely identically to a bunch of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15],"tags":[10,33],"_links":{"self":[{"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=\/wp\/v2\/posts\/24"}],"collection":[{"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=24"}],"version-history":[{"count":5,"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=\/wp\/v2\/posts\/24\/revisions"}],"predecessor-version":[{"id":165,"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=\/wp\/v2\/posts\/24\/revisions\/165"}],"wp:attachment":[{"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=24"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=24"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.imaginarybillboards.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=24"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}