分布式计算Gearman

On 2010年08月9日, in highscalability, linux, by netoearth

Gearman是一个分发任务的程序框架,可以用在各种场合,与Hadoop相 比,Gearman更偏向于任务分发功能。它的任务分布非常简单,简单得可以只需要用脚本即可完成。Gearman最初用于LiveJournal的图片 resize功能,由于图片resize需要消耗大量计算资源,因此需要调度到后端多台服务器执行,完成任务之后返回前端再呈现到界面。

尽管一个 Web 应用程序的大部分内容都与表示有关,但它的价值与竞争优势却可能体现在若干专有服务或算法方面。如果这类处理过于复杂或拖沓,最好是进行异步执行,以免 Web 服务器对传入的请求没有响应。实际上,将一个计算密集型的或专门化的功能放在一个或多个独立的专用服务器上运行,效果会更好。

常用的缩略词

  • API:应用程序编程接口
  • HTTP:超文本传输协议
  • LAMP:Linux、Apache、MySQL 与 PHP

PHP 的 Gearman 库能把工作分发给一组机器。Gearman 会对作业进行排队并少量分派作业,而将那些复杂的任务分发给为此任务预留的机器。这个库对 Perl、Ruby、C、Python 及 PHP 开发人员均可用,并且还可以运行于任何类似 UNIX® 的平台上,包括 Mac OS X、 Linux® 和 Sun Solaris。

向一个 PHP 应用程序添加 Gearman 非常简单。假设您将 PHP 应用程序托管在一个典型的 LAMP 配置上,那么 Gearman 将需要一个额外的守护程序以及一个 PHP 扩展。截止到 2009 年 11 月,Gearman 守护程序的最新版本是 0.10,并且有两个 PHP 扩展可以用 — 一个用 PHP 包裹了 Gearman C 库,另一个用纯 PHP 编写。我们要用的是前者。它的最新版本是 0.6.0,可以从 PECL 或 Github(参见 参考资料)获取它的源代码。

请注意:对于本文而言,producer 指的是生成工作请求的机器;consumer 是执行工作的机器;而 agent 则是连接 producer 与适当 consumer 的中介。

安装 Gearman

向一个机器添加 Gearman 需要两步:第一步构建并启动这个守护程序,第二步构建与 PHP 版本相匹配的 PHP 扩展。这个守护程序包包括构建此扩展所需的所有库。

首先,下载 Gearman 守护程序 gearmand 的最新源代码,解压缩这个 tarball,构建并安装此代码(安装需要有超级用户的权限,即根用户权限)。

$ wget http://launchpad.net/gearmand/trunk/\
  0.10/+download/gearmand-0.10.tar.gz
$ tar xvzf gearmand-0.10.tar.gz
$ cd gearmand-0.10
$ ./configure
$ make
$ sudo make install

安装 gearmand 后,构建 PHP 扩展。您可以从 PECL 获取这个 tarball,也可以从 Github 复制该存储库。

$ wget http://pecl.php.net/get/gearman-0.6.0.tgz
$ cd pecl-gearman
#
# or
#
$ git clone git://github.com/php/pecl-gearman.git
$ cd pecl-gearman

有了这些代码后,就可以开始构建扩展了:

$ phpize
$ ./configure
$ make
$ sudo make install

这个 Gearman 守护程序通常被安装在 /usr/sbin。可以从命令行直接启动此守护程序,也可以将这个守护程序添加到启动配置中,以便在机器每次重启时就可以启动这个守护程序。

接下来,需要安装 Gearman 扩展。打开 php.ini 文件(可以通过 php --ini 命令快速找到这个文件),然后添加代码行 extension = gearman.so

$ php --ini
Loaded Configuration File:         /etc/php/php.ini
$ vi /etc/php/php.ini 
...
extension = gearman.so

保存此文件。要想验证扩展是否启用,请运行 php --info,然后查找 Gearman:

$ php --info | grep "gearman support"
gearman
gearman support => enabled
libgearman version => 0.10

此外,还可以用一个 PHP 代码片段来验证构建和安装是否得当。将这个小应用程序保存到 verify_gearman.php:

<?php
  print gearman_version() . "\n";
?>

接下来,从命令行运行此程序:

$ php verify_gearman.php
0.10

如果这个版本号与之前构建和安装的 Gearman 库的版本号相匹配,那么系统就已准备好了。


回页首

运行 Gearman

我们前面提到过,一个 Gearman 配置有三个角色:

  • 一个或多个 producer 生成工作请求。每个工作请求命名它所想要的函数,例如 email_allanalyze
  • 一个或多个 consumer 完成请求。每个 consumer 命名它所提供的一个或多个函数并向 agent 注册这些功能。一个 consumer 也可以被称为是一个 worker
  • 代理对与之建立连接的那些 consumer 提供的所有服务进行集中编制。它将 producer 与恰当的 consumer 联系起来。

借助如下的命令行,可以立即体验 Gearman:

  1. 启动这个 agent,即 Gearman 守护程序:
    $ sudo /usr/sbin/gearmand --daemon
    
  2. 用命令行实用工具 gearman 运行一个 worker。这个 worker 需要一个名字并能运行任何命令行实用工具。例如,可以创建一个 worker 来列出某个目录的内容。-f 参数命名了该 worker 所提供的函数:
    $ gearman -w -f ls -- ls -lh
    
  3. 最后的一个代码块是一个 producer,或用来生成查找请求的一个作业。也可以用 gearman 生成一个请求。同样,用 -f 选项来指定想要从中获得帮助的那个服务:
    $ gearman -f ls < /dev/null
    drwxr-xr-x@ 43 supergiantrobot  staff   1.4K Nov 15 15:07 gearman-0.6.0
    -rw-r--r--@  1 supergiantrobot  staff    29K Oct  1 04:44 gearman-0.6.0.tgz
    -rw-r--r--@  1 supergiantrobot  staff   5.8K Nov 15 15:32 gearman.html
    drwxr-xr-x@ 32 supergiantrobot  staff   1.1K Nov 15 14:04 gearmand-0.10
    -rw-r--r--@  1 supergiantrobot  staff   5.3K Jan  1  1970 package.xml
    drwxr-xr-x  47 supergiantrobot  staff   1.6K Nov 15 14:45 pecl-gearman
    

回页首

从 PHP 使用 Gearman

从 PHP 使用 Gearman 类似于之前的示例,惟一的区别在于这里是在 PHP 内创建 producer 和 consumer。每个 consumer 的工作均封装在一个或多个 PHP 函数内。

清单 1 给出了用 PHP 编写的一个 Gearman worker。将这些代码保存在一个名为 worker.php 的文件中。
清单 1. Worker.php

				
<?php
  $worker= new GearmanWorker();
  $worker->addServer();
  $worker->addFunction("title", "title_function");
  while ($worker->work());
   
  function title_function($job)
  {
    return ucwords(strtolower($job->workload()));
  }
?>

清单 2 给出了用 PHP 编写的一个 producer,或 client。将此代码保存在一个名为 client.php 的文件内。
清单 2. Client.php

				
<?php
  $client= new GearmanClient();
  $client->addServer();
  print $client->do("title", "AlL THE World's a sTagE");
  print "\n";
?>

现在,可以用如下的命令行连接客户机与 worker 了:

$ php worker.php &
$ php client.php
All The World's A Stage
$ jobs
[3]+  Running                 php worker.php &

这个 worker 应用程序继续运行,准备好服务另一个客户机。


回页首

Gearman 的高级特性

在一个 Web 应用程序内可能有许多地方都会用到 Gearman。可以导入大量数据、发送许多电子邮件、编码视频文件、挖据数据并构建一个中央日志设施 — 所有这些均不会影响站点的体验和响应性。可以并行地处理数据。而且,由于 Gearman 协议是独立于语言和平台的,所以您可以在解决方案中混合编程语言。比如,可以用 PHP 编写一个 producer,用 C、Ruby 或其他任何支持 Gearman 库的语言编写 worker。

一个连接客户机和 worker 的 Gearman 网络实际上可以使用任何您能想象得到的结构。很多配置能够运行多个代理并将 worker 分配到许多机器上。负载均衡是隐式的:每个可操作的可用 worker(可能是每个 worker 主机具有多个 worker)从队列中拉出作业。一个作业能够同步或异步运行并具有优先级。

Gearman 的最新版本已经将系统特性扩展到了包含持久的作业队列和用一个新协议来通过 HTTP 提交工作请求。对于前者,Gearman 工作队列保存在内存并在一个关系型数据库内存有备份。这样一来,如果 Gearman 守护程序故障,它就可以在重启后重新创建这个工作队列。另一个最新的改良通过一个 memcached 集群增加队列持久性。memcached 存储也依赖于内存,但被分散于几个机器以避免单点故障。

Gearman 是一个刚刚起步却很有实力的工作分发系统。据 Gearman 的作者 Eric Day 介绍,Yahoo! 在 60 或更多的服务器上使用 Gearman 每天处理 600 万个作业。新闻聚合器 Digg 也已构建了一个相同规模的 Gearman 网络,每天可处理 400,000 个作业。Gearman 的一个出色例子可以在 Narada 这个开源搜索引擎(参见 参考资料)中找到。

——————————————————————————————————————————————————————————–

下面看看如何安装运行一个例子,条件所限,我们把Client,Job,Worker三个角色运行在一台服务器上:

安装Gearman server and library:

wget http://launchpad.net/gearmand/trunk/0.8/+download/gearmand-0.8.tar.gz
tar zxf gearmand-0.8.tar.gz
cd gearmand-0.8
./configure
make
make install

安装Gearman PHP extension:

wget http://pecl.php.net/get/gearman-0.4.0.tgz
tar zxf gearman-0.4.0.tgz
cd gearman-0.4.0
phpize
./configure
make
make install

编辑php.ini配置文件加载相应模块并使之生效:

extension = “gearman.so”

启动Job:

gearmand -d

如果当前用户是root的话,则需要这样操作:

gearmand -d -u root

缺省会使用4730端口,下面会用到。

注意:如果找不到gearmand命令的路径,别忘了用whereis gearmand确认。

编写Worker:

worker.php文件内容如下:

<?php
$worker= new GearmanWorker();
$worker->addServer(‘127.0.0.1′, 4730);
$worker->addFunction(‘reverse’, ‘my_reverse_function’);

while ($worker->work());

function my_reverse_function($job)
{
return strrev($job->workload());
}
?>

设置后台运行work:

php worker.php &

编写Client:

client.php文件内容如下:

<?php
$client= new GearmanClient();
$client->addServer(‘127.0.0.1′, 4730);
echo $client->do(‘reverse’, ‘Hello World!’), “\n”;
?>

运行client:

php client.php

输出:!dlroW olleH

出于方便的考虑,Worker,Client使用的都是PHP,但这并不影响演示,实际应用中,你完全可以通过Gearman集成不同语言实现的 Worker,Client。或许此时你还想了解前面提到的负载均衡功能:很简单,只要增加多个Worker即可,你可以按照worker.php的样子 多写几个类似的文件,并设置不同的返回值用以识别演示效果。然后依次启动这几个Worker文件,并多次使用client.php去请求,你就会发现 Job会把Client请求转发给不同的Worker。

命令工具

如果你觉得安装PHP之类的东西太麻烦的话,你也可以仅仅通过命令行工具来体验Gearman的功能:

启动Worker:gearman -w -f wc — wc -l &
运行Client:gearman -f wc < /etc/passwd

—————————————————————————————————————————–

How Does Gearman Work?

Gearman Stack

A Gearman powered application consists of three parts: a client, a worker, and a job server. The client is responsible for creating a job to be run and sending it to a job server. The job server will find a suitable worker that can run the job and forwards the job on. The worker performs the work requested by the client and sends a response to the client through the job server. Gearman provides client and worker APIs that your applications call to talk with the Gearman job server (also known as gearmand) so you don’t need to deal with networking or mapping of jobs. Internally, the gearman client and worker APIs communicate with the job server using TCP sockets. To explain how Gearman works in more detail, lets look at a simple application that will reverse the order of characters in a string. The example is given in PHP, although other APIs will look quite similar.

We start off by writing a client application that is responsible for sending off the job and waiting for the result so it can print it out. It does this by using the Gearman client API to send some data associated with a function name, in this case the function “reverse”. The code for this is (with error handling omitted for brevity):

# Reverse Client Code
$client= new GearmanClient();
$client->addServer();
print $client->do("reverse", "Hello World!");

This code initializes a client class, configures it to use a job server with add_server (no arguments means use 127.0.0.1 with the default port), and then tells the client API to run the “reverse” function with the workload “Hello world!”. The function name and arguments are completely arbitrary as far as Gearman is concerned, so you could send any data structure that is appropriate for your application (text or binary). At this point the Gearman client API will package up the job into a Gearman protocol packet and send it to the job server to find an appropriate worker that can run the “reverse” function. Let’s now look at the worker code:

# Reverse Worker Code
$worker= new GearmanWorker();
$worker->addServer();
$worker->addFunction("reverse", "my_reverse_function");
while ($worker->work());
 
function my_reverse_function($job)
{
  return strrev($job->workload());
}

Gearman Flow

This code defines a function “my_reverse_function” that takes a string and returns the reverse of that string. It is used by a worker object to register a function named “reverse” after it is setup to connect to the same local job server as the client. When the job server receives the job to be run, it looks at the list of workers who have registered the function name “reverse” and forwards the job on to one of the free workers. The Gearman worker API then takes this request, runs the function “my_reverse_function”, and sends the result of that function back through the job server to the client.

As you can see, the client and worker APIs (along with the job server) deal with the job management and network communication so you can focus on the application parts. There a few different ways you can run jobs in Gearman, including background for asynchronous processing and prioritized jobs. See the documentation available for the various APIs for details.

How Is Gearman Useful?

The reverse example above seems like a lot of work to run a function, but there are a number of ways this can be useful. The simplest answer is that you can use Gearman as an interface between a client and a worker written in different languages. If you want your PHP web application to call a function written in C, you could use the PHP client API with the C worker API, and stick a job server in the middle. Of course, there are more efficient ways of doing this (like writing a PHP extension in C), but you may want a PHP client and a Python worker, or perhaps a MySQL client and a Perl worker. You can mix and match any of the supported language interfaces easily, you just need all applications to be able to understand the workload being sent. Is your favorite language not supported yet? Get involved with the project, it’s probably fairly easy for either you or one of the existing Gearman developers to put a language wrapper on top of the C library.

The next way that Gearman can be useful is to put the worker code on a separate machine (or cluster of machines) that are better suited to do the work. Say your PHP web application wants to do image conversion, but this is too much processing to run it on the web server machines. You could instead ship the image off to a separate set of worker machines to do the conversion, this way the load does not impact the performance of your web server and other PHP scripts. By doing this, you also get a natural form of load balancing since the job server only sends new jobs to idle workers. If all the workers running on a given machine are busy, you don’t need to worry about new jobs being sent there. This makes scale-out with multi-core servers quite simple: do you have 16 cores on a worker machine? Start up 16 instances of your worker (or perhaps more if they are not CPU bound). It is also seamless to add new machines to expand your worker pool, just boot them up, install the worker code, and have them connect to the existing job server.

Gearman Cluster

Now you’re probably asking what if the job server dies? You are able to run multiple job servers and have the clients and workers connect to the first available job server they are configured with. This way if one job server dies, clients and workers automatically fail over to another job server. You probably don’t want to run too many job servers, but having two or three is a good idea for redundancy. The diagram to the left shows how a simple Gearman cluster may look.

From here, you can scale out your clients and workers as needed. The job servers can easily handle having hundreds of clients and workers connected at once. You can draw your own physical (or virtual) machine lines where capacity allows, potentially distributing load to any number of machines. For more details on specific uses and installations, see the section on use cases.

Tagged with:  

Comments are closed.