November 18th, 2017 • Comments Off on Q-Learning in python
There is a nice tutorial that explains how Q-Learning works here. The following python code implements the basic principals of Q-Learning:
Let’s assume we have a state matrix defining how we can transition between states, and a goal state (5):
GOAL_STATE = 5
# rows are states, columns are actions
STATE_MATRIX = np.array([[np.nan, np.nan, np.nan, np.nan, 0., np.nan],
[np.nan, np.nan, np.nan, 0., np.nan, 100.],
[np.nan, np.nan, np.nan, 0., np.nan, np.nan],
[np.nan, 0., 0., np.nan, 0., np.nan],
[0., np.nan, np.nan, 0, np.nan, 100.],
[np.nan, 0., np.nan, np.nan, 0, 100.]])
Q_MATRIX = np.zeros(STATE_MATRIX.shape)
Visually this can be represented as follows:
(Click to enlarge)
For example if you are in state 0, we can go to state 4, define by the 0. . If we are in state 4, we can directly goto to state 5 define by the 100. . np.nan define impossible transitions. Finally we initialize an empty Q-Matrix.
Now the Q-Learning algorithm is simple. The comments in the following code segment will guide through the steps:
i = 0
while i < MAX_EPISODES:
# pick a random state
state = random.randint(0, 5)
while state != goal_state:
# find possible actions for this state.
candidate_actions = _find_next(STATE_MATRIX[state])
# randomly pick one action.
action = random.choice(candidate_actions)
# determine what the next states could be for this action...
next_actions = _find_next(STATE_MATRIX[action])
values = []
for item in next_actions:
values.append(Q_MATRIX[action][item])
# add some exploration randomness...
if random.random() < EPSILON:
# so we do not always select the best...
max_val = random.choice(values)
else:
max_val = max(values)
# calc the Q value matrix...
Q_MATRIX[state][action] = STATE_MATRIX[state][action] + \
EPSILON * max_val
# next step.
state = action
i += 1
We need one little helper routine for this – it will help in determine the next possible step I can do:
def _find_next(items):
res = []
i = 0
for item in items:
if item >= 0:
res.append(i)
i += 1
return res
Finally we can output the results:
Q_MATRIX = Q_MATRIX / Q_MATRIX.max()
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print Q_MATRIX
This will output the following Q-Matrix:
[[ 0.000 0.000 0.000 0.000 0.800 0.000]
[ 0.000 0.000 0.000 0.168 0.000 1.000]
[ 0.000 0.000 0.000 0.107 0.000 0.000]
[ 0.000 0.800 0.134 0.000 0.134 0.000]
[ 0.044 0.000 0.000 0.107 0.000 1.000]
[ 0.000 0.000 0.000 0.000 0.000 0.000]]
This details for example the best path to get from e.g. state 2 to state 5 is: 2 -> 3 (0.107), 3 -> 1 (0.8), 1 -> 5 (1.0).
Categories: Personal • Tags: Machine Learning, Python • Permalink for this article
November 12th, 2017 • Comments Off on Controlling a Mesos Framework
Note: This is purely for fun, and only representing early results.
It is possible to combine more traditional scheduling and resource managers like OpenLava with DCOS like Mesos [1]. The basic implementation which glues OpenLava and Mesos together is very simple: as long as jobs are in the queue(s) of the HPC/HTC scheduler it will try to consume offers presented by Mesos to run these jobs on. There is a minor issue with that however: the framework is very greedy, and will consume a lot of offers from Mesos (So be aware to set quotas etc.).
To control how many offers/tasks the Framework needs to dispatch the jobs in the queue of the HPC/HTC scheduler we can use a simple PID controller. By applying a bit of control we can tame the framework as the following the diagram shows:
(Click to enlarge)
We define the ratio between running and pending jobs as a possible target or the controller (Accounting for a division by zero). Given this, we can set the PID controller to try to keep the system at the ratio of e.g. 0.3 as a target (semi-randomly picked).
For example: if 10 jobs are running, while 100 are in the queue pending, the ratio will be 0.1 and we need to take more resource offers from Mesos. More offers, means more resources available for the jobs in the queues – so the number of running jobs will increase. Let’s assume a stable number of jobs in the queue, so e.g. the system will now be running 30 jobs and 100 jobs are in the queue. This represent the steady state and the system is happy. If the number of jobs in the queues decreases the system will need less resources to process them. For example 30 jobs are running, while 50 are pending gives us a ratio of 0.6. As this is a higher ratio than the specified target, the system will decrease the number of tasks needed from Mesos.
This approach is very agnostic to job execution times too. Long running jobs will lead to more jobs in the queue (as they are blocking resources) and hence decreasing the ratio, leading to the framework picking up more offers. Short running jobs will lead to the number of pending jobs decreasing faster and hence a higher ratio, which in turn will lead to the framework disregarding resources offered to it.
And all the magic is happening very few lines of code running in a thread:
def run(self):
while not self.done:
error = self.target - self.current # target = 0.3, self.current == ratio from last step
goal = self.pid_ctrl.step(error) # call the PID controller
self.current, pending = self.scheduler.get_current() # get current ratio from the scheduler
self.scheduler.goal = max(0, int(goal)) # set the new goal of # of needed tasks.
time.sleep(1)
The PID controller itself is super basic:
class PIDController(object):
"""
Simple PID controller.
"""
def __init__(self, prop_gain, int_gain, dev_gain, delta_t=1):
# P/I/D gains
self.prop_gain = prop_gain
self.int_gain = int_gain
self.dev_gain = dev_gain
self.delta_t = delta_t
self.i = 0
self.d = 0
self.prev = 0
def step(self, error):
"""
Do some work & progress.
"""
self.i += self.delta_t * error
self.d = (error - self.prev) / self.delta_t
self.prev = error
tmp = \
self.prop_gain * error + \
self.int_gain * self.i + \
self.dev_gain * self.d
return tmp
I can recommend the following book on control theory btw: Feedback Control for Computer Systems.
Categories: Personal • Tags: Control Theory, LSF, Orchestration, Scheduling • Permalink for this article
April 17th, 2017 • Comments Off on Distributed systems: which cluster do I obey?
The topic of cluster formation in itself is not new. There are plenty of methods around to form cluster on the fly [1]. They mostly follow methods which make use of gossip protocols. Implementation can wildly be found, even in products like Akka.
Once a cluster is formed the question becomes wehter control (see this post for some background) is centralized or decentralized. Or something in between with a (hierarchical) set of leaders managing – in a distributed fashion – a cluster of entities that are actuating. Both, centralized and decentralized approaches, have their advantages and disadvantages. Centralized approaches are sometimes more efficient as they might have more knowledge to base their decisions upon. Note that even in centralized management approaches a gossip protocols might be in use. There are also plenty of algorithms around to support consensus and leader election [2], [3].
Especially in Fog and Fdge computing use cases, entities which move (e.g. a car, plane, flying/sailing drone) have to decide to which cluster they want to belong and obey the actions initiated by those. Graph representations are great for this – and yet again because of a rich set of algorithms in the graph theory space, the solution might be quite simple to implement. Note that in Fog/Edge use cases there most likely will always be very (geographic) static entities like Cloudlets in the mix – which could be elected as leaders.
For example, the following diagram shows two clusters: The red nodes’ size indicates who is the current leader. The blue dot – marks an entity that is connected to both clusters. The width of the edge connecting it to the cluster indicates to which cluster it ‘belongs’ (aka it’s affinity).
(Click to enlarge)
Now as the blue entity moves (geographically) the connections/edges to the clusters change. Now based on algorithms like this, the entity can make a decision to hand-off control to another cluster and obey the directions given from it.
(Click to enlarge)
Categories: Personal • Tags: Distributed Systems, Graphs • Permalink for this article
January 7th, 2017 • Comments Off on Example 2: Intelligent Orchestration & Scheduling with OpenLava
This is the second post in a series (the first post can be found here) about how to insert smarts into a resource manager. So let’s look how a job scheduler or distributed resource management system (DRMS) — in a HPC use case — with OpenLava can be used. For the rationale behind all this check the background section of the last post.
The basic principle about this particular example is simple: each host in a cluster will report a “rank”; the rank will be used to make a decision on where to place a job. The rank could be defined as: a rank is high when the sub-systems of the hosts are heavily used; and the rank is low when none or some sub-system are utilized. How the individual sub-systems usage influences the rank value, is something that can be machine learned.
Let’s assume the OpenLava managed cluster is up and running and a couple of hosts are available. The concept of elims can be used to get the job done. The first step is, to teach the system what the rank is. This is done in the lsf.shared configuration file. The rank is defined to be a numeric value which will be updated every 10 seconds (while not increasing):
Begin Resource
RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION
...
rank Numeric 10 N (A rank for this host.)
End Resource
Next OpenLava needs to know for which hosts this rank should be determined. This is done through a concept of ‘resource mapping’ in the lsf.cluster.* configuration file. For now the rank should be used for all hosts by default:
Begin ResourceMap
RESOURCENAME LOCATION
rank ([default])
End ResourceMap
Next an external load information manager (LIM) script which will report the rank to OpenLava needs to be written. OpenLava expects that the script writes to stdout with the following format: <number of resources to report on> <first resource name> <first resource value> <second resource name> <second resource value> … . So in this case it should spit out ‘1 rank 42.0‘ every 10 seconds. The following python script will do just this – place this script in the file elim in $LSF_SERVERDIR:
#!/usr/bin/python2.7 -u
import time
INTERVAL = 10
def _calc_rank():
# TODO calc value here...
return {'rank': value}
while True:
RES = _calc_rank()
TMP = [k + ' ' + str(v) for k, v in RES.items()]
print(\"%s %s\" % (len(RES), ' '.join(TMP)))
time.sleep(INTERVAL)
Now a special job queue in the lsb.queues configuration file can be used which makes use of the rank. See the RES_REQ parameter in which it is defined that the candidate hosts for a job request are ordered by the rank:
Begin Queue
QUEUE_NAME = special
DESCRIPTION = Special queue using the rank coming from the elim.
RES_REQ = order[rank]
End Queue
Submitting a job to this queue is as easy as: bsub -q special sleep 1000. Or the rank can be passed along as a resource requirements on job submission (for any other queue): bsub -R “order[-rank]” -q special sleep 1000. By adding the ‘-‘ it is said that the submitter request the candidate hosts to be sorted for hosts with a high rank first.
Let’s assume a couple of hosts are up & running and they have different ranks (see the last column):
openlava@242e2f1f935a:/tmp$ lsload -l
HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem rank
45cf955541cf ok 0.2 0.2 0.3 2% 0.0 0 0 2e+08 159G 16G 11G 9.0
b7245f8e6d0d ok 0.2 0.2 0.3 2% 0.0 0 0 2e+08 159G 16G 11G 8.0
242e2f1f935a ok 0.2 0.2 0.3 3% 0.0 0 0 2e+08 159G 16G 11G 98.0
When checking the earlier submitted job, the execution host (EXEC_HOST) is indeed the hosts with the lowest rank as expected:
openlava@242e2f1f935a:/tmp$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
101 openlav RUN special 242e2f1f935 b7245f8e6d0 sleep 1000 Jan 7 10:06
The rank can also be seen in web interface like the one available for the OpenLava Mesos framework. What was described in this post is obviously just an example – other methods to integrate your smarts into the OpenLava resource manager can be realized as well.
Categories: Personal • Tags: LSF, Orchestration, Scheduling, SDI • Permalink for this article
September 18th, 2016 • Comments Off on Example 1: Intelligent Orchestration & Scheduling with Kubernetes
In the last blog I suggested that analytical capabilities need to move to the core of resource managers. This is very much needed for autonomous controlled large scale systems which figure out the biggest chunk of decisions to be made themselves. While the benefits from this might be obvious, the how to inject the insights/intelligence back into the resource manager might not be. Hence this blog post series documenting a bit how to let systems (they are just examples – trying to cover most domains :-)) like Kubernetes, OpenStack, Mesos, YARN and OpenLava make smarter decisions.
Background
The blog posts are going to cover some generic concepts as well as point to specific documentation bits of the individual resource managers. Some of this is already covered in past blog posts but to recap let’s look at the 5(+1) Ws for resource managers decision making (click to skip to the technical details):
- What decision needs to be made? – Decisions – and the actuations they lead too – can roughly be categorized into: doing initial placement of workloads on resources, the re-balancing of workload and resource landscapes (through either pausing/killing, migrating or tuning resource and workloads) and capacity planning activities (see ref).
- Who is involved? – The two driving forces for data center resource management are the customer and the provider. The customer looking for good performance and user experience while the provider looking for maximizing his ROI & lowering TCO of his resources. The customer is mostly looking for service orchestration (e.g. doesn’t care where and how the workload runs, as long as it performs and certain policies and rules – like for auto-scaling are adhered; or see sth like google’s instance size recommendation feature) while the provider looks at infrastructure orchestration of larger scale geo-distributed infrastructures (and the resources within) with multiple workloads from different customers (tenants are not equal btw – some are low playing non important workloads/customers some high paying important workloads/customers with priorities and SLAs).
- When does the decision/actuation apply? – Decisions can either be made immediately (e.g. an initial placement) or be more forward/backward looking (e.g. handle a maintenance/forklift upgrade request for certain resources).
- Where does the decision need to be made?- This is probably one of the most challenging questions. First of all this covers the full stack from physical resources (e.g. compute hosts, air-conditioning, …), software defined resources (e.g. virtual machines (VM), containers, tasks, …) all the way to the services the customers are running, as well as across domains of compute (e.g CPUs, VMs, containers, …), network (e.g. NICs, SDN, …) and storage (e.g. Disks, block/object storage, …). Decisions are done on individual resource, aggregated, group, data center or a global level. For example the NIC, the Virtual machine/container/tasks hosting the workload, or even the power supply can be actuated upon (feedback control is great for this). The next level actuations can be carried out on the aggregate level – in which a set of resources make up a compute hosts, ToR-switch, SAN (e.g. by tuning the TCP/IP stack in the kernel). Next up is the group level for which e.g. polices across a set of aggregates can be defined (e.g. over-subscription policy for all Xeon E5 CPUs, a certain rack determined to run small unimportant jobs vs. a rack needing to run high performance workloads). Next is the data center level for which we possibly want to enforce certain efficiency goals driven by business objective (e.g. lowering the PuE). Finally the global level captures possible multiple distributed data centers for which decisions need to be made which enable e.g. high availability and fault tolerance.
- Why does the decision need to be made? Most decision are made for efficiency reasons derived from business objectives of the provider and customer. This means ultimately the right balance between the customer deploying the workload and asking for performance and SLA compliance (customers tend to walk away if the provider doesn’t provide a good experience) and the provider improving TCO (not being able to have a positive cashflow normally lead to a provider running out of business).
- How is the decision/actuation made? This is the focus for this article series. In case it is determined a decision needs to be made, it needs to be clear on how to carry out the actual actuation(s) for all the kinds of decision that can be made described above.
Decision most of the time cannot be made generic – e.g. decisions made in HPC/HTC systems do not necessarily apply to a telco environments in which the workloads and resource are different. Hence the context of workloads and resource in place play a huge role. Ultimately Analytics which embraces the context (in all sorts and forms: deep/machine learning, statistical modelling, artificial intelligence, …) of the environment can drive the intelligence in the decision making through insights. This can obviously in multiple places/flows (see the foreground and background flow concepts here) and ultimately enables autonomous control.
Enhancing Kubernetes
For the Kubernetes examples let’s focus on a crucial decision point – doing the initial placement of a workloads (aka a POD in Kubernetes language) in a cluster. Although much of today’s research focuses on initial placement I’d urge everybody not to forget about all the other decisions that can be made more intelligent.
Like most Orchestrators and Schedulers Kubernetes follows a simple approach of filtering and ranking. After shortlisting possible candidates, the first step involves filtering those resource which do not meet the workloads demands. The second step involves prioritization (or ranking) of the resources best suited.
This general part is described nicely in the Kubernetes documentation here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md
This filtering part is mostly done based on capacities, while the second can involve information like the utilization. If you want to see this code have a look at the generic scheduling implementation: here. The available algorithms for filtering (aka predicates) and prioritization can be found here. The default methods that Kubernetes filters upon can be seen here: here – the default prioritization algorithms here: here. Note that weights can be applied to the algorithms based on your own needs as a provider. This is a nice way to tune and define how the resource under the control of the provider can be used.
While the process and the defaults already do a great job – let’s assume you’ve found a way on when and how to use an accelerator. Thankfully like most scheduling systems the scheduler in Kubernetes is extendable. Documentation for this can be found here. 3 ways are possible:
- recompile and alter the scheduler code,
- implement your own scheduler completely and run it in parallel,
- or implement an extension which the default scheduler calls when needed.
The first option is probably hard to manage in the long term, the second option requires you to deal with the messiness or concurrency while the third option is interesting (although adds latency to the process of scheduling due to an extra HTTP(s) call made). The default scheduler can basically call an external process to either ‘filter’ or ‘prioritize’. In the first case a list of possible candidate hosts is returned, in the the second case a prioritized list if returned. Now unfortunately the documentation get’s a bit vague, but luckily some code is available from the integration tests. For example here you can see some external filtering code, and here the matching prioritization code. Now that just needs to be served up over HTTP and you are ready to go, next to adding some configurations documented here.
So now an external scheduler extension can make a decisions if an accelerator should be assigned to a workload or not. The intelligent decision implemented in this extender could e.g. decide if an SR-IOV port is needed based on a bandwidth requirement, or if it is even a good idea to assign a Accelerator to a workload par the previous example.
Corrections, feedback and additional info are more then welcome. I’ve some scheduler extender code running here – but that is not shareable yet. I will update the post once I’ve completed this. In the next posts OpenStack (e.g. service like Nova, Watcher, Heat and Neutron), Mesos (how e.g. allocator modules can be used to inject smarts) and OpenLava (for which e.g. elims can be used to make better scheduling decisions) and obviously others will be introduced 🙂
Categories: Personal • Tags: Orchestration, Scheduling, SDI • Permalink for this article
Page 3 of 37«12345...102030...»Last »