Developers Planet

July 15, 2018

Bin Chen

Understand Kubernetes 5: Controller

Controllers in k8s assumes the same role and responsibility as the Controller in the classic Model-View-Controller(whereras the Model are the various API objects stored in the etcd) architecture. What's kind of unique about the controller in k8s is will constantly reconcile the system desired state to current state, not just a one time task.

Replicaset Controller

To make things real, we'll look at the source code of Replicaset Controller and see what exactly is a controller, who it will interact with, and how.
The core logic of Replicaset Controller is quite simple, as showing below:
func (rsc *ReplicaSetController) manageReplicas(filteredPods []*v1.Pod, rs *apps.ReplicaSet) error {
diff := len(filteredPods) - int(*(rs.Spec.Replicas))
if (diff < 0) {
createPods( )
} else if (diff > 0) {
createPods( )
}
To create the Pod, it uses a KubeClient which talks to the API server.
func (r RealPodControl) createPods( )
{
newPod, _ := r.KubeClient.CoreV1().Pods(namespace).Create(pod)
}
Tracing further function Create(), it uses a nice builder patterner, to set up an HTTP request
func (c *pods) Create(pod *v1.Pod) (result *v1.Pod, err error) {
result = &v1.Pod{}
err = c.client.Post().
Namespace(c.ns).
Resource("pods").
Body(pod).
Do().
Into(result)
return
}
Upon calling Do, it will issue an HTTP post request and get the result.
func (r *Request) Do() Result {    
var result Result
err := r.request(func(req *http.Request, resp *http.Response) {
result = r.transformResponse(resp, req)
})
return result
}
That only cover one direction of the communication, from the controller to the API server.

How about the other direction?

Informer

A controller subscribe itself to the apiserver for the events it cares about.
A controller typical cares about two type of information: controller specific information and the core information regarding the Pods.
In k8s, the components used to notify the events are called Informer. FWIW, it is just an Oberser Pattern.
In the case of ReplicatSetController, When a replicatSet request is submitted, the API server will notify the replicatSetControll through appsinformers.ReplicaSetInformer. When a Pod gets created, the API server will notify the replicatSetControll using coreinformers.PodInformer.
See how a ReplicatSetController is initiated:
func startReplicaSetController(ctx ControllerContext) (bool, error) {
go replicaset.NewReplicaSetController(
ctx.InformerFactory.Apps().V1().ReplicaSets(), // appsinformers.ReplicaSetInformer
ctx.InformerFactory.Core().V1().Pods(), // coreinformers.PodInformer
ctx.ClientBuilder.ClientOrDie("replicaset-controller"),
).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)
return true, nil
}
And how ReplicatSetController is handling those events:
    rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: rsc.enqueueReplicaSet,
UpdateFunc: rsc.updateRS,
DeleteFunc: rsc.enqueueReplicaSet,
})

podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: rsc.addPod,
UpdateFunc: rsc.updatePod,
DeleteFunc: rsc.deletePod,
})
Ok, this covers the direction from the API server to the controller.

But we still missing a one thing.

Workqueue, and worker

After being notified of the relevant events, a controller will push the events to an event queue; meanwhile, a poor worker is in a dead loop checking the queue and processing the event.

Cached & Shared Informer

We know that etcd provided the API to list and watch particular resources and each resource in k8s has its dedicated locations. With that, we have the things needed to implement an informer for a controller. However, there are two aspects we can optimize. First, instead of relaying everything to etcd, we can cache the information/event in the apiserver for better performance; Second, since different controls care about same set information, it makes sense those controllers can share an informer.
With that in mind, here is how currently a ReplicaSetInformer is created.

Controller Manager

kube-controller-manageris a daemon that bundles together all the built-in controllers for k8s. It provides a central place to register, initiate, and start the controllers.

Summary

We go through what a controller is and it interacts with the api sever and does the job.

by Bin Chen (noreply@blogger.com) at July 07, 2018 06:12

July 07, 2018

Bin Chen

Understand kubernetes 4 : Scheduler

The most known job of a container orchestration is to "Assign Pods to Nodes", or so-called scheduling. If all the Pods and Nodes are the same, it becomes a trivial problem to solve - a round robin policy would do the job. In practice, However, Pods have different resource requirements, and less obvious that the nodes may have different capabilities - thinking machines purchased 5 years ago and brand new ones.

An Analogy: Rent a house

Say you want to rent a house, and you tell the agent that any house with 2 bedrooms and 2 bathrooms is fine; However, you don't want a house with swimming Pool, since you would rather be going to the beaches and don't have to pay for something you won't use.
That actually covers the main concepts/job for the k8s scheduler.
  • You/Tenant: have some requirements (rooms)
  • Agent: k8s scheduler
  • Houses(owned by Landlords): The nodes.
You tell the Agent the must-have, definite no-no, and nice-to-have requirements.
Agent's job is to find you the house matches your requirement and anti-requirement.
The owner can also reject an application base on his preference (say no pets).

Requirements for Pod scheduler

Let's see some practical requirements when placing a Pod to Node.
1 Run Pods on a specific type of Nodes : e.g: run this Pod on Ubuntu 17.10 only.
2 Run Pods of different services on the same Node: e.g Place weberver and memcache on some Node.
3 Spread Pods of a service to different Nodes: e.g Place the websever on nodes in different zone for fault toleratnt.
4 Best utilization of the resource: e.g run as "much" job as possible but be able to preempty the low priority one.
In k8s world,
1, 2 can be resolved using Affinity
3 can be resolved using Anti-Affinity
4 can be resolved using Taint and Toleration and Priority and Preemption
Before we talking about those scheduler policies and we first need a way to identify the Nodes. Without the identification, the scheduler can do nothing more/better than allocating with only the capacity information of the node.

Lable the Nodes

Nothing fancy. Nodes are labeled.
You can add any label you want but there are predefined common labels, including
  • hostname
  • os/arch/instance-type
  • zone/region
The first may be used to identify a single node, the 2nd one for a type of nodes, the last one is for geolocation related fault toleration or scalability.

Affnity

There two type of Affinity, Node Affnity and Pod Affinity. The first one indicates an Affinity to a type of Node, and can be used to achieve the 1st requirement; the later one indicates the Affinity to Node with a certain type of Pods already running, and can be used to achieve 2nd requirement.
The affinity can be soft or hard, which nice-to-have and must respectively.
Reverse the logical of Affinity, it became Anti-Affinity, means Pod don't want to be in the Nodes with a certain type of feature. Requirement 3 can be implemented as "Pod doesn't want to be in the Node with the same Pod (using Pod Label)".
Side notes: You might know that in Linux a process can set it is cpu affinity, that is which CPU core it prefers to run on. It assembles to the problem of placing a Pod on a specific (type of) Node. As well as the CPUset in cgroup.

Taint and Toleration

Landlord tells to the Angent that he only want to rent the house to a programmer (for whatever reason). So unless a renter identifies himself as a programmer, the agent won't submit his application to the landlord.
Similar, a node can add some special requirement (called Taint) and use that to repel a set of nodes. Unless a Pod can tolerate the taint, it will be placed on the Node.
I found the concept of Taint and Tolerations was a little bit twisted, since Taint sounds like a bad stuff, unreasonable requirements/restriction that Pod has tolerate. It more likes landlord requires to pay the upfront rent for half a year and only the one who will tolerate this are able to apply.
One thing to remember is Taint is it is an attribution of Node and it gives Node an opportunity to have a voice for his preference; unlike Affinity is for Pod shows its preference to Node.

Priority and Preemption

Maximise resource utilization is important and it can be overlooked for most people don't have the experience of managing thousands of servers. As pointed out in section 5 of Borg paper, which k8s is inspired from.
One of Borg’s primary goals is to make efficient use of
Google’s fleet of machines, which represents a significant
financial investment: increasing utilization by a few percentages
points can save millions of dollars.
How to increasing utilization? That could mean many things, such as: schedule jobs fast, optimize the Pod allocation so that more jobs can be accommodated, and last but not least, be able to interrupt the low priority job with high priority one.
The last one just makes sense for machine. Do something always better than running idle. But when more important jobs coming, it will be preempted.
And an indication for the possibility of being preempted is we have to spend a minute of thinking about the effect of the Pod/service that may be evicted. Does it matter? How to gracefully terminate itself?

Make it real

To make things more real, take a look at this sample toy scheduler, which will bind a Pod to the cheapest Node as long as the Node can it can "fit" the resource requirements needed by the Pod.
Here are a few takeaways:
  1. You can roll your own scheduler.
  2. You can have more than one schedulers in the system. Each scheduler looks after a particular set/type of Pods and schedules them. (It doesn't make sense to have multiple schedulers trying to schedule the same set of Pods - there will be racing.)
  3. Scheduler always talks to the API server, as a client. It asks the APIs server for unscheduled Pods, scheduler them using a defined policy, and post the scheduler results ( i.e Pod/Node binding) to API server.
schedulerschedulerapi serverapi serverget me unscheduled Podsget me Node info/status/capacityschedule it according to a predefined policypost binding resultpost binding OK events
You can find default scheduler here.

Summary

We go over the requirement of a Pod scheduler and the way to achieve those requirements in k8s.

by Bin Chen (noreply@blogger.com) at July 07, 2018 04:47

June 30, 2018

Bin Chen

Understand Kubernetes 3 : etcd

In the last article, we said there was a statetore in the master node; in practice, it is implemented using etcdetcd is open source distributed key-value store (from coreOs) using the raft consensus algorithm. You can find a good introduction of etcd herek8s use etcd to store all the cluster information and is the only stateful component in the whole k8s (we don't count in the stateful components of the application itself).
Notably, it stores the following information:
  • Resource object/spec submitted by the user
  • The scheduler results from master node
  • Current status of work nodes and Pods

etcd is the critical

The stability and responsiveness of etcd is critical to stability & performance of the whole cluster. here is an excellent blog from open AI sharing that, there etcd system, hindered by 1) the high disk latency due to cloud backend and 2) high network io load incurred by the monitoring system, was one of the biggest issues they encountered when scaling the nodes to 2500.
For a production system, we will set up a separate etcd cluster and connect the k8s master to it. The master will store the requests to the etcd, update the results by controllers/schedulers, and the work nodes will watch the relevant state change through master and take action according, e,g start a container on itself.
It looks like this diagram:

usage of etcd in k8s

etcd is set up separately, but it has to be setup first so that the nodes ip (and tls info) of in the etcd cluster can be pass to the apiserver running on the master nodes. Using that information (etcd-servers and etcd-tls) apiserver will create an etc client (or multiple clients) talking to the etcd. That is all the connection between etcd and k8s.

All the components in the api-server will use storage.Interface to communicate with storage. etcd is the only backend implementation at the moment and it supports two versions of etcd, v2 and v3, which is the default.
class storage.Interface {
Create(key string, obj runtime.Object))
Delete(k)
Watch(k)
Get(k)
Count(k)
}
k8s master, to be specific, apiserver component, act as one client of the etcd, using the etcd client to implement the storage. Interface API with a little bit more stuff that fits k8s model.
Let's see two APIs, Create and Watch.
For create, the value part of the k/v is a runtime object, e.g Deployment spec, a few more steps (encoder, transform) is needed before finally commit that to the etcd.
  • Create
apiserver/package/storage/etcd3/store.go
Create(key string, obj runtime.Object)
obj -> encoder -> transformer -> clientv3.OpPut(key, v string)
Besides the normal create/get/delete, there is one operation that is very important for distributed k/v store, watch, which allows you block wait on something and being notified when something is changed. As a user case, someone can watch a specific location for new pod creation/deletion and then take the corresponding action.
Kublete doesn't watch the storage direction, instead, it watches it through API server.
  • Watch
apiserver/package/storage/etcd3/watcher.go
func (wc *watchChan) startWatching(watchClosedCh chan struct{}) {
wch := wc.watcher.client.Watch(wc.ctx, wc.key, opts...)
}

pluggable backend storage

In theory, you should be able to replace etcd with other k/v stores, such as Consul and Zookeeper.
There was a PR to add Consul as the backend, but was closed (after three years) as "not ready to do this in the near future". Why create pluggable container runtime but not for the storage backend, which seems make sense as well. One of the possible technical reason is that k8s and etcd are already loosely coupled so doesn't worth the effort to create another layer to make it pluggable.

Summary

etcd is the components storing all the state for k8s cluster. It is availability and performance is vital to the whole k8s. apisever is the only one that talks to ectd using etc clients, request that submit to apiserver will be encoded and transformed before committing to etcd. Anyone can watch a particular state change but not directly to the etcd instead that go through the apiserver.

by Bin Chen (noreply@blogger.com) at June 06, 2018 03:59

June 29, 2018

Neil Williams

Automation & Risk

First of two posts reproducing some existing content for a wider audience due to delays in removing viewing restrictions on the originals. The first is a bit long... Those familiar with LAVA may choose to skip forward to Core elements of automation support.

A summary of this document was presented by Steve McIntyre at Linaro Connect 2018 in Hong Kong. A video of that presentation and the slides created from this document are available online: http://connect.linaro.org/resource/hkg18/hkg18-tr10/

Although the content is based on several years of experience with LAVA, the core elements are likely to be transferable to many other validation, CI and QA tasks.

I recognise that this document may be useful to others, so this blog post is under CC BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/legalcode See also https://creativecommons.org/licenses/by-sa/3.0/deed.en

Automation & Risk

Background

Linaro created the LAVA (Linaro Automated Validation Architecture) project in 2010 to automate testing of software using real hardware. Over the seven years of automation in Linaro so far, LAVA has also spread into other labs across the world. Millions of test jobs have been run, across over one hundred different types of devices, ARM, x86 and emulated. Varied primary boot methods have been used alone or in combination, including U-Boot, UEFI, Fastboot, IoT, PXE. The Linaro lab itself has supported over 150 devices, covering more than 40 different device types. Major developments within LAVA include MultiNode and VLAN support. As a result of this data, the LAVA team have identified a series of automated testing failures which can be traced to decisions made during hardware design or firmware development. The hardest part of the development of LAVA has always been integrating new device types, arising from issues with hardware design and firmware implementations. There are a range of issues with automating new hardware and the experience of the LAVA lab and software teams has highlighted areas where decisions at the hardware design stage have delayed deployment of automation or made the task of triage of automation failures much harder than necessary.

This document is a summary of our experience with full background and examples. The aim is to provide background information about why common failures occur, and recommendations on how to design hardware and firmware to reduce problems in the future. We describe some device design features as hard requirements to enable successful automation, and some which are guaranteed to block automation. Specific examples are used, naming particular devices and companies and linking to specific stories. For a generic summary of the data, see Automation and hardware design.

What is LAVA?

LAVA is a continuous integration system for deploying operating systems onto physical and virtual hardware for running tests. Tests can be simple boot testing, bootloader testing and system level testing, although extra hardware may be required for some system tests. Results are tracked over time and data can be exported for further analysis.

LAVA is a collection of participating components in an evolving architecture. LAVA aims to make systematic, automatic and manual quality control more approachable for projects of all sizes.

LAVA is designed for validation during development - testing whether the code that engineers are producing “works”, in whatever sense that means. Depending on context, this could be many things, for example:

  • testing whether changes in the Linux kernel compile and boot
  • testing whether the code produced by gcc is smaller or faster
  • testing whether a kernel scheduler change reduces power consumption for a certain workload etc.

LAVA is good for automated validation. LAVA tests the Linux kernel on a range of supported boards every day. LAVA tests proposed android changes in gerrit before they are landed, and does the same for other projects like gcc. Linaro runs a central validation lab in Cambridge, containing racks full of computers supplied by Linaro members and the necessary infrastucture to control them (servers, serial console servers, network switches etc.)

LAVA is good for providing developers with the ability to run customised test on a variety of different types of hardware, some of which may be difficult to obtain or integrate. Although LAVA has support for emulation (based on QEMU), LAVA is best at providing test support for real hardware devices.

LAVA is principally aimed at testing changes made by developers across multiple hardware platforms to aid portability and encourage multi-platform development. Systems which are already platform independent or which have been optimised for production may not necessarily be able to be tested in LAVA or may provide no overall gain.

What is LAVA not?

LAVA is designed for Continuous Integration not management of a board farm.

LAVA is not a set of tests - it is infrastructure to enable users to run their own tests. LAVA concentrates on providing a range of deployment methods and a range of boot methods. Once the login is complete, the test consists of whatever scripts the test writer chooses to execute in that environment.

LAVA is not a test lab - it is the software that can used in a test lab to control test devices.

LAVA is not a complete CI system - it is software that can form part of a CI loop. LAVA supports data extraction to make it easier to produce a frontend which is directly relevant to particular groups of developers.

LAVA is not a build farm - other tools need to be used to prepare binaries which can be passed to the device using LAVA.

LAVA is not a production test environment for hardware - LAVA is focused on developers and may require changes to the device or the software to enable automation. These changes are often unsuitable for production units. LAVA also expects that most devices will remain available for repeated testing rather than testing the software with a changing set of hardware.

The history of automated bootloader testing

Many attempts have been made to automate bootloader testing and the rest of this document cover the issues in detail. However, it is useful to cover some of the history in this introduction, particularly as that relates to ideas like SDMux - the SD card multiplexer which should allow automated testing of bootloaders like U-Boot on devices where the bootloader is deployed to an SD card. The problem of SDMux details the requirements to provide access to SD card filesystems to and from the dispatcher and the device. Requirements include: ethernet, no reliance on USB, removable media, cable connections, unique serial numbers, introspection and interrogation, avoiding feature creep, scalable design, power control, maintained software and mounting holes. Despite many offers of hardware, no suitable hardware has been found and testing of U-Boot on SD cards is not currently possible in automation. The identification of the requirements for a supportable SDMux unit are closely related to these device requirements.

Core elements of automation support

Reproducibility

The ability to deploy exactly the same software to the same board(s) and running exactly the same tests many times in a row, getting exactly the same results each time.

For automation to work, all device functions which need to be used in automation must always produce the same results on each device of a specific device type, irrespective of any previous operations on that device, given the same starting hardware configuration.

There is no way to automate a device which behaves unpredictably.

Reliability

The ability to run a wide range of test jobs, stressing different parts of the overall deployment, with a variety of tests and always getting a Complete test job. There must be no infrastructure failures and there should be limited variability in the time taken to run the test jobs to avoid the need for excessive Timeouts.

The same hardware configuration and infrastructure must always behave in precisely the same way. The same commands and operations to the device must always generate the same behaviour.

Scriptability

The device must support deployment of files and booting of the device without any need for a human to monitor or interact with the process. The need to press buttons is undesirable but can be managed in some cases by using relays. However, every extra layer of complexity reduces the overall reliability of the automation process and the need for buttons should be limited or eliminated wherever possible. If a device uses on LEDs to indicate the success of failure of operations, such LEDs must only be indicative. The device must support full control of that process using only commands and operations which do not rely on observation.

Scalability

All methods used to automate a device must have minimal footprint in terms of load on the workers, complexity of scripting support and infrastructure requirements. This is a complex area and can trivially impact on both reliability and reproducibility as well as making it much more difficult to debug problems which do arise. Admins must also consider the complexity of combining multiple different devices which each require multiple layers of support.

Remote power control

Devices MUST support automated resets either by the removal of all power supplied to the DUT or a full reboot or other reset which clears all previous state of the DUT.

Every boot must reliably start, without interaction, directly from the first application of power without the limitation of needing to press buttons or requiring other interaction. Relays and other arrangements can be used at the cost of increasing the overall complexity of the solution, so should be avoided wherever possible.

Networking support

Ethernet - all devices using ethernet interfaces in LAVA must have a unique MAC address on each interface. The MAC address must be persistent across reboots. No assumptions should be made about fixed IP addresses, address ranges or pre-defined routes. If more than one interface is available, the boot process must be configurable to always use the same interface every time the device is booted. WiFi is not currently supported as a method of deploying files to devices.

Serial console support

LAVA expects to automate devices by interacting with the serial port immediately after power is applied to the device. The bootloader must interact with the serial port. If a serial port is not available on the device, suitable additional hardware must be provided before integration can begin. All messages about the boot process must be visible using the serial port and the serial port should remain usable for the duration of all test jobs on the device.

Persistence

Devices supporting primary SSH connections have persistent deployments and this has implications, some positive, some negative - depending on your use case.

  • Fixed OS - the operating system (OS) you get is the OS of the device and this must not be changed or upgraded.
  • Package interference - if another user installs a conflicting package, your test can fail.
  • Process interference - another process could restart (or crash) a daemon upon which your test relies, so your test will fail.
  • Contention - another job could obtain a lock on a constrained resource, e.g. dpkg or apt, causing your test to fail.
  • Reusable scripts - scripts and utilities your test leaves behind can be reused (or can interfere) with subsequent tests.
  • Lack of reproducibility - an artifact from a previous test can make it impossible to rely on the results of a subsequent test, leading to wasted effort with false positives and false negatives.
  • Maintenance - using persistent filesystems in a test action results in the overlay files being left in that filesystem. Depending on the size of the test definition repositories, this could result in an inevitable increase in used storage becoming a problem on the machine hosting the persistent location. Changes made by the test action can also require intermittent maintenance of the persistent location.

Only use persistent deployments when essential and always take great care to avoid interfering with other tests. Users who deliberately or frequently interfere with other tests can have their submit privilege revoked.

The dangers of simplistic testing

Connect and test

Seems simple enough - it doesn’t seem as if you need to deploy a new kernel or rootfs every time, no need to power off or reboot between tests. Just connect and run stuff. After all, you already have a way to manually deploy stuff to the board. The biggest problem with this method is Persistence as above - LAVA keeps the LAVA components separated from each other but tests frequently need to install support which will persist after the test, write files which can interfere with other tests or break the manual deployment in unexpected ways when things go wrong. The second problem within this fallacy is simply the power drain of leaving the devices constantly powered on. In manual testing, you would apply power at the start of your day and power off at the end. In automated testing, these devices would be on all day, every day, because test jobs could be submitted at any time.

ssh instead of serial

This is an over-simplification which will lead to new and unusual bugs and is only a short step on from connect & test with many of the same problems. A core strength of LAVA is demonstrating differences between types of devices by controlling the boot process. By the time the system has booted to the point where sshd is running, many of those differences have been swallowed up in the boot process.

Test everything at the same time

Issues here include:

Breaking the basic scientific method of test one thing at a time

The single system contains multiple components, like the kernel and the rootfs and the bootloader. Each one of those components can fail in ways which can only be picked up when some later component produces a completely misleading and unexpected error message.

Timing

Simply deploying the entire system for every single test job wastes inordinate amounts of time when you do finally identify that the problem is a configuration setting in the bootloader or a missing module for the kernel.

Reproducibility

The larger the deployment, the more complex the boot and the tests become. Many LAVA devices are prototypes and development boards, not production servers. These devices will fail in unpredictable places from time to time. Testing a kernel build multiple times is much more likely to give you consistent averages for duration, performance and other measurements than if the kernel is only tested as part of a complete system.Automated recovery - deploying an entire system can go wrong, whether an interrupted copy or a broken build, the consequences can mean that the device simply does not boot any longer.

Every component involved in your test must allow for automated recovery

This means that the boot process must support being interrupted before that component starts to load. With a suitably configured bootloader, it is straightforward to test kernel builds with fully automated recovery on most devices. Deploying a new build of the bootloader itself is much more problematic. Few devices have the necessary management interfaces with support for secondary console access or additional network interfaces which respond very early in boot. It is possible to chainload some bootloaders, allowing the known working bootloader to be preserved.

I already have builds

This may be true, however, automation puts extra demands on what those builds are capable of supporting. When testing manually, there are any number of times when a human will decide that something needs to be entered, tweaked, modified, removed or ignored which the automated system needs to be able to understand. Examples include /etc/resolv.conf and customised tools.

Automation can do everything

It is not possible to automate every test method. Some kinds of tests and some kinds of devices lack critical elements that do not work well with automation. These are not problems in LAVA, these are design limitations of the kind of test and the device itself. Your preferred test plan may be infeasible to automate and some level of compromise will be required.

Users are all admins too

This will come back to bite! However, there are other ways in which this can occur even after administrators have restricted users to limited access. Test jobs (including hacking sessions) have full access to the device as root. Users, therefore, can modify the device during a test job and it depends on the device hardware support and device configuration as to what may happen next. Some devices store bootloader configuration in files which are accessible from userspace after boot. Some devices lack a management interface that can intervene when a device fails to boot. Put these two together and admins can face a situation where a test job has corrupted, overridden or modified the bootloader configuration such that the device no longer boots without intervention. Some operating systems require a debug setting to be enabled before the device will be visible to the automation (e.g. the Android Debug Bridge). It is trivial for a user to mistakenly deploy a default or production system which does not have this modification.

LAVA and CI

LAVA is aimed at kernel and system development and testing across a wide variety of hardware platforms. By the time the test has got to the level of automating a GUI, there have been multiple layers of abstraction between the hardware, the kernel, the core system and the components being tested. Following the core principle of testing one element at a time, this means that such tests quickly become platform-independent. This reduces the usefulness of the LAVA systems, moving the test into scope for other CI systems which consider all devices as equivalent slaves. The overhead of LAVA can become an unnecessary burden.

CI needs a timely response - it takes time for a LAVA device to be re-deployed with a system which has already been tested. In order to test a component of the system which is independent of the hardware, kernel or core system a lot of time has been consumed before the “test” itself actually begins. LAVA can support testing pre-deployed systems but this severely restricts the usefulness of such devices for actual kernel or hardware testing.

Automation may need to rely on insecure access. Production builds (hardware and software) take steps to prevent systems being released with known login identities or keys, backdoors and other security holes. Automation relies on at least one of these access methods being exposed, typically a way to access the device as the root or admin user. User identities for login must be declared in the submission and be the same across multiple devices of the same type. These access methods must also be exposed consistently and without requiring any manual intervention or confirmation. For example, mobile devices must be deployed with systems which enable debug access which all production builds will need to block.

Automation relies on remote power control - battery powered devices can be a signficant problem in this area. On the one hand, testing can be expected to involve tests of battery performance, low power conditions and recharge support. However, testing will also involve broken builds and failed deployments where the only recourse is to hard reset the device by killing power. With a battery in the loop, this becomes very complex, sometimes involving complex electrical bodges to the hardware to allow the battery to be switched out of the circuit. These changes can themselves change the performance of the battery control circuitry. For example, some devices fail to maintain charge in the battery when held in particular states artificially, so the battery gradually discharges despite being connected to mains power. Devices which have no battery can still be a challenge as some are able to draw power over the serial circuitry or USB attachments, again interfering with the ability of the automation to recover the device from being “bricked”, i.e. unresponsive to the control methods used by the automation and requiring manual admin intervention.

Automation relies on unique identification - all devices in an automation lab must be uniquely identifiable at all times, in all modes and all active power states. Too many components and devices within labs fail to allow for the problems of scale. Details like serial numbers, MAC addresses, IP addresses and bootloader timeouts must be configurable and persistent once configured.

LAVA is not a complete CI solution - even including the hardware support available from some LAVA instances, there are a lot more tools required outside of LAVA before a CI loop will actually work. The triggers from your development workflow to the build farm (which is not LAVA), the submission to LAVA from that build farm are completely separate and outside the scope of this documentation. LAVA can help with the extraction of the results into information for the developers but LAVA output is generic and most teams will benefit from some “frontend” which extracts the data from LAVA and generates relevant output for particular development teams.

Features of CI

Frequency

How often is the loop to be triggered?

Set up some test builds and test jobs and run through a variety of use cases to get an idea of how long it takes to get from the commit hook to the results being available to what will become your frontend.

Investigate where the hardware involved in each stage can be improved and analyse what kind of hardware upgrades may be useful.

Reassess the entire loop design and look at splitting the testing if the loop cannot be optimised to the time limits required by the team. The loop exists to serve the team but the expectations of the team may need to be managed compared to the cost of hardware upgrades or finite time limits.

Scale

How many branches, variants, configurations and tests are actually needed?

Scale has a direct impact on the affordability and feasibility of the final loop and frontend. Ensure that the build infrastructure can handle the total number of variants, not just at build time but for storage. Developers will need access to the files which demonstrate a particular bug or regression

Scale also provides benefits of being able to ignore anomalies.

Identify how many test devices, LAVA instances and Jenkins slaves are needed. (As a hint, start small and design the frontend so that more can be added later.)

Interface

The development of a custom interface is not a small task

Capturing the requirements for the interface may involve lengthy discussions across the development team. Where there are irreconcilable differences, a second frontend may become necessary, potentially pulling the same data and presenting it in a radically different manner.

Include discussions on how or whether to push notifications to the development team. Take time to consider the frequency of notification messages and how to limit the content to only the essential data.

Bisect support can flow naturally from the design of the loop if the loop is carefully designed. Bisect requires that a simple boolean test can be generated, built and executed across a set of commits. If the frontend implements only a single test (for example, does the kernel boot?) then it can be easy to identify how to provide bisect support. Tests which produce hundreds of results need to be slimmed down to a single pass/fail criterion for the bisect to work.

Results

This may take the longest of all elements of the final loop

Just what results do the developers actually want and can those results be delivered? There may be requirements to aggregate results across many LAVA instances, with comparisons based on metadata from the original build as well as the LAVA test.

What level of detail is relevant?

Different results for different members of the team or different teams?

Is the data to be summarised and if so, how?

Resourcing

A frontend has the potential to become complex and need long term maintenance and development

Device requirements

At the hardware design stage, there are considerations for the final software relating to how the final hardware is to be tested.

Uniqueness

All units of all devices must uniquely identify to the host machine as distinct from all other devices which may be connected at the same time. This particularly covers serial connections but also any storage devices which are exported, network devices and any other method of connectivity.

Example - the WaRP7 integration has been delayed because the USB mass storage does not export a filesystem with a unique identifier, so when two devices are connected, there is no way to distinguish which filesystem relates to which device.

All unique identifiers must be isolated from the software to be deployed onto the device. The automation framework will rely on these identifiers to distinguish one device from up to a dozen identical devices on the same machine. There must be no method of updating or modifying these identifiers using normal deployment / flashing tools. It must not be possible for test software to corrupt the identifiers which are fundamental to how the device is identified amongst the others on the same machine.

All unique identifiers must be stable across multiple reboots and test jobs. Randomly generated identifiers are never suitable.

If the device uses a single FTDI chip which offers a single UART device, then the unique serial number of that UART will typically be a permanent part of the chip. However, a similar FTDI chip which provides two or more UARTs over the same cable would not have serial numbers programmed into the chip but would require a separate piece of flash or other storage into which those serial numbers can be programmed. If that storage is not designed into the hardware, the device will not be capable of providing the required uniqueness.

Example - the WaRP7 exports two UARTs over a single cable but fails to give unique identifiers to either connection, so connecting a second device disconnects the first device when the new tty device replaces the existing one.

If the device uses one or more physical ethernet connector(s) then the MAC address for each interface must not be generated randomly at boot. Each MAC address needs to be:

  • persistent - each reboot must always use the same MAC address for each interface.
  • unique - every device of this type must use a unique MAC address for each interface.

If the device uses fastboot, then the fastboot serial number must be unique so that the device can be uniquely identified and added to the correct container. Additionally, the fastboot serial number must not be modifiable except by the admins.

Example - the initial HiKey 960 integration was delayed because the firmware changed the fastboot serial number to a random value every time the device was rebooted.

Scale

Automation requires more than one device to be deployed - the current minimum is five devices. One device is permanently assigned to the staging environment to ensure that future code changes retain the correct support. In the early stages, this device will be assigned to one of the developers to integrate the device into LAVA. The devices will be deployed onto machines which have many other devices already running test jobs. The new device must not interfere with those devices and this makes some of the device requirements stricter than may be expected.

  • The aim of automation is to create a homogenous test platform using heterogeneous devices and scalable infrastructure.

  • Do not complicate things.

  • Avoid extra customised hardware

    Relays, hardware modifications and mezzanine boards all increase complexity

    Examples - X15 needed two relay connections, the 96boards initially needed a mezzanine board where the design was rushed, causing months of serial disconnection issues.

  • More complexity raises failure risk nonlinearly

    Example - The lack of onboard serial meant that the 96boards devices could not be tested in isolation from the problematic mezzanine board. Numerous 96boards devices were deemed to be broken when the real fault lay with intermittent failures in the mezzanine. Removing and reconnecting a mezzanine had a high risk of damaging the mezzanine or the device. Once 96boards devices moved to direct connection of FTDI cables into the connector formerly used by the mezzanine, serial disconnection problems disappeared. The more custom hardware has to be designed / connected to a device to support automation, the more difficult it is to debug issues within that infrastructure.

  • Avoid unreliable protocols and connections

    Example. WiFi is not a reliable deployment method, especially inside a large lab with lots of competing signals and devices.

  • This document is not demanding enterprise or server grade support in devices.

    However, automation cannot scale with unreliable components.

    Example - HiKey 6220 and the serial mezzanine board caused massively complex problems when scaled up in LKFT.

  • Server support typically includes automation requirements as a subset:

    RAS, performance, efficiency, scalability, reliability, connectivity and uniqueness

  • Automation racks have similar requirements to data centres.

  • Things need to work reliably at scale

Scale issues also affect the infrastructure which supports the devices as well as the required reliability of the instance as a whole. It can be difficult to scale up from initial development to automation at scale. Numerous tools and utilities prove to be uncooperative, unreliable or poorly isolated from other processes. One result can be that the requirements of automation look more like the expectations of server-type hardware than of mobile hardware. The reality at scale is that server-type hardware has already had fixes implemented for scalability issues whereas many mobile devices only get tested as standalone units.

Connectivity and deployment methods

  • All test software is presumed broken until proven otherwise
  • All infrastructure and device integration support must be proven to be stable before tests can be reliable
  • All devices must provide at least one method of replacing the current software with the test software, at a level lower than you're testing.

The simplest method to automate is TFTP over physical ethernet, e.g. U-Boot or UEFI PXE. This also puts the least load on the device and automation hardware when delivering large images

Manually writing software to SD is not suitable for automation. This tends to rule out many proposed methods for testing modified builds or configurations of firmware in automation.

See https://linux.codehelp.co.uk/the-problem-of-sd-mux.html for more information on how the requirements of automation affect the hardware design requirements to provide access to SD card filesystems to and from the dispatcher and the device.

Some deployment methods require tools which must be constrained within an LXC. These include but are not limited to:

  • fastboot - due to a common need to have different versions installed for different hardware devices

    Example - Every fastboot device suffers from this problem - any running fastboot process will inspect the entire list of USB devices and attempt to connect to each one, locking out any other fastboot process which may be running at the time, which sees no devices at all.

  • IoT deployment - some deployment tools require patches for specific devices or use tools which are too complex for use on the dispatcher.

    Example - the TI CC3220 IoT device needs a patched build of OpenOCD, the WaRP7 needs a custom flashing tool compiled from a github repository.

Wherever possible, existing deployment methods and common tools are strongly encouraged. New tools are not likely to be as reliable as the existing tools.

Deployments must not make permanent changes to the boot sequence or configuration.

Testing of OS installers may require modifying the installer to not install an updated bootloader or modify bootloader configuration. The automation needs to control whether the next reboot boots the newly deployed system or starts the next test job, for example when a test job has been cancelled, the device needs to be immediately ready to run a different test job.

Interfaces

Automation requires driving the device over serial instead of via a touchscreen or other human interface device. This changes the way that the test is executed and can require the use of specialised software on the device to translate text based commands into graphical inputs.

It is possible to test video output in automation but it is not currently possible to drive automation through video input. This includes BIOS-type firmware interaction. UEFI can be used to automatically execute a bootloader like Grub which does support automation over serial. UEFI implementations which use graphical menus cannot be supported interactively.

Reliability

The objective is to have automation support which runs test jobs reliably. Reproducible failures are easy to fix but intermittent faults easily consume months of engineering time and need to be designed out wherever possible. Reliable testing means only 3 or 4 test job failures per week due to hardware or infrastructure bugs across an entire test lab (or instance). This can involve thousands of test jobs across multiple devices. Some instances may have dozens of identical devices but they still need not to exceed the same failure rate.

All devices need to reach the minimum standard of reliability, or they are not fit for automation. Some of these criteria might seem rigid, but they are not exclusive to servers or enterprise devices. To be useful mobile and IoT devices need to meet the same standards, even though the software involved and the deployment methods might be different. The reason is that the Continuous Integration strategy remains the same for all devices. The problem is the same, regardless of underlying considerations.

A developer makes a change; that change triggers a build; that build triggers a test; that test reports back to the developer whether that change worked or had unexpected side effects.

  • False positive and false negatives are expensive in terms of wasted engineering time.
  • False positives can arise when not enough of the software is fully tested, or if the testing is not rigorous enough to spot all problems.
  • False negatives arise when the test itself is unreliable, either because of the test software or the test hardware.

This becomes more noticeable when considering automated bisections which are very powerful in tracking the causes of potential bugs before the product gets released. Every test job must give a reliable result or the bisection will not reliably identify the correct change.

Automation and Risk

Linaro kernel functional test framework (LKFT) https://lkft.validation.linaro.org/

We have seen with LKFT that complexity has a non-linear relationship with the reliability of any automation process. This section aims to set out some guidelines and recommendations on just what is acceptable in the tools needed to automate testing on a device. These guidelines are based on our joint lab and software team experiences with a wide variety of hardware and software.

Adding or modifying any tool has a risk of automation failure

Risk increases non-linearly with complexity. Some of this risk can be mitigated by testing the modified code and the complete system.

Dependencies installed count as code in terms of the risks of automation failure

This is a key lesson learnt from our experiences with LAVA V1. We added a remote worker method, which was necessary at the time to improve scalability. But it massively increased the risk of automation failure simply due to the extra complexity that came with the chosen design.These failures did not just show up in the test jobs which actively used the extra features and tools; they caused problems for all jobs running on the system.

The ability in LAVA V2 to use containers for isolation is a key feature

For the majority of use cases, the small extension of the runtime of the test to set up and use a container is negligible. The extra reliability is more than worth the extra cost.

Persistent containers are themselves a risk to automation

Just as with any persistent change to the system.

Pre-installing dependencies in a persistent container does not necessarily lower the overall risk of failure. It merely substitutes one element of risk for another.

All code changes need to be tested

In unit tests and in functional tests. There is a dividing line where if something is installed as a dependency of LAVA, then when that something goes wrong, LAVA engineers will be pressured into fixing the code of that dependency whether or not we have any particular experience of that language, codebase or use case. Moving that code into a container moves that burden but also makes triage of that problem much easier by allowing debug builds / options to be substituted easily.

Complexity also increases the difficulty of debugging, again in a nonlinear fashion

A LAVA dependency needs a higher bar in terms of ease of triage.

Complexity cannot be easily measured

Although there are factors which contribute.

Monoliths

Large programs which appear as a single monolith are harder to debug than the UNIX model of one utility joined with other utilities to perform a wider task. (This applies to LAVA itself as much as any one dependency - again, a lesson from V1.)

Feature creep

Continually adding features beyond the original scope makes complex programs worse. A smaller codebase will tend to be simpler to triage than a large codebase, even if that codebase is not monolithic.

Targeted utilities are less risky than large environments

A program which supports protocol after protocol after protocol will be more difficult to maintain than 3 separate programs for each protocol. This only gets worse when the use case for that program only requires the use of one of the many protocols supported by the program. The fact that the other protocols are supported increases the complexity of the program beyond what the use case actually merits.

Metrics in this area are impossible

The risks are nonlinear, the failures are typically intermittent. Even obtaining or applying metrics takes up huge amounts of engineering time.

Mismatches in expectations

The use case of automation rarely matches up with the more widely tested use case of the upstream developers. We aren't testing the code flows typically tested by the upstream developers, so we find different bugs, raising the level of risk. Generally, the simpler it is to deploy a device in automation, the closer the test flow will be to the developer flow.

Most programs are written for the single developer model

Some very widely used programs are written to scale but this is difficult to determine without experience of trying to run it at scale.

Some programs do require special consideration

QEMU would fail most of these guidelines above, so there are mitigating factors:

  • Programs which can be easily restricted to well understood use cases lower the risk of failure. Not all use cases of the same program not need to be covered.
  • Programs which have excellent community and especially in-house support also lower the risk of failure. (Having QEMU experts in Linaro is a massive boost for having QEMU as a dispatcher dependency.)

Unfamiliar languages increase the difficulty of triage

This may affect dependencies in unexpected ways. A program which has lots of bindings into a range of other languages becomes entangled in transitions and bugs in those other languages. This commonly delays the availability of the latest version which may have a critical fix for one use case but which fails to function at all in what may seem to be an unrelated manner.

The dependency chain of the program itself increases the risk of failure in precisely the same manner as the program

In terms of maintenance, this can include the build dependencies of the program as those affect delivery / availability of LAVA in distributions like Debian.

Adding code to only one dispatcher amongst many increases the risk of failure on the instance as a whole

By having an untested element which is at variance to the rest of the system.

Conditional dependencies increase the risk

Optional components can be supported but only increase the testing burden by extending the matrix of installations.

Presence of the code in Debian main can reduce the risk of failure

This does not outweigh other considerations - there are plenty of packages in Debian (some complex, some not) which would be an unacceptable risk as a dependency of the dispatcher, fastboot for one. A small python utility from github can be a substantially lower risk than a larger program from Debian which has unused functionality.

Sometimes, "complex" simply means "buggy" or "badly designed"

fastboot is not actually a complex piece of code but we have learnt that it does not currently scale. This is a result of the disparity between the development model and the automation use case. Disparities like that actually equate to complexity, in terms of triage and maintenance. If fastboot was more complex at the codebase level, it may actually become a lower risk than currently.

Linaro as a whole does have a clear objective of harmonising the ecosystem

Adding yet another variant of existing support is at odds with the overall objective of the company. Many of the tools required in automation have no direct affect on the distinguishing factors for consumers. Adding another one "just because" is not a good reason to increase the risk of automation failure. Just as with standards.

Having the code on the dispatcher impedes development of that code

Bug fixes will take longer to be applied because the fix needs to go through a distribution or other packaging process managed by the lab admins. Applying a targeted fix inside an LXC is useful for proving that the fix works.

Not all programs can work in an LXC

LAVA also provides ways to test using those programs by deploying the code onto a test device. e.g. the V2 support for fastmodels involves only deploying the fastmodel inside a LAVA Test Shell on a test device, e.g. x86 or mustang or Juno.

Speed of running a test job in LAVA is important for CI

The goal of speed must give way to the requirement for reliability of automation

Resubmitting a test job due to a reliability failure is more harmful to the CI process than letting tests take longer to execute without such failures. Test jobs which run quickly are easier to parallelize by adding more test hardware.

Modifying software on the device

Not all parts of the software stack can be replaced automatically, typically the firmware and/or bootloader will need to be considered carefully. The boot sequence will have important effects on what kind of testing can be done automatically. Automation relies on being able to predict the behaviour of the device, interrupt that default behaviour and then execute the test. For most devices, everything which executes on the device prior to the first point at which the boot sequence can be interrupted can be considered as part of the primary boot software. None of these elements can be safely replaced or modified in automation.

The objective is to deploy the device such that as much of the software stack can be replaced as possible whilst preserving the predictable behaviour of all devices of this type so that the next test job always gets a working, clean device in a known state.

Primary boot software

For many devices, this is the bootloader, e.g. U-Boot, UEFI or fastboot.

Some devices include support for a Baseboard management controller or BMC which allows the bootloader and other firmware to be updated even if the device is bricked. The BMC software itself then be considered as the primary boot software, it cannot be safely replaced.

All testing of the primary boot software will need to be done by developers using local devices. SDMux was an idea which only fitted one specific set of hardware, the problem of testing the primary boot software is a hydra. Adding customised hardware to try to sidestep the primary boot software always increases the complexity and failure rates of the devices.

It is possible to divide the pool of devices into some which only ever use known versions of the primary boot software controlled by admins and other devices which support modifying the primary boot software. However, this causes extra work when processing the results, submitting the test jobs and administering the devices.

A secondary problem here is that it is increasingly common for the methods of updating this software to be esoteric, hacky, restricted and even proprietary.

  • Click-through licences to obtain the tools

  • Greedy tools which hog everything in /dev/bus/usb

  • NIH tools which are almost the same as existing tools but add vendor-specific "functionality"

  • GUI tools

  • Changing jumpers or DIP switches,

    Often in inaccessible locations which require removal of other ancillary hardware

  • Random, untrusted, compiled vendor software running as root

  • The need to press and hold buttons and watch for changes in LED status.

We've seen all of these - in various combinations - just in 2017, as methods of getting devices into a mode where the primary boot software can be updated.

Copyright 2018 Neil Williams linux@codehelp.co.uk

Available under CC BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/legalcode

by Neil Williams at June 06, 2018 14:19

June 18, 2018

Senthil Kumaran

lava-dispatcher docker images - part 1

Introduction, Details and Preparation

Linaro Automated Validation Architecture a.k.a LAVA project has released official docker images for lava-dispatcher only containers. This blog post series explains how to use these images in order to run inpdependent LAVA workers along with devices attached to it. The blog post series is split into three parts as follows:

  1. lava-dispatcher docker images - part 1 - Introduction, Details and Preparation
  2. lava-dispatcher docker images - part 2 - Docker based LAVA Worker running pure LXC job
  3. lava-dispatcher docker images - part 3 - Docker based LAVA Worker running Nexus 4 job with and without LXC Protocol

Before getting into the details of running these images, let us see how these images are organized and what are the packages available via these images.

The lava-dispatcher only docker images will be officially supported by the LAVA project team and there will be regular releases of these images whenever there are updates or new releases. As of this writing there are two images released - production and staging. These docker images are based on Debian Stretch operating system, which is the recommended operating system for installing LAVA.

lava-dispatcher production docker images

The production docker image of lava-dispatcher is based on the official production-repo of LAVA project. The production-repo holds the latest stable packages released by LAVA team for each of the LAVA components.The production docker image is available in the following link:

https://hub.docker.com/r/linaro/lava-dispatcher-production-stretch-amd64/

Whenever there is a production release from the LAVA project there will be a corresponding image created with the tag name in https://hub.docker.com/r/linaro/lava-dispatcher-production-stretch-amd64/tags/ The latest tag as of this writing is 2018.5-3. In order to know what this production docker images are built with, have a look at the DockerFile in https://git.linaro.org/ci/dockerfiles.git/tree/lava/dispatcher/production/stretch-amd64/Dockerfile

lava-dispatcher staging docker images

The staging docker image of lava-dispatcher is based on the official staging-repo of LAVA project. The staging-repo holds the latest packages built everyday by LAVA team for each of the LAVA components, which is also a source for bleeding edge unreleased software.The staging docker image is available in the following link, which is built daily:

https://hub.docker.com/r/linaro/lava-dispatcher-staging-stretch-amd64/

Whenever there is a successful daily build of staging packages available, a docker image will be made available in https://hub.docker.com/r/linaro/lava-dispatcher-staging-stretch-amd64/tags/ with the tag name 'latest'. Hence, at any point of time there will be only one tag, i.e., latest in the staging docker image location. In order to know what this staging docker images are built with, have a look at the DockerFile in https://git.linaro.org/ci/dockerfiles.git/tree/lava/dispatcher/staging/stretch-amd64/Dockerfile

lava-lxc-mocker

Unlike regular installations of LAVA workers, installations via the above docker images will use a package called lava-lxc-mocker instead of lxc Debian package. lava-lxc-mocker is a pseudo implementation of lxc which tries to mock the lxc commands without running the commands on the machine, but providing the exact same output of the original lxc command. This package exists to provide an alternative (pseudo alternative) to lxc and also to avoid the overheads of running nested containers, which simplifies things without losing the power to run LAVA job definitions that has LXC protocol defined, unmodified.

Having seen the details about the lava-dispatcher only docker images, let us now see three different use cases where jobs are run within a docker container with and without using LXC protocol on attached device such as a Nexus 4 phone.

In demonstrating all these use cases we will use lava-dispatcher only staging docker images. We will use https://lava.codehelp.co.uk instance as the LAVA master to which the docker based LAVA worker will connect to. https://lava.codehelp.co.uk is an encrypted LAVA instance which accepts connections, only from authenticated LAVA workers. Read more about how to configure encrypted communication between LAVA master and LAVA worker in https://staging.validation.linaro.org/static/docs/v2/pipeline-server.html#using-zmq-authentication-and-encryption The following is a preparation step in order to connect the docker based LAVA slave to the encrypted LAVA master instance.

Creating slave certificate

We will name the docker based LAVA worker as 'docker-slave'. Let us create a slave certificate which could be shared to the LAVA master. In a previously running LAVA worker, issue the following command to create a slave certificate,

stylesen@hanshu:~$ sudo /usr/share/lava-dispatcher/create_certificate.py \
docker-slave-1
Creating the certificate in /etc/lava-dispatcher/certificates.d
 - docker-slave-1.key
 - docker-slave-1.key_secret

We can see the certificates are created successfully in /etc/lava-dispatcher/certificates.d As explained in https://staging.validation.linaro.org/static/docs/v2/pipeline-server.html#distribute-public-certificates copy the public component of the above slave certificate to the master instance (https://lava.codehelp.co.uk), which is shown below:

stylesen@hanshu:~$ scp /etc/lava-dispatcher/certificates.d/docker-slave-1.key \
stylesen@lava.codehelp.co.uk:/tmp

docker-slave-1.key                            100%  364     1.4KB/s   00:00   

Then login to lava.codehelp.co.uk to do the actual copy as follows (since we need sudo rights to copy directly, this is done in two steps):

stylesen@hanshu:~$ ssh lava.codehelp.co.uk
stylesen@codehelp:~$ sudo mv /tmp/docker-slave-1.key /etc/lava-dispatcher/certificates.d/
[sudo] password for stylesen:
stylesen@codehelp:~$ sudo ls -alh /etc/lava-dispatcher/certificates.d/docker-slave-1.key
-rw-r--r-- 1 stylesen stylesen 364 Jun 18 00:05 /etc/lava-dispatcher/certificates.d/docker-slave-1.key

Now, we have the slave certificate copied to appropriate location on the LAVA master. For convenience, on the host machine from where we start the docker based LAVA worker, copy the slave certificates to a specific directory as shown below:

stylesen@hanshu:~$ mkdir docker-slave-files
stylesen@hanshu:~$ cd docker-slave-files/
stylesen@hanshu:~/docker-slave-files$ cp /etc/lava-dispatcher/certificates.d/docker-slave-1.key* .

Similarly, copy the master certificate's public component to the above folder, in order to enable communication.

stylesen@hanshu:~/docker-slave-files$ scp \
stylesen@lava.codehelp.co.uk:/etc/lava-dispatcher/certificates.d/master.key .

master.key                                    100%  364     1.4KB/s   00:00   
stylesen@hanshu:~/docker-slave-files$ ls -alh
total 20K
drwxr-xr-x  2 stylesen stylesen 4.0K Jun 18 05:48 .
drwxr-xr-x 17 stylesen stylesen 4.0K Jun 18 05:45 ..
-rw-r--r--  1 stylesen stylesen  364 Jun 18 05:45 docker-slave-1.key
-rw-r--r--  1 stylesen stylesen  313 Jun 18 05:45 docker-slave-1.key_secret
-rw-r--r--  1 stylesen stylesen  364 Jun 18 05:48 master.key
stylesen@hanshu:~/docker-slave-files$

We are all set with the required files to start and run our docker based LAVA workers.

... Continue Reading Part 2

by stylesen at June 06, 2018 02:30

lava-dispatcher docker images - part 2

This is part 2 of the three part blog post series on lava-dispatcher only docker images. If you haven't read part 1 already, then read it on - https://www.stylesen.org/lavadispatcher_docker_images_part_1

Docker based LAVA Worker running pure LXC job

This is the first use case in which we will look at starting a docker based LAVA worker and running a job that requests a LXC device type. The following command is used to start a docker based LAVA worker,

stylesen@hanshu:~$ sudo docker run \
-v /home/stylesen/docker-slave-files:/fileshare \
-v /var/run/docker.sock:/var/run/docker.sock -itd \
-e HOSTNAME='docker-slave-1' -e MASTER='tcp://lava.codehelp.co.uk:5556' \
-e SOCKET_ADDR='tcp://lava.codehelp.co.uk:5555' -e LOG_LEVEL='DEBUG' \
-e ENCRYPT=1 -e MASTER_CERT='/fileshare/master.key' \
-e SLAVE_CERT='/fileshare/docker-slave-1.key_secret' -p 2222:22 \
--name ld-latest linaro/lava-dispatcher-staging-stretch-amd64:latest

Unable to find image 'linaro/lava-dispatcher-staging-stretch-amd64:latest' locally
latest: Pulling from linaro/lava-dispatcher-staging-stretch-amd64
cc1a78bfd46b: Pull complete
5ddb65a5b8b4: Pull complete
41d8dcd3278b: Pull complete
071cc3e7e971: Pull complete
39bedb7bda2f: Pull complete
Digest: sha256:1bc7c7b2bee09beda4a6bd31a2953ae80847c706e8500495f6d0667f38fe0c9c
Status: Downloaded newer image for linaro/lava-dispatcher-staging-stretch-amd64:latest
522f079649816a931247c5917efea281846e394dba7ec19f522bba5f1e433fd5
stylesen@hanshu:~$

Lets have a closer look at the 'docker run' command above and see what are the options used:

'-v /home/stylesen/docker-slave-files:/fileshare' - mounts the directory /home/stylesen/docker-slave-files from the host machine, inside the docker container at the location /fileshare This location is used to exchange files from the host to the container and vice versa.

'-v /var/run/docker.sock:/var/run/docker.sock' - similarly the docker socket file is exposed within the container. This is optional and may be required for advanced job runs and use cases.

For options such as '-itd', '-p' and '--name' refer https://docs.docker.com/engine/reference/commandline/run/ to know what these option do for running docker images.

'-e' - This option is used to set environment variables inside the docker container being run. The following environment variables are set in the above command line which is consumed by the entrypoint.sh script inside the container and starts the lava-slave daemon based on these variable's values.

  1. HOSTNAME - Name of the slave
  2. MASTER - Main master socket
  3. SOCKET_ADDR - Log socket
  4. LOG_LEVEL - Log level, default to INFO
  5. ENCRYPT - Encrypt messages
  6. MASTER_CERT - Master certificate file
  7. SLAVE_CERT - Slave certificate file

We can see the docker based LAVA worker is started and running,

stylesen@hanshu:~$ sudo docker ps -a
CONTAINER ID        IMAGE                                               \
  COMMAND             CREATED              STATUS              PORTS    \
              NAMES

522f07964981        linaro/lava-dispatcher-staging-stretch-amd64:latest \
  "/entrypoint.sh"    About a minute ago   Up 58 seconds       \
0.0.0.0:2222->22/tcp   ld-latest

stylesen@hanshu:~$

If everything goes fine, we can see the LAVA master receiving ping messages from the above LAVA worker as seen below on the LAVA master logs:

stylesen@codehelp:~$ sudo tail -f /var/log/lava-server/lava-master.log
2018-06-18 00:24:30,878    INFO docker-slave-1 => HELLO
2018-06-18 00:24:30,878 WARNING New dispatcher <docker-slave-1>
2018-06-18 00:24:34,069   DEBUG lava-logs => PING(20)
2018-06-18 00:24:36,138   DEBUG docker-slave-1 => PING(20)
... <TRUNCATED OUTPUT> ...
^C
stylesen@codehelp:~$

The worker will also get listed on https://lava.codehelp.co.uk/scheduler/allworkers in the web UI. The docker based LAVA worker host docker-slave-1 is up and running. Let us add a LXC device to this worker on which we will run our LXC protocol based job. The name of the LXC device we will add to docker-slave-1 is 'lxc-docker-slave-01'. Create a jinja2 template file for lxc-docker-slave-01 and copy it to /etc/lava-server/dispatcher-config/devices/ on the LAVA master instance,

stylesen@codehelp:~$ cat \
/etc/lava-server/dispatcher-config/devices/lxc-docker-slave-01.jinja2

{% extends 'lxc.jinja2' %}
{% set exclusive = 'True' %}
stylesen@codehelp:~$ ls -alh \
/etc/lava-server/dispatcher-config/devices/lxc-docker-slave-01.jinja2

-rw-r--r-- 1 lavaserver lavaserver 56 Jun 18 00:36 \
/etc/lava-server/dispatcher-config/devices/lxc-docker-slave-01.jinja2

stylesen@codehelp:~$

In order to add the above device lxc-docker-slave-01 to the LAVA master database and associate it with our docker based LAVA worker docker-slave-1, login to the LAVA master instance and issue the following command:

stylesen@codehelp:~$ sudo lava-server manage devices add \
--device-type lxc --worker docker-slave-1 lxc-docker-slave-01

stylesen@codehelp:~$

The device will now be listed as part of the worker docker-slave-1 and could be seen in the link - https://lava.codehelp.co.uk/scheduler/worker/docker-slave-1

The LXC job we will submit to the above device will be https://git.linaro.org/lava-team/refactoring.git/tree/health-checks/lxc.yaml which is a normal LXC job requesting a LXC device type and runs a basic smoke test on a Debian based LXC device.

stylesen@harshu:/tmp$ lavacli -i lava.codehelp jobs submit lxc.yaml 
2486
stylesen@harshu:/tmp$

NOTE: lavacli is the official command line tool for interacting with LAVA instances. Read more about lavacli in https://staging.validation.linaro.org/static/docs/v2/lavacli.html

Thus job 2486 has been submitted successfully to LAVA instance lava.codehelp.co.uk and it ran successfully as seen in https://lava.codehelp.co.uk/scheduler/job/2486 This job used lava-lxc-mocker instead of lxc as seen from https://lava.codehelp.co.uk/scheduler/job/2486#L3

Read part 1...                                                                                                                     ... Continue Reading part 3

Read all parts of this blog post series from below links:

  1. lava-dispatcher docker images - part 1 - Introduction, Details and Preparation
  2. lava-dispatcher docker images - part 2 - Docker based LAVA Worker running pure LXC job
  3. lava-dispatcher docker images - part 3 - Docker based LAVA Worker running Nexus 4 job with and without LXC Protocol

by stylesen at June 06, 2018 02:30

lava-dispatcher docker images - part 3

This is part 3 of the three part blog post series on lava-dispatcher only docker images. If you haven't read part 2 already, then read it on - https://www.stylesen.org/lavadispatcher_docker_images_part_2

Docker based LAVA Worker running Nexus 4 job with LXC protocol

This is the second use case in which we will look at starting a docker based LAVA worker and running a job that requests a Nexus 4 device type with LXC protocol. The following command is used to start a docker based LAVA worker,

stylesen@hanshu:~$ sudo docker run \
-v /home/stylesen/docker-slave-files:/fileshare \
-v /var/run/docker.sock:/var/run/docker.sock -v /dev:/dev -itd --privileged \ -e HOSTNAME='docker-slave-1' -e MASTER='tcp://lava.codehelp.co.uk:5556' \ -e SOCKET_ADDR='tcp://lava.codehelp.co.uk:5555' -e LOG_LEVEL='DEBUG' \
-e ENCRYPT=1 -e MASTER_CERT='/fileshare/master.key' \
-e SLAVE_CERT='/fileshare/docker-slave-1.key_secret' -p 2222:22 \
--name ld-latest linaro/lava-dispatcher-staging-stretch-amd64:latest

76e820c1df7e5f4a7fe45bf130052674f2489f4d0ce7bb5f5a70c21a32696ff4
stylesen@hanshu:~$

There is not much difference in the above command from what we used in use case one, except for couple of new options.

'-v /dev:/dev' - mounts the host machine's /dev directory inside the docker container at the location /dev This is required when we deal with actual (physical) devices in order to access these devices from within the docker container.

'--privileged' - this option is required to allow seamless passthrough and device access from within the container.

Once we have the docker based LAVA worker up and running with the new options in place, we can add a new nexus4 device to it. The name of the nexus4 device we will add to docker-slave-1 is 'nexus4-docker-slave-01'. Create a jinja2 template file for nexus4-docker-slave-01 and copy it to /etc/lava-server/dispatcher-config/devices/ on the LAVA master instance,

stylesen@codehelp:~$ sudo cat \
/etc/lava-server/dispatcher-config/devices/nexus4-docker-slave-01.jinja2

{% extends 'nexus4.jinja2' %}
{% set adb_serial_number = '04f228d1d9c76f39' %}
{% set fastboot_serial_number = '04f228d1d9c76f39' %}
{% set device_info = [{'board_id': '04f228d1d9c76f39'}] %}
{% set fastboot_options = ['-u'] %}
{% set flash_cmds_order = ['update', 'ptable', 'partition', 'cache', \
'userdata', 'system', 'vendor'] %}

{% set exclusive = 'True' %}
stylesen@codehelp:~$ sudo ls -alh \
/etc/lava-server/dispatcher-config/devices/nexus4-docker-slave-01.jinja2

-rw-r--r-- 1 lavaserver lavaserver 361 Jun 18 01:32 \
/etc/lava-server/dispatcher-config/devices/nexus4-docker-slave-01.jinja2

stylesen@codehelp:~$

In order to add the above device nexus4-docker-slave-01 to the LAVA master database and associate it with our docker based LAVA worker docker-slave-1, login to the LAVA master instance and issue the following command:

stylesen@codehelp:~$ sudo lava-server manage devices add \
--device-type nexus4 --worker docker-slave-1 nexus4-docker-slave-01

stylesen@codehelp:~$

The device will now be listed as part of the worker docker-slave-1 and could be seen in the link - https://lava.codehelp.co.uk/scheduler/worker/docker-slave-1

The job definition we will submit to the above device will be https://git.linaro.org/lava-team/refactoring.git/tree/health-checks/nexus4.yaml which is a normal job requesting a Nexus4 device type and runs a simple test on the device using LXC protocol.

stylesen@harshu:/tmp$ lavacli -i lava.codehelp jobs submit nexus4.yaml 
2491
stylesen@harshu:/tmp$

Thus job 2491 has been submitted successfully to LAVA instance lava.codehelp.co.uk and it ran successfully as seen in https://lava.codehelp.co.uk/scheduler/job/2491

Docker based LAVA Worker running Nexus 4 job without LXC protocol

This is the third use case in which we will look at starting a docker based LAVA worker and running a job that requests a Nexus 4 device type without LXC protocol. The following command is used to start a docker based LAVA worker, which is exactly same as use case two.

stylesen@hanshu:~$ sudo docker run \
-v /home/stylesen/docker-slave-files:/fileshare \
-v /var/run/docker.sock:/var/run/docker.sock -v /dev:/dev -itd --privileged \
-e HOSTNAME='docker-slave-1' -e MASTER='tcp://lava.codehelp.co.uk:5556' \
-e SOCKET_ADDR='tcp://lava.codehelp.co.uk:5555' -e LOG_LEVEL='DEBUG' \
-e ENCRYPT=1 -e MASTER_CERT='/fileshare/master.key' \
-e SLAVE_CERT='/fileshare/docker-slave-1.key_secret' -p 2222:22 \
--name ld-latest linaro/lava-dispatcher-staging-stretch-amd64:latest

76e820c1df7e5f4a7fe45bf130052674f2489f4d0ce7bb5f5a70c21a32696ff4
stylesen@hanshu:~$

We will use the same device added for use case two i.e., 'nexus4-docker-slave-01' in order to execute this job.

The job we will submit to the above device will be https://git.linaro.org/lava-team/refactoring.git/tree/minus-lxc/nexus4.yaml which is a normal job requesting a Nexus4 device type and runs a simple test on the device, without calling any LXC protocol.

stylesen@harshu:/tmp$ lavacli -i lava.codehelp jobs submit nexus4-minus-lxc.yaml 
2492
stylesen@harshu:/tmp$

Thus job 2492 has been submitted successfully to LAVA instance lava.codehelp.co.uk and it ran successfully as seen in https://lava.codehelp.co.uk/scheduler/job/2492

Hope this blog series helps to get started with using lava-dispatcher only docker images and running your own docker based LAVA workers. If you have any doubts, questions or comments, feel free to email the LAVA team at lava-users [@] lists [dot] linaro [dot] org

Read part 2 ...

Read all parts of this blog post series from below links:

  1. lava-dispatcher docker images - part 1 - Introduction, Details and Preparation
  2. lava-dispatcher docker images - part 2 - Docker based LAVA Worker running pure LXC job
  3. lava-dispatcher docker images - part 3 - Docker based LAVA Worker running Nexus 4 job with and without LXC Protocol

by stylesen at June 06, 2018 02:30

June 17, 2018

Bin Chen

Understand Kubernetes 2: Operation Model

In the last article, we focus on the components in the work nodes. In this one, we'll switch our focus to the user and the component in master node.

Operation Model

From user's perspective, the model is quite simple: User declare a State he wants the system to be in, and then it is k8s's job to achieve that.
User send the Resouces and Operation to the k8s using the REST API, which is served by the API server inside of the master node, the request will be put into a stateStore (implemented using etcd). According to the type of resource, different Controllers will be delegated to do the job.
The exact Operations available depend on the Resource type, but most the case, it means CRUD. For create operation, there is a Specification define the attribute of the resource wanted to be created.
Here are two examples:
  • create a Pod, according to a Pod spec.
  • create a Deployment called mypetstore, according to ta Deployment spec.
  • update the mypetstore deployment with a new container image.
Each Resource (also called Object) has three pieces of information: Spec, Status and Metadata, and those are saved in the stateStore.
  • Spec is specified by the user for resource creation and update; it is desired state of the resource.
  • Status is updated by the k8s system and queried by the user; it is the actual state of the resource.
  • Metadata is partly specified by the user and can be updated by the k8s system; it is the label of the resource.
The class diagram looks like this:
ResourceSpec : create by userStatus : updated by k8s systemMetadata : may be updated by bothControllerCreate() : ResourceUpdate(Resource)Delete(Resource)GetStatus(Resource)CustomizedOps(Resource)K8sUseruse1ncontrols(CRUD)defines Spec, provides Metadata

Sequence Diagram:

Let's see how what really happens when you typing kubectl create -f deployment/spec.yaml:
UserUserkubectlkubectlAPI ServerAPI ServerStateStoreStateStoreControllerControllerWorkNodesWorkNodescreate spec.yamlkubectl turn it to REST callPost xxx/deploymentssave the specunblocked by new stateok (async)ok (async)do stuff to achieve the new stateok (async)update some new information (e.g pod & nodes binding)unblock by new state and do stuffdo stuff to achive the state

API

k8s cluster is managed and accessed from predefined APIkubectl is a client of the API, it converts the shell command to the REST call, as shown in above sequence diagram.
You can build your own tools using those APIs to add functionality that are currently not available. Since API is versioned and stable, it makes sure your tool are portable.
Portability and extensibility are the most important benefits k8s brings. In another word, k8s is awesome not only because it does awesome things itself but enables you and others build awesome things on top of it.

Controllers

Controller is to make sure the actual state of the object matches the desired state.
The idea of matching the actual state to the desired state is the driving philosophy of k8s's design. It doesn't sound quite novel given most declarative tools follow the same idea. For example, both Terraform and Ansible are declarative. The things k8s is different is it keep monitoring the system status and make sure the desired status is always kept. And that means all the goodness of availability and scalability are built-in in k8s.
The desired state is defined using a Spec, and that is the stuff user will interact with. It is k8s's job to do whatever you requested.
The most common specs are:
  • Deployments for stateless persistent apps (e.g. http servers)
  • StatefulSets for stateful persistent apps (e.g. databases)
  • Jobs for run-to-completion apps (e.g. batch jobs).
Let's take a close look at the Deployments Spec.

Deployment Spec

Below is the deployment spec that can be used to create a deployment of nginx server with 3 replicas, each of which use nginx:1.7.9 as the container image and application will listen on 80 port.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
This should be simple to understand. Compared with a simple Pod/Container specification,it has an extra replica field. The kind is set as Deployment so that a right Controller will be able to pick it up.
Lots of specs will have a nested PodSpec, as shown below, since at the end of the day, k8s is a Pod/Container management system.
Deplyment Controller and Speck8s (master)cluster (work nodes)specDeploymentControllerspec : DeloymentSpecstatus: DeloymentStatusControllerPodKindMetadataSpec : PodSpecStatus : PodStatusSpecKindMetadataDeloymentSpecreplicas: intselector: LabelSelectorstrategy: DeloymentStrategytemplate: PodTemplateSpecPodTemplateSpecMetadataSpec: PodSpecPodSpecContainers: []ContainerVolumes : []Volumecreate/update/monitor$.specuseembed$.template$.spec
For a complete reference of the field available for deployment spec, you can check here.

Summary

In this article, we looked at the components of Master node and the overall operation Model of k8s: drive and maintainer the actual state of the system to be same as the desired state as specified by the user through various object specification. In particular, we took a close look at most used deployment spec.

by Bin Chen (noreply@blogger.com) at June 06, 2018 01:34

June 14, 2018

Marcin Juszkiewicz

OpenStack Days 2018

During last few days I was in Kraków, Poland at OpenStack Days conference. It had two (Tuesday) or three (Monday) tracks filled with talks. Of different quality as it happens on such small events.

Detailed list of presentations is available on conference’s agenda. As usual I attended some of them and spent time on hallway track.

There was one issue (common to Polish conferences): should speaker use Polish or English? There were attendees who did not understand Polish language so some talks were mix of Polish slides with English presentation, full English ones and also fully Polish ones. Few speakers asked for language option at start of their talks.

Interesting talks? The one from OVH about updating OpenStack (Juno on Ubuntu 14.04 -> Newton on Ubuntu 16.04). Interesting, simple to understand. Slides available. Szymon Datko described how they started with Havana release and how moved from in-house development to cooperation with upstream.

Other one was about becoming upstream OpenStack developer given by Sławek Kapłoński from Red Hat. Git, gerrit etc. Talk turned into discussion with several questions and notes from the audience (including me).

DreamLab guys spoke about testing OpenStack. Rally, Shaker and few other names appeared during talk. It was interesting but their voices were making me sleepy ;(

Attended several other presentations but had a feeling that those small conferences give many slots to sponsors which not always have something interesting to fill them. Or title sounds good but then speaker lacks presentation experience and is unable to keep the flow.

Met several people from Polish division of Red Hat, spoke with folks from Mirantis, OVH, Samsung, Suse (and other companies), met local friends. Had several discussions. So it was worth going.

by Marcin Juszkiewicz at June 06, 2018 08:25

June 10, 2018

Ard Biesheuvel

UEFI driver pitfalls and PC-isms

Even though Intel created UEFI (still known by its TLA EFI at the time) for Itanium initially, x86 is by far the dominant architecture when it comes to UEFI deployments in the field, and even though the spec itself is remarkably portable to architectures such as ARM, there are a lot of x86 UEFI drivers out there that cut corners when it comes to spec compliance. There are a couple of reasons for this:

  • the x86 architecture is not as heterogeneous as other architectures, and while the form factor may vary, most implementations are essentially PCs;
  • the way the PC platform organizes its memory and especially its DMA happens to result in a configuration that is rather forgiving when it comes to UEFI spec violations.

UEFI drivers provided by third parties are mostly intended for plugin PCI cards, and are distributed as binary option ROM images. There are very few open source UEFI drivers available (apart from the _HCI class drivers and some drivers for niche hardware available in Tianocore), and even if they were widely available, you would still need to get them into the flash ROM of your particular card, which is not a practice hardware vendors are eager to support.
This means the gap between theory and practice is larger than we would like, and this becomes apparent when trying to run such code on platforms that deviate significantly from a PC.

The theory

As an example, here is some code from the EDK2 EHCI (USB2) host controller driver.

  Status = PciIo->AllocateBuffer (PciIo, AllocateAnyPages,
                     EfiBootServicesData, Pages, &BufHost, 0);
  if (EFI_ERROR (Status)) {
    goto FREE_BITARRAY;
  }

  Bytes = EFI_PAGES_TO_SIZE (Pages);
  Status = PciIo->Map (PciIo, EfiPciIoOperationBusMasterCommonBuffer,
                     BufHost, &Bytes, &MappedAddr, &Mapping);
  if (EFI_ERROR (Status) || (Bytes != EFI_PAGES_TO_SIZE (Pages))) {
    goto FREE_BUFFER;
  }

  ...

  Block->BufHost  = BufHost;
  Block->Buf      = (UINT8 *) ((UINTN) MappedAddr);
  Block->Mapping  = Mapping;

This is a fairly straight-forward way of using UEFI’s PCI DMA API, but there a couple of things to note here:

  • PciIo->Map () may be called with the EfiPciIoOperationBusMasterCommonBuffer mapping type only if the memory was allocated using PciIo->AllocateBuffer ();
  • the physical address returned by PciIo->Map () in MappedAddr may deviate from both the virtual and physical addresses as seen by the CPU (note that UEFI maps VA to PA 1:1);
  • the size of the actual mapping may deviate from the requested size.

However, none of this matters on a PC, since its PCI is cache coherent and 1:1 mapped. So the following code will work just as well:

  Status = gBS->AllocatePages (AllocateAnyPages, EfiBootServicesData,
                  Pages, &BufHost);
  if (EFI_ERROR (Status)) {
    goto FREE_BITARRAY;
  }

  ...

  Block->BufHost  = BufHost;
  Block->Buf      = BufHost;

So let’s look at a couple of ways a non-PC platform can deviate from a PC when it comes to the layout of its physical address space.

DRAM starts at address 0x0

On a PC, DRAM starts at address 0x0, and most of the 32-bit addressable physical region is used for memory. Not only does this mean that inadvertent NULL pointer dereferences from UEFI code may go entirely unnoticed (one example of this is the NVidia GT218 driver), it also means that PCI devices that only support 32-bit DMA (or need a little kick to support more than that) will always be able to work. In fact, most UEFI implementations for x86 explicitly limit PCI DMA to 4 GB, and most UEFI PCI drivers don’t bother to set the mandatory EFI_PCI_IO_ATTRIBUTE_DUAL_ADDRESS_CYCLE attribute for >32 bit DMA capable hardware either.

On ARM systems, the amount of available 32-bit addressable RAM may be much smaller, or it may even be absent entirely. In the latter case, hardware that is only 32-bit DMA capable can only work if a IOMMU is present and wired into the PCI root bridge driver by the platform, or if DRAM is not mapped 1:1 in the PCI address space. But in general, it should be expected that ARM platforms use at least 40 bits of address space for DMA, and that drivers for 64-bit DMA capable peripherals enable this capability in the hardware.

PCI DMA is cache coherent

Although not that common, it is possible and permitted by the UEFI spec for PCI DMA to be non cache coherent. This is completely transparent to the driver, provided that it uses the APIs correctly. For instance, PciIo->AllocateBuffer () will return an uncached buffer in this case, and the Map () and Unmap () methods will perform cache maintenance under the hood to keep the CPU’s and the device’s view of memory in sync. Obviously, this use case breaks spectacularly if you cut corners like in the second example above.

PCI memory is mapped 1:1 with the CPU

On a PC, the two sides of the PCI host bridge are mapped 1:1. As illustrated in the example above, this means you can essentially ignore the device or bus address returned from the PciIo->Map () call, and just program the CPU physical address into the DMA registers/rings/etc. However, non-PC systems may have much more extravagant PCI topologies, and so a compliant driver should use the appropriate APIs to obtain these addresses. Note that this is not limited to inbound memory accesses (DMA) but also applies to outbound accesses, and so a driver should not interpret BAR fields from the PCI config space directly, given that the CPU side mapping of that BAR may be at a different address altogether.

PC has strongly ordered memory

Whatever. UEFI is uniprocessor anyway, and I don’t remember seeing any examples where this mattered.

Using encrypted memory for DMA

Interestingly, and luckily for us in the ARM world, there are other reasons why hardware vendors are forced to clean up their drivers: memory encryption. This case is actually rather similar to the non cache coherent DMA case, in the sense that the allocate, map and unmap actions all involve some extra work performed by the platform under the hood. Common DMA buffers are allocated from unencrypted memory, and mapping or unmapping involve decryption or encryption in place depending on the direction of the transfer (or bounce buffering if encryption in place is not possible, in which case the device address will deviate from the host address like in the non-1:1 mapped PCI case above). Cutting corners here means that attempted DMA transfers will produce corrupt data, usually a strong motivator to get your code fixed.

Conclusion

The bottom line is really that the UEFI APIs appear to be able to handle anything you throw at them when it comes to unconventional platform topologies, but this only works if you use them correctly, and having been tested on a PC doesn’t actually prove all that much in this regard.

by ardbiesheuvel at June 06, 2018 17:45

Bin Chen

Understand Kubernetes 1: Container Orchestration

By far, we know the benefits of the container and how the container is implemented using Linux primitives.
If we only need to one or two containers, we should be satisfied. That's all we need. But if we want to run dozens or thousands containers to build a stable and scalable web service that is able to server millions transaction per seconds, we have more problems to solve. To name a few:
  • scheduling: Which host to put a container?
  • update: How to update the container image and ensure zero downtime?
  • self-healing: How to detect and restart a container when it is down?
  • scaling: How to add more containers when more processing capacity is needed?
None of those issues are new but only the subject become containers, rather than physical servers (in the old days), or virtual machines as recently. The functionalities described above are usually referred as Container Orchestration.

Kubernetes

kubernetes, abbreviated as k8s, is one of many container orchestration solutions. But, as of mid-2018, many would agree the competition is over; k8s is the de facto standard. I think it is a good news, freeing you from the hassle of picking from many options and worrying about investing in the wrong one. K8s is completely open source, with a variety of contributors from big companies to individual contributors.
k8s has a very good documentation, mostly here and here.
In this article, we'll take a different perspective. Instead of starting with how to use the tools, we'll start with the very object k8s platform is trying to manage - the container. We'll try to see what extra things k8s can do, compare with single machine container runtime such as runc or docker, and how k8s integrate with those container runtimes.
However, we can't do that without an understanding of the high-level architecture of k8s.

At the highest level, k8s is a master and slave architecture, with a master node controlling multiple slave or work nodes. master & slave nodes together are called a k8s clusterUser talks to the cluster using API, which is served by the master. We intentionally left the master node diagram empty, with a focus on the how the things are connected on the work node.
Master talks to work nodes through kublet, which primarily run and stop Pods, through CRI, which is connected to a container runtime. kublet also monitor Pods for liveness and pulling debug information and logs.
We'll go over the components in a little more detail below.

Nodes

There are two type of nodes, master node and slave node. A node can either be a physical machine or virtual machine.
You can jam the whole k8s cluster into a single machine, such as using minikube.

Kubelet

Each work note has a kubelet, it is the agent that enables the master node talk to the slaves.
The responsibility of kubelet includes:
  • Creating/running the Pod
  • Probe Pods
  • Monitor Nodes/Pod
  • etc.
We can go nowhere without first introducing Pod.

Pod

In k8s, the smaller scheduling or deployment unit is Pod, not container. But there shouldn't be any cognitive overhead if you already know containers well. The benefits of Pod is to add another wrap on top of the container to make sure closely coupled contains are guaranteed end up being scheduled on the same host so that they can share a volume or network that would otherwise difficult or inefficient to implement if they being on different hosts.
A pod is a group of one or more containers, with shared storage and network, and a specification for how to run the containers. A pod’s contents are always co-located and co-scheduled and run in a shared context, such as namespaces and cgroups.
For details, you can find here.

Config, Scheduing and Run Pod

You config a Pod using ymal file, call it spec. As you can imagine, the Pod spec will include configurations for each container, which includes the image and the runtime configuration.
With this spec, the k8s will sure pull the image and run the container, just as you would do using simple docker command. Nothing quite innovative here.
What missing here is in the spec we'll describe the resource requirement for the containers/Pod, and the k8s will use that information along with current cluster status, find a suitable host for the host. This is called Pod scheduling. The functionality and effectiveness of the schedule may be overlooked, in the borg paper, it is mentioned a better schedule actually could save millions of dollar for in google scale.
In the spec, we can also specify the Liveness and Readiness Probes.

Probe Pods

The kubelet uses liveness probes to know when to restart a container, and readiness probes to know when a container is ready to start accepting traffic. The first is the foundation for self-healing and the second for load balancing.
Without k8s, you have to do all these by your owner. Time and $$ saved.

Container Runtime: CRI

k8s isn't binding to a particular container runtime, instead, it defines an interface for image management and container runtime. Anyone one implemented the interface can be plugged into the k8s, be more accurate, the kubelet.
There are multiple implementations of CRI. Docker has cri-contained that plugs the containd/docker into the kubelet. cri-o is another implementation, which wraps runc for the container runtime service and wraps a bunch of other libraries for the image service. Both use cni for the network setup.
Assuming a Pod/Container is assigned to a particular node, and the kubelet on that node will operate as follows:
kubeletkubeletcri clientcri clientcri servercri serverimage serviceimage serviceruntime service(runc)runtime service(runc)run containercreate (over gPRC)pull image from a registryunpack the image and create rootfscreate runtime config (config.json) using the pod specrun container

Summary

We go through why we need a container orchestration system, and then the high-level architecture of k8s, with a focus on the components in the work node and its integration with container runtime.

by Bin Chen (noreply@blogger.com) at June 06, 2018 07:04

June 06, 2018

Marcin Juszkiewicz

From a diary of AArch64 porter — parallel builds

Imagine that you have a package to build. Sometimes it takes minutes. Other one takes hours. And then you run htop and see that your machine is idle during such build… You may ask “Why?” and the answer would be simple: multiple cpu cores.

On x86-64 developers usually have from two to four cpu cores. Can be double of that due to HyperThreading. And that’s all. So for some weird reason they go for using make -jX where X is half of their cores. Or completely forget to enable parallel builds.

And then I came with ARM64 system. With 8 or 24 or 32 or 48 or even 96 cpu cores. And have to wait and wait and wait for package to build…

So next step is usually similar — edit of debian/rules file and adding --parallel argument to dh call. Or removal of --max-parallel option. And then build makes use of all those shiny cpu cores. And it goes quickly…

UPDATE: Riku Voipio told me that Debhelper 10 does parallel builds by default. If you set ‘debian/compat’ value to at least ’10’.

by Marcin Juszkiewicz at June 06, 2018 10:46

June 04, 2018

Tom Gall

Developing Android apps on ChromeOS

Having been at Google IO 2018 I happened to be lucky enough to attend the “What’s new on ChromeOS.” session at the end of which they handed out not only groovy socks but also 75% off on Pixelbook.

During the session however Google had all sorts of things to say about enabling both Linux and Android development on ChromeOS. Now these are two things the world has needed and wanted for some time.

The Chromebook offer is for the midrange i5 based Chromebook. I received mine on Friday so I’ve had a few days with it.

IMG_6336

Setting up to Android development, meaning having Android Studio running as well as being able to run/debug Android Apps running on the Chromebook wasn’t too hard to setup.

Instructions are here but they are wrong in a few spots.

First, do turn on developer unstable and turn on Linux. BUT in order to debug Android Apps via Android Studio, you need to then turn on developer mode on your Pixelbook (or other akin device). You can’t debug Android Apps over USB (yet) so really I view this as an essential step.

Developer mode of course wipes the device so yeah, takes a bit longer to get to the end goal. You’ll live. I’ll link to the ‘snarky’ guide because there’s reasons not to enable developer mode if you don’t know what you’re doing. Remember you JUST need to enable developer mode, nothing else from this guide.

With developer mode on, again, turn on Linux mode, and now follow the rest of the guide. When you get to the point where you need to Mount Linux Files, before you do that you need to enable ssh server first in the debian environment.

> sudo rm sshd_not_to_be_run

> sudo service ssh restart

Ok now go back to the guide.

Then when you get to the Android Studio part, make sure you download the current preview, 3.2. If you don’t you’ll end up in a world of frustration where your new shiny Pixelbook will be at great risk to you throwing it across the room.

That done, you’ll find App development is pretty darn smooth. I’ve pounded out a couple of simple apps this weekend and everything ‘just worked’.  I’ll note that your very first compile will take awhile. This is down to some gradle files getting downloaded in the background. In real world terms, my first “hello world” app took about 3 minutes to build. After that, more like a second or two.

 

by tgallfoo at June 06, 2018 14:03

June 03, 2018

Bin Chen

Understand Container 7: use CNI to setup network

CNI means Container Runtime Interface, originated from coreOs for rkt's network solution, and beat Docker's CNM as being adopted by k8s as the network plugin interface.
In this blog we are going to see how to use CNI, to be specific, the bridge plugin, to setup the network for containers spawned by runc and achieve the same result/topology as we did in the last blog using netns as the hook.

Overview

The caller/user of CNI (eg: you calling from a shell, a container runtime/orchestrator, such as runc or k8s) interact with a plugin using two things: a network configuration file and some environment variables. The configuration files has the configs of network (or subnet) the container supposed to connect to; the environment variables include the path regarding where to find the plugin binary and network configuration files, plus "add/delete which container to/from which network namespace", which can well be implemented by passing arguments to the plugin (instead of using environment variable). It's not a big issue but looks a little bit "unusual" to use environment to pass arguments.
For a more detailed introduction of CNI, see here and here.

Use CNI plugins

build/install plugins

go get github.com/containernetworking/plugins
cd $GOPATH/src/github.com/containernetworking/plugins
./build.sh
mkdir -p /opt/cni/bin/bridge
sudo cp bin/* c

Use CNI

We'll be using following simple (and dirty) script to exercise CNI with runc. It covers all the essential concepts in one place, which is nice.
$ cat runc_cni.sh
#!/bin/sh

# need run with root
# ADD or DEL or VERSION
action=$1
cid=$2
pid=$(runc ps $cid | sed '1d' | awk '{print $2}')
plugin=/opt/cni/bin/bridge

export CNI_PATH=/opt/cni/bin/
export CNI_IFNAME=eth0
export CNI_COMMAND=$action
export CNI_CONTAINERID=$cid
export CNI_NETNS=/proc/$pid/ns/net

$plugin <<EOF
{
"cniVersion": "0.2.0",
"name": "mynet",
"type": "bridge",
"bridge": "cnibr0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"subnet": "172.19.1.0/24",
"routes": [
{ "dst": "0.0.0.0/0" }
],
"dataDir": "/run/ipam-state"
},
"dns": {
"nameservers": [ "8.8.8.8" ]
}
}
EOF

It may be not obvious to a newcomer that we are using two plugins here, bridge plugin and host-local. The format is to set up a bridge network (as well as veth pair) and the late is to set up allocate and assign ip to the containers (and the bridge gateway), which is called ipam (IP Address Management), as you might have noticed in the config key.
The internal working of the bridge plugging is almost same as the netns does and we are not going to repeat it here.
Start a container called c1sudo runc run c1.
Then, put c1 into the network:
sudo ./runc_cni.sh ADD c1
Below is the output, telling you the ip and gateway of c1, among other things.
{
"cniVersion": "0.2.0",
"ip4": {
"ip": "172.19.1.6/24",
"gateway": "172.19.1.1",
"routes": [
{
"dst": "0.0.0.0/0",
"gw": "172.19.1.1"
}
]
},
"dns": {
"nameservers": [
"8.8.8.8"
]
}
}
You can create another container c2 and put it into the same network in a similar way, and now we create a subnet with two containers inside. They can talk to the each other and call can ping outside IPs, thanks route setting and IP masquerade. However, the dns won't work.
You can also remove a container from the network, after which the container won't be connected to the bridge anymore.
sudo ./runc_cni.sh DEL c1
However, the IP resource won't be reclaimed automatically, you have to do that "manually".
That is it, as we said this will be a short ride. Have fun with CNI.

by Bin Chen (noreply@blogger.com) at June 06, 2018 07:56

June 01, 2018

Alex Bennée

dired-rsync 0.4 released

I started hacking on this a while back but I’ve finally done the house-keeping tasks required to make it a proper grown up package.

dired-rsync is a simple command which you can use to trigger an rsync copy from within dired. This is especially useful when you want to copy across large files from a remote server without locking up Emacs/Tramp. The rsync just runs as an inferior process in the background.

Today was mainly a process of cleaning up the CI and fixing any issues with it. I’d still like to add some proper tests but the whole thing is interactive and that seems to be tricky for Emacs to test. Anyway I’ve now tagged 0.4 so it will be available from MELPA Stable once it rebuilds. You can of course grab the building edge from MELPA any time 😉

by Alex at June 06, 2018 17:12

May 29, 2018

Leif Lindholm

Running UniFi Controller on arm64 (or ppc64el)

Sometime last year I decided to switch my home wireless infrastructure over to Ubiquiti UniFi. This isn't just standalone access points, so they rely on controller software - to be run on Someone Else's Computer (just no), or using their UniFi Controller on a machine of your choice. Since the controller is written in Java, it will run pretty much anywhere that can also run its other dependencies. They even provide their own Debian/Ubuntu repository, and a pretty howto on setting it up.

UniFi on armhf

I initially actually ran this on armhf/Stretch, and still have a post in draft state on how to achieve this (since one of the prerequisites is MongoDB, no longer supported on armhf), but probably won't bother publishing it since it is a bit of a dead end.

(Short short version: grab the 2.6.10 sources from Ubuntu Xenial and fix the most awfully broken bits of code until it actually compiles. This includes the parts of the testsuite that try to verify undefined behaviour of the programming languages used. ?!?)

But since I now have always-on arm64 machines in my home network, I decided it was time to move to the architecture that has been my main development target for the past 8 years...

UniFi on arm64

Unsurprisingly, this hit a snag; while the package itself is completely architecture-independent, the Debian repository format is not. With the instructions from the howto, apt expects to find $ARCHIVE_ROOT/dists/$DISTRIBUTION/ubiquiti/binary-$arch/Packages.gz to tell it which packages are available in the repo and what their dependencies are. Which works fine when there is a populated entry for $arch. There is for (at least) i386, amd64 and armhf - but not for arm64 or ppc64el.

The $ARCHIVE_ROOT specified in abovelinked howto is http://www.ubnt.com/downloads/unifi/debian. Not sure why that does not specify https (which also works), but I will use the actually documented variant below.

Workaround

The package itself is fully architecture independent. So what we can do instead is grab the Packages.gz for armhf and have a peek:

$ wget http://dl.ubnt.com/unifi/debian/dists/stable/ubiquiti/binary-armhf/Packages.gz
...
$ zcat Packages.gz
Package: unifi
Version: 5.7.23-10670
Architecture: all
Depends: binutils, coreutils, jsvc (>=1.0.8) , mongodb-server (>=2.4.10) | mongodb-10gen (>=2.4.14) | mongodb-org-server (>=2.6.0), java8-runtime-headless, adduser, libcap2
Conflicts: unifi-controller
Provides: unifi-controller
Replaces: unifi-controller
Installed-Size: 113416
Maintainer: UniFi developers <unifi-dev@ubnt.com>
Priority: optional
Section: java
Filename: pool/ubiquiti/u/unifi/unifi_5.7.23-10670_all.deb
Size: 64571866
SHA256: e7b60814c27d85c13e54fc3041da721cc38ad21bb0a932bdfe810c2ad3855392
SHA1: 49f16c3d0c6334cb2369cd2ac03ef3f0d0dfe9e8
MD5sum: 478b56465bf652993e9870912713fab2
Description: Ubiquiti UniFi server
 Ubiquiti UniFi server is a centralized management system for UniFi suite of devices.
 After the UniFi server is installed, the UniFi controller can be accessed on any
 web browser. The UniFi controller allows the operator to instantly provision thousands
 of UniFi devices, map out network topology, quickly manage system traffic, and further
 provision individual UniFi devices.
Homepage: http://www.ubnt.com/unifi

Download the package

The Filename: field tells us the current unifi packages can be found at pool/ubiquiti/u/unifi/unifi_5.7.23-10670_all.deb - relative to the $ARCHIVE_ROOT, not the binary-$arch - so we can download it with

$ wget http://dl.ubnt.com/unifi/debian/pool/ubiquiti/u/unifi/unifi_5.7.23-10670_all.deb

Verify the integrity of the package by running

$ sha256sum unifi_5.7.23-10670_all.deb

and comparing the output with the value from the SHA256: field.

Install dependencies and UniFi

The Depends: field tells us we need

  • binutils
  • coreutils

(both of these are likely to be installed already, unless you like me had just accidentally tried to install a broken home-built toolchain package in the host instead of a chroot ... oops!)

  • jsvc
  • mongodb-server
  • java8-runtime-headless
  • adduser (also likely to already be installed)
  • libcap2

Resolving this is straightforward enough, with perhaps the single exception of java8-runtime-headless, which is a virtual package. But if you try to install that, apt will let you know, and point out which available packages provide it. So, as a one-liner:

$ sudo apt-get install jsvc mongodb-server openjdk-8-jre-headless libcap2

Then we're ready to:

$ sudo dpkg -i unifi_5.7.23-10670_all.deb

Setup

Nothing architecture-specific about this: go to https://$HOST:8443 to set up. In my case, I just imported my downloaded backup from the armhf server and had everything back up and running quickly without manual intervention.

Final notes

Of course, this will leave you without automatic updates, so you'll need to go periodically have a look at one of the actually enabled architectures for version changes and manually install updates.

And if you have an account on the Ubiquiti forum, consider upvoting my proposal to add the missing architectures to the repository.

by Leif Lindholm at May 05, 2018 11:24

May 28, 2018

Marcin Juszkiewicz

Yet another blog theme change

During morning discussions I had to check something on my website and decided that it is a time to change theme. For nth time.

So I looked, checked several ones and then started editing ‘Spacious‘ one. Usual stuff — no categories, colours/fonts/styles changes. Went much faster than previous time.

But then I realised that I do not remember all previous ‘looks’ of my blog. Web archive to the rescue ;D

When I started on 1st April of 2005 I used some theme. Do not remember how it was called:

About one year later I decided to change it. To Barthelme theme. Widgets arrived, clean view etc. At that time all my FOSS work was done in free time. As people were asking about donating money/hardware I had a special page about it. Anyone remembers Moneybookers?

Year passed, another theme change. “Big Blue” this time. Something is wrong on styles as that white area in top left corner should have blue background. At that time I had my own one person company so website had information about available services. And blog got moved to “blog.haerwu.biz” domain instead of “hrw.one.pl” one.

In 2009 I played with Atahualpa theme. Looks completely broken when loaded through web archive. Also changed site name to my full name instead of nickname. Also got rid of hard to pronounce properly name in favour of “marcin.juszkiewicz.com.pl” which may not be easier to pronounce but several people already were able to call my last name properly.

Same year I went for “Carrington blog” theme. Looks much better than previous one.

2012 happened. And change to Twenty Twelve happened too. End of the world did not happened.

Some restyling was done later. And subtitle went from OpenEmbedded to ARM/AArch64 stuff.

Three years with one theme. Quite long time. So another change: Twenty Sixteen. This was supposed to look properly on mobile devices (and did).

And now new theme: Spacious. For another few years?

One website and so many changes… Still keeping simplicity, no plans for adding images to every post etc.

by Marcin Juszkiewicz at May 05, 2018 15:54

May 27, 2018

Bin Chen

Understand Container 6: Hooks and Network

When it comes to the container network, the oci runtime spec does no more than creating or joining a network namespace. All the other workers are left to be dealt with by using hooks, which lets you inject into different stages of the container runtime and do some customization.


With the default config.json, you will see only a loop device, but not an eth0that you normally see on the host that allows you talk to the outside world. But, we can set up a simple bridge network by using netns as the hook.

Go and get netns and copy the binary to /usr/local/bin, where the following config.json assume. Worth to note that the hooks are executed in the runtime namespace, not the container namespace. That means, among other things, the hooks binary should reside in the host system, not the container. Hence, you don't need to put the netns into the container rootfs.

setup bridge network using netns

Make following chagnes to config.json. In addition to the hooks, we also need CAP_NET_RAW capability so that we can use ping inside of the container to do some basic network checking.

binchen@m:~/container/runc$ git diff
diff --git a/config.json b/config.json
index 25a3154..d1c0fb2 100644
--- a/config.json
+++ b/config.json
@@ -18,12 +18,16 @@
"bounding": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
- "CAP_NET_BIND_SERVICE"
+ "CAP_NET_BIND_SERVICE",
+ "CAP_NET_RAW"
],
"effective": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
- "CAP_NET_BIND_SERVICE"
+ "CAP_NET_BIND_SERVICE",
+ "CAP_NET_RAW"
],
"inheritable": [
"CAP_AUDIT_WRITE",
@@ -33,7 +37,9 @@
"permitted": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
- "CAP_NET_BIND_SERVICE"
+ "CAP_NET_BIND_SERVICE",
+ "CAP_NET_RAW"
],
"ambient": [
"CAP_AUDIT_WRITE",
@@ -131,6 +137,16 @@
]
}
],
+
+ "hooks":
+ {
+ "prestart": [
+ {
+ "path": "/usr/local/bin/netns"
+ }
+ ]
+ },
+
"linux": {
"resources": {
"devices": [

start a container with this new config.
Inside of the container, we find an eth0 device, in addition to a loop device that is always there.

/ # ifconfig
eth0 Link encap:Ethernet HWaddr 8E:F3:5C:D8:CA:2B
inet addr:172.19.0.2 Bcast:172.19.255.255 Mask:255.255.0.0
inet6 addr: fe80::8cf3:5cff:fed8:ca2b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:21992 errors:0 dropped:0 overruns:0 frame:0
TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2610155 (2.4 MiB) TX bytes:22406 (21.8 KiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:498 (498.0 B) TX bytes:498 (498.0 B)

And, you will be able to ping (one IP of google) outside world.

/ # ping 216.58.199.68
PING 216.58.199.68 (216.58.199.68): 56 data bytes
64 bytes from 216.58.199.68: seq=0 ttl=55 time=18.382 ms
64 bytes from 216.58.199.68: seq=1 ttl=55 time=17.936 ms

So, how it works?

Bridge, Veth, Route and iptable/NAT

Upon a hook being called, the container runtime will pass the hook the container's state, among other things, the pid of the container (in runtime namespace). The hook, Netns in this case, will use that pid to find out the network namespace the container is supposed to be run in. With that pid, the netns will do a few things:
  1. Create a linux bridge with the a default name netns0(if there isn't already one). Also setup the MASQUERADE rule on the host.
  2. Create a veth pair, connecting one endpoint of the pair to the bridge netns0 and placing the another one (renamed to eth0) into the container network namespaces.
  3. Allocate and Assign an IP to the container interface (eth0) and setup the Route table for the container.
Soon we'll go over the stuff we metioned above in detail but let's start another container with the same config.json. Hopefully, it'll make things more clear and interesting than having only one container.
  • bridge and interfaces
A bridge netns0 is created and two interfaces are associated with it. The name of the interface follows the format of netnsv0-$(containerPid).

$ brctl show netns0
bridge name bridge id STP enabled interfaces
netns0 8000.f2df1fb10980 no netnsv0-8179
netnsv0-10577

As we explained before netnsv0-8179 is one endpoint of the veth pair, connecting to the bridge; the other endpoint is inside of the container 8179. Let's find it out.
  • vthe pair
On the host, we can see the peer of netnsv0-8179 is index 7

$ ethtool -S netnsv0-8179
NIC statistics:
peer_ifindex: 7

And in the container 8179, we can see the eth0's index is 7. It confirms that the eth0 in container 8179 is paired with netnsv0-8179 in the host. Same is true for netnsv0-10577 and the eth0 in container 10577.

/ # ip a
1: lo: mtu 65536 qdisc noqueue qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
7: eth0@if8: mtu 1500 qdisc noqueue qlen 1000
link/ether 8e:f3:5c:d8:ca:2b brd ff:ff:ff:ff:ff:ff
inet 172.19.0.2/16 brd 172.19.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::8cf3:5cff:fed8:ca2b/64 scope link
valid_lft forever preferred_lft forever

So far, we have seen how a container is connected to host virtul bridge using veth pair. We have the network interfaces but still need a few more setups: Route table and iptable.

Route Table

Here is the route table for In container 8179:

/ # route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 172.19.0.1 0.0.0.0 UG 0 0 0 eth0
172.19.0.0 * 255.255.0.0 U 0 0 0 eth0

We can see the all traffic will goes through eth0 to the gateway, which is the bridge netns0, as shown by:

# in container
/ # ip route get 216.58.199.68 from 172.19.0.2
216.58.199.68 from 172.19.0.2 via 172.19.0.1 dev eth0

In the host:

$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 192-168-1-1 0.0.0.0 UG 0 0 0 wlan0
172.19.0.0 * 255.255.0.0 U 0 0 0 netns0
192.168.1.0 * 255.255.255.0 U 9 0 0 wlan0
192.168.122.0 * 255.255.255.0 U 0 0 0 virbr0

Also:

# on host
$ ip route get 216.58.199.68 from 172.19.0.1
216.58.199.68 from 172.19.0.1 via 192.168.1.1 dev wlan0
cache

The 192.168.1.1 is the ip of my home route, which is a real bridge.
Piece together the route in the container, we can see when ping google from the container, the package will go to the virtual bridge created by the netnsfirst, and then goes to the real route gateway at my home, and then into the wild internet and finally to one of the goole servers.

Iptable/NAT

Another change made by the netns is to set up the MASQUERADE target, that means all traffic with a source of 172.19.0.0/16 will be MASQUERADE or NAT-ed with the host address so that outside can only see the host (ip) but not the container (ip).

# sudo iptables -t nat --list
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- 172.19.0.0/16 anywhere

To put all those together, it looks like this:

+--------------------------------------------------------------+
| |
| |
| +------------------+ |
| | wlan/eth0 +------+
| | | |
| +---------+--------+ |
| | |
| +-----+----+ |
| +-----+route | |
| | |table | |
| | +----------+ |
| +-------------------------------+----------+ |
| | | |
| | bridge:netns0 | |
| | | |
| +-----+-----------------------+------------+ |
| | interface | interface |
| +-----+-----+ +------+----+ |
| | | |10:netnsv0 | |
| |8:netnsv0- | +-10577@if9 | |
| |8179@if7 | | | |
| +---+-------+ +----+------+ |
| | | |
| | | |
| +-----------------+ +-----------------+ |
| | | | | | | |
| | +---+------+ | | +----+------+ | |
| | | | | | | | | |
| | |7:eth0@if8| | | | 9:eth0@if10 | |
| | | | | | | | | |
| | | | | | | | | |
| | +----------+ | | +-----------+ | |
| | | | | |
| | c8179 | | c10577 | |
| +-----------------+ +-----------------+ |
| |
+--------------------------------------------------------------+

Share network namespace

To join the network namespace of another container, set up the network namespace path pointing to the one you want to join. In our example, we'll join the network namespace of container 8179.

{
- "type": "network"
+ "type": "network",
+ "path": "/proc/8179/ns/net"

Remeber to remove the prestart hook, since we don't need to create new network interface (veth pair and route table) this time.
Start a new container, and we'll find that the new container has the same eth0 device (as well as same ip) with the container 8179 and the route table is same as the one in container 8179 since they are in the same network namespace.

/ # ifconfig
eth0 Link encap:Ethernet HWaddr 8E:F3:5C:D8:CA:2B
inet addr:172.19.0.2 Bcast:172.19.255.255 Mask:255.255.0.0
inet6 addr: fe80::8cf3:5cff:fed8:ca2b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:22371 errors:0 dropped:0 overruns:0 frame:0
TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2658017 (2.5 MiB) TX bytes:22406 (21.8 KiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:498 (498.0 B) TX bytes:498 (498.0 B)

So, despite being in different containers, they share same network device, route table, port numbers and all the other network resources. For example, if you start a webservice in container 8179 port 8100 and you will be able to access the service in this new container using localhost:8100.

Summary

We see how to use netns as the hook to setup a bridge network for our containers so that the containers can talk to the internet as well as each other.

by Bin Chen (noreply@blogger.com) at May 05, 2018 01:10

May 24, 2018

Marcin Juszkiewicz

GDPR?

Generic Data Protected Reduction or something like that. Everyone in EU (those in UK too) knows about it due to amount of spam from all those services/pages you registered in the past.

I would not bother writing anything about it but we had a discussion (beer was involved) recently in a pub and I decided to blog.

So to make sure you know: there is some data stored in this system. Every time you leave a comment all that data you wrote is recorded. And can be used to identify author so we can agree that those are personal details, right?

If by any chance you want those data removed then write to me. With url of comment you wrote, from email address used in that comment. I will remove your email, link to website (if present) and replace your name with some random words (like Herman Humpalla for example).

If I remember correctly there is no other data stored in my system. All statistics done by WordPress are anonymous.

by Marcin Juszkiewicz at May 05, 2018 18:09

May 20, 2018

Naresh Bhat

A dream come true: Himalayan Odyssey - 2016 (Day-6 to 10)

Day-6: Leh

Leh, a high-desert city in the Himalayas, is the capital of the Leh region in northern India’s Jammu and Kashmir state. Originally a stop for trading caravans, Leh is now known for its Buddhist sites and nearby trekking areas. Massive 17th-century Leh Palace, modeled on the Dalai Lama’s former home (Tibet’s Potala Palace), overlooks the old town’s bazaar and maze like lanes.

Leh city

Apricot seller 

Vegetable seller

Leh is at an altitude of 3,524 metres (11,562 ft), and is connected via National Highway 1 to Srinagar in the southwest and to Manali in the south via the Leh-Manali Highway. In 2010, Leh was heavily damaged by the sudden floods caused by a cloud burst.

Dry fruits shop

Indian spices seller
Leh was an important stopover on trade routes along the Indus Valley between Tibet to the east, Kashmir to the west and also between India and China for centuries. The main goods carried were salt, grain, pashm or cashmere wool, charas or cannabis resin from the Tarim Basin, indigo, silk yarn and Banaras brocade.

Day-7: Leh To Hunder

This was the day we all were waiting eagerly. Riding to Hunder (Nubra) valley via highest motorable road called "Khardung La" pass.  The pass situated at an elevation of 5602 meters (18379 ft) in the Ladakh region and is 39.7 km from Leh at an altitude of 3,524 metres (11,562 ft).  You can just imagine the steep uphills Journey from Leh to Khardungla, was a painful 3 hours drive up on a winding road.  Khardung La is the highest motorable pass in the world.

Khardungla top

Highest motorable pass ..Yuppie..reached..:)
Best known as the gateway to the Nubra and Shyok valleys in the Ladakh region of Jammu and Kashmir, the Khardung La Pass, commonly pronounced as Khardzong La, is a very important strategic pass into the Siachen glacier.

The pristine air, the scenic beauty one sees all around and the feeling that you are on top of the world has made Khardung La a very popular tourist attraction in the past few years.

 The first 24 km, as far as the South Pullu check point, are paved. From there to the North Pullu check point about 15 km beyond the pass the roadway is primarily loose rock, dirt, and occasional rivulets of snow melt.

Nubra valley is a beautiful place where you can see sand dunes, water, and green apricot tree's.  We are staying at hunder in a tent.  After reaching valley we had hot snacks and went for double humped camel rides.
Nubra river

Sand dunes @Nubra valley

Nubra is mix of all in summer..water, tree, sand dunes, rocks and mountains. But completely frozen for 6 months
We had a campfire and party night.

Party all night..:)
The Siachen glacier water was flowing just beside our tent. The villagers use the flowing water directly. We were just 80kms away from Siachen glacier.

Tents just beside glacier water flow

You can directly drink glacier water

Karnataka state boys outside Royal Camp..Ready to ride out

Day-8: Hunder To Leh

Hundar is a village in the Leh district of Jammu and Kashmir, India. It is located in the Nubra tehsil, on the bank of Shyok River. The Hunder Monastery is located here. Hundar was once the capital of former Nubra kingdom.

Indian Army check post
You can see the Nubra river flowing in the background in the picture below.

Nubra valley view
The Nubra was the last destination of our journey.  Now it was time to start return journey and were headed back to Leh via KhardungLa pass.  I was half way through Khardungla pass it started snowing. Hands almost frozen and the slippery roads were, could not have asked for more 😊. It was a struggling ride up to Khardungla Pass because of low oxygen I could recognize very low response for throttle.

It was fun to ride highest motorable pass in rain and snow
I finally reached the highest motorable road Khardungla pass.  The snow fall had only increased.  Sipping on the lemon tea gave good feeling like never before. We took couple of pictures and started descending. Headache was already hitting back due to high altitude sickness. At couple of places we even faced land slides. When the snow settles down on the mountains the landslide will start automatically because of weight of the snow.

It started raining heavily when we reached south pullu check point.  We took a break and had a lunch. After the rain stopped, we continued our journey and reached Hotel Namgyal Palace in Leh.


Hotel
Day-9: Leh To Debring (Tso Kar)

Today we are riding back towards Debring which is near to Moreplanes. We were staying in a camp near a salt lake called as Tso Kar.  We were also about to touch world's second highest pass called as "Tanglangla". The high altitude, minus temperature and cold wind are pretty common and one needs to gain all the physical and mental strength to withstand them and ride along.

We had a first break and regroup point at place called Rumtse. A small village even by Ladakh standards. Rumtse is the first human settlement on the way from Lahaul to Ladakh after Taglang Pass. It is located 70 km east of Leh and is the starting point for trek to Tso Moriri. Rumtse lies in Rupshu Valley which lies sandwiched between Tibet, Zanskar and Ladakh.
Tea break
The Tanglangla pass is located in the Zanskar range, at the northernmost tip of India, Tanglang La pass is famed as the second highest mountain pass in Leh Ladakh region. It is located at an altitude of around 17000 ft, on the Manali-Leh highway. Characterized by such an altitude, Tanglang La pass is like the gateway to Leh.

The pass provides for a scenic view as it sways away from the main highway. Ample vegetation on both sides further cools the already chilled air and at times, the sharp bends provide just the adrenaline push adventurists crave.

Second highest motorable pass

Second highest pass
 After reaching Moreplanes we had a group photo session.

Ready for group photo

60+ riders lined up for group photo at Moreplanes
Next we continued to ride towards Tso Kar camp site.  There were no roads. It is a very plain area with full of dust and small stones. Approximately after 15kms we reached the camp.  Had evening snacks and tea. We rested at Tso Kar for that night.

Tsokar camp site
It was a nightmare because of -ve temperature and cold windy weather. Early morning we were not able to touch the cold water for brush and bath.  There was no availability of any hot water, since were camped in the middle of no where.You can just see a plain area for miles and miles.

Day-10: Debring (Tso Kar) To Keylong

The distance between Tsokar and Keylong is around 236km but the time taken to cover this distance is around 7+ hours.  The road conditions are very bad.  Hence we just need to focus on the road and try to cover more distance taking less breaks. I just stopped at Moreplanes and took some pics

A view from Moreplanes

Dusty and tested thoroughly..:)
We reached hotel at Keylong by 5PM. It was very chilled weather and beautiful location.  I visited the local city market and purchased items like winter cap, gloves. The local market is very small and the roads are narrow. 

Motorcycles lined up outside Keylong hotel for check-up

Waiting for my turn
There was a fantastic view from out room balcony. We have also completed a round of motorcycle check-up because the next day ride would be very challenging with more water crossings...:)

Tobe continued.....:)

by Naresh (noreply@blogger.com) at May 05, 2018 15:56

Bin Chen

Understand Container 5: User and Root

User and permission are the oldest and most basic security mechanism in Linux. Briefly here is how it works: 1) System has a number of users and groups 2) Every file belongs to an owner and a group, 3) Every process belong to a user and one or more groups, 4) lastly, to link 1,2,3 together, every file has a modesetting that defines the permissions for three type of processes: owner, group and other. Note that the kernel knows and cares only about uid and guid, not user name and group name.

Specify the uid of a container process

User property can be used to specify under which user the process will be run as. It is optional and by default it is 0 or root, which is required to run the runc.
That means you can delete follow section from the default config.json and will still be able to start the container.

diff --git a/config.json b/config.json
- "user": {
- "uid": 0,
- "gid": 0
- },

start the container and list the user.

$ sudo runc run xyxy12
/ # id
uid=0(root) gid=0(root)

On host,

binchen@m:~/container/runc$ sudo runc ps xyxy12
UID PID PPID C STIME TTY TIME CMD
root 27544 27535 0 12:05 pts/0 00:00:00 sh

As seen, it is running as root.
Running a container process as root is worrisome. But fortunately, by default, the container process, even being run as root, has other extra constraints (such as capability) in place, so they are usually less powerful then the root on the host which usually by default has more capability assigned.
But still, it is more secure to run the process as a non-privileged normal user, and you can do so by specifying the uid/guid as non-zero.
Let's change the uid/guid of the user config to 1000 and start the container.

/ $ id
uid=1000 gid=1000

It doesn't mention the username since there isn't one in the container, but from the host side (see the UID):

binchen@m:~/container/runc$ sudo runc ps xyxy12
UID PID PPID C STIME TTY TIME CMD
binchen 24904 24895 0 11:44 pts/0 00:00:00 sh

By default, create a container won't create a new user namespace and the uid you see in the container and on the host are the same user - i.e share the same user namespace, to say it in a fancy way.

User namespace and UID/GID mapping

Let's see what happens when using a user namespace.
Here is the user namespace before starting container with namespace support:

$ sudo cinf | grep user
4026531837 user 297 0,1,7,101,102,106,107,109,111,113,116,121,125,126,127,1000,65534 /sbin/init
4026532254 user 1 1000 /opt/google/chrome/n
4026532423 user 25 1000 /opt/google/chrome/c

Making following changes to enable user namespace:

$ git diff
diff --git a/config.json b/config.json
index 25a3154..466eae8 100644
--- a/config.json
+++ b/config.json
@@ -155,6 +155,23 @@
},
{
"type": "mount"
+ },
+ {
+ "type": "user"
+ }
+ ],
+ "uidMappings": [
+ {
+ "containerID": 0,
+ "hostID": 1000,
+ "size": 32000
+ }
+ ],
+ "gidMappings": [
+ {
+ "containerID": 0,
+ "hostID": 1000,
+ "size": 32000
}

It is an error when user namespace is enabled but no uid/guid mapping; similar, GUID/GID mapping is useless and will be ignored if user namespace isn't enabled, which effectively is a wrong configuration as well.
start a container with the new config and list the user namespace in the system:

$ sudo cinf | grep user
4026532423 user 25 1000 /opt/google/chrome/c
4026532254 user 1 1000 /opt/google/chrome/n
4026532450 user 1 1000 sh
4026531837 user 297 0,1,7,101,102,106,107,109,111,113,116,121,125,126,127,1000,65534 /sbin/init

We can see we have a new user namespace (4026532450) and our new container process (sh) is running inside of it.
Inside of the container, it is running as uid/guid 0, and is considered to be root.

/ # id
uid=0(root) gid=0(root)

However, from the outside, the process is indeed considered to be running as binchen, which is 1000.

binchen@m:~/container/runc$ sudo runc ps xyxy12
UID PID PPID C STIME TTY TIME CMD
binchen 4356 4347 0 11:18 pts/0 00:00:00 sh

That's user namespace and uid/pid mapping in play here: uid 0 inside of the container is the 1000 on the host, a constant offset as specified in the mapping. The offset or mapping can be seen on the host by check the proc as well:

binchen@m:~/container/runc$ cat /proc/4356/uid_map
0 1000 32000

Exercise

Let's do some Exercise to verify the 0 inside of the container is actually 1000 on the host, and ultimately it is 1000 that is checked by the kernel.
Inside of the rootfs but on the host, create two directories, bindir and rootdir, which are owned by the current user (id:1000) and root respectively, and are all accessible only by its owner.
Type following commands:

mkdir bindir
chmod 700 bindir

mkdir rootdir
sudo chgrp 0 rootdir
sudo chown 0 rootdir
sudo chmod 700 rootdir

Here is what it should look like:

drwx------ 2 binchen binchen 4096 May 10 11:27 bindir/
drwx------ 2 root root 4096 May 10 11:27 rootdir/

On the host, test the group and permission, The exception is the current user (binchen) is able to enter into bindir but not rootdir. After you switch to the root, the root can access not only rootdir (since root owns that dir) but also bindir (because it is root!).
To make the exercise more convincing, and let's change the uid/gid offset to 2000, so that the actual user maps to no-body on the host. And we'll expect inside of the container, the root can access none of the directories since the root in the container is really uid 2000 and kernel won't allow it to access any of those directories.
start the container:

binchen@m:~/container/runc$ sudo runc run xyxy12
/ # id
uid=0(root) gid=0(root)
/ # ls -l
drwx------ 2 nobody nogroup 4096 May 10 01:27 bindir
drwx------ 2 nobody nogroup 4096 May 10 01:27 rootdir
/ # cd bindir/
sh: cd: can't cd to bindir/: Permission denied
/ # cd ..
/ # cd rootdir/
sh: cd: can't cd to rootdir/: Permission denied

This is actually a great time to mention that you always have to make sure the rootfs (or runc runtime bundle) has the right permission setting that matches the user/gid mapping you want to use. The runtime won't modify the file system ownership to realize the mapping.

Benefit

What's the benefit of using user namespace?
  1. user namespace is useful in case the process requires root to run but you won't want to give it the real root power. (Otherwise, just use a non-zero user id is fines)
  2. when there are multiple users (for different processes) inside of a single container, putting them in different user namespaces allows you monitoring and control multiple instances of the same container.

Summary

Don't run your container process as root user; if you have to put it into a separate user namespace.

by Bin Chen (noreply@blogger.com) at May 05, 2018 04:12

May 11, 2018

Leif Lindholm

Turn the page

On a long and lonesome highway... Err, nevermind.

Anyway, after nearly 12 and a half years, Friday 11 May 2018 will be my last day at ARM. I'm not going very far - after a short break I will be joining Linaro as a full time employee on 21 May.

I will keep my roles in LEG and TianoCore.

I joined ARM back in December 2005 to work in Support (cough, sorry, "applications engineering") for the Embedded Software series of products - which mainly meant TrustZone Software and a little bit of the software components required to make use of the Jazelle DBX (Direct Bytecode eXecution or Dogs BolloX, depending on context) extensions.

As is traditional, the job quickly turned into something quite different, and I spent the next few years supporting development boards and writing and delivering ARM software training. Both with a particular focus on multicore, following the release of the ARM11MPCore and Cortex-A9. I also spent a while in the compilation tools support team. It's impossible to overstate what an amazing time this was for learning. New things. All the time. Solving real problems for real people.

Then followed a short period (9 months) in the TechPubs group, where I worked on standalone documentation to help fill the gaps between the architecture specification and what a programmer is trying to find out. But at this point I had somewhat recovered from my startup years and was itching to get back to development.

I found a role advertised looking for someone to work on multicore software enablement. This sounded like fun, and I ended up getting the job. That was the last time I changed roles in ARM, but (as is traditional) the role itself kept changing. After a period including SWP emulation, Open MPI and Android, I ended up first being and then leading the original ARM server software project. Meanwhile Linaro was created, and after identifying that the IP paranoia overhead of running the server software project in-house was prohibitive, I first started working unofficially with the not-yet-announced Linaro Enterprise Group from around Q2 2012, and then became a full-time assignee into LEG from 1 January 2013.

I will look back at my time at ARM with fondness, and am making this move because I believe it will actually enable me to be more useful to the ARM ecosystem.

So long, and thanks for all the chips.

by Leif Lindholm at May 05, 2018 10:49

May 09, 2018

Marcin Juszkiewicz

Android at Google I/O: what’s the point?

Another year, another Google I/O. Another set of articles with “what’s new in xyz Google product”. Maps, Photos, AI, this, that. And then all those Android P features which nearly no one will see on their phones (tablets look like dead part of market already).

I have a feeling that this part is more or less useless with current state of Android. Latest release is Oreo. On 5.7% of devices. Which sounds like “feel free to ignore” value. Every 4th device runs 3 years old version (and usually lacks two years of security updates). Every 3rd one has 2 years old Nougat one.

How many users will remember what’s new in their phones when Android P will land on their devices? Probably very small part of crazy geeks. Some features will get renamed by device vendors. Other will be removed. Or changed (not always in positive way). Reviewers will write “OMG that feature added by VENDORNAME is so awesome” as no one will remember that it is part of base system.

In other words: I stopped caring what is happening in Android space. With most popular version being few years old I do not see a point in tracking new features. Who would use them in their apps when you have to care about running on four years old Android?

by Marcin Juszkiewicz at May 05, 2018 07:49

May 03, 2018

Neil Williams

Upgrading the home server rack

My original home server rack is being upgraded to use more ARM machines as the infrastructure of the lab itself. I've also moved house, so there is more room for stuff and kit. This has allowed space for a genuine machine room. I will be using that to host test devices which are do not need manual intervention despite repeated testing. (I'll also have the more noisy / brightly illuminated devices in the machine room.) The more complex devices will sit on shelves in the office upstairs. (The work to put the office upstairs was a major undertaking involving my friends Steve and Andy - embedding ethernet cables into the walls of four rooms in the new house. Once that was done, the existing ethernet cable into the kitchen could be fixed (Steve) and then connected to my new Ubiquity AP, (a present from Steve and Andy)).

Before I moved house, I found that the wall mounted 9U communications rack was too confined once there were a few devices in use. A lot of test devices now need many cables to each device. (Power, ethernet, serial, second serial and USB OTG and then add a relay board with it's own power and cables onto the DUT....)

Devices like beaglebone-black, cubietruck and other U-Boot devices will go downstairs, albeit in a larger Dell 24U rack purchased from Vince who has moved to a larger rack in his garage. Vince also had a gigabit 16 port switch available which will replace the Netgear GS108 8-port Gigabit Ethernet Unmanaged Switch downstairs.

I am currently still using the same microserver to run various other services around the house (firewall, file server etc.): HP 704941-421 ProLiant Micro Server

I've now repurposed a reconditioned Dell Compact Form Factor desktop box to be my main desktop machine in my office. This was formerly my main development dispatcher and the desktop box was chosen explicitly to get more USB host controllers on the motherboard than is typically available with an x86 server. There have been concerns that this could be causing bottlenecks when running multiple test jobs which all try to transfer several hundred megabytes of files over USB-OTG at the same time.

I've now added a SynQuacer Edge ARM64 Server to run a LAVA dispatcher in the office, controlling several of the more complex devices to test in LAVA - Hikey 620, HiKey 960 and Dragonboard 410c via a Cambrionix PP15s to provide switchable USB support to enable USB network dongles attached to the USB OTG port which is also used for file deployment during test jobs. There have been no signs of USB bottlenecks at this stage.

This arm64 machine then supports running test jobs on the development server used by the LAVA software team as azrael.codehelp. It runs headless from the supplied desktop tower case. I needed to use a PCIe network card from TPlink to get the device operating but this limitation should be fixed with new firmware. (I haven't had time to upgrade the firmware on that machine yet, still got the rest of the office to kit out and the rack to build.) The development server itself is an ARM64 virtual machine, provided by the Linaro developer cloud and is used with a range of other machines to test the LAVA codebase, doing functional testing.

The new dispatcher is working fine, I've not had any issues with running test jobs on some of the most complex devices used in LAVA. I haven't needed to extend the RAM from the initial 4G and the 24 cores are sufficient for the work I've done using the machine so far.

The rack was moved into place yesterday (thanks to Vince & Steve) but the patch panel which Andy carefully wired up is not yet installed and there are cables everywhere, so a photo will have to wait. The plan now is to purchase new UPS batteries and put each of the rack, the office and the ISP modem onto dedicated UPS. The objective is not to keep the lab running in the event of a complete power cut lasting hours, just to survive brown outs and power cuts lasting a minute or two, e.g. when I finally get around to labelling up the RCD downstairs. (The new house was extended a few yours before I bought it and the organisation of the circuits is a little unexpected in some parts of the house.)

Once the UPS batteries are in, the microserver, a PDU, the network switch and patch panel, as well as the test devices, will go into the rack in the machine room. I've recently arranged to add a second SynQuacer server into the rack - this time fitted into a 1U server case. (Definite advantage of the new full depth rack over the previous half-depth comms box.) I expect this second SynQuacer to have a range of test devices to complement our existing development staging instance which runs the nightly builds which are available for both amd64 and arm64.

I'll post again once I've got the rest of the rack built and the second SynQuacer installed. The hardest work, by far, has been fitting out the house for the cabling. Setting up the machines, installing and running LAVA has been trivial in comparison. Thanks to Martin Stadler for the two SynQuacer machines and the rest of the team in Linaro Enterprise Group (LEG) for getting this ARM64 hardware into useful roles to support wider development. With the support from Debian for building the arm64 packages, the new machine simply sits on the network and does "TheRightThing" without fuss or intervention. I can concentrate on the test devices and get on with things. The fact that the majority of my infrastructure now runs on ARM64 servers is completely invisible to my development work.

by Neil Williams at May 05, 2018 07:05

May 02, 2018

Neil Williams

Upgrading the home server rack

My original home server rack is being upgraded to use more ARM machines as the infrastructure of the lab itself. I've also moved house, so there is more room for stuff and kit. This has allowed space for a genuine machine room. I will be using that to host test …

by Neil Williams at May 05, 2018 07:05

April 25, 2018

Peter Maydell

Debian on QEMU’s Raspberry Pi 3 model

For the QEMU 2.12 release we added support for a model of the Raspberry Pi 3 board (thanks to everybody involved in developing and upstreaming that code). The model is sufficient to boot a Debian image, so I wanted to write up how to do that.

Things to know before you start

Before I start, some warnings about the current state of the QEMU emulation of this board:

  • We don’t emulate the boot rom, so QEMU will not automatically boot from an SD card image. You need to manually extract the kernel, initrd and device tree blob from the SD image first. I’ll talk about how to do that below.
  • We don’t have an emulation of the BCM2835 USB controller. This means that there is no networking support, because on the raspi devices the ethernet hangs off the USB controller.
  • Our raspi3 model will only boot AArch64 (64-bit) kernels. If you want to boot a 32-bit kernel you should use the “raspi2” board model.
  • The QEMU model is missing models of some devices, and others are guesswork due to a lack of documentation of the hardware; so although the kernel I tested here will boot, it’s quite possible that other kernels may fail.

You’ll need the following things on your host system:

  • QEMU version 2.12 or better
  • libguestfs (on Debian and Ubuntu, install the libguestfs-tools package)

Getting the image

I’m using the unofficial preview images described on the Debian wiki.

$ wget https://people.debian.org/~stapelberg/raspberrypi3/2018-01-08/2018-01-08-raspberry-pi-3-buster-PREVIEW.img.xz
$ xz -d 2018-01-08-raspberry-pi-3-buster-PREVIEW.img.xz

Extracting the guest boot partition contents

I use libguestfs to extract files from the guest SD card image. There are other ways to do this but I think libguestfs is the easiest to use. First, check that libguestfs is working on your system:

$ virt-filesystems -a 2018-01-08-raspberry-pi-3-buster-PREVIEW.img
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Now you can ask libguestfs to extract the contents of the boot partition:

$ mkdir bootpart
$ guestfish --ro -a 2018-01-08-raspberry-pi-3-buster-PREVIEW.img -m /dev/sda1

Then at the guestfish prompt type:

copy-out / bootpart/
quit

This should have copied various files into the bootpart/ subdirectory.

Run the guest image

You should now be able to run the guest image:

$ qemu-system-aarch64 \
  -kernel bootpart/vmlinuz-4.14.0-3-arm64 \
  -initrd bootpart/initrd.img-4.14.0-3-arm64 \
  -dtb bootpart/bcm2837-rpi-3-b.dtb \
  -M raspi3 -m 1024 \
  -serial stdio \
  -append "rw earlycon=pl011,0x3f201000 console=ttyAMA0 loglevel=8 root=/dev/mmcblk0p2 fsck.repair=yes net.ifnames=0 rootwait memtest=1" \
  -drive file=2018-01-08-raspberry-pi-3-buster-PREVIEW.img,format=raw,if=sd

and have it boot to a login prompt (the root password for this Debian image is “raspberry”).

There will be several WARNING logs and backtraces printed by the kernel as it starts; these will have a backtrace like this:

[  145.157957] [] uart_get_baud_rate+0xe4/0x188
[  145.158349] [] pl011_set_termios+0x60/0x348
[  145.158733] [] uart_change_speed.isra.3+0x50/0x130
[  145.159147] [] uart_set_termios+0x7c/0x180
[  145.159570] [] tty_set_termios+0x168/0x200
[  145.159976] [] set_termios+0x2b0/0x338
[  145.160647] [] tty_mode_ioctl+0x358/0x590
[  145.161127] [] n_tty_ioctl_helper+0x54/0x168
[  145.161521] [] n_tty_ioctl+0xd4/0x1a0
[  145.161883] [] tty_ioctl+0x150/0xac0
[  145.162255] [] do_vfs_ioctl+0xc4/0x768
[  145.162620] [] SyS_ioctl+0x8c/0xa8

These are ugly but harmless. (The underlying cause is that QEMU doesn’t implement the undocumented ‘cprman’ clock control hardware, and so Linux thinks that the UART is running at a zero baud rate and complains.)

by pm215 at April 04, 2018 08:07

April 23, 2018

Marcin Juszkiewicz

Mass removal of image tags on Docker hub

At Linaro we moved from packaged OpenStack to virtualenv tarballs. Then we packaged those. But as it took us lot of maintenance time we switched to Docker container images for OpenStack and whatever it needs to run. And then we added CI job to our Jenkins to generate hundreds of images per build. So now we have lot of images with lot of tags…

Finding out which tags are latest is quite easy — you just have to go to Docker hub page of linaro/debian-source-base image and switch to tags view. But how to know which build is complete? We had some builds where all images except one got built and pushed. And the missing one is first in deployment… So whole set was b0rken.

How to remove those tags? One solution is to login to Docker hub website and go image by image and click all those tags to be removed. No one is so insane to suggest it. And we do not have credentials to do that as well.

So let’s handle it as we do that in SDI team: by automation. Docker has some API so it’s hub should have some too, right? Hmm…

I went through some pages, then issues, bug reports, random projects. Saw code in JavaScript, Ruby, Bash but nothing usable in Python. Some of projects assume that no one has more than one hundred of images (no paging in getting list of images) and limits itself to some queries.

Started reading docs and some code. Learnt that GET/POST are not the only methods of doing HTTP. There is also DELETE one which was exactly what I needed. Sorted out authentication, web paths and something started to work.

First version was simple: login and remove tag from image. Then added querying for whole list of images (with proper paging) and looping through the list with removal of requested tags from requested images:

15:53 (s) hrw@gossamer:docker$ ./delimage.py haerwu debian-source 5.0.0
haerwu/debian-source-memcached:5.0.0 removed
haerwu/debian-source-glance-api:5.0.0 removed
haerwu/debian-source-nova-api:5.0.0 removed
haerwu/debian-source-rabbitmq:5.0.0 removed
haerwu/debian-source-nova-consoleauth:5.0.0 removed
haerwu/debian-source-nova-placement-api:5.0.0 removed
haerwu/debian-source-glance-registry:5.0.0 removed
haerwu/debian-source-nova-compute:5.0.0 removed
haerwu/debian-source-keystone:5.0.0 removed
haerwu/debian-source-horizon:5.0.0 removed
haerwu/debian-source-neutron-dhcp-agent:5.0.0 removed
haerwu/debian-source-openvswitch-db-server:5.0.0 removed
haerwu/debian-source-neutron-metadata-agent:5.0.0 removed
haerwu/debian-source-heat-api:5.0.0 removed

Final version got MIT license as usual, I created git repo for it and pushed code. Next step? Probably creation of a job on Linaro CI to have a way of removing no longer supported builds. And some more helper scripts.

by Marcin Juszkiewicz at April 04, 2018 17:14

April 10, 2018

Marcin Juszkiewicz

XGene1: cursed processor?

Years ago Applied Micro (APM) released XGene processor. It went to APM BlackBird, APM Mustang, HPe M400 and several other systems. For some time there was no other AArch64 cpu available on market so those machines got popular as distribution builders, developer machines etc…

Then APM got aquired by someone, CPU part got bought by someone else and any support just vanished. Their developers moved to work on XGene2/XGene3 cpus (APM Merlin etc systems). And people woke up with not-supported hardware.

For some time it was not an issue – Linux boots, system works. Some companies got rid of their XGene systems by sending them to Linaro lab, some moved them to ‘internal use only, no external support’ queue etc.

Each mainline kernel release was “let us check what is broken on XGene this time” time. No serial console output again? Ok, we have that ugly patch for it (got cleaned and upstreamed). Now we have kernel 4.16 and guess what? Yes, it broke. Turned out that 4.15 was already faulty (we skipped it at Linaro).

Red Hat bugzilla has a Fedora bug for it. Turns out that firmware has wrong ACPI tables. Nothing new, right? We already know that it lacks PPTT for example (but it is quite new thing for processors topology). This time bug is present in DSDT one.

Sounds familiar? If you had x86 laptop about 10 years ago then it could. DSDT stands for Differentiated System Description Table. It is a major ACPI table used to describe what peripherals the machine has. And serial ports are described wrong there so kernel ignores them.

One of solutions is bundling fixed DSDT to kernel/initrd but that would require adding support for it into Debian and probably not get merged as no one needs that nowadays (unless they have XGene1).

So far I decided to stay on 4.14 for my development cartridges. It works and allows me to continue my Nova work. Do not plan to move to other platform as at Linaro we have probably over hundred XGene1 systems (M400 and Mustangs) which will stay there for development (hard to replace 4.3U case with 45 cartridges by something else).

by Marcin Juszkiewicz at April 04, 2018 09:35

April 07, 2018

Alex Bennée

Working with dired

I’ve been making a lot more use of dired recently. One use case is copying files from my remote server to my home machine. Doing this directly from dired, even with the power of tramp, is a little too time consuming and potentially locks up your session for large files. While browsing reddit r/emacs I found a reference to this post that spurred me to look at spawning rsync from dired some more.

Unfortunately the solution is currently sitting in a pull-request to what looks like an orphaned package. I also ran into some other problems with the handling of where rsync needs to be run from so rather than unpicking some unfamiliar code I decided to re-implement everything in my own package.

I’ve still got some debugging to do to get it to cleanly handle multiple sessions as well as a more detailed mode-line status. Once I’m happy I’ll tag a 0.1 and get it submitted to MELPA.

While getting more familiar with dired I also came up with this little helper:

(defun my-dired-frame (directory)
  "Open up a dired frame which closes on exit."
  (interactive)
  (switch-to-buffer (dired directory))
  (local-set-key
   (kbd "C-x C-c")
   (lambda ()
     (interactive)
     (kill-this-buffer)
     (save-buffers-kill-terminal 't))))

Which is paired with a simple alias in my shell setup:

alias dired="emacsclient -a '' -t -e '(my-dired-frame default-directory)'"

This works really nicely for popping up a dired frame in your terminal window and cleaning itself up when you exit.

by Alex at April 04, 2018 10:12

April 02, 2018

Gema Gomez

What to make next?

One of the most complicated parts of the fiber crafts, and a part that normally takes at least a couple of weeks for me, is the planning phase. As soon as you are done with a project, you try to figure out what you want to do next. The first step is to decide what I feel inspired to make:

  • Quick project
  • Long and intrincate project
  • Use existing yarn project
  • Use existing pattern project
  • Learn a new skill only project
  • Garment or accessory project
  • Something I have done before or something new
  • Who will be the owner? Is it for me? Someone in my family? Friends? Special occassion?

In my case, it depends on the time of the year, the plans I have for the coming months, whether I have stumbled upon something super cool that I could make for someone and how much spare time I have over the coming months.

The first thing I decided is I want to use this gorgeus variegated yarn I bought a few months back:

Yarn

I only have one skein, it is 100% merino, Unic from Bergere. The weight of it is DK, but it comes on 4ply untangled fibre, so it will be like working with 4 strands of fingering yarn at once. I have 660m of material (200g).

With this amount of yarn I cannot really make an adult size garment, but I could make a rather gorgeous complement, either cowl, infinity scarf or a shawl. I could also make a garment for a child or a baby. The changing color of the fibre also makes for a nice color effect if I were to find the right pattern for it.

Q&A

Knitting or crochet?

Either one would work for me this time around.

What are you making? For whom?

Something easy and quick that showcases the yarns color. Probably a cowl/shawl/infity scarf for myself. Not in the mood for learning a new skill, so a pattern with some known techniques will have to do.

Which patterns are worth considering? Are there any nice examples out there of projects made with this yarn?

I looked at the patterns showcased by the manufacturer of the yarn, but none of them were really my cup of tea. Kept searching until I found a book of shawls that has patterns specific for variegated yarn like this one. I bought the book yesterday and I am trying to decide which one to make, it is called The Shawl Project: Book Four, by The Crochet Project.

Now the only question left is to figure out which of the projects in the book I like best and get crocheting. Will post a picture of the project when it is finished!

by Gema Gomez at April 04, 2018 23:00

April 01, 2018

Gema Gomez

Olca Cowl

As part of my yarn shopping spree in San Francisco last October, I bought some Berroco Mykonos (66% linen, 26% nylon, 8% cotton), color hera (8570). I decided to make a crocheted Olca Cowl with it, it required 2 x 50g hanks (260 m):

Olca cowl finished

The pattern was followed verbatim, I used a 3.75mm (F) hook as per pattern description:

hook and yarn

This was a quick and fun pattern to work, I managed to finish it in about a month of spare time. I recommend it for any advanced crochet beginner. Once the three first rows are worked, the rest is mechanic and quick to grow.

by Gema Gomez at April 04, 2018 23:00

March 30, 2018

Naresh Bhat

Benchmarking BigData


Purpose:

The purpose of this blog is try to explain about different types of benchmark tools available for BigData components.  We did a talk on BigData benchmark Linaro Connect @LasVegas in 2016. This is one of my effort to collectively put into a one place with more information.

We have to remember that all the BigData/components/benchmarks are developed 
  • Keeping in mind x86 architecture.  
    • So in first place we should make sure that all the relevant benchmark tools compile and run it on AArch64.  
    • Then we should go ahead and try to optimize the same for AArch64.
Different types of benchmarks and standards
  • Micro benchmarks: To evaluate specific lower-level, system operations
    • E.g. HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark, CALDA, Hadoop Workload Examples (sort, grep, wordcount and Terasort, Gridmix, Pigmix)
  • Functional/Component benchmarks: Specific to low level function
    • E.g. Basic SQL: Individual SQL operations like select, project, join, Order-by..
  • Application level
    • Bigbench
    • Spark bench
The below table explains different types of benchmark
Benchmark Efforts - Microbenchmarks
Workloads
Software Stacks
Metrics
DFSIO
Generate, read, write, append, and remove data for MapReduce jobs
Hadoop
Execution Time, Throughput
HiBench
Sort, WordCount, TeraSort, PageRank, K-means, Bayes classification, Index
Hadoop and Hive
Execution Time, Throughput, resource utilization
AMPLab benchmark
Part of CALDA workloads (scan, aggregate and join) and PageRank
Hive, Tez
Execution Time
CALDA
Load, scan, select, aggregate and join data, count URL links
Hadoop, Hive
Execution Time

Benchmark Efforts - TPC
Workloads
Software Stacks
Metrics
TPCx-HS
HSGen, HSData, Check, HSSort and HSValidate
Hadoop
Performance, price and energy
TPC-H
Datawarehousing operations
Hive, Pig
Execution Time, Throughput
TPC-DS
Decision support benchmark
Data loading, queries and maintenance
Hive, Pig
Execution Time, Throughput

Benchmark Efforts - Synthetic
Workloads
Software Stacks
Metrics
SWIM
Synthetic user generated MapReduce jobs of reading, writing, shuffling and sorting
Hadoop
Multiple metrics
GridMix
Synthetic and basic operations to stress test job scheduler and compression and decompression
Hadoop
Memory, Execution Time, Throughput
PigMix
17 Pig specific queries
Hadoop, Pig
Execution Time
MRBench
MapReduce benchmark as a complementary to TeraSort - Datawarehouse operations with 22 TPC-H queries
Hadoop
Execution Time
NNBench
Load testing namenode and HDFS I/O with small payloads
Hadoop
I/O
SparkBench
CPU, memory and shuffle and IO intensive workloads. Machine Learning, Streaming, Graph Computation and SQL Workloads
Spark
Execution Time, Data process rate
BigBench
Interactive-based queries based on synthetic data
Hadoop, Spark
Execution Time

Benchmark Efforts
Workloads
Software Stacks
Metrics
BigDataBench
1. Micro Benchmarks (sort, grep, WordCount);
2. Search engine workloads (index, PageRank);
3. Social network workloads (connected components (CC), K-means and BFS);
4. E-commerce site workloads (Relational database queries (select, aggregate and join), collaborative filtering (CF) and Naive Bayes;
5. Multimedia analytics workloads (Speech Recognition, Ray Tracing, Image Segmentation, Face Detection);
6. Bioinformatics workloads
Hadoop, DBMSs, NoSQL systems, Hive, Impala, Hbase, MPI, Libc, and other real-time analytics systems
Throughput,
Memory, CPU (MIPS, MPKI - Misses per instruction)

Let's go through each of the benchmark in detail.

Hadoop benchmark and test tool:

The hadoop source comes with a number of bench marks. The TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar file and the TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar file in the source code of hadoop.

You can check it using the command

       $ cd /usr/local/hadoop
       $ bin/hadoop jar hadoop-*test*.jar
       $ bin/hadoop jar hadoop-*examples*.jar

While running the benchmarks you might want to use time command which measure the elapsed time.  This saves you the hassle of navigating to the hadoop JobTracker interface.  The relevant metric is real value in the first row.

      $ time hadoop jar hadoop-*examples*.jar ...
      [...]
      real    9m15.510s
      user    0m7.075s
      sys     0m0.584s

TeraGen, TeraSort and TeraValidate

This is a most well known Hadoop benchmark.  The TeraSort is to sort the data as fast as possible.  This test suite combines HDFS and mapreduce layers of a hadoop cluster.  The TeraSort benchmark consists of 3 steps Generate input via TeraGen, Run TeraSort on input data and Validate sorted output data via TeraValidate.  We have a wikipage which explains about this test suite.  You can refer Hadoop Build Install And Run Guide

TestDFSIO

It is part of hadoop-mapreduce-client-jobclient.jar file.  The Stress test I/O performance (throughput and latency) on a clustered setup.  This test will shake out the hardware, OS and Hadoop setup on your cluster machines (NameNode/DataNode).  The tests are run as a MapReduce job using 1:1 mapping (1 map / file).  This test is helpful to discover performance bottlenecks in your network.  The benchmark write test follow up with read test.  You can use the switch case -write for write tests and -read for read tests.  The results are stored by default in TestDFSIO_results.log. You can use following switch case -resFile to choose different file name.

MR(Map Reduce) Benchmark for MR

The test loops a small job in number of times.  It checks whether small job runs are responsive and running efficiently on your cluster.  It puts focus on MapReduce layer as its impact on the HDFS layer is very limited.  The multiple parallel MRBench issue is resolved.  Hence you can run it from different boxes.

Test command to run 50 small test jobs
      $ hadoop jar hadoop-*test*.jar mrbench -numRuns 50

Exemplary output, which means in 31 sec the job finished
      DataLines       Maps    Reduces AvgTime (milliseconds)
      1               2       1       31414

NN (Name Node) Benchmark for HDFS

This test is useful for load testing the NameNode hardware &amp; configuration.  The benchmark test generates a lot of HDFS related requests with normally very small payloads.  It puts a high HDFS management stress on the NameNode.  The test can be simultaneously run from several machines e.g. from a set of DataNode boxes in order to hit the NameNode from multiple locations at the same time.


The TPC is a non-profit, vendor-neutral organization. The reputation of providing the most credible performance results to the industry. The TPC is a role of “consumer reports” for the computing industry.  It is a solid foundation for complete system-level performance.  The TPC is a methodology for calculating total-system-price and price-performance.  This is a methodology for measuring energy efficiency of complete system 

TPC Benchmark 
  • TPCx-HS
We have a collaborate page TPCxHS  The X: Express, H: Hadoop, S: Sort.  The TPCx-HS kit contains TPCx-HS specification documentation, TPCx-HS User's guide documentation, Scripts to run benchmarks and Java code to execute the benchmark load. A valid run consists of 5 separate phases run sequentially with overlap in their execution The benchmark test consists of 2 runs (Run with lower and higher TPCx-HS Performance Metric).  There is no configuration or tuning changes or reboot are allowed between the two runs.

TPC Express Benchmark Standard is easy to implement, run and publish, and less expensive.  The test sponsor is required to use TPCx-Hs kit as it is provided.  The vendor may choose an independent audit or peer audit which is 60 day review/challenge window apply (as per TPC policy). This is approved by  super majority of the TPC General Council. All publications must follow the TPC Fair Use Policy.
  • TPC-H
    • TPC-H benchmark focuses on ad-hoc queries
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.
  • TPC-DS
    • This is the standard benchmark for decision support
The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.
  • TPC-C
    • TPC-C is an On-Line Transaction Processing Benchmark

Approved in July of 1992, TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC). While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.

TPC vs SPEC models

Here is our comparison between TPC Vs SPEC model benchmark

TPC modelSPEC model
Specification basedKit based
Performance, Price, energy in one benchmarkPerformance and energy in separate benchmarks
End-to-EndServer centric
Multiple tests (ACID, Load)Single test
Independent ReviewSummary disclosure
Full disclosureSPEC research group ICPE
TPC Technology conferenceSPEC Research Group, ICPE (International
Conference on Performance Engineering)



BigBench is a joint effort with partners in industry and academia on creating a comprehensive and standardized BigData benchmark. One of the reference reading about BigBench Toward An Industry Standard Benchmark for BigData Analytics  BigBench builds upon and borrows elements from existing benchmarking efforts (such as TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS).  BigBench is a specification-based benchmark with an open-source reference implementation kit. As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations.  It is focused around execution time calculation Consists of around 30 queries/workloads (10 of them are from TPC).  The drawback is, it is a structured-data-intensive benchmark.  

Spark Bench for Apache Spark

We are able to build on ARM64. The setup completed for single node but run scripts are failing. When spark bench examples are run, a KILL signal is observed which terminates all workers.  This is still under investigation as there are no useful logs to debug. No proper error description and lack of documentation is a challenge. A ticket is already filed on spark bench git which is unresolved.


It is based on TPC-H and TPC-DS benchmarks.  You can exeriment Apache Hive at any data scale. The benchmark contains data generator  and set of queries.  This is very useful to test the basic Hive performance on large data sets.  We have a wiki page for Hive TestBench


This is a stripped-down version of common Mapreduce jobs. (sorting text data and SequenceFiles).  Its a tool for benchmarking Hadoop clusters.  This is a trace based benchmark for MapReduce.  It 
evaluate MapReduce and HDFS performance. 

It submits a mix of synthetic jobs , modeling a profile mined from production loads.  The benchmark attempt to model the resource profiles of production jobs to identify bottlenecks

Basic command line usage:

 $ hadoop gridmix [-generate ] [-users ]
                - Destination directory
                - Path to a job trace

Con - Challenging to explore the performance impact of combining or separating workloads, e.g., through consolidating from many clusters.


The PigMix is a set of queries used test pig component performance.  There are queries that test latency (How long it takes to run this query ?).  The queries that test scalability (How many fields or records can ping handle before it fails ?).

Usage: Run the below commands from pig home

ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)

The documentation can be found at Apache pig - https://pig.apache.org/docs/ 


This benchmark enables rigorous performance measurement of MapReduce systems.  The benchmark contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns.  Informs both highly targeted, workload specific optimizations.  This tool is highly recommended for MapReduce operators  The performance measurement - https://github.com/SWIMProjectUCB/SWIM/wiki/Performance-measurement-by-executing-synthetic-or-historical-workloads 


This is a BigData Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems
  • Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse
  • Hive – a Hadoop-based data warehousing system
  • Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework
  • Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine
  • Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development
This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes.


This is a specification based benchmark.  The two key components: A data model specification and a workload/query specification. It's a comprehensive end-to-end big data benchmark suite.  The git hub for BigDataBenchmark

BigDataBench is a benchmark suite for scale-out workloads, different from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it simulates five typical and important big data applications: search engine, social network, e-commerce, multimedia data analytics, and bioinformatics.

Currently, BigDataBench includes 15 real-world data sets, and 34 big data workloads.


This benchmark test suite is for Hadoop.  It contains 4 different categories tests, 10 workloads and 3 types.  This is a best benchmark with metrics: Time (sec) &amp; Throughput (Bytes/Sec)

Screenshot from 2016-09-22 18:32:56.png


References

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.pdf 

Terasort, TestDFSIO, NNBench, MRBench

https://wiki.linaro.org/LEG/Engineering/BigData
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopTuningGuide 
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide 
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ 

GridMix3, PigMix, HiBench, TPCx-HS, SWIM, AMPLab, BigBench

https://hadoop.apache.org/docs/current/hadoop-gridmix/GridMix.html 
https://cwiki.apache.org/confluence/display/PIG/PigMix 
https://wiki.linaro.org/LEG/Engineering/BigData/HiBench 
https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS 
https://github.com/SWIMProjectUCB/SWIM/wiki 
https://github.com/amplab
 https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench 
http://www.academia.edu/15636566/Handbook_of_BigDataBench_Version_3.1_A_Big_Data_Benchmark_Suite 



Industry Standard benchmarks

TPC - Transaction Processing Performance Council http://www.tpc.org 
SPEC - The Standard Performance Evaluation Corporation https://www.spec.org 
CLDS - Center for Largescale Data System Research http://clds.sdsc.edu/bdbc 

by Naresh (noreply@blogger.com) at March 03, 2018 09:30

March 29, 2018

Marcin Juszkiewicz

Shenzhen trip

Few months ago, at the end of previous Linaro Connect gathering, there was announcement that next one will take place in Hong Kong. This gave me idea of repeating Shenzhen trip but in a bit longer version.

So I mailed people at Linaro and there were some responses. We quickly agreed on going there before Connect. Alex, Arnd, Green and me were landing around noon, Riku a few hours later so we decided that we will meet in Shenzhen.

We crossed border in Lok Ma Chau, my visa had the highest price again and then we took a taxi to the Maker Hotel (still called “Quchuang Hotel” in Google Maps and on Booking.com) next to all those shops we wanted to visit. Then went for quick walk through Seg Electronics Market. Lot of mining gear. 2000W power supplies, strange PCI Express expanders etc. Dinner, meeting with Riku and day ended.

I have woken up at 02:22 and was not able to fall asleep. Around 6:00 it turned out that rest of team is awake as well so we decided to go around and search for some breakfast. Deserted streets looked a bit weird.

Back at hotel we were discussing random things. Then someone from Singapore joined and we were talking about changes in how Shenzhen stores/factories operate. He told us that there is less and less of stores as business moves to the Internet. Then some Chinese family came with about seven years old boy. He said something, his mother translated and it turned out that he wanted to touch my beard. As it was not the first time my beard got such attention I allowed him. That surprise on his face was worth it. And then we realized that we have not seen bearded Chinese man on a street.

As stores were opening at 10:00 we still had a lot of time so went for random walk. Including Shenzhen Center Park which is really nice place:

Then stores started to open. Fake phones, real phones, tablets, components, devices, misc things… Walking there was fun itself. Bought some items from my list.

They also had a lot of old things. Intel Overdrive system for example or 386/486 era processors and FPUs.

From weird things: 3.5″ floppy disks and Intel Xeon Platinum 8175 made for Amazon cloud only.

Lot and lot of stuff everywhere. Need power supply? There were several stores with industrial ones, regulated ones etc. Used computers/laptops? Piles after piles. New components? Lot to choose from. Etc, etc, etc…

After several hours we finally decided to go back to Hong Kong and rest. The whole trip was fun. I really enjoyed it. Even without getting half of items from my ‘buy while in Shenzhen’ list ;D

And ordered Shenzhen fridge magnet on Aliexpress… They were not available to buy at any place we were.

by Marcin Juszkiewicz at March 03, 2018 11:54

March 26, 2018

Alex Bennée

Solving the HKG18 puzzle with org-mode

One of the traditions I like about Linaro’s Connect event is the conference puzzle. Usually set by Dave Piggot they provide a challenge to your jet lagged brain. Full disclosure: I did not complete the puzzle in time. In fact when Dave explained it I realised the answer had been staring me in the face. However I thought a successful walk through would make for a more entertaining read 😉

First the Puzzle:

Take the clues below and solve them. Once solved, figure out what the hex numbers mean and then you should be able to associate each of the clue solutions with their respective hex numbers.

Clue Hex Number
Lava Ale Code 1114DBA
Be Google Roe 114F6BE
Natural Gin 114F72A
Pope Charger 121EE50
Dolt And Hunk 12264BC
Monk Hops Net 122D9D9
Is Enriched Tin 123C1EF
Bran Hearing Kin 1245D6E
Enter Slim Beer 127B78E
Herbal Cabbages 1282FDD
Jan Venom Hon Nun 12853C5
A Cherry Skull 1287B3C
Each Noun Lands 1298F0B
Wave Zone Kits 12A024C
Avid Null Sorts 12A5190
Handcars All Trim 12C76DC

Clues

It looks like all the clues are anagrams. I was lazy and just used the first online anagram solver that Google pointed me at. However we can automate this by combining org-mode with Python and the excellent Beautiful Soup library.

from bs4 import BeautifulSoup
import requests
import re

# ask internet to solve the puzzle
url="http://anagram-solver.net/%s" % (anagram.replace(" ", "%20"))
page=requests.get(url)

# fish out the answers
soup=BeautifulSoup(page.text)
answers=soup.find("ul", class_="answers")
for li in answers.find_all("li"):
    result = li.text
    # filter out non computer related or poor results
    if result in ["Elmer Berstein", "Tim-Berners Lee", "Babbage Charles", "Calude Shannon"]:
        continue
    # filter out non proper names
    if re.search("[a-z] [A-Z]", result):
        break

return result

So with :var anagram=clues[2,0] we get

Ada Lovelace

I admit the “if result in []” is a bit of hack.

Hex Numbers

The hex numbers could be anything. But lets first start by converting to something else.

Hex Prompt Number
1114DBA 17911226
114F6BE 18151102
114F72A 18151210
121EE50 19000912
12264BC 19031228
122D9D9 19061209
123C1EF 19120623
1245D6E 19160430
127B78E 19380110
1282FDD 19410909
12853C5 19420101
1287B3C 19430204
1298F0B 19500811
12A024C 19530316
12A5190 19550608
12C76DC 19691228

The #+TBLFM: is $1='(identity remote(clues,@@#$2))::$2='(string-to-number $1 16)

This is where I went down a blind alley. The fact all they all had the top bit set made me think that Dave was giving a hint to the purpose of the hex number in the way many cryptic crosswords do (I know he is a fan of these). However the more obvious answer is that everyone in the list was born in the last millennium.

Looking up Birth Dates

Now I could go through all the names by hand and look up their birth dates but as we are automating things perhaps we can use computers for what they are good at. Unfortunately there isn’t a simple web-api for looking up this stuff. However there is a project called DBpedia which takes Wikipedia’s data and attempts to make it semantically useful. We can query this using a semantic query language called SparQL. If only I could call it from Emacs…

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>

select ?birthDate where {
  { dbr:$name dbo:birthDate ?birthDate }
  UNION
  { dbr:$name dbp:birthDate ?birthDate }
}

So calling with :var name="Ada_Lovelace" we get

"birthDate"
1815-12-10
1815-12-10

Of course it shouldn’t be a surprise this exists. And in what I hope is a growing trend sparql-mode supports org-mode out of the box. The $name in the snippet is expanded from the passed in variables to the function. This makes it a general purpose lookup function we can use for all our names.

There are a couple of wrinkles. We need to format the name we are looking up with underscores to make a valid URL. Also the output spits out a header and possible multiple birth dates. We can solve this with a little wrapper function. It also introduces some rate limiting so we don’t smash DBpedia’s public SPARQL endpoint.

;; rate limit
(sleep-for 1)
;; do the query
(let* ((str (s-replace-all '((" " . "_") ("Von" . "von")) name))
       (ret (eval
             (car
              (read-from-string
               (format "(org-sbe get-dob (name $\"%s\"))" str))))))
  (string-to-number (replace-regexp-in-string "-" "" (car (cdr (s-lines ret))))))

Calling with :var name="Ada Lovelace" we get

18151210

Full Solution

So now we know what we are doing we need to solve all the puzzles and lookup the data. Fortunately org-mode’s tables are fully functional spreadsheets except they are not limited to simple transformations. Each formula can be a fully realised bit of elisp, calling other source blocks as needed.

Clue Solution DOB
Herbal Cabbages Charles Babbage 17911226
Be Google Roe George Boole 18151102
Lava Ale Code Ada Lovelace 18151210
A Cherry Skull Haskell Curry 19000912
Jan Venom Hon Nun John Von Neumann 19031228
Pope Charger Grace Hopper 19061209
Natural Gin Alan Turing 19120623
Each Noun Lands Claude Shannon 19160430
Dolt And Hunk Donald Knuth 19380110
Is Enriched Tin Dennis Ritchie 19410909
Bran Hearing Kin Brian Kernighan 19420101
Monk Hops Net Ken Thompson 19430204
Wave Zone Kits Steve Wozniak 19500811
Handcars All Trim Richard Stallman 19530316
Enter Slim Beer Tim Berners-Lee 19550608
Avid Null Sorts Linus Torvalds 19691228

The #+TBLFM: is $1='(identity remote(clues,@@#$1))::$2='(org-sbe solve-anagram (anagram $$1))::$3='(org-sbe frob-dob (name $$2))

The hex numbers are helpfully sorted so as long as we sort the clues table by the looked up date of birth using M-x org-table-sort-lines we are good to go.

You can find the full blog post in raw form here.

by Alex at March 03, 2018 10:19

March 22, 2018

Naresh Bhat

A dream come true: Himalayan Odyssey - 2016 (Day-0 to 5)

History:

THE HIMALAYAS as most everyone knows are the highest mountains in the world, with 30 peaks over 24,000 feet. The adventure of a lifetime doesn't get much bigger or higher than riding and chasing mountains of the Himalayas.

The Royal Enfield (RE) motorcycles are manufactured and sold in INDIA since 1907. These motorcycles are best suited for INDIAN road conditions. These motorcycles are used by INDIAN ARMY from the period of second world war.

There is a saying "FOUR WHEELS MOVE THE BODY-BUT TWO WHEELS MOVE THE SOUL". I am a motorcycle enthusiast from my childhood days. I always had dreams to own a RE motorcycle after getting into a job. Right now I own two variants of RE motorcycles, “Royal Enfield Thunderbird Twinspark” (TBTS) and “Squadron Blue Classic Dispatch” which is a Limited Edition.

Thunder Bird Twin Spark 350cc 2011 model

Squadron Blue Dispatch 500cc  2015 model


The TBTS is 350CC, good for cruising on long stretched highways.  The dispatch is 500CC EFI engine which gives quick response to throttle.  Hence I decided to take classic 500CC motorcycle for Himalayan Odyssey (HO).

In INDIA, Royal Enfield conducts different motorcycling tours e.g. HO, Tour of Tibet, Tour of Nepal, Tour of Rajasthan..etc.  But out of all these tours it is  considered that HO is the toughest one.  The reason is very simple, riding on Himalayan mountains are not that easy by considering road conditions, unpredictable weather, high altitudes..etc.  The Himalayan mountain roads are completely shutdown for 6 months. The INDIAN army clears the snow, opens and maintains it for another 6 months. Every year, army announces the open and close dates.

From past 12 years RE is conducting HO. I took part in HO-2016, the 13th HO - “18 DAYS LIKE NO OTHER IN RIDING” . It was conducted between 6th to 23rd July 2016. Our group had 70 men and 14 women from all over the world.  The men and women odyssey route were different, but they meet at LADAKH. Again take separate route and meet last day celebration party in Chandigarh. Men's group route map is as below.


HO Preparation:


It takes lot efforts to convince our family and making suitable arrangements at office. I was planning HO ride from last 5 years by accumulating leaves. I was trying to physically be fit as much as possible by doing exercises on regular basis.  After doing registration it is required to go through physical fitness test and submit those documents.  The physical fitness test includes 5KM run and 50 push-ups in 45 min.  There is also a physical fitness certificate from local doctor and you need to submit to RE. Documents to be submitted include medical test reports for blood, urine and Treadmill test (TMT), Medical history by self, Medical check-up fitness certificate by doctor and Indemnity bond.

The HO team includes a doctor, backup van, mechanics, media people, 3-4 lead riders from RE etc. All the information will be communicated to you post registration.

The HO ride starts from Delhi and ends at Chandigarh.  I am located in  Bangalore and hence I also had to plan to reach Delhi on July 7th with my Motorcycle.  I knew I would need 3 days to reach Delhi from Bangalore via road.  Since I had  very limited amount of time, I planed to ship my motorcycle via containers and fly to Delhi.  The transport of my motorcycle I coasted INR Rs.5780.00 one way. Actually the cost of transportation of my motorcycle was more than my air tickets 😅.  The flight tickets round trip cost INR Rs. 7000.00.  Once you register for the HO trip they will include you in closed facebook, whatsapp group.  It will be very easy to discuss all your questions in those groups.

Ready to ship
I used VRL logistics (Vijayanand Road Lines) to ship my bike from Banglore to Delhi.  Many of you may ask why can't I just rent a motorcycle at Delhi ?  This is just because if I ride my own motorcycle in mountains,  I will understand my motorcycle in a better way and the personal attachment with motorcycle will be more.  That's the reason RE suggests to take our own motorcycle for any rides.

locked in a container
Luggage types and split-up:

When we start our ride, our overall luggage will be split into two. 

1. The luggage that we carry on the motorcycle, We call it as “satellite luggage”
    A duffel bag is a good choice. You can fasten it to you motorcycle using bungee cords or luggage straps. Remember to waterproof this bag well as this is exposed to the elements of outside nature whatever terrain you ride.  Packing of this bag is very crucial,  Distribute the weight evenly. If there is some space left in the bag use compression straps to ensure stuff does not move around inside the bag.  Tie the bag after checking its placement thoroughly and do so only on the centre stand. We will end up doing this even at camp sites and finding a flat piece of land could be tricky, use stones to ensure that your motorcycle is as upright as possible when you're fastening your luggage.  It is very tempting to use saddle bags for satellite luggage, but this will leave you with more empty space. Avoid starting the trip with saddlebags on your bike and then shifting them to the luggage vehicle.

What my satellite luggage will definitely have
1. A change of clothes- a pair of denims/cargos, a T-shirt and a casual jacket
2. A hat
3. A pair of running shoes
4. Winter gloves - depending on where we're on the Odyssey
5. Toiletries - I'll have my lip balm/guard and sunscreen
6. GoPro, some mounts, batteries and a power bank
7. a Beanie or a woolen buff
8. a Torch
9. Spare cables and a tube

2. The luggage that is carried in the luggage vehicle.
     This will be minus the riding gear that you bring, as that will be worn by you for the duration of the entire ride.  This luggage will have to be restricted to one piece per rider with a max limit of 15 kilos.  Why 15 kilos? After you have removed all the gear and your satellite luggage, we have found that this is comfortable cut-off. This is also a comfortable weight for you to carry to your rooms, unload/load to the luggage vehicle every day. This luggage will need to be loaded and unloaded every day and in case of rain, the bags can get soiled and wet. It is best that we use some level of waterproofing so as to safeguard what's in the bags. a waterproof cover or waterproofing from the inside could do the job.


Day-0:

Everybody needs to reach two days before the HO trip.  They will book the accommodation you. The very first day I just did a check-in and collected my motorcycles at Delhi.

The next day schedule was as below


Flag Off day and complete Itinerary:

The 13th edition of the Royal Enfield Himalayan Odyssey will flag off from New Delhi on 9th July 2016. This is a call to all those who love to ride on tough and testing terrain and have the passion to ride with RE. In the year 2016 will see 75 riders riding on one of the most spectacular motorcycle journeys in the world. 

Here is our detailed itinerary



Day-1: Delhi To Chandigarh

The first day started as below

  • 5 AM luggage loading - HO
  • 6:30 AM - breakfast
  • 7:15 AM HO start to India gate
Let this begin!

Group photo @INDIA Gate
The first day ride always starts from India Gate, Delhi.  We took a group photo and did some Buddhist rituals and prayed for a safe ride. The briefing includes regroup point, road conditions and some common mistakes committed by riders.


We were 12 people from Karnataka State and grouped together to take some group photos.

Riders joined from Karnataka State

The flag off is done by RE sales director.  The video just after flagoff there are some news channel coverage:  Auto Today  and NDTV




flag-off
Chandigarh, the capital of the northern Indian states of Punjab and Haryana, was designed by the Swiss-French modernist architect, Le Corbusier.  Chandigarh is a city and a union territory in India that serves as the capital of both neighboring states of Haryana and Punjab. The city is not part of either of the two states and is governed directly by the Union Government, which administers all such territories in the country.

Afternoon we reached Chandigarh and checked-in into hotel. The Chandigarh is a very well planned and beautiful city. The city is having lot of tree's and parks. So we did  a quick tour of couple of places in the city.

Day-2: Chandigarh To Manali

In HO every day is a learning day.  You will become much more closer to your motorcycle each day.  In another words you will understand the motorcycle handling in a better way.   The day starts with luggage loading, breakfast, briefing and ride out.  The time which are followed same on each day.

Briefing

The briefing will be about 10-15 min.  This is very important for a rider.  Because the briefing contains about the kind of road you are going to ride on that day and important riding tips.

We reached Manali Highland hotel by 5PM.  We visited local market to purchase  required items fir the ride.  This will be the last city on our onward journey to Leh.  After Manali, the real ride will start. There will be less tarmac and more rough roads.  After Manali you will see all shops in tents till you reach Leh.  I also met couple of cyclist who were cycling up to Leh.

Cyclists @Manali hotel
Manali is a high-altitude Himalayan resort town in India’s northern Himachal Pradesh state. It has a reputation as a backpacking center and honeymoon destination. Set on the Beas River, it’s a gateway for skiing in the Solang Valley and trekking in Parvati Valley. It's also a jumping-off point for paragliding, rafting and mountaineering in the Pir Panjal mountains, home to 4,000m-high Rohtang Pass.

Day-3: Manali To Keylong (Jispa)

The road from Manali to Rothang pass is a single road.  Although it had tarmac, it was not in a good condition.  We took a break at Rothang pass base camp.
Base camp
We started slowly climbing the pass.  I could feel that thin air and altitude change.  My motorcycle was also giving slow response to the throttle.  The machine also need oxygen for the combustion.   The weather on Rothang pass will change every 15 min.  The last leg climb was very foggy and hardly I could see the road.

Rothang Pass roads
After couple of kms it was very sunny and bright.  We were warned not to stay more than 10 min at high altitude region.
Top of Rothang Pass
We just took couple of photos and started descending the Rothang pass. It is good that after crossing the Rothang pass, the road is completely empty and traffic free.  You can only see some Indian Army trucks or some goods carriages on the road. But suddenly, the road becomes too rough, dusty.After travelling few kms on these rough roads my motorcycle started behaving in a weird way.  The headlight, horn and indicators stopped working.  Hence I stopped by to check the problem. Fortunately, there I spotted one of the RE HO trip co-coordinators.  He just did a basic check and identified that a fuse is burned.  In couple of minutes he replaced the fuse which is readily available in the side panel box.  I continued my ride till the lunch break.


Lunch time..:)

Dusty roads on the way to Tandi
At some places the roads were under construction.  Since they had put wet mud with stones, it was very difficult to handle the motorcycle which is of around 200 KG weight.
Road construction
Finally reached Tandi fuel pump. Filled up the tank full, since there will not be any filling station till next 365KM.
Tandi
Tandi gas station
The rough and dusty road continues.  At some places the dust settled on the road might be nearly 10-15 cms too.


We continued to ride and reached Jispa camp.  The river was flowing just behind our tents.  It is really a heaven on the earth.  Very beautiful village.
Jispa camp

Our Tent
We had snacks and had hangout.  Evening onwards it was too cold because of wind and  the cold river was just behind our tents.  I was feeling like I could have taken room instead of tent. That was purely our mistake since we reached early we garbed a tent to stay.

Day-4: Keylong (Jispa) To Sarchu

Jispa is a village in Lahaul, in the Indian state of Himachal Pradesh. Jispa is located 20 km north of Keylong and 7 km south of Darcha, along the Manali-Leh Highway and the Bhaga river. There are approximately 20 villages between Jispa and Keylong.

The briefing we were given instructions on how to do water crossing.  In all the water crossing there will be small pebbles and water very chilled. One should make sure that the motorcycle tires should not get stuck between these small stone beds.
Ready to leave Jispa valley
The distance between Jispa to Sarchu is very less.  It makes difficult to ride on no road terrain. We finished the morning briefing and started riding.
Briefing @Jispa
We crossed couple of water streams before reaching Sarchu. The technique to cross water stream is very simple.  First you should hold your motorcycle tanks tightly with knee's.  Next you have to free your upper body, give focus and look ahead on water flowing road and give the throttle.
Riding beside Bhaga river

Water crossing
Valley view
Lunch break
We had a break for the lunch.  I had some noodles.  You will not get any other kind of foods in these tents other than omlet, noodles, plain rice.

We reached Sarchu very early around 3-4PM.  But after reaching Sarchu with-in 15-20mins the headache started. Almost all had mountain sickness.  Acute Mountain Sickness (AMS) is the mildest form and it's very common. The symptoms can feel like a hangover – dizziness, headache, muscle aches, nausea. Camp doctors checked heartbeats for all affected people.

We were unable to eat anything,  could not sleep or take rest.  Even if we walk for 100mtr we were unable to breath.  That day was a very horrible day which I will never forget in my life.

Sarchu camp
We were having again tented accommodation.  There will be only solar charged lights. There will not be any army hospitals nearby.  After sun goes there will be sudden drop in temperature. It felt like the situation was life threatening.

Sarchu is a major halt point with tented accommodation in the Himalayas on the Leh-Manali Highway, on the boundary between Himachal Pradesh and Ladakh (Jammu and Kashmir) in India. It is situated between Baralacha La to the south and Lachulung La to the north, at an altitude of 4,290 m (14,070 ft).

Day-5: Sarchu To Leh

I was very eager to start from Sarchu. The stay at high altitude and very cold weather I could not get a good sleep.  The RE guys bought petrol (gas) in backup van.  All of us queued up to top up petrol. The stay at Sarchu tent was the most uncomfortable stay.  But it is true once you get acclimatize to Sarchu altitude,  you are more prepared to travel further.
@Patso
I  shifted my satellite luggage to backup van. As the experience speaks, it is very uncomfortable to ride with saddle bags on the motorcycle. After Sarchu the roads are open, no traffic for several kms. I was going alone and stoped to take pictures.  When I reached "Gata Loops" bottom, couple of my friends joined.

GATA Loops begin
Gata Loops is a name that is unknown to everyone except for a few who have traveled on the Manali Leh highway; or planning to do so. It is a series of twenty one hairpin bends that takes you to the top of the 3rd high altitude pass on this highway, Nakeela; at a height of 15,547 ft.
More (Mo-ray) plains
I have covered hundreds of mountain miles but never seen a plateau. When I came upon the More (pronounced ‘mo-ray’) Plains, they were much bigger than what I’d visualized of plateaus from school geography books.
They are endless. Well, 50 km of flatlands at an elevation of 15,000 feet deserves that epithet! And they are flat, for miles after miles, till they run into the surrounding mountains.  Camp here for the evening and you’ll see the most stunning of sunsets. The area is surprisingly active here. You will always have workers building or repairing roads.


We continue the ride towards Leh after taking few pics at More plains.  We were pass through Pang, Meroo, Debring and at Rumtse we had a lunch break.  The Indus river flows parallel to road and other side steep cliff of the mountains.  I remember each mountain had of different colors after Debring. By evening we reached Leh and check-into hotel "Namgyal Palace". 


by Naresh (noreply@blogger.com) at March 03, 2018 10:16

March 11, 2018

Gema Gomez

Azufral Capelet

A few months ago I bought some Berroco Mykonos yarn in San Francisco. I also bought a pattern for it, the Azufral pattern, written by Donna Yacino. Now, after a few months with not a lot of spare time to work on it, I have managed to finish the capelet:

Capelet

The pattern was followed verbatim, adjusting for gauge and measurements of desired garment. The needles used were Knit Pro Symfonie Cubic Square Needles - 30cm (Pair) - 4.00mm, single pointed.

The yarn is Berroco Mykonos (66% linen, 26% nylon, 8% cotton), color aura (8544), handwash in lukewarm water only and lay flat to dry. I hardly ever go for yarn that is not machine washable, but this one was so shiny and nice to the touch that I could not help it.

The fabric looks as follows once finished:

fabric

by Gema Gomez at March 03, 2018 00:00

March 01, 2018

Gema Gomez

OpenStack Queens on ARM64

We are in Dublin this week, at the OpenStack PTG. We happen to be here on a week that has red weather warnings all over Europe, so most of us are stuck in Dublin for longer than we expected.

Queens has been released!

During Pike/Queens my team at Linaro (Software Defined Infrastructure) have been enabling different parts of OpenStack on ARM64 and making sure the OpenStack code is multiarch when necessary (note that I use the terms AArch64 and ARM64 interchangeably).

There seems to be some confusion about the nature of the servers we are using, here is a picture of one of our racks:

servers

Queens is the first release that we feel confident will run out of the box on ARM64, a milestone of collaboration not only from the Linaro member companies but also from the OpenStack community at large. OpenStack projects have been welcoming of the diversity and inclusive, helping us ramp up: either giving direction and reviewing our code or fixing issues themselves.

We will be deploying Queens with Kolla on the Linaro Developer Cloud (ARM64 servers) and documenting the experience for new Kolla users, including brownfield upgrades.

The Linaro Developer Cloud is a collaborative effort of the Linaro Enteprise Group to ensure ARM64 building and testing capabilities are available for different upstream projects, including OpenStack.

This cycle we added resources from one of our clouds to the openstack-infra project so the community can start testing multiarch changes regularly. The bring up of the ARM64 cloud in infra is in progress, there are only 8 executors currently available to run jobs that we’ll be using for experimental jobs for the time being. The long term goal of this effort is to be able to run ARM64 jobs on the gates by default for all projects.

What next? Next steps include running experimental gate jobs for Kolla and any other project that volunteers, ironing out any leftover issues, making sure devstack runs smoothly, incrementally making sure we have a stable platform to run tests on and inviting all OpenStack projects to take part if they are interested. If you want to discuss any specifics or have questions either use the Kolla mailing or reach out to hrw or gema on freenode.

by Gema Gomez at March 03, 2018 00:00

February 21, 2018

Alex Bennée

Workbooks for Benchmarking

While working on a major re-factor of QEMU’s softfloat code I’ve been doing a lot of benchmarking. It can be quite tedious work as you need to be careful you’ve run the correct steps on the correct binaries and keeping notes is important. It is a task that cries out for scripting but that in itself can be a compromise as you end up stitching a pipeline of commands together in something like perl. You may script it all in a language designed for this sort of thing like R but then find your final upload step is a pain to implement.

One solution to this is to use a literate programming workbook like this. Literate programming is a style where you interleave your code with natural prose describing the steps you go through. This is different from simply having well commented code in a source tree. For one thing you do not have to leap around a large code base as everything you need is on the file you are reading, from top to bottom. There are many solutions out there including various python based examples. Of course being a happy Emacs user I use one of its stand-out features org-mode which comes with multi-language org-babel support. This allows me to document my benchmarking while scripting up the steps in a variety of “languages” depending on the my needs at the time. Let’s take a look at the first section:

1 Binaries To Test

Here we have several tables of binaries to test. We refer to the
current benchmarking set from the next stage, Run Benchmark.

For a final test we might compare the system QEMU with a reference
build as well as our current build.

Binary title
/usr/bin/qemu-aarch64 system-2.5.log
~/lsrc/qemu/qemu-builddirs/arm-targets.build/aarch64-linux-user/qemu-aarch64 master.log
~/lsrc/qemu/qemu.git/aarch64-linux-user/qemu-aarch64 softfloat-v4.log

Well that is certainly fairly explanatory. These are named org-mode tables which can be referred to in other code snippets and passed in as variables. So the next job is to run the benchmark itself:

2 Run Benchmark

This runs the benchmark against each binary we have selected above.

    import subprocess
    import os

    runs=[]

    for qemu,logname in files:
    cmd="taskset -c 0 %s ./vector-benchmark -n %s | tee %s" % (qemu, tests, logname)
        subprocess.call(cmd, shell=True)
        runs.append(logname)

        return runs
        

So why use python as the test runner? Well truth is whenever I end up munging arrays in shell script I forget the syntax and end up jumping through all sorts of hoops. Easier just to have some simple python. I use python again later to read the data back into an org-table so I can pass it to the next step, graphing:

set title "Vector Benchmark Results (lower is better)"
set style data histograms
set style fill solid 1.0 border lt -1

set xtics rotate by 90 right
set yrange [:]
set xlabel noenhanced
set ylabel "nsecs/Kop" noenhanced
set xtics noenhanced
set ytics noenhanced
set boxwidth 1
set xtics format ""
set xtics scale 0
set grid ytics
set term pngcairo size 1200,500

plot for [i=2:5] data using i:xtic(1) title columnhead

This is a GNU Plot script which takes the data and plots an image from it. org-mode takes care of the details of marshalling the table data into GNU Plot so all this script is really concerned with is setting styles and titles. The language is capable of some fairly advanced stuff but I could always pre-process the data with something else if I needed to.

Finally I need to upload my graph to an image hosting service to share with my colleges. This can be done with a elaborate curl command but I have another trick at my disposal thanks to the excellent restclient-mode. This mode is actually designed for interactive debugging of REST APIs but it is also easily to use from an org-mode source block. So the whole thing looks like a HTTP session:

:client_id = feedbeef

# Upload images to imgur
POST https://api.imgur.com/3/image
Authorization: Client-ID :client_id
Content-type: image/png

< benchmark.png

Finally because the above dumps all the headers when run (which is very handy for debugging) I actually only want the URL in most cases. I can do this simply enough in elisp:

#+name: post-to-imgur
#+begin_src emacs-lisp :var json-string=upload-to-imgur()
  (when (string-match
         (rx "link" (one-or-more (any "\":" whitespace))
             (group (one-or-more (not (any "\"")))))
         json-string)
    (match-string 1 json-string))
#+end_src

The :var line calls the restclient-mode function automatically and passes it the result which it can then extract the final URL from.

And there you have it, my entire benchmarking workflow document in a single file which I can read through tweaking each step as I go. This isn’t the first time I’ve done this sort of thing. As I use org-mode extensively as a logbook to keep track of my upstream work I’ve slowly grown a series of scripts for common tasks. For example every patch series and pull request I post is done via org. I keep the whole thing in a git repository so each time I finish a sequence I can commit the results into the repository as a permanent record of what steps I ran.

If you want even more inspiration I suggest you look at John Kitchen’s scimax work. As a publishing scientist he makes extensive use of org-mode when writing his papers. He is able to include the main prose with the code to plot the graphs and tables in a single source document from which his camera ready documents are generated. Should he ever need to reproduce any work his exact steps are all there in the source document. Yet another example of why org-mode is awesome 😉

by Alex at February 02, 2018 20:34

February 13, 2018

Riku Voipio

Making sense of /proc/cpuinfo on ARM

Ever stared at output of /proc/cpuinfo and wondered what the CPU is?

...
processor : 7
BogoMIPS : 2.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 3
Or maybe like:

$ cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
BogoMIPS : 50.00
Features : half thumb fastmult vfp edsp thumbee vfpv3 tls idiva idivt vfpd32 lpae
CPU implementer : 0x56
CPU architecture: 7
CPU variant : 0x2
CPU part : 0x584
CPU revision : 2
...
The bits "CPU implementer" and "CPU part" could be mapped to human understandable strings. But the Kernel developers are heavily against the idea. Therefor, to the next idea: Parse in userspace. Turns out, there is a common tool almost everyone has installed does similar stuff. lscpu(1) from util-linux. So I proposed a patch to do ID mapping on arm/arm64 to util-linux, and it was accepted! So using lscpu from util-linux 2.32 (hopefully to be released soon) the above two systems look like:

Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A53
Stepping: r0p3
CPU max MHz: 1200.0000
CPU min MHz: 208.0000
BogoMIPS: 2.40
L1d cache: unknown size
L1i cache: unknown size
L2 cache: unknown size
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
And

$ lscpu
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: Marvell
Model: 2
Model name: PJ4B-MP
Stepping: 0x2
CPU max MHz: 1333.0000
CPU min MHz: 666.5000
BogoMIPS: 50.00
Flags: half thumb fastmult vfp edsp thumbee vfpv3 tls idiva idivt vfpd32 lpae
As we can see, lscpu is quite versatile and can show more information than just what is available in cpuinfo.

by Riku Voipio (noreply@blogger.com) at February 02, 2018 14:33

February 11, 2018

Siddhesh Poyarekar

Optimizing toolchains for modern microprocessors

About 2.5 years ago I left Red Hat to join Linaro in a move that surprised even me for the first few months. I still work on the GNU toolchain with a glibc focus, but my focus changed considerably. I am no longer looking at the toolchain in its entirety (although I do that on my own time whenever I can, either as glibc release manager or reviewer); my focus is making glibc routines faster for one specific server microprocessor; no prizes for guessing which processor that is. I have read architecture manuals in the past to understand specific behaviours but this is the first time that I have had to pore through the entire manual and optimization guides and try and eek out the last cycle of performance from a chip.

This post is an attempt to document my learnings and make a high level guide of the various things me and my team looked at to improve performance of the toolchain. Note that my team is continuing to work on this chip (and I continue to learn new techniques, I may write about it later) so this ‘guide’ is more of a personal journey. I may add more follow ups or modify this post to reflect any changes in my understanding of this vast topic.

All of my examples use ARM64 assembly since that’s what I’ve been working on and translating the examples to something x86 would have discouraged me enough to not write this at all.

What am I optimizing for?

CPUs today are complicated beasts. Vulnerabilities like Spectre allude to how complicated CPU behaviour can get but in reality it can get a lot more complicated and there’s never really a universal solution to get the best out of them. Due to this, it is important to figure out what the end goal for the optimization is. For string functions for example, there are a number of different factors in play and there is no single set of behaviours that trumps over all others. For compilers in general, the number of such combinations is even higher. The solution often is to try and ensure that there is a balance and there are no exponentially worse behaviours.

The first line of defence for this is to ensure that the algorithm used for the routine does not exhibit exponential behaviour. I wrote about algorithmic changes I did to the multiple precision fallback implementation in glibc years ago elsewhere so I’m not going to repeat that. I will however state that the first line of attack to improve any function must be algorithmic. Thankfully barring strcmp, string routines in glibc had a fairly sound algorithmic base. strcmp fall back to a byte comparison when inputs are not mutually aligned, which is now fixed.

Large strings vs small

This is one question that gets asked very often in the context of string functions and different developers have different opinions on it, some differences even leading to flamewars in the past. One popular approach to ‘solving’ this is to quote usage of string functions in a popular benchmark and use that as a measuring stick. For a benchmark like CPU2006 or CPU2017, it means that you optimize for smaller strings because the number of calls to smaller strings is very high in those benchmarks. There are a few issues to that approach:

  • These benchmarks use glibc routines for a very small fraction of time, so you’re not going to win a lot of performance in the benchmark by improving small string performance
  • Small string operations have other factors affecting it a lot more, i.e. things like cache locality, branch predictor behaviour, prefether behaviour, etc. So while it might be fun to tweak behaviour exactly the way a CPU likes it, it may not end up resulting in the kind of gains you’re looking for
  • A 10K string (in theory) takes at least 10 times more cycles than a 1K string, often more. So effectively, there is 10x more incentive to look at improving performance of larger strings than smaller ones.
  • There are CPU features specifically catered for larger sequential string operations and utilizing those microarchitecture quirks will guarantee much better gains
  • There are a significant number of use cases outside of these benchmarks that use glibc far more than the SPEC benchmarks. There’s no established set of benchmarks that represent them though.

I won’t conclude with a final answer for this because there is none. This is also why I had to revisit this question for every single routine I targeted, sometimes even before I decide to target it.

Cached or not?

This is another question that comes up for string routines and the answer is actually a spectrum - a string could be cached, not cached or partially cached. What’s the safe assumption then?

There is a bit more consensus on the answer to this question. It is generally considered safe to consider that shorter string accesses are cached and then focus on code scheduling and layout for its target code. If the string is not cached, the cost of getting it into cache far outweighs the savings through scheduling and hence it is pointless looking at that case. For larger strings, assuming that they’re cached does not make sense due to their size. As a result, the focus for such situations should be on ensuring that cache utilization is optimal. That is, make sure that the code aids all of the CPU units that populate caches, either through a hardware prefetcher or through judiciously placed software prefetch instructions or by avoiding caching altogether, thus avoiding evicting other hot data. Code scheduling, alignment, etc. is still important because more often than not you’ll have a hot loop that does the loads, compares, stores, etc. and once your stream is primed, you need to ensure that the loop is not suboptimal and runs without stalls.

My branch is more important than yours

Branch predictor units in CPUs are quite complicated and the compiler does not try to model them. Instead, it tries to do the simpler and more effective thing; make sure that the more probably branch target is accessible through sequential fetching. This is another aspect of the large strings vs small for string functions and more often than not, smaller sizes are assumed to be more probable for hand-written assembly because it seems to be that way in practice and also the cost of a mispredict hits the smaller size more than it does the larger one.

Don’t waste any part of a pig CPU

CPUs today are complicated beasts. Yes I know I started the previous section with this exact same line; they’re complicated enough to bear repeating that. However, there is a bit of relief in the fact that the first principles of their design hasn’t changed much. The components of the CPU are all things we heard about in our CS class and the problem then reduces to understanding specific quirks of the processor core. At a very high level, there are three types of quirks you look for:

  1. Something the core does exceedingly well
  2. Something the core does very badly
  3. Something the core does very well or badly under specific conditions

Typically this is made easy by CPU vendors when they provide documentation that specifies a lot of this information. Then there are cases where you discover these behaviours through profiling. Oh yes, before I forget:

Learn how to use perf or similar tool and read its output it will save your life

For example, the falkor core does something interesting with respect with loads and addressing modes. Typically, a load instruction would take a specific number of cycles to fetch from L1, more if memory is not cached, but that’s not relevant here. If you issue a load instruction with a pre/post-incrementing addressing mode, the microarchitecture issues two micro-instructions; one load and another that updates the base address. So:

   ldr  x1, [x2, 16]!

effectively is:

  ldr   x1, [x2, 16]
  add   x2, x2, 16

and that increases the net cost of the load. While it saves us an instruction, this addressing mode isn’t always preferred in unrolled loops since you could avoid the base address increment at the end of every instruction and do that at the end. With falkor however, this operation is very fast and in most cases, this addressing mode is preferred for loads. The reason for this is the way its hardware prefetcher works.

Hardware Prefetcher

A hardware prefetcher is a CPU unit that speculatively loads the memory location after the location requested, in an attempt to speed things up. This forms a memory stream and larger the string, the more its gains from prefetching. This however also means that in case of multiple prefetcher units in a core, one must ensure that the same prefetcher unit is hit so that the unit gets trained properly, i.e. knows what’s the next block to fetch. The way a prefetcher typically knows is if sees a consistent stride in memory access, i.e. it sees loads of X, X+16, X+32, etc. in a sequence.

On falkor the addressing mode plays an important role in determining which hardware prefetcher unit is hit by the load and effectively, a pre/post-incrementing load ensures that the loads hit the same prefetcher. That combined with a feature called register renaming ensures that it is much quicker to just fetch into the same virtual register and pre/post-increment the base address than to second-guess the CPU and try to outsmart it. The memcpy and memmove routines use this quirk extensively; comments in the falkor routines even have detailed comments explaining the basis of this behaviour.

Doing something so badly that it is easier to win

A colleague once said that the best targets for toolchain optimizations are CPUs that do things badly. There always is this one behaviour or set of behaviours that CPU designers decided to sacrifice to benefit other behaviours. On falkor for example, calling the MRS instruction for some registers is painfully slow whereas it is close to single cycle latency for most other processors. Simply avoiding such slow paths in itself could result in tremendous performance wins; this was evident with the memset function for falkor, which became twice as fast for medium sized strings.

Another example for this is in the compiler and not glibc, where the fact that using a ‘str’ instruction on 128-bit registers with register addressing mode is very slow on falkor. Simply avoiding that instruction altogether results in pretty good gains.

CPU Pipeline

Both gcc and llvm allow you to specify a model of the CPU pipeline, i.e.

  1. The number of each type of unit the CPU has. That is, the number of load/store units, number of integer math units, number of FP units, etc.
  2. The latency for each type of instruction
  3. The number of micro-operations each instruction splits into
  4. The number of instructions the CPU can fetch/dispatch in a single cycle

and so on. This information is then used to sequence instructions in a function that it optimizes for. This may also help the compiler choose between instructions based on how long those take. For example, it may be cheaper to just declare a literal in the code and load from it than to construct a constant using mov/movk. Similarly, it could be cheaper to use csel to select a value to load to a register than to branch to a different piece of code that loads the register or vice versa.

Optimal instruction sequencing can often result in significant gains. For example, intespersing load and store instructions with unrelated arithmetic instructions could result in both those instructions executing in parallel, thus saving time. On the contrary, sequencing multiple load instructions back to back could result in other units being underutilized and all instructions being serialized on to the load unit. The pipeline model allows the compiler to make an optimal decision in this regard.

Vector unit - to use or not to use, that is the question

The vector unit is this temptress that promises to double your execution rate, but it doesn’t come without cost. The most important cost is that of moving data between general purpose and vector registers and quite often this may end up eating into your gains. The cost of the vector instructions themselves may be high, or a CPU might have multiple integer units and just one SIMD unit, because of which code may get a better schedule when executed on the integer units as opposed to via the vector unit.

I had seen an opposite example of this in powerpc years ago when I noticed that much of the integer operations were also implemented in FP in multiple precision math. This was because the original authors were from IBM and they had noticed a significant performance gain with that on powerpc (possible power7 or earlier given the timelines) because the CPU had 4 FP units!

Final Thoughts

This is really just the tip of the iceberg when it comes to performance optimization in toolchains and utilizing CPU quirks. There are more behaviours that could be exploited (such as aliasing behaviour in branch prediction or core topology) but the cost benefit of doing that is questionable.

Despite how much fun it is to hand-write assembly for such routines, the best approach is always to write simple enough code (yes, clever tricks might actually defeat compiler optimization passes!) that the compiler can optimize for you. If there are missed optimizations, improve compiler support for it. For glibc and aarch64, there is also the case of impending multiarch explosion. Due to the presence of multiple vendors, having a perfectly tuned routine for each vendor may pose code maintenance problems and also secondary issues with performance, like code layout in a binary and instruction cache utilization. There are some random ideas floating about for that already, like making separate text sections for vendor-specific code, but that’s something we would like to avoid doing if we can.

by Siddhesh at February 02, 2018 19:37

February 06, 2018

Alex Bennée

FOSDEM 2018

I’ve just returned from a weekend in Brussels for my first ever FOSDEM – the Free and Open Source Developers, European Meeting. It’s been on my list of conferences to go to for some time and thanks to getting my talk accepted, my employer financed the cost of travel and hotels. Thanks to the support of the Université libre de Bruxelles (ULB) the event itself is free and run entirely by volunteers. As you can expect from the name they also have a strong commitment to free and open source software.

The first thing that struck me about the conference is how wide ranging it was. There were talks on everything from the internals of debugging tools to developing public policy. When I first loaded up their excellent companion app (naturally via the F-Droid repository) I was somewhat overwhelmed by the choice. As it is a free conference there is no limit on the numbers who can attend which means you are not always guarenteed to be able to get into every talk. In fact during the event I walked past many long queues for the more popular talks. In the end I ended up just bookmarking all the talks I was interested in and deciding which one to go to depending on how I felt at the time. Fortunately FOSDEM have a strong archiving policy and video most of their talks so I’ll be spending the next few weeks catching up on the ones I missed.

There now follows a non-exhaustive list of the most interesting ones I was able to see live:

Dashamir’s talk on EasyGPG dealt with the opinionated decisions it makes to try and make the use of GnuPG more intuitive to those not versed in the full gory details of public key cryptography. Although I use GPG mainly for signing GIT pull requests I really should make better use it over all. The split-key solution to backups was particularly interesting. I suspect I’ll need a little convincing before I put part of my key in the cloud but I’ll certainly check out his scripts.

Liam’s A Circuit Less Travelled was an entertaining tour of some of the technologies and ideas from early computer history that got abandoned on the wayside. These ideas were often to be re-invented in a less superior form as engineers realised the error of their ways as technology advanced. The later half of the talk turns into a bit of LISP love-fest but as an Emacs user with an ever growing config file that is fine by me 😉

Following on in the history vein was Steven Goodwin’s talk on Digital Archaeology which was a salutatory reminder of the amount of recent history that is getting lost as computing’s breakneck pace has discarded old physical formats in lieu of newer equally short lived formats. It reminded me I should really do something about the 3 boxes of floppy disks I have under my desk. I also need to schedule a visit to the Computer History Museum with my children seeing as it is more or less on my doorstep.

There was a tongue in check preview that described the EDSAC talk as recreating “an ancient computer without any of the things that made it interesting”. This was was a little unkind. Although the project re-implemented the computation parts in a tiny little FPGA the core idea was to introduce potential students to the physicality of the early computers. After an introduction to the hoary architecture of the original EDSAC and the Wheeler Jump Mary introduced the hardware they re-imagined for the project. The first was an optical reader developed to read in paper tapes although this time ones printed on thermal receipt paper. This included an in-depth review of the problems of smoothing out analogue inputs to get reliable signals from their optical sensors which mirrors the problems the rebuild is facing with nature of the valves used in EDSAC. It is a shame they couldn’t come up with some way to involve a valve but I guess high-tension supplies and school kids don’t mix well. However they did come up with a way of re-creating the original acoustic mercury delay lines but this time with a tube of air and some 3D printed parabolic ends.

The big geek event was the much anticipated announcement of RISC-V hardware during the RISC-V enablement talk. It seemed to be an open secret the announcement was coming but it still garnered hearty applause when it finally came. I should point out I’m indirectly employed by companies with an interest in a competing architecture but it is still good to see other stuff out there. The board is fairly open but there are still some peripheral IPs which were closed which shows just how tricky getting to fully-free hardware is going to be. As I understand the RISC-V’s licensing model the ISA is open (unlike for example an ARM Architecture License) but individual companies can still have closed implementations which they license to be manufactured which is how I assume SiFive funds development. The actual CPU implementation is still very much a black box you have to take on trust.

Finally for those that are interested my talk is already online for those that are interested in what I’m currently working on. The slides have been slightly cropped in the video but if you follow the link to the HTML version you can read along on your machine.

I have to say FOSDEM’s setup is pretty impressive. Although there was a volunteer in each room to deal with fire safety and replace microphones all the recording is fully automated. There are rather fancy hand crafted wooden boxes in each room which take the feed from your laptop and mux it with the camera. I got the email from the automated system asking me to review a preview of my talk about half and hour after I gave it. It took a little longer for the final product to get encoded and online but it’s certainly the nicest system I’ve come across so far.

All in all I can heartily recommend FOSDEM for anyone in an interest is FLOSS. It’s a packed schedule and there is going to be something for everyone there. Big thanks to all the volunteers and organisers and I hope I can make it next year 😉

by Alex at February 02, 2018 09:36

January 23, 2018

Leif Lindholm

Fun and games with gnu-efi

gnu-efi is a set of scripts, libraries, header files and code examples to make it possible to write applications and drivers for the UEFI environment directly from your POSIX world. It supports i386, Ia64, X64, ARM and AArch64 targets ... but it would be dishonest to say it is beginner friendly in its current state. So let's do something about that.

Rough Edges

gnu-efi comes packaged for most Linux distributions, so you can simply run

$ sudo apt-get install gnu-efi

or

$ sudo dnf install gnu-efi gnu-efi-devel

to install it. However, there is a bunch of Makefile boilerplate that is not covered by said packaging, meaning that getting from "hey, let's check this thing out" to "hello, world" involves a fair bit of tedious makefile hacking.

... serrated?

Also, the whole packaging story here is a bit ... special. It means installing headers and libraries into /usr/lib and /usr/include solely for the inclusion into images to be executed by the UEFI firmware during Boot Services, before the operating system is running. And don't get me started on multi-arch support.

Simplification

Like most other programming languages, Make supports including other source files into the current context. The gnu-efi codebase makes use of this, but not in a way that's useful to a packaging system.

Now, at least GNU Make looks in /usr/include and /usr/local/include as well as the current working directory and any directories specified on the command line with -L. This means we can stuff most of the boilerplate in makefile fragments and include where we need them.

Hello World

So, let's start with the (almost) most trivial application imaginable:

#include <efi/efi.h>
#include <efi/efilib.h>

EFI_STATUS
efi_main(
    EFI_HANDLE image_handle,
    EFI_SYSTEM_TABLE *systab
    )
{
    InitializeLib(image_handle, systab);

    Print(L"Hello, world!\n");

    return EFI_SUCCESS;
}

Save that as hello.c.

Reducing the boiler-plate

Now grab Make.defaults and Make.rules from the gnu-efi source directory and stick them in a subdirectory called efi/.

Then download this gnuefi.mk I prepared earlier, and include it in your Makefile:

include gnuefi.mk

ifeq ($(HAVE_EFI_OBJCOPY), y)
FORMAT := --target efi-app-$(ARCH)      # Boot time application
#FORMAT := --target efi-bsdrv-$(ARCH)   # Boot services driver
#FORMAT := --target efi-rtdrv-$(ARCH)   # Runtime driver
else
SUBSYSTEM=$(EFI_SUBSYSTEM_APPLICATION)  # Boot time application
#SUBSYSTEM=$(EFI_SUBSYSTEM_BSDRIVER)    # Boot services driver
#SUBSYSTEM=$(EFI_SUBSYSTEM_RTDRIVER)    # Runtime driver
endif

all: hello.efi

clean:
    rm -f *.o *.so *.efi *~

The hello.efi dependency for the all target invokes implicit rules (defined in Make.rules) to generate hello.efi from hello.so, which is generated by an implicit rule from hello.o, which is generated by an implicit rule from hello.c.

NOTE: there are two bits of boiler-plate that still need addressing.

First of all, in gnuefi.mk, GNUEFI_LIBDIR needs to be manually adjusted to fit the layout implemented by your distribution. Template entries for Debian and Fedora are provided.

Secondly, the bit of boiler-plate we cannot easily get rid of - we need to inform the toolchain about whether the desired output is an application, a boot-time driver or a runtime driver. Templates for this is included in the Makefile snippet above - but note that different options must currently be set for toolchains where objcopy supports efi- targets directly and ones where it does not.

Building and running

Once the build environment has been completed, build the project as you would with any regular codebase.

$ make
gcc -I/usr/include/efi -I/usr/include/efi/x86_64 -I/usr/include/protocol -mno-red-zone -fpic  -g -O2 -Wall -Wextra -Werror -fshort-wchar -fno-strict-aliasing -fno-merge-constants -ffreestanding -fno-stack-protector -fno-stack-check -DCONFIG_x86_64 -DGNU_EFI_USE_MS_ABI -maccumulate-outgoing-args --std=c11 -c hello.c -o hello.o
ld -nostdlib --warn-common --no-undefined --fatal-warnings --build-id=sha1 -shared -Bsymbolic /usr/lib/crt0-efi-x86_64.o -L /usr/lib64 -L /usr/lib /usr/lib/gcc/x86_64-linux-gnu/6/libgcc.a -T /usr/lib/elf_x86_64_efi.lds hello.o -o hello.so -lefi -lgnuefi
objcopy -j .text -j .sdata -j .data -j .dynamic -j .dynsym -j .rel \
        -j .rela -j .rel.* -j .rela.* -j .rel* -j .rela* \
        -j .reloc --target efi-app-x86_64       hello.so hello.efi
rm hello.o hello.so
$ 

Then get the resulting application (hello.efi) over to a filesystem accessible from UEFI and run it.

UEFI Interactive Shell v2.2
EDK II
UEFI v2.60 (EDK II, 0x00010000)
Mapping table
FS0: Alias(s):HD1a1:;BLK3:
     PciRoot(0x0)/Pci(0x1,0x1)/Ata(0x0)/HD(1,MBR,0xBE1AFDFA,0x3F,0xFBFC1)
BLK2: Alias(s):
     PciRoot(0x0)/Pci(0x1,0x1)/Ata(0x0)
BLK4: Alias(s):
     PciRoot(0x0)/Pci(0x1,0x1)/Ata(0x0)
BLK0: Alias(s):
     PciRoot(0x0)/Pci(0x1,0x0)/Floppy(0x0)
BLK1: Alias(s):
     PciRoot(0x0)/Pci(0x1,0x0)/Floppy(0x1)
Press ESC in 5 seconds to skip startup.nsh or any other key to continue.
Shell> fs0:
FS0:\> hello
Hello, world!
FS0:\>

Wohoo, it worked! (I hope.)

Summary

gnu-efi provides a way to easily develop drivers and applications for UEFI inside your POSIX environment, but it comes with some unnecessarily rough edges. Hopefully this post makes it easier for you to get started with developing real applications and drivers using gnu-efi quickly.

Clearly, we should be working towards getting this sort of thing included in upstream and installed with distribution packages.

by Leif Lindholm at January 01, 2018 16:07

Ard Biesheuvel

Per-task stack canaries for arm64

Due to the way the stack of a thread (or task in kernelspeak) is shared between control flow data (frame pointer, return address, caller saved registers) and temporary buffers, overflowing such buffers can completely subvert the control flow of a program, and the stack is therefore a primary target for attacks. Such attacks are referred to as Return Oriented Programming (ROP), and typically consist of a specially crafted array of forged stack frames, where each return from a function is directed at another piece of code (called a gadget) that is already present in the program. By piecing together gadgets like this, powerful attacks can be mounted, especially in a big program such as the kernel where the supply of gadgets is endless.

One way to mitigate such attacks is the use of stack canaries, which are known values that are placed inside each stack frame when entering a function, and checked again when leaving the function. This forces the attacker to craft his buffer overflow attack in a way that puts the correct stack canary value inside each stack frame. That by itself is rather trivial, but it does require the attacker to discover the value first.

GCC support

GCC implements support for stack canaries, which can be enabled using the various ‑fstack-protector[‑xxx] command line switches. When enabled, each function prologue will store the value of the global variable __stack_chk_guard inside the stack frame, and each epilogue will read the value back and compare it, and branch to the function __stack_chk_fail if the comparison fails.

This works fine for user programs, with the caveat that all threads will use the same value for the stack canary. However, each program will pick a random value at program start, and so this is not a severe limitation. Similarly, for uniprocessor (UP) kernels, where only a single task will be active at the same time, we can simply update the value of the __stack_chk_guard variable when switching from one task to the next, and so each task can have its own unique value.

However, on SMP kernels, this model breaks down. Each CPU will be running a different task, and so any combination of tasks could be active at the same time. Since each will refer to __stack_chk_guard directly, its value cannot be changed until all tasks have exited, which only occurs at a reboot. Given that servers don’t usually reboot that often, leaking the global stack canary value can seriously compromise security of a running system, as the attacker only has to discover it once.

x86: per-CPU variables

To work around this issue, Linux/x86 implements support for stack canaries using the existing Thread-local Storage (TLS) support in GCC, which replaces the reference to __stack_chk_guard with a reference to a fixed offset in the TLS block. This means each CPU has its own copy, which is set to the stack canary value of that CPU’s current task when it switches to it. When the task migrates, it just takes its stack canary value along, and so all tasks can use a unique value. Problem solved.

On arm64, we are not that lucky, unfortunately. GCC only supports the global stack canary value, although discussions are underway to decide how this is best implemented for multitask/thread environments, i.e., in a way that works for userland as well as for the kernel.

Per-CPU variables and preemption

Loading the per-CPU version of __stack_chk_guard could look something like this on arm64:

adrp    x0, __stack__chk_guard
add     x0, x0, :lo12:__stack_chk_guard
mrs     x1, tpidr_el1
ldr     x0, [x0, x1]

There are two problems with this code:

  • the arm64 Linux kernel implements support for Virtualization Host Extensions (VHE), and uses code patching to replace all references to tpidr_el1 with tpidr_el2 on VHE capable systems,
  • the access is not atomic: if this code is preempted after reading the value of tpidr_el1 but before loading the stack canary value, and is subsequently migrated to another CPU, it will load the wrong value.

In kernel code, we can deal with this easily: every emitted reference to tpidr_el1 is tagged so we can patch it at boot, and on preemptible kernels we put the code in a non-preemtible block to make it atomic. However, this is impossible to do in GCC generated code without putting elaborate knowledge of the kernel’s per-CPU variable implementation into the compiler, and doing so would severely limit our future ability to make any changes to it.

One way to mitigate this would be to reserve a general purpose register for the per-CPU offset, and ensure that it is used as the offset register in the ldr instruction. This addresses both problems: we use the same register regardless of VHE, and the single ldr instruction is atomic by definition.

However, as it turns out, we can do much better than this. We don’t need per-CPU variables if we can load the task’s stack canary value directly, and each CPU already keeps a pointer to the task_struct of the current task in system register sp_el0. So if we replace the above with

movz    x0, :abs_g0:__stack__chk_guard_offset
mrs     x1, sp_el0
ldr     x0, [x0, x1]

we dodge both issues, since all of the values involved are per-task values which do not change when migrating to another CPU. Note that the same sequence could be used in userland for TLS if you swap out sp_el0 for tpidr_el0 (and use the appropriate relocation type), so adding support for this to GCC (with a command line configurable value for the system register) would be a flexible solution to this problem.

Proof of concept implementation

I implemented support for the above, using a GCC plugin to replace the default sequence

adrp    x0, __stack__chk_guard
add     x0, x0, :lo12:__stack_chk_guard
ldr     x0, [x0]

with

mrs     x0, sp_el0
add     x0, x0, :lo12:__stack_chk_guard_offset
ldr     x0, [x0]

This limits __stack_chk_guard_offset to 4 KB, but this is not an issue in practice unless struct randomization is enabled. Another caveat is that it only works with GCC’s small code model (the one that uses adrp instructions) since the plugin works by looking for those instructions and replacing them.

Code can be found here.

by ardbiesheuvel at January 01, 2018 11:12

January 17, 2018

Alex Bennée

Edit with Emacs v1.15 released

After a bit of hiatus there was enough of a flurry of patches to make it worth pushing out a new release. I’m in a little bit of a quandary with what to do with this package now. It’s obviously a useful extension for a good number of people but I notice the slowly growing number of issues which I’m not making much progress on. It’s hard to find time to debug and fix things when it’s main state is Works For Me. There is also competition from the Atomic Chrome extension (and it’s related emacs extension). It’s an excellent package and has the advantage of a Chrome extension that is more actively developed and using a bi-directional web-socket to communicate with the edit server. It’s been a feature I’ve wanted to add to Edit with Emacs for a while but my re-factoring efforts are slowed down by the fact that Javascript is not a language I’m fluent in and finding a long enough period of spare time is hard with a family. I guess this is a roundabout way of saying that realistically this package is in maintenance mode and you shouldn’t expect to see any new development for the time being. I’ll of course try my best to address reproducible bugs and process pull requests in a timely manner. That said please enjoy v1.15:

Extension

* Now builds for Firefox using WebExtension hooks
* Use chrome.notifications instead of webkitNotifications
* Use

with style instead of inline for edit button
* fake “input” event to stop active page components overwriting text area

edit-server.el

* avoid calling make-frame-on-display for TTY setups (#103/#132/#133)
* restore edit-server-default-major-mode if auto-mode lookup fails
* delete window when done editing with no new frame

Get the latest from the Chrome Webstore.

by Alex at January 01, 2018 16:47

December 30, 2017

Gema Gomez

Add new ball for knitting

I knit less than I crochet, and this means that I forget all the basic things from time to time. Up until now, I had never had to join a new ball of yarn to a project, because my projects were small and used just one skein.

After some research, I have found this video quite clear on how to add a new ball of yarn safely:

Instructions

  1. In the middle of a row, insert the needle as if getting ready to knit a stitch normally.
  2. Instead of using the old yarn end, create a loop with the new one, and finish the stitch with it.
  3. Loop the old end of yarn over the top of the two new ones, this prevents a hole from forming.
  4. Holding both strands of the new ball of yarn do three or four more regular stitches to secure everything.
  5. Drop the short end from the new ball and just pick up the long strand and continue as normal.

Note: be careful on the way back not to work increases on the stitches that have been knitted with two strands, work them together. If the loose ends loosen up whilst you are working, give them little tugs, then weave them in.

by Gema Gomez at December 12, 2017 00:00

December 29, 2017

Gema Gomez

Autumn Knitting and Stiching Show 2017

This year, once again I took a day off during October and headed to Alexandra Palace in London to enjoy a day off looking at knitting/sewing supplies and ideas. This year’s Autumn Knitting and Stitching Show has been as interesting as always. I started the day doing some fabrics shopping (everything was so colorful):

sewing

Then, inevitably, admired all the art that was on display at the show. This time I was quite surprised by two scenes made of yarn, a railway station and a church. Here is proof that it can be knitted and it can look gorgeous:

railway station church

Awesome day out, as always with the Knitting and Stitching Show, cannot wait to see what things are there next year!

by Gema Gomez at December 12, 2017 00:00

October 14, 2017

Gema Gomez

ImagiKnit

A couple of weeks ago I was in San Francisco for work. This was not my first time in San Francisco, so I didn’t really have a very packed agenda. Since it was Sunday, went out with a couple of colleagues, we stopped at Presidio for picnic and ate some amazing food from the lovely food trucks there (Off the grid). Afterwards we headed to what would be a very amazing visit to a yarn shop abroad. Imagiknit:

Imagiknit shop

I had never heard of it before one of my friends at work mentioned it a couple of weeks prior to our trip. The shop was a delight: spacious and a nice atmosphere, welcoming. A lot of different brands of yarn. Lots of ideas hanging near the different brands of yarn.

Inside the shop

It took us a while to do our shopping, there was a lot of wall space to cover and we wanted to make sure to get enough yarn to have something to remember this little corner of the world by. The shop keepers were knowledgeable and helpful, they got me some of the colors I needed and were not on display. They were kind and also gave me advice on some of the patterns I was interested in, found the books I was looking for. They did not only have yarn, they had plenty of accessories and books to choose from too.

Inside the shop

And this is what my shopping looked like when I arrived to the hotel:

Shopping

ImagiKnit has become a new must go place for me whenever I go next to San Francisco. Totally worth a couple of hours if you are ever visiting the city and are into knitting or crochet.

by Gema Gomez at October 10, 2017 23:00

September 12, 2017

Siddhesh Poyarekar

Across the Charles Bridge - GNU Tools Cauldron 2017

Since I joined Linaro back in 2015 around this time, my travel has gone up 3x with 2 Linaro Connects a year added to the one GNU Tools Cauldron. This year I went to FOSSAsia too, so it’s been a busy traveling year. The special thing about Cauldron though is that it is one of those conferences where I ‘work’ as well as have a lot of fun. The fun bit is because I get to meet all of the people that I work with almost every day in person and a lot of them have become great friends over the years.

I still remember the first Cauldron I went to in 2013 at Mountain View where I felt dwarfed by all of the giants I was sitting with. It was exaggerated because it was the first time I met the likes of Jeff Law, Richard Henderson, etc. in personal meetings since I had joined the Red Hat toolchain team just months before; it was intimidating and exciting all at once. That was also the first time I met Roland McGrath (I still hadn’t met Carlos, he had just had a baby and couldn’t come), someone I was terrified of back then because his patch reviews would be quite sharp and incisive. I had imagined him to be a grim old man hammering out those words from a stern laptop, so it was a surprise to see him use the same kinds of words but with a sarcastic smile, completely changing the context and tone. That was the first time I truly realized how emails often lack context. Years later, I still try to visualize people when I read their emails.

Skip to 4 years later and I was at my 5th Cauldron last week and despite my assumptions on how it would go, it was a completely new experience. A lot of it had to do with my time at Linaro and very little to do with technical growth. I felt like an equal to Linaro folks all over the world and I seemed to carry that forward here, where I felt like an equal with all of the people present, I felt like I belonged. I did not feel insecure about my capabilities (I still am intimately aware of my limitations), nor did I feel the need to constantly prove that I belonged. I was out there seeking toolchain developers (we are hiring btw, email me if you’re a fit), comfortable with the idea of leading a team. The fact that I managed to not screw up the two glibc releases I managed may also have helped :)

Oh, and one wonderful surprise was that an old friend decided to drop in an Cauldron and spend a couple of days.

This year’s Cauldron had the most technical talks submitted in recent years. We had 5 talks in the glibc area, possibly also the highest for us; just as well because we went over time in almost all of them. I won’t say that it’s a surprise since that has happened in every single year that I attended. The first glibc talk was about tunables where I briefly recapped what we have done in tunables so far and talked about the future a bit more at length. Pedro Alves suggested putting pretty printers for tunables for introspection and maybe also for runtime tuning in the coming future. There was a significant amount of interest in the idea of auto-tuning, i.e. collecting profiling data about tunable use and coming up with optimal default values and possibly even eliminating such tunables in future if we find that we have a pretty good default. We also talked about tuning at runtime and the various kinds of support that would be required to make it happen. Finally there were discussions on tuning profiles and ideas around creating performance-enhanced routines for workloads instead of CPUs. The video recording of the talk will hopefully be out soon and I’ll link the video here when it is available.

Florian then talked about glibc 3.0, a notional concept (i.e. won’t be a soname bump) where we rewrite sections of code that have been rotting due to having to support some legacy platforms. The most prominent among them is libio, the module in glibc that implements stdio. When libio was written, it was designed to be compatible with libstdc++ so that FILE streams could be compatible with C++ stdio streams. The only version of gcc that really supports that is 2.95 since libstdc++ has since moved on. However because of the way we do things in glibc, we cannot get rid of them even if there is just one user that needs that ABI. We toyed with the concept of a separate compatibility library that becomes a graveyard for such legacy interfaces so that they don’t hold up progress in the library. It remains to be seen how this pans out, but I would definitely be happy to see this progress; libio was one of my backlog projects for years. I had to miss Raji’s talk on powerpc glibc improvements since I had to be in another meeting, so I’ll have to catch it when the video comes out.

The two BoFs for glibc dealt with a number of administrative and development issues, details of which Carlos will post on the mailing list soon. The highlights for me were the malloc instrumented benchmarks that Carlos wants to add to benchtests and build and review tools. Once I clear up my work backlog a bit, I’ll attempt to set up something like phabricator or gerrit and see how that works out or the community instead of patchwork. I am convinced that all of the issues that we want to solve like crediting reviewers, ensuring good git commit logs, running automated builds and tests, etc. can only be effectively solved with a proper review tool in place to review patches.

There was also a discussion on redoing the makefiles in glibc so that it doesn’t spend so much time doing dependecy resolution, but I am going to pretend that it didn’t happen because it is an ugly ugly task :/

I’m back home now, recovering from the cold that worsened while I was in Prague before I head out again in a couple of weeks to SFO for Linaro Connect. I’ve booked tickets for whale watching tours there, so hopefully I’ll be posting some pictures again after a long break.

by Siddhesh at September 09, 2017 06:16

August 25, 2017

Steve McIntyre

Let's BBQ again, like we did last summer!

It's that time again! Another year, another OMGWTFBBQ! We're expecting 50 or so Debian folks at our place in Cambridge this weekend, ready to natter, geek, socialise and generally have a good time. Let's hope the weather stays nice, but if not we have gazebo technology... :-)

Many thanks to a number of awesome companies and people near and far who are sponsoring the important refreshments for the weekend:

I've even been working on the garden this week to improve it ready for the event. If you'd like to come and haven't already told us, please add yourself to the wiki page!

August 08, 2017 02:00

August 19, 2017

Leif Lindholm

OpenPlatformPkg is dead, long live edk2-platforms!

For a few years now, I have been working towards improving the availability of open source platform ports and device drivers for EDK2.

Initially, this began by setting up OpenPlatformPkg. This has been used both for platforms from Linaro members and external parties, and has already led to some amount of reduced code duplication, and moving common functionality to EDK2.

Now, the platforms that were in OpenPlatformPkg have been moved into the master branch of edk2-platforms, and OpenPlatformPkg itself has become a read-only archive.

So ... what changes?

Well, the first and most obvious change is that the repository now lives in the TianoCore area on github: https://github.com/tianocore/edk2-platforms

Like OpenPlatformPkg, this is not part of the main EDK2 repository. Unlike OpenPlatformPkg, there is an official way to work with this repository as part of the TianoCore group of projects. Code contributions to this repository are reviewed on the edk2-devel mailing list.

Secondly, the directory structure changes slightly. I will let you discover the specifics for yourself.

Thirdly, edk2-platforms is being kept license clean and source only. So binary-only content from OpenPlatformPkg was moved to a separate edk2-non-osi repository. We still want to enable platforms that have a number of non-open-source components to be able to share part of their code, but edk2-platforms will contain only free software.

At the same time, we change the build behavior from having OpenPlatformPkg nested under edk2 to building with edk2, edk2-platforms and (if needed) edk2-non-osi located "wherever" and individual packages located using PACKAGES_PATH.

Updates to uefi-tools

As before, I am way too lazy to keep figuring out the build command lines for each platform/toolchain combination, so I added support to uefi-tools for the new structure as well. Rather than breaking the compatibility of uefi-build.sh with OpenPlaformPkg, or making it more complex by making it support both, I added a new script called edk2-build.sh (which uses a new default platform configuration file called edk2-platforms.config).

Usage-wise, the most visible change is that the script no longer needs to be executed inside the edk2 directory; any directory it is executed from becomes the WORKSPACE, and build output, including intermediary stages, will be placed underneath it.

Secondly, the addition of new command line parameters to point out the locations of the various repositories involved in a build:

-e <edk2 directory>
-p <edk2-platforms directory>
-n <edk2-non-osi directory>

Release management

Well, the old strategies that could be used with edk2/OpenPlatformPkg to achieve a coherent commit on a single hash (git subrepos or submodules) are no longer much use. In order to make a tagged release over multiple repositories, a tool such as mr or repo will be necessary.

I will have to figure out which I pick for the Linaro Enterprise 17.10 release, but I have several weeks left for that :)

by Leif Lindholm at August 08, 2017 23:00

August 05, 2017

Leif Lindholm

Another new blog...

Well, I guess it's that time again. Much as I liked blosxom, it's not really maintained anymore, and the plugin architecture is ... archaic ... to say the least. So last year I started looking into pelican and found it would simplify my life a bit ... and then I started trying to move my existing blosxom theme over to pelican, and then I got bored of that and dropped everything.

However, I do need to be posting some more, and pelican has very nice and simple tags and drafts handling, as well as a lot more useful metadata functionality, whilst remaining as no-frills as I like (markdown support is good enough).

So here is a migration of all of the old content to the new architecture. Hopefully, I will get around to sorting the theme at some point, but at least I am functional again.

by Leif Lindholm at August 08, 2017 21:55

August 03, 2017

Siddhesh Poyarekar

Tunables story continued - glibc 2.26

Those of you tuned in to the wonderful world of system programming may have noticed that glibc 2.26 was released last night (or daytime if you live west of me or middle of the night/dawn if you live east of me, well you get the drift) and it came out with a host of new improvements, including the much awaited thread cache for malloc. The thread cache for malloc is truly a great step forward - it brings down latency of a bulk of allocations from hundreds of cycles to tens of cycles. The other major improvement that a bulk of users and developers will notice is the fact that glibc now detects when resolv.conf has changed and reloads the lookup configuration. Yes, this was long overdue but hey, it’s not like we were refusing patches for the past half a decade, so thank the nice soul (Florian Weimer) who actually got it done in the end.

We are not here to talk about the improvements mentioned in the NEWS. We are here to talk about an improvement that will likely have a long term impact on how optimizations are implemented in libraries. We are here to talk about…

TUNABLES!

Yes, I’m back with tunables, but this time I am not the one who did the work, it’s the wonderful people from Cavium and Intel who have started using tunables for a use case I had alluded to in my talk at Linaro Connect BKK 2016 and also in my previous blog post on tunables, which was the ability to influence IFUNCs.

IFUNCs? International functions? Intricate Functions? Impossibly ridiculous Functions?

There is a short introduction of the GNU Indirect Functions on the glibc wiki that should help you get started on this very powerful yet very complicated concept. In short, ifuncs extend the GOT/PLT mechanism of loading functions from dynamic libraries to loading different implementations of the same function depending on some simple selection criteria. Traditionally this has been based on querying the CPU for features that it supports and as a result we have had multiple variants of some very common functions such as memcpy_sse2 and memcpy_ssse3 for x86 processors that get executed based on the support declared by the processor the program is running on.

Tunables allow you to take this idea further because there are two ways to get performance benefits, (1) by utilizing all of the CPU features that help and (2) by catering to the workload. For example, you could have a workload that performs better with a supposedly sub-optimal memcpy variant for the CPU purely because of the way your data is structured or laid out. Tunables allow you to select that routine by pretending that the CPU has a different set of capabilities than it actually reports, by setting the glibc.tune.hwcaps tunable on x86 processors. Not only that, you can even tune cache sizes and non-temporal thresholds (i.e. threshold beyond which some routines use non-temporal instructions for loads and stores to optimize cache usage) to suit your workload. I won’t be surprised if some years down the line we see specialized implementations of these routines that cater to specific workloads, like memcpy_db for databases or memset_paranoid for a time invariant (or mostly invariant) implementation of memset.

Beyond x86

Here’s where another very important feature landed in glibc 2.26: multiarch support in aarch64. The ARMv8 spec is pretty standard and as a result the high level instruction set and feature set of vendor chips is pretty much the same with some minor trivial differences. However, even though the spec is standard, the underlying microarchitecture implementation could be very different and that meant that selection of instructions and scheduling differences could lead to sometimes very significant differences in performance and vendors obviously would like to take advantage of that.

The only way they could reliably (well, kind of, there should be a whole blog post for this) identify their processor variant (and hence deploy routines for their processors) was by reading the machine identification register or MIDR_EL1. If you’re familiar with aarch64 registers, you’ll notice that this register cannot be read by userspace, it can only be read by the kernel. The kernel thus had to trap and emulate this instruction, support for which is now available since Linux 4.11. In glibc 2.26, we now use MIDR_EL1 to identify which vendor processor the program is running on and deploy an optimal routine (in this case for the Cavium thunderxt88) for the processor.

But wait, what about earlier kernels, how do they take advantage of this? There’s a tunable for it! There’s glibc.tune.cpu for aarch64 that allows you to select the CPU variant you want to emulate. For some workloads you’ll find the generic memcpy actually works better and the tunable allows you to select that as well.

Finally due to tunables, the much needed cleanup of LD_HWCAP_MASK happened, giving rise to the tunable glibc.tune.hwcap_mask. Tunables also eliminated a lot of the inconsistency in environment variable behaviour due to the way static and dynamic executables are initialized, so you’ll see much less differences in the way your applications behave when they’re built dynamically vs when they’re built statically.

Wow, that sounds good, where do I sign up for your newsletter?

The full list of hardware capability tunables are documented in the glibc manual so take a look and feel free to hop on to the libc-help mailing list to discuss these tunables and suggest more ways in which you would like to tune the library for your workload. Remember that tunables don’t have any ABI/API guarantees for now, so they can be added or removed between releases as we deem fit. Also, your distribution may end up adding their own tunables too in future, so look out for those as well. Finally, system level tunables coming up real soon to allow system administrators to control how users use these tunables.

Happy hacking!

by Siddhesh at August 08, 2017 06:57

July 25, 2017

Rémi Duraffort

Using requests with xmlrpc

Using XML-RPC with Python3 is really simple. Calling system.version on http://localhost/RCP2 is as simple as:

import xmlrpc.client

proxy = xmlrpc.client.ServerProxy("http://localhost/RPC2")
print(proxy.system.version())

However, the default client is missing many features, like handling proxies. Using requests for the underlying connection allows for greater control of the http request.

The xmlrpc client allows to change the underlying transport class by a custom class. In order to use requests, we create a simple Transport class:

import requests
import xmlrpc.client

class RequestsTransport(xmlrpc.client.Transport):

    def request(self, host, handler, data, verbose=False):
        # set the headers, including the user-agent
        headers = {"User-Agent": "my-user-agent",
                   "Content-Type": "text/xml",
                   "Accept-Encoding": "gzip"}
        url = "https://%s%s" % (host, handler)
        try:
            response = None
            response = requests.post(url, data=data, headers=headers)
            response.raise_for_status()
            return self.parse_response(response)
        except requests.RequestException as e:
            if response is None:
                raise xmlrpc.client.ProtocolError(url, 500, str(e), "")
            else:
                raise xmlrpc.client.ProtocolError(url, response.status_code,
                                                  str(e), response.headers)

    def parse_response(self, resp):
        """
        Parse the xmlrpc response.
        """
        p, u = self.getparser()
        p.feed(resp.text)
        p.close()
        return u.close()

To use this Transport class, we should use:

proxy = xmlrpc.client.ServerProxy(uri, transport=RequestsTransport())

We can now use requests to:

  • use proxies
  • skip ssl verification (on a development server) or adding the right certificate chain
  • set the headers
  • set the timeouts
  • ...

See the documentation or an example for more information.

by Rémi Duraffort at July 07, 2017 08:33

July 24, 2017

Peter Maydell

Installing Debian on QEMU’s 64-bit ARM “virt” board

This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…

Why the “virt” board?

For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.

Prerequisites and assumptions

I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.

I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)

Getting the installer files

I suggest creating a subdirectory for these and the other files we’re going to create.

wget -O installer-linux http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/linux
wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/initrd.gz

Saving them locally as installer-linux and installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.

(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)

Installing

First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.

qemu-img create -f qcow2 hda.qcow2 5G

(Oops — an earlier version of this blogpost created a “qcow” format image, which will work but is less efficient. If you created a qcow image by mistake, you can convert it to qcow2 with mv hda.qcow2 old-hda.qcow && qemu-img convert -O qcow2 old-hda.qcow hda.qcow2. Don’t try it while the VM is running! You then need to update your QEMU command line to say “format=qcow2” rather than “format=qcow”. You can delete the old-hda.qcow once you’ve checked that the new qcow2 file works.)

Now we can run the installer:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel installer-linux \
  -initrd installer-initrd.gz \
  -drive if=none,file=hda.qcow2,format=qcow2,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic -no-reboot

The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.

The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.

Late in the process, the installer will print the following warning dialog:

   +-----------------| [!] Continue without boot loader |------------------+
   |                                                                       |
   |                       No boot loader installed                        |
   | No boot loader has been installed, either because you chose not to or |
   | because your specific architecture doesn't support a boot loader yet. |
   |                                                                       |
   | You will need to boot manually with the /vmlinuz kernel on partition  |
   | /dev/vda1 and root=/dev/vda2 passed as a kernel argument.             |
   |                                                                       |
   |                              <Continue>                               |
   |                                                                       |
   +-----------------------------------------------------------------------+  

Press continue for now, and we’ll sort this out later.

Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the -no-reboot option).

At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.

Extracting the kernel

The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.

There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:

$ virt-filesystems -a hda.qcow2 
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Looking at what’s in our disk we can see the kernel and initrd in /boot:

$ virt-ls -a hda.qcow2 /boot/
System.map-4.9.0-3-arm64
config-4.9.0-3-arm64
initrd.img
initrd.img-4.9.0-3-arm64
initrd.img.old
lost+found
vmlinuz
vmlinuz-4.9.0-3-arm64
vmlinuz.old

and we can copy them out to the host filesystem:

virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .

(We want the longer filenames, because vmlinuz and initrd.img are just symlinks and virt-copy-out won’t copy them.)

An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.

If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.

Running

To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel vmlinuz-4.9.0-3-arm64 \
  -initrd initrd.img-4.9.0-3-arm64 \
  -append 'root=/dev/vda2' \
  -drive if=none,file=hda.qcow2,format=qcow2,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic

This should boot to a login prompt, where you can log in with the user and password you set up during the install.

The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.

by pm215 at July 07, 2017 09:25

July 21, 2017

Gema Gomez

Acer Shawl

Last weekend I attended a class at The Sheep Shop. It was the Easy crochet lace class by Joanne Scrace. Just for attending the class, we got a copy of the Acer Shawl pattern by Joanne. It was easy to get into the rythm of it and well explained. This is the sample I managed to do during the three hours of the class:

Class sample

I have continued working on it this week, and I managed to finish two skeins of 50g each of Louisa Harding Yarn, Amitola, color Tinkerbell (134). I have bought a third skein to make it slightly bigger, but it is looking lovely:

Shawl

Crochet hook used for this: 5.0mm.

This was the first time I work with a colour changing yarn on a project like this. I have been rather careful when changing skeins to match the tones of both ends of the yarn and the trick worked wonders for a very neat finish.

Thank you Joanne for such a lovely and simple pattern!

by Gema Gomez at July 07, 2017 23:00

July 05, 2017

Ard Biesheuvel

GHASH for low-end ARM cores

The Galois hash algorithm (GHASH) is a fairly straight-forward keyed hash algorithm based on finite field multiplication, using the field GF(2128) with characteristic polynomial x128 + x7 + x2 + x + 1. (An excellent treatment of Galois fields can be found here)

The significance of GHASH is that it is used as the authentication component in the GCM algorithm, which is an implementation of authenticated encryption with associated data (AEAD), a cryptographic mode that combines authentication of data sent in the clear with authentication of data that is sent in encrypted form at the same time. It is widely used these days, primarily in the networking domain (IPsec, IEEE 802.11)

ISA support

Both the Intel and ARMv8 instruction sets now contain support for carry-less multiplication (also known as polynomial multiplication), primarily to allow for accelerated implementations of GHASH to be created, which formerly had to rely on unwieldy and less secure table based implementations. (The Linux implementation pre-computes a 4 KB lookup table for each instance of the hash algorithm that is in use, i.e., for each session having a different key. 4 KB per IPsec connection does not sound too bad in terms of memory usage, but the D-cache footprint may become a bottleneck when serving lots of concurrent connections.) In contrast, implementations based on these special instructions are time invariant, and are significantly faster (around 16x on high end ARMv8 cores).

Unfortunately, though, while ARMv8 specifies a range of polynomial multiplication instructions with various operand sizes, the one we are most interested in, which performs carry-less multiplication on two 64-bit operands to produce a 128-bit result, is optional in the architecture. So on low-end cores such as the Cortex-A53 (as can be found in the Raspberry Pi 3), the accelerated driver is not available because this particular instruction is not implemented.

Using vmull.p8 to implement vmull.p64

The other day, I stumbled upon the paper Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine by Danilo Camara, Conrado Gouvea, Julio Lopez and Ricardo Dahab, which describes how 64×64 to 128 bit polynomial multiplication (vmull.p64) can be composed using 8×8 to 16 bit polynomial multiplication (vmull.p8) combined with other SIMD arithmetic instructions. The nice thing about vmull.p8 is that it is a standard NEON instruction, which means all NEON capable CPUs implement it, including the Cortex-A53 on the Raspberry Pi 3.

Transliterating 32-bit ARM code to the 64-bit ISA

The algorithm as described in the paper is based on the 32-bit instruction set (retroactively named AArch32), which deviates significantly from the new 64-bit ISA called AArch64. The primary difference is that the number of SIMD registers has increased to 32, which is nice, but which has a downside as well: it is no longer possible to directly use the top half of a 128-bit register as a 64-bit register, which is something the polynomial multiplication algorithm relies on.

The original code looks something like this (note the use of ‘high’ and ‘low’ registers in the same instruction)

.macro          vmull_p64, rq, ad, bd
vext.8          t0l, \ad, \ad, #1       @ A1
vmull.p8        t0q, t0l, \bd           @ F = A1*B
vext.8          \rq\()_L, \bd, \bd, #1  @ B1
vmull.p8        \rq, \ad, \rq\()_L      @ E = A*B1
vext.8          t1l, \ad, \ad, #2       @ A2
vmull.p8        t1q, t1l, \bd           @ H = A2*B
vext.8          t3l, \bd, \bd, #2       @ B2
vmull.p8        t3q, \ad, t3l           @ G = A*B2
vext.8          t2l, \ad, \ad, #3       @ A3
vmull.p8        t2q, t2l, \bd           @ J = A3*B
veor            t0q, t0q, \rq           @ L = E + F
vext.8          \rq\()_L, \bd, \bd, #3  @ B3
vmull.p8        \rq, \ad, \rq\()_L      @ I = A*B3
veor            t1q, t1q, t3q           @ M = G + H
vext.8          t3l, \bd, \bd, #4       @ B4
vmull.p8        t3q, \ad, t3l           @ K = A*B4
veor            t0l, t0l, t0h           @ t0 = (L) (P0 + P1) << 8
vand            t0h, t0h, k48
veor            t1l, t1l, t1h           @ t1 = (M) (P2 + P3) << 16
vand            t1h, t1h, k32
veor            t2q, t2q, \rq           @ N = I + J
veor            t0l, t0l, t0h
veor            t1l, t1l, t1h
veor            t2l, t2l, t2h           @ t2 = (N) (P4 + P5) << 24
vand            t2h, t2h, k16
veor            t3l, t3l, t3h           @ t3 = (K) (P6 + P7) << 32
vmov.i64        t3h, #0
vext.8          t0q, t0q, t0q, #15
veor            t2l, t2l, t2h
vext.8          t1q, t1q, t1q, #14
vmull.p8        \rq, \ad, \bd           @ D = A*B
vext.8          t2q, t2q, t2q, #13
vext.8          t3q, t3q, t3q, #12
veor            t0q, t0q, t1q
veor            t2q, t2q, t3q
veor            \rq, \rq, t0q
veor            \rq, \rq, t2q
.endm

However, things like veor t1l, t1l, t1h or using ext with upper halves of registers are not possible in AArch64, and so we need to transpose the contents of some of registers using the tbl and/or zip/unzip instructions. Also, the vmull.p8 instruction now exists in two variants: pmull operating on the lower halves and pmull2 operating on the upper halves of the input operands.

We end up with the following sequence, which is 3 instructions longer than the original:

.macro          __pmull_p8, rq, ad, bd, i
.ifb            \i
ext             t4.8b, \ad\().8b, \ad\().8b, #1         // A1
ext             t8.8b, \bd\().8b, \bd\().8b, #1         // B1
ext             t5.8b, \ad\().8b, \ad\().8b, #2         // A2
ext             t7.8b, \bd\().8b, \bd\().8b, #2         // B2
ext             t6.8b, \ad\().8b, \ad\().8b, #3         // A3
ext             t9.8b, \bd\().8b, \bd\().8b, #3         // B3
ext             t3.8b, \bd\().8b, \bd\().8b, #4         // B4

pmull           t4.8h, t4.8b, \bd\().8b                 // F = A1*B
pmull           t8.8h, \ad\().8b, t8.8b                 // E = A*B1
pmull           t5.8h, t5.8b, \bd\().8b                 // H = A2*B
pmull           t7.8h, \ad\().8b, t7.8b                 // G = A*B2
pmull           t6.8h, t6.8b, \bd\().8b                 // J = A3*B
pmull           t9.8h, \ad\().8b, t9.8b                 // I = A*B3
pmull           t3.8h, \ad\().8b, t3.8b                 // K = A*B4
pmull           \rq\().8h, \ad\().8b, \bd\().8b         // D = A*B
.else
tbl             t4.16b, {\ad\().16b}, perm1.16b         // A1
tbl             t8.16b, {\bd\().16b}, perm1.16b         // B1
tbl             t5.16b, {\ad\().16b}, perm2.16b         // A2
tbl             t7.16b, {\bd\().16b}, perm2.16b         // B2
tbl             t6.16b, {\ad\().16b}, perm3.16b         // A3
tbl             t9.16b, {\bd\().16b}, perm3.16b         // B3
tbl             t3.16b, {\bd\().16b}, perm4.16b         // B4

pmull2          t4.8h, t4.16b, \bd\().16b               // F = A1*B
pmull2          t8.8h, \ad\().16b, t8.16b               // E = A*B1
pmull2          t5.8h, t5.16b, \bd\().16b               // H = A2*B
pmull2          t7.8h, \ad\().16b, t7.16b               // G = A*B2
pmull2          t6.8h, t6.16b, \bd\().16b               // J = A3*B
pmull2          t9.8h, \ad\().16b, t9.16b               // I = A*B3
pmull2          t3.8h, \ad\().16b, t3.16b               // K = A*B4
pmull2          \rq\().8h, \ad\().16b, \bd\().16b       // D = A*B
.endif

eor             t4.16b, t4.16b, t8.16b                  // L = E + F
eor             t5.16b, t5.16b, t7.16b                  // M = G + H
eor             t6.16b, t6.16b, t9.16b                  // N = I + J

uzp1            t8.2d, t4.2d, t5.2d
uzp2            t4.2d, t4.2d, t5.2d
uzp1            t7.2d, t6.2d, t3.2d
uzp2            t6.2d, t6.2d, t3.2d

// t4 = (L) (P0 + P1) << 8
// t5 = (M) (P2 + P3) << 16
eor             t8.16b, t8.16b, t4.16b
and             t4.16b, t4.16b, k32_48.16b

// t6 = (N) (P4 + P5) << 24
// t7 = (K) (P6 + P7) << 32
eor             t7.16b, t7.16b, t6.16b
and             t6.16b, t6.16b, k00_16.16b

eor             t8.16b, t8.16b, t4.16b
eor             t7.16b, t7.16b, t6.16b

zip2            t5.2d, t8.2d, t4.2d
zip1            t4.2d, t8.2d, t4.2d
zip2            t3.2d, t7.2d, t6.2d
zip1            t6.2d, t7.2d, t6.2d

ext             t4.16b, t4.16b, t4.16b, #15
ext             t5.16b, t5.16b, t5.16b, #14
ext             t6.16b, t6.16b, t6.16b, #13
ext             t3.16b, t3.16b, t3.16b, #12

eor             t4.16b, t4.16b, t5.16b
eor             t6.16b, t6.16b, t3.16b
eor             \rq\().16b, \rq\().16b, t4.16b
eor             \rq\().16b, \rq\().16b, t6.16b
.endm

On the Raspberry Pi 3, this code runs 2.8x faster than the generic, table based C code. This is a nice improvement, but we can do even better.

GHASH reduction

The accelerated GHASH implementation that uses the vmull.p64 instruction looks like this:

ext		T2.16b, XL.16b, XL.16b, #8
ext		IN1.16b, T1.16b, T1.16b, #8
eor		T1.16b, T1.16b, T2.16b
eor		XL.16b, XL.16b, IN1.16b

pmull2		XH.1q, XL.2d, SHASH.2d		// a1 * b1
eor		T1.16b, T1.16b, XL.16b
pmull	 	XL.1q, XL.1d, SHASH.1d		// a0 * b0
pmull		XM.1q, T1.1d, SHASH2.1d		// (a1 + a0)(b1 + b0)

eor		T2.16b, XL.16b, XH.16b
ext		T1.16b, XL.16b, XH.16b, #8
eor		XM.16b, XM.16b, T2.16b

pmull		T2.1q, XL.1d, MASK.1d
eor		XM.16b, XM.16b, T1.16b

mov		XH.d[0], XM.d[1]
mov		XM.d[1], XL.d[0]

eor		XL.16b, XM.16b, T2.16b
ext		T2.16b, XL.16b, XL.16b, #8
pmull		XL.1q, XL.1d, MASK.1d

eor		T2.16b, T2.16b, XH.16b
eor		XL.16b, XL.16b, T2.16b

What should be noted here is that the finite field multiplication consists of a multiplication step and a reduction step, where the latter essentially performs the modulo division involving the characteristic polynomial (which is how we normalize the result, i.e., ensure that it remains inside the field)

So while this sequence is optimal for cores that implement vmull.p64 natively, we can switch to a reduction step that does not involve polynomial multiplication at all, and avoid two copies of the fallback vmull.p64 sequence consisting of 40 instructions each.

ext		T2.16b, XL.16b, XL.16b, #8
ext		IN1.16b, T1.16b, T1.16b, #8
eor		T1.16b, T1.16b, T2.16b
eor		XL.16b, XL.16b, IN1.16b

__pmull_p8	XH, XL, SHASH, 2		// a1 * b1
eor		T1.16b, T1.16b, XL.16b
__pmull_p8 	XL, XL, SHASH			// a0 * b0
__pmull_p8	XM, T1, SHASH2			// (a1 + a0)(b1 + b0)

eor		T2.16b, XL.16b, XH.16b
ext		T1.16b, XL.16b, XH.16b, #8
eor		XM.16b, XM.16b, T2.16b

eor		XM.16b, XM.16b, T1.16b

mov		XL.d[1], XM.d[0]
mov		XH.d[0], XM.d[1]

shl		T1.2d, XL.2d, #57
shl		T2.2d, XL.2d, #62
eor		T2.16b, T2.16b, T1.16b
shl		T1.2d, XL.2d, #63
eor		T2.16b, T2.16b, T1.16b
ext		T1.16b, XL.16b, XH.16b, #8
eor		T2.16b, T2.16b, T1.16b

mov		XL.d[1], T2.d[0]
mov		XH.d[0], T2.d[1]

ushr		T2.2d, XL.2d, #1
eor		XH.16b, XH.16b, XL.16b
eor		XL.16b, XL.16b, T2.16b
ushr		T2.2d, T2.2d, #6
ushr		XL.2d, XL.2d, #1

eor		T2.16b, T2.16b, XH.16b
eor		XL.16b, XL.16b, T2.16b

Loop invariants

Another observation one can make when looking at this code is that the vmull.p64 calls that remain all involve right hand sides that are invariants during the execution of the loop. For the version that uses the native vmull.p64, this does not matter much, but for our fallback sequence, it means that some instructions essentially calculate the same value each time, and the computation can be taken out of the loop instead.

Since we have plenty of spare registers on AArch64, we can dedicate 8 of them to prerotated B1/B2/B3/B4 values of SHASH and SHASH2. With that optimization folded in as well, this implementation runs at 4x the speed of the generic GHASH driver. When combined with the bit-sliced AES driver, GCM performance on the Cortex-A53 increases twofold, from 58 to 29 cycles per byte.

The patches implementing this for AArch64 and for AArch32 can be found here.

by ardbiesheuvel at July 07, 2017 12:12

June 30, 2017

Gema Gomez

Stitching group

A couple of months ago we started a Stitch ‘n B*tch group at work. We meet every week on Thursdays at lunchtime in a meeting room for those of us in the office and via online conference for the rest.

We work in technology and most of our workforce is remote, so we decided to make the group inclusive and invite not only people that may be interested in the office, but also colleagues working from home. So far there is four of us regularly attending this once a week meetup at lunchtime and we are having lots of fun sharing stories from our respective areas of the company. We are all from different departments and if it weren’t due to this hobby we all share, we may have never gotten to know each other that much.

I cannot encourage crafters out there enough to get organised and do something other than sitting in front of the computer during the lunch hour. Stitching or walking are fun activities and help socialize with your colleagues, plus they are fun. And it makes us so much more productive afterwards!

We are currently planning to attend fibre-east.co.uk at the end of the month together :-)

by Gema Gomez at June 06, 2017 23:00

June 23, 2017

Riku Voipio

Cross-compiling with debian stretch

Debian stretch comes with cross-compiler packages for selected architectures:
 $ apt-cache search cross-build-essential
crossbuild-essential-arm64 - Informational list of cross-build-essential packages for
crossbuild-essential-armel - ...
crossbuild-essential-armhf - ...
crossbuild-essential-mipsel - ...
crossbuild-essential-powerpc - ...
crossbuild-essential-ppc64el - ...

Lets have a quick exact steps guide. But first - while you can use do all this in your desktop PC rootfs, it is more wise to contain yourself. Fortunately, Debian comes with a container tool out of box:

sudo debootstrap stretch /var/lib/container/stretch http://deb.debian.org/debian
echo "strech_cross" | sudo tee /var/lib/container/stretch/etc/debian_chroot
sudo systemd-nspawn -D /var/lib/container/stretch
Then we set up cross-building enviroment for arm64 inside the container:

# Tell dpkg we can install arm64
dpkg --add-architecture arm64
# Add src line to make "apt-get source" work
echo "deb-src http://deb.debian.org/debian stretch main" >> /etc/apt/sources.list
apt-get update
# Install cross-compiler and other essential build tools
apt install --no-install-recommends build-essential crossbuild-essential-arm64
Now we have a nice build enviroment, lets choose something more complicated than the usual kernel/BusyBox to cross-build, qemu:

# Get qemu sources from debian
apt-get source qemu
cd qemu-*
# New in stretch: build-dep works in unpacked source tree
apt-get build-dep -a arm64 .
# Cross-build Qemu for arm64
dpkg-buildpackage -aarm64 -j6 -b
Now that works perfectly for Qemu. For other packages, challenges may appear. For example you may have to se "nocheck" flag to skip build-time unit tests. Or some of the build-dependencies may not be multiarch-enabled. So work continues :)

by Riku Voipio (noreply@blogger.com) at June 06, 2017 13:36

June 22, 2017

Steve McIntyre

-1, Trolling

Here's a nice comment I received by email this morning. I guess somebody was upset by my last post?

From: Tec Services <tecservices911@gmail.com>
Date: Wed, 21 Jun 2017 22:30:26 -0700
To: steve@einval.com
Subject: its time for you to retire from debian...unbelievable..your
         the quality guy and fucked up the installer!

i cant ever remember in the hostory of computing someone releasing an installer
that does not work!!

wtf!!!

you need to be retired...due to being retarded..

and that this was dedicated to ian...what a
disaster..you should be ashames..he is probably roling in his grave from shame
right now....

It's nice to be appreciated.

June 06, 2017 21:59

June 20, 2017

Steve McIntyre

So, Stretch happened...

Things mostly went very well, and we've released Debian 9 this weekend past. Many many people worked together to make this possible, and I'd like to extend my own thanks to all of them.

As a project, we decided to dedicate Stretch to our late founder Ian Murdock. He did much of the early work to get Debian going, and inspired many more to help him. I had the good fortune to meet up with Ian years ago at a meetup attached to a Usenix conference, and I remember clearly he was a genuinely nice guy with good ideas. We'll miss him.

For my part in the release process, again I was responsible for producing our official installation and live images. Release day itself went OK, but as is typical the process ran late into Saturday night / early Sunday morning. We made and tested lots of different images, although numbers were down from previous releases as we've stopped making the full CD sets now.

Sunday was the day for the release party in Cambridge. As is traditional, a group of us met up at a local hostelry for some revelry! We hid inside the pub to escape from the ridiculouly hot weather we're having at the moment.

Party

Due to a combination of the lack of sleep and the heat, I nearly forgot to even take any photos - apologies to the extra folks who'd been around earlier whom I missed with the camera... :-(

June 06, 2017 22:21