Filter by type:

Sort by year:

An Accelerated Communication-Efficient Primal-Dual Optimization Framework for Structured Machine Learning

Chenxin Ma, Martin Jaggi, Frank E. Curtis, Nathan Srebro and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

Distributed optimization algorithms are essential for training machine learning models on very large-scale datasets. However, they often suffer from communication bottlenecks. Confronting this issue, a communication-efficient primal-dual coordinate ascent framework (CoCoA) and its improved variant CoCoA+ have been proposed, achieving a convergence rate of (1/t) for solving empirical risk minimization problems with Lipschitz continuous losses. In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of (1/t2) in terms of reducing suboptimality. The analysis of this rate is also notable in that the convergence rate bounds involve constants that, except in extreme cases, are significantly reduced compared to those previously provided for CoCoA+. The results of numerical experiments are provided to show that acceleration can lead to significant performance gains.

A Deep Q-Network for the Beer Game with Partial Information

Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence Snyder and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

The beer game is a decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain network in which agents cooperatively attempt to minimize the total cost of the network even though each agent can only observe its own local information. We develop a variant of the Deep Q-Network algorithm to solve this problem. Extensive numerical experiment show the effectiveness of our algorithm. Unlike most algorithms in literature, our algorithm does not have any limits on the parameter values, and it provides good solutions even if the agents do not follow a rational policy. The algorithm can be extended to other decentralized multi-agent cooperative games with partially observed information, which is a common type of situation in supply chain problems.

Stock-out Prediction in Multi-echelon Networks

Afshin Oroojlooyjadid, Lawrence Snyder and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

In multi-echelon inventory systems the performance of a given node is affected by events that occur at many other nodes and at many other time periods. For example, a supply disruption upstream will have an effect on downstream, customer-facing nodes several periods later as the disruption "cascades" through the system. There is very little research on stock-out prediction in single-echelon systems and (to the best of our knowledge) none on multi-echelon systems. However, in real the world, it is clear that there is significant interest in techniques for this sort of stock-out prediction. Therefore, our research aims to fill this gap by using DNN to predict stock-outs in multi-echelon supply chains.

A Coordinate-Descent Algorithm for Tracking Solutions in Time-Varying Optimal Power Flows

Jie Liu, Jakub Marecek, Andrea Simonetto and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

Consider a polynomial optimisation problem, whose instances vary continuously over time. We propose to use a coordinate-descent algorithm for solving such time-varying optimisation problems. In particular, we focus on relaxations of transmission-constrained problems in power systems. On the example of the alternating-current optimal power flows (ACOPF), we bound the difference between the current approximate optimal cost generated by our algorithm and the optimal cost for a relaxation using the most recent data from above by a function of the properties of the instance and the rate of change to the instance over time. We also bound the number of floating-point operations that need to be performed between two updates in order to guarantee the error is bounded from above by a given constant.

Underestimate Sequences via Quadratic Averaging

Chenxin Ma, Naga Venkata C. Gudapati, Majid Jahani, Rachael Tappenden and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

In this work we introduce the concept of an Underestimate Sequence (UES), which is a natural extension of Nesterov's estimate sequence. Our definition of a UES utilizes three sequences, one of which is a lower bound (or under-estimator) of the objective function. The question of how to construct an appropriate sequence of lower bounds is also addressed, and we present lower bounds for strongly convex smooth functions and for strongly convex composite functions, which adhere to the UES framework. Further, we propose several first order methods for minimizing strongly convex functions in both the smooth and composite cases. The algorithms, based on efficiently updating lower bounds on the objective functions, have natural stopping conditions, which provides the user with a certificate of optimality. Convergence of all algorithms is guaranteed through the UES framework, and we show that all presented algorithms converge linearly, with the accelerated variants enjoying the optimal linear rate of convergence.

On the Complexity of Parallel Coordinate Descent

Rachael Tappenden, Martin Takáč and Peter Richtárik
2017 Journal Paper Optimization Methods and Software

Abstract

In this work we study the parallel coordinate descent method (PCDM) proposed by Richtarik and Takac [26] for minimizing a regularized convex function. We adopt elements from the work of Xiao and Lu [39], and combine them with several new insights, to obtain sharper iteration complexity results for PCDM than those presented in [26]. Moreover, we show that PCDM is monotonic in expectation, which was not confirmed in [26], and we also derive the first high probability iteration complexity result where the initial levelset is unbounded.

Hybrid Methods in Solving Alternating-Current Optimal Power Flows

Alan C. Liddell, Jie Liu, Jakub Marecek and Martin Takáč
2017 Journal Paper IEEE Transactions on Smart Grid

Abstract

Many steady-state problems in power systems, including rectangular power-voltage formulations of optimal power flows in the alternating-current model (ACOPF), can be cast as polynomial optimisation problems (POP). For a POP, one can derive strong convex relaxations, or rather hierarchies of ever stronger, but ever larger relaxations. We study means of switching from solving the convex relaxation to Newton method working on a non-convex Lagrangian of the POP.

A Robust Multi-Batch L-BFGS Method for Machine Learning

Albert S. Berahas and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.

Distributed Optimization with Arbitrary Local Solvers

Chenxin Ma, Jakub Konečný, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter Richtárik and Martin Takáč
2017 Journal Paper Optimization Methods and Software

Abstract

With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.

A low-rank coordinate-descent algorithm for semidefinite programming relaxations of optimal power flow

Jakub Mareček and Martin Takáč
2017 Journal Paper Optimization Methods and Software

Abstract

A novel rank-constrained re-formulation of alternating-current optimal power flow problem makes it possible to derive novel semidefinite programming (SDP) relaxations. For those, we develop a solver, which is often as fast as Matpower's interior point method, within the same accuracy.

Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory

Peter Richtárik and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact. Further, we propose and analyze three stochastic algorithms for solving the reformulated problem---basic, parallel and accelerated methods---with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.

Stochastic Recursive Gradient Algorithm for Nonconvex Optimization

Lam Nguyen, Jie Liu, Katya Scheinberg and Martin Takáč
2017 arXiv Preprint arXiv

Abstract

In this paper, we study and analyze the mini-batch version of StochAstic Recursive grAdient algoritHm (SARAH), a method employing the stochastic recursive gradient, for solving empirical loss minimization for the case of nonconvex losses. We provide a sublinear convergence rate (to stationary points) for general nonconvex functions and a linear convergence rate for gradient dominated functions, both of which have some advantages compared to other modern stochastic gradient algorithms for nonconvex losses.

SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient

Lam Nguyen, Jie Liu, Katya Scheinberg and Martin Takáč
2017 Conference Paper ICML 2017 (34th International Conference on Machine Learning)

Abstract

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.

Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme under Weak Strong Convexity Assumption

Jie Liu and Martin Takáč
2017 Conference Paper Proceedings of MOPTA 2016

Abstract

We propose a projected semi-stochastic gradient descent method with mini-batch for improving both the theoretical complexity and practical performance of the general stochastic gradient descent method (SGD). We are able to prove linear convergence under weak strong convexity assumption. This requires no strong convexity assumption for minimizing the sum of smooth convex functions subject to a compact polyhedral set, which remains popular across machine learning community. Our PS2GD preserves the low-cost per iteration and high optimization accuracy via stochastic gradient variance-reduced technique, and admits a simple parallel implementation with mini-batches. Moreover, PS2GD is also applicable to dual problem of SVM with hinge loss.

Linear Convergence of the Randomized Feasible Descent Method Under the Weak Strong Convexity Assumption

Chenxin Ma, Rachael Tappenden and Martin Takáč
2016 Journal Paper Journal of Machine Learning Research

Abstract

In this paper we generalize the framework of the feasible descent method (FDM) to a randomized (R-FDM) and a coordinate-wise random feasible descent method (RC-FDM) framework. We show that the famous SDCA algorithm for optimizing the SVM dual problem, or the stochastic coordinate descent method for the LASSO problem, fits into the framework of RC-FDM. We prove linear convergence for both R-FDM and RC-FDM under the weak strong convexity assumption. Moreover, we show that the duality gap converges linearly for RC-FDM, which implies that the duality gap also converges linearly for SDCA applied to the SVM dual problem.

Structural Damage Detection Using Convolutional Neural Networks

Nur Sila Gulgec, Martin Takáč and Shamim N. Pakzad
2016 Conference Paper IMAC-XXXV

Abstract

Detection of the deficiencies affecting the performance of the structures has been studied over the past few decades. How- ever, with the long-term data collection from dense sensor arrays, accurate damage diagnosis has become computationally challenging task. To address such problem, this paper introduces convolutional neural network (CNN), which has led to break- through results in computer vision, to the damage detection challenge. CNN technique has the ability to discover abstract features which are able to discriminate various aspect of interest. In our case, these features are used to classify “damaged” and “healthy” cases modeled through the finite element simulations. CNN is performed by using a Python library called Theano with the graphics processing unit (GPU) to achieve higher performance of these data-intensive calculations. The accuracy and sensitivity of the proposed technique are assessed with a cracked steel gusset connection model with multiplicative noise. Dur- ing the training procedure, strain distributions generated from different crack and loading scenarios are adopted. Completely unseen damage setups are introduced to the simulations while testing. Based on the findings of the proposed study, high accu- racy, robustness and computational efficiency are succeeded for the damage diagnosis.

CoCoA: A General Framework for Communication-Efficient Distributed Optimization

Virginia Smith, Simone Forte, Chenxin Ma, Martin Takáč, Michael I. Jordan, Martin Jaggi
2016 arXiv Preprint arXiv

Abstract

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for the distributed environment, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.

A Multi-Batch L-BFGS Method for Machine Learning

Albert S. Berahas, Jorge Nocedal and Martin Takáč
2016 Conference Paper NIPS

Abstract

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This inherently gives the algorithm a stochastic flavor that can cause instability in L-BFGS, a popular batch method in machine learning. These difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

On optimal probabilities in stochastic coordinate descent methods (code: `NSync)

Peter Richtárik and Martin Takáč
2016 Journal Paper Optimization Letters, 10(6), 1233-1243

Abstract

We propose and analyze a new parallel coordinate descent method—NSync—in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law. This is the first method of this type. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration—with optimal probabilities—may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.

Matrix Completion under Interval Uncertainty

Jakub Mareček, Peter Richtárik and Martin Takáč
2016 Journal Paper European Journal of Operational Research (to appear)

Abstract

Matrix completion under interval uncertainty can be cast as a matrix completion problem with element-wise box constraints. We present an efficient alternating-direction parallel coordinate-descent method for the problem. We show that the method outperforms any other known method on a benchmark in image in-painting in terms of signal-to-noise ratio, and that it provides high-quality solutions for an instance of collaborative filtering with 100,198,805 recommendations within 5 minutes on a single personal computer.

Applying Deep Learning to the Newsvendor Problem

Afshin Oroojlooyjadid, Lawrence Snyder and Martin Takáč
2016 arXiv Preprint arXiv

Abstract

The newsvendor problem is one of the most basic and widely applied inventory models. There are numerous extensions of this problem. One important extension is the multi-item newsvendor problem, in which the demand of each item may be correlated with that of other items. If the joint probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may be not optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately as is typically done. The algorithm does not require the knowledge of the probability distributions of the demand. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and SVM approaches, especially for demands with high volatility.

Large Scale Distributed Hessian-Free Optimization for Deep Neural Network

Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Takáč
2016 arXiv Preprint arXiv

Abstract

Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using na"ive conjugate gradient method and Gauss-Newton Hessian approximation information - we propose a novel algorithm to explore negative curvature direction by solving the sub-problem with stabilized bi-conjugate method involving possible indefinite stochastic Hessian information. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 16 CPU nodes for a simple 4-layer network.

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

Zheng Qu, Peter Richtárik, Martin Takáč and Olivier Fercoq
2016 Conference Paper ICML 2016 (33rd International Conference on Machine Learning)

Abstract

We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all curvature information contained in the examples, which leads to striking improvements in both theory and practice - sometimes by orders of magnitude. In the special case when an L2-regularizer is used in the primal, the dual problem is a concave quadratic maximization problem plus a separable term. In this regime, SDNA in each step solves a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function; whence the name of the method. If, in addition, the loss functions are quadratic, our method can be interpreted as a novel variant of the recently introduced Iterative Hessian Sketch.

Distributed Inexact Damped Newton Method: Data Partitioning and Load-Balancing

Chenxin Ma and Martin Takáč
2016 arXiv Preprint arXiv

Abstract

In this paper we study inexact dumped Newton method implemented in a distributed environment. We start with an original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015]. We will show that this algorithm may not scale well and propose an algorithmic modifications which will lead to less communications, better load-balancing and more efficient computation. We perform numerical experiments with an regularized empirical loss minimization instance described by a 273GB dataset.

Primal-Dual Rates and Certificates

Celestine Dunner, Simone Forte, Martin Takáč and Martin Jaggi
2016 Conference Paper ICML 2016 (33rd International Conference on Machine Learning)

Abstract

We propose an algorithm-independent framework to equip existing optimization methods with primal-dual certificates. Such certificates and corresponding rate of convergence guarantees are important for practitioners to diagnose progress, in particular in machine learning applications. We obtain new primal-dual convergence rates e.g. for the Lasso as well as many L1, Elastic-Net and group-lasso-regularized problems. The theory applies to any norm-regularized generalized linear model. Our approach provides efficiently computable duality gaps which are globally defined, without modifying the original problems in the region of interest.

Distributed coordinate descent method for learning with big data (code: Hydra)

Peter Richtárik and Martin Takáč
2016 Journal Paper Journal of Machine Learning Research

Abstract

In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix.

Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

Jakub Konečný, Jie Liu, Peter Richtárik and Martin Takáč
2016 Journal Paper IEEE Journal of Selected Topics in Signal Processing (to appear)

Abstract

We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as $ is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Hybrid Methods in Solving Alternating-Current Optimal Power Flows

Jie Liu, Jakub Marecek and Martin Takáč
2015 arXiv Preprint

Abstract

Optimisation problems in power systems employing alternating-current models of power flows have driven much recently interest in (convergent hierarchies of) convex relaxations for polynomial optimisation problems. Readily available second-order methods for solving the convex relaxations on real-world large-scale power systems often fail to perform even a single iteration within reasonable run-times. First-order methods have much lower per-iteration computational and memory requirements, but require many more iterations than second-order methods to converge within the same accuracy. We hence study means of switching from first-order methods for solving the convex relaxation to Newton method working on the original non-convex problem, which would allow for convergence under the same conditions as in solvers for the convex relaxation, but with an improved rate of convergence. We illustrate our approach on the alternating current power flows (ACPF) and alternating current optimal power flows (ACOPF).

Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?

Chenxin Ma and Martin Takáč
2015 Conference Paper OptML@NIPS 2015

Abstract

In this paper we study the effect of the way that the data is partitioned in distributed optimization. The original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015] partitions the input data based on samples. We describe how the original algorithm has to be modified to allow partitioning on features and show its efficiency both in theory and also in practice.

Dual Free SDCA for Empirical Risk Minimization with Adaptive Probabilities

Xi He and Martin Takáč
2015 Conference Paper OptML@NIPS 2015

Abstract

In this paper we develop dual free SDCA with adaptive probabilities for regularized empirical risk minimization. This extends recent work of Shai Shalev-Shwartz [SDCA without Duality, arXiv:1502.06177] to allow non-uniform selection of "dual" coordinate in SDCA. Moreover, the probability can change over time, making it more efficient than uniform selection. Our work focuses on generating adaptive probabilities through iterative process, preferring to choose coordinate with highest potential to decrease sub-optimality. We also propose a practical variant Algorithm adfSDCA+ which is more aggressive. The work is concluded with multiple experiments which shows efficiency of proposed algorithms.

Distributed Mini-Batch SDCA

Martin Takáč, Peter Richtárik and Nathan Srebro
2015 arXiv Preprint

Abstract

We present an improved analysis of mini-batched stochastic dual coordinate ascent for regularized empirical loss minimization (i.e. SVM and SVM-type objectives). Our analysis allows for flexible sampling schemes, including where data is distribute across machines, and combines a dependence on the smoothness of the loss and/or the data spread (measured through the spectral norm).

Adding vs. Averaging in Distributed Primal-Dual Optimization

Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richtárik and Martin Takáč
2015 Conference Paper ICML 2015 (32nd International Conference on Machine Learning)

Abstract

Distributed optimization algorithms for large-scale machine learning suffer from a communication bottleneck. Reducing communication makes the efficient aggregation of partial work from different machines more challenging. In this paper we present a novel generalization of the recent communication efficient primal-dual coordinate ascent framework (CoCoA). Our framework, CoCoA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes only allowed conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both CoCoA as well as our new variants, and generalize the theory for both methods to also cover non-smooth convex loss functions. We provide an extensive experimental comparison on several real-world distributed datasets, showing markedly improved performance, especially when scaling up the number of machines.

Parallel Coordinate Descent Methods for Big Data Optimization

Peter Richtárik and Martin Takáč
2015 Journal Paper Mathematical Programming

Abstract

In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is a simple expression depending on the number of parallel processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there may be no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors.
Our analysis also works in the mode when the number of blocks being updated at each iteration is random, which allows for modeling situations with busy or unreliable processors. We show that our algorithm is able to solve a LASSO problem involving a matrix with 20 billion nonzeros in 2 hours on a large memory node with 24 cores.

mS2GD: Mini-batch semi-stochastic gradient descent in the proximal setting (code: mS2GD)

Jakub Konečný, Jie Liu, Peter Richtárik and Martin Takáč
2014 Conference Paper OPT 2014: Optimization for Machine Learning @NIPS 2014

Abstract

We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample b functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that the method benefits from two speedup effects. First, we prove that as long as b is below a certain threshold, we can reach predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Communication-Efficient Distributed Dual Coordinate Ascent

Martin Jaggi, Virginia Smith, Martin Takáč, Jonathan Terhorst, Thomas Hofmann and Michael I. Jordan
2014 Conference Paper NIPS

Abstract

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly.

Randomized Coordinate Descent Methods for Big Data Optimization

Martin Takáč
2014 PhD thesis University of Edinburgh

Abstract

Distributed Block Coordinate Descent for Minimizing Partially Separable Functions

Jakub Mareček, Peter Richtárik and Martin Takáč
2014 Journal Paper Numerical Analysis and Optimization 2014, Springer Proceedings in Mathematics and Statistics

Abstract

In this work we propose a distributed randomized block coordinate descent method for minimizing a convex function with a huge number of variables/coordinates. We analyze its complexity under the assumption that the smooth part of the objective function is partially block separable, and show that the degree of separability directly influences the complexity. This extends the results in [22] to a distributed environment. We first show that partially block separable functions admit an expected separable overapproximation (ESO) with respect to a distributed sampling, compute the ESO parameters, and then specialize  complexity results from recent literature that hold under the generic ESO assumption. We describe several approaches to distribution and synchronization of the computation across a cluster of multi-core computer and provide promising computational results.

Fast Distributed Coordinate Descent for Non-Strongly Convex Losses

Olivier Fercoq, Zheng Qu, Peter Richtárik and Martin Takáč
2014 Conference Paper MLSP2014: IEEE International Workshop on Machine Learning for Signal Processing

Abstract

We propose an efficient distributed randomized coordinate descent method for minimizing regularized non-strongly convex loss functions. The method attains the optimal O(1/k^2) convergence rate, where k is the iteration counter. The core of the work is the theoretical study of stepsize parameters. We have implemented the method on Archer - the largest supercomputer in the UK - and show that the method is capable of solving a (synthetic) LASSO optimization problem with 50 billion variables.

TOP-SPIN: TOPic discovery via Sparse Principal component INterference

Martin Takáč, Selin Damla Ahipasaoglu, Ngai-Man Cheung and Peter Richtárik
2013 arXiv Preprint

Abstract

We propose a novel topic discovery algorithm for unlabeled images based on the bag-of-words (BoW) framework. We first extract a dictionary of visual words and subsequently for each image compute a visual word occurrence histogram. We view these histograms as rows of a large matrix from which we extract sparse principal components (PCs). Each PC identifies a sparse combination of visual words which co-occur frequently in some images but seldom appear in others. Each sparse PC corresponds to a topic, and images whose interference with the PC is high belong to that topic, revealing the common parts possessed by the images. We propose to solve the associated sparse PCA problems using an Alternating Maximization (AM) method, which we modify for purpose of efficiently extracting multiple PCs in a deflation scheme. Our approach attacks the maximization problem in sparse PCA directly and is scalable to high-dimensional data. Experiments on automatic topic discovery and category prediction demonstrate encouraging performance of our approach.

Mini-Batch Primal and Dual Methods for SVMs

Martin Takáč, Avleen Bijral, Peter Richtárik and Nathan Srebro
2013 Conference Paper ICML 2013 (30th International Conference on Machine Learning)

Abstract

We address the issue of using mini-batches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hinge-loss.

Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes

Peter Richtárik, Martin Takáč and Selin Damla Ahipasaoglu
2012 arXiv Preprint

Abstract

Given a multivariate data set, sparse principal component analysis (SPCA) aims to extract several linear combinations of the variables that together explain the variance in the data as much as possible, while controlling the number of nonzero loadings in these combinations.
In this paper we consider 8 different optimization formulations for computing a single sparse loading vector; these are obtained by combining the following factors: we employ two norms for measuring variance (L2, L1) and two sparsity-inducing norms (L0, L1), which are used in two different ways (constraint, penalty). Three of our formulations, notably the one with L0 constraint and L1 variance, have not been considered in the literature.
We give a unifying reformulation which we propose to solve via a natural alternating maximization (AM) method. We show the the AM method is nontrivially equivalent to GPower (Journee et al; JMLR 11:517--553, 2010) for all our formulations. Besides this, we provide 24 efficient parallel SPCA implementations: 3 codes (multi-core, GPU and cluster) for each of the 8 problems.
Parallelism in the methods is aimed at

  1. speeding up computations (our GPU code can be 100 times faster than an efficient serial code written in C++),
  2. obtaining solutions explaining more variance and
  3. dealing with big data problems (our cluster code is able to solve a 357 GB problem in about a minute).

Efficient serial and parallel coordinate descent methods for huge-scale truss topology design

Peter Richtárik and Martin Takáč
2011 Conference Paper Operations Research Proceedings 2011, pp. 27-32, Springer-Verlag 2012

Abstract

In this work we propose solving huge-scale instances of the truss topology design problem with coordinate descent methods. We develop four efficient codes: serial and parallel implementations of randomized and greedy rules for the selection of the variable (potential bar) to be updated in the next iteration. Both serial methods enjoy an O(n/k) iteration complexity guarantee, where n is the number of potential bars and k the iteration counter. Our parallel implementations, written in CUDA and running on a graphical processing unit (GPU), are capable of speedups of up to two orders of magnitude when compared to their serial counterparts. Numerical experiments were performed on instances with up to 30 million potential bars.

Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function

Peter Richtárik and Martin Takáč
2011 Journal Paper Mathematical Programming, Series A, 38 pages, 2012

Abstract

In this paper we develop a randomized block-coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth block-separable convex function and prove that it obtains an $\epsilon$-accurate solution with probability at least $1- ho$ in at most $O( frac{n}{\epsilon} \log frac{1}{ ho})$ iterations, where $n$ is the number of blocks. For strongly convex functions the method converges linearly. This extends recent results of Nesterov [Efficiency of coordinate descent methods on huge-scale optimization problems, CORE Discussion Paper \#2010/2], which cover the smooth case, to composite minimization, while at the same time improving the complexity by the factor of 4 and removing $\epsilon$ from the logarithmic term. More importantly, in contrast with the aforementioned work in which the author achieves the results by applying the method to a regularized version of the objective function with an unknown scaling factor, we show that this is not necessary, thus achieving true iteration complexity bounds. In the smooth case we also allow for arbitrary probability vectors and non-Euclidean norms. Finally, we demonstrate numerically that the algorithm is able to solve huge-scale $\ell_1$-regularized least squares and support vector machine problems with a billion variables.

Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function

Peter Richtárik and Martin Takáč
2011 Conference Paper Proceedings of SPARS11 (4th Workshop on Signal Processing with Adaptive Sparse Structured Representations), June 27-30, 2011

Abstract

Sensitivity analysis of the early exercise boundary for American style of Asian options

Daniel Ševcovič and Martin Takáč
2011 Journal Paper International Journal of Numerical Analysis and Modeling, Ser. B, 2(2-3), 2011 231-247

Abstract

In this paper we analyze American style of floating strike Asian call options belonging to the class of financial derivatives whose payoff diagram depends not only on the underlying asset price but also on the path average of underlying asset prices over some predetermined time interval.
The mathematical model for the option price leads to a free boundary problem for a parabolic partial differential equation. Applying fixed domain transformation and transformation of variables we develop an efficient numerical algorithm based on a solution to a non-local parabolic partial differential equation for the transformed variable representing the synthesized portfolio. For various types of averaging methods we investigate the dependence of the early exercise boundary on model parameters.