Metadata-Version: 2.1
Name: DaMa-ML
Version: 1.0a0
Summary: A framework for data management and used to do data science and machine learning's pipelines
Home-page: https://github.com/elaeon/dama_ml
Author: Alejandro G. Martínez Romero
Author-email: mara80@gmail.com
License: Apache
Keywords: data python management machine learning science
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.6
Requires-Dist: matplotlib (>=2.2)
Requires-Dist: networkx (>=1.11)
Requires-Dist: numpy (>=1.12)
Requires-Dist: pandas (>=0.19.2)
Requires-Dist: scikit-learn (>=0.18)
Requires-Dist: scipy (>=0.18.1)
Requires-Dist: tabulate (>=0.7.5)
Requires-Dist: tensorflow (>=1.0.1)
Requires-Dist: tqdm (>=4.11.2)
Requires-Dist: h5py (>=2.6.0)
Requires-Dist: climin (>=0.1a1)
Requires-Dist: xmltodict (>=0.10.2)
Requires-Dist: keras (>=2.0.2)
Requires-Dist: psycopg2 (>=2.7.3.2)
Requires-Dist: dask (>=0.18.2)
Requires-Dist: xarray (>=0.10.9)
Requires-Dist: zarr (>=2.2.0)
Requires-Dist: numcodecs (>=0.6.1)
Requires-Dist: toolz (>=0.9.0)
Requires-Dist: cytoolz (>=0.9.0.1)
Requires-Dist: msgpack-python (>=0.5.6)
Requires-Dist: scikit-image (>=0.14.0)
Requires-Dist: GitPython (>=2.1.11)
Requires-Dist: psutil (>=5.5.1)
Requires-Dist: colorlog (>=4.0.2)
Provides-Extra: docs
Requires-Dist: sphinx (>=1.4.0) ; extra == 'docs'
Provides-Extra: test

.. image:: https://travis-ci.org/elaeon/dama_ml.svg?branch=master
    :target: https://travis-ci.org/elaeon/dama_ml

.. image:: https://api.codacy.com/project/badge/Grade/0ab998e72f4f4e31b3dc7b3c9921374a
    :target: https://www.codacy.com/app/elaeon/dama_ml?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=elaeon/dama_ml&amp;utm_campaign=Badge_Grade


Warning
=============
    Although, the API is stable this work is in alpha steps and there are methods that have limited functionality or aren't implemented.


Overview
=====================================

Dama ML is a framework for data management and used to do data science and machine learning's pipelines, also dama-ml try to unify diverse data sources like csv, sql db, hdf5, zarr, etc, and also unify machine learning frameworks (sklearn, Keras, LigthGBM, etc) with a simplify interface.

For more detail read the docs_. 

.. _docs: https://elaeon.github.io/dama_ml/


Installation
=====================

.. code-block:: bash

    git clone https://github.com/elaeon/dama_ml.git
    pip install dama_ml/
    or
    pip install dama_ml


You can install the python dependences with pip, but we strongly recommend install the dependences with conda and conda forge.

.. code-block:: bash

    conda config --add channels conda-forge
    conda create -n new_environment --file dama_ml/requirements.txt
    source activate new_environment
    pip install dama_ml/


Quick start
==================

First, configure the data paths where all data will be saved. This can be done with help of dama_ml cli tools.

.. code-block:: python

    $ dama-cli config --edit

This will display a nano editor where you can edit data_path, models_path, code_path, class_path, metadata_path.

* data_path is where all datasets wiil be saved.
* models_path is where all files from your models will be saved.
* code_path is the repository os code.
* metadata_path is where the metadata database will be saved.

Building a dataset

.. code-block:: python

    from dama.data.ds import Data
    from dama.drivers.core import Zarr, HDF5
    import numpy as np

    array_0 = np.random.rand(100, 1)
    array_1 = np.random.rand(100,)
    array_2 = np.random.rand(100, 3)
    array_3 = np.random.rand(100, 6)
    array_4 = (np.random.rand(100)*100).astype(int)
    array_5 = np.random.rand(100).astype(str)
    with Data(name=name, driver=Zarr(mode="w")) as data:
        data.from_data({"x": array_0, "y": array_1, "z": array_2, "a": array_3, "b": array_4, "c": array_5})


Now we use a regression model, in this case we used RandomForestRegressor

.. code-block:: python

    from dama.reg.extended.w_sklearn import RandomForestRegressor
    from dama.utils.model_selection import CV

    data.driver.mode = "r"  # we changed mode "w" to "r" to not overwrite the data previously saved
    with data, Data(name="test_from_hash", driver=HDF5(mode="w")) as ds:
        cv = CV(group_data="x", group_target="y", train_size=.7, valid_size=.1)  # cross validation class
        stc = cv.apply(data)
        ds.from_data(stc, from_ds_hash=data.hash)
        reg = RandomForestRegressor()
        model_params = dict(n_estimators=25, min_samples_split=2)
        reg.train(ds, num_steps=1, data_train_group="train_x", target_train_group='train_y',
                  data_test_group="test_x", target_test_group='test_y', model_params=model_params,
                  data_validation_group="validation_x", target_validation_group="validation_y")
        reg.save(name="test_model", model_version="1")

Using RandomForestRegressor to do predictions is like this:

.. code-block:: python

    with RandomForestRegressor.load(model_name="test_model", model_version="1") as reg:
        for pred in reg.predict(data):
            prediction = pred.batch.to_ndarray()


CLI
==============
dama-ml has a CLI where you can view your datasets and models.
For example

.. code-block:: bash

    dama-cli datasets

Return a table of datasets previosly saved.

.. code-block:: python

    Using metadata ..../metadata/metadata.sqlite3
    Total 2 / 2

    hash                    name            driver    group name    size       num groups  datetime UTC
    ---------------------  --------------  --------  ------------  --------  ------------  -------------------
    sha1.3124d5f16eb0e...  test_from_hash  HDF5      s/n           9.12 KB              6  2019-02-27 19:39:00
    sha1.e832f56e33491...  reg0            Zarr      s/n           23.68 KB             6  2019-02-27 19:39:00



You can use "--help" for view more options. 


