Guangning Yu's Blog

Building Java Projects with Maven

2018-11-30 11:07:21  |  Java Maven

Create the directory structure

  1. mkdir -p src/main/java/hello
  2. mkdir -p src/test/java/hello

Create classes

src/main/java/hello/HelloWorld.java

  1. package hello;
  2. import org.joda.time.LocalTime;
  3. public class HelloWorld {
  4. public static void main(String[] args) {
  5. LocalTime currentTime = new LocalTime();
  6. System.out.println("The current local time is: " + currentTime);
  7. Greeter greeter = new Greeter();
  8. System.out.println(greeter.sayHello());
  9. }
  10. }

src/main/java/hello/Greeter.java

  1. package hello;
  2. public class Greeter {
  3. public String sayHello() {
  4. return "Hello world!";
  5. }
  6. }

Write a test

src/test/java/hello/GreeterTest.java

  1. package hello;
  2. import static org.hamcrest.CoreMatchers.containsString;
  3. import static org.junit.Assert.*;
  4. import org.junit.Test;
  5. public class GreeterTest {
  6. private Greeter greeter = new Greeter();
  7. @Test
  8. public void greeterSaysHello() {
  9. assertThat(greeter.sayHel

Regression using Keras

2017-07-18 11:24:12  |  DeepLearning Keras
  1. #!/usr/bin/env python
  2. import urllib2
  3. import numpy as np
  4. from keras.models import Sequential
  5. from keras.layers import Dense
  6. from keras.wrappers.scikit_learn import KerasRegressor
  7. from sklearn.model_selection import cross_val_score
  8. from sklearn.model_selection import KFold
  9. from sklearn.preprocessing import StandardScaler
  10. from sklearn.pipeline import Pipeline
  11. def load_data():
  12. X = []
  13. Y = []
  14. data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
  15. for line in urllib2.urlopen(data_url).readlines():
  16. line = map(float, line.split())
  17. X.append(line[0:13])
  18. Y.append(line[13])
  19. return X, Y
  20. def basic_model():
  21. # create model
  22. model = Sequential()
  23. model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
  24. model.add(Dense(1, kernel_initializer='normal'))
  25. # compile model
  26. model.compile(loss='mean_squared_error', optimizer='adam')
  27. return model
  28. d

Neural Network

2016-03-05 11:12:57  |  MachineLearning

Hive Basics

2015-11-18 11:44:16  |  Hive
  • Date functions
  1. -- change date format
  2. from_unixtime(unix_timestamp('20150101' ,'yyyyMMdd'), 'yyyy-MM-dd')
  3. -- add n days
  4. date_add('2015-11-01', 30) -- will return '2015-12-01'
  5. -- calculate date difference
  6. datediff('2015-12-01', '2015-11-01') -- will return 30
  • Generate row number
  1. row_number() over (DISTRIBUTE BY... SORT BY... DESC)
  • Get partition information
  1. analyze table xxx.yyy partition(dt = '2015-12-11') compute statistics;
  2. describe formated xxx.yyy partition (dt = '2015-12-11');

Calculate the similarity of two vectors

2015-03-13 11:33:36  |  MachineLearning

Euclidean distance

  1. from sklearn.metrics.pairwise import euclidean_distances
  2. euclidean_distances([[1,2,3], [100,200,300]])
  3. # return:
  4. # array([[ 0. , 370.42408129],
  5. # [370.42408129, 0. ]])

Cosine similarity

  1. from sklearn.metrics.pairwise import cosine_similarity
  2. cosine_similarity([[1,2,3],[100,200,300]])
  3. # return:
  4. # array([[1., 1.],
  5. # [1., 1.]])

Pearson correlation

  1. from scipy.stats.stats import pearsonr
  2. pearsonr([1,2,3], [100,200,300])
  3. # return ('1.0', 0.0) // (Pearson’s correlation coefficient, 2-tailed p-value)

Cosine Similarity and Pearson Correlation Coefficient

2015-03-12 16:46:53  |  MachineLearning

Logistic Regression

2014-04-07 11:13:31  |  MachineLearning

enter image description here
enter image description here

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. import urllib2
  4. from numpy import mat, ones, shape, exp, array, arange
  5. import matplotlib.pyplot as plt
  6. def createDataSet():
  7. features = []
  8. labels = []
  9. lines = urllib2.urlopen('https://raw.github.com/pbharrin/machinelearninginaction/master/Ch05/testSet.txt').readlines()
  10. for line in lines:
  11. line = line.strip().split()
  12. features.append([1.0, float(line[0]), float(line[1])]) # set x0 to 1.0
  13. labels.append(int(line[2]))
  14. return features, labels
  15. def sigmoid(value):
  16. return 1.0 / (1 + exp(-value))
  17. def gradAscent(features, labels, alpha=0.001, iterations=500):
  18. '''
  19. 梯度上升算法:
  20. - 批处理算法:每次更新回归系数时都需要遍历整个数据集
  21. '''
  22. featureMatrix = mat(features)
  23. labelMatrix = mat(labels).transpose()
  24. m, n = shape(featureMatrix)
  25. weights = ones((n, 1))
  26. for k in range(iterations):
  27. h = sigmoid(featureMatrix*weights)
  28. error = (labelMatrix - h)
  29. weig

Collaborative Filtering

2014-03-02 10:39:57  |  MachineLearning

user-based collaborative filtering

  1. for each user, find similar users by calculating similarity of the ratings (e.g. euclidean distance, pearson similarity)
  2. for each item of the seleted users, calculate the weighted rating according to each user's similarity
  3. select top n new items for this user

item-based collaborative filtering

  1. for each item, calculate similarity of each other item
  2. select top rating items of this user
  3. for each selected item, find similar items and calculate the weighted rating according to each item's similarity
  4. select top n new items for this user

user-based or item-based?

  • item-based method needs to maintain the item similarity table
  • for sparse dataset, item-based method is better
  • for dense dataset, both methods have the similar performance

Awk Basics

2013-09-30 11:33:44
  • Get absolute value
  1. awk '{printf("%d",sqrt($1*$1))}' test.csv
  • User defined function
  1. echo "4 105" | awk 'function max(a,b){return a>b?a:b}{print max($1, $2)}'