10 Best Practices for Apache Hive

10 Best Practices for Apache Hive

Apache Hive is an SQL-like software used with Hadoop to give users the capability of performing SQL-like queries on it’s own language, HiveQL, quickly and efficiently . It also gives users additional query and analytical abilities not available on traditional SQL structures.

With Apache Hive, users can use HiveQL or traditional Mapreduce systems, depending on individual needs and preferences. Hive is particularly ideal for analyzing large datasets (petabytes) and also includes a variety of storage options.
Hive is full of unique tools that allow users to quickly and efficiently perform data queries and analysis. In order to make full use of all these tools, it’s important for users to use best practices for Hive implementation. Here are 10 ways to make the most of Hive.

  1. Partitioning Tables:

    Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.

  2. De-normalizing data:

    Normalization is a standard process used to model your data tables with certain rules to deal with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.

  3. Compress map/reduce output:

    Compression techniques significantly reduce the intermediate data volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file size should not be larger than a few hundred megabytes. Otherwise it can potentially lead to an imbalanced job. Other options of compression codec could be snappy, lzo, bzip, etc.

    • For map output compression set mapred.compress.map.output to true
    • For job output compression set mapred.output.compress to true

      For more functions, check out the Hive Cheat Sheet.

  4. Map join:

    Map joins are really efficient if a table on the other side of a join is small enough to fit in the memory. Hive supports a parameter, hive.auto.convert.join, which when it’s set to “true” suggests that Hive try to map join automatically. When using this parameter, be sure the auto convert is enabled in the Hive environment.

  5. Bucketing:

    Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
    Additionally it’s important to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.

  6. Input Format Selection:

    Input formats play a critical role in Hive performance. For example JSON, the text type of input formats, is not a good choice for a large production system where data volume is really high. These type of readable formats actually take a lot of space and have some overhead of parsing ( e.g JSON parsing ). To address these problems, Hive comes with columnar input formats like RCFile, ORC etc. Columnar formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some other binary formats like Avro, sequence files, Thrift and ProtoBuf, which can be helpful in various use cases too.

  7. Parallel execution:

    Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism. However, single, complex Hive queries commonly are translated to a number of MapReduce jobs that are executed by default sequencing. Often though, some of a query’s MapReduce stages are not interdependent and could be executed in parallel. They then can take advantage of spare capacity on a cluster and improve cluster utilization while at the same time reducing the overall query executions time. The configuration in Hive to change this behavior is merely switching a single flag SET hive.exec.parallel=true.

  8. Vectorization:

    Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. To enable vectorization, set this configuration parameter SET hive.vectorized.execution.enabled=true.

  9. Unit Testing:

    Simply speaking, unit testing determines whether the smallest testable piece of your code works exactly as you expect. Unit testing gives a couple of benefits i.e. detecting problems early, making it easier to change and refactor code, being a form of documentation that explains how code works, to name a few.
    In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more. To a large extent, it is possible to verify the correctness of your whole HiveQL query by running quick local unit tests without even touching a Hadoop cluster. Because executing HiveQL query in the local mode takes literally seconds, compared to minutes, hours or days if it runs in the Hadoop mode, it certainly saves huge amounts of development time.

    There are several tools available that helps you to test Hive queries. Some of them that you might want to look at HiveRunner,Hive_test and Beetest.

  10. Sampling:

    Sampling allows users to take a subset of dataset and analyze it, without having to analyze the entire data set. If a representative sample is used, then a query can return meaningful results as well as finish quicker and consume fewer compute resources.

    Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels – it can return only subsets of buckets (bucket sampling), or HDFS blocks (block sampling), or only first N records from each input split. Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.

    Reference: 10 Best Practices for Apache Hive

基于K-Means的Feature Location

一、问题描述

本文是对jEdit4.3的特征定位(Feature Location)。采用的核心算法是K-Means算法。通过Python3的scikit-learn、matplotlib、numpy进行实现。

K-means算法是硬聚类算法,是典型的基于原型的目标函数聚类方法的代表,它是数据点到原型的某种距离作为优化的目标函数,利用函数求极值的方法得到迭代运算的调整规则。K-means算法以欧式距离作为相似度测度,它是求对应某一初始聚类中心向量V最优分类,使得评价指标J最小。算法采用误差平方和准则函数作为聚类准则函数。

因为在K-Means算法中,K的选取是非常重要的。在程序的分析过程中,我通过多次迭代的方式,通过轮廓系数作为判断聚类结果的指标,然后找出合适的K值(即jEdit4.3的功能点数量),具体过程见分析与设计章节。

轮廓系数(Silhouette Coefficient)结合了聚类的凝聚度(Cohesion)和分离度(Separation),用于评估聚类的效果。该值处于-1~1之间,值越大,表示聚类效果越好。

二、分析与设计

在这一部分,我按流程的先后对用到的模型、算法、参数选择进行详细的阐述。

1. 词袋模型

词袋,即Bag of words。信息检索中,Bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。

词袋模型是在自然语言处理和信息检索中的一种简单假设。在这种模型中,文本被看作是无序的词汇集合,忽略语法甚至是单词的顺序。

2. 向量化

在sklearn中提供了TfidfVectorizer和 HashingVectorizer的向量化方式。通过TfidfVectorizer可以将文本转成TF-IDF矩阵,即文本的特征向量。

TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

  • 因为HashingVectorizer是一个无状态模型,不提供IDF权重,如果采用HashingVectorizer的方式,可以通过pipeline的方式添加IDF权重。

  • TfidfVectorizer可以直接生成TF-IDF矩阵,因为后面要做特征定位,所以直接采用TfidfVectorizer的方式更为简单。

3. 特征

尽管TfidfVectorizer中已经提供了默认的tokenizer方法,但并不能完全满足我们的要求。

  • 因为在jEdit4.3的语料库中,像长度小于2、数字等通常是没有什么意义的,对文本聚类没有太大的意义,我们可以在自定义的tokenizer中去掉。

  • 尽管在TF-IDF中会对每个文档中都出现的单词降低其权重,但考虑到不同函数中出现的关键字的多少也有不同,我考虑到将Java语言的关键字统一去掉。

  • 因为jEdit4.3的语料库已经提取过词干,所以我们没有继续提取。

  • 因为可以通过TfidfVectorizer参数的设置来去除停词,所以在自定义的tokenizer中不再重复处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
def token_filter(text):
"""
Return filtered tokens
:param text: text to split into tokens
"""
tokens = [word for word in nltk.word_tokenize(text)]
filtered_tokens = []
for token in tokens:
if re.search('^\D\w*', token) and len(token) > 2 \
and token not in keywords:
filtered_tokens.append(token)
# stems = [SnowballStemmer("english").stem(token) for token in filtered_tokens]
return filtered_tokens

在代码中,前后单词的联系并不是太大,所以在特征的选择上只选取不重复单词,并不对多个单词组合成复杂特征,因此ngram_range采用了默认的参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
print("Extracting features:")
start = time()
# hasher = HashingVectorizer(stop_words='english',
# non_negative=True,
# norm=None,
# tokenizer=token_filter)
# vectorizer = make_pipeline(hasher, TfidfTransformer())
# vectorizer = HashingVectorizer(stop_words='english',
# norm='l2',
# tokenizer=token_filter)
vectorizer = TfidfVectorizer(stop_words='english',
tokenizer=token_filter,
# ngram_range=(1, 2),
norm='l2')
file = open(corpus_file)
tfidf_matrix = vectorizer.fit_transform(file)
file.close()
print(tfidf_matrix.shape)
print("done in %fs" % (time() - start))

4. 降维

关于降维,通过实验的结果发现,降维并不能提高聚类的准确率。尽管降维的一个重要的作用是减少矩阵的大小,提高后续的效率。但因为特征只选取了3000维左右,并且通过降维后会提高后续Query的复杂性,因此在最终方案的选择上,并没有降维处理。但作为分析过程,在此附上降维的分析过程。

  • 因为TF-IDF矩阵是一个稀疏矩阵,在降维的方法上可以使用sklearn.decomposition 中的TruncatedSVD。

  • 衡量降维后保存的原信息的多少可以通过Explained variance衡量。

因此第一步是要通过多少迭代,确定应该降到多少维是合适的。

1
2
3
4
5
6
7
8
9
10
11
12
13
print("Dimensionality reduction:")
start = time()
n_explained_variances = []
for n_components in range(500, 1500, 100):
svd = TruncatedSVD(n_components)
lsa = make_pipeline(svd, Normalizer(copy=False))
lsa_matrix = lsa.fit_transform(tfidf_matrix)
explained_variance = svd.explained_variance_ratio_.sum()
n_explained_variances.append(explained_variance)
print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))
print("done in %fs" % (time() - start))
plt.plot(range(500, 1500, 100), n_explained_variances)

1

如上图所示,x轴代表维度,y轴代表Explained variance。可以发现选择800维是一个合适的值。
降维的话,可以用如下的代码将维度降低到800维:

1
2
3
4
5
6
7
8
9
print("Dimensionality reduction:")
start = time()
svd = TruncatedSVD(800)
lsa = make_pipeline(svd, Normalizer(copy=False))
lsa_matrix = lsa.fit_transform(tfidf_matrix)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))
print("done in %fs" % (time() - start))

5. 聚类方法

通过对[scikit-learn聚类][1]方法的分析,在数据的平整性、特征大小、类的数目和训练集的评估下,选择K-Means和MiniBatchKMeans(作为测试,可以提高K-Means的速度,但不能提高聚类的效果)作为聚类的方法。

2

K-Means的一个难点是在于K的取值上,我也是通过迭代的方式,来寻找最佳的K值。通过对100到1000范围内步长为100的迭代,通过轮廓系数来对K进行粗略的选择,如果想要提高估计的准确性,可以在确定一个大致的范围后,减少迭代的区间和步长再次进行选择。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
print("KMeans:")
start = time()
k_silhouette_scores = []
for k in range(100, 1000, 100):
km = KMeans(n_clusters=k,
# init='k-means++',
# init='random',
n_init=10)
# km = MiniBatchKMeans(n_clusters=k,
# init_size=1000,
# batch_size=1000,
# n_init=1)
km.fit(lsa_matrix)
silhouette_score = metrics.silhouette_score(lsa_matrix, km.labels_, sample_size=1000)
k_silhouette_scores.append(silhouette_score)
print("%d clusters: Silhouette Coefficient: %0.3f"
% (k, silhouette_score))
print("done in %0.3fs" % (time() - start))
plt.plot(range(100, 1000, 100), k_silhouette_scores)

3

上图所示,x轴表示K的大小,y轴表示轮廓系数的大小。通过分析,最终我选择了K=500对我们的数据进行训练。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
print("KMeans:")
start = time()
k = 500
count = []
km = KMeans(n_clusters=k,
# init='k-means++',
# init='random',
n_init=10)
km.fit(tfidf_matrix)
print("Top terms per cluster:")
clusters = km.labels_.tolist()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
functions = open(function_file).read().split()
for i in range(k):
count.append(clusters.count(i))
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % functions[ind], end='\n')
print()
print("done in %0.3fs" % (time() - start))
fig = plt.figure(figsize=(15,4))
plt.bar(left = range(k), height = count, width = 4)

在K-Means聚类后,每个类中函数的多少可以表示为下图,x轴表示类,y轴表示函数数量。

4

三、定位与结果

1. Query

因为没有降维,所以对Query来说,比较简单,可以将要Query的内容通过TfidfVectorizer,使用语料库的特征词典来向量化。之后可以通过训练出来的K-Means进行预测。本程序中使用的是有150条Query记录的文件来对150条Query文本做预测。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
print("query vectorizion:")
start = time()
query_vectorizer= TfidfVectorizer(stop_words='english',
tokenizer=token_filter,
ngram_range=(1, 1),
norm='l2',
vocabulary=vectorizer.get_feature_names())
file = open(queries_file)
query_tfidf_matrix = query_vectorizer.fit_transform(file)
file.close()
print(query_tfidf_matrix)
print("done in %fs" % (time() - start))
query_result = km.predict(query_tfidf_matrix)
print(query_result)

如下图所示,这是150条预测的结果。

5

2. 精确率

通过上图发现在本次的运行结果中,第一条Query是被分到第158(Python中索引从0开始)类中,下面我们将第158个类的结果与第一条的good_set取交集,然后进行模型精确率的计算。
因为K-means算法的随机性,预测结果的精确率不会是一个一成不变的值,会因每次执行的K-Means的结果而变化。通过多次实验,发现精确率在5%到10%之间。

1
2
3
4
good_set = set(open(goodset_file).read().split())
print(good_set)
precision = len(set.intersection(my_set, good_set)) / len(my_set)
print(precision)

四、总结

这个问题的难点在于对语料库数据属性的认识,因为通过使用多个聚类方法,特征数量的选取,最终聚类的多少,降维的处理,或者对K-Means的结果再进行层次聚类等,都存在一个类较之其它的类有太多的数据,表现在上面K-Means聚类后的每个类对应函数多少的图上,即y轴方向上较高的那个类。所以我认为这是一种多维上的数据“聚堆”现象。并且没有找到合适的方法来解决这个问题。
另一个问题是对降维后的数据做Query时,因为Query的数据远远少于语料库的数据,所以在维度上很难与特征数据进行匹配。因为就算通过维度转换转换到同一维度上,在每一维上并不一定是有意义的,因此最终并没有使用降维算法。

PS: 机器学习课作业,精确率的计算上存在问题。

四、其它代码

这一章主要是前面没有提及的一些程序部分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
%matplotlib inline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
# from nltk.stem.snowball import SnowballStemmer
import sys
import os
from time import time
base_dir = os.path.join("/Users", "Jack", "Documents", "Projects", "Python")
corpus_file = os.path.join(base_dir, "Corpus-jEdit4.3CorpusTransformedStemmed.OUT")
keywords_file = os.path.join(base_dir, "javaKeywords.txt")
function_file = os.path.join(base_dir, "Corpus-jEdit4.3.mapping")
queries_file = os.path.join(base_dir, "Queries-jEdit4.3ShortLongDescriptionCorpusTransformedStemmed.OUT")
goodset_file = os.path.join(base_dir, "GoldSet950961.txt")
1
2
keywords = open(keywords_file).read().split()
print(keywords)

五、参考文献

  1. http://scikit-learn.org/dev/modules/clustering.html#clustering
  2. http://scikit-learn.org/dev/auto_examples/text/document_clustering.html#example-text-document-clustering-py

os_x下卸载r语言

我最讨厌那些装上后不能完全卸载的软件,比如R。如果你也需要完全卸载R,下面的方法也许会帮到你。

OS X 下R语言安装后会有三部分内容(默认):

  • R framework(/Library/Frameworks/R.framework)
  • R.app(/Applications/R.app,可选)
  • Tcl/Tk(/usr/bin,可选)

总的来说,前两个组件是比较容易删除的(需要权限的话,加sudo):
rm -rf /Library/Frameworks/R.framework /Applications/R.app /usr/bin/R /usr/bin/Rscript
最恶心的就是第三个组件,官网
Uninstalling under OS X竟然说卸载它不容易,然后只给出了查看它安装了哪些文件,然后就没然后了。

我的方法,也只是在它的基础上实现的,至于会不会产生不良后果现在还不知道,如果不在意这些细节的话,建议不要删除算了。

  1. 查看安装了哪些文件并将结果重定向到文件。
    pkgutil --files org.r-project.x86_64.tcltk.x11 > tcltk
  2. 查看一下文件内容,最好用文本编辑器打开,因为我们还要修改下这个结果。

    usr
    usr/local
    usr/local/bin
    usr/local/bin/tclsh8.6
    usr/local/bin/wish8.6
    usr/local/include
    usr/local/include/fakemysql.h
    usr/local/include/fakepq.h
    usr/local/include/fakesql.h
    usr/local/include/itcl.h  
    

    这是文件的前10行。有两点需要注意:

    • 都是相对路径
    • 有目录、有文件
      首先我们要剔除掉里面的一些目录(放心,没几个),这里为了保险起见我手工删除的,比如usrusr/localusr/local/binusr/local/include这些都是要排除掉的目录,因为Tcl/Tk影响的都是它们的内部的子目录或文件。
  3. 最重要的是第2步,一定要细心排除掉那些我们不想删掉的目录。这一步是把相对路径变成绝对路径,采用Vim或Sublime等,在每一行的行首加上/
  4. cat tcltk | sudo xargs rm -rf

忠告:rm -rf是一个非常危险的命令。

Mac下创建本地化文件夹

在系统上你肯定注意过,像“桌面”、“下载”、“文稿”这样的文件夹,但是当我们在终端中用ls命令查看的时候,我们发现他们的“真实”名字是“Desktop”、“Downloads”、”Documents”这样的英文名。这些会显示相应系统语言的文件夹就是本地化(Localization)文件夹。

如何建立自己的本地化文件夹呢,下面我写一下最常用的做法。为了明了起见,本文只给出最简单的例子和命令,不解释为什么要这样。至于为什么要这么麻烦的建个文件夹,也仁者见仁,智者见智吧。


  1. 假设我要在桌面上建立一个“test”文件夹,中文名称显示为“测试”。在终端中输入如下命令:
    mkdir -p Desktop/test.localized/.localized
    touch Desktop/test.localized/.localized/zh_CN.strings
    vim Desktop/test.localized/.localized/zh_CN.strings

  2. 在打开的zh_CN.strings文件中加入:
    “test” = "测试";

  3. 需要点击桌面上的“test”文件夹,打开finder,就可以上看到文件夹名变成中文的“测试”了。

浅谈Java和C#泛型及C++模板

最近偶然接触到了类型擦除(Type Erasure)这个概念。通过查了一些资料才知道Java和C#的泛型是有一些区别的,当然与C++的模板也不一样。因此做了如下整理:

泛型的本质就是让你的类型能够拥有类型参数。它们也被称为参数化类型(Parameterized Types)或者参数的多态(Parametric Polymorphism)。

区别

Java的泛型采用的是类型擦除法,类型擦除是指在运行时去除所有泛型的类型信息(看不懂的看下面的例子)。JVM本身并没有“泛型”的概念,Java语言的泛型只是编译器层面的。而在.NET中,“泛型”是CLR层面上的。在运行时,CLR会为不同的泛型类型生成不同的具体类型代码。因此也有人说Java的泛型是伪泛型,C#的泛型实现地更加彻底。

优缺点

Java的做法最大优势在于其兼容性,使用了泛型的代码可以运行在泛型出现之前的JVM上。而.NET中的泛型需要CLR的支持,因此.NET 2.0的程序集无法在CLR 1.0上运行。但C#的这种实现方式,较之Java减少了装箱和拆箱的开销,在性能上有很大的优势。

下面通过一个简单的例子来阐述之间的区别。

  • 在C#中,classList<T> {...},这里T是类型参数。我们可以这样写List<Person> foo=new List<Person>();新类型是通过List<T>构建的,实际上就像是你的类型参数替换掉了原本的类型参数。编译后相当生成了一个ListOfPerson类,这个类跟其它类没什么区别。这样做的好处是非常迅速,不需要类型转换,在代码中通过反射可以知道这是一个包含Person的List。类型信息没有丢失。

  • 在Java中,我们可以这样写ArrayList<Person> foo =new ArrayList<Person>();表面上看跟C#是一样的,同样地编译器会阻止你放入不是Person的类型到这个List里。不同点是Java不会创建一个独一的ListOfPerson类。擦除掉了Person的类型信息,相当于只是一个ArrayList<Object>。一般要用类型转换,如Person p=(Person)foo.get(1)

  • 在C++中,我们可以这样写std::list<Person>* foo =new std::list<Person>();它跟C#的方式很像(应该说C#的方式跟C++很像)。它同样保存类型信息,而不是像Java那样丢掉。但是C#和Java的输出都是面向虚拟机的,C++是产生原始的x86二进制代码。所有的东西都不是对象,也没有装箱跟拆箱,C++编译器对使用模版来做什么没有限制。可以说C++的模版是更强大。

PS:关于C++的模版这里还有好多细节性的问题没提,因为只想主要比较一下C#和Java的泛型机制是不是类型擦除,所以C++的模版问题就此一笔代过了。上面谈到的如果有什么问题欢迎指正。

,