# 10 Best Practices for Apache Hive

Apache Hive is an SQL-like software used with Hadoop to give users the capability of performing SQL-like queries on it’s own language, HiveQL, quickly and efficiently . It also gives users additional query and analytical abilities not available on traditional SQL structures.

With Apache Hive, users can use HiveQL or traditional Mapreduce systems, depending on individual needs and preferences. Hive is particularly ideal for analyzing large datasets (petabytes) and also includes a variety of storage options.
Hive is full of unique tools that allow users to quickly and efficiently perform data queries and analysis. In order to make full use of all these tools, it’s important for users to use best practices for Hive implementation. Here are 10 ways to make the most of Hive.

1. Partitioning Tables:

Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.

2. De-normalizing data:

Normalization is a standard process used to model your data tables with certain rules to deal with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.

3. Compress map/reduce output:

Compression techniques significantly reduce the intermediate data volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file size should not be larger than a few hundred megabytes. Otherwise it can potentially lead to an imbalanced job. Other options of compression codec could be snappy, lzo, bzip, etc.

• For map output compression set mapred.compress.map.output to true
• For job output compression set mapred.output.compress to true

For more functions, check out the Hive Cheat Sheet.

4. Map join:

Map joins are really efficient if a table on the other side of a join is small enough to fit in the memory. Hive supports a parameter, hive.auto.convert.join, which when it’s set to “true” suggests that Hive try to map join automatically. When using this parameter, be sure the auto convert is enabled in the Hive environment.

5. Bucketing:

Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
Additionally it’s important to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.

6. Input Format Selection:

Input formats play a critical role in Hive performance. For example JSON, the text type of input formats, is not a good choice for a large production system where data volume is really high. These type of readable formats actually take a lot of space and have some overhead of parsing ( e.g JSON parsing ). To address these problems, Hive comes with columnar input formats like RCFile, ORC etc. Columnar formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some other binary formats like Avro, sequence files, Thrift and ProtoBuf, which can be helpful in various use cases too.

7. Parallel execution:

Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism. However, single, complex Hive queries commonly are translated to a number of MapReduce jobs that are executed by default sequencing. Often though, some of a query’s MapReduce stages are not interdependent and could be executed in parallel. They then can take advantage of spare capacity on a cluster and improve cluster utilization while at the same time reducing the overall query executions time. The configuration in Hive to change this behavior is merely switching a single flag SET hive.exec.parallel=true.

8. Vectorization:

Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. To enable vectorization, set this configuration parameter SET hive.vectorized.execution.enabled=true.

9. Unit Testing:

Simply speaking, unit testing determines whether the smallest testable piece of your code works exactly as you expect. Unit testing gives a couple of benefits i.e. detecting problems early, making it easier to change and refactor code, being a form of documentation that explains how code works, to name a few.
In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more. To a large extent, it is possible to verify the correctness of your whole HiveQL query by running quick local unit tests without even touching a Hadoop cluster. Because executing HiveQL query in the local mode takes literally seconds, compared to minutes, hours or days if it runs in the Hadoop mode, it certainly saves huge amounts of development time.

There are several tools available that helps you to test Hive queries. Some of them that you might want to look at HiveRunner,Hive_test and Beetest.

10. Sampling:

Sampling allows users to take a subset of dataset and analyze it, without having to analyze the entire data set. If a representative sample is used, then a query can return meaningful results as well as finish quicker and consume fewer compute resources.

Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels – it can return only subsets of buckets (bucket sampling), or HDFS blocks (block sampling), or only first N records from each input split. Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.

Reference: 10 Best Practices for Apache Hive

# 基于K-Means的Feature Location

## 一、问题描述

K-means算法是硬聚类算法，是典型的基于原型的目标函数聚类方法的代表，它是数据点到原型的某种距离作为优化的目标函数，利用函数求极值的方法得到迭代运算的调整规则。K-means算法以欧式距离作为相似度测度，它是求对应某一初始聚类中心向量V最优分类，使得评价指标J最小。算法采用误差平方和准则函数作为聚类准则函数。

## 二、分析与设计

### 2. 向量化

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

• 因为HashingVectorizer是一个无状态模型，不提供IDF权重，如果采用HashingVectorizer的方式，可以通过pipeline的方式添加IDF权重。

• TfidfVectorizer可以直接生成TF-IDF矩阵，因为后面要做特征定位，所以直接采用TfidfVectorizer的方式更为简单。

### 3. 特征

• 因为在jEdit4.3的语料库中，像长度小于2、数字等通常是没有什么意义的，对文本聚类没有太大的意义，我们可以在自定义的tokenizer中去掉。

• 尽管在TF-IDF中会对每个文档中都出现的单词降低其权重，但考虑到不同函数中出现的关键字的多少也有不同，我考虑到将Java语言的关键字统一去掉。

• 因为jEdit4.3的语料库已经提取过词干，所以我们没有继续提取。

• 因为可以通过TfidfVectorizer参数的设置来去除停词，所以在自定义的tokenizer中不再重复处理。

### 4. 降维

• 因为TF-IDF矩阵是一个稀疏矩阵，在降维的方法上可以使用sklearn.decomposition 中的TruncatedSVD。

• 衡量降维后保存的原信息的多少可以通过Explained variance衡量。

### 5. 聚类方法

K-Means的一个难点是在于K的取值上，我也是通过迭代的方式，来寻找最佳的K值。通过对100到1000范围内步长为100的迭代，通过轮廓系数来对K进行粗略的选择，如果想要提高估计的准确性，可以在确定一个大致的范围后，减少迭代的区间和步长再次进行选择。

## 四、总结

PS: 机器学习课作业，精确率的计算上存在问题。

# os_x下卸载r语言

OS X 下R语言安装后会有三部分内容（默认）:

• R framework（/Library/Frameworks/R.framework）
• R.app（/Applications/R.app，可选）
• Tcl/Tk（/usr/bin，可选）

rm -rf /Library/Frameworks/R.framework /Applications/R.app /usr/bin/R /usr/bin/Rscript

Uninstalling under OS X竟然说卸载它不容易，然后只给出了查看它安装了哪些文件，然后就没然后了。

1. 查看安装了哪些文件并将结果重定向到文件。
pkgutil --files org.r-project.x86_64.tcltk.x11 > tcltk
2. 查看一下文件内容，最好用文本编辑器打开，因为我们还要修改下这个结果。

usr
usr/local
usr/local/bin
usr/local/bin/tclsh8.6
usr/local/bin/wish8.6
usr/local/include
usr/local/include/fakemysql.h
usr/local/include/fakepq.h
usr/local/include/fakesql.h
usr/local/include/itcl.h


这是文件的前10行。有两点需要注意：

• 都是相对路径
• 有目录、有文件
首先我们要剔除掉里面的一些目录（放心，没几个），这里为了保险起见我手工删除的，比如usrusr/localusr/local/binusr/local/include这些都是要排除掉的目录，因为Tcl/Tk影响的都是它们的内部的子目录或文件。
3. 最重要的是第2步，一定要细心排除掉那些我们不想删掉的目录。这一步是把相对路径变成绝对路径，采用Vim或Sublime等，在每一行的行首加上/
4. cat tcltk | sudo xargs rm -rf

# Mac下创建本地化文件夹

1. 假设我要在桌面上建立一个“test”文件夹，中文名称显示为“测试”。在终端中输入如下命令：
mkdir -p Desktop/test.localized/.localized
touch Desktop/test.localized/.localized/zh_CN.strings
vim Desktop/test.localized/.localized/zh_CN.strings

2. 在打开的zh_CN.strings文件中加入:
“test” = "测试";

3. 需要点击桌面上的“test”文件夹，打开finder，就可以上看到文件夹名变成中文的“测试”了。

# 浅谈Java和C#泛型及C++模板

### 区别

Java的泛型采用的是类型擦除法，类型擦除是指在运行时去除所有泛型的类型信息（看不懂的看下面的例子）。JVM本身并没有“泛型”的概念，Java语言的泛型只是编译器层面的。而在.NET中，“泛型”是CLR层面上的。在运行时，CLR会为不同的泛型类型生成不同的具体类型代码。因此也有人说Java的泛型是伪泛型，C#的泛型实现地更加彻底。

### 优缺点

Java的做法最大优势在于其兼容性，使用了泛型的代码可以运行在泛型出现之前的JVM上。而.NET中的泛型需要CLR的支持，因此.NET 2.0的程序集无法在CLR 1.0上运行。但C#的这种实现方式，较之Java减少了装箱和拆箱的开销，在性能上有很大的优势。

• 在C#中，classList<T> {...}，这里T是类型参数。我们可以这样写List<Person> foo=new List<Person>();新类型是通过List<T>构建的，实际上就像是你的类型参数替换掉了原本的类型参数。编译后相当生成了一个ListOfPerson类，这个类跟其它类没什么区别。这样做的好处是非常迅速，不需要类型转换，在代码中通过反射可以知道这是一个包含Person的List。类型信息没有丢失。

• 在Java中，我们可以这样写ArrayList<Person> foo =new ArrayList<Person>();表面上看跟C#是一样的，同样地编译器会阻止你放入不是Person的类型到这个List里。不同点是Java不会创建一个独一的ListOfPerson类。擦除掉了Person的类型信息，相当于只是一个ArrayList<Object>。一般要用类型转换，如Person p=(Person)foo.get(1)

• 在C++中，我们可以这样写std::list<Person>* foo =new std::list<Person>();它跟C#的方式很像（应该说C#的方式跟C++很像）。它同样保存类型信息，而不是像Java那样丢掉。但是C#和Java的输出都是面向虚拟机的，C++是产生原始的x86二进制代码。所有的东西都不是对象，也没有装箱跟拆箱，C++编译器对使用模版来做什么没有限制。可以说C++的模版是更强大。

PS：关于C++的模版这里还有好多细节性的问题没提，因为只想主要比较一下C#和Java的泛型机制是不是类型擦除，所以C++的模版问题就此一笔代过了。上面谈到的如果有什么问题欢迎指正。