用Python脚本实现对 Lua表的压缩

前言

问题

在研发过程中，通常会定义一些Excel表格，规定行列值让策划填写，然后，转成lua的table文件，直接使用。

但是，随着研发进行，项目迭代，表格将越来越大。

如果表格中存在大量重复数据，或者表格中很多列数值重复，则可以通过数据压缩给表减减肥。

解决

利用python实现lua表的数据压缩

excel表内存在大量 同列不同行 内容一致
excel表内存在大量 复合型 单元格内容一致

具体代码，均在github中： ltree98’s github -> CompressLua

参照方案

这是我在网上看到的文章：

Lua配置表存储优化方案

总结的来说，就是

利用lua的元表机制。如果在table中取的值不存在，会去它的元表中查找（如果元表存在）。
将重复的table，提取出来，将所有使用的地方引用过去。

问题

在参照方案的文章中，也提供了解决方法，但是在使用过程中遇见一些问题。

对同一文件压缩后文件MD5可能不一致

对同一个lua文件，进行减肥，在减肥后，虽然最终体重一样，但是可能这次瘦了肚子，下次瘦了腿。

主要原因就是参照方案使用lua处理，对同一个table遍历顺序是不稳定的，即table中存在 A B C 元素，可能这次遍历顺序是 A-B-C，下次遍历顺序可能就是 B-A-C；稳定性得不到保证。

lua的官网对于table的遍历描述一直是：

The order in which the indices are enumerated is not specified, even for numeric indices. (To traverse a table in numeric order, use a numerical for or the ipairs function.)

就是索引在遍历过程中的次序是不固定的，即使是数字索引也不固定。

因为，项目中使用热更机制是比较两个文件的MD5，决定是否更新该文件。

所以，这个问题就显得很严重，每次热更要更新所有的表。当然，这个问题也可以通过做备份等方式来弥补，但是毕竟治标不治本。

但是用不同的lua版本发现：

使用 lua5.1 版本生成的压缩文件是一致的

使用 lua5.3版本生成的压缩文件是不一致的

复杂度高

虽然是现成的，但是也没法直接拿来用；

还是要根据项目现状进行修改。

在整合过程中，发现逻辑比较复杂，而且工具集与我们现有的python不符。（如果用其他工具集，又要去配置相应环境等，比较麻烦）。

修改方案

简介

目的

解决 Excel表内存在大量同列不同行内容一致
解决 Excel表内存在大量复合型单元格内容一致

设计

对于重复table，只存一份，其他引用过去
设计一个基础元表，存储Excel每列最频繁的值；其他表缺省最频繁的值
设置只读属性

注意

Lua作用域内存放最大的local变量个数为 200个，超过的需要放入临时数组
输出到lua文件，key值不可存在特殊字符($、- 等)

缺点

Lua表的可读性变差，需要跳转获取最终值
使用元表实现，若后续处理（比如加密，修改操作等）也存在使用元表，增加处理的复杂度

方案

名词解释

Excel表：Excel转换成Lua的表，一般结构为

-- tableName.lua

local tableName = {
    ...
}
return tableName

默认表：未来的元表，存储Excel中每列最频繁的元素
重复表：Excel表中各单元格重复的复合型元素（array/table）

流程

遍历表，进行统计
- 统计Excel表中各列元素出现次数
- 统计Excel表中单元格复合型元素出现次数
构造重复表
- 筛选单元格复合型元素次数大于1的元素，构建重复表
构造默认表
- 根据各列最频繁元素，构造默认表
整理 Excel表
- 根据默认表，将重复的字段忽略
输出各表
- 根据重复表，替换并输出重复表
- 根据重复表，替换并输出 Excel表
- 根据重复表，替换并输出默认表
设置元表及只读属性

关键代码

最新代码请移步 lt-tree的github

###################################################################
##
##  tools
##

def count_dict_deep(dict_temp):
	"""Count dictionary's deep

		Args:
			dict_temp: dict

		Returns:
			int : deep
	"""

    deep = 0
    for item in dict_temp.values():
        if isinstance(item, dict):
            temp_deep = count_dict_deep(item)
            if temp_deep > deep:
                deep = temp_deep

    return deep+1

def calc_weight(obj1):
	"""Calculate obj's weight

		Args:
			obj1: tuple[dict's string, dict's frequency]
		
		Returns:
			int : weight
	"""

    dict1 = eval(obj1[0])
    times1 = obj1[1]

    deep1 = count_dict_deep(dict1)
    ans = deep1 + 1/times1

    return ans

def count_table_frequency(unit_dict, dict_frequency):
    """Count table frequency
        
        Count table frequency and record as {table string: times}

        Args:
            unit_dict: dict, need analyse data
            dict_frequency: dict, the record's set
    """

    unit_str = str(unit_dict)
    if unit_str in dict_frequency:
        dict_frequency[unit_str] = dict_frequency[unit_str] + 1
    else:
        dict_frequency[unit_str] = 1

    # traversing sub dict
    for item in unit_dict.values():
        if isinstance(item, dict):
            count_table_frequency(item, dict_frequency)


def count_table_value_frequency(key, value, item_frequency):
    """Count table value frequency

        Count every excel column element appear times.
        Record as {
                    key1 : {element1 : times, element2: times, ...}
                    key2 : {element1 : times, element2: times, ...}
                    ...
                    }

        Args:
            key: string
            value: string or dict
            item_frequency: dict, the record's set
    """

    if isinstance(value, dict):
        value = str(value)

    if key in item_frequency.keys():
        if value in item_frequency[key].keys():
            item_frequency[key][value] = item_frequency[key][value] + 1
        else:
            item_frequency[key][value] = 1
    else:
        item_frequency[key] = {}
        item_frequency[key][value] = 1


def traverse_table(excel_dict, dict_frequency, item_frequency):
    """Traverse table.
        
        Analyse lua table.

        Args:
            excel_dict: dict
            dict_frequency: dict
            item_frequency: dict
    """

    for key in sorted(excel_dict):
        if isinstance(excel_dict[key], dict):
            count_table_frequency(excel_dict[key], dict_frequency)

            for k, v in excel_dict[key].items():
                count_table_value_frequency(k, v, item_frequency)


def check_repeat_dict(item_dict, repeat_dict):
    """Check repeat dict
        
        Check repeat dict and return the repeat index, if not exist in repeat dict return -1.

        Args:
            item_dict: dict
            repeat_dict: dict

        Returns:
            int
    """

    dict_str = str(item_dict)
    if dict_str in repeat_dict.keys():
        return repeat_dict[dict_str]
    else:
        return -1

def replace_repeat_dict(item_dict, repeat_dict, cur_index = -1):
    """Replace repeat dict

        Check if element exist in repeat dict and replace by designation string.
    
        Args:
            item_dict: dict
            repeat_dict: dict
    """

    cur_index = -1

    for key, value in item_dict.items():
        if isinstance(value, dict):
            index = check_repeat_dict(value, repeat_dict)
            if index != -1 and (index < cur_index or cur_index == -1):
                if index > 190:
                    item_dict[key] = REPEAT_KEY_PREFIX + '[' + str(index - LOCAL_TABLE_MAX) + ']'
                else:
                    item_dict[key] = REPEAT_KEY_PREFIX + str(index)
            else:
                replace_repeat_dict(value, repeat_dict, cur_index)



def output_file(table_name, file_path, repeat_dict, final_dict, default_dict):
    """Output file

        Args:
            table_name: string
            file_path: path
            repeat_dict: dict
            final_dict: dict
            default_dict: dict
    """

    file_handler = codecs.open(file_path, 'a', encoding='utf-8')

    # output repeat dict
    for dictStr, index in sorted(repeat_dict.items(), key=lambda item:item[1]):

        # replace repeat item by repeat_dict 
        repeat_dict_item = eval(dictStr)
        replace_repeat_dict(repeat_dict_item, repeat_dict, index)

        if index <= LOCAL_TABLE_MAX:
            # file_handler.write("local %s = {\n" % (REPEAT_KEY_PREFIX + str(index)))
            file_handler.write("local %s = {\n" % (REPEAT_KEY_PREFIX + str(index)))
            convert.convert_dict_lua_file(file_handler, repeat_dict_item, 1)
            file_handler.write("}\n")
        else:
            if index == (LOCAL_TABLE_MAX + 1):
                file_handler.write("\nlocal __rt = createtable and createtable(%d, 0) or {}\n" % (len(repeat_dict)-LOCAL_TABLE_MAX))
            
            file_handler.write("__rt[%d] = {\n" % (index - LOCAL_TABLE_MAX))
            convert.convert_dict_lua_file(file_handler, repeat_dict_item, 1)
            file_handler.write("}\n")       

    # output final dict
    replace_repeat_dict(final_dict, repeat_dict)
    file_handler.write("\nlocal %s = {\n" % (table_name))
    convert.convert_dict_lua_file(file_handler, final_dict, 1)
    file_handler.write("}\n")

    # output default dict
    replace_repeat_dict(default_dict, repeat_dict)
    file_handler.write("\nlocal %s = {\n" % (DEFAULT_TABLE_NAME))
    convert.convert_dict_lua_file(file_handler, default_dict, 1)
    file_handler.write("}\n")

    # set metatable and read-only
    file_handler.write("\ndo\n")
    file_handler.write("\tlocal base = {__index = %s, __newindex = function() error(\"Attempt to modify read-only table\") end}\n" % (DEFAULT_TABLE_NAME))
    file_handler.write("\tfor k, v in pairs(%s) do\n" % (table_name))
    file_handler.write("\t\tsetmetatable(v, base)\n")
    file_handler.write("\tend\n")
    file_handler.write("\tbase.__metatable = false\n")
    file_handler.write("end\n")

    file_handler.write("\nreturn %s\n" % (table_name))
    file_handler.close()
    

###################################################################
##
##  structure method
##

def structure_repeat_dict(dict_frequency):
    """Structure frequency dict

        Select frequency > 1 element to structure dict.

        Args:
            dict_frequency: dict

        Returns:
            dict; {dict's string : repeat index}
    """

    repeat_dict = {}
    repeat_index = 1

    for key, value in sorted(dict_frequency.items(), key=lambda x:calc_weight(x)):
        if value > 1:
            repeat_dict[key] = repeat_index
            repeat_index = repeat_index + 1

    return repeat_dict


def structure_default_dict(excel_dict, all_item_frequency):
    """Structure default dict
        
        Args:
            excel_dict: dict
            all_item_frequency: dict

        Returns:
            dict; {key : most frequently value}
    """

    excel_item = {}
    for item in excel_dict.values():
        excel_item = item
        break

    default_dict = {}
    for key, value in excel_item.items():
        for k, v in sorted(all_item_frequency[key].items(), key=lambda item:item[1], reverse=True):
            if isinstance(value, dict):
                default_dict[key] = eval(k)
            else:
                default_dict[key] = k
            break

    return default_dict


def structure_final_dict(excel_dict, default_dict):
    """Structure final dict
        
        Structure final dict by default_dict and excel_dict.

        Args:
            excel_dict: dict
            default_dict: dict

        Returns:
            dict
    """

    final_dict = {}
    for key, value in excel_dict.items():
        final_dict[key] = {}

        if isinstance(value, dict):
            for k, v in value.items():
                if default_dict[k] != v:
                    final_dict[key][k] = v
        else:
            final_dict[key] = value

    return final_dict



def process_file(src_path, dst_path, file_name):
    dict_frequency_statistics = {}
    all_item_frequency_statistics = {}

    # conver lua table to python dict
    file_dict = convert.convert_lua_table_dict(src_path)

    # analyse dict
    traverse_table(file_dict, dict_frequency_statistics, all_item_frequency_statistics)

    # get repeat dict
    repeat_dict = structure_repeat_dict(dict_frequency_statistics)

    # get default dict
    default_dict = structure_default_dict(file_dict, all_item_frequency_statistics)

    # structure final dict
    final_dict = structure_final_dict(file_dict, default_dict)

    output_file(file_name, dst_path, repeat_dict, final_dict, default_dict)

文件结构

six.py：专门用于兼容 Python2 和 Python3 的库
slpp.py：用来将 lua的table结构转换成 python dictionary结构
BEBase.py：基础模块，目前包含转换、格式化输出等
CompressLua.py ：压缩Lua工具

其他

为什么 lua5.1 有序， lua5.3 无序

最开始使用lua5.3来执行参照方案，所以出现了压缩后的文件不一致问题；

一开始也没有想是lua版本的问题，只是单纯觉得用lua实现太不方便了，就直接用python重写了。

后来在实现完python版本，然后进行整理的时候，

发现lua5.1版本其实还是固定顺序的（虽然官方文档写的不保证），

但是lua5.3版本的确是无序，而且这个无序是载入无序，

也就是说，对同一个table，载入时会安排好顺序，后面遍历就固定顺序了，但是下次载入可能就换了个顺序。

为什么lua5.3会载入无序呢？

lua中table的 for pairs 遍历是通过next来实现的，next的实现在不同lua版本没有变化

int luaH_next (lua_State *L, Table *t, StkId key) {
  int i = findindex(L, t, key);  /* find original element */
  for (i++; i < t->sizearray; i++) {  /* try first array part */
    if (!ttisnil(&t->array[i])) {  /* a non-nil value? */
      setnvalue(key, cast_num(i+1));
      setobj2s(L, key+1, &t->array[i]);
      return 1;
    }
  }
  for (i -= t->sizearray; i < sizenode(t); i++) {  /* then hash part */
    if (!ttisnil(gval(gnode(t, i)))) {  /* a non-nil value? */
      setobj2s(L, key, key2tval(gnode(t, i)));
      setobj2s(L, key+1, gval(gnode(t, i)));
      return 1;
    }
  }
  return 0;  /* no more elements */
}

都是先查数组部分，再查哈希部分。

所以，key值的设置就是关键。

lua关于table中加入新key值解释是：

/
inserts a new key into a hash table; first, check whether key’s main position is free. If not, check whether colliding node is in its main
position or not: if it is not, move colliding node to an empty place and put new key in its main position; otherwise (colliding node is in its main
** position), new key goes to an empty position. /

向哈希表插入新key值，

先检查key的 mainposition(哈希值) 是否空着
- 是：key值放在 mainposition
- 否：查看占据位置的值的 mainposition 是否是该位置
  - 是：找到空位置，然后把 key值放在该位置
  - 否：将占据位置的值移动到空位置，然后将key值放在该位置

核心就是，

当发生哈希碰撞的时候，会在碰撞位置物理向上查找到空位置，然后作为后加入值的位置。

因此，出现发生碰撞的时候，需要先去判断占据位置的值的哈希值是否是该位置。

由于，对同一个哈希值的不同值，是有链关系的，所以，后来的兄弟被移到别的位置了，还是能被找到，并不会因为后来兄弟把它扔别的地方就丢了。

最后，当没有空位置的时候，就要进行 rehash操作，这个设计到lua表的具体实现，等以后有空再整理一下吧。

mainposition的值是怎么来的呢？

这里先只讨论string类型的值，

对于 lua5.1 来说：

TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
  
  ...
  
  unsigned int h = cast(unsigned int, l);  /* seed */
  size_t step = (l>>5)+1;  /* if string is too long, don't hash all its chars */
  size_t l1;
  for (l1=l; l1>=step; l1-=step)  /* compute hash */
    h = h ^ ((h<<5)+(h>>2)+cast(unsigned char, str[l1-1]));
  
  ...
  
  return newlstr(L, str, l, h);  /* not found */
}

此处，h的取值即为该string的hash key，可以发现它是将 string长度作为种子，计算出来的，所以 lua5.1 是固定顺序的。

对于 lua5.3 来说：

TString *luaS_createlngstrobj (lua_State *L, size_t l) {
  TString *ts = createstrobj(L, l, LUA_TLNGSTR, G(L)->seed);
  ts->u.lnglen = l;
  return ts;
}

此处，h取值是一个随机种子，在创建的时候随机生成的：

/*
** a macro to help the creation of a unique random seed when a state is
** created; the seed is used to randomize hashes.
*/
#if !defined(luai_makeseed)
#include <time.h>
#define luai_makeseed()		cast(unsigned int, time(NULL))
#endif

至此，破案。

关于 repeat_dict 的嵌套问题（v0.0.2更新）

问题

最终输出的lua会存在以下结构：

local __rt1 = {
    [1] = __rt2,
}
local __rt2 = {
    [1] = 0,
}

这样在lua使用时，由于定义__rt1时，__rt2未被声明定义，所以输出出来的结果是：

__rt1 = {
    
}
__rt2 = {
    [1] = 0,
}

这是由于，repeat_dict只是单纯根据出现频繁次数排序的，所以可能存在上面的去引用下面的情况。

解决

方案一

这种方案是比较正规的方法，

将repeat_dict内所有序号都看作一个节点，然后通过引用关系相连，可以生成多条链结构。
根据链的长度进行遍历，共同维护将要输出的最终链结构
- 每遍历一个节点，都先去最终结构遍历位置关系及是否已经输出
- 找到合适的位置，将结点插入进去
将最终链结构按顺序赋值输出

方案二

这种方案比较投机取巧一些。

因为存在嵌套关系，那肯定是递归深度高的table嵌套了递归深度低的table。

所以先输出递归深度低的table，再输出递归深度高的table，那就不会出现这种问题。

但是，为了更好查找，同样深度的table，根据出现频率要再排序。

针对不同python版本打表不一致问题（v0.0.3更新 & v0.0.4更新）

经使用发现，不同python版本会导致，输出的表内容不一致。（暂时只试验了 Python 3.4 及 Python 3.7）

经过试验查询，发现是Python在存到字典后读取的顺序是不同的，导致sorted排序会有所差异（Python3.7没有问题，Python3.4会有所差异）。

查了相关资料发现，在python3.6开始普通字典可以记住元素添加顺序，之前并没有，所以会导致之前的字典遍历顺序有问题。

PS：注意，此处排序的稳定性和添加顺序一致是两个概念。
- 排序的稳定性：对于待排序数列 1,1,2,3，经历稳定排序，则不会进行交换操作，第一个位置的1 和第二个位置的1 不会交换位置（虽然两者值相同）
- 添加顺序的一致：输入 age, sex, sex, age；如果添加顺序一致，那么 sex永远会优先遍历到第二次，从而顺序会是 {sex: 2, age: 2}；如果添加顺序不一致，就可能出现 {age: 2, sex: 2}

原因查明白了，该想怎么解决了。就是对同样频率的再根据key排个序。（没有什么顺序是排序解决不了的，如果不行，多排几遍…）

基于同样的问题，v0.0.4进一步修改，字典转字符串，要先对字典排序再转，否则将出现根据存储顺序不一致问题，然后比较是否重复元素，也由原来比较字符串相同改为比较字典相同。

参考资料