Python3.5源码分析-Dict概述
发布日期:2021-07-25 13:04:45 浏览次数:18 分类:技术文章

本文共 20466 字,大约阅读时间需要 68 分钟。

Python3源码分析

本文环境python3.5.2。参考书籍<
>python官网

Python3的Dict对象

在生成d = {}和d[‘1’] = ‘1’,执行的字节码如下;

23          94 BUILD_MAP                0             97 STORE_NAME               5 (d) 24         100 LOAD_CONST               5 ('1')            103 LOAD_NAME                5 (d)            106 LOAD_CONST               5 ('1')            109 STORE_SUBSCR

调用了BUILD_MAP和STORE_SUBSCR进行赋值操作,先查看BUILD_MAP进行字典的初始化动作;

TARGET(BUILD_MAP) {        int i;        PyObject *map = _PyDict_NewPresized((Py_ssize_t)oparg);   // 初始化字典,传入字典的大小        if (map == NULL)            goto error;        for (i = oparg; i > 0; i--) {                             // 如果生成字典时有传入的默认的参数            int err;            PyObject *key = PEEK(2*i);                            // 获取key            PyObject *value = PEEK(2*i - 1);                      // 获取value            err = PyDict_SetItem(map, key, value);                // 设值到字典中            if (err != 0) {                Py_DECREF(map);                goto error;            }        }        while (oparg--) {            Py_DECREF(POP());            Py_DECREF(POP());        }        PUSH(map);                                                // 压入字典        DISPATCH();    }

其中调用了_PyDict_NewPresized生成字典;

PyObject *_PyDict_NewPresized(Py_ssize_t minused){    Py_ssize_t newsize;    PyDictKeysObject *new_keys;    for (newsize = PyDict_MINSIZE_COMBINED;            newsize <= minused && newsize > 0;         newsize <<= 1)                         // 生成newsize大小,建立空列表时minused默认为0,则newsize为PyDict_MINSIZE_COMBINED        ;    new_keys = new_keys_object(newsize);        // 生成对应大小的字典key    if (new_keys == NULL)        return NULL;    return new_dict(new_keys, NULL);            // 生成字典}

执行到这里的时候,先了解一下有关dict相关数据的结构。

有关dict的定义如下;

typedef struct {    PyObject_HEAD    Py_ssize_t ma_used;    PyDictKeysObject *ma_keys;    PyObject **ma_values;} PyDictObject;

其中ma_keys就是使用的保存的key值,PyDictKeysObject类型的定义如下;

typedef struct {    /* Cached hash code of me_key. */    Py_hash_t me_hash;                          // hash值    PyObject *me_key;                           // 对应的key    PyObject *me_value; /* This field is only meaningful for combined tables */} PyDictKeyEntry;typedef PyDictKeyEntry *(*dict_lookup_func)(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject ***value_addr);struct _dictkeysobject {    Py_ssize_t dk_refcnt;               // 引用计数值    Py_ssize_t dk_size;                 // 大小    dict_lookup_func dk_lookup;         // 查找方法    Py_ssize_t dk_usable;               // 可用值    PyDictKeyEntry dk_entries[1];       // 存储的值};

此时回过头来看初始化过程,查看new_keys_object方法的执行过程;

static PyDictKeysObject *new_keys_object(Py_ssize_t size){    PyDictKeysObject *dk;    Py_ssize_t i;    PyDictKeyEntry *ep0;    assert(size >= PyDict_MINSIZE_SPLIT);    assert(IS_POWER_OF_2(size));    dk = PyMem_MALLOC(sizeof(PyDictKeysObject) +                      sizeof(PyDictKeyEntry) * (size-1));   // 申请内存,由于PyDictKeysObject中包含了一个PyDictKeyEntry,所以只需要申请余下的size-1    if (dk == NULL) {        PyErr_NoMemory();        return NULL;    }    DK_DEBUG_INCREF dk->dk_refcnt = 1;                      // 设置引用计数值     dk->dk_size = size;                                     // 设置大小    dk->dk_usable = USABLE_FRACTION(size);                  // 设置可使用的大小    ep0 = &dk->dk_entries[0];                               // 获取第一个PyDictKeyEntry    /* Hash value of slot 0 is used by popitem, so it must be initialized */    ep0->me_hash = 0;                                       // 设置第一个hash值为0    for (i = 0; i < size; i++) {        ep0[i].me_key = NULL;                               // 依次设置余下的me_key和me_value为NULL        ep0[i].me_value = NULL;    }    dk->dk_lookup = lookdict_unicode_nodummy;               // 设置key的查找方法    return dk;                                              // 返回}

此时就会生成的过程会调用new_dict函数,

/* Consumes a reference to the keys object */static PyObject *new_dict(PyDictKeysObject *keys, PyObject **values){    PyDictObject *mp;    assert(keys != NULL);    if (numfree) {                                          // 判断缓冲池是否有        mp = free_list[--numfree];        assert (mp != NULL);        assert (Py_TYPE(mp) == &PyDict_Type);        _Py_NewReference((PyObject *)mp);                   // 使用缓冲池对象    }    else {         mp = PyObject_GC_New(PyDictObject, &PyDict_Type);   // 缓冲池没有则申请新的对象        if (mp == NULL) {            DK_DECREF(keys);            free_values(values);            return NULL;        }    }    mp->ma_keys = keys;                                     // 设置ma_keys    mp->ma_values = values;                                 // 设置ma_values    mp->ma_used = 0;                                        // 设置ma_used为0    return (PyObject *)mp;                                  // 返回字典值}

至此一个字典的初始化过程完成。

此时继续向字典中赋值d[‘1’]=’1’,此时调用STORE_SUBSCR,

TARGET(STORE_SUBSCR) {        PyObject *sub = TOP();                      // 获取需要设置的key        PyObject *container = SECOND();             // 获取list实例        PyObject *v = THIRD();                      // 获取待设置的值        int err;        STACKADJ(-3);                               // 栈移动三个值        /* v[w] = u */        err = PyObject_SetItem(container, sub, v);  // 设置值        Py_DECREF(v);        Py_DECREF(container);        Py_DECREF(sub);        if (err != 0)            goto error;        DISPATCH();    }

查看PyObject_SetItem函数;

intPyObject_SetItem(PyObject *o, PyObject *key, PyObject *value){    PyMappingMethods *m;    if (o == NULL || key == NULL || value == NULL) {                    // 检查传入参数        null_error();        return -1;    }    m = o->ob_type->tp_as_mapping;                                      // 获取类型的tp_as_mapping方法    if (m && m->mp_ass_subscript)        return m->mp_ass_subscript(o, key, value);                      // 调用字典的设置方法    if (o->ob_type->tp_as_sequence) {                                   // 调用tp_as_sequence方法        if (PyIndex_Check(key)) {                                       // 检查key            Py_ssize_t key_value;            key_value = PyNumber_AsSsize_t(key, PyExc_IndexError);            if (key_value == -1 && PyErr_Occurred())                return -1;            return PySequence_SetItem(o, key_value, value);             // 调用设置值        }        else if (o->ob_type->tp_as_sequence->sq_ass_item) {            type_error("sequence index must be "                       "integer, not '%.200s'", key);            return -1;        }    }    type_error("'%.200s' object does not support item assignment", o);    return -1;}

此时就调用了tp_as_mapping的mp_ass_subscript方法,此时就是调用dict的dict_ass_sub方法;

static intdict_ass_sub(PyDictObject *mp, PyObject *v, PyObject *w){    if (w == NULL)        return PyDict_DelItem((PyObject *)mp, v);    else        return PyDict_SetItem((PyObject *)mp, v, w);}

由于此时是赋值操作,所以w不为NULL,就调用的是PyDict_SetItem函数,

intPyDict_SetItem(PyObject *op, PyObject *key, PyObject *value){    PyDictObject *mp;    Py_hash_t hash;    if (!PyDict_Check(op)) {                            // 检查是否是字典类型        PyErr_BadInternalCall();        return -1;    }    assert(key);    assert(value);    mp = (PyDictObject *)op;    if (!PyUnicode_CheckExact(key) ||        (hash = ((PyASCIIObject *) key)->hash) == -1)   // 检查传入的key是否hash为-1    {        hash = PyObject_Hash(key);                      // 生成hash调用key对应的tp_hash方法        if (hash == -1)            return -1;    }    /* insertdict() handles any resizing that might be necessary */    return insertdict(mp, key, hash, value);            // 插入该值}

其中有关insertdict的搜索条件如下;

/*Internal routine to insert a new item into the table.Used both by the internal resize routine and by the public insert routine.Returns -1 if an error occurred, or 0 on success.*/static intinsertdict(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject *value){    PyObject *old_value;    PyObject **value_addr;    PyDictKeyEntry *ep;    assert(key != dummy);    if (mp->ma_values != NULL && !PyUnicode_CheckExact(key)) {        if (insertion_resize(mp) < 0)                                 // 重新设置mp的大小            return -1;    }    ep = mp->ma_keys->dk_lookup(mp, key, hash, &value_addr);          // 调用查找方法    if (ep == NULL) {        return -1;    }    assert(PyUnicode_CheckExact(key) || mp->ma_keys->dk_lookup == lookdict);  // 由于dk_lookup初始是lookdict_unicode_nodummy,在调用该方法后会unicode会被重置为lookdict方法    Py_INCREF(value);    MAINTAIN_TRACKING(mp, key, value);    old_value = *value_addr;                                          // 获取查找之后的值    if (old_value != NULL) {                                          // 如果不为空  则证明对应的key有值        assert(ep->me_key != NULL && ep->me_key != dummy);        *value_addr = value;                                          // 将旧值设置成新值        Py_DECREF(old_value); /* which **CAN** re-enter (see issue #22653) */    }    else {        if (ep->me_key == NULL) {                                     // 如果为空             Py_INCREF(key);            if (mp->ma_keys->dk_usable <= 0) {                        // 判断可用的key是否为0                /* Need to resize. */                if (insertion_resize(mp) < 0) {                       // 重新设置mp的大小                    Py_DECREF(key);                    Py_DECREF(value);                    return -1;                }                ep = find_empty_slot(mp, key, hash, &value_addr);     // 查找一个空闲的位置            }            mp->ma_keys->dk_usable--;                                 // 可用值减一            assert(mp->ma_keys->dk_usable >= 0);            ep->me_key = key;                                         // 设置key            ep->me_hash = hash;                                       // 设置hash值        }        else {            if (ep->me_key == dummy) {                                // 如果值为dummy状态                Py_INCREF(key);                ep->me_key = key;                                     // 设置me_key值                ep->me_hash = hash;                                   // 设置hash值                Py_DECREF(dummy);            } else {                assert(_PyDict_HasSplitTable(mp));            }        }        mp->ma_used++;                                                // 使用值加1        *value_addr = value;                                          // 赋值        assert(ep->me_key != NULL && ep->me_key != dummy);    }    return 0;}

首先会调用相关的查找方法,去查找待搜索的值是否已经存在字典中,如果当前字典数据已经满了则会按照增长大小的函数生成一个新的字典,并把旧数据设置到新的字典中,当找到的字典匹配时则返回。

其中dk_lookup对应的方法,在初始化之后对应的是lookdict_unicode_nodummy;

/* Faster version of lookdict_unicode when it is known that no 
keys * will be present. */static PyDictKeyEntry *lookdict_unicode_nodummy(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject ***value_addr){ size_t i; size_t perturb; size_t mask = DK_MASK(mp->ma_keys); PyDictKeyEntry *ep0 = &mp->ma_keys->dk_entries[0]; // 获取第一个entery PyDictKeyEntry *ep; /* Make sure this function doesn't have to handle non-unicode keys, including subclasses of str; e.g., one reason to subclass unicodes is to override __eq__, and for speed we don't cater to that here. */ if (!PyUnicode_CheckExact(key)) { // 检查如果是非unicode mp->ma_keys->dk_lookup = lookdict; // 设置成lookdict查找方法 return lookdict(mp, key, hash, value_addr); // 直接查找 } i = (size_t)hash & mask; // 散列获取第一个entry ep = &ep0[i]; assert(ep->me_key == NULL || PyUnicode_CheckExact(ep->me_key)); if (ep->me_key == NULL || ep->me_key == key || (ep->me_hash == hash && unicode_eq(ep->me_key, key))) { // 如果ep未被使用,或者与查找key相同,货值哈希值相同则找到 *value_addr = &ep->me_value; // 获取value return ep; // 返回该entry } for (perturb = hash; ; perturb >>= PERTURB_SHIFT) { // 循环整个dict i = (i << 2) + i + perturb + 1; ep = &ep0[i & mask]; assert(ep->me_key == NULL || PyUnicode_CheckExact(ep->me_key)); if (ep->me_key == NULL || ep->me_key == key || (ep->me_hash == hash && unicode_eq(ep->me_key, key))) { // 检查是否未被使用,是否key相同,哈希值是否相同 *value_addr = &ep->me_value; // 设置该值 return ep; // 返回entery } } assert(0); /* NOT REACHED */ return 0;}

在查找之后,如果是要重新扩充字典大小则调用insertion_resize函数,

static intinsertion_resize(PyDictObject *mp){    return dictresize(mp, GROWTH_RATE(mp));       // 传入字典和按照当前mp下一步字典的大小}

继续查看dictresize函数,

/*Restructure the table by allocating a new table and reinserting allitems again.  When entries have been deleted, the new table mayactually be smaller than the old one.If a table is split (its keys and hashes are shared, its values are not),then the values are temporarily copied into the table, it is resized asa combined table, then the me_value slots in the old table are NULLed out.After resizing a table is always combined,but can be resplit by make_keys_shared().*/static intdictresize(PyDictObject *mp, Py_ssize_t minused){    Py_ssize_t newsize;    PyDictKeysObject *oldkeys;    PyObject **oldvalues;    Py_ssize_t i, oldsize;/* Find the smallest table size > minused. */    for (newsize = PyDict_MINSIZE_COMBINED;         newsize <= minused && newsize > 0;         newsize <<= 1)                        // 根据传入的minused 获取newsize的值        ;    if (newsize <= 0) {        PyErr_NoMemory();        return -1;    }    oldkeys = mp->ma_keys;                    // 先保存旧值    oldvalues = mp->ma_values;                // 保存旧值    /* Allocate a new table. */    mp->ma_keys = new_keys_object(newsize);   // 生成newsize的keys    if (mp->ma_keys == NULL) {                // 如果申请失败        mp->ma_keys = oldkeys;                // 回复原值并返回        return -1;    }    if (oldkeys->dk_lookup == lookdict)       // 判断旧值查找方法        mp->ma_keys->dk_lookup = lookdict;    // 继续使用旧值的查找方法    oldsize = DK_SIZE(oldkeys);               // 获取旧值的大小    mp->ma_values = NULL;                     // 先置空    /* If empty then nothing to copy so just return */    if (oldsize == 1) {                       // 如果是空则不需要复制        assert(oldkeys == Py_EMPTY_KEYS);        DK_DECREF(oldkeys);        return 0;    }    /* Main loop below assumes we can transfer refcount to new keys     * and that value is stored in me_value.     * Increment ref-counts and copy values here to compensate     * This (resizing a split table) should be relatively rare */    if (oldvalues != NULL) {                  // 旧值不为空        for (i = 0; i < oldsize; i++) {       // 循环旧值            if (oldvalues[i] != NULL) {                Py_INCREF(oldkeys->dk_entries[i].me_key);                oldkeys->dk_entries[i].me_value = oldvalues[i];   // 如果值不为空则依次将旧值赋值到dk_entries的me_value中            }        }    }    /* Main loop */    for (i = 0; i < oldsize; i++) {        PyDictKeyEntry *ep = &oldkeys->dk_entries[i];             // 获取旧值的entries        if (ep->me_value != NULL) {            assert(ep->me_key != dummy);                          // 设置的不为dummy状态的me_key            insertdict_clean(mp, ep->me_key, ep->me_hash, ep->me_value);        }    }    mp->ma_keys->dk_usable -= mp->ma_used;                        // 设置已经使用的数量    if (oldvalues != NULL) {        /* NULL out me_value slot in oldkeys, in case it was shared */        for (i = 0; i < oldsize; i++)            oldkeys->dk_entries[i].me_value = NULL;               // 旧值置空        assert(oldvalues != empty_values);        free_values(oldvalues);                                   // 释放旧值对应的内存        DK_DECREF(oldkeys);    }    else {        assert(oldkeys->dk_lookup != lookdict_split);        if (oldkeys->dk_lookup != lookdict_unicode_nodummy) {            PyDictKeyEntry *ep0 = &oldkeys->dk_entries[0];            for (i = 0; i < oldsize; i++) {                       // oldvalues为空则设置为dummy状态                if (ep0[i].me_key == dummy)                     Py_DECREF(dummy);            }        }        assert(oldkeys->dk_refcnt == 1);        DK_DEBUG_DECREF PyMem_FREE(oldkeys);                      // 释放旧值    }    return 0;}

当申请到新的字典空间后,就会将旧值重新设置到新的字典中,通过insertdict_clean;

static voidinsertdict_clean(PyDictObject *mp, PyObject *key, Py_hash_t hash,                 PyObject *value){    size_t i;    size_t perturb;    PyDictKeysObject *k = mp->ma_keys;    size_t mask = (size_t)DK_SIZE(k)-1;    PyDictKeyEntry *ep0 = &k->dk_entries[0];    PyDictKeyEntry *ep;    assert(k->dk_lookup != NULL);    assert(value != NULL);    assert(key != NULL);    assert(key != dummy);    assert(PyUnicode_CheckExact(key) || k->dk_lookup == lookdict);    i = hash & mask;    ep = &ep0[i];    for (perturb = hash; ep->me_key != NULL; perturb >>= PERTURB_SHIFT) {     // 查找对应的散列位置        i = (i << 2) + i + perturb + 1;        ep = &ep0[i & mask];    }    assert(ep->me_value == NULL);    ep->me_key = key;                                                         // 设置查找出来的位置的key    ep->me_hash = hash;                                                       // 设置查找出来的hash    ep->me_value = value;                                                     // 设置查找出来的value}

至此一个字典的赋值过程基本分析完成,会先去查找该字典是否已经有了对应的key值,如果有则直接重新将value赋值,如果没有找到则先判断字典是否需要重新改变大小,如果改变大小则在新申请的空间后将旧的值重新赋值到新字典中,然后将新插入的值插入,大致流程如上。

总结

有关字典的hash操作,可详细参考dictobject.c的说明,里面详细说明了dict的工作状态,

There are four kinds of slots in the table:1. Unused.  me_key == me_value == NULL   Does not hold an active (key, value) pair now and never did.  Unused can   transition to Active upon key insertion.  This is the only case in which   me_key is NULL, and is each slot's initial state.2. Active.  me_key != NULL and me_key != dummy and me_value != NULL   Holds an active (key, value) pair.  Active can transition to Dummy or   Pending upon key deletion (for combined and split tables respectively).   This is the only case in which me_value != NULL.3. Dummy.  me_key == dummy and me_value == NULL   Previously held an active (key, value) pair, but that was deleted and an   active pair has not yet overwritten the slot.  Dummy can transition to   Active upon key insertion.  Dummy slots cannot be made Unused again   (cannot have me_key set to NULL), else the probe sequence in case of   collision would have no way to know they were once active.4. Pending. Not yet inserted or deleted from a split-table.   key != NULL, key != dummy and value == NULLThe DictObject can be in one of two forms.Either:  A combined table:    ma_values == NULL, dk_refcnt == 1.    Values are stored in the me_value field of the PyDictKeysObject.    Slot kind 4 is not allowed i.e.        key != NULL, key != dummy and value == NULL is illegal.Or:  A split table:    ma_values != NULL, dk_refcnt >= 1    Values are stored in the ma_values array.    Only string (unicode) keys are allowed, no 
keys are present.

讲述了dict的四个状态,对应在代码中如何进行判断可知是哪种状态,

Major subtleties ahead:  Most hash schemes depend on having a "good" hashfunction, in the sense of simulating randomness.  Python doesn't:  its mostimportant hash functions (for strings and ints) are very regular in commoncases:  >>> map(hash, (0, 1, 2, 3))  [0, 1, 2, 3]  >>> map(hash, ("namea", "nameb", "namec", "named"))  [-1658398457, -1658398460, -1658398459, -1658398462]  >>>This isn't necessarily bad!  To the contrary, in a table of size 2**i, takingthe low-order i bits as the initial table index is extremely fast, and thereare no collisions at all for dicts indexed by a contiguous range of ints.The same is approximately true when keys are "consecutive" strings.  So thisgives better-than-random behavior in common cases, and that's very desirable.OTOH, when collisions occur, the tendency to fill contiguous slices of thehash table makes a good collision resolution strategy crucial.  Taking onlythe last i bits of the hash code is also vulnerable:  for example, considerthe list [i << 16 for i in range(20000)] as a set of keys.  Since ints aretheir own hash codes, and this fits in a dict of size 2**15, the last 15 bits of every hash code are all 0:  they *all* map to the same table index.But catering to unusual cases should not slow the usual ones, so we just takethe last i bits anyway.  It's up to collision resolution to do the rest.  Ifwe *usually* find the key we're looking for on the first try (and, it turnsout, we usually do -- the table load factor is kept under 2/3, so the oddsare solidly in our favor), then it makes best sense to keep the initial indexcomputation dirt cheap.

这里也介绍了是如何选用hash来进行散列表的快速定位,大家都看文档说明对照代码可一一阅读。

转载地址:https://blog.csdn.net/qq_33339479/article/details/81835379 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!

上一篇:Django源码分析7:migrate命令的浅析
下一篇:Python3.5源码分析-List概述

发表评论

最新留言

路过,博主的博客真漂亮。。
[***.116.15.85]2024年04月03日 00时43分42秒